Here you can find my research!

GANs in the Panorama of Synthetic Data Generation Methods

This paper focuses on the creation and evaluation of synthetic data to address the challenges of imbalanced datasets in machine learning applications (ML), using fake news detection as a case study. We conducted a thorough literature review on generative adversarial networks (GANs) for tabular data, synthetic data generation methods, and synthetic data quality assessment. By augmenting a public news dataset with synthetic data generated by different GAN architectures, we demonstrate the potential of synthetic data to improve ML models' performance in fake news detection. Our results show a significant improvement in classification performance, especially in the underrepresented class. We also modify and extend a data usage approach to evaluate the quality of synthetic data and investigate the relationship between synthetic data quality and data augmentation performance in classification tasks. We found a positive correlation between synthetic data quality and performance in the underrepresented class, highlighting the importance of high-quality synthetic data for effective data augmentation.

Show Publication

Survey on Synthetic Data Generation, Evaluation Methods and GANs

Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.

Show Publication

On Creation of Synthetic Samples from GANs for Fake News Identification Algorithms

The use of Generative Adversarial Networks is almost traditional in creating synthetic images for medical purposes. They are probably the best use of GANs until now, as their results can easily be checked by the eye of specialists. In fake news detection models, we have seen lately that neural models (and deep learning) can provide a considerable improvement from standard classifiers. Yet, the most problematic problem still is the lack of data, mostly fake news data to feed these models. In this paper, we address that by proposing the use of a GAN. Results show a better capacity to generalize when used for training an extended dataset based on synthetic samples created by this GAN.

Show Publication

What Makes a Movie Get Success? A Visual Analytics Approach

It is common for people to choose their next movie or show through other viewers' experience statements, like the Internet Movie Database (IMDb) presents. In this paper, we will be inspecting the IMDb public datasets, processing them, and using a visual analytics approach to understand how a movie can be successful among its fans. The main exploration focus is regions where titles are translated to, how the success of a title relates to its cast, crew, and awards nominations/wins. We took a methodology based on hypothesis formulation based on the EDA exploration and their testing based on a visual analytics confirmation.

Show Publication