GANs in the Panorama of Synthetic Data Generation Methods
This paper focuses on the creation and evaluation of synthetic
data to address the challenges of imbalanced datasets in machine
learning applications (ML), using fake news detection as a case
study. We conducted a thorough literature review on generative
adversarial networks (GANs) for tabular data, synthetic data
generation methods, and synthetic data quality assessment. By
augmenting a public news dataset with synthetic data generated by
different GAN architectures, we demonstrate the potential of
synthetic data to improve ML models' performance in fake news
detection. Our results show a significant improvement in
classification performance, especially in the underrepresented
class. We also modify and extend a data usage approach to evaluate
the quality of synthetic data and investigate the relationship
between synthetic data quality and data augmentation performance
in classification tasks. We found a positive correlation between
synthetic data quality and performance in the underrepresented
class, highlighting the importance of high-quality synthetic data
for effective data augmentation.
Survey on Synthetic Data Generation, Evaluation Methods and GANs
Synthetic data consists of artificially generated data. When data
are scarce, or of poor quality, synthetic data can be used, for
example, to improve the performance of machine learning models.
Generative adversarial networks (GANs) are a state-of-the-art deep
generative models that can generate novel synthetic samples that
follow the underlying data distribution of the original dataset.
Reviews on synthetic data generation and on GANs have already been
written. However, none in the relevant literature, to the best of
our knowledge, has explicitly combined these two topics. This
survey aims to fill this gap and provide useful material to new
researchers in this field. That is, we aim to provide a survey
that combines synthetic data generation and GANs, and that can act
as a good and strong starting point for new researchers in the
field, so that they have a general overview of the key
contributions and useful references. We have conducted a review of
the state-of-the-art by querying four major databases: Web of
Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This
allowed us to gain insights into the most relevant authors, the
most relevant scientific journals in the area, the most cited
papers, the most significant research areas, the most important
institutions, and the most relevant GAN architectures. GANs were
thoroughly reviewed, as well as their most common training
problems, their most important breakthroughs, and a focus on GAN
architectures for tabular data. Further, the main algorithms for
generating synthetic data, their applications and our thoughts on
these methods are also expressed. Finally, we reviewed the main
techniques for evaluating the quality of synthetic data
(especially tabular data) and provided a schematic overview of the
information presented in this paper.
On Creation of Synthetic Samples from GANs for Fake News
Identification Algorithms
The use of Generative Adversarial Networks is almost traditional
in creating synthetic images for medical purposes. They are
probably the best use of GANs until now, as their results can
easily be checked by the eye of specialists. In fake news
detection models, we have seen lately that neural models (and deep
learning) can provide a considerable improvement from standard
classifiers. Yet, the most problematic problem still is the lack
of data, mostly fake news data to feed these models. In this
paper, we address that by proposing the use of a GAN. Results show
a better capacity to generalize when used for training an extended
dataset based on synthetic samples created by this GAN.
What Makes a Movie Get Success? A Visual Analytics Approach
It is common for people to choose their next movie or show through
other viewers' experience statements, like the Internet Movie
Database (IMDb) presents. In this paper, we will be inspecting the
IMDb public datasets, processing them, and using a visual
analytics approach to understand how a movie can be successful
among its fans. The main exploration focus is regions where titles
are translated to, how the success of a title relates to its cast,
crew, and awards nominations/wins. We took a methodology based on
hypothesis formulation based on the EDA exploration and their
testing based on a visual analytics confirmation.