Conference Paper

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Deep Convolutional Generative Adversarial Networks (DCGAN) are an extension of the GAN model. DCGAN was first introduced by Radford, A., et al. [13]. In his paper, the DCGAN architecture explicitly uses convolutional-transpose layers in the generator and convolutional layers in the discriminator. ...
... Following the paper on DCGAN by Radford, A., et al. [13], the paper on WGAN-GP by Gulrajani, I., et al. [14], and the tutorial provided by PyTorch (Inkawhich), both the DCGAN and WGAN-GP had the same architecture for their generator and discriminator except for the loss function, where the DCGAN uses a binary cross-entropy loss function whereas the WGAN-GP uses the Wasserstein's distance with gradient penalty as its loss function. ...
... WGAN-GP was first mentioned in Gulrajani, I., et al. [14]. In Gulrajani, I., et al. [14], the WGAN-GP uses a different loss function for both the generator and discriminator loss as compared with the DCGAN described in Radford, A., et al. [13]. In the WGAN-GP architecture, the discriminator, called the critic, omits the sigmoid layer in the last layer. ...
... Well-trained GANs are born out of contention between two models learning simultaneously -the generator creating better artificial images and the discriminator getting better at detecting them. Post first appearance, GANs have been modified into different forms like Deep Convolution GAN (DCGAN) [5], Conditional GAN [6], GANs with inference models [7], [8], Adversarial Autoencoders [9], [10] and Adversarial Variational Bayes (AVB) framework [11]. GANs, however, are plagued with issues like finding Nash equilibrium in a constantly changing landscape [12] and failure to converge [9]. ...
... Recently, Alec Radford et al. showed that Convolution Neural Networks could be fused with Generative Adversarial learning to learn ordered sequence of features from image parts and scenes in both generators and discriminators. These Deep Convolution Generative Adversarial Networks (DC-GANs) are built following certain architectural guidelines to generate realistic images [5]. The high level design of a DCGAN is shown in Figure 2. ...
... sin tan −1 dy dx (5) where y is the distance of each point to its k th nearest point and x is the index of the points. The resultant distribution after application of sine and tan inverse is shown in Figure 11. ...
Article
Full-text available
Researchers gravitate towards Generative Adversarial Networks (GAN) to create artificial images. However, GANs suffer from convergence issues, mode collapse, and overall complexity in balancing the Nash Equilibrium. Images generated are often distorted, rendering them useless. We propose a combination of Variational Autoencoders (VAEs) and a statistical oversampling method called K-Nearest Neighbor OveRsampling (KNNOR) to create artificial images. This combination of VAE and KNNOR results in more life-like images with reduced distortion. We fine-tune several pre-trained networks on a separate set of real and fake face images to test images generated by our method against images generated by conventional Deep Convolutional GANs (DCGANs). We also compare the combination of VAEs and Synthetic Minority Oversampling Technique (SMOTE) to establish the efficacy of KNNOR against naive oversampling methods. Not only are our methods better able to convince the classifiers that the images generated are authentic, but the models are also half in size of DCGANs. The code is available at GitHub for public use.
... In GANbased algorithms, a critical point is to define a distance that appropriately measures the agreement between the distribution of the generated samples P θ and the target distribution P data . Different definitions of the distance between distributions lead to various GAN, e.g., WGAN [13], SobolevGAN [14], MMD-GAN [15], and others [16,17,13,18]. This paper introduces a novel approach to generative modeling using a loss function based on elastic interaction energy (EIE). ...
... Moreover, we observe that the feature space exhibits greater diversity when this term is included, thereby it enhancing the model's ability to capture the underlying data distribution. Experiments result show that EIEG GAN outperforms several standard GAN-based models [8,17,13,27,15] in terms of both sample diversity and training stability. ...
... Network architecture We use the neural network architecture of DCGAN [17] to set its generator G θ and replace the output layer of the discriminator to n dimensional space as our feature transformation network D φ . ...
Preprint
In this paper, we propose a novel approach to generative modeling using a loss function based on elastic interaction energy (EIE), which is inspired by the elastic interaction between defects in crystals. The utilization of the EIE-based metric presents several advantages, including its long range property that enables consideration of global information in the distribution. Moreover, its inclusion of a self-interaction term helps to prevent mode collapse and captures all modes of distribution. To overcome the difficulty of the relatively scattered distribution of high-dimensional data, we first map the data into a latent feature space and approximate the feature distribution instead of the data distribution. We adopt the GAN framework and replace the discriminator with a feature transformation network to map the data into a latent space. We also add a stabilizing term to the loss of the feature transformation network, which effectively addresses the issue of unstable training in GAN-based algorithms. Experimental results on popular datasets, such as MNIST, FashionMNIST, CIFAR-10, and CelebA, demonstrate that our EIEG GAN model can mitigate mode collapse, enhance stability, and improve model performance.
... One of the most important difficulties when designing a metric for GAN is the ability to capture both the quality and diversity of the generated data. In addition to being still an open issue, there is consensus on some metrics and many papers measure their results with the same metrics [44][45][46][47][48][49]. The main problem in the time series domain is that it is not always possible to adapt the metrics to the particularities of this field because most of the metrics are designed to be useful in computer vision-related tasks. ...
... An example of this use is the one proposed with Spec-GAN [61] which tries to operate with sound spectrograms that represent audio samples. This approach uses deep convolutional GAN (DCGAN) [44] as the main algorithm for DA, but prior to that, it processes the audio signal to generate images for each audio track. The process of transforming audio into image ''can be approximately inverted'' in the author's own words. ...
Article
Full-text available
With the latest advances in deep learning-based generative models, it has not taken long to take advantage of their remarkable performance in the area of time series. Deep neural networks used to work with time series heavily depend on the size and consistency of the datasets used in training. These features are not usually abundant in the real world, where they are usually limited and often have constraints that must be guaranteed. Therefore, an effective way to increase the amount of data is by using data augmentation techniques, either by adding noise or permutations and by generating new synthetic data. This work systematically reviews the current state of the art in the area to provide an overview of all available algorithms and proposes a taxonomy of the most relevant research. The efficiency of the different variants will be evaluated as a central part of the process, as well as the different metrics to evaluate the performance and the main problems concerning each model will be analysed. The ultimate aim of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
... In the first stage, we model P (R|Z) as an unsupervised GAN [31], which being trained from a large number of unlabelled photos of a particular class, is capable of generating realistic photos G(z), given a random vector z ∼ N (0, 1) [31]. As GAN models learn data distribution [68], we can loosely assume that any photo can be generated by sampling a specific z * from the GAN latent space [1]. Once the GAN model is trained, in the second stage, keeping the G(·) fixed, we aim to learn P (Z|S) as a sketch mapper that would encode the input sketch S into a latent code Z corresponding to the paired photo R in the pre-trained GAN latent space. ...
... Instead of passing a random noise vector z ∈ Z directly as the network input [68], StyleGAN [46,47] eliminates the idea of input layer and always starts from a learned constant tensor of size R 4×4×d . The generator network G(·) consists of a number of progressive resolution blocks, each having the sequence conv3×3 AdaIN conv3×3 AdaIN [46,47]. ...
Preprint
Full-text available
Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise.
... Furthermore, these existing systems are not directly comparable to one another due to variations in the experimental environments and the data sets on which they were trained. In contrast to the augmentation techniques mentioned above, in this study, we apply three techniques SMOTE [16], M2m [26] and DCGAN [27] to deal with an imbalanced data set in plant disease detection. Most deep learning research for plant disease detection largely ignores various preprocessing strategies for disease detection and classification; however, this work also included an in-depth analysis of them. ...
... Furthermore, our method is implemented in TensorFlow [38] . The generative network and discriminative network are trained with Adam [27] optimizer with β1 = 0:5, β2 = 0:999 and a learning rate of 0.0001. The batch size is 32 and the training set is set to 2000 epochs. ...
Article
Full-text available
The foundation of effectively predicting plant disease in the early stage using deep learning algorithms is ideal for addressing food insecurity, inevitably drawing researchers and agricultural specialists to contribute to its effectiveness. The input preprocessor, abnormalities of the data (i.e., incomplete and nonexistent features, class imbalance), classifier, and decision explanation are typical components of a plant disease detection pipeline based on deep learning that accepts an image as input and outputs a diagnosis. Data sets related to plant diseases frequently display a magnitude imbalance due to the scarcity of disease outbreaks in real field conditions. This study examines the effects of several preprocessing methods and class imbalance approaches and deep learning classifiers steps in the pipeline for detecting plant diseases on our data set. We notably want to evaluate if additional preprocessing and effective handling of data inconsistencies in the plant disease pipeline may considerably assist deep learning classifiers. The evaluation’s findings indicate that contrast limited adaptive histogram equalization (CLAHE) combined with image sharpening and generative adversarial networks (GANs)-based approach for resampling performed the best among the preprocessing and resampling techniques, with an average classification accuracy of 97.69% and an average F1-score of 97.62% when fed through a ResNet-50 as the deep learning classifier. Lastly, this study provides a general workflow of a disease detection system that allows each component to be individually focused on depending on necessity.
... For a wide variety of image generation tasks, most approaches impose monolithic generators on the entire image, including unsupervised image generation [17,18,35], image-to-image translation [29,33,42,51], image inpainting [24,45,46], and image editing [15,26]. They share the same network structure and weights to generate all the content without specialized submodels for different semantic regions or classes. ...
Preprint
Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, \textbf{S}emantic \textbf{I}mage \textbf{E}diting by \textbf{D}isentangling \textbf{O}bject and \textbf{B}ackground (\textbf{SIEDOB}), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds.
... GANs are composed of two networks: a discriminator that distinguishes real data samples from generated samples and a generator that tries to generate samples to fool the discriminator. Many important works have been proposed to improve the original GAN for more stabilized training or producing high-quality samples, such as by proposing better loss functions or regularizations (Arjovsky, Chintala, and Bottou 2017;Gulrajani et al. 2017;Miyato et al. 2018), changing network structures (Radford, Metz, and Chintala 2015;Karras, Laine, and Aila 2019;Brock, Donahue, and Simonyan 2018;Schonfeld, Schiele, and Khoreva 2020), or combining GANs with inference networks or autoencoders (Donahue, Krähenbühl, and Darrell 2016;Dumoulin et al. 2016;Larsen et al. 2016;Srivastava et al. 2017;Ulyanov, Vedaldi, and Lempitsky 2018 posed SCAT is related to U-net GAN (Schonfeld, Schiele, and Khoreva 2020), while it differs from U-net GAN in two aspects: (1) SCAT is inspired by how humans recognize lowquality repaired images; (2) In contrast to simply classifying all pixels as real or fake in U-net GAN, SCAT identifies the generated and the valid regions in the input images, which is specially tailored for image inpainting tasks. ...
Preprint
This paper presents a new adversarial training framework for image inpainting with segmentation confusion adversarial training (SCAT) and contrastive learning. SCAT plays an adversarial game between an inpainting generator and a segmentation network, which provides pixel-level local training signals and can adapt to images with free-form holes. By combining SCAT with standard global adversarial training, the new adversarial training framework exhibits the following three advantages simultaneously: (1) the global consistency of the repaired image, (2) the local fine texture details of the repaired image, and (3) the flexibility of handling images with free-form holes. Moreover, we propose the textural and semantic contrastive learning losses to stabilize and improve our inpainting model's training by exploiting the feature representation space of the discriminator, in which the inpainting images are pulled closer to the ground truth images but pushed farther from the corrupted images. The proposed contrastive losses better guide the repaired images to move from the corrupted image data points to the real image data points in the feature representation space, resulting in more realistic completed images. We conduct extensive experiments on two benchmark datasets, demonstrating our model's effectiveness and superiority both qualitatively and quantitatively.
... Therefore, Fabbri et al. [108] focus on the poor resolution and occlusion challenges in recognizing the attribute of people such as gender, race, clothing, etc., in surveillance systems. The authors propose a model based on DCGAN [109] to improve the quality of images in order to overcome the mentioned problems. The model has three networks, one for attribute classification from the full body images, and the other two networks attempt to enhance the resolution and recover from occlusion. ...
Article
Full-text available
Although current computer vision systems are closer to the human intelligence when it comes to comprehending the visible world than previously, their performance is hindered when objects are partially occluded. Since we live in a dynamic and complex environment, we encounter more occluded objects than fully visible ones. Therefore, instilling the capability of amodal perception into those vision systems is crucial. However, overcoming occlusion is difficult and comes with its own challenges. The generative adversarial network (GAN), on the other hand, is renowned for its generative power in producing data from a random noise distribution that approaches the samples that come from real data distributions. In this survey, we outline the existing works wherein GAN is utilized in addressing the challenges of overcoming occlusion, namely amodal segmentation, amodal content completion, order recovery, and acquiring training data. We provide a summary of the type of GAN, loss function, the dataset, and the results of each work. We present an overview of the implemented GAN architectures in various applications of amodal completion. We also discuss the common objective functions that are applied in training GAN for occlusion-handling tasks. Lastly, we discuss several open issues and potential future directions.
... Moreover, the sample size is the key impediment. Even while unsupervised models like DCGAN (Radford et al., 2015) can train a lot of samples, handling downstream tasks is still challenging. Multi-modal task processing has become a hot area of research in recent years, including text-to-image generation, as the drawbacks of training samples have been greatly reduced. ...
Preprint
Stable Diffusion model has been extensively employed in the study of archi-tectural image generation, but there is still an opportunity to enhance in terms of the controllability of the generated image content. A multi-network combined text-to-building facade image generating method is proposed in this work. We first fine-tuned the Stable Diffusion model on the CMP Fa-cades dataset using the LoRA (Low-Rank Adaptation) approach, then we ap-ply the ControlNet model to further control the output. Finally, we contrast-ed the facade generating outcomes under various architectural style text con-tents and control strategies. The results demonstrate that the LoRA training approach significantly decreases the possibility of fine-tuning the Stable Dif-fusion large model, and the addition of the ControlNet model increases the controllability of the creation of text to building facade images. This pro-vides a foundation for subsequent studies on the generation of architectural images.
... The GAN has become one of the hottest topics in artificial intelligence and machine learning, and several variants have been developed in recent years Pan et al. [96] . The typical GAN models include conditional generative adversarial nets (CGAN) [97], semi-supervised GAN (SGAN) [98], deep convolution generative adversarial networks (DCGAN) [99], and Wasserstein GAN (WGAN) [6]. In reliability analysis, imbalanced data and high-dimensional cases may occur, which hinders further study. ...
Article
One of the most significant and growing research fields in mechanical and civil engineering is Structural Reliability Analysis (SRA). A reliable and precise SRA usually has to deal with complicated , aand numerically expensive problems. Artificial intelligence-based (AI) nd specifically, Deep learning-based (DL) methods, have been applied to the SRA problems to reduce the computational cost and to improve the accuracy of reliability estimation as well. This article reviews the recent advances in using DL models in SRA problems. The review includes the most common categories of DL-based methods used in SRA. More specifically, the application of supervised methods, unsupervised methods, and hybrid deep learning methods in SRA are explained. In this paper, the supervised methods for SRA are categorized as Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN ), Long short-term memory (LSTM), Bidirectional (Bi-LSTM) and Gated recurrent units (GRU). For the unsupervised methods, we have investigated methods such as Generative Adversarial Network (GAN), Autoencoders (AE), Self-Organizing Map (SOM), Restricted Boltzmann Machine (RBM), and Deep Belief Network (DBN). We have made a comprehensive survey of these methods in SRA. Aiming towards an efficient SRA, deep learning-based methods applied for approximating the limit state function (LSF) with First/Second Order Reliability Methods (FORM/SORM), Monte Carlo simulation (MCS), or MCS with importance sampling (IS). Accordingly, the current paper focuses on the structure of different DL-based models and the applications of each DL method in various SRA problems. This survey helps researchers in mechanical and civil engineering, especially those who are engaged with structural and reliability analysis or dealing with quality assurance problems.
... GANs: have been widely used for image generation and synthesis tasks [13]. In recent work, several improvements have been proposed [2,14,22,30,36] over the original architecture. For example, the popularly used StyleGAN [22] model uses a mapping network to generate style codes which are then used to modulate the weights of the Conv layers. ...
Preprint
Implicit neural representations (INR) have gained significant popularity for signal and image representation for many end-tasks, such as superresolution, 3D modeling, and more. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model's representational power. Higher representational power is needed to go from representing a single given image to representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets like ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with far fewer trainable parameters. With much fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is available at \url{https://github.com/Rajhans0/Poly_INR}
... One significant kind of improvement is designing advanced network architectures for both the generator and discriminator. For example, convolutional layers [17], residual connection [18], and self-attention layers [19] have become standard components [6], [20] for constructing GANs. Nowadays, StyleGANs [3]- [5] develop the most popular generator structure and are capable of synthesizing high-resolution images in various domains. ...
Article
Full-text available
Training generative adversarial networks (GANs) using limited training data is challenging since the original discriminator is prone to overfitting. The recently proposed label augmentation technique complements categorical data augmentation approaches for discriminator, showing improved data efficiency in training GANs but lacks a theoretical basis. In this paper, we propose a novel regularization approach for the label-augmented discriminator to further improve the data efficiency of training GANs with a theoretical basis. Specifically, the proposed regularization adaptively constrains the predictions of the label-augmented discriminator on generated data to be close to the moving averages of its historical predictions on real data, and vice versa. We theoretically establish a connection between the objective function with the proposed regularization and a f-divergence that is more robust than the previous reversed Kullback-Leibler divergence. Experimental results on various datasets and diverse architectures show the significantly improved data efficiency of our proposed method compared to state-of-the-art data-efficient GAN training approaches for training GANs under limited training data regimes.
... The limitation of AlignDRAW [274] is that the generated images are unrealistic and require an additional GAN for post-processing. Based on a deep convolutional generative adversarial network (DC-GAN) [340], [351] is the first end-to-end differential architecture from the character level to the pixel level. To generate high-resolution images while stabilizing the training process, StackGAN [529] and StackGAN++ [530] propose a multistage mechanism that multiple generators produce images of different scales, and high-resolution image generation is conditioned on the low-resolution images. ...
Preprint
Full-text available
As ChatGPT goes viral, generative AI (AIGC, a.k.a AI-generated content) has made headlines everywhere because of its ability to analyze and create text, images, and beyond. With such overwhelming media coverage, it is almost impossible for us to miss the opportunity to glimpse AIGC from a certain angle. In the era of AI transitioning from pure analysis to creation, it is worth noting that ChatGPT, with its most recent language model GPT-4, is just a tool out of numerous AIGC tasks. Impressed by the capability of the ChatGPT, many people are wondering about its limits: can GPT-5 (or other future GPT variants) help ChatGPT unify all AIGC tasks for 1 2 Zhang et al. diversified content creation? Toward answering this question, a comprehensive review of existing AIGC tasks is needed. As such, our work comes to fill this gap promptly by offering a first look at AIGC, ranging from its techniques to applications. Modern generative AI relies on various technical foundations, ranging from model architecture and self-supervised pretraining to generative modeling methods (like GAN and diffusion models). After introducing the fundamental techniques, this work focuses on the technological development of various AIGC tasks based on their output type, including text, images, videos, 3D content, etc., which depicts the full potential of ChatGPT's future. Moreover, we summarize their significant applications in some mainstream industries, such as education and creativity content. Finally, we discuss the challenges currently faced and present an outlook on how generative AI might evolve in the near future.
... Generative adversarial networks [10] launched the generative revolution in image generation [6,17,36,19], and text generation [46,5,13]. This self-supervised training scheme enables the networks to consume large unlabeled realistic dataset, and provides a powerful baseline in various downstream tasks like image colorization [32], image compositing [49], and text synthesis [23]. ...
Preprint
We develop a diffusion-based approach for various document layout sequence generation. Layout sequences specify the contents of a document design in an explicit format. Our novel diffusion-based approach works in the sequence domain rather than the image domain in order to permit more complex and realistic layouts. We also introduce a new metric, Document Earth Mover's Distance (Doc-EMD). By considering similarity between heterogeneous categories document designs, we handle the shortcomings of prior document metrics that only evaluate the same category of layouts. Our empirical analysis shows that our diffusion-based approach is comparable to or outperforming other previous methods for layout generation across various document datasets. Moreover, our metric is capable of differentiating documents better than previous metrics for specific cases.
... Semi-Supervised Learning (SSL) is attractive because of its capability to further unveil the power of machine learning with abundant cheap unlabeled data [50,50,39,25,4,46,40]. Due to the space limitation, this section only reviews self-training-based methods, which is one of the most engaging directions in SSL [37,43]. ...
Preprint
Full-text available
In this paper, we improve the challenging monocular 3D object detection problem with a general semi-supervised framework. Specifically, having observed that the bottleneck of this task lies in lacking reliable and informative samples to train the detector, we introduce a novel, simple, yet effective `Augment and Criticize' framework that explores abundant informative samples from unlabeled data for learning more robust detection models. In the `Augment' stage, we present the Augmentation-based Prediction aGgregation (APG), which aggregates detections from various automatically learned augmented views to improve the robustness of pseudo label generation. Since not all pseudo labels from APG are beneficially informative, the subsequent `Criticize' phase is presented. In particular, we introduce the Critical Retraining Strategy (CRS) that, unlike simply filtering pseudo labels using a fixed threshold (e.g., classification score) as in 2D semi-supervised tasks, leverages a learnable network to evaluate the contribution of unlabeled images at different training timestamps. This way, the noisy samples prohibitive to model evolution could be effectively suppressed. To validate our framework, we apply it to MonoDLE and MonoFlex. The two new detectors, dubbed 3DSeMo_DLE and 3DSeMo_FLEX, achieve state-of-the-art results with remarkable improvements for over 3.5% AP_3D/BEV (Easy) on KITTI, showing its effectiveness and generality. Code and models will be released.
... The network architecture used in this work is fiwGAN (Beguš, 2021b), an InfoGAN (Chen et al., 2016) adaptation of the WaveGAN (Donahue et al., 2019) model (which itself is based on DCGAN; Radford et al. 2015). Unlike InfoGAN, the fiwGAN features a separate Q-network and a binary code instead of a one-hot vector which enables featural learning. ...
Preprint
Full-text available
This paper proposes a methodology for discovering meaningful properties in data by exploring the latent space of unsupervised deep generative models. We combine manipulation of individual latent variables to extreme values outside the training range with methods inspired by causal inference into an approach we call causal disentanglement with extreme values (CDEV) and show that this approach yields insights for model interpretability. Using this technique, we can infer what properties of unknown data the model encodes as meaningful. We apply the methodology to test what is meaningful in the communication system of sperm whales, one of the most intriguing and understudied animal communication systems. We train a network that has been shown to learn meaningful representations of speech and test whether we can leverage such unsupervised learning to decipher the properties of another vocal communication system for which we have no ground truth. The proposed technique suggests that sperm whales encode information using the number of clicks in a sequence, the regularity of their timing, and audio properties such as the spectral mean and the acoustic regularity of the sequences. Some of these findings are consistent with existing hypotheses, while others are proposed for the first time. We also argue that our models uncover rules that govern the structure of communication units in the sperm whale communication system and apply them while generating innovative data not shown during training. This paper suggests that an interpretation of the outputs of deep neural networks with causal methodology can be a viable strategy for approaching data about which little is known and presents another case of how deep learning can limit the hypothesis space. Finally, the proposed approach combining latent space manipulation and causal inference can be extended to other architectures and arbitrary datasets.
... A key property of h-space is that it obeys vector arithmetic properties which have previously been demonstrated for GANs by Radford et al. [23]. Specifically, image editing can be done in h-space as follows. ...
Preprint
Denoising Diffusion Models (DDMs) have emerged as a strong competitor to Generative Adversarial Networks (GANs). However, despite their widespread use in image synthesis and editing applications, their latent space is still not as well understood. Recently, a semantic latent space for DDMs, coined `$h$-space', was shown to facilitate semantic image editing in a way reminiscent of GANs. The $h$-space is comprised of the bottleneck activations in the DDM's denoiser across all timesteps of the diffusion process. In this paper, we explore the properties of h-space and propose several novel methods for finding meaningful semantic directions within it. We start by studying unsupervised methods for revealing interpretable semantic directions in pretrained DDMs. Specifically, we show that global latent directions emerge as the principal components in the latent space. Additionally, we provide a novel method for discovering image-specific semantic directions by spectral analysis of the Jacobian of the denoiser w.r.t. the latent code. Next, we extend the analysis by finding directions in a supervised fashion in unconditional DDMs. We demonstrate how such directions can be found by relying on either a labeled data set of real images or by annotating generated samples with a domain-specific attribute classifier. We further show how to semantically disentangle the found direction by simple linear projection. Our approaches are applicable without requiring any architectural modifications, text-based guidance, CLIP-based optimization, or model fine-tuning.
... Deep convolutional GANs [18] were the first GANs to use convolutional layers, compared to the inital GAN which used only fully connected layers. With its simplicity, DCGAN is often the de facto baseline GAN one implements. ...
Article
Full-text available
Generative adversarial networks (GANs) have become increasingly powerful, generating mind-blowing photorealistic images that mimic the content of datasets they have been trained to replicate. One recurrent theme in medical imaging, is whether GANs can also be as effective at generating workable medical data, as they are for generating realistic RGB images. In this paper, we perform a multi-GAN and multi-application study, to gauge the benefits of GANs in medical imaging. We tested various GAN architectures, from basic DCGAN to more sophisticated style-based GANs, on three medical imaging modalities and organs, namely: cardiac cine-MRI, liver CT, and RGB retina images. GANs were trained on well-known and widely utilized datasets, from which their FID scores were computed, to measure the visual acuity of their generated images. We further tested their usefulness by measuring the segmentation accuracy of a U-Net trained on these generated images and the original data. The results reveal that GANs are far from being equal, as some are ill-suited for medical imaging applications, while others performed much better. The top-performing GANs are capable of generating realistic-looking medical images by FID standards, that can fool trained experts in a visual Turing test and comply to some metrics. However, segmentation results suggest that no GAN is capable of reproducing the full richness of medical datasets.
... [4] proposed a Generative Adversarial Network (GAN) in 2014, which relies on the idea of two network games to conduct targeted learning, and, on this basis, can be expanded to generate realistic and clear images to achieve the blind deblurring effect. Radford et al. [5] proposed the deep convolution generative adversarial network in 2016, which was improved on the basis of GAN to address the defect of GAN learning instability. The generative model introduces the convolution network structure, which effectively improves the learning level of the network. ...
Article
Full-text available
In order to improve the detection accuracy of an algorithm in the complex environment of a coal mine, including low-illumination, motion-blur, occlusions, small-targets, and background-interference conditions; reduce the number of model parameters; improve the detection speed of the algorithm; and make it meet the real-time detection requirements of edge equipment, a real-time obstacle detection method in the driving of driverless rail locomotives based on DeblurGANv2 and improved YOLOv4 is proposed in this study. A blurred image was deblurred using DeblurGANv2. The improved design was based on YOLOv4, and the lightweight feature extraction network MobileNetv2 was used to replace the original CSPDarknet53 network to improve the detection speed of the algorithm. There was a high amount of background interference in the target detection of the coal mine scene. In order to strengthen the attention paid to the target, the SANet attention module was embedded in the Neck network to improve the detection accuracy of the algorithm under low-illumination, target-occlusion, small-target, and other conditions. To further improve the detection accuracy of the algorithm, the K-means++ algorithm was adopted to cluster prior frames, and the focal loss function was introduced to increase the weight loss of small-target samples. The experimental results show that the deblurring of the motion-blurred image can effectively improve the detection accuracy of obstacles and reduce missed detections. Compared with the original YOLOv4 algorithm, the improved YOLOv4 algorithm increases the detection speed by 65.85% to 68 FPS and the detection accuracy by 0.68% to 98.02%.
... Generative adversarial networks (GANs) based on a deeplearning (DL) architecture can be used to generate such synthetic images from different MR contrasts as input [19][20][21][22][23]. The iterative interaction of two networks, one generating images and one learning to differentiate between synthetic and true images [24,25], has already been used on MRI data from a variety of anatomical regions [26][27][28]. In the spine, GANs can generate T2-fs images from conventional T1-w and non-fs T2-w images [15,29]. ...
Article
Full-text available
Objectives: T2-weighted (w) fat sat (fs) sequences, which are important in spine MRI, require a significant amount of scan time. Generative adversarial networks (GANs) can generate synthetic T2-w fs images. We evaluated the potential of synthetic T2-w fs images by comparing them to their true counterpart regarding image and fat saturation quality, and diagnostic agreement in a heterogenous, multicenter dataset. Methods: A GAN was used to synthesize T2-w fs from T1- and non-fs T2-w. The training dataset comprised scans of 73 patients from two scanners, and the test dataset, scans of 101 patients from 38 multicenter scanners. Apparent signal- and contrast-to-noise ratios (aSNR/aCNR) were measured in true and synthetic T2-w fs. Two neuroradiologists graded image (5-point scale) and fat saturation quality (3-point scale). To evaluate whether the T2-w fs images are indistinguishable, a Turing test was performed by eleven neuroradiologists. Six pathologies were graded on the synthetic protocol (with synthetic T2-w fs) and the original protocol (with true T2-w fs) by the two neuroradiologists. Results: aSNR and aCNR were not significantly different between the synthetic and true T2-w fs images. Subjective image quality was graded higher for synthetic T2-w fs (p = 0.023). In the Turing test, synthetic and true T2-w fs could not be distinguished from each other. The intermethod agreement between synthetic and original protocol ranged from substantial to almost perfect agreement for the evaluated pathologies. Discussion: The synthetic T2-w fs might replace a physical T2-w fs. Our approach validated on a challenging, multicenter dataset is highly generalizable and allows for shorter scan protocols.
... Generative adversarial networks [18] are considered one of the most creative frameworks to produce high-fidelity, high-quality pictures. One of them, DCGANs [19], which consists of convolutional layers, can generate images from a dataset. Conditional GANs [20] extend the field to textto-image and image-to-image syntheses by controlling some specific labels. ...
Preprint
Due to the COVID-19 epidemic, video conferencing has evolved as a new paradigm of communication and teamwork. However, private and personal information can be easily leaked through cameras during video conferencing. This includes leakage of a person's appearance as well as the contents in the background. This paper proposes a novel way of using online low-resolution thermal images as conditions to guide the synthesis of RGB images, bringing a promising solution for real-time video conferencing when privacy leakage is a concern. SPADE-SR (Spatially-Adaptive De-normalization with Self Resampling), a variant of SPADE, is adopted to incorporate the spatial property of a thermal heatmap and the non-thermal property of a normal, privacy-free pre-recorded RGB image provided in a form of latent code. We create a PAIR-LRT-Human (LRT = Low-Resolution Thermal) dataset to validate our claims. The result enables a convenient way of video conferencing where users no longer need to groom themselves and tidy up backgrounds for a short meeting. Additionally, it allows a user to switch to a different appearance and background during a conference.
... Since E-GAN [22] is the most familiar to our method, we further take E-GAN as the baseline to compare with IGASEN-EMWGAN. In addition to this, we also utilize different existing state-of-art GANs including DCGAN [69], WGAN-GP [34], AEGAN [70], LSGAN [71], and EASGAN [72] to compare with the proposed model in this paper. In Table 5, we take a number of various generators µ = {1,2,4,8} for E-GAN and various numbers of discriminators τ= {1,2,4,8} for IGASEN-EMWGAN. ...
Article
Full-text available
During the process of ship coating, various defects will occur due to the improper operation by the workers, environmental changes, etc. The special characteristics of ship coating limit the amount of data and result in the problem of class imbalance, which is not conducive to ensuring the effectiveness of deep learning-based models. Therefore, a novel hybrid intelligent image generation algorithm called the IGASEN-EMWGAN model for ship painting defect images is proposed to tackle the aforementioned limitations in this paper. First, based on a subset of imbalanced ship painting defect image samples obtained by a bootstrap sampling algorithm, a batch of different base discriminators was trained independently with the algorithm parameter and sample perturbation method. Then, an improved genetic algorithm based on the simulated annealing algorithm is used to search for the optimal subset of base discriminators. Further, the IGASEN-EMWGAN model was constructed by fusing the base discriminators in this subset through a weighted integration strategy. Finally, the trained IGASEN-EMWGAN model is used to generate new defect images of the minority classes to obtain a balanced dataset of ship painting defects. The extensive experimental results are conducted on a real unbalanced ship coating defect database and show that, compared with the baselines, the values of the ID and FID scores are significantly improved by 4.92% and decreased by 7.29%, respectively, which prove the superior effectiveness of the proposed model in this paper.
... W ITH the rapid development of deep learning and multi-modal learning, many studies have tried to understand the relationship between different modals of data and have promoted research in cross-modal learning [1]- [5]. The emergence of generative adversarial models (GANS) [6] and their variations [7]- [9] have motivated the development of cross-modal generation studies, such as text-to-image generations [10]- [12]. Unlike text-to-image generations, soundto-image generations have had less of an impact due to their intrinsic limitations. ...
Article
Full-text available
Audio and visual modal data are essential elements of precise investigation in many fields. Sometimes it is difficult to obtain visual data while auditory data is easily available. In this case, generating visual data using audio data will be very helpful. This paper proposes a novel audio-to-visual cross-modal generation approach. The proposed sound encoder extracts the features of the auditory data and a generative model generates images using those audio features. This model is expected to learn (i) valid feature representation and (ii) associations between generated images and audio inputs to generate realistic and well-classified images. A new dataset is collected for this research called the Audio-Visual Corresponding Bird (AVC-B) dataset which contains the sounds and corresponding images of 10 different bird species. The experimental results show that the proposed method can generate class-appropriate images and achieve better classification results than the state-of-the-art methods.
... Conversely, those implicit counterparts [14,37,51,9] could benefit from more flexible generations without a fixed density form, thanks to the adoption of Generative Adversarial Networks (GANs) [11]. Still, training GANs has proven difficult to mitigate mode collapse [13,35] and unstable gradients [21,23]. Moreover, GANs sometimes make overconfident predictions [10], rendering difficulty in expressing predictive uncertainty. ...
Preprint
Tremendous efforts have been devoted to pedestrian trajectory prediction using generative modeling for accommodating uncertainty and multi-modality in human behaviors. An individual's inherent uncertainty, e.g., change of destination, can be masked by complex patterns resulting from the movements of interacting pedestrians. However, latent variable-based generative models often entangle such uncertainty with complexity, leading to either limited expressivity or overconfident predictions. In this work, we propose to separately model these two factors by implicitly deriving a flexible distribution that describes complex pedestrians' movements, whereas incorporating predictive uncertainty of individuals with explicit density functions over their future locations. More specifically, we present an uncertainty-aware pedestrian trajectory prediction framework, parameterizing sufficient statistics for the distributions of locations that jointly comprise the multi-modal trajectories. We further estimate these parameters of interest by approximating a denoising process that progressively recovers pedestrian movements from noise. Unlike prior studies, we translate the predictive stochasticity to the explicit distribution, making it readily used to generate plausible future trajectories indicating individuals' self-uncertainty. Moreover, our framework is model-agnostic for compatibility with different neural network architectures. We empirically show the performance advantages of our framework on widely-used benchmarks, outperforming state-of-the-art in most scenes even with lighter backbones.
Chapter
Statistical and machine learning methods have many applications in the environmental sciences, including prediction and data analysis in meteorology, hydrology and oceanography; pattern recognition for satellite images from remote sensing; management of agriculture and forests; assessment of climate change; and much more. With rapid advances in machine learning in the last decade, this book provides an urgently needed, comprehensive guide to machine learning and statistics for students and researchers interested in environmental data science. It includes intuitive explanations covering the relevant background mathematics, with examples drawn from the environmental sciences. A broad range of topics is covered, including correlation, regression, classification, clustering, neural networks, random forests, boosting, kernel methods, evolutionary algorithms and deep learning, as well as the recent merging of machine learning and physics. End‑of‑chapter exercises allow readers to develop their problem-solving skills, and online datasets allow readers to practise analysis of real data.
Article
Full-text available
Sharpness is an important factor for image inpainting in future Internet, but the massive model parameters involved may produce insufficient edge consistency and reduce image quality. In this paper, we propose a two-stage transformer-based high-resolution image inpainting method to address this issue. This model consists of a coarse and a fine generator network. A self-attention mechanism is introduced to guide the transformation of higher-order semantics across the network layers, accelerate the forward propagation and reduce the computational cost. An adaptive multi-head attention mechanism is applied to the fine network to control the input of the features in order to reduce the redundant computations during training. The pyramid and perception are fused as the loss function of the generator network to improve the efficiency of the model. The comparison with Pennet, GapNet and Partial show the significance of the proposed method in reducing parameter scale and improving the resolution and texture details of the inpainted image.
Preprint
Full-text available
This paper presents a novel approach to simulating electronic health records (EHRs) using diffusion probabilistic models (DPMs). Specifically, we demonstrate the effectiveness of DPMs in synthesising longitudinal EHRs that capture mixed-type variables, including numeric, binary, and categorical variables. To our knowledge, this represents the first use of DPMs for this purpose. We compared our DPM-simulated datasets to previous state-of-the-art results based on generative adversarial networks (GANs) for two clinical applications: acute hypotension and human immunodeficiency virus (ART for HIV). Given the lack of similar previous studies in DPMs, a core component of our work involves exploring the advantages and caveats of employing DPMs across a wide range of aspects. In addition to assessing the realism of the synthetic datasets, we also trained reinforcement learning (RL) agents on the synthetic data to evaluate their utility for supporting the development of downstream machine learning models. Finally, we estimated that our DPM-simulated datasets are secure and posed a low patient exposure risk for public access.
Chapter
Deep learning is becoming increasingly important in our everyday lives. It has already made a big difference in industries like cancer diagnosis, precision medicine, self-driving cars, predictive forecasting, and speech recognition, to name a few. Traditional learning, classification, and pattern recognition methods necessitate feature extractors that aren't scalable for large datasets. Depending on the issue complexity, deep learning can often overcome the limitations of past shallow networks that hampered fast training and abstractions of hierarchical representations of multi-dimensional training data. Deep learning techniques have been applied successfully to vegetable infection by plant disease, demonstrating their suitability for the agriculture sector. The chapter looks at a few optimization approaches for increasing training accuracy and decreasing training time. The authors delve into the mathematics that underpin recent deep network training methods. Current faults, improvements, and implementations are discussed. The authors explore various popular deep learning architecture and their real-world uses in this chapter. Deep learning algorithms are increasingly being used in place of traditional techniques in many machine vision applications. Benefits include avoiding the requirement for specific handcrafted feature extractors and maintaining the integrity of the output. Additionally, they frequently grow better. The review discusses deep convolutional networks, deep residual networks, recurrent neural networks, reinforcement learning, variational autoencoders, and other deep architectures.
Chapter
Major advancement in the field of medical image science is mainly due to deep learning technology, and it has demonstrated good performance in numerous applications such as segmentation and registration. Using generative adversarial networks (GAN), this study provides an outstanding data augmentation technique for developing synthetic chest X-ray images of pneumonia victims. The proposed model first leverages standard data augmentation methodologies in combination with GANs in order to produce more data. The unparalleled chest X-ray descriptions of patients who suffer from pneumonia using a unique application of GANs are developed. The generated samples are used to train a deep convolutional neural network (DCNN) model to classify chest X-ray data. The performance metrics values of existent and synthetic images were also compared and calculated.KeywordsGenerative adversarial network (GAN)Deep convolutional neural network (DCNN)Deep convolutional generative adversarial network (DCGAN)Chest X-ray imagesPneumonia
Article
To efficiently preserve texture and target information in source images, an image fusion algorithm of Regional Fusion Factor-Based Union Gradient and Contrast Generative Adversarial Network (R2F-UGCGAN) is proposed. Firstly, an adaptive gradient diffusion (AGD) decomposition algorithm is designed to extract representative features. A pair of infrared (IR) and visible (VIS) images are decomposed by AGD to obtain low-frequency components with salient targets and high-frequency components with rich edge gradient information. Secondly, In the high-frequency components, principal component analysis (PCA) is used for fusion to obtain more detailed images with texture gradients. R2F-UGCGAN is used to fuse the low-frequency components, which can effectively ensure good consistency between the target region and the background region. Therefore, a fused image is produced, which inherits more thermal radiation information and important texture details. Finally, subjective and objective comparison experiments are performed on TNO and RoadScene datasets with state-of-the-art image fusion methods. The experimental results of R2F-UGCGAN are prominent and consistent compared to these fusion algorithms in terms of both subjective and objective evaluation.
Article
The facial expressions recognition (FER) is crucial to many applications. As technology advances and our needs evolve, compound emotion recognition is becoming increasingly important, along with basic emotion recognition. In the literature, Although, FER can be conducted primarily using multiple sensors. However, research shows that using facial images/videos to recognize facial expressions is better because visual presentation can convey more efficiently. Among state-of-the-art methods for FER systems, to improve the accuracy of the basic and compound FER systems, detection of facial action units (AUs) must be combined to detect basic and compound facial expressions. State-of-the-art results show that machine learning and deep learning-based approaches are more potent than conventional FER approaches. This paper surveys various learning frameworks for facial emotion recognition systems for detecting basic and compound emotions using the diverse database and summarizing state-of-the-art results to give good understanding of impact of each learning framework used in FER systems.
Chapter
In this chapter, we will present the deep FRAME model or deep energy-based model as a recursive multi-layer generalization of the original FRAME model. We shall also present the generator model that can be considered a nonlinear multi-layer generalization of the factor analysis model. Such multi-layer models capture the fact that visual patterns and concepts appear at multiple layers of abstractions.
Chapter
This chapter gives a general introduction to three families of probabilistic models and their connections. Most of the models studied in the previous chapters, as well as most of the models in the current machine learning and deep learning literature, belong to these three families of models.
Article
Tek görüntü süper çözünürlük problemi, literatürde çeşitli derin öğrenme tabanlı teknikler kullanılarak kapsamlı çalışmalar yapılmıştır. Derin evrişimli ağlar tabanlı süper çözünürlük, çok sayıda pratik uygulama ile beraber hızla büyüyen bir ilgi alanı haline gelmiştir. Bununla birlikte derin öğrenme tabanlı ilk çalışmalar evrişimli sinir ağları tabanlı olup, tepe sinyal gürültü oranı odaklı çalışmalardır. Son yıllardaki çekişmeli üretici ağlar tabanlı geliştirilen modeller sayesinde görsel kaliteyi artırmak esas amaç olarak belirlenmiştir; fakat bu durum görüntü kalite metrikleri incelendiğinde görülmemektedir. Bu çalışmada ise ağın eğitimi sırasında kullanılan ağ kaybı için hem ortalama kare hata hem de algısal kayıp değerlerinden faydalanılmıştır. Ayrıca, üç farklı eğitim veri setinin birleşimi yeni bir eğitim veri seti olarak kullanılmıştır. Bu etmenlerin sonucunda hem görsel kalite artırılmış hem de görüntü kalite metrik değerlerinde ciddi bir artış yakalanmıştır. Ek olarak, yığın normalleştirme katmanları ağ mimarisine dahil edilmemiş ve bağlantı atlama tekniği kullanılarak derin ağ mimarisinin eğitim hızı artırılmıştır. Önerilen modelin başarı performansı literatürde yer alan önemli modeller ile karşılaştırılmıştır. Burada, tepe sinyal gürültü oranı ve yapısal benzerlik indeksi değerleri literatürde yaygın kullanılan üç farklı test veri seti için ayrı ayrı hesaplanmış ve değerlendirilmiştir. Elde edilen sonuçlar değerlendirildiğinde önerilen modelin diğer modellere göre daha başarılı olduğu ve daha kaliteli görüntüler oluşturduğu görülmektedir. Tüm bulgular değerlendirildiğinde önerilen modelin diğer modellere kıyasla hem başarı hem de eğitim hızı bakımından daha verimli bir model olduğu görülmektedir.
ResearchGate has not been able to resolve any references for this publication.