Conference Paper

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Generative models are a class of powerful models that aim to learn the distribution of data in order to generate new samples that resemble real data [1]. Common types of generative models include generative adversarial networks (GANs) [2][3][4][5][6][7], variational autoencoders (VAEs) [8][9][10][11], diffusion models [12][13][14][15], and autoregressive models [16]. These models have found widespread application in multimodal generation tasks [17][18][19][20][21][22]. ...
... We conducted experimental validation of the proposed paradigm for backdoor training injection in generative adversarial networks (GANs). Our experiments focused on three prominent GAN architectures: DCGAN [2], SRGAN [58], and CycleGAN [59]. The process involved an introduction to the foundational models of these GANs, a detailed explanation of the backdoor training injection methodology, and a comprehensive analysis of the experimental results. ...
... The generalized formulation in Equation (6) demonstrates robust applicability for backdoor injection across various GAN architectures. The discussed regularization frameworks for DCGAN [2], SRGAN [58], and CycleGAN [59] ensure effective backdoor embedding while maintaining the integrity of the generative process. ...
Article
Full-text available
Backdoor attacks remain a critical area of focus in machine learning research, with one prominent approach being the introduction of backdoor training injection mechanisms. These mechanisms embed backdoor triggers into the training process, enabling the model to recognize specific trigger inputs and produce predefined outputs post-training. In this paper, we identify a unifying pattern across existing backdoor injection methods in generative models and propose a novel backdoor training injection paradigm. This paradigm leverages a unified loss function design to facilitate backdoor injection across diverse generative models. We demonstrate the effectiveness and generalizability of this paradigm through experiments on generative adversarial networks (GANs) and Diffusion Models. Our experimental results on GANs confirm that the proposed method successfully embeds backdoor triggers, enhancing the model’s security and robustness. This work provides a new perspective and methodological framework for backdoor injection in generative models, making a significant contribution toward improving the safety and reliability of these models.
... As a result, we decide to use Ordinal Encoding for ordinal variables to maintain the inner ordered relations to make sure the GAN model can understand and harness this relationship during the generation step. By using the Ordinal Encoding, the representation of D size = {small, medium, large} is [1,2,3]. We use d i,j to represent the converted value from the ith categorical column and jth row. ...
... Training evolution of DCGAN[3] ...
... 3 shows the training evolution of a deep convolutional generative adversarial network[3] which is commonly used to generate images. It is an example of generating sunset images from the first epoch to the 500th epoch. ...
Preprint
Full-text available
As E-commerce platforms face surging transactions during major shopping events like Black Friday, stress testing with synthesized data is crucial for resource planning. Most recent studies use Generative Adversarial Networks (GANs) to generate tabular data while ensuring privacy and machine learning utility. However, these methods overlook the computational demands of processing GAN-generated data, making them unsuitable for E-commerce stress testing. This thesis introduces a novel GAN-based approach incorporating query selectivity constraints, a key factor in database transaction processing. We integrate a pre-trained deep neural network to maintain selectivity consistency between real and synthetic data. Our method, tested on five real-world datasets, outperforms three state-of-the-art GANs and a VAE model, improving selectivity estimation accuracy by up to 20pct and machine learning utility by up to 6 pct.
... In recent years, researchers have proposed using generative adversarial networks (GAN) [8] to generate defect images [9,10]. GAN-based defect generation methods can be broadly categorized into two types: those based on deep convolutional generative adversarial networks (DCGAN) [11] and those based on cycle-consistent adversarial networks (CycleGAN) [12]. DCGAN-based methods directly generate defects from noise [9,13,14]. ...
... To verify the proposed method, we compare it with SOTA image inpainting methods and defect generation baseline methods, including EC [20], PIC [39], LaMa [24], AOT [23], StrDiffusion [27], Refusion [26], DCGAN [11], and Cycle- Table 4, it can be seen that our proposed method slightly underperforms LaMa [24] and AOT [23] in terms of FID for the open defect category but achieves the best results in all other evaluation metrics, demonstrating the superiority of our method. In terms of average FID, the result of our method achieves 24.19% to 43.39% improvements on other methods. ...
Article
Full-text available
In the electronic manufacturing process, deep learning (DL)-based defect detection models often suffer from limited training defect datasets. To enhance training data, a novel mask inpainting-based data generation architecture (MIDG) is developed for surface defect images with complex backgrounds. It consists of a mask inpainting block, an edge generation block, followed by a defect generation module. The defect generation module is proposed based on an encoder-decoder model with an edge attention block, which hybridizes the information from inpainted normal images and edge maps simultaneously, where the first focuses on texture information and the second on edge structure, generated respectively from the mask inpainting and edge generation blocks. Besides, an annotation strategy is developed, which is at the rectangular mask level and can be easily executed. Experimental results demonstrate that our proposed method can generate various and high-quality defects on flexible printed circuit (FPC) surfaces with irregular circuit lines and copper-covered regions. After adding the generated samples to the training set, the mean Average Precision (mAP) of DL-based detection models such as Faster RCNN, YOLOv8, and YOLOv5 for FPC defect detection increases by 3.1%, 2.7%, and 3.0%, respectively. The codes are available at https://github.com/chenjiaxuandaima/MIDG.git.
... Beyond data augmentation, generative AI-particularly GANs-also provides a way to learn meaningful feature representations in an unsupervised manner (Radford (2015)). By learning a latent space that effectively captures the underlying structure of neural data, GANs can generate interpolated samples that preserve key statistical and physiological properties. ...
... The model consists of two core components: A generator, responsible for producing synthetic EEG signals, and a discriminator-referred to as a "Critic" in the WGAN framework-which evaluates whether the generated data is real or synthetic. Our architecture follows the Deep Convolutional GAN (DC-GAN) (Radford (2015)) framework, incorporating a stack of convolutional layers with upsampling layers in the generator and strided convolutional layers in the Critic (Figure 1). For upsampling, we used linear interpolation, as it has been shown to produce significantly fewer high-frequency artifacts compared to nearest-neighbor upsampling (Hartmann et al. (2018)). ...
Preprint
Full-text available
Generative Adversarial Networks (GANs) have shown promise in synthesising realistic neural data, yet their potential for unsupervised representation learning in resting-state EEG remains under explored. In this study, we implement a Wasserstein GAN with Gradient Penalty (WGAN-GP) to generate multi-channel resting-state EEG data and assess the quality of the synthesised signals through both visual and feature-based evaluations. Our results indicate that the model effectively captures the statistical and spectral characteristics of real EEG data, although challenges remain in replicating high-frequency oscillations in the frontal region. Additionally, we demonstrate that the Critic's learned representations can be fine-tuned for age group classification, achieving an out-of-sample accuracy, significantly better than a shuffled-label baseline. These findings suggest that generative models can serve not only as EEG data generators but also as unsupervised feature extractors, reducing the need for manual feature engineering. This study highlights the potential of GAN-based unsupervised learning for EEG analysis, suggesting avenues for more data-efficient deep learning applications in neuroscience.
... To improving hiding capacity, some generative steganographic methods customized the message mapping between secret message and input noise vector of GAN. Taking advantages of generating high-quality images with Deep Convolutional GAN (DCGAN) (Radford et al. 2016), Hu et al. (2018) and Jiang et al. (2020) adopted DCGAN to generate realistic-looking stego-images, where the noise vectors were directly encoded by secret messages. Wasserstein GAN gradient penalty (WGAN-GP) was adopted to generate image from secret message for steganography (Li et al. 2020). ...
Article
Full-text available
Steganography aims to embed and extract secret information in digital media for enhancing information security, which is widely applied to covert communication, copyright and privacy protection, digital forensics, etc. To resist steganalysis detection, generative steganography is one of the most promising techniques with embedding secret information into a generated image. Although existing generative steganographic methods could perform well with low hiding capacity, most of them encode the secret information in non-distribution-preserving manners, leading to poor security performance against steganalyzers when hiding more secret information. Meanwhile, the secret information tends to be difficult to be extracted with these methods because the secret-to-image transformations are irreversible. To tackle these issues, in this paper, we propose a reversible generative steganography with distribution-preserving scheme, which is mainly composed of a secret message mapping strategy with distribution-preserving and a reversible Glow model. To improve the anti-detectability against steganalyzers, the message mapping strategy with distribution-preserving is customized to encode the secret information into latent vectors which follow the Gaussian distribution as they are usually done in typical image generation models. The Glow model is then trained with reversible transformation to map the latent vectors into the generated stego-images with information hiding. Owing to the distribution-preserving and reversibility of the message mapping and Glow model, the proposed generative steganographic method achieves superior security performance and accurate extraction of secret message. Extensive experimental results demonstrate that the proposed method outperforms several state-of-the-art methods in terms of information extraction accuracy and anti-detectability, especially for high hiding capacity (up to 4.0 bpp).
... Once trained, such models can efficiently make predictions with given observations. As a pioneer AI-based method for precipitation nowcasting, ConvLSTM 24 incorporates the convolution operation into LSTM 19 to model spatiotemporal patterns of radar sequences and delivers promising results compared with traditional approaches. ...
Preprint
Full-text available
Convection (thunderstorm) develops rapidly within hours and is highly destructive, posing a significant challenge for nowcasting and resulting in substantial losses to nature and society. After the emergence of artificial intelligence (AI)-based methods, convection nowcasting has experienced rapid advancements, with its performance surpassing that of physics-based numerical weather prediction and other conventional approaches. However, the lead time and coverage of it still leave much to be desired and hardly meet the needs of disaster emergency response. Here, we propose deep diffusion models of satellite (DDMS) to establish an AI-based convection nowcasting system. On one hand, it employs diffusion processes to effectively simulate complicated spatiotemporal evolution patterns of convective clouds, significantly improving the forecast lead time. On the other hand, it utilizes geostationary satellite brightness temperature data, thereby achieving planetary-scale forecast coverage. During long-term tests and objective validation based on the FengYun-4A satellite, our system achieves, for the first time, effective convection nowcasting up to 4 hours, with broad coverage (about 20,000,000 km2), remarkable accuracy, and high resolution (15 minutes; 4 km). Its performance reaches a new height in convection nowcasting compared to the existing models. In terms of application, our system operates efficiently (forecasting 4 hours of convection in 8 minutes), and is highly transferable with the potential to collaborate with multiple satellites for global convection nowcasting. Furthermore, our results highlight the remarkable capabilities of diffusion models in convective clouds forecasting, as well as the significant value of geostationary satellite data when empowered by AI technologies.
... The generator strives to produce realistic data to deceive the discriminator, while the discriminator attempts to distinguish between real and generated data. GANs have been used in various applications such as image generation (Karras et al., 2017;Radford et al., 2015), image-to-image translation Zhu et al., 2017), and style transfer (Gatys et al., 2016;Johnson et al., 2016), showing success in generating high-quality data samples. ...
Article
Full-text available
The advent of X-ray Free Electron Lasers (XFELs) has opened unprecedented opportunities for advances in the physical, chemical, and biological sciences. With their state-of-the-art methodologies and ultrashort, and intense X-ray pulses, XFELs propel X-ray science into a new era, surpassing the capabilities of traditional light sources. Ultrafast X-ray scattering and imaging techniques leverage the coherence of these intense pulses to capture nanoscale structural dynamics with femtosecond spatial-temporal resolution. However, spatial and temporal resolutions remain limited by factors such as intrinsic fluctuations and jitters in the Self-Amplified Spontaneous Emission (SASE) mode, relatively low coherent scattering cross-sections, the need for high-performance, single-photon-sensitive detectors, effective sample delivery techniques, low parasitic X-ray instrumentation, and reliable data analysis methods. Furthermore, the high-throughput data flow from high-repetition rate XFEL facilities presents significant challenges. Therefore, more investigation is required to determine how Artificial Intelligence (AI) can support data science in this situation. In recent years, deep learning has made significant strides across various scientific disciplines. To illustrate its direct influence on ultrafast X-ray science, this article provides a comprehensive overview of deep learning applications in ultrafast X-ray scattering and imaging, covering both theoretical foundations and practical applications. It also discusses the current status, limitations, and future prospects, with an emphasis on its potential to drive advancements in fourth-generation synchrotron radiation, ultrafast electron diffraction, and attosecond X-ray studies.
... If the generated images successfully deceive the discriminator, they become the final output. DCGAN [24], based on a deep convolutional neural network, improves image generation quality. However, GANs still face issues such as training failures and mode collapse [23]. ...
Article
Full-text available
We propose a Generative Adversarial Network (GAN)-based method for image synthesis from remote sensing data. Remote sensing images (RSIs) are characterized by large intraclass variance and small interclass variance, which pose significant challenges for image synthesis. To address these issues, we design and incorporate two distinct attention modules into our GAN framework. The first attention module is designed to enhance similarity measurements within label groups, effectively handling the large intraclass variance by reinforcing consistency within the same class. The second module addresses the small interclass variance by promoting diversity between adjacent label groups, ensuring that different classes are distinguishable in the generated images. These attention mechanisms play a critical role in generating more realistic and visually coherent images. Our GAN-based framework consists of an advanced image encoder and a generator, which are both enhanced by these attention modules. Furthermore, we integrate optimal transport (OT) to approximate human perceptual loss, further improving the visual quality of the synthesized images. Experimental results demonstrate the effectiveness of our approach, highlighting its advantages in the remote sensing field by significantly enhancing the quality of generated RSIs.
... Regarding the IMLE objective, we conduct a comparative analysis using DCGAN [53] and WGAN [3] as baseline implicit models in Tab. 4. GAN-based distillation methods fail to deliver satisfactory performance due to their inherent training instability. In contrast, the IMLE training objective is both fast and stable, achieving state-of-the-art performance compared to existing baselines, as shown in Tab. 1. ...
Preprint
In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. We propose a novel motion prediction conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene. We design a novel flow matching loss function that not only ensures at least one of the K sets of future trajectories is accurate but also encourages all K sets of future trajectories to be diverse and plausible. Furthermore, by leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is 100\textbf{100} times faster than the teacher flow model during sampling. The code, model, and data are available at our project page: https://moflow-imle.github.io
... Here, X ∈ R d ∼ F represents a real dataset, and ξ ∈ R p ∼ F ξ represents a noise vector, where d > p. There are various approaches to address the mode collapse issue, including modifying the loss function Nowozin et al., 2016;Li et al., 2015;Dellaporta et al., 2022;Fazeli-Asl et al., 2024) and exploring different architectures (Radford et al., 2015;Zhang et al., 2019). ...
Preprint
Full-text available
Mutual Information (MI) is a crucial measure for capturing dependencies between variables, but exact computation is challenging in high dimensions with intractable likelihoods, impacting accuracy and robustness. One idea is to use an auxiliary neural network to train an MI estimator; however, methods based on the empirical distribution function (EDF) can introduce sharp fluctuations in the MI loss due to poor out-of-sample performance, destabilizing convergence. We present a Bayesian nonparametric (BNP) solution for training an MI estimator by constructing the MI loss with a finite representation of the Dirichlet process posterior to incorporate regularization in the training process. With this regularization, the MI loss integrates both prior knowledge and empirical data to reduce the loss sensitivity to fluctuations and outliers in the sample data, especially in small sample settings like mini-batches. This approach addresses the challenge of balancing accuracy and low variance by effectively reducing variance, leading to stabilized and robust MI loss gradients during training and enhancing the convergence of the MI approximation while offering stronger theoretical guarantees for convergence. We explore the application of our estimator in maximizing MI between the data space and the latent space of a variational autoencoder. Experimental results demonstrate significant improvements in convergence over EDF-based methods, with applications across synthetic and real datasets, notably in 3D CT image generation, yielding enhanced structure discovery and reduced overfitting in data synthesis. While this paper focuses on generative models in application, the proposed estimator is not restricted to this setting and can be applied more broadly in various BNP learning procedures.
... Goodfellow et al., 2014;Radford et al., 2015;Mirza & Osindero, 2014;, image-to-image translation(Zhu et al., 2017a;Isola et al., 2017;Ma et al., 2018), domain generalization(Ganin & Lempitsky, 2015;Zhang et al., 2022;Huang et al., 2018), data augmentation(Antoniou et al., 2017;Zhu et al., 2017b;Calimeri et al., 2017), representation learning(Chen et al., 2016a;Donahue et al., 2017;Huang et al., 2017;Odena, 2016;Donahue et al., 2016), and active learning(Sinha et al., 2019; ...
Preprint
Full-text available
Active learning aims to select optimal samples for labeling, minimizing annotation costs. This paper introduces a unified representation learning framework tailored for active learning with task awareness. It integrates diverse sources, comprising reconstruction, adversarial, self-supervised, knowledge-distillation, and classification losses into a unified VAE-based ADROIT approach. The proposed approach comprises three key components - a unified representation generator (VAE), a state discriminator, and a (proxy) task-learner or classifier. ADROIT learns a latent code using both labeled and unlabeled data, incorporating task-awareness by leveraging labeled data with the proxy classifier. Unlike previous approaches, the proxy classifier additionally employs a self-supervised loss on unlabeled data and utilizes knowledge distillation to align with the target task-learner. The state discriminator distinguishes between labeled and unlabeled data, facilitating the selection of informative unlabeled samples. The dynamic interaction between VAE and the state discriminator creates a competitive environment, with the VAE attempting to deceive the discriminator, while the state discriminator learns to differentiate between labeled and unlabeled inputs. Extensive evaluations on diverse datasets and ablation analysis affirm the effectiveness of the proposed model.
... The generator and discriminator networks are made up of convolutional layers with no max pooling or fully connected layers. Instead, they use convolutional strides for down-sampling in the discriminator and transposed convolutions for up-sampling in the generator [26]. The design of the generator and discriminator in DCGAN is shown in Fig. 2. ...
Conference Paper
Full-text available
Cervical cancer is the fourth most common cancer among women worldwide. This study aims to classify cervical cancer cell images using deep learning techniques. Given the limited dataset, we propose a Deep Convolutional Generative Adversarial Network (DCGAN) to generate synthetic images and improve classification performance. The Pap smear images collected from Pomeranian Medical University were used to evaluate our approach. Several deep neural architectures were applied for image classification, including CNN, VGG16, MobileNet, ResNet50V2, InceptionV3, and Xception. Three experiments were conducted: (1) using the real dataset, (2) combining real and generated datasets, and (3) training with 80% real and generated images while testing on 20% real images not used in generation. In the third experiment, the model achieved accuracies of 96% for Xception and MobileNet, and 94% for ResNet50V2. These results demonstrate that DCGAN-based augmentation significantly improves classification performance and can play a crucial role in aiding the early detection of cervical cancer, enhancing diagnostic accuracy in clinical practice. Overall, this approach effectively addresses dataset limitations and boosts model accuracy for cervical cancer detection.
... Although recent advancements in generative networks, such as Generative Adversarial Networks (GANs) [12] and diffusion models [13,14], have significantly advanced research in talking portrait video generation, visual dubbing still faces two major challenges: lip-speech synchronization and identity preservation. Lip-speech synchronization ensures that lip movements align accurately with phonemes while identity preservation maintains the speaker's facial appearance and expressions. ...
Preprint
Full-text available
Recent advances in diffusion-based lip-syncing generative models have demonstrated their ability to produce highly synchronized talking face videos for visual dubbing. Although these models excel at lip synchronization, they often struggle to maintain fine-grained control over facial details in generated images. In this work, we identify "lip averaging" phenomenon where the model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. This issue arises because the commonly used UNet backbone primarily integrates audio features into visual representations in the latent space via cross-attention mechanisms and multi-scale fusion, but it struggles to retain fine-grained lip details in the generated faces. To address this issue, we propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences while maintaining accurate lip synchronization. Specifically, our method comprises two primary components: (1) an Identity Perceiver module that encodes facial embeddings to align with conditioned audio features; and (2) an ID-CrossAttn module that injects facial embeddings into the generation process, enhancing model's capability of identity retention. Extensive experiments demonstrate that, at a modest training and inference cost, UnAvgLip effectively mitigates the "averaging" phenomenon in lip inpainting, significantly preserving unique facial characteristics while maintaining precise lip synchronization. Compared with the original approach, our method demonstrates significant improvements of 5% on the identity consistency metric and 2% on the SSIM metric across two benchmark datasets (HDTF and LRW).
... To effectively transfer knowledge from local models to the global model, one solution is to improve the transferability of synthetic data samples. Enlightened by the success of generative adversarial networks (GANs) (Goodfellow et al. 2014;Mirza and Osindero 2014;Radford, Metz, and Chintala 2015), we introduce an adversarial loss to encourage the generator G to generate difficult data samples for the training of the global model by maximizing the disagreement between the global model f and the ensemble of local models {f k } k∈K , as follows: ...
Preprint
Full-text available
Federated learning (FL) enables decentralized clients to collaboratively train a global model under the orchestration of a central server without exposing their individual data. However, the iterative exchange of model parameters between the server and clients imposes heavy communication burdens, risks potential privacy leakage, and even precludes collaboration among heterogeneous clients. Distillation-based FL tackles these challenges by exchanging low-dimensional model outputs rather than model parameters, yet it highly relies on a task-relevant auxiliary dataset that is often not available in practice. Data-free FL attempts to overcome this limitation by training a server-side generator to directly synthesize task-specific data samples for knowledge transfer. However, the update rule of the generator requires clients to share on-device models for white-box access, which greatly compromises the advantages of distillation-based FL. This motivates us to explore a data-free and black-box FL framework via Zeroth-order Gradient Estimation (FedZGE), which estimates the gradients after flowing through on-device models in a black-box optimization manner to complete the training of the generator in terms of fidelity, transferability, diversity, and equilibrium, without involving any auxiliary data or sharing any model parameters, thus combining the advantages of both distillation-based FL and data-free FL. Experiments on large-scale image classification datasets and network architectures demonstrate the superiority of FedZGE in terms of data heterogeneity, model heterogeneity, communication efficiency, and privacy protection.
... • Neural network-based discriminators were used to distinguish between photos that were fake and those that were real. We obtained probability scores that represented the chance of each image being authentic by utilizing a pre-trained GAN discriminator (Ghayoumi 2023;Radford, Metz and Chintala 2018). After evaluating each image, the discriminator produced a probability score that represented the confidence level regarding the image's veracity. ...
Article
Full-text available
Virtual Influencers (VIs) have become the most prolific research subjects in human–computer interaction and mass media and communication studies from a plethora of perspectives. Developed to integrate social traits and anthropomorphic minds in their social media posts, human-like VIs engage with followers via visually authentic personae, emotionally captivating multimodal storytelling, and semio-pragmatic labor-intensive strategies in conformity with the expectations (and pressures) of the contemporary influencer culture. Informed by Belk’s revisited model of and timely scholarly works on the extended self, we introduce a new conceptualization of the virtual self that performs identity in platformized spaces. To examine virtual personae’s identity performance, we adopt a trans-disciplinary mixed-method forensic netnographic research design, synergizing computer vision, natural language processing, and semio-pragmatic analytical tools. A convenient sample of 334 (sponsored) posts, retrieved from the official Instagram account of the quintessential virtual agent Lil Miquela, is scrutinized taking into consideration her posts’ images and accompanying captions. The paper carries out the tripartite analysis in serious attempt to unravel: (a) how humanoid her synthesized images appear to the naked eye in quest of authenticity building; (b) the techno-affects that contribute to her identity performance; and (c) the semio-pragmatic affordances appropriated and deployed in Instagrammable spaces, showcasing how the three serve the performance of her digital identity. Valuable insights reveal that her agency draws heavily on algorithmization and semiotic immateriality to produce action. The study’s findings contribute to the existing body of literature on VIs and the extended self within the context of artificial intelligence.
... This enables the network to retain information from the encoder required for decoding the image, this is important since there is no visible feature difference between crop and weed, and therefore, low-level features learnt are retained. For reconstruction, we have used fractional stride convolutions of factor 0.5, and it is to note that this operation is mistaken for de-convolutions [44]. The number of kernels is tuned as an exponential growth for the encoder network which learns2 l , where l stands for number of layers, feature mappings at the latent space, and the inverse mapping is used for tuning the number of kernels for the decoder network. ...
Article
Full-text available
Crop weed segmentation is one of the most challenging tasks in the field of computer vision. This is because, unlike other object detection or segmentation tasks, crop and weed are similar in terms of spectral features, shape, dimensions, etc. For precision agriculture to flourish in terms of smart spraying of crops, efficient systems to distinguish between crop and weed are the need of the hour, which if precise, will take a huge step toward solving the issue of food scarcity. To tackle this issue, we propose new ensemble architecture of two models—a U-Net with a modified backbone and an encoder–decoder. These networks learn to distinguish between soil and crop and soil and weed, respectively, whose ensemble gives state-of-the-art results on pixel-wise annotations of combined crop and weed images. Moreover, it also learns that the model captures un-annotated features since each component of the architecture learnt either crop or weed features to high precision. Finally, the proposed architecture is compared with the U-Net and SegNet, which are popular segmentation networks, and consistently achieved better results.
... The overall architecture of our generator is analogous to that of the DCGAN [32]. Our most significant enhancement is the incorporation of CCB as an integral component within each convolutional block. ...
Article
Full-text available
Data-free knowledge distillation has recently gained significant attention in the field of model compression, as it enables knowledge transfer from a trained teacher model to a smaller student model without requiring original training data. Current methods often utilize generative adversarial networks (GANs) to synthesize fake samples, but this approach introduces two main issues. First, mode collapse leads to instances lacking diversity for downstream tasks. Second, inefficient instance synthesis makes existing methods too time-consuming and thus difficult to adapt to large-scale datasets. Finally, the increased memory footprint makes deployment difficult. In this paper, we propose a novel paradigm called conditional contrast for data-free knowledge distillation (CC-DFKD), which integrates conditional generative adversarial network (CGAN) and contrastive learning. CGAN synthesizes class-specific diverse images to address the diversity challenge, while contrastive learning enriches the student model’s feature representations to tackle the reality challenge. Additionally, compared with the recent work, simplification of the distillation loss reduces instance generation time and memory usage during operation, achieving significant speed improvements (half an hour to nine hours reduction) and lower GPU memory usage (2000–5000 MB reduction). Empirical results across multiple datasets validate CC-DFKD’s effectiveness and efficiency under low-memory conditions. Code is available at: https://github.com/jcynxu/CC-DFKD.
... Training is performed using mini-batch stochastic gradient descent with a batch size of 16. The Adam optimizer [33] is employed with the following learning rates for each network: 0.0002 (E nc ), 0.0002 (D ec ), 0.0001 (G en ), and 0.0003 (D). For all datasets, we set the parameters µ 1 , µ 2 , µ 3 in Equation (9) and λ 1 , λ 2 in Equation (15) as 2.0, 0.5, 0.1, 2.0, and 0.5, respectively, as used by [7]. ...
Article
Full-text available
Face recognition (FR) is a less intrusive biometrics technology with various applications, such as security, surveillance, and access control systems. FR remains challenging, especially when there is only a single image per person as a gallery dataset and when dealing with variations like pose, illumination, and occlusion. Deep learning techniques have shown promising results in recent years using VAE and GAN, with approaches such as patch-VAE, VAE-GAN for 3D Indoor Scene Synthesis, and hybrid VAE-GAN models. However, in Single Sample Per Person Face Recognition (SSPP FR), the challenge of learning robust and discriminative features that preserve the subject’s identity persists. To address these issues, we propose a novel framework called AD-VAE, specifically for SSPP FR, using a combination of variational autoencoder (VAE) and Generative Adversarial Network (GAN) techniques. The proposed AD-VAE framework is designed to learn how to build representative identity-preserving prototypes from both controlled and wild datasets, effectively handling variations like pose, illumination, and occlusion. The method uses four networks: an encoder and decoder similar to VAE, a generator that receives the encoder output plus noise to generate an identity-preserving prototype, and a discriminator that operates as a multi-task network. AD-VAE outperforms all tested state-of-the-art face recognition techniques, demonstrating its robustness. The proposed framework achieves superior results on four controlled benchmark datasets—AR, E-YaleB, CAS-PEAL, and FERET—with recognition rates of 84.9%, 94.6%, 94.5%, and 96.0%, respectively, and achieves remarkable performance on the uncontrolled LFW dataset, with a recognition rate of 99.6%. The AD-VAE framework shows promising potential for future research and real-world applications.
... Generative Adversarial Networks (GANs) are made up of two networks: a generator and a discriminator trained on unlabeled data [10,29,32]. The generator G seeks to capture the data distribution and generate realistic video frames by constructing a data distribution for the input data V via a mapping from a previous latent space noise distribution z. ...
Article
Surveillance video refers to video footage captured by cameras for the purpose of monitoring and recording activities in specific environments. These videos are commonly used for security purposes in places such as airports, shopping malls, streets, industrial facilities, hospitals, and other public or private spaces. The primary objective of surveillance video systems is to maintain safety, detect suspicious activities, and collect evidence for investigation. Anomaly detection in Surveillance video is an important and evolving field with applications across various industries. It involves analyzing video data to detect unusual or suspicious events, which could indicate threats, errors, or rare occurrences. While traditional methods have been useful, recent advancements in learning methods, particularly using 2D Convolutional Long Short Term Memory, Autoencoders, and Generative Adversarial Networks have made significant improvements in detecting complex anomalies. Our proposed system based on Autoencoder with Convolutional 2DLong Short Term Memory unit in Generative Adversarial Network. The model aims to learn the appropriate normal data distribution during training. Frames with a large variance in their regularity score are identified as anomalies based on this distribution. We have adopted depth-wise separable convolution with Conv2DLSTM unit in auto encoder to learn spatial and temporal features to reconstruct and differentiate generated frame with real frame in video sequence, and make the model lightweight and efficient. The entire system has been evaluated on many benchmark datasets using metrics like AUC and Equal Error Rate (EER) and shown to be reliable for complicated video anomaly identification.
... Some other major works include: similar "vector arithmetic" and interpretable direction results found in generative adversarial networks (Radford et al. (2016)); a significant body of research identifying neurons with interpretable behavior in RNNs (Karpathy et al. (2015)), CNNs (Zhou et al. (2015)), and GANs (Bau et al. (2020)). ...
Preprint
Full-text available
Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the "features as directions" perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.
... Nauata et al. (2020) developed the House-GAN algorithm, generating house layouts using a graph-based approach, with room adjacency represented by nodes and edges, trained with WGAN-GP. Radford et al. (2016) introduced Deep Convolutional GANs (DCGANs), enhancing GAN stability and enabling the generation of high-resolution bedroom designs. Isola et al. (2017) created 'pix2pix,' a conditional GAN software for generating building façades and transforming images, which was later improved by Wang et al. (2018) with 'pix2pixHD' for high-resolution image generation. ...
Article
Full-text available
Generative AI has seen significant advances, particularly in text-to-image, with the potential to revolutionize industries, especially in creative fields such as art and design. This innovation is especially important in architecture, where idea visualization is critical. Text-to-image tools, a form of generative AI, enable architects and designers to visually bring their concepts to life. The study explores the impact of prompt-based AI generation on architecture, asking whether it is enhancing efficiency, creativity, and sustainability or threatening to replace architects. To address concerns about the role of AI in the profession, the research examines the perceptions of architecture professionals in Egypt. The authors conducted a survey and interviews with industry experts to assess the transformative impacts of AI on architecture. The findings reveal a strong awareness of AI's potential to enhance design quality and project outcomes, although some concerns about job prospects and control over AI outputs persist. Small firms view AI as vital for optimizing operations and attracting clients. Overall, AI shows promise in conceptualization and visualization, enhancing creativity and efficiency, with architects needing to adapt to AI as a tool for innovation rather than a competitor. Finally, the study proposes a roadmap for improving the use of AI in architecture.
... Portrait animation (Guo et al., 2024;Ma et al., 2024;Xie et al., 2024;Niu et al., 2024;Wang et al., 2021; has demonstrated impressive results with the recent advancements in generative models, such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2020;Donahue et al., 2016;Odena et al., 2017;Radford, 2015) and diffusion models (Rombach et al., 2022;Nichol et al., 2021;Saharia et al., 2022;Ho et al., 2020;Song et al., 2021a). However, these methods depend on facial landmark recognition, and their performance are constrained by the generalization capability of facial landmark detection models (Zhou et al., 2023;. ...
Preprint
In this paper, we present FaceShot, a novel training-free portrait animation framework designed to bring any character into life from any driven video without fine-tuning or retraining. We achieve this by offering precise and robust reposed landmark sequences from an appearance-guided landmark matching module and a coordinate-based landmark retargeting module. Together, these components harness the robust semantic correspondences of latent diffusion models to produce facial motion sequence across a wide range of character types. After that, we input the landmark sequences into a pre-trained landmark-driven animation model to generate animated video. With this powerful generalization capability, FaceShot can significantly extend the application of portrait animation by breaking the limitation of realistic portrait landmark detection for any stylized character and driven video. Also, FaceShot is compatible with any landmark-driven animation model, significantly improving overall performance. Extensive experiments on our newly constructed character benchmark CharacBench confirm that FaceShot consistently surpasses state-of-the-art (SOTA) approaches across any character domain. More results are available at our project website https://faceshot2024.github.io/faceshot/.
... 3) Representing sketches as discrete parameter sequences makes generating smooth and continuous shape transformations difficult, leading to unnatural results for latent space interpolation, as illustrated in Fig.1. This poses a challenge for generating plausible sketches, for which plausible interpolation is essential (Radford, Metz, and Chintala 2015;Goodfellow et al. 2014;Higgins et al. 2017). ...
Preprint
The integration of deep generative networks into generating Computer-Aided Design (CAD) models has garnered increasing attention over recent years. Traditional methods often rely on discrete sequences of parametric line/curve segments to represent sketches. Differently, we introduce RECAD, a novel framework that generates Raster sketches and 3D Extrusions for CAD models. Representing sketches as raster images offers several advantages over discrete sequences: 1) it breaks the limitations on the types and numbers of lines/curves, providing enhanced geometric representation capabilities; 2) it enables interpolation within a continuous latent space; and 3) it allows for more intuitive user control over the output. Technically, RECAD employs two diffusion networks: the first network generates extrusion boxes conditioned on the number and types of extrusions, while the second network produces sketch images conditioned on these extrusion boxes. By combining these two networks, RECAD effectively generates sketch-and-extrude CAD models, offering a more robust and intuitive approach to CAD model generation. Experimental results indicate that RECAD achieves strong performance in unconditional generation, while also demonstrating effectiveness in conditional generation and output editing.
... Generative models, in particular, have made substantial contributions to anomaly detection and have paved the way for addressing more intricate tasks, particularly in self-supervised understanding and the generation of natural images. Authors in [33] introduced DC-GAN, showcasing GANs' ability to capture semantic image content, which has led to intriguing applications like vector arithmetic for manipulating visual concepts. Additionally, [50] trained GANs on natural images and employed the trained models for semantic image inpainting, demonstrating the versatility and potential of GANs in various imagerelated tasks. ...
Conference Paper
Full-text available
A robust anomaly detection mechanism should possess the capability to effectively remediate anomalies, restoring them to a healthy state, while preserving essential healthy information. Despite the efficacy of existing generative models in learning the underlying distribution of healthy reference data, they face primary challenges when it comes to efficiently repair larger anomalies or anomalies situated near high pixel-density regions. In this paper, we introduce a self-supervised anomaly detection method based on a diffusion model that samples from multi-frequency, four-dimensional simplex noise and makes predictions using our proposed Dynamic Transformer UNet (DTUNet). This simplex-based noise function helps address primary problems to some extent and is scalable for three-dimensional and colored images. In the evolution of ViT, our developed architecture serving as the backbone for the diffusion model, is tailored to treat time and noise image patches as tokens. We incorporate long skip connections bridging the shallow and deep layers, along with smaller skip connections within these layers. Furthermore, we integrate a partial diffusion Markov process, which reduces sampling time, thus enhancing scalability. Our method surpasses existing generative-based anomaly detection methods across three diverse datasets, which include BrainMRI, Brats2021, and the MVtec dataset. It achieves an average improvement of +10.1% in Dice coefficient, +10.4% in IOU, and +9.6% in AUC. Our source code is made publicly available on Github.
... Information 2025, 16, 197 2 of 16 Generative Adversarial Network (DCGAN), which generates transitional images between data categories through interpolation techniques [4]. To achieve conditional control over generated image categories, Antoniou et al. developed the Data Augmentation Generative Adversarial Network (DAGAN), producing new images consistent with the original class [5]. ...
Article
Full-text available
Existing substation equipment image data augmentation models face challenges such as high dataset size requirements, difficult training processes, and insufficient condition control. This paper proposes a transformer equipment image data augmentation method based on a Stable Diffusion model. The proposed method incorporates the Low-Rank Adaptation (LoRA) concept to fine-tune the pre-trained Stable Diffusion model weights, significantly reducing training requirements while effectively integrating the essential features of transformer equipment image data. To minimize interference from complex backgrounds, the Segment Anything Model (SAM) is employed for preprocessing, thereby enhancing the quality of generated image data. The experimental results demonstrate significant improvements in evaluation metrics using the proposed method. Specifically, when implemented with the YOLOv7 model, the accuracy metric shows a 16.4 percentage point improvement compared to “Standard image transformations” (e.g., rotation and scaling) and a 2.3 percentage point improvement over DA-Fusion. Comparable improvements are observed in the SSD and Faster-RCNN object detection models. Notably, the model demonstrates advantages in reducing false-negative rates (higher Recall). The proposed approach successfully addresses key data augmentation challenges in transformer fault detection applications.
... Following the initial development of GANs, various architectures emerged, notably Deep Convolutional Generative Adversarial Networks (DCGANs) introduced by Radford et al. in 2015 [29], which extended the foundational GAN framework. While the Vanilla GAN's architecture contains simple downsampling and upsampling layers with ReLU rr s=0 N | rr X X activations and a Sigmoid activation for the discriminator, this variant of the GAN is made of strided convolution layers, batch norm layers, and LeakyReLU activation functions. ...
Preprint
Full-text available
This review surveys the state-of-the-art in text-to-image and image-to-image generation within the scope of generative AI. We provide a comparative analysis of three prominent architectures: Variational Autoencoders, Generative Adversarial Networks and Diffusion Models. For each, we elucidate core concepts, architectural innovations, and practical strengths and limitations, particularly for scientific image understanding. Finally, we discuss critical open challenges and potential future research directions in this rapidly evolving field.
... Unlike traditional discrepancy measures, NCFM operates within the complex plane to conduct minmax optimization. While instability is a common issue in minmax adversarial optimization, as seen in generative adversarial networks [2,37,39], NCFM consistently maintains stable optimization throughout training, as illustrated in Figure 7. This stability is further supported by theoretical guarantees of weak convergence in Theorem 1, demonstrating the robustness of the CF-based discrepancy under diverse conditions and contributing to NCFM's reliable convergence across datasets. ...
Preprint
Full-text available
Dataset distillation has emerged as a powerful approach for reducing data requirements in deep learning. Among various methods, distribution matching-based approaches stand out for their balance of computational efficiency and strong performance. However, existing distance metrics used in distribution matching often fail to accurately capture distributional differences, leading to unreliable measures of discrepancy. In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. NCFD leverages the Characteristic Function (CF) to encapsulate full distributional information, employing a neural network to optimize the sampling strategy for the CF's frequency arguments, thereby maximizing the discrepancy to enhance distance estimation. Simultaneously, we minimize the difference between real and synthetic data under this optimized NCFD measure. Our approach, termed Neural Characteristic Function Matching (\mymethod{}), inherently aligns the phase and amplitude of neural features in the complex plane for both real and synthetic data, achieving a balance between realism and diversity in synthetic samples. Experiments demonstrate that our method achieves significant performance gains over state-of-the-art methods on both low- and high-resolution datasets. Notably, we achieve a 20.5\% accuracy boost on ImageSquawk. Our method also reduces GPU memory usage by over 300×\times and achieves 20×\times faster processing speeds compared to state-of-the-art methods. To the best of our knowledge, this is the first work to achieve lossless compression of CIFAR-100 on a single NVIDIA 2080 Ti GPU using only 2.3 GB of memory.
... The networks were trained adversarially using the loss functions defined in the Equations section. The latent noise vector z was sampled from Gaussian distribution, and the GAN was trained for 200 epochs [11]. ...
Conference Paper
In recent times, one of the important fields of research is anomaly detection in crowds, which is highly useful in public safety, urban planning,and event management issues. This paper focuses on cutting-edge machine learning approaches to address the problem of adaptive and explainable anomalous activity identification of running crowd behavior, where Recurrent Neural Networks (RNNs) and Generative Adversarial Networks(GANs) frameworks have been utilized to learn to perform predictive analysis. In this contribution, we present a framework that combines these approaches to boost the accuracy and interpretability of anomaly detection. Using RNNs to learn temporal characteristics and GANs to generate synthetic datasets from learned patterns to build a detection model. In addition to leveraging the most accurate models, we augment our predictions with Explainable Artificial Intelligence (XAI) techniques that provide transparency on how the decision-making in our models occurred, allowing better confidence in their predictions. The experimental results validate the applicability of the proposed approach on real crowd datasets, resulting in significant improvements in detection rate as compared to conventional approaches. The experimental results 1 further indicate that our framework achieves high anomaly identification precision and provides explanations that are essential in real-world applications. By reducing the growing literature on anomaly detection to those goals of predic-tive performance and explainability, this research provides awareness towards developing safer and more efficient crowd management routines.
Article
Full-text available
In previous research on ghost imaging encoding transmission schemes, the influence of real transmission channels on the communication quality was weakened to some extent. Simultaneously, to ensure the imaging quality of the algorithm, it is often performed under full sampling or even supersampling, which undoubtedly requires a long sampling time. This paper proposes a ghost imaging reconstruction method that uses a generative adversarial network and Rayleigh fading channel. By introducing the channel transmission model (Rayleigh fading channel) in real scenes and the generative adversarial neural network model, the image is reconstructed under under-sampling and the imaging time is saved. To further explore how to improve the image transmission quality and reduce the channel interference as much as possible, this scheme provides a new imaging technology for the research of the image transmission field, which has good theoretical significance.
Chapter
This paper presents a framework for generating realistic photographs using Generative Adversarial Networks (GANs). Our framework comprises two key neural networks: the Generator and the discriminator. The generator network creates images, while the discriminator network evaluates their authenticity. This framework highlights the potential and advancements of GANs in creating photorealistic visuals. To demonstrate the comparative analysis, we created two distinct models with different picture sizes 128 pixels and 64 pixels, respectively using two different GPUs, the T4x2 and P100. The produced pictures from the P100 GPU have been surpassed by those from the T4x2 GPU, the P100 model has been outperformed by the T4x2 GPU, by considering metrics like LOSS_G, LOSS_D, Real Score, and Fake Score. From the P100 model to the 'T4x2 model, the average value of phony SCORE shows a noteworthy improvement of about 59.34%, suggesting improved performance in identifying phony data in the later model.
Article
Full-text available
Today, due to the increasing recognition of the capabilities of machine learning and deep learning algorithms, the use of these algorithms is undergoing significant development. This ongoing evolution has led to the creation and enhancement of numerous algorithms and their variants. These advancements not only enhance the accuracy of algorithms but also pose challenges for researchers in terms of their understanding and utilization. The increasing capabilities of these algorithms have resulted in a dramatic rise in their utilization within seismic exploration. Among these, generative adversarial algorithms stand out due to their unique abilities and rapid progress, making them a crucial part of deep learning algorithms applied to various seismic exploration challenges. One notable characteristic of this algorithm is its high complexity and the existence of multiple variants. In this article, we aim to provide a comprehensive yet concise overview of generative adversarial algorithms, focusing on their theoretical foundations and mathematical underpinnings as they apply to seismic exploration. By doing so, we facilitate researchers' initial understanding of this algorithm, allowing them to grasp its fundamentals before delving into its intricacies and more time-consuming aspects. This approach enables researchers to intelligently and purposefully explore the algorithm according to their specific goals.
Chapter
Deep learning methods trained on 3D medical images typically do not generalize well as training data are relatively homogenous and small. One way to potentially overcome this issue is creating realistic-looking 3D medical images using generative models. This chapter describes the fundamental principles and architectures of generative models used for this purpose, such as those based on generative adversarial networks (GANs) and diffusion probabilistic models (DPMs). The chapter also reviews evaluation techniques for measuring the quality of synthetic medical images, including the evaluation of the biological plausibility of the anatomy displayed.
Article
Recently, there has been a growing interest in automatically collecting distributed solar photovoltaic (PV) installation information in smart grid systems, including the quantity and locations of solar PV deployments, as well as their profiling information across a given geospatial region. Most recent approaches are still suffering low detection accuracy due to insufficient sample and principal feature learning when building their models and also separation of rooftop object segmentation and identification during their detection processes. In addition, they cannot report accurate multi-deployment results. To address these problems, we design a new system-SolarDetector ⁺ , which can automatically and accurately detect, and profile distributed solar PV arrays without any extra cost. In essence, SolarDetector ⁺ first leverages multiple data augmentation techniques, including CycleGAN, Latent Diffusion Models, and Generative Adversarial networks, to build a large rooftop satellite imagery dataset (RSID). Then, SolarDetector ⁺ employs Mask R-CNN algorithm to accurately identify rooftop solar PV arrays and learn the detailed installation information for each solar PV array simultaneously. We find that pre-trained SolarDetector ⁺ yields an average Matthews correlation coefficient (MCC) of 0.862 to detect solar PV arrays over RSID, which is ∼ 50% better than the most recent open-source detection system—SolarFinder.
Article
Deep learning‐based side‐channel attacks (DL‐SCA) have attracted widespread attention in recent years, and most of the researchers are devoted to finding the optimal DL‐SCA method. At the same time, traditional SCA methods have lost their luster. However, traditional attacks still have certain advantages. Compared with the DL‐SCA method, they do not require cumbersome engineering of tuning DL models and hyperparameters, making them easier to implement. Correlation power analysis (CPA), as a traditional SCA method, is still widely used in various analysis scenarios and plays an important role. In CPA, the leakage model is the key to simulating the power consumption, and it decides the attack efficiency. However, the existing leakage models are designed based on theory but ignore the actual attack scene. We found that conditional generative adversarial networks (CGAN) can ideally learn the target device's leakage characteristics and real power consumption. We let CGAN pre‐learn the leakage of the target device, and then make the generator as the leakage model . The leakage model can characterize the leakages of the device and consider the presence of noise in the actual scenario. It can map the power consumption more realistically and accurately, which can lead to a more powerful CPA attack. In this work, three kinds of leakage models (1, 2, and 3 leakage models) corresponding to the labels least significant bit (LSB), hamming weight (HW), and identity (ID) of CGAN are discussed. The experimental results show that the 3 leakage model has better attack performance. Compared with the ordinary HW leakage model, the number of traces needed to recover the key on the ASCAD and SAKURA‐AES datasets reduced by about 38.9% and 85.9%, respectively.
Article
It is common in nonparametric estimation problems to impose a certain low-dimensional structure on the unknown parameter to avoid the curse of dimensionality. This paper considers a nonparametric distribution estimation problem with a structural assumption under which the target distribution is allowed to be singular with respect to the Lebesgue measure. In particular, we investigate the use of generative adversarial networks (GANs) for estimating the unknown distribution and obtain a convergence rate with respect to the L1L^1-Wasserstein metric. The convergence rate depends only on the underlying structure and noise level. More interestingly, under the same structural assumption, the convergence rate of GAN is strictly faster than the known rate of VAE in the literature. We also obtain a lower bound for the minimax optimal rate, which is conjectured to be sharp at least in some special cases. Although our upper and lower bounds for the minimax optimal rate do not match, the difference is not significant.
Article
Full-text available
This research paper explores the transformative potential of generative AI in the context of document processing within large financial organizations, with a particular focus on fraud detection. As financial institutions increasingly rely on vast amounts of documentation for operations ranging from customer onboarding to compliance, the inefficiencies and limitations of traditional manual processing methods become glaringly apparent. These legacy systems are not only time-consuming and prone to human error but also struggle with scalability, a critical requirement in today’s fast-paced financial environment. Moreover, manual systems and traditional Optical Character Recognition (OCR) engines often lack the necessary accuracy and contextual understanding to reliably process complex financial documents and detect fraudulent activities. While OCR technology has automated certain aspects of document processing, its inherent limitations in accuracy, particularly in dealing with degraded documents or complex layouts, and its inability to interpret context, significantly impede its effectiveness in high-stakes financial applications. Furthermore, OCR’s limited capability in detecting subtle indicators of fraud leaves financial organizations vulnerable to increasingly sophisticated fraudulent schemes. Generative AI emerges as a revolutionary solution to these challenges by enhancing the accuracy, scalability, and security of document processing systems. Unlike traditional OCR, generative AI models are designed to understand and interpret the context of documents, thereby significantly improving the accuracy of text recognition, even in complex scenarios. These AI models, trained on vast datasets, are capable of processing large volumes of documents in parallel, making them ideally suited for the high-speed, high-volume environments characteristic of financial institutions. Additionally, generative AI incorporates advanced algorithms that enhance fraud detection capabilities by analyzing patterns, detecting anomalies, and cross-referencing data across multiple documents. This approach not only improves the detection of fraudulent activities but also reduces the likelihood of false positives, thereby enhancing the overall reliability of the system. The paper further delves into the practical applications of generative AI in various critical areas within financial organizations. Key applications include Know Your Customer (KYC) compliance, where AI streamlines the processing and verification of customer documents, thereby ensuring both compliance with regulatory requirements and the authenticity of the information provided. In loan processing, generative AI accelerates the analysis of loan applications, providing real-time risk assessments that enable faster decision-making. Additionally, the technology is applied in invoice and payment processing, where it automates and verifies transactions, reducing errors and ensuring the timely execution of financial operations. In the realm of contract analysis, generative AI facilitates the extraction and interpretation of key terms and clauses, enabling more effective contract negotiation and management. Beyond its practical applications, the paper also addresses the continuous learning capabilities of generative AI models, which allow them to evolve and adapt to new data and document types over time. This feature is particularly crucial in the financial sector, where the types of documents and the nature of fraudulent activities are continually changing. The continuous learning aspect of generative AI ensures that the systems remain up-to-date and effective, even as new challenges and document types emerge. The research also highlights the comparative analysis between traditional OCR-based systems and AI-powered systems, demonstrating the superior performance, efficiency, and scalability of the latter. Moreover, the paper discusses the challenges associated with the implementation of generative AI in financial document processing. These include technical challenges such as the integration of AI systems with existing IT infrastructure, as well as regulatory and compliance issues that arise when deploying AI technologies in the highly regulated financial sector. Despite these challenges, the paper argues that the long-term benefits of adopting generative AI, including improved accuracy, enhanced fraud detection, and greater operational efficiency, far outweigh the initial hurdles. The research also considers the future of generative AI in financial document processing, suggesting that as the technology continues to advance, its applications and benefits will expand even further. Future research opportunities are identified, particularly in the areas of improving the efficiency and scalability of AI models, enhancing their ability to handle increasingly complex document types, and developing more sophisticated fraud detection algorithms. The paper concludes with a discussion on the potential long-term impact of generative AI on the financial industry, arguing that it will play a crucial role in shaping the future of financial operations by providing more accurate, scalable, and secure document processing solutions. This paper makes a significant contribution to the existing body of knowledge on the application of AI in financial services, particularly in the area of document processing and fraud detection. By providing a detailed analysis of the challenges faced by financial organizations and demonstrating how generative AI can address these challenges, the research offers valuable insights for both academic researchers and practitioners in the field. The findings presented in this paper have important implications for the future of document processing in financial organizations, suggesting that the adoption of generative AI will be essential for maintaining operational efficiency, accuracy, and security in an increasingly complex and fast-paced financial environment. In summary, this research not only highlights the transformative potential of generative AI in financial document processing but also provides a roadmap for its successful implementation in large financial organizations, with a particular emphasis on enhancing fraud detection capabilities.
Article
Full-text available
In recent years, convolutional neural networks (CNNs) have been impressive due to their excellent feature representation abilities, but it is difficult to learn long-distance spatial structures information. Unlike CNN, graph convolutional networks (GCNs) can well handle the intrinsic manifold structures of hyperspectral images (HSIs). However, the existing GCN-based classification methods do not fully utilize the edge relationship, which makes their performance is limited. In addition, a small number of training samples is also a reason for hindering high-performance hyperspectral image classification. Therefore, this paper proposes a hybrid CNN-GCN network (HCGN) for hyperspectral image classification. Firstly, a graph edge enhanced module (GEEM) is designed to enhance the superpixel-level features of graph edge nodes and improve the spatial discrimination ability of ground objects. In particular, considering multiscale information is complementary, a multiscale graph edge enhanced module (MS-GEEM) based on GEEM is proposed to fully utilize texture structures of different sizes. Then, in order to enhance the pixel-level multi hierarchical fine feature representation of images, a multiscale cross fusion module (MS-CFM) based on the CNN framework is proposed. Finally, the extracted pixel-level features and superpixel-level features are cascaded. Through a series of experiments, it has been proved that compared with some state-of-the-art methods, HCGN combines the advantages of CNN and GCN frameworks, can provide superior classification performance under limited training samples, and demonstrates the advantages and great potential of HCGN. The source code of the proposed method will be available publicly at https://github.com/Dilingliao/HCGN.
Article
Diffusion models (DMs) have achieved state-of-the-art performance on various generative tasks such as image synthesis, text-to-image, and text-guided image-to-image generation. However, the more powerful the DMs, the more harmful they can potentially be. Recent studies have shown that DMs are prone to a wide range of attacks, including adversarial attacks, membership inference attacks, backdoor injection, and various multi-modal threats. Since numerous pre-trained DMs are published widely on the Internet, potential threats from these attacks are especially detrimental to the society, making DM-related security a topic worthy of investigation. Therefore, in this paper, we conduct a comprehensive survey on the security aspect of DMs, focusing on various attack and defense methods for DMs. First, we present crucial knowledge of DMs with five main types of DMs, including denoising diffusion probabilistic models, denoising diffusion implicit models, noise conditioned score networks, stochastic differential equations, and multi-modal conditional DMs. We provide a comprehensive survey of recent works investigating different types of attacks that exploit the vulnerabilities of DMs. Then, we thoroughly review potential countermeasures to mitigate each of the presented threats. Finally, we discuss open challenges of DM-related security and describe potential research directions for this topic.
ResearchGate has not been able to resolve any references for this publication.