Conference Paper

High-Resolution Image Synthesis with Latent Diffusion Models

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Text-to-image generation techniques have rapidly evolved thanks to the advances in image-generative models and language models. In particular, diffusion model-based methods, such as Imagen [47] and Stable Diffusion [44], achieve stateof-the-art quality. Although text input is helpful for intuitive control, it has spatial ambiguity and does not allow sufficient user control. ...
... For example, the second column from the left in Fig. 1 and the most left column in Fig. 9 show that it generates foreground objects and backgrounds with different styles. Moreover, MultiDif- Spatially controlling text-to-image generation using visual guidance without additional training of Stable Diffusion [44]. Our method can generate images more faithful to given semantic masks than the state-of-the-art method (MultiDiffusion) [7] fusion still struggles to generate images aligned to especially fine-grained masks (right four columns in Fig. 1). ...
... Masked-attention guidance simply encourages attention to increase or decrease according to semantic regions, and we leverage the ability of well-trained diffusion models to determine how much the network attends to which pixels and words. Masked-attention guidance is easy to implement and integrate into pre-trained off-the-shelf diffusion models (e.g., Stable Diffusion [44]). It can also be applied to the task of text-guided image editing. ...
Article
Full-text available
Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g., sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Some prior works also allow training-free spatial control of text-to-image diffusion models by directly manipulating cross-attention maps. However, these approaches still suffer from misalignment to given masks because manipulated attention maps are far from actual ones learned by diffusion models. To address this issue, we propose masked-attention guidance, which can generate images more faithful to semantic masks via indirect control of attention to each word and pixel by manipulating noise images fed to diffusion models. Masked-attention guidance can be easily integrated into pre-trained off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the tasks of text-guided image editing. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.
... Text-to-image generation Text-to-image models such as stable diffusion (Rombach et al., 2022) have gained extraordinary attention from both researchers and the general public. ControlNet (Zhang & Agrawala, 2023) builds upon this work and proposes to add controls to the diffusion network to make it adapt to task-specific conditions. ...
... It adapts a U-Net architecture similar to DDPM (Ho et al., 2020) but removes all self-attention layers. To inject clean content embeddings into the diffusion process, we introduce a cross-attention (Rombach et al., 2022) mechanism to learn semantic guidance from pre-trained VLMs. Considering the varying input sizes in image restoration tasks and the increasing cost of applying attention to high-resolution features, we only use cross-attention in the bottom blocks of the U-Net for sample efficiency. ...
... A key capability of deep learning is the possibility to learn latent representations of complex, high-dimensional data. The latent space is not only a crucial concept in deep generative models [67,68], it can also be used to compress bulky data [69,70], or to gain insight into hidden correlations [71][72][73]. ...
... Starting this iterative process on random noise, and combining it with some guidance using latent information about the original content, a denoising diffusion model can generate samples that match the target latent description. Such networks are very popular in computer vision and image generation ("text to image") [68,189,190]. Very recently, a first application of a diffusion model for metasurface inverse design has been demonstrated [191]. ...
Article
Full-text available
Nanophotonic devices manipulate light at sub-wavelength scales, enabling tasks such as light concentration, routing, and filtering. Designing these devices to achieve precise light–matter interactions using structural parameters and materials is a challenging task. Traditionally, solving this problem has relied on computationally expensive, iterative methods. In recent years, deep learning techniques have emerged as promising tools for tackling the inverse design of nanophotonic devices. While several review articles have provided an overview of the progress in this rapidly evolving field, there is a need for a comprehensive tutorial that specifically targets newcomers without prior experience in deep learning. Our goal is to address this gap and provide practical guidance for applying deep learning to individual scientific problems. We introduce the fundamental concepts of deep learning and critically discuss the potential benefits it offers for various inverse design problems in nanophotonics. We present a suggested workflow and detailed, practical design guidelines to help newcomers navigate the challenges they may encounter. By following our guide, newcomers can avoid frustrating roadblocks commonly experienced when venturing into deep learning for the first time. In a second part, we explore different iterative and direct deep learning-based techniques for inverse design, and evaluate their respective advantages and limitations. To enhance understanding and facilitate implementation, we supplement the manuscript with detailed Python notebook examples, illustrating each step of the discussed processes. While our tutorial primarily focuses on researchers in (nano-)photonics, it is also relevant for those working with deep learning in other research domains. We aim at providing a solid starting point to empower researchers to leverage the potential of deep learning in their scientific pursuits.
... To overcome the limitations of existing anonymization tools, we introduce DiffSLVA, a novel anonymization approach leveraging large-scale pre-trained diffusion models, notably Stable Diffusion [28]. DiffSLVA is designed to tackle text-guided sign language anonymization. ...
... Diffusion models [11] have demonstrated exceptional performance in the field of generative AI. Once such models are trained on large-scale datasets (e.g., LAION [30]), textguided latent diffusion models [28] (e.g., Stable Diffusion) are capable of producing diverse and high-quality images from a single text prompt. Additionally, ControlNet [46] presents a novel enhancement. ...
Preprint
Full-text available
Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymiza-tion that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations , our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymiza-tion. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible , for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.
... Owing to the removal of group symmetry and polymer restraints, any modern generative methods can be straightforwardly applied on this latent space to generate and sample protein structures. Inspired by the success of latent generative models 36 , by further quantizing this latent space into a limited number of discrete tokens (called "ProTokens"), we not only obtained a new solution to protein structure compression, but also created a "protein language" which encodes sufficient information about protein structures and meanwhile remains amenable to LLMs. ProTokens are learned in a self-supervised manner, intuited by the profound connections between structure prediction and inverse folding. ...
... As in VQ-VAE 37,40 and VQ-GAN 44,45 , quantization is equivalent to clustering SLVs and degenerating SLVs within a cluster into one representation vector (also known as a "code"). Because the codebook (i.e., the set of cluster centers) is of finite size, quantization often plays the role of an "information bottleneck" in an autoencoder network 36 . ...
Preprint
Full-text available
Designing protein structures towards specific functions is of great values for science, industry and therapeutics. Although backbones can be designed with arbitrary variety in the coordinate space, the generated structures may not be stabilized by any combination of natural amino acids, resulting in the high failure risk of many design approaches. Aiming to sketch a compact space for designable protein structures, we present an unsupervised learning strategy by integrating the structure prediction and the inverse folding tasks, to encode protein structures into amino-acid-like discrete tokens and decode these tokens back to 3D coordinates. We show that tokenizing protein structures with proper perplexity can lead to compact and informative representations (ProTokens), which can reconstruct 3D coordinates with high fidelity and reduce the trans-rotational equivariance of protein structures. Therefore, protein structures can be efficiently compressed, stored, aligned and compared in the form of ProTokens. Besides, ProTokens enable protein structure design via various generative AI without the concern of symmetries, and even support joint design of the structure and sequence simultaneously. Additionally, as a modality transformer, ProTokens provide a domain-specific vocabulary, allowing large language models to perceive, process and explore the microscopic structures of biomolecules as effectively as learning a foreign language.
... Recently, large text-to-image Diffusion models [2,31,39,42,49] have shown promising results in synthesizing high quality images. These models empower users by enabling scene synthesis through natural language descriptions. ...
... Diffusion models [8,20,43] are a class of generative models that learn to generate novel scenes by learning to gradually denoise samples from the Gaussian prior x T (trained by adding noise ϵ ∼ N (0, 1) to x 0 ) back to the image x 0 . In this work, we study T2I models, namely the Latent Diffusion Model (LDM) [39] which is the predecessor of Stable Diffusion. Instead of directly adding noise and learning to denoise images, LDM operates on a pretrained autoencoder's projection of images in some latent space. ...
Preprint
Full-text available
A text-based inversion method for text-to-image diffusion models that faithfully inverts adjective and verb concepts, learned from a few example images.
... This is achieved through the incorporation of two encoders, speci cally the 14-layer Convolutional Neural Networks (CNN14) [41] and Bidirectional Encoder Representations from Transformers (BERT) [42], in conjunction with contrastive learning. [43] introduce the Latent Diffusion Model (LDM), which is built upon the diffusion model [36], enabling high-resolution unconditional image generation and the synthesis of text-to-image. ...
Preprint
Full-text available
Process planning serves as a critical link between design and manufacturing, exerting a pivotal influence on the quality and efficiency of production. However, current intelligent process planning systems, like computer-aided process planning (CAPP), still contend with the challenge of realizing comprehensive automation in process decision-making. These obstacles chiefly involve, though are not confined to, issues like limited intelligence, poor flexibility, low reliability, and high usage thresholds. Generative artificial intelligence (AI) has attained noteworthy accomplishments in natural language processing (NLP), offering new perspectives to address these challenges. This paper summarizes the limitations of current intelligent process planning methods and explores the potential of integrating generative AI into process planning. With synergistically incorporating digital twins, this paper introduces a conceptual framework termed generative AI and digital twin-enabling intelligent process planning (GIPP). The paper elaborates on two supporting methodologies: process generative pre-trained transformer (ProcessGPT) modelling and digital twin-based process verification method. Moreover, a prototype system is established to introduce the implementation and machining execution mechanism of GIPP for milling a specific thin-walled component. Three potential application scenarios and a comparative analysis are employed to elucidate the practicality of GIPP, providing new insights for intelligent process planning.
... (d) Entire face synthesis to generate fake face images that do not exist in reality. The common models include PGGAN [22], StyleGAN [23] and Latent Diffusion [24]. ...
Article
Full-text available
Convolutional neural networks (CNNs) have achieved impressive successes in fake face image detection. However, CNNs ignore tampering traces outside their attention scope. Moreover, example forgetting events can also pose negative impacts on the face forgery detection accuracy. To address these issues, this paper proposes an attention-expanded two-stage face forgery detector, named Attention-expanded Deepfake Spotter (ADS). In the first stage, the manipulated regions are preliminarily located by utilizing the Region Score Maps (RSMs) generated by the modified CNN. In the second stage, the Expanding and Undetectable Regions (EUR) loss function is designed to encourage another modified CNN to mine manipulation traces outside the manipulated areas exposed in the first stage. To fuse the manipulation traces extracted from different regions in the two stages and mitigate the problems caused by example forgetting events, RSM-weighted accumulation is adopted to integrate the detection information from both stages and obtain the final detection result. The proposed algorithm’s effectiveness for each component is analyzed through ablation experiments, and the method is evaluated on four publicly available datasets: FF++, HFF, DFDC, and Celeb-DF. The experimental results demonstrate that the proposed method has high detection rates and superior transferability, outperforming most existing algorithms.
... We chose to use cropped patches from the inspiratory and registered expiratory CT scans to reduce the training difficulty. As the diffusion model [44,45] proposed recently, we can use the diffusion model to replace the GAN model to improve the robustness of the generation model. Furthermore, as the diffusion model can synthesize data from noisy inputs, we will further implement a diffusion model to obtain more available training data, construct larger datasets with synthetic high-resolution CT. ...
Article
Full-text available
Objectives Parametric response mapping (PRM) enables the evaluation of small airway disease (SAD) at the voxel level, but requires both inspiratory and expiratory chest CT scans. We hypothesize that deep learning PRM from inspiratory chest CT scans can effectively evaluate SAD in individuals with normal spirometry. Methods We included 537 participants with normal spirometry, a history of smoking or secondhand smoke exposure, and divided them into training, tuning, and test sets. A cascaded generative adversarial network generated expiratory CT from inspiratory CT, followed by a UNet-like network predicting PRM using real inspiratory CT and generated expiratory CT. The performance of the prediction is evaluated using SSIM, RMSE and dice coefficients. Pearson correlation evaluated the correlation between predicted and ground truth PRM. ROC curves evaluated predicted PRMfSAD (the volume percentage of functional small airway disease, fSAD) performance in stratifying SAD. Results Our method can generate expiratory CT of good quality (SSIM 0.86, RMSE 80.13 HU). The predicted PRM dice coefficients for normal lung, emphysema, and fSAD regions are 0.85, 0.63, and 0.51, respectively. The volume percentages of emphysema and fSAD showed good correlation between predicted and ground truth PRM (|r| were 0.97 and 0.64, respectively, p < 0.05). Predicted PRMfSAD showed good SAD stratification performance with ground truth PRMfSAD at thresholds of 15%, 20% and 25% (AUCs were 0.84, 0.78, and 0.84, respectively, p < 0.001). Conclusion Our deep learning method generates high-quality PRM using inspiratory chest CT and effectively stratifies SAD in individuals with normal spirometry.
... Following prior works [2,45], we assess the visual quality using the FID metric and evaluate semantic consistency through the average CLIP similarity between video frames and the corresponding text. We benchmarked GPT4Video against four text-to-video specialized models: CogVideo [14], MakeVideo [37], Latent-VDM [35], and Latent-Shift [2], as well as two LLM-based methods, including CoDi [38] and NExt-GPT [45]. The results, presented in Table 3, demonstrate that GPT4Video outperforms the aforementioned models across all evaluated metrics. ...
Preprint
Full-text available
While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion gen-erative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley [25] by 11.8% on the Video Question Answering task, and surpasses NExt-GPT [45] by 2.3% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.
... For a problem where the input contains less information than the output, as in this case, a generative type of neural network is necessary. Recent large generative models such as diffusion 35 and transformer 36 have shown extremely powerful capability in generating image and language content, even stepping towards artificial general intelligence (AGI) 37 , yet for engineering problems on specific tasks, small-scale efficient models are still preferred, among which the tandem architecture has proven a very effective framework 20,21,[38][39][40] . ...
Article
Full-text available
Manipulating the electromagnetic (EM) scattering behavior from an arbitrary surface dynamically on arbitrary design goals is an ultimate ambition for many EM stealth and communication problems, yet it is nearly impossible to accomplish with conventional analysis and optimization techniques. Here we present a reconfigurable conformal metasurface prototype as well as a workflow that enables it to respond to multiple design targets on the reflection pattern with extremely low on-site computing power and time. The metasurface is driven by a sequential tandem neural network which is pre-trained using actual experimental data, avoiding any possible errors that may arise from calculation, simulation, or manufacturing tolerances. This platform empowers the surface to operate accurately in a complex environment including varying incident angle and operating frequency, or even with other scatterers present close to the surface. The proposed data-driven approach requires minimum amount of prior knowledge and human effort yet provides maximized versatility on the reflection control, stepping towards the end form of intelligent tunable EM surfaces.
... Cloud computing has also had an immense impact on Generative AI systems. It has been a key enabling factor in the development of massive Generative AI models such as LLMs (Large Language Models) [13,14] and Diffusion Models [15]. FalconLLM [13] is the one of the latest renditions in the line of LLMs and is currently one of the bestperforming open-source language models. ...
Article
Full-text available
As cloud computing rises in popularity across diverse industries, the necessity to compare and select the most appropriate cloud provider for specific use cases becomes imperative. This research conducts an in-depth comparative analysis of two prominent cloud platforms, Microsoft Azure and Amazon Web Services (AWS), with a specific focus on their suitability for deploying object-detection algorithms. The analysis covers both quantitative metrics—encompassing upload and download times, throughput, and inference time—and qualitative assessments like cost effectiveness, machine learning resource availability, deployment ease, and service-level agreement (SLA). Through the deployment of the YOLOv8 object-detection model, this study measures these metrics on both platforms, providing empirical evidence for platform evaluation. Furthermore, this research examines general platform availability and information accessibility to highlight differences in qualitative aspects. This paper concludes that Azure excels in download time (average 0.49 s/MB), inference time (average 0.60 s/MB), and throughput (1145.78 MB/s), and AWS excels in upload time (average 1.84 s/MB), cost effectiveness, ease of deployment, a wider ML service catalog, and superior SLA. However, the decision between either platform is based on the importance of their performance based on business-specific requirements. Hence, this paper ends by presenting a comprehensive comparison based on business-specific requirements, aiding stakeholders in making informed decisions when selecting a cloud platform for their machine learning projects.
... As suggested in Refs. [22][23][24], strong data augmentation can significantly support contrastive learning, leading to effective feature representation modeling. However, generating positive and negative pairs as samples for contrastive training is not feasible in our setup due to the lack of salient objects in conventional camouflage datasets. ...
Article
Chinese paper-cutting, as an ancient folk art, is facing difficulties in preserving and passing down its traditions due to a lack of skilled paper-cut artists. In contrast to other image generation tasks, paper-cut images not only necessitate symmetry and exaggeration but also demand a certain level of resemblance to human facial features. To address these issues, this paper proposes a dual-branch generative adversarial network model for automatically generating paper-cut images, referred to as CutGAN. Specifically, we first construct a paper-cut dataset consisting of 891 pairs of facial images and handcrafted paper-cut images to train and evaluate CutGAN. Next, during the pre-training phase, we utilize gender and eyeglasses recognition tasks to train the fixed encoder. In the fine-tuning phase, we design a flexible encoder based on the modified U-net structure without skip connections. Furthermore, we introduce an average face loss to augment the diversity and improve the quality of the generated paper-cut images. We conducted extensive qualitative and quantitative experiments, as well as ablation experiments, comparing CutGAN with state-of-the-art baseline models on the test set. The experimental results indicate that CutGAN outperforms other image translation models by generating paper-cut images that more accurately capture the essence of Chinese paper-cut art and closely resemble actual facial images.
Article
Full-text available
Deep neural networks are widely used in computer vision for image classification, segmentation and generation. They are also often criticised as “black boxes” because their decision-making process is often not interpretable by humans. However, learning explainable representations that explicitly disentangle the underlying mechanisms that structure observational data is still a challenge. To further explore the latent space and achieve generic processing, we propose a pipeline for discovering the explainable directions in the latent space of generative models. Since the latent space contains semantically meaningful directions and can be explained, we propose a pipeline to fully resolve the representation of the latent space. It consists of a Dirichlet encoder, conditional deterministic diffusion, a group-swap and a latent traversal module. We believe that this study provides an insight into the advancement of research explaining the disentanglement of neural networks in the community.
Chapter
The unsupervised anomaly detection problem holds great importance but remains challenging to address due to the myriad of data possibilities in our daily lives. Currently, distinct models are trained for different scenarios. In this work, we introduce a reconstruction-based anomaly detection structure built on the Latent Space Denoising Diffusion Probabilistic Model (LDM). This structure effectively detects anomalies in multi-class situations. When normal data comprises multiple object categories, existing reconstruction models often learn identical patterns. This leads to the successful reconstruction of both normal and anomalous data based on these patterns, resulting in the inability to distinguish anomalous data. To address this limitation, we implemented the LDM model. Its process of adding noise effectively disrupts identical patterns. Additionally, this advanced image generation model can generate images that deviate from the input. We have further proposed a classification model that compares the input with the reconstruction results, tapping into the generative power of the LDM model. Our structure has been tested on the MNIST and CIFAR-10 datasets, where it surpassed the performance of state-of-the-art reconstruction-based anomaly detection models.
Chapter
The empirical risk minimization approach of contemporary machine learning leads to potential failures under distribution shifts. While out-of-distribution data can be used to probe for robustness issues, collecting this at scale in the wild can be difficult given its nature. We propose a novel method to generate this data using pretrained foundation models. We train a language model to generate class-conditioned image captions that minimize their cosine similarity with that of corresponding class images from the original distribution. We then use these captions to synthesize new images with off-the-shelf text-to-image generative models. We show our method’s ability to generate samples from shifted distributions, and the quality of the data for both robustness testing and as additional training data to improve generalization.
Chapter
Recent years have witnessed the tremendous success of diffusion models in data synthesis. However, when diffusion models are applied to sensitive data, they also give rise to severe privacy concerns. In this paper, we present a comprehensive study about membership inference attacks against diffusion models, which aims to infer whether a sample was used to train the model. Two attack methods are proposed, namely loss-based and likelihood-based attacks. Our attack methods are evaluated on several state-of-the-art diffusion models, over different datasets in relation to privacy-sensitive data. Extensive experimental evaluations reveal the relationship between membership leakages and generative mechanisms of diffusion models. Furthermore, we exhaustively investigate various factors which can affect membership inference. Finally, we evaluate the membership risks of diffusion models trained with differential privacy.
Article
Full-text available
The rapid emergence of spatial transcriptomics (ST) technologies is revolutionizing our understanding of tissue spatial architecture and biology. Although current ST methods, whether based on next-generation sequencing (seq-based approaches) or fluorescence in situ hybridization (image-based approaches), offer valuable insights, they face limitations either in cellular resolution or transcriptome-wide profiling. To address these limitations, we present SpatialScope, a unified approach integrating scRNA-seq reference data and ST data using deep generative models. With innovation in model and algorithm designs, SpatialScope not only enhances seq-based ST data to achieve single-cell resolution, but also accurately infers transcriptome-wide expression levels for image-based ST data. We demonstrate SpatialScope’s utility through simulation studies and real data analysis from both seq-based and image-based ST approaches. SpatialScope provides spatial characterization of tissue structures at transcriptome-wide single-cell resolution, facilitating downstream analysis, including detecting cellular communication through ligand-receptor interactions, localizing cellular subtypes, and identifying spatially differentially expressed genes.
Article
Deep neural networks (DNNs) have enabled recent advances in the accuracy and robustness of video-oculography. However, to make robust predictions, most DNN models require extensive and diverse training data, which is costly to collect and label. In this work, we seek to improve the codevelop pylids, a pupil- and eyelid-estimation DNN model based on DeepLabCut. We show that performance of pylids-based pupil estimation can be related to the distance of test data from the distribution of training data. Based on this principle, we explore methods for efficient data selection for training our DNN. We show that guided sampling of new data points from the training data approaches state-of-the-art pupil and eyelid estimation with fewer training data points. We also demonstrate the benefit of using an efficient sampling method to select data augmentations for training DNNs. These sampling methods aim to minimize the time and effort required to label and train DNNs while promoting model generalization on new diverse datasets.
Conference Paper
Full-text available
Stable-Diffusion enables the generation of optimal virtual interviewers through prompt engineering that includes various conditions. However, when trying to generate a full-body avatar, despite achieving high-quality results for the upper body, there is a limitation where facial components are awkwardly created. Therefore, this paper attempts prompt engineering to generate full-body avatars using Text2Human for the creation of virtual interviewers. Furthermore, we analyze performance by comparing with results using Stable-Diffusion and explore the possibilities for application within a metaverse environment.
Conference Paper
Full-text available
메타버스 환경에서 가상 면접 시스템 개발에 대한 연구가 활발히 이루어지고 있다[1]. 그러나 다양한 가상 면접관을 수작업으로 제작하는 것은 상당한 노력이 필요하므로, 생성 모델을 통한 자동화 방법에 대한 관심이 증가하고 있다. 본 논문에서는 가상 면접관 영상을 생성하는 것을 목표로 한다. 기존의 Stable Diffusion[2]의 기본 모델은 얼굴 구성요소들을 포함한 면접관의 세부적인 특징들을 상세히 언급하더라도 만족스러운 결과를 얻기 어렵다[3]. 이러한 문제를 해결하기 위해, 실사를 생성할 수 있는 Realistic Vision V5.1[4]을 Dreambooth[5]을 활용하여 가상 면접관을 생성할 수 있도록 파인 튜닝한 모델을 제시한다. Dreambooth를 이용하여 적은 양의 데이터만을 활용하여도 높은 성능의 결과물을 생성할 수 있도록 하며, 데이터는 구글 검색을 통해 정장을 입은 면접관의 형태를 보이는 9장의 이미지 데이터를 수집한다. 학습 과정에서는 Epoch를 200으로 설정하고, 초기 학습률을 0.000002로 조정하였다. 실험 결과, 파인 튜닝된 모델은 항상 전신의 형태로 실사와 같은 면접관이 생성되는 것을 확인하였다.
Chapter
Recently, there has been an increase in the popularity of multimodal approaches in audio-related tasks, which involve using not only the audible modality but also textual or visual modalities in combination with sound. In this paper, we propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL). It projects audio, image and text into a shared embedded space, so that multi-modal applications can be realized. We tested it on some downstream tasks and presented the images rearranged by our method and evaluated them qualitatively and quantitatively. The main purpose of this article is to: (1) Explore new correlation representations between audio and images; (2) Explore a new way to generate images using audio. The experimental results show that this method can effectively do a match on the audio image.
Article
In the rapidly advancing field of multi-modal machine learning (MMML), the convergence of multiple data modalities has the potential to reshape various applications. This paper presents a comprehensive overview of the current state, advancements, and challenges of MMML within the sphere of engineering design. The review begins with a deep dive into five fundamental concepts of MMML: multi-modal information representation, fusion, alignment, translation, and co-learning. Following this, we explore the cutting-edge applications of MMML, placing a particular emphasis on tasks pertinent to engineering design, such as cross-modal synthesis, multi-modal prediction, and cross-modal information retrieval. Through this comprehensive overview, we highlight the inherent challenges in adopting MMML in engineering design, and proffer potential directions for future research. To spur on the continued evolution of MMML in engineering design, we advocate for concentrated efforts to construct extensive multi-modal design datasets, develop effective data-driven MMML techniques tailored to design applications, and enhance the scalability and interpretability of MMML models. MMML models, as the next generation of intelligent design tools, hold a promising future to impact how products are designed.
Article
Biocatalysis harnesses enzymes to make valuable products. This green technology is used in countless applications from bench scale to industrial production and allows practitioners to access complex organic molecules, often with fewer synthetic steps and reduced waste. The last decade has seen an explosion in the development of experimental and computational tools to tailor enzymatic properties, equipping enzyme engineers with the ability to create biocatalysts that perform reactions not present in nature. By using (chemo)-enzymatic synthesis routes or orchestrating intricate enzyme cascades, scientists can synthesize elaborate targets ranging from DNA and complex pharmaceuticals to starch made in vitro from CO 2 -derived methanol. In addition, new chemistries have emerged through the combination of biocatalysis with transition metal catalysis, photocatalysis, and electrocatalysis. This review highlights recent key developments, identifies current limitations, and provides a future prospect for this rapidly developing technology.
Conference Paper
Autoencoder (AE) is known as an artificial intelligence (AI), which is considered to be useful to analyze the bio-signal (BS) and/or conduct simulations of the BS. We can show examples to study Electrogastrograms (EGGs) and Electroencephalograms (EEGs) as a BS. In previous study, we have analyzed the EGGs by using the AE and have compared mathematical models of EGGs in the seated posture with those in the supine. The EEGs of normal subjects and patients with Meniere’s disease were herein converted to lower dimensions using Variational AE (VAE). The existence of characteristic differences was verified.
Article
Full-text available
We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models [1], [2] to image-to-image translation, and performs super-resolution through a stochastic iterative denoising process. Output images are initialized with pure Gaussian noise and iteratively refined using a U-Net architecture that is trained on denoising at various noise levels, conditioned on a low-resolution input image. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8× face super-resolution task on CelebA-HQ for which SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GAN baselines do not exceed a fool rate of 34%. We evaluate SR3 on a 4× super-resolution task on ImageNet, where SR3 outperforms baselines in human evaluation and classification accuracy of a ResNet-50 classifier trained on high-resolution images. We further show the effectiveness of SR3 in cascaded image generation, where a generative model is chained with super-resolution models to synthesize high-resolution images with competitive FID scores on the class-conditional 256×256 ImageNet generation challenge.
Conference Paper
Full-text available
Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for image completion, we propose the new Paired/Unpaired Inception Discriminative Score (P-IDS/U-IDS), which robustly measures the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space. Experiments demonstrate superior performance in terms of both quality and diversity over state-of-the-art methods in free-form image completion and easy generalization to image-to-image translation. Code is available at https://github.com/zsyzzsoft/co-mod-gan.
Conference Paper
Full-text available
Over the last few years, deep learning techniques have yielded significant improvements in image inpainting. However , many of these techniques fail to reconstruct reasonable structures as they are commonly over-smoothed and/or blurry. This paper develops a new approach for image in-painting that does a better job of reproducing filled regions exhibiting fine details. We propose a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network. The edge generator hallucinates edges of the missing region (both regular and irregular) of the image, and the image completion network fills in the missing regions using hallucinated edges as a priori. We evaluate our model end-to-end over the publicly available datasets CelebA, Places2, and Paris StreetView, and show that it outperforms current state-of-the-art techniques quantitatively and qualitatively.
Conference Paper
Full-text available
Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the `Fréchet Inception Distance'' (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark. https://papers.nips.cc/paper/7240-gans-trained-by-a-two-time-scale-update-rule-converge-to-a-local-nash-equilibrium
Article
Full-text available
We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.
Article
Full-text available
Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classification and detection works focus on thing classes, less attention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) which thing classes are likely to be present and their location (determined through contextual reasoning); (3) physical attributes, material types and geometric properties of the scene. To understand stuff and things in context we annotate 10,000 images of the COCO dataset with a broad range of stuff classes, using a specialized stuff annotation protocol allowing us to efficiently label each pixel. On this dataset, we analyze several aspects: (a) the importance of stuff and thing classes in terms of their surface cover and how frequently they are mentioned in image captions; (b) the importance of several visual criteria to discriminate stuff and thing classes; (c) we study the spatial relations between stuff and things, highlighting the rich contextual relations that make our dataset unique. Furthermore, we show experimentally how modern semantic segmentation methods perform on stuff and thing classes and answer the question whether stuff is easier to segment than things. We release our new dataset and the trained models online, hopefully promoting further research on stuff and stuff-thing contextual relations.
Conference Paper
Full-text available
Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories such as faces, album covers, room interiors etc. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.
Article
Full-text available
The state-of-the-art visual recognition algorithms are all data-hungry, requiring a huge amount of labeled image data to optimize millions of parameters. While there has been remarkable progress in algorithm and system design, the labeled datasets used by these models are quickly becoming outdated in terms of size. To overcome the bottleneck of human labeling speed during dataset construction, we propose to amplify human effort using deep learning with humans in the loop. Our procedure comes equipped with precision and recall guarantees to ensure labeling quality, reaching the same level of performance as fully manual annotation. To demonstrate the power of our annotation procedure and enable further progress of visual recognition, we construct a scene-centric database called "LSUN" containing millions of labeled images in each scene category. We experiment with popular deep nets using our dataset and obtain a substantial performance gain with the same model trained using our larger training set. All data and source code will be available online upon acceptance of the paper.
Article
Full-text available
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
Full-text available
We propose a deep learning framework for modeling complex high-dimensional densities via Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable, and unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.
Article
We begin with the hypothesis that a model must be able to understand individual objects and relationships between objects in order to generate complex scenes with multiple objects well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We also propose changes to the conditioning mechanism of the generator that enhance its object instance-awareness. Apart from improving image quality, our contributions mitigate two failure modes in previous approaches: (1) spurious objects being generated without corresponding bounding boxes in the layout, and (2) overlapping bounding boxes in the layout leading to merged objects in images. Extensive quantitative evaluation and ablation studies demonstrate the impact of our contributions, with our model outperforming previous state-of-the-art approaches on both the COCO-Stuff and Visual Genome datasets. Finally, we address an important limitation of evaluation metrics used in previous works by introducing SceneFID -- an object-centric adaptation of the popular Fréchet Inception Distance metric, that is better suited for multi-object images.
Article
Recently, deep neural networks have achieved promising performance for in-filling large missing regions in image inpainting tasks. They have usually adopted the standard convolutional architecture over the corrupted image, leading to meaningless contents, such as color discrepancy, blur, and other artifacts. Moreover, most inpainting approaches cannot handle well the case of a large contiguous missing area. To address these problems, we propose a generic inpainting framework capable of handling incomplete images with both contiguous and discontiguous large missing areas. We pose this in an adversarial manner, deploying regionwise operations in both the generator and discriminator to separately handle the different types of regions, namely, existing regions and missing ones. Moreover, a correlation loss is introduced to capture the nonlocal correlations between different patches, and thus, guide the generator to obtain more information during inference. With the help of regionwise generative adversarial mechanism, our framework can restore semantically reasonable and visually realistic images for both discontiguous and contiguous large missing areas. Extensive experiments on three widely used datasets for image inpainting task have been conducted, and both qualitative and quantitative experimental results demonstrate that the proposed model significantly outperforms the state-of-the-art approaches, on the large contiguous and discontiguous missing areas.
Article
In this paper, we show that popular Generative Adversarial Network (GAN) variants exacerbate biases along the axes of gender and skin tone in the generated data. The use of synthetic data generated by GANs is widely used for a variety of tasks ranging from data augmentation to stylizing images. While practitioners celebrate this method as an economical way to obtain synthetic data to train data-hungry machine learning models or provide new features to users of mobile applications, it is unclear whether they recognize the perils of such techniques when applied to real world datasets biased along latent dimensions. Although one expects GANs to replicate the distribution of the original data, in real-world settings with limited data and finite network capacity, GANs suffer from mode collapse. First, we show readily-accessible GAN variants such as DCGANs ‘imagine’ faces of synthetic engineering professors that have masculine facial features and fair skin tones. When using popular GAN architectures that attempt to address mode-collapse, we observe that these variants either provide a false sense of security or suffer from other inherent limitations due to their design choice. Second, we show that a conditional GAN variant transforms input images of female and nonwhite faces to have more masculine features and lighter skin when asked to generate faces of engineering professors. Worse yet, prevalent filters on Snapchat end up consistently lightening the skin tones in people of color when trying to make face images appear more feminine. Thus, our study is meant to serve as a cautionary tale for practitioners and educate them about the side-effect of bias amplification when applying GAN-based techniques.
Article
With the remarkable recent progress on learning deep generative models, it becomes increasingly interesting to develop models for controllable image synthesis from reconfigurable structured inputs. This paper focuses on a recently emerged task, layout-to-image , whose goal is to learn generative models for synthesizing photo-realistic images from a spatial layout (i.e., object bounding boxes configured in an image lattice) and its style codes (i.e., structural and appearance variations encoded by latent vectors). This paper first proposes an intuitive paradigm for the task, layout-to-mask-to-image , which learns to unfold object masks in a weakly-supervised way based on an input layout and object style codes. The layout-to-mask component deeply interacts with layers in the generator network to bridge the gap between an input layout and synthesized images. Then, this paper presents a method built on Generative Adversarial Networks (GANs) for the proposed layout-to-mask-to-image synthesis with layout and style control at both image and object levels. The controllability is realized by a proposed novel Instance-Sensitive and Layout-Aware Normalization (ISLA-Norm) scheme. A layout semi-supervised version of the proposed method is further developed without sacrificing performance. In experiments, the proposed method is tested in the COCO-Stuff dataset and the Visual Genome dataset with state-of-the-art performance obtained.
Article
The field of artificial intelligence has experienced a dramatic methodological shift towards large neural networks trained on plentiful data. This shift has been fueled by recent advances in hardware and techniques enabling remarkable levels of computation, resulting in impressive advances in AI across many applications. However, the massive computation required to obtain these exciting results is costly both financially, due to the price of specialized hardware and electricity or cloud compute time, and to the environment, as a result of non-renewable energy used to fuel modern tensor processing hardware. In a paper published this year at ACL, we brought this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training and tuning neural network models for NLP (Strubell, Ganesh, and McCallum 2019). In this extended abstract, we briefly summarize our findings in NLP, incorporating updated estimates and broader information from recent related publications, and provide actionable recommendations to reduce costs and improve equity in the machine learning and artificial intelligence community.
Article
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Article
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
Article
Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this note we show that the requirement of absolute continuity is necessary: we describe a simple yet prototypical counterexample showing that in the more realistic case of distributions that are not absolutely continuous, unregularized GAN training is generally not convergent. Furthermore, we discuss recent regularization strategies that were proposed to stabilize GAN training. Our analysis shows that while GAN training with instance noise or gradient penalties converges, Wasserstein-GANs and Wasserstein-GANs-GP with a finite number of discriminator updates per generator update do in general not converge to the equilibrium point. We explain these results and show that both instance noise and gradient penalties constitute solutions to the problem of purely imaginary eigenvalues of the Jacobian of the gradient vector field. Based on our analysis, we also propose a simplified gradient penalty with the same effects on local convergence as more complicated penalties.
Article
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.
Article
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Article
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
PixelCNNs are a recently proposed class of powerful generative models with tractable likelihood. Here we discuss our implementation of PixelCNNs which we make available at https://github.com/openai/pixel-cnn. Our implementation contains a number of modifications to the original model that both simplify its structure and improve its performance. 1) We use a discretized logistic mixture likelihood on the pixels, rather than a 256-way softmax, which we find to speed up training. 2) We condition on whole pixels, rather than R/G/B sub-pixels, simplifying the model structure. 3) We use downsampling to efficiently capture structure at multiple resolutions. 4) We introduce additional short-cut connections to further speed up optimization. 5) We regularize the model using dropout. Finally, we present state-of-the-art log likelihood results on CIFAR-10 to demonstrate the usefulness of these modifications.
Article
We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator. This allows training to be adjusted between using the optimal discriminator in the generator's objective, which is ideal but infeasible in practice, and using the current value of the discriminator, which is often unstable and leads to poor solutions. We show how this technique solves the common problem of mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and coverage of the data distribution by the generator.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder, creating. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
Article
Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.
Article
Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.
Article
Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric better reflects perceptually similarity of images and thus leads to better results. We show three applications: autoencoder training, a modification of a variational autoencoder, and inversion of deep convolutional networks. In all cases, the generated images look sharp and resemble natural images.
Article
A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps.
Article
Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.