Yang Zhang's research while affiliated with University of Illinois, Urbana-Champaign and other places

Publications (78)

Preprint
Although large language models (LLMs) have achieved great success in vast real-world applications, their vulnerabilities towards noisy inputs have significantly limited their uses, especially in high-stake environments. In these contexts, it is crucial to ensure that every prediction made by large language models is stable, i.e., LLM predictions sh...
Preprint
Full-text available
Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability tha...
Preprint
Full-text available
For effective human-robot interaction, robots need to understand, plan, and execute complex, long-horizon tasks described by natural language. The recent and remarkable advances in large language models (LLMs) have shown promise for translating natural language into robot action sequences for complex tasks. However, many existing approaches either...
Preprint
Full-text available
Temporal Logic (TL) can be used to rigorously specify complex high-level specification for systems in many engineering applications. The translation between natural language (NL) and TL has been under-explored due to the lack of dataset and generalizable model across different application domains. In this paper, we propose an accurate and generaliz...
Preprint
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. However, one critical limitation of these models is the low fidelity of generated images with respect to the text description, such as missing objects, mismatched attributes, and mislocated objects. One key reason for such inconsistencies is the inac...
Preprint
Full-text available
Image inpainting refers to the task of generating a complete, natural image based on a partially revealed reference image. Recently, many research interests have been focused on addressing this problem using fixed diffusion models. These approaches typically directly replace the revealed region of the intermediate or final generated images with tha...
Preprint
Full-text available
We describe PromptBoosting, a query-efficient procedure for building a text classifier from a neural language model (LM) without access to the LM's parameters, gradients, or hidden representations. This form of "black-box" classifier training has become increasingly important as the cost of training and inference in large-scale LMs grows. But exist...
Preprint
Full-text available
Robustness evaluation against adversarial examples has become increasingly important to unveil the trustworthiness of the prevailing deep models in natural language processing (NLP). However, in contrast to the computer vision domain where the first-order projected gradient descent (PGD) is used as the benchmark approach to generate adversarial exa...
Preprint
Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the sema...
Preprint
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, adva...
Preprint
In order to be effective partners for humans, robots must become increasingly comfortable with making contact with their environment. Unfortunately, it is hard for robots to distinguish between ``just enough'' and ``too much'' force: some force is required to accomplish the task but too much might damage equipment or injure humans. Traditional appr...
Preprint
Full-text available
Despite a surge of recent advances in promoting machine Learning (ML) fairness, the existing mainstream approaches mostly require training or finetuning the entire weights of the neural network to meet the fairness criteria. However, this is often infeasible in practice for those large-scale trained models due to large computational and storage cos...
Article
Activation maximization (AM) refers to the task of generating input examples that maximize the activation of a target class of a classifier, which can be used for class-conditional image generation and model interpretation. A popular class of AM method, GAN-based AM, introduces a GAN pre-trained on a large image set, and performs AM over its input...
Preprint
Pre-training serves as a broadly adopted starting point for transfer learning on various downstream tasks. Recent investigations of lottery tickets hypothesis (LTH) demonstrate such enormous pre-trained models can be replaced by extremely sparse subnetworks (a.k.a. matching subnetworks) without sacrificing transferability. However, practical securi...
Article
Full-text available
A language-independent automatic speech recognizer (ASR) is one that can be used for phonetic transcription in languages other than the languages in which it was trained. Language-independent ASR is difficult to train, because different languages implement phones differently: even when phonemes in two different languages are written using the same...
Preprint
We propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings. DiffCSE learns sentence embeddings that are sensitive to the difference between the original sentence and an edited sentence, where the edited sentence is obtained by stochastically masking out the original sentence and then sampling from a masked l...
Preprint
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech represe...
Preprint
Full-text available
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can si...
Preprint
Full-text available
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to...
Preprint
Full-text available
SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner. However, SpeechSplit requires careful tuning of the autoencoder bottlenecks, which can be time-consuming and less robust. This paper proposes SpeechSplit 2.0, which constrain...
Preprint
Full-text available
The study of language emergence aims to understand how human languages are shaped by perceptual grounding and communicative intent. Computational approaches to emergent communication (EC) predominantly consider referential games in limited domains and analyze the learned protocol within the game framework. As a result, it remains unclear how the em...
Preprint
We study the problem of aligning the supports of distributions. Compared to the existing work on distribution alignment, support alignment does not require the densities to be matched. We propose symmetric support difference as a divergence measure to quantify the mismatch between supports. We show that select discriminators (e.g. discriminator tra...
Preprint
Full-text available
Selective rationalization explains the prediction of complex neural networks by finding a small subset of the input that is sufficient to predict the neural model output. The selection mechanism is commonly integrated into the model itself by specifying a two-component cascaded system consisting of a rationale generator, which makes a binary select...
Preprint
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i.e., an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions. To tackle this, adversarial training is currently the most effective defense method, by augmenting the training set with adversarial samples gener...
Preprint
Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparstiy and its subsequent effects on s...
Preprint
Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challengin...
Preprint
While maximizing deep neural networks' (DNNs') acceleration efficiency requires a joint search/design of three different yet highly coupled aspects, including the networks, bitwidths, and accelerators, the challenges associated with such a joint search have not yet been fully understood and addressed. The key challenges include (1) the dilemma of w...
Preprint
Full-text available
Recent work on speech self-supervised learning (speech SSL) demonstrated the benefits of scale in learning rich and transferable representations for Automatic Speech Recognition (ASR) with limited parallel data. It is then natural to investigate the existence of sparse and transferrable subnetworks in pre-trained speech SSL models that can achieve...
Article
2019 Neural information processing systems foundation. All rights reserved. Selection of input features such as relevant pieces of text has become a common technique of highlighting how complex neural predictors operate. The selection can be optimized post-hoc for trained models or incorporated directly into the method itself (self-explaining). How...
Preprint
Full-text available
The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest st...
Preprint
Full-text available
Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. Here we investigate whether deep feature representations learned for audio classification tasks can be used to improve denoising. We first trained deep neural networks to classify either spoken words or environmental sou...
Article
We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated outside a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content a...
Preprint
Full-text available
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchin...
Preprint
We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated \emph{outside} a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual cont...
Preprint
Full-text available
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in many speech analysis and generation applications. Recently, state-of-the-art voice conversion systems have led to speech representations that can disentangle speaker-...
Preprint
Selective rationalization improves neural network interpretability by identifying a small subset of input features -- the rationale -- that best explains or supports the prediction. A typical rationalization criterion, i.e. maximum mutual information (MMI), finds the rationale that maximizes the prediction performance based only on the rationale. H...
Article
Selective rationalization has become a common mechanism to ensure that predictive models reveal how they use any available features. The selection may be soft or hard, and identifies a subset of input features relevant for prediction. The setup can be viewed as a cooperate game between the selector (aka rationale generator) and the predictor making...
Preprint
Selective rationalization has become a common mechanism to ensure that predictive models reveal how they use any available features. The selection may be soft or hard, and identifies a subset of input features relevant for prediction. The setup can be viewed as a co-operate game between the selector (aka rationale generator) and the predictor makin...
Preprint
Selection of input features such as relevant pieces of text has become a common technique of highlighting how complex neural predictors operate. The selection can be optimized post-hoc for trained models or incorporated directly into the method itself (self-explaining). However, an overall selection does not properly capture the multi-faceted natur...
Preprint
There are two major paradigms of white-box adversarial attacks that attempt to impose input perturbations. The first paradigm, called the fix-perturbation attack, crafts adversarial samples within a given perturbation level. The second paradigm, called the zero-confidence attack, finds the smallest perturbation needed to cause mis-classification, a...
Preprint
Full-text available
Non-parallel many-to-many voice conversion, as well as zero-shot voice conversion, remain under-explored areas. Deep style transfer algorithms, such as generative adversarial networks (GAN) and conditional variational autoencoder (CVAE), are being applied as new solutions in this field. However, GAN training is sophisticated and difficult, and ther...
Article
Full-text available
Patient portals to Electronic Health Record (EHR) systems are underused by older adults because of limited system usability and usefulness, including difficulty understanding numeric information. We investigated whether enhanced context for portal messages about test results improved responses to these messages, comparing verbally, graphically, and...
Article
Selection of input features such as relevant pieces of text has become a common technique of highlighting how complex neural predictors operate. The selection can be optimized post-hoc for trained models or incorporated directly into the method itself (self-explaining). However, an overall selection does not properly capture the multi-faceted natur...
Article
Full-text available
Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would otherwise be too complicated. On the other hand, deep learning based enhancement approaches are able to learn...
Article
Full-text available
Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneo...
Article
Convolutional autoregressive models have recently demonstrated state-of-the-art performance on a number of generation tasks. While fast, parallel training methods have been crucial for their success, generation is typically implemented in a na\"{i}ve fashion where redundant computations are unnecessarily repeated. This results in slow generation, m...
Conference Paper
The increasing popularity of real-world recommender systems produces data continuously and rapidly, and it becomes more realistic to study recommender systems under streaming scenarios. Data streams present distinct properties such as temporally ordered, continuous and high-velocity, which poses tremendous challenges to traditional recommender syst...
Article
We describe a project intended to improve the use of Electronic Medical Record (EMR) patient portal information by older adults with diverse numeracy and literacy abilities, so that portals can better support patient-centered care. Patient portals are intended to bridge patients and providers by ensuring patients have continuous access to their hea...
Article
Full-text available
This paper presents an efficient implementation of the Wavenet generation process called Fast Wavenet. Compared to a naive implementation that has complexity O(2^L) (L denotes the number of layers in the network), our proposed approach removes redundant convolution operations by caching previous calculations, thereby reducing the complexity to O(L)...
Conference Paper
Data of many problems in real-world systems such as link prediction and one-class recommendation share common characteristics. First, data are in the form of positive unlabeled (PU) measurements (e.g. Twitter "following", Facebook "like", etc.) that do not provide negative information, which can be naturally represented as networks. Second, in the...
Article
Full-text available
The increasing popularity of real-world recommender systems produces data continuously and rapidly, and it becomes more realistic to study recommender systems under streaming scenarios. Data streams present distinct properties such as temporally ordered, continuous and high-velocity, which poses tremendous challenges to traditional recommender syst...
Conference Paper
Full-text available
The Probabilistic Acoustic Tube (PAT) model is a probabilistic generative model of speech. By associating every generative parameter with a probability distribution, it becomes possible to convert every standard speech analysis task into a probabilistic inference task, thereby grounding every such task with quantifiable measures of bias and consist...
Article
Full-text available
We are interested in a multichannel transient acoustic signal classification task which suffers from additive/convolutionary noise corruption. To address this problem, we propose a double-scheme classifier that takes the advantage of multichannel data to improve noise robustness. Both schemes adopt task-driven dictionary learning as the basic frame...
Conference Paper
Full-text available
Current model-based speech analysis tends to be incomplete - only a part of parameters of interest (e.g. only the pitch or vocal tract) are modeled, while the rest that might as well be important are disregarded. The drawback is that without joint modeling of parameters that are correlated, the analysis on speech parameters may be inaccurate or eve...
Conference Paper
Most speech analysis/synthesis systems are based on the basic physical model of speech production - the acoustic tube model. There are two main drawbacks with current speech analysis methods. First, a common design paradigm seems to build a special-purpose signal-processing front-end followed by (when appropriate) a back-end based on probabilistic...

Citations

... Recently, text-to-image (T2I) diffusion models [32,36,37,40,42] have shown promise in enabling an intuitive interface for users to control image generation, using natural language descriptions. However, gaining granular control and customized image generation has proven difficult with natural language descriptions alone [11,26,27,48]. Addressing this difficulty has started a line of research for inverting desired concepts in these large models and better tuning them for customized content creation. ...
... Learning a speech recognizer with only unpaired speech and text corpora, or unsupervised speech recognition (ASR-U) [1], is a self-supervised learning task crucial for developing speech technology for low-resource languages. Beyond converting speech to text without reliance on transcribed speech, an ASR-U system can serve as the linchpin for low-resource text-to-speech synthesis [2,3], speech translation [4,5] and other spoken language understanding tasks. Despite significant strides made in the domain [6,7,8,9,10], the stability of ASR-U systems remains a conspicuous bottleneck [11,2]. ...
... They also noted the benefit of using left and right context, as opposed to models learning only from past values. Finally, attempts have been made to augment large language models with speech capabilities, although more research is needed to achieve competitive performance [14]. Despite considerable improvement in speech modeling in recent years, speech representation models do not show the same ability as text-based language models to efficiently store semantic information, and a considerable amount of fine tuning is necessary to achieve decent performance [10]. ...
... SimCSE [14] incorporates annotated pairs from natural language inference datasets into its contrastive learning framework, using entailment pairs as positives and contradiction pairs as hard negatives. DiffCSE [21] represents an instance of equivariant contrastive learning that generalizes contrastive learning and learns representations that are insensitive to certain types of augmentations and sensitive to other ''harmful'' types of augmentations. In addition, PromptBERT [17] uses a prompt-based contrastive learning method with template denoising to leverage the power of BERT in unsupervised settings, significantly reducing the gap between supervised and unsupervised performance. ...
... The activation maximization method aims to maximize the activation of specific neurons in neural networks. The network weights and biases are iteratively tuned during the default training so that the neural network's error is minimized across training examples[103]. It can be categorized into three types: Learning Level of Abstraction, Learning Semantic Concepts, and Learning Distributed Codes. ...