December 2023
·
79 Reads
·
62 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
December 2023
·
79 Reads
·
62 Citations
August 2023
·
36 Reads
·
149 Citations
Neurocomputing
June 2023
·
193 Reads
·
2 Citations
Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach called Two Parallel Adapter (TPA) that is inserted into the conformer-based model pre-trained model instead. TPA is based on systematic studies of the residual adapter, a popular approach for finetuning a subset of parameters. We evaluate TPA on various public benchmarks and experiment results demonstrates its superior performance, which is close to the full finetuning on different datasets and speech tasks. These results show that TPA is an effective and efficient approach for serving large pre-trained speech models. Ablation studies show that TPA can also be pruned, especially for lower blocks.
June 2023
·
35 Reads
·
81 Citations
May 2023
·
745 Reads
·
45 Citations
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
March 2023
·
57 Reads
Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.
March 2023
·
553 Reads
·
14 Citations
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
January 2023
·
4 Reads
·
6 Citations
December 2022
·
54 Reads
We propose AnyTOD, an end-to-end task-oriented dialog (TOD) system with zero-shot capability for unseen tasks. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer in the form of a schema. To enable generalization onto unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events that occur during a conversation, and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing a long-standing challenge in TOD research: rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on the STAR and ABCD benchmarks, as well as AnyTOD's strong zero-shot transfer capability in low-resource settings. In addition, we release STARv2, an updated version of the STAR dataset with richer data annotations, for benchmarking zero-shot end-to-end TOD models.
December 2022
·
114 Reads
·
1 Citation
This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates N token embeddings per frame for totally T video frames. We flatten token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
... The advancement of large language models (LLMs) in natural language processing (NLP) has significantly impacted speech processing research [1][2][3][4][5], particularly in developing instruction-based speech models [6][7][8][9][10][11]. Unlike conventional task-specific models trained for fixed tasks, instruction-based models use user prompts to perform various tasks, offering greater flexibility. A key goal is to achieve emergent capabilities for handling unseen tasks effectively, as done in NLP [12]. ...
December 2023
... Decision scenarios naturally occur in any task-oriented dialogue such as flight booking, restaurant selection, trip planning. Implementation of the task-oriented dialogue is traditionally achieved by task-oriented dialogue systems (TODs).In recent times TODs have evolved with advancements like ANYTOD [36] and TOD-BERT [32], enabling zero-shot learning and improved dialogue management. ...
January 2023
... PickScore (Kirstain et al. 2024) leverages user preferences to predict the appeal of generated images, combining CLIP model elements with InstructGPT's reward model objectives (Ouyang et al. 2022) for a nuanced understanding of user satisfaction. Alongside, the aesthetic score (Ke et al. 2023) assesses images based on aesthetics learned from image-comment pairs, providing a richer evaluation that includes composition, color, and style. ...
June 2023
... Early works such as CLIP [72] and ALIGN [42] leveraged contrastive learning techniques to align images and text into a joint representation space, facilitating zero-shot transfer via language prompts. Building on these foundations, subsequent research has focused on improving vision-language models through enhanced training methodologies [19,29,107,112], as well as scaling models and datasets [107, 53,18,83,15,22,84,30] with their zero-shot transfer capabilities [42,111,71,55]. In contrast, our work focuses specifically on target tasks with compact models, aiming to distill knowledge from these large VLMs effectively. ...
August 2023
Neurocomputing
... Chen et al. [16] presents parallel adapters, but it employs a randomly initialized decoder without pre-trained parameters. We further investigate the impact of parallel adapters on adapterbased MTL for speech translation tasks. ...
June 2023
... Information about our prompts and detailed results of our study is shown in our OSF link. 1 ...
May 2023
... The study of CS in ASR has been a significant focus for scholars. Encoder-decoder attention-based ASR has been transformative in the field, providing impressive results in multilingual ASR systems such as Whisper [13], XLS-R [14], USM [15]. However, these systems require significant data for training and their ability to manage CS is not fully clear. ...
March 2023
... Morioka et al. [86] proposed a parameterefficient few-shot speaker adaptation method. It uses a PnG NAT model [62], [87] as its backbone model and then includes trainable lightweight modules called residual adapters for each target speaker while keeping the backbone architecture frozen. Similarly, Hsieh et al. [88] proposed parameterefficient adapter modules, using FastPitch [89] as a baseline TTS model through only fine-tuning those adapters to a target speaker during speaker adaptation. ...
Reference:
Voice Cloning: Comprehensive Survey
August 2021
... In the domain of video analysis, foundation models are also emerging as a promising tool. [30][31][32] One example, which has been applied to animal behavior analysis, is VideoPrism, 33 a video foundation model trained on a massive dataset of approximately 600 million videos. VideoPrism is a video encoder that maps input videos to semantically meaningful representations, which then empower a range of downstream tasks in computer vision, including captioning, video question answering, and video-text retrieval. ...
December 2022
... Moreover, it also exhibits remarkable robustness under multiple natural distribution shifts. After the initial success, subsequent works propose further improvement on CLIP framework containing scaling up tasks (Pham et al., 2021;Jia et al., 2021), using pre-trained visual encoders (Zhai et al., 2022), combining sub-task of image captioning (Yu et al., 2022) or expanding more data format . As the performance of CLIP has a strong correlation with its used datasets (Fang et al., 2022), there are some efforts (Schuhmann et al., 2022;Thomee et al., 2016) to create plentiful and useful image-text pairs and make them open to the community. ...
Reference:
Context-Aware Robust Fine-Tuning
November 2021