Ruoming Pang’s research while affiliated with Apple Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (97)


Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
  • Preprint

March 2025

·

8 Reads

Siddhant Arora

·

Zhiyun Lu

·

Chung-Cheng Chiu

·

[...]

·

Shinji Watanabe

The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.


Instruction-Following Pruning for Large Language Models

January 2025

·

4 Reads

With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed "instruction-following pruning", introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task. To identify and activate effective parameters, we jointly optimize the sparse mask predictor and the LLM, leveraging both instruction-following data and the pre-training corpus. Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.



Figure 1: The upper figure questions whether training exclusively on direct-answer datasets can effectively teach CoT prediction. In the lower figure, generating CoT for prediction provides the additional benefit of reasoning alignment, allowing the model to improve by leveraging self-generated data.
Figure 4: Distillation of examples from various VLM task domains, highlighting the specific reasoning capabilities required.
Figure 5: The upper section displays the data sources used for the SFT experiments, while the lower section illustrates the data composition for model training.
Figure 8: Credit assignment of the DPO model on a portion of the responses from the ChartQA and AI2D datasets. The DPO token-level reward is computed for each token, with the rewards normalized to have a mean of 0. Negative scores are highlighted in cool colors (blue), while positive scores are highlighted in warm colors (orange). We observe that the DPO model is particularly sensitive to the first mistakes or hallucinations introduced in the response.
Figure A.1: GPT-4o system prompt for CoT distillation.

+6

Improve Vision Language Model Chain-of-thought Reasoning
  • Preprint
  • File available

October 2024

·

54 Reads

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

Download

Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

October 2024

·

17 Reads

Augmenting the multi-step reasoning abilities of Large Language Models (LLMs) has been a persistent challenge. Recently, verification has shown promise in improving solution consistency by evaluating generated outputs. However, current verification approaches suffer from sampling inefficiencies, requiring a large number of samples to achieve satisfactory performance. Additionally, training an effective verifier often depends on extensive process supervision, which is costly to acquire. In this paper, we address these limitations by introducing a novel verification method based on Twisted Sequential Monte Carlo (TSMC). TSMC sequentially refines its sampling effort to focus exploration on promising candidates, resulting in more efficient generation of high-quality solutions. We apply TSMC to LLMs by estimating the expected future rewards at partial solutions. This approach results in a more straightforward training target that eliminates the need for step-wise human annotations. We empirically demonstrate the advantages of our method across multiple math benchmarks, and also validate our theoretical analysis of both our approach and existing verification methods.


EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

October 2024

·

10 Reads

Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.


ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

August 2024

·

21 Reads

·

1 Citation

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox


Figure 1: Modeling overview for the Apple foundation models.
provides some details about AFM-on-device dimensions.
HELM MMLU-5s [Liang et al., 2023] v1.5.0 evaluation results.
Pre-training evaluation for AFM-server with an internal harness. Unless otherwise noted, we use 0-shot prompts. TriviaQA evaluation is on the larger and more challenging "Web" split.
Apple Intelligence Foundation Language Models

July 2024

·

453 Reads

·

1 Citation

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.


MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

July 2024

·

44 Reads

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including \textcolor{teal}{Tool-use}, \textcolor{teal}{Directed Acyclic Graph (DAG) QA}, \textcolor{teal}{Data Science and Machine Learning coding}, \textcolor{teal}{Contest-level programming} and \textcolor{teal}{Mathematics}, and covers five essential capabilities: \textcolor{orange}{Understanding}, \textcolor{orange}{Reasoning}, \textcolor{orange}{Planning}, \textcolor{orange}{Problem-solving}, and \textcolor{orange}{Self-correction}. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at \url{https://github.com/apple/axlearn/docs/research/mmau}.


Large Language Model-guided Document Selection

June 2024

·

8 Reads

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.


Citations (53)


... Recently, Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in multimodal understanding [3,33,26,44,21], paving the way for innovative applications in general-purpose foundation models [46]. By leveraging visual instruction tuning [27] with large-scale vision-text datasets spanning diverse domains (e.g., Visual Question Answering (VQA), Optical Character Recognition (OCR), and coding), MLLMs can generate coherent and contextually accurate responses to user queries involving both visual and textual inputs [1,41]. ...

Reference:

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training
  • Citing Chapter
  • November 2024

... This is evident in state-of-the-art text-image pretraining methods like BLIP and ALBEF , which rely on dense architectures. For multimodal learned sparse retrieval (MLSR), LexLIP (Zhao et al., 2023a) and STAIR (Chen et al., 2023a) are the only recent methods that exhibit competitive results on standard benchmarks. However, both models require complex multi-step training on extensive text-image pairs: LexLIP with up to 14.3 million pairs and STAIR with a massive 1 billion pairs, encompassing public and private data. ...

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
  • Citing Conference Paper
  • January 2023

... The symmetric 16 × 16 × 16 static configuration is chosen as the baseline because it has the highest bisection bandwidth among all possible static configurations of a 3D torus. The optimal lightwave fabric configuration, together with the model-specific parallelism configuration including model partitioning and pipelining, is determined automatically by a reinforcement-learning-based and hyperscale-hardware-optimized neural architecture search (NAS) system without human intervention [33]. Another important observation from Table 2 is that there is no "one-size-fits-all" optimal slice configuration for ML models. ...

Hyperscale Hardware Optimized Neural Architecture Search
  • Citing Conference Paper
  • March 2023

... Rare Words in ASR, MT, and direct ST In ASR, some representative approaches to handle rare words include language model rescoring or fusion (Raju et al., 2019;Yang et al., 2021;Huang et al., 2022;Weiran et al., 2022;, data augmentation by text-to-speech (TTS) (Guo et al., 2019;Zheng et al., 2021;Qu et al., 2023), and context enhancement by an additional memory module (Bruguier et al., 2019;Jain et al., 2020;Chang et al., 2021;Huber et al., 2021;Qiu et al., 2022;Huber and Waibel, 2024). In MT, rare word translation has been tackled by, among other techniques, constrained decoding (Chatterjee et al., 2017;Hasler et al., 2018;Ailem et al., 2021;, copying by source annotations (Dinu et al., 2019;Song et al., 2019;Bergmanis and Pinnis, 2021) or pointing mechanisms (Gulcehre et al., 2016;Pham et al., 2018;Gu et al., 2019;Zhang et al., 2021), and retrieval-augmented translation (Martins et al., 2023;. ...

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

... First, we consider a task of multilingual ASR. The current trend in building multilingual end-to-end ASR models is to train a single network that can recognize multiple languages [17,18,19]. Another popular line of work is to build mixture-of-expert systems that use separate sub-networks specializing in different languages [20,21]. ...

A Language Agnostic Multilingual Streaming On-Device ASR System
  • Citing Conference Paper
  • September 2022

... In viticulture, data collection is challenging due to the need for fieldwork to check the stage, resulting in a small amount of annotation during the viticultural timeline and missing values on most days within the grapevine phase, significantly degrading the quality of the machine-learning solution's estimations (Palanivinayagam and Damaševičius, 2023). To address missing data, which is commonly encountered across various domains, including natural language processing (Yang et al., 2023), computer vision (Ciarfuglia et al., 2022), and speech recognition (Zhang et al., 2022), semi-supervised learning methods have emerged as effective solutions. ...

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
  • Citing Article
  • October 2022

IEEE Journal of Selected Topics in Signal Processing

... Model Architecture: We train streaming Conformer transducers [19] following the recipe in [37]. The encoder of the transducer consists of 7 causal Conformer blocks followed by 10 non-causal Conformer blocks [38,39]. There are two separate hybrid autoregressive transducer (HAT) decoders [31] for the causal and non-causal encoders. ...

Improving The Latency And Quality Of Cascaded Encoders
  • Citing Conference Paper
  • May 2022

... However, limited focus has been given to CL for MMASR. While some works has shown success in mitigating CF by incremental fine-tuning [20] , training a map-ping matrix for task-specific weights [21], and performing weight averaging [22], they are not language-agnostic. Another work [23] has provided a comprehensive study of CL baseline methods for MMASR, indicating potential areas for improvement. ...

Massively Multilingual ASR: A Lifelong Learning Solution
  • Citing Conference Paper
  • May 2022

... Experimental results show that S4M achieves comparable separation performance with significantly fewer trainable parameters in comparison with other mainstream methods. Furthermore, we analyze the model complexity using computing time and MACs, which shows that S4M provides a potential solution for streaming-based speech separation on mobile devices or streaming applications [39]. ...

Transducer-Based Streaming Deliberation for Cascaded Encoders
  • Citing Conference Paper
  • May 2022

... Neural acoustic models have shown robust performance in processing human speech information and have demonstrated remarkable capabilities in spoken language tasks (Radford et al., 2023;Peng et al., 2023b;Barrault et al., 2023a). Powered by large-scale training (Baevski et al., 2020;Zhang et al., 2023;Chen et al., 2024;2022;Li et al., 2021), Transformer-based (Vaswani et al., 2017) Whisper (Radford et al., 2023) and Canary (Puvvada et al., 2024) are trained on undisclosed data, while OWSM and the presented OWLS use public data. ...

Scaling End-to-End Models for Large-Scale Multilingual ASR