Eric P. Xing’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (727)


Log-Linear Attention
  • Preprint

June 2025

·

12 Reads

Han Guo

·

Songlin Yang

·

Tarushii Goel

·

[...]

·

Yoon Kim

The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.


Fig. 6 shows the comparison of attention biases for sequential-phase training.
Figure 7: Comparison of attention biases for diffusion-phase training, before and after sorting the rows and columns by σ. Orange represents 0 (attention) and gray represents −∞ (no attention). The clean sequence is x = (A, B, C, D, E, F ) and hence L = 6. After random masking, we obtain z t = (A, M, C, M, M, F ). The integers denote the position indices with M(z t ) = {2, 4, 5} and C(z t ) = {1, 3, 6}. The random ordering sampled is σ = (3, 1, 6, 4, 5, 2) ∼ P 6 with clean tokens before mask tokens. Red highlights the differences between Eso-LM (A) and Eso-LM (B).
Figure 8: Comparison of attention biases for sequential-phase training, before and after sorting the rows and columns of each of the four L × L blocks by σ. Orange represents 0 (attention) and gray represents −∞ (no attention). The clean sequence is x = (A, B, C, D, E, F ) and hence L = 6. After random masking, we obtain z 0 = (A, M, C, M, M, F ). The integers denote the position indices with M(z 0 ) = {2, 4, 5} and C(z 0 ) = {1, 3, 6}. The random ordering among C(z 0 ) is (3, 1, 6). Red highlights the differences between Eso-LM (A) and Eso-LM (B). Green highlights the extra connections added for clean tokens in z 0 so that the attention biases display useful patterns after sorting -they don't affect the transformer output because no other token attends to clean tokens in z 0 .
Test perplexities (PPL; ↓) on OWT for models trained for 250K. ‡ Reported in [1]. † De- notes retrained models. ¶ Intermediate checkpoints were provided by Sahoo et al. [26].
Sampling time (↓) in seconds for se- quence lengths L ∈ {2048, 8192} with NFEs set to L for all methods. Reported values are mean std over 5 runs.

+5

Esoteric Language Models
  • Preprint
  • File available

June 2025

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)

Download

Learning to estimate sample-specific transcriptional networks for 7,000 tumors

May 2025

·

17 Reads

Proceedings of the National Academy of Sciences

Cancers are shaped by somatic mutations, microenvironment, and patient background, each altering gene expression and regulation in complex ways, resulting in heterogeneous cellular states and dynamics. Inferring gene regulatory networks (GRNs) from expression data can help characterize this regulation-driven heterogeneity, but network inference requires many statistical samples, limiting GRNs to cluster-level analyses that ignore intracluster heterogeneity. We propose to move beyond coarse analyses of predefined subgroups by using contextualized learning, a multitask learning paradigm that uses multiview contexts including phenotypic, molecular, and environmental information to infer personalized models. With sample-specific contexts, contextualization enables sample-specific models and even generalizes at test time to predict network models for entirely unseen contexts. We unify three network model classes (Correlation, Markov, and Neighborhood Selection) and estimate context-specific GRNs for 7,997 tumors across 25 tumor types, using copy number and driver mutation profiles, tumor microenvironment, and patient demographics as model context. Our generative modeling approach allows us to predict GRNs for unseen tumor types based on a pan-cancer model of how somatic mutations affect gene regulation. Finally, contextualized networks enable GRN-based precision oncology by providing a structured view of expression dynamics at sample-specific resolution, explaining known biomarkers in terms of network-mediated effects and leading to subtypings that improve survival prognosis. We provide a SKLearn-style Python package https://contextualized.ml for learning and analyzing contextualized models, as well as interactive plotting tools for pan-cancer data exploration at https://github.com/cnellington/CancerContextualized .


lmgame-Bench: How Good are LLMs at Playing Games?

May 2025

·

2 Reads

Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.


Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

May 2025

·

1 Read

The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on https://github.com/maitrix-org/de-arena.


Figure 2: VSA scaling experiments. (a): Video DiT trained with VSA achieves similar loss curve compared to one trained with full attention. (b): VSA consistently produces a better Pareto frontier when scaling model size up to 1.4B. (c) & (d): The optimal Top-K value (dictating sparsity) depends on both sequence length and training compute. A larger K is needed for a larger training budget.
Figure 4: Kernel benchmarks. (a): Runtime breakdown of a single transformer block for Wan1.3B and Hunyuan. VSA reduces the attention latency by 6×. (b): Speed of VSA with a fixed 87.5% sparsity under various sequence length with head dim 64. VSA approach the theorectical 8× speedup over FA3.
Figure 5: Visualization of the attention pattern of VSA. (a)-(f): VSA dynamically select different cubes to attend, where the blue cube indicates query and red cubes indicated selected key and values.(e): VSA critical-token prediction accuracy.
Model configuration and training hyperparameters used for the ablation studies.
Coarse stage runtime (µs).
Faster Video Diffusion with Trainable Sparse Attention

May 2025

·

8 Reads

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53×\times with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6×\times and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.


A Large-Scale Foundation Model for RNA Enables Diverse Function and Structure Prediction

April 2025

·

6 Reads

Accurately predicting RNA structures and functions from nucleotide sequences, or conversely, designing sequences to meet structural and functional requirements, remains a fundamental challenge in RNA biology, largely due to limited annotated data and the poor efficiency of \textit{ab initio} modeling approaches. Here, we introduce AIDO.RNA, a large-scale RNA foundation model that leverages self-supervised pre-training to learn general and effective RNA representations, which can be transferred to tackle a wide range of RNA prediction and design tasks. AIDO.RNA is a 1.6-billion-parameter transformer-based language model, pre-trained on 42 million non-coding RNA (ncRNA) sequences at single-nucleotide resolution. It can be adapted to achieve state-of-the-art performance on 26 out of 28 diverse tasks, including RNA structure and function prediction, mRNA expression modeling, multi-modal RNA isoform expression prediction, and RNA inverse folding, demonstrating its effectiveness and versatility across the board. We find that beyond excelling in ncRNA-related tasks that directly reside in the pre-training data space, AIDO.RNA can be efficiently adapted to new domains with continued domain-specific pre-training to generalize toward untranslated regions and coding regions of mRNA, suggesting a promising pathway to continue to level up biological foundation models in general. We make AIDO.RNA open source and release the utility of the model in AIDO.ModelGenerator, a Python package enabling easy reproduction, application, and extension of our results.


Figure 1: System overview: First transfer Huggingface model to ONNX model, then add hidden states of last layer as a output node in ONNX computation graph, deploy ONNX model on Laptop and ONNX-mobile on Mobile phone. Then connect edge divice with router to the SG-Lang backend from server side. The router automatically route token with low confidence to server, and send response back to edge device
Figure 2: Computation procedure: Unlike conventional inference, the token routing system involves multiple rounds of prefill and decode within a single request, which prevents full utilization of inference acceleration engines such as SGLang and vLLM, as they only optimize kernel and KV cache on single stage prefill and decode.
Figure 3: User interface of the token-level routing system. Users can set prompts, thresholds, and decoding modes. Tokens from the large model are highlighted in red for interpretability.
Figure 6: Latency comparisons under different thresholds.
Figure 7: Accuracy vs Threshold on CommonSense QA
Token Level Routing Inference System for Edge Devices

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.


MegaMath: Pushing the Limits of Open Math Corpora

April 2025

·

2 Reads

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.


Causal Representation Learning from Multimodal Biomedical Observations

March 2025

·

3 Reads

·

1 Citation

Prevalent in biomedical applications (e.g., human phenotype research), multi-modal datasets can provide valuable insights into the underlying physiological mechanisms. However, current machine learning (ML) models designed to analyze these datasets often lack interpretability and identifiability guarantees, which are essential for biomedical research. Recent advances in causal representation learning have shown promise in identifying interpretable latent causal variables with formal theoretical guarantees. Unfortunately, most current work on multi-modal distributions either relies on restrictive parametric assumptions or yields only coarse identification results, limiting their applicability to biomedical research that favors a detailed understanding of the mechanisms. In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets. Theoretically, we consider a nonparametric latent distribution (c.f., parametric assumptions in previous work) that allows for causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component, extending the subspace identification results from previous work. Our key theoretical contribution is the structural sparsity of causal connections between modalities, which, as we will discuss, is natural for a large collection of biomedical systems. Empirically, we present a practical framework to instantiate our theoretical insights. We demonstrate the effectiveness of our approach through extensive experiments on both numerical and synthetic datasets. Results on a real-world human phenotype dataset are consistent with established biomedical research, validating our theoretical and methodological framework.


Citations (43)


... MetaAgents, while providing a new possibility for LLM-based agents, also raises ethical concerns. The first concern is the trustworthiness issues, such as fairness, transparency, and accountability [26]. LLMs may generate undesired output, such as gender stereotypes and harmful opinions, which may be amplified through interactions in multi-agent settings. ...

Reference:

MetaAgents: Large Language Model Based Agents for Decision-Making on Teaming
Position: TRUSTLLM: Trustworthiness in Large Language Models

... The term 'foundation model' (FM) is used to describe a machine-learning model trained on a large, diverse, and therefore general dataset, which demonstrates high accuracy and generalisation behaviour out-of-the-box. 1 Such models are currently transforming research and technologies in many fields, including natural language processing, computer vision, medicine, and the physical sciences. [2][3][4][5][6][7][8][9] For example, an image-recognition model was trained on large amounts of unlabelled medical images and then fine-tuned (adapted in a supervised setting) to predict the occurrence of diseases; 4 a transformer model was trained on heterogeneous data comprising weather forecasts, ocean-wave and hurricane dynamics to yield a foundation model for the Earth system as a whole. 9 In the domain of atomistic simulations of materials and molecular systems, the term 'foundation model' is typically used more specifically: for machine-learned interatomic potential (MLIP) models that have been trained on very large datasets of diverse chemical systems. ...

A foundation model of transcription across human cell types

Nature

... Earlier works in NLP demonstrated that larger models learn representations that can be fine-tuned to perform downstream tasks more effectively [30,75,100,101], and elucidated the compute-optimal ways to scale up the size of both models and datasets to reap these benefits [43,51]. Similarly, it has been shown that PLM embeddings implicitly capture notions of protein fitness that can be elicited in both supervised and unsupervised settings [15,65,69,70,90,109], and recent work has derived compute-optimal scaling laws for PLMs [19]. ...

Mixture of Experts Enable Efficient and Effective Protein Understanding and Design

... To test zero-shot performance of genomic foundation models on RNA directed evolution, we benchmarked 9 SOTA genomic foundation models. These models included RNA language models(AIDO.RNA 22 , RiNALMo 23 , RNAFM 20 , RNAMSM 24 ), DNA language models (Evo 17 , Nucleotide Transformer 25 , GENA-LM 26 and GROVER 27 ) and the RNA inverse-folding model (RhoDesign 8 ). We collected 6 ncRNA DMS datasets including tRNA 28 , RNA aptamer 10,11,29 and ribozyme 30 and evaluated the model's ability to perform zero-shot ncRNA fitness prediction using the results of experimental ncRNA DMS studies as the ground truth score. ...

A Large-Scale Foundation Model for RNA Function and Structure Prediction

... Exploration Algorithm Solely the informativeness optimization is insufficient to fully explore value difference-evoking questions x, since values are pluralistic [98,99] and one single topic cannot capture diverse values. Therefore, we combine the optimization with a search algorithm like [100,101], adaptively deciding whether to further exploit and refine a question x or shift to another, covering a spectrum of social issues, especially the controversial ones as discussed in Sec. 1. ...

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
  • Citing Conference Paper
  • January 2024

... Our scheme supports arbitrary zero-points, which may not align with inference kernels such as ExLlamaV2 8 . Nonetheless, NeUQI is fully compatible with flexible backends like FLUTE [11], LUT-GEMM [24], and BitBLAS [30], which support mixed precision or lookup-based matrix multiplication. ...

Fast Matrix Multiplications for Lookup Table-Quantized LLMs
  • Citing Conference Paper
  • January 2024

... In test-time adaptation tasks, the memory bank often serves as an auxiliary structure to indirectly optimize the prediction results. RoTTA [39] maintains balanced class representations using dynamic filtering, while TDA [14] employs a lightweight key-value cache for efficient, backpropagationfree pseudo-label refinement. To enhance capability under extreme interference, our work introduces a memory bank storing features of high-confidence samples. ...

Efficient Test-Time Adaptation of Vision-Language Models
  • Citing Conference Paper
  • June 2024

... Different from existing methods, we analyse the densification process of 3DGS and propose a simple yet effective solution based on the delayed Gaussian growth and scale-cascaded mask bootstrapping to reliably remove the effects of trainsient objects. Optimization in Densification and Regularization There are prior works aiming to improve the densification and optimization process of 3DGS [3,3,10,14,60]. For example, several methods [55,58,62] have analyzed the gradient computation process and identified issues such as gradient collision or averaging, which lead to suboptimal reconstruction quality. RAIN-GS [15] investigates alternative initialization strategy for 3DGS without relying on Colmap SFM. ...

FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization
  • Citing Conference Paper
  • June 2024

... Training a state-of-the-art LLM, such as GPT-3, requires access to extensive hardware infrastructure, including thousands of GPUs or TPUs, large amounts of RAM, and high-speed data storage solutions [113,114]. According to estimates, the training cost of models, like GPT-3, can reach up to US $1.4 million, with operational costs amounting to several hundred thousand dollars per day when deployed at scale. ...

RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
  • Citing Conference Paper
  • January 2024

... First, while the choice of knowledge source likely has a large impact on the predictive performance of these methods 15,27 , many proposed architectures are highly specific to their employed knowledge source 28 , posing practical limits to exploiting additional knowledge sources. These architectures range from simple linear regression models 22,29 through graph neural networks (GNNs) 15,16 to transformer architectures [19][20][21] and generative models using optimal transport (OT) 30,31 . Second, current evaluation methodologies and benchmarks have limited depth and practical relevance, such that prior work has questioned the efficacy of these models compared to uninformative baselines 32 . ...

AttentionPert: accurately modeling multiplexed genetic perturbations with multi-scale effects
  • Citing Article
  • June 2024

Bioinformatics