Shih-Cheng Huang’s research while affiliated with Stanford University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (37)


Figure 2: Preference Vector Scaling with Preference Model Evaluation. We evaluate the controllability of our method on LLAMA3-8B using preference models under varying scaling coefficients η Hel p f ul , η Harmless ∈ {−1.0, −0.5, 0.0, +0.5, +1.0} for the preference vectors. Green indicates higher helpfulness or harmlessness, while red indicates low ones. The results show relatively smooth and interpretable trends, demonstrating fine-grained control over preference strength.
Figure 3: Safety, helpfulness, and commonsense performance on different scaling coefficients.The models maintains knowledge base when adding preference vector. (η = η Hel p f ul = η Harmless )
Figure 4: Eigenvalues of different preference vectors obtained from different random seeds. The largest eigenvalue (λ 1 ) dominates the others, indicating that preference vectors primarily align along a single, dominant direction.
Win rates based on human eval- uation. Win rates represent the percent- age of pairwise comparisons won by each model based on human annotator rank- ings. Higher values are better.
Hyper-parameters of SFT and DPO Model Training.
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
  • Preprint
  • File available

April 2025

·

1 Read

Ren-Wei Liang

·

Chin-Ting Hsu

·

Chan-Hung Yu

·

[...]

·

Shao-Hua Sun

Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.

Download

LLaVA-Rad overview
a To train LLaVA-Rad, we assemble a large dataset with over 697 thousand chest X-ray image-text pairs; GPT-4 is used to synthesize reports from labels, translate reports from Spanish, and process and structure the corresponding radiology reports. b We adopt a modular three-stage approach to train LLaVA-Rad, comprised of pre-training, alignment and fine-tuning. c A qualitative visualization of the model’s attention during its generative process. d For evaluation, we also propose a novel factual error scoring approach using GPT-4 and demonstrate its parity with expert evaluation. e LLaVA-Rad outperforms much larger generalist and specialized models like GPT-4V and Med-PaLM M on prior standard report evaluation metrics. MLP multi-layer perceptron. The example chest X-ray image in b is obtained from ref. ²⁷ with permission for reproduction from the authors.
Quantitative and qualitative evaluation of LLaVA-Rad using existing report generation benchmarks on MIMIC-CXR
a Comparison between LLaVA-Rad and open-source models according to existing factual correctness (F1-CheXbert-14, F1-RadGraph) and lexical similarity (ROUGE-L) metrics. b Comparison between LLaVA-Rad and closed-source models according to existing factual correctness and lexical similarity metrics. c Comparison between model size and factual correctness shows that LLaVA-Rad is both smaller and more factually correct compared to existing approaches. d Illustration of a sample generated report from LLaVA-Rad compared with that of LLaVA and LLaVA-Med. LLaVA-Rad’s generations that match reference findings are highlighted. e Comparison of the performance on cross-modal retrieval demonstrated by LLaVA-Rad, LLaVA-Med and LLaVA. In a–e values correspond to mean statistic in MIMIC-CXR test-set (n = 2461 image-report pairs) with the exception of MAIRA-1 and Med-PaLM M which are derived from their original publications. In a, b error bars correspond to 95% bootstrap confidence intervals derived from 500 samples. Source data are provided as a Source Data file.
External validation results for LLaVA-Rad on held-out datasets
Open-I (a, b) CheXpert (c, d) and US-CXR (e, f). LLaVA-Rad outperforms baselines across all external validation datasets, as assessed by traditional factual correctness metrics (F1-CheXbert-14, F1-RadGraph) and lexical similarity (ROUGE-L). CheXprompt evaluation (b, d, f) further demonstrates that LLaVA-Rad produces fewer clinically significant and overall errors compared to baselines. Each dataset sample consists of image-report pairs (Open-I: n = 2163; CheXpert: n = 61; US-CXR: n = 1751). Values represent mean metric scores for each dataset, and error bars indicate 95% bootstrap confidence intervals derived from 500 resampling iterations. Source data are provided as a Source Data file.
Evaluating LLaVA-Rad using CheXprompt
a GPT-4 based CheXprompt is more similar to average left-in radiologists in total error quantification, compared to the left-out radiologist (mean absolute difference 0.55 vs 0.71). b Comparison between CheXprompt and existing metrics in terms of agreement with radiologist error quantification. c Comparison between LLaVA-Rad and competing methods using CheXprompt on the MIMIC-CXR test set. d Illustration of how CheXprompt can be used to evaluate a report generated by LLaVA-Rad, with errors highlighted. GPT-4T stands for GPT-4 Turbo. In ap values correspond to two-sided paired t-test. In b, c values represent mean metric scores and error bars correspond to 95% bootstrap confidence intervals. Source data are provided as a Source Data file.
Analyzing the performance of LLaVA-Rad using ablation studies and attention visualization
a Comparison of using different image encoders (BiomedCLIP-CXR from LLaVA-Rad, BiomedCLIP continually pre-trained on MIMIC-CXR, BiomedCLIP, and OpenAI CLIP) to start the alignment and fine-tuning stages. b Ablation study on only using rule-processed MIMIC-CXR training data or GPT-4 processed training data in alignment and fine-tuning stages. c Attention visualization qualitatively demonstrates the appropriate grounding of LLaVA-Rad in-specific image regions when generating a word (bold text) as part of a specific finding (bottom row). In a, b values represent mean metric scores and error bars indicate 95% bootstrap confidence intervals derived from 500 resampling iterations. Source data are provided as a Source Data file.
A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings

April 2025

·

54 Reads

·

1 Citation

Large foundation models show promise in biomedicine but face challenges in clinical use due to performance gaps, accessibility, cost, and lack of scalable evaluation. Here we show that open-source small multimodal models can bridge these gaps in radiology by generating free-text findings from chest X-ray images. Our data-centric approach leverages 697K curated radiology image-text pairs to train a specialized, domain-adapted chest X-ray encoder. We integrate this encoder with pre-trained language models via a lightweight adapter that aligns image and text modalities. To enable robust, clinically relevant evaluation, we develop and validate CheXprompt, a GPT-4-based metric for assessing factual accuracy aligned with radiologists’ evaluations. Benchmarked with CheXprompt and other standard factuality metrics, LLaVA-Rad (7B) achieves state-of-the-art performance, outperforming much larger models like GPT-4V and Med-PaLM M (84B). While not immediately ready for real-time clinical deployment, LLaVA-Rad is a scalable, privacy-preserving and cost-effective step towards clinically adaptable multimodal AI for radiology.


A Systematic Review and Implementation Guidelines of Multimodal Foundation Models in Medical Imaging

November 2024

·

3 Reads

Artificial Intelligence (AI) holds immense potential to transform healthcare, yet progress is often hindered by the reliance on large labeled datasets and unimodal data. Multimodal Foundation Models (FMs), particularly those leveraging Self-Supervised Learning (SSL) on multimodal data, offer a paradigm shift towards label-efficient, holistic patient modeling. However, the rapid emergence of these complex models has created a fragmented landscape. Here, we provide a systematic review of multimodal FMs for medical imaging applications. Through rigorous screening of 1,144 publications (2012–2024) and in-depth analysis of 48 studies, we establish a unified terminology and comprehensively assess the current state-of-the-art. Our review aggregates current knowledge, critically identifies key limitations and underexplored opportunities, and culminates in actionable guidelines for researchers, clinicians, developers, and policymakers. This work provides a crucial roadmap to navigate and accelerate the responsible development and clinical translation of next-generation multimodal AI in healthcare.


Figure 1: Timeline showing growth in publications on deep learning for medical imaging, based on search criteria applied to PubMed and Scopus. The figure illustrates that multimodal self-supervised learning represents a small but rapidly growing subset of medical deep learning literature. Publication counts were aggregated using keyword groups. For example, "Medical AI" combines the "Deep Learning" and "Medical Imaging" groups, while "Medical AI + Self-supervised Learning" includes the prior two groups plus the "Self-supervised Learning" group. Specific keywords for each group are detailed in the Methodology section and Appendix. Y-axis is in log scale.
Figure 2: Illustration of multimodal self-supervised learning pretraining strategies. During the pre-training stage of multimodal Foundation Models, one or more of the following self-supervised strategies are typically used: (a) Contrastive Learning forms positive pairs between matching data with shared semantic content, e.g. X-ray images and reports for the same medical examination, and minimizes the representational distance in a common latent space of positive samples (b) Self-prediction masks out random parts of the inputs and seeks to reconstruct the masked out regions by utilizing complimentary information across the input modalities (c) Generative SSL learns the
Figure 4: PRISMA flowchart of the study selection process.
Multimodal Foundation Models for Medical Imaging - A Systematic Review and Implementation Guidelines

October 2024

·

388 Reads

·

1 Citation

Advancements in artificial intelligence (AI) offer promising solutions for enhancing clinical workflows and patient care, potentially revolutionizing healthcare delivery. However, the traditional paradigm of AI integration in healthcare is limited by models that rely on single input modalities during training and require extensive labeled data, failing to capture the multimodal nature of medical practice. Multimodal foundation models, particularly Large Vision Language Models (VLMs), have the potential to overcome these limitations by processing diverse data types and learning from large-scale unlabeled datasets or natural pairs of different modalities, thereby significantly contributing to the development of more robust and versatile AI systems in healthcare. In this review, we establish a unified terminology for multimodal foundation models for medical imaging applications and provide a systematic analysis of papers published between 2012 and 2024. In total, we screened 1,144 papers from medical and AI domains and extracted data from 97 included studies. Our comprehensive effort aggregates the collective knowledge of prior work, evaluates the current state of multimodal AI in healthcare, and delineates both prevailing limitations and potential growth areas. We provide implementation guidelines and actionable recommendations for various stakeholders, including model developers, clinicians, policymakers, and dataset curators.


CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and Patients

May 2024

·

35 Reads

Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1 Models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus





INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

December 2023

·

116 Reads

·

1 Citation

Synthesizing information from various data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes. INSPECT contains data from 19,402 patients, including CT images, radiology report impression sections, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, we develop and release a benchmark for evaluating several baseline model-ing approaches on a variety of important PE related tasks. We evaluate image-only, EHR-only, and fused models. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement. To the best our knowledge, INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data.


Figure 1: LOVM Motivation. Number of pre-trained VLMs released on openclip over time.
Figure 5: Analyzing Score Trends. Average text scores dependence on pre-training datasets and model architecture on our text-derived scores. (left) scores quantifying inter-class similarity (right) scores quantifying intra-class similarity. ResNet ( ) and ConvNext ( × ) based models are grouped separately to evaluate their effect on the score trends.
Details on the different datasets used, including the number of classes, tasks, and domain.
Details on the different benchmarks used in the study, including the number of classes, tasks, and target domain.
Translation of open clip to model/pre-training dataset names used in paper. When renaming the datasets we tried to group models with similar optimization schemes to minimize the number of pre-training datasets without causing undo overlap.
LOVM: Language-Only Vision Model Selection

June 2023

·

38 Reads

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.


Citations (23)


... Social Bias Probing [50] introduces a large-scale dataset and perplexity-based fairness score to analyze LLMs' associations with societal categories and stereotypes. TWBias [29] focuses on biases in Traditional Chinese LLMs, incorporating chat templates to assess gender and ethnicity-related stereotypes within Taiwan's context. Similarly, BBQ (Bias Benchmark for QA) [57] provides question sets to reveal social biases against protected classes in U.S. English-speaking contexts. ...

Reference:

The Science of Evaluating Foundation Models
TWBias: A Benchmark for Assessing Social Bias in Traditional Chinese Large Language Models through a Taiwan Cultural Lens
  • Citing Conference Paper
  • January 2024

... This sophisticated dual-modality capability enables intuitive natural language interaction, allowing users to communicate with these systems through ordinary text instructions. This versatility of VLMs makes them invaluable across diverse medical applications, including disease diagnosis, interactive interpretation of medical images, generation of clinical reports, and phrase grounding (precisely locating specific anatomical or pathological features within images) (5). Additionally, VLMs excel at extracting imaging features for various downstream tasks, leveraging their broad knowledge acquired during pre-training. ...

Multimodal Foundation Models for Medical Imaging - A Systematic Review and Implementation Guidelines

... Among these methods, Task Arithmetic [27] is particularly appealing for the fact that it requires no training. Each domainor task-specific fine-tuning run is represented as a Task Vector, defined by the difference between the fine-tuned model's parameters and those of the pre-trained model; adding or subtracting these vectors transfers knowledge without further updates [6,18,20,26,43]. Unlike approaches requiring adapters [10,13,24,31] or gating modules [32,37,53], Task Arithmetic retains the original network architecture and remains cost-effective. ...

Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages
  • Citing Conference Paper
  • January 2024

... Disease outcomes and mortality rates vary significantly based on comorbidities, with increased risks observed in patients with cancer, chronic inflammatory disorders, or prior splenectomy.58,59 The INSPECT dataset, containing 23,248 CTPA studies from 19,402 patients, provides comprehensive longitudinal data for validating predictive models of these outcomes.20 The integration of multiple data sources reflects the natural clinical workflow in which radiologists routinely combine imaging findings with patient context for diagnosis and treatment planning. ...

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

... demonstrating superior performance compared to traditional methods [8][9][10][11] . Previous efforts to automate the measurement of radiographic parameters often relied on conventional image processing techniques such as thresholding, edge detection, or geometric fitting. ...

Self-supervised learning for medical image classification: a systematic review and implementation guidelines

npj Digital Medicine

... The comparison of the AI performance vs clinical experts is challenging due to the fact that the clinically preferred settings of the algorithm depend on the context. Efforts to create open-source datasets include the WILDS benchmark dataset [82], aiming to address naturally occurring distribution shifts (changes in imaging characteristics) in a diverse set of problems (e.g., in tumor identification tasks across different acquisition sites), BenchMD for variations across hospitals [83], and the DomainBed suite [84], consisting of multi-domain datasets, and focusing on assessing the generalizability of AI algorithms in real-world settings. Another great resource of publicly available datasets, along with their performance on a dataset of interest can be found in the papers with code website [85], and datasets focusing on medical imaging tasks in the GitHub repository of Adalca [86]. ...

BenchMD: A Benchmark for Modality-Agnostic Learning on Medical Images and Sensors
  • Citing Preprint
  • April 2023

... Keywords (MeSH): prostate cancer, Radiotherapy, Artificial Intelligence, Deep learning, Digital pathology, Biomarkers, Androgen deprivation therapy 9413, 9910, and 0126), with a total of 5,654 patients and a dataset of 16,204 histopathology slides. This model was shown to significantly outperform the NCCN classification with a 5-year distant metastasis AUC of 0.83 compared to 0.72 for NCCN, p < 0.001 [14]. The predictive model assesses the benefit of short-term, 4-6 months of ADT (ST-ADT) in intermediate-risk (IR) prostate cancer patients and has recently been validated. ...

Author Correction: Prostate cancer therapy personalization via multi-modal deep learning on randomized phase III clinical trials

npj Digital Medicine

... In order to implement multiple channels the MedicalNet weights had to be adjusted since they assume single channel inputs. To accomplish this adjustment, we followed a weight inflation method that duplicates weights along a desired axis, and averages them over the number of duplications (Zhang et al 2022). For example, for a 3-channel input, the weights of the first convolutional layer are duplicated 3 times along the channel axis, with those weights averaged over the 3 channels. ...

Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation

... This failure is mainly part of not reproducing the actual clinical settings in the research labs. However, Federated Learning could be a blessing in disguise to enforce such clinical constraints in the research labs, especially in medical imaging [17]- [19]. Even though there are recent works in Federated Learning in the medical imaging domain, the exploration of the same is not done for the latest diffusion models. ...

Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging
  • Citing Article
  • January 2023

IEEE Transactions on Medical Imaging

... Trustworthiness issues detection. Recent years have seen growing interest in the trustworthiness of ML across various areas [60], such as computer vision [58], natural language processing [9,56], and healthcare [22]. However, only few methods support detecting trustworthiness issues, most of which still rely on human validation [16,27,48]. ...

Developing medical imaging AI for emerging infectious diseases