Louis-Philippe Morency’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (459)


OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
  • Preprint

June 2025

·

15 Reads

Jiewen Hu

·

·

Paul Pu Liang

·

Louis-Philippe Morency

In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection, eye-gaze estimation, and facial emotion recognition. OpenFace 3.0 contributes a lightweight unified model for facial analysis, trained with a multi-task architecture across diverse populations, head poses, lighting conditions, video resolutions, and facial analysis tasks. By leveraging the benefits of parameter sharing through a unified model and training paradigm, OpenFace 3.0 exhibits improvements in prediction performance, inference speed, and memory efficiency over similar toolkits and rivals state-of-the-art models. OpenFace 3.0 can be installed and run with a single line of code and operate in real-time without specialized hardware. OpenFace 3.0 code for training models and running the system is freely available for research purposes and supports contributions from the community.


Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

May 2025

We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.



Fig. S3. Neural prediction errors for facial expression increased with the intensity of expressions in the neuro-perceptual spaces that incorporated facial motion during fixations.
Neural encoding of real world face perception
  • Preprint
  • File available

May 2025

·

32 Reads

Social perception unfolds as we freely interact with people around us. We investigated the neural basis of real world face perception using multi electrode intracranial recordings in humans during spontaneous interactions with friends, family, and others. Computational models reconstructed the faces participants looked at during natural interactions, including facial expressions and motion, from brain activity alone. The results highlighted a critical role for the social vision pathway, a network of areas spanning parietal, temporal, and occipital cortex. This network was more sharply tuned to subtle expressions compared to intense expressions, which was confirmed with controlled psychophysical experiments. These findings reveal that the human social vision pathway encodes facial expressions and motion as deviations from a neutral expression prototype during natural social interactions in real life.

Download

The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning

April 2025

·

8 Reads

Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. Recently, language models (LMs) and foundational models (FMs) are being utilized as automatic evaluators of human-AI interactions with the goal of eventually being used to improve the policy of the AI agent. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of LMs and FMs to identify and reason about social interactions, specifically with regard to robot social errors and competencies . Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions. To further assess AI models' ability to reason about social interactions, we propose eight new benchmark tasks for evaluating centered around whether AI models can (1) evaluate social interactions via detecting social errors and competencies, (2) identify the explanatory factors associated to errors and competencies, (3) understand the flow of real-world social interactions, and (4) provide reasons and corrective actions for social errors. Human studies and experiments with modern LMs and FMs reveal that current models struggle with these tasks, demonstrating that our dataset and benchmark provides a step forward towards socially intelligent AI.



Social Genome: Grounded Social Reasoning Abilities of Multimodal Models

February 2025

·

8 Reads

Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce Social Genome, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. Social Genome contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). Social Genome is also the first modeling challenge to study external knowledge in social reasoning. Social Genome computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of Social Genome through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.


AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

February 2025

·

3 Reads

We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/


Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

February 2025

·

3 Reads

·

1 Citation

While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.


LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment

January 2025

·

7 Reads

This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment using the Montgomery-Asberg Depression Rating Scale (MADRS). We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews from the Context-Adaptive Multimodal Informatics (CAMI) dataset, demonstrates strong correlations with clinician assessments. The Qwen 2.5--72b model achieves near-human level agreement across most MADRS items, with Intraclass Correlation Coefficients (ICC) closely approaching those between human raters. We provide a comprehensive analysis of model performance across different MADRS items, highlighting strengths and current limitations. Our findings suggest that LLMs, with appropriate prompting, can serve as efficient tools for mental health assessment, potentially increasing accessibility in resource-limited settings. However, challenges remain, particularly in assessing symptoms that rely on non-verbal cues, underscoring the need for multimodal approaches in future work.


Citations (50)


... Recently, a series of training-free approaches have directly reduced language priors using contrastive decoding [27; 45], achieving remarkable performance. These methods construct an alternative logit distribution on top of the original one through techniques such as masking the image [21], perturbing the instruction [22], augmenting the vision input [23], or performing cross-modal conversion [26]. During decoding, the two logit distributions are contrasted to eliminate language priors. ...

Reference:

Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

... Yu et al. [106] further proposes to approximate interaction types via prediction similarity between unimodal and multimodal models. Building on this insight, we generalize the approach by introducing a dataset-level interaction score, enabling principled comparisons and groupings of multimodal tasks based on their interaction characteristics. ...

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts
  • Citing Conference Paper
  • January 2024

... Understanding communicative signals from faces is a critical ability driving human face-to-face social interactions [29]. A slight shift in a person's eye gaze or tilt of the head, for example, are subtle facial behaviors that can substantially influence the social meaning being conveyed and interpreted during interactions [48]. Among the computer vision community, there has been a growing interest in designing systems that can use fine-grained facial behaviors to interpret human affective, cognitive, and social signals [6], [2]. ...

Advancing Social Intelligence in AI Agents: Technical Challenges and Open Questions
  • Citing Conference Paper
  • January 2024

... Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM (Wilf et al., 2023) and TOMBENCH (Chen et al., 2024) to systematically improve and assess ToM capabilities in LLMs by integrating multiinteraction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. ...

Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind Capabilities
  • Citing Conference Paper
  • January 2024

... Presently, the community lacks a large-scale, open-source benchmark specifically architect ed for the rigorous training and comprehensive evaluation of these ambitious generalist models. Most benchmarks focus on narrow, very specific domains and tasks, and are typically closed-source (White et al., 2024) (Hendrycks et al., 2021b) (Liang et al., 2021) (Gulcehre et al., 2021 (Fu et al., 2021). This critical gap motivates our work. ...

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
  • Citing Article
  • December 2021

Advances in Neural Information Processing Systems

... Recent advances in language and multimodal reasoning [22] have enabled significant progress in step-by-step problem-solving [40,44], transparent reasoning [10,24], and enhanced human-AI collaboration [42,6]. However, existing reasoning benchmarks largely focus on narrow domains with fully specified tasks, such as math [23] or coding [18]. ...

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
  • Citing Article
  • April 2024

ACM Computing Surveys

... Future work could explore its application in online continual learning, as well as in class-imbalanced or few-shot scenarios. While this study focuses on classification tasks, the HiProIBM learning paradigm could be extended to other domains, such as object detection [87], segmentation [88], and multi-modal learning [89,90]. ...

Continual Learning for Personalized Co-Speech Gesture Generation
  • Citing Conference Paper
  • October 2023

... To assess the generalizability of our framework, we randomly select 10 lecture videos from the Multimodal Lecture Presentations Dataset [39], covering subjects such as psychology, machine learning, dentistry, and biology, and we also include full-episode news broadcasts from NBC News. We then conduct an A/B study comparing our method against TeaserGen [11], asking participants which teaser they would prefer for each lecture video or news broadcast. ...

Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos
  • Citing Conference Paper
  • October 2023

... Although comparatively underexplored, LLMs have also shown impressive performance in causal effect estimation. These works can be mainly categorized into two branches: (1) Causal effect in data: LLMs estimate causal effects within data (Lin et al., 2023;Kıcıman et al., 2023) by leveraging their reasoning capabilities and large-scale training data. CLADDER (Jin et al., 2023a) benchmarks LLMs for causal effect estimation tasks (e.g., ATE in Rung 2, and ATT, NDE, NIE in Rung 3). ...

Text-Transport: Toward Learning Causal Effects of Natural Language
  • Citing Conference Paper
  • January 2023

... For nonstationary settings, recent works explore forecasting value functions (Chandak et al., 2020), perturbations (Yu et al., 2020a), and models (Lee et al., 2023a), showing promising empirical/theoretical results. Another direction is multimodal approach (see (Liang & Morency, 2023) for a recent survey), as certain modes (e.g., language) may exhibit more stationarity or support large-scale diversified data for better forecasting. ...

Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open Questions
  • Citing Conference Paper
  • October 2023