Jiebo Luo’s research while affiliated with University of Rochester and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (151)


Aligning, Autoencoding and Prompting Large Language Models for Novel Disease Reporting
  • Article

January 2025

·

11 Reads

·

1 Citation

IEEE Transactions on Pattern Analysis and Machine Intelligence

Fenglin Liu

·

·

·

[...]

·

Given radiology images, automatic radiology report generation aims to produce informative text that reports diseases. It can benefit current clinical practice in diagnostic radiology. Existing methods typically rely on large-scale medical datasets annotated by clinicians to train desirable models. However, for novel diseases, sufficient training data are typically not available. We propose a prompt-based deep learning framework, i.e., PromptLLM, to align, autoencode, and prompt the (large) language model to generate reports for novel diseases accurately and efficiently. Our method includes three major steps: (1) aligning visual images and textual reports to learn general knowledge across modalities from diseases where labeled data are sufficient, (2) autoencoding the LLM using unlabeled data of novel diseases to learn the specific knowledge and writing styles of the novel disease, and (3) prompting the LLM with learned knowledge and writing styles to report the novel diseases contained in the radiology images. Through the above three steps, with limited labels on novel diseases, we show that PromptLLM can rapidly learn the corresponding knowledge for accurate novel disease reporting. The experiments on COVID-19 and diverse thorax diseases show that our approach, utilizing 1% of the training data, achieves desirable performance compared to previous methods. It shows that our approach allows us to relax the reliance on labeled data that is common to existing methods. It could have a real-world impact on data analysis during the early stages of novel diseases. Our code and data are available at https://github.com/ai-in-health/PromptLLM .




Figure 3: Overall framework of DanceCamAnimator. In the Camera Keyframe Detection stage, the model utilizes music-dance context and temporal keyframe history to generate subsequent temporal keyframe tags. Next, for each pair of adjacent keyframes, the Camera Keyframe Synthesis stage takes music-dance context and camera history as input to synthesize camera keyframe motions. Given camera keyframe motions, camera history, and music-dance context, the final stage predicts tween function values to calculate in-between non-keyframe camera movements. Encoders with the same name share structures in different stages but are trained separately. Stages 2&3 are trained together and conducted alternately during inference.
Figure 4: Visualization Comparison. We rendered the ground truth data and results generated from our method and the baselines given a 2-second music-dance condition. Compared to the baselines, our DanceCamAnimator synthesizes dance camera movements with more shot changes in a short period of time. This comparison also shows the usage of filters in the baseline DanceCamera3D is unstable and carries the risk of erroneous smoothing, causing the character to deviate from the center of the camera view, thus validating that our designed no post-processing framework is meaningful.
DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis
  • Preprint
  • File available

September 2024

·

22 Reads

Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbf{DanceCamAnimator}, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at \url{https://github.com/Carmenw1203/DanceCamAnimator-Official}.

Download

MtArtGPT: A Multi-task Art Generation System with Pre-Trained Transformer

August 2024

·

25 Reads

·

3 Citations

IEEE Transactions on Circuits and Systems for Video Technology

Instruction tuning large language models are making rapid advances in the field of artificial intelligence where GPT-4 models have exhibited impressive multi-modal perception capabilities. Such models have been used as the core assistant for many tasks including art generation. However, high-quality art generation relies heavily on human prompt engineering which is in general uncontrollable. To address these issues, we propose a multi-task AI generated content (AIGC) system for art generation. Specifically, a dense representation manager is designed to process multi-modal user queries and generate dense and applicable prompts to GPT. To enhance artistic sophistication of the whole system, we fine-tune the GPT model by a meticulously collected prompt-art dataset. Furthermore, we introduce artistic benchmarks for evaluating the system based on professional knowledge. Experiments demonstrate the advantages of our proposed MtArtGPT system.


Fig. 1: The introduction of this survey, with the parts marked in RED indicating the new content compared to existing reviews.
Fig. 7: The Modified Splatting Strategy [190].
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities

July 2024

·

198 Reads

3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian representations through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.



ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

June 2024

·

27 Reads

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude.



Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models

May 2024

·

13 Reads

Conventional demographic inference methods have predominantly operated under the supervision of accurately labeled data, yet struggle to adapt to shifting social landscapes and diverse cultural contexts, leading to narrow specialization and limited accuracy in applications. Recently, the emergence of large multimodal models (LMMs) has shown transformative potential across various research tasks, such as visual comprehension and description. In this study, we explore the application of LMMs to demographic inference and introduce a benchmark for both quantitative and qualitative evaluation. Our findings indicate that LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs, albeit with a propensity for off-target predictions. To enhance LMM performance and achieve comparability with supervised learning baselines, we propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.


Citations (42)


... T HE rapid advancement of Large Language Models (LLMs) has revolutionized artificial intelligence [1,2,3,4,5,6,7,8], enabling unprecedented generative capabilities across diverse applications, such as dialogue systems [9,10], code generation [11,12,13], and medical diagnosis [14,15,16,17]. Models like OpenAI-o1 [18] and DeepSeek-R1 [19] have demonstrated remarkable proficiency in understanding and generating human-like text, outperforming traditional language processing techniques [20]. ...

Reference:

A Survey of Direct Preference Optimization
Aligning, Autoencoding and Prompting Large Language Models for Novel Disease Reporting
  • Citing Article
  • January 2025

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Advancing the affective computing domain, Chain-of-Sentiment [155] refines sentiment analysis in conversational contexts, with complementary contributions from concurrent studies [218,219]. Furthermore, Chain-of-Exemplar [83] expands MCoT into the field of education, X-Reflect [249] extends MCoT into multimodal recommendation systems, while Yu and Luo [250] employs zero-shot MCoT for demographic inference. These advancements highlight the potential of MCoT to tackle complex challenges in human-centric and social scientific contexts. ...

Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models
  • Citing Conference Paper
  • July 2024

... Dance generation. Recent advancements in the music-to-dance generation have seen significant progress (Tang et al., 2018;Sun et al., 2020;Zhuang et al., 2022;Sun et al., 2022;Qi et al., 2023;Wang et al., 2024a;b). FACT (Li et al., 2021b) utilizes cross-modal transformer blocks with strong sequence modeling capabilities to generate single-person dance from given music. ...

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
  • Citing Conference Paper
  • June 2024

... In particular, the segmentation of hands and surgical tools is essential for interpreting a surgeon's actions and intent, facilitating workflow optimization and AI-assisted surgery. MIS 10k 3 -✗ Endovis2017 [2] 2.4k 10 -✗ Endovis2018 [1] 2.4k 10 -✗ CholecSeg8k [17] 8k 12 -✗ AutoLaparo [31] 1.8k 7 -✗ ROBUSTMIS2019 [26] 10k 2 -✗ SAR-RARP50 [25] 10k 10 -✗ EgoSurgery-HTS (Ours) Open Surgery 15.4K 14 4 ✓ Surgical tool segmentation, which involves the precise identification and delineation of surgical instruments, has gained significant attention, particularly in minimally invasive surgeries (MIS), leading to the development of various advanced approaches [3,14,18,24,27,32,35]. A key factor driving this progress is the availability of large, well-annotated datasets [1,2,5,17,25,26,31]. ...

SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... By leveraging generative models to produce high-quality image data, AIGC can effectively mitigate the issue of insufficient training samples. Specifically, AIGC employs advanced deep learning models to generate realistic images, thereby enriching the training dataset and enhancing the model's generalization capability [14][15][16][17]. This approach not only increases the diversity of the training data but also improves the robustness and reliability of object detection algorithms. ...

Recent advances in artificial intelligence generated content人工智能生成内容最新进展
  • Citing Article
  • February 2024

Frontiers of Information Technology & Electronic Engineering

... In addition, unlike platforms that primarily rely on short-form content like tweets, Reddit facilitates in-depth discussions with longer posts and comments. This allows for more detailed and nuanced analysis of user opinions, arguments, and perspectives [21], [22], [24]. Hence, we choose to conduct our study using Reddit data, leveraging its unique attributes to gain deeper insights into the multifaceted landscape of AI discussions. ...

A Fine-Grained Analysis of Public Opinion toward Chinese Technology Companies on Reddit
  • Citing Conference Paper
  • December 2023

... Ideology reflects the political orientations or biases of individuals, frequently characterized as leftwing or right-wing perspectives [15,12,66,46]. Nowadays, the ideological division has become significantly more pronounced [78,46,47,76,62,61], and it exerts a notable influence on daily communication including those on social media [83]. The process of ideology detection is designed to detect an author's political stance from their generated content. ...

Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets
  • Citing Conference Paper
  • December 2023

... However, they may be identified through sophisticated techniques such as edge detectors, quality metrics, and frequency analysis [3,4,50,98]. Furthermore, invisible signatures such as camera CFA patterns have been exploited to differentiate camera-generated images from AIgenerated images [63]. Additionally, anomalies such as improper alignment with the rest of the image [39,40], inconsistent lighting [77], and differences in image fidelity [35] also aid in recognition. ...

Investigating the Effectiveness of Deep Learning and CFA Interpolation Based Classifiers on Identifying AIGC
  • Citing Conference Paper
  • December 2023

... By incorporating diverse data sources such as videos and text documents, with advanced deep learning techniques [5,26], it enhances summarization quality and user experience. To further simplify and expedite information acquisition, extreme summarization was introduced into the MSMO task [24,25], proposing eXtreme Multimodal Summarization with Multimodal Output (XMSMO). XMSMO condenses a video-document pair into a single frame and a single sentence. ...

TopicCAT: Unsupervised Topic-Guided Co-Attention Transformer for Extreme Multimodal Summarisation
  • Citing Conference Paper
  • October 2023

... Effective prompt design is crucial, as converting user and item attributes into natural language prompts can be limited by context length. Controlling LLM outputs to meet specific constraints (e.g., price, color) is difficult, and ensuring consistent formatting is challenging [46,48]. Finally, privacy concerns arise because LLMs use large datasets that may contain sensitive user information, risking data exposure [12]. ...

User-Controllable Recommendation via Counterfactual Retrospective and Prospective Explanations