Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Artificial Intelligence for Biomedical Video Generation
Linyuan Li1, Jianing Qiu1, Anujit Saha1, Lin Li2, Poyuan Li1, Mengxian He1, Ziyu Guo1, and Wu Yuan1
1Department of Biomedical Engineering, The Chinese University of Hong Kong, Hong Kong SAR
2Department of Informatics, King’s College London, United Kingdom
*Address correspondence to: wyuan@cuhk.edu.hk
Abstract
As a prominent subfield of Artificial Intelligence Generated Content (AIGC), video generation has achieved notable
advancements in recent years. The introduction of Sora-alike models represents a pivotal breakthrough in video generation
technologies, significantly enhancing the quality of synthesized videos. Particularly in the realm of biomedicine, video
generation technology has shown immense potential such as medical concept explanation, disease simulation, and biomedical
data augmentation. In this article, we thoroughly examine the latest developments in video generation models and explore
their applications, challenges, and future opportunities in the biomedical sector. We have conducted an extensive review and
compiled a comprehensive list of datasets from various sources to facilitate the development and evaluation of video generative
models in biomedicine. Given the rapid progress in this field, we have also created a github repository to regularly update
the advances of biomedical video generation at: https://github.com/Lee728243228/Biomedical-Video-Generation
1 Introduction
Artificial Intelligence Generated Content (AIGC) stands as a cornerstone in contemporary computer vision research, bolstering
significant accomplishments across numerous sectors including economy [1], healthcare [2], and transportation [3]. Notably
within the biomedical field, image generation technology has been effectively applied to various imaging modalities, including
Computed Tomography (CT) [4, 5], Magnetic Resonance Imaging (MRI) [6–8], fundus photography [9, 10], and pathological
imaging [11, 12]. The fidelity of these generated images has seen progressive enhancements through technological evolution:
from early generative models such as generative adversarial network (GAN)-based models [6, 9], to auto-regressive (AR)
models [13, 14], then to cutting-edge diffusion models [5, 12], and now moving towards the combination of auto-regressive and
diffusion (AR + Diffusion) models [15, 16].
Compared to static images that contain solely spatial information, videos encompass additional temporal and motion
features. For instance, cardiac ultrasound videos present not only the cardiac structure and pathological conditions but also
the dynamics of cardiac valves and pumping functions. Hence, richer information content poses a more complex challenge for
video generation. Thus far, predominant techniques in video generation are also based on GANs [17–19], AR models [20, 21],
and diffusion models [22, 23]. Following the robust generative capabilities demonstrated by Sora [24] and Movie Gen [25],
diffusion models based on transformer architectures [26, 27] have been increasingly employed for video generation, achieving
state-of-the-art results. The AR + Diffusion models, by combining the strengths of AR models and diffusion models [15, 16],
and featuring a more lightweight design [15], have the potential to become the next-gen video generative models.
Despite the current video generation models’ capability to simulate medical scenarios [28–32], in order to achieve a more
realistic and reliable simulation, considerations should be taken from three aspects: understanding principles in physics,
effective evaluation metrics for generated medical content, and controllability and explainability of generation.
Understanding principles in physics is critical to enhancing the realism and precision of synthesizing biomedical videos.
Taking surgical operations as an example for physics understanding, surgical procedures involve manipulating deformable
tissues and organs using articulated instruments to achieve desired outcomes. While existing video generation models [28, 32]
can create surgical scenes, they fail to model the surgical operations cohesively. The recent Movie Gen [25] has demonstrated
1
arXiv:2411.07619v1 [cs.CV] 12 Nov 2024
certain abilities to simulate physical behaviors, but this ability has not been proven in the generation of medical videos. To
achieve a more precise understanding of the motion characteristics in biomedicine, knowledge from physiology and pathology
also needs to be further learned by generative models.
In addition, it is crucial to understand the significance of evaluation criteria for biomedical video synthesis. In addition
to considering the coherence and authenticity of the generated content [33–35], it is also necessary to ensure the medical
utility and applicability of the generated content and its value added to the existing biomedical data. Therefore, in the design
of evaluation criteria, it is necessary to additionally consider whether the biomedical knowledge contained in the generated
content is meaningful and meets the needs of medical practice.
Besides, generated videos could serve various medical purposes, such as aiding in diagnosis or education. Information to
be generated in videos should be able to be precisely controlled, necessitating both controllability and explainability.
Although controlling mechanisms such as proposed in ControlNet [36] has proved its applicability in medical generation [37,
38], many issues remain unresolved.
Despite facing enormous challenges, the field of biomedical video generation is poised for exciting advancements. Addressing
these challenges will open up numerous opportunities for technological innovation. With a foundation of existing video
generation models and a wealth of biomedical video datasets, this article will explore potential application scenarios. By
analyzing current technologies and challenges, we can pave the way for innovative developments in video generation that will
greatly benefit the biomedical and healthcare communities.
The main contributions of this work are summarized as follows:
•
We summarized the three main challenges in medical video generation, including learning physical laws, establishing
evaluation metrics and benchmarks, and enhancing controllability and explainability, and analyzed the corresponding
potential solutions.
•
We conducted a comprehensive investigation of existing video generation models in the general domain and surveyed
models in biomedical and healthcare domains.
•
We curated existing biomedical video generation datasets, including open-source datasets, video libraries, and biomedical
videos on various multimedia platforms.
•
We discussed the potential applications of video generation including medical education, patient-facing applications, and
public health promotion, and analyzed their feasibility.
Article Pipeline: Section 2 analyzes the challenges faced by biomedical video generation and the potential solutions to
address these challenges. Section 3 introduces video generation workflow including data pre-processing, model architecture,
training, inference, and evaluation. Section 4 focuses on medical video datasets and generation techniques applied in
biomedicine. Section 5 discusses future directions of biomedical video generation and analyzes their feasibility and challenges.
In section 6, we discuss noteworthy risks in biomedical video generation, and finally, we conclude in section 7.
2 Challenges
The challenges in video generation primarily lie in modeling physical principles in biomedicine, devising effective and meaningful
evaluation strategies, and enhancing generative controllability and explainability. This section reviews existing literature that
sheds light on potential solutions to addressing these challenges in biomedical video generation.
2.1 Understanding Physical Laws
The understanding of physical laws indicates to the process of simulating and learning the physical phenomena such as
the motion of objects, the effects of forces, and interactions between objects within the video content related to medicine.
This process aims to ensure that the generated videos are not only visually realistic but also adhere to the principles of
physical dynamics found in the real world. In addition to physical laws, other biomedical principles such as physiological and
pathological laws are equally important.
2
Video Generation in Medicine and Healthcare: A Review
General Video
Generation
Biomedical Video
Generation
Future Application
Scenarios
Image
Pre-training
Inference
Conditions
Evaluation Metrics and
Benchmarks
Acceleration
Medical Metrics and
Benchmarks
Medical Videos
Downstream AdaptVideo
Generation
Model
Biomedical
Video
Generation
Model
Deployment
Video
Pre-training
Pre-
processing
Challenges
Understanding Physical Laws,
Evaluation Metrics and Benchmark,
Controllability and Explanability
Risks
AI Hallucination, Lack of Explainability,
Bias, Lack of AI Accountability,
Model Jailbreaking, .......
Biomedical
Simulation
Diagnosis
Assistance
AI Model
Enhancement
Public Health
Promotion
Patient-Facing
Application
Medical
Education
Figure 1: The existing workflow of biomedical video generation, including large-scale pretraining, adaptation in the biomedical
domain, and deployment in biomedical scenarios. In the process of development and deployment of video generation technology,
we highlight prominent challenges, such as learning medical physical laws, as well as risks related to hallucinations and bias.
Taking laparoscopic surgeries as an example, the physical phenomena observed are complex and multifaceted. Key aspects
include: 1) Tissue Pulsation: The rhythmic pulsation of biological tissues, typically synchronized with the heartbeat, plays a
critical role in understanding the physiological state of the patient during surgery. 2) Instrument Displacement: The movement
of surgical instruments is crucial for performing precise manipulations. This displacement can be influenced by various factors,
including the surgeon’s technique and the properties of the tissues being operated on. 3) Deformation of Tissues: Surgical
manipulation results in various types of deformation, including elastic deformation due to stretching and plastic deformation
from cutting. These interactions can be complex and are influenced by the mechanical properties of both the instruments
and the tissues. Through a comprehensive review of the literature, the modeling of physical laws in video generation can be
categorized into four categories: 1)explicit physical law learning; 2)implicit physical law learning; 3)physics-informed neural
networks; and 4) online interaction with the medical environment.
Explicit Physical Law Learning The methodologies for explicit physical law learning involve the direct synthesis of
object motion using generative models, such as optical flow and frequency domain changes, followed by rendering on images or
videos to achieve effective physical simulation [39–42]. For instance, GIT [39] employs diffusion models to learn information in
the frequency domain and then integrates this information into one image for animation, thereby accomplishing the modeling
of simple physical phenomena like vibrations. PhysDreamer [41] offers a more direct and interpretable approach to physical
3
Evaluation Metrics & Benchmark
General Automated Metrics Doctor Evaluation Metrics Medical Automated Metrics
Image Smoothness
Image Quality
Image Realism
Image Rationality
Medical Utility
Image Rationality
Medical Utility
Image Smoothness
c) Evaluation Metrics
& Benchmark
Generator
General Control
Medical-Specific
Control
Imaging Intrisic
Parameters
General Control
Skeleton Optical Flow
Text/ Mask, ....
dMRI, CT, ....
parameters
b) Controllability
Explainability
Imaging Intrisic
Parameters
Vascular Structure
Lesion Location
Medical-Specific
Control
Generator
Video
Video in
Frequency
domain
Optical
Flow
Rendering
Conditions Physical Law
Content Latent
Motion Latent
Content
Motion
Generator
Fusion
Model
Noised Latent
Conditions
Explicit Physical Law Learning Implicit Physical Law Learning
Understanding
Physical Law
a): Understanding Physical Law
Physics-Informed Neural Network
Generator
General Loss
Term
Physical-related
Loss Term
Back-propagation
Generator
Online-Interaction with Medical Environment
Medical-
related
Physical
Control
Online Data
Feed-back
Fine-tuning
Physical Law
Figure 2: The challenges that biomedical video generation techniques face include a) understanding principles in biomedicine
such as medical physics; b) controllability and explainability; and c) robust evaluation metrics and benchmarks.
modeling by solving the differential equations of elastic materials using the Material Point Method (MPM) [43, 44] to model
their physical laws. In the context of surgical scenarios, the relationship between stress and strain represents a primary form
of physical deformation and the modeling of such a relationship can refer to PhyDreamer [41].
Implicit Physical Law Learning Implicit methods do not directly learn the physical phenomena, such as object motion
trajectories or elastic deformation as explicit methods do [39–42]. Instead, they model object motion by learning motion
features [45]. For example, generative models typically learn the temporal and spatial information of videos [22, 46], or
the information related to motion and content separately [19, 47]. This decoupled approach is an implicit learning method;
CMD [45] decomposes video content into content frames and motion latent representation during video encoding, which are
then processed separately by a diffusion model. This decomposition method is frequently used when GANs are employed for
video generation [19, 47].
Physics-Informed Neural Network In traditional machine learning approaches, the learning process is primarily
data-driven, with the model heavily reliant on large volumes of high-quality data. However, in practical applications, there is
often a scarcity of data or the presence of noise, making it challenging for data-driven models to yield accurate and reliable
results. To address this, incorporating physical knowledge as prior information aims to overcome the limitations of data
insufficiency and make generation or prediction results more aligned with physical intuition [48]. Particularly in the medical
4
field, introducing the constraints of physical laws can make the model’s decision outcomes more interpretable and controllable,
thereby achieving a credible effect.
To incorporate physical laws into machine learning, these laws are typically embedded within the model’s architecture [49–
51] and the design of the loss function [52]. In the visual generation domain, Phy-diff [49] improves synthesized MRI quality
by informing dMRI physical information (Diffusion coefficient map atlas) into the diffusion process with principled noise
management, conditioned on the XTRACT atlas for anatomical details. Tirindelli et al. [50] augment ultrasound data by
integrating physics-inspired transformations including deformations, reverberations, and signal-to-noise ratio adjustments,
aligned with the principles of ultrasound imaging, providing anatomically and physically consistent variability. Momeini et
al. [51] model cerebral microbleeds with a Gaussian shape to simulate the data’s characteristics including shape, intensity
volume, and location, guided by MRI properties. In the generation domain, the incorporation of physical knowledge such as
imaging properties helps boost the generation effect and performance of downstream tasks.
Online Interaction with Medical Environment Generated videos often include hallucinatory content and unrealistic
physical effects. While expanding datasets [22, 24] and optimizing models [25, 45] offer partial solutions, collecting external
feedback from the environment is crucial and one effective solution for improving video generation [53].
By employing reinforcement learning, generative models are enabled to interact with the external environment, adjusting
themselves based on a reward function to make the generated effects more consistent with the physical laws of the natural
environment alleviating the issue of hallucinations to a certain extent.
VideoAgent [53] is a representative of this approach. Based on the task description, it is necessary to capture a frame from
the environment as the initial frame, and then the generative model generates a video based on the initial frame and task
description. Subsequently, the content generated is judged against the standard of optical flow to determine if it complies
with the physical laws. Content that conforms is collected and used to fine-tune the video generation model. When the model
literally undergoes fine-tuning based on a variety of environments, it can then generate videos that adhere to the physical laws
across different environments.
2.2 Controllability and Explainability
Input conditions often contain rich medical information, which is a direct manifestation of medical utility. For example,
masks may include information about the location and size of lesions, which provide critical information for physicians in
making diagnoses. Therefore, enhancing the representation of control conditions in the generated content and ensuring the
explainability of the generated content is important in medical generation. This section discusses the existing control methods,
how to strengthen control, and the explainability of control, addressing the ways and difficulties in tackling these challenges.
Control in Medical Domain Text, mask, and depth information are common modalities used to control the generation
of medical videos, which provide guidance to the generation process through control mechanisms such as ControlNet [37,
38, 54], Clip [30, 32] and condition-specific encoder [55, 56] to the generation of medical images and videos. For instance,
FairDiff [54] generates point masks through a point-cloud diffusion model, which is then utilized to control the image diffusion
model in generating fundus images. ControlPolypNet [38] controls the generation of polyp images based on segmentation
masks.
Controllability Enhancement By integrating the inherent physical knowledge of medical imaging into the generation
process, or by employing more rational and comprehensive modal control for generation, the effect of controllability can be
enhanced. Phy-diff [49] introduces inherent physical information of dMRI into the diffusion process and makes it more suitable
for dMRI generation. This approach incorporates physical information to optimize the diffusion process also has the potential
to be applied to the generation of images or videos from CT, fundus fluorescein angiography (FFA), ultrasound, and other
modalities. Furthermore, in fundus2Video [31] work, a knowledge mask containing lesion and vessel information serves as an
additional condition to assist in the generation of FFA sequences. HeartBeat [55] utilizes a comprehensive set of modalities to
guide ultrasound video generation in both coarse and fine-grained levels. Such conditions typically need to be determined by
specific generation tasks and medical modalities.
Explainability In order to enhance the controllability of generative models, the explainability of the model has to be
taken into account. Although some existing works, such as ECM [56] and Heartbeat [55], can control and even explain the
generation of medical videos, there are still issues to be addressed to enhance overall explainability, including but not limited
5
to 1) the contribution of different modalities to the content of the generated video needs to be measured quantitatively; 2)
whether there is redundancy between modalities; and 3) whether excessive control modalities can lead to the degradation of
model performance.
2.3 Evaluation Metrics and Benchmarks
The introduction of benchmarks like vbench [33] has allowed for a more comprehensive evaluation of video generation, which
employs 16 different metrics to thoroughly assess video generation models. Given the unique characteristics of medical videos,
a comprehensive benchmark is essential and in particular, with specialized metrics tailored for medical videos.
General Automatic Metrics General automatic metrics refer to assessment standards that are applicable to both
natural video generation models and medical video generation models. Commonly used evaluation metrics include Fr´echet
Inception Distance (FID) [57], Inception Score (IS) [58], Fr´echet Video Distance (FVD) [59], Structural Similarity Index
(SSIM) [60], Peak Signal-to-Noise Ratio (PSNR), and CLIPSIM [61], among others. These metrics could assess multiple
dimensions of the quality of generated videos, such as realism, smoothness of imagery, and the alignment of conditions with
generation outcomes. However, there is a domain gap between medical videos and natural videos, hence the evaluation metrics
for natural video generation models cannot be directly applied to medical video generation models without consideration of
their discrepancy.
Doctor Evaluation Metrics Doctor evaluation refers to the direct involvement of a physician in assessing the synthesized
videos, and providing scores based on their individual preferences. Due to the absence of benchmarks designed for medical
image or video generation, this study refers to an ophthalmology report generation benchmark known as FFA-IR [62] that not
only takes into account the general evaluation metrics for report generation, such as Recall-Oriented Understudy for Gisting
Evaluation(ROUGE) and Bilingual Evaluation Understudy(BLEU) but also incorporates the assessment by ophthalmologists
of the generated reports. The assessment criteria include the fluency of the report, the rationality of the lesion description
within the report, and the accuracy of the described lesion location, among other professional medical issues. Similarly, in
video generation, it is also necessary to consider whether the generated content adheres to physiological laws and medical
utilities with the involvement of medical professionals.
Medical Automatic Metrics Physicians’ participation in evaluating synthesized videos is typically time-consuming
and inefficient. Hence, automatically integrating medical expertise into evaluation metrics is a promising alternative. On
one hand, using LLMs as substitutes for physicians in evaluations is an option. Medial expert LLMs [63] and multimodal
LLMs [64] possess the potential to accomplish this task. On the other hand, designing medical evaluation metrics is also
feasible. Since there are no such evaluation metrics in the field of medical visual generation, evaluation metrics in radiologist
report generation can be taken as references such as F1-RadGraph [65], CheXBert vector similarity [66], and RadCliQ [66]
used in MultiMedEval [67]. RadGraph constructs a radiology knowledge graph based on reports and then calculates the
similarity between the generated report and the reference report’s graphs to obtain a more specialized evaluation metric.
When assessing medical video generation models, this unique relationship can also be captured, such as building a graph
model that uses detection and classification techniques to model the relationships between surgical instruments and biological
tissues, achieving a specialized evaluation of surgical video generation.
3 Technical Background of Video Generation
This section outlines the essential technical aspects of video generation, including data preparation and pre-processing, the
design of neural network architectures, the training and inference of generative models, as well as evaluation metrics and
benchmarks.
3.1 Data Preparation and Pre-processing
Video generative models require extensive video data to develop. Recent surveys [68–70] have conducted a detailed investigation
of the datasets, highlighting commonly employed datasets in general domains such as UFC101 [71], WebViD [72], and LAION-
5b [73]. These datasets, however, are mainly for text-guided video generation and unconditional video generation. Specifically,
6
text-guided video generation aims to generate videos that represent text semantics, and the unconditional video generation
model synthesizes videos by sampling noises from Gaussian Distribution, which represents the distribution of training
data. For video generation conditioned on other modalities, such as depth-guided or pose-guided video generation, the
conditional modalities can either be manually curated or synthesized using specialized tools like a depth-generator [74] and
pose-generator [36, 75].
After gathering the datasets for training, the first step is to pre-process the videos along with their corresponding conditional
modalities. For unconditional generation, only videos have to be pre-processed. As the most common conditional modality is
text, the discussion below will focus on video-text data pre-processing.
•
Video Pre-processing Videos have to be pre-processed to facilitate training. Common strategies include cut detection,
filtering static scenes, resizing, and downsampling. SVD [22] introduces a series of comprehensive and effective video pre-
processing methods, including cascaded cut detection by PySceneDetect [76], Keyframe-aware clipping, and calculating
optical flow-based motion scores, aesthetic scores, clip scores [61], and OCR detection rates for clip filtering. These
processes remove clips that are motionless, unclear, unaesthetic, and mismatched with the text, as well as text-video
pairs with low-quality text.
•
Text Pre-processing Text diversity and richness are important for text-video generation [77]. In addition to text
resources from datasets, they can also be obtained from a captioning model [78–80] and video subtitles. Text enhancement
is commonly used to obtain more detailed and diverse descriptions. LLMs [63, 81–83] and multimodal LLMs [64, 84, 85]
can be used to create more diverse descriptions for videos.
3.2 Design of Neural Network Architecture
Conditions
Auto-regressive
Model
Generator Discriminator
Conditions
Diffusion
Process
T steps
De-noising
Process (U-net)
T steps
Conditions
a) GANs
b) AR Models
Diffusion
Process
T steps
De-noising
Process
(Transformer)
T steps
GPTs
Encoder
Decoder
Decoder
Encoder Encoder
Decoder
c) Diffusion Models
d) Towards Sora
e) AR + Diffusion Models
Conditions AR+Diffusion
Encoder
Video Conditions
Conditions
Video Conditions
Decoder
GANs
Fast Inference
High Controllabiliy
Mediocre Quality
Unstable Training
AR Models
High Diversity
High Quality
Error
Accumulation
Slow Inference
Diffusion Models
High Diversity
High Quality
Slow Inference
Towards Sora
High Diversity
High Quality
Physical Law
Learning
Slow Inference
AR+Diffusion
Higher
Controllabiliy
Lighter
Low Diversity
f) Pros
& Cons
Diversity & Quality
& Learning
Capability
Slow Inference
Figure 3: Architectures of video generation models, and their strengths and limitations. GANs, AR models, diffusion models,
and Sora-alike models have been widely used in video generation. The AR + Diffusion models, due to their capabilities in
understanding and generating content, holds promise for reliable biomedical video generation.
The architectural landscape of video generation models has evolved rapidly over the past few years. Early approaches
mainly adopted variational autoencoder (VAE) [86, 87] and flow-based [88, 89] video generation techniques. Later, studies
7
predominantly utilized GANs [17, 19, 90], followed by the rising popularity of AR models for video generation [86, 91].
Recently, research on diffusion models [22, 92] has significantly advanced video generation technology. The new generation
of diffusion models, exemplified by Sora [24] and Movie Gen [25], features a more diverse training dataset, a more efficient
compression model, and a more powerful diffusion backbone. The breakthrough in the diffusion backbone also lays the
foundation for subsequent work [24, 25, 93–96]. The AR + Diffusion models [15, 16] demonstrate strong capabilities in
image-text understanding and generation, showing potential in the field of medical visual generation. This section will primarily
focus on analyzing GANs, AR video generation models, diffusion-based video generation models, Sora-alike Transformer-based
video generation models, and AR+Diffusion models that show potential in medical video generation.
GANs [90, 97–100] comprise two principal components: a generator and a discriminator. The generator synthesizes
outputs by sampling a noise
z
from prior distribution
pg
(
z
), denoting
G
(
z
), and the discriminator, functioning as a classifier,
aims to discern whether the output is real or generated by the generator. In video generation, GANs often employ a decoupled
approach; for instance, VGAN [101] generates the foreground and background streams using 3D and 2D convolutional networks
respectively, which are subsequently amalgamated to produce a cohesive video. Furthermore, GAN framework frequently
utilizes decoupling strategies along various dimensions, such as spatial dimensions and temporal dimensions [17], as well as
content dimensions and motion dimensions [47, 102], to enhance the generation process.
GAN models generate videos in only one step with the generator, showing its high efficiency. Controllability is commendable
because the generation conditions can directly control the latent features, rather than controlling the latent noise as diffusion
models do [103]. However, in terms of generation quality, GANs typically do not match the performance of large-scale models
such as diffusion models and auto-regressive models [104]. Due to the inherent structure of GAN models, mode collapse can
occur, which also leads to generation diversity not being particularly competitive [105].
AR Models have been widely used in image generation [106–109] based on the Vision Transformer [110]. In auto-regressive
generation, each new token is produced based on preceding tokens, a principle that has been extended to video generation
as well. For example, CogVideo [21] encodes video frames into a latent sequence and is trained auto-regressively, where
the next frame is generated conditioned on the previous frames. VideoPoet [20] applies an LLM [82] as its backbone and
auto-regressively trains the input text, audio, and video tokens.
AR models are capable of generating high-quality and diverse visual content [111]. However, despite the model’s ability to
generate high-quality and diverse samples, the generative performance of these models will decline due to the drawback of
error accumulation [112], as AR models generate subsequent tokens based on previous ones. In addition, these models are
usually large-scale, so optimizing their training and inference speed is a challenge.
Diffusion Models [113, 114], inspired by non-equilibrium thermodynamics [115], begin by defining a Markov chain
that incrementally adds Gaussian noise to the data
x0
, a process known as diffusion. Given the properties of the Markov
process, the state of
xt
only depends on
xt−1
. Subsequently, the model learns to reverse this diffusion process to reconstruct
the desired data sample by removing the noise. In a diffusion model framework, the inputs are initially condensed using
compression models such as VAE [116] and VQ-VAE [117], and followed by the Denoising Diffusion Probabilistic Model
(DDPM) process [114] guided by conditions such as text prompt [118–121], images [121–124], audio[125, 126], or other
modalities [127–129], and finally reconstructed by a decoder. Furthermore, diffusion models incorporate mechanisms that
decouple video components, typically separating videos into temporal and spatial dimensions [46]. This approach enhances
the model’s ability to learn video features more effectively.
Diffusion-based generative models have become the mainstream approach for video generation due to their high generation
quality and richness of details. However, because of its multi-step inference during generation, their training and inference
speeds are relatively slow. Additionally, compared to the latent features in GANs, the noise latent poses a challenge in terms
of explainability [103].
Towards Sora Sora [24] is an advanced text-to-video generation model developed by OpenAI that is capable of generating
up to one minute of high-quality video content based on user input. The demonstration of Sora has accelerated the trend of
using Transformer [130] as backbones for diffusion models. Sora, according to OpenAI’s technique report [131] and open-source
projects [132, 133], comprises three key components akin to Latent Diffusion Model(LDM) [134], including a compression
model [116, 117], a conditioning mechanism, and a generation model. Diffusion Transformer (DiT) [135] is the core of Sora,
where noised latent from the diffusion process is patchified into a series of tokens and then added with positional embeddings.
8
DiT blocks utilize adaLN [136] as the conditioner to introduce timestep condition
t
and class label condition
c
. Notably,
Sora makes adaLN initialized to zero to expedite large-scale training in a supervised learning context [137], referred to as
adaLN-zero.
Movie Gen [25] is a media generator developed by Meta that integrates both video and audio generation capabilities.
Movie Gen has designed a diffusion backbone based on the transformer architecture, which structurally draws inspiration
from the design of LLama 3 [27], and trains in flow matching manner [138]. However, it does not incorporate structures such
as causal attention and Grouped-Query Attention (GQA) that are utilized for the auto-regressive training of LLama 3. It uses
a temporal autoencoder (TAE) as its compression model combining VAE [116] and a 1D convolution layer. During TAE
training, it adds outlier penalty loss (OPL) to eliminate the negative impact of ‘spot’ artifacts.
AR + Diffusion Models Auto-regressive models such as LLMs [27, 81–83] and multimodal LLMs [139, 140] have
demonstrated strong comprehension capabilities for text and multimodal information, while diffusion models like LDM [134,
141] have shown impressive visual generation abilities. Transfusion [16] and Show-O [15] effectively integrate these two
capabilities, exhibiting understanding and generation capabilities that are comparable to LLMs and diffusion models, along
with enhanced text-image alignment capabilities.
Transfusion [16] utilizes a transformer as a generative model, employing causal attention when training text tokens and
bidirectional attention when training image generation models. It integrates the next prediction loss
LLM
from LLMs and the
LDDP M
from diffusion models as its training objectives. Show-O [15] is consistent with Transfusion in terms of language
training, but in image training, it adopts a mask token prediction loss
LMT P
to reconstruct the masked image patches.
Comparatively, both models possess the capabilities of LLM diffusion models in text generation and image generation, and
they have also achieved a new SOTA in text-video alignment. Show-O performs slightly better than Transfusion in terms of
alignment. Although such models have not yet been applied to the generation of medical visual content, their demonstrated
stronger condition-alignment capabilities can provide better explainability and controllability for medical content generation,
holding potential in the generation of medical images and videos.
3.3 Training
3.3.1 Workflow
A general training strategy for developing video generation models involves large-scale pre-training on natural images, then
large-scale pre-training on natural videos, followed by fine-tuning with high-quality videos, and finally domain-specific
adaptations to downstream scenarios, e.g., medical video generation.
Large-scale Image Pre-training Image pre-training serves as the initial phase in training video generation models, with
the primary goal of enriching the models’ comprehension of visual content representations. Images are typically pre-trained on
image generation models, but some models, such as Movie Gen [25], treat images as a single-frame video and complete image
pre-training on a video generation model.
Large-scale Video Pre-training Video pre-training enables the model to learn motion representation based on the
model pre-trained on images with rich visual priors
Datasets such as ShareGPT4Video [142], FineVideo [143], WebVid10M [144], and Kinetics-600 [145] offer rich resources for
video training. ShareGPT4Video includes millions of videos featuring wild animals, cooking, sports, landscapes, and beyond,
coupled with abundant textual information. However, low-quality data can degrade model performance. Therefore, video
data and its modalities must undergo a series of pre-processing and filtering steps (see Section 3.1) to further enhance model
performance.
Following image pre-training, the generative model undergoes further training. However, there are two key differences: 1) It
is necessary to introduce a temporal dimension or motion features into the image generation model to model dynamic content,
such as Make-A-Video [146] and LVDM [125], which incorporate temporal attention, VDM [147] and Make-An-Animation [148],
which introduce pseudo 3D covolution, and StyleGan-V [47], which incorporates motion vectors; 2) This step will introduce
various conditions, such as text as a supervisory signal to guide video generation, promoting the model’s learning of the
consistency between conditions and generated content.
9
High-Quality Video Fine-tuning Following the previous two steps, training on higher-quality datasets can yield
enhanced and more realistic outcomes.
High-quality video datasets are normally high-resolution. For example, LSMDC [149] provides a video dataset with 1080p
resolution, while DAVIS [150] includes videos with a resolution of 1280p, and RDS [46] offers a substantial collection of video
clips with a dimension of 1024 ×512 pixels.
At this stage, it is essential to design fine-tuning tasks that help the model gain a deeper understanding of video content
and the consistency between video conditions. Models such as the diffusion model and the auto-regressive model, e.g., SVD [22]
and VideoPoet [20], have designed a series of self-supervised learning tasks to assist the model in further exploring the intrinsic
features of videos. These tasks include video interpolation, video prediction, and image-to-video generation tasks.
Downstream Adaptation High-quality fine-tuning establishes a robust foundational model for subsequent downstream
tasks. A notable application involves the transfer of generative models to the medical domain for medical video synthesis. A
suite of generative tasks can be accomplished based on existing medical video datasets (see Section 4 and Table 4).
3.3.2 Training Acceleration
Video generation models often face issues such as slow training speeds and excessive consumption of resources like memory
during training. Therefore, it is necessary to optimize the training process of these models to enhance training efficiency.
GAN Acceleration The training process of GANs is often characterized by instability due to its structural nature.
Techniques such as Earth Mover’s Distance [151], Gradient Penalty [152], and TTUR [153] help stabilize GAN training.
Besides, skip-layer excitation and self-supervised discriminator [154] reduce the number of parameters, thereby facilitating
faster and more stable training.
AR Model Acceleration Acceleration techniques for auto-regressive models often align with those developed for
transformers, especially LLMs, as most AR generators utilize LLM structure as their generation backbones [13, 14, 20].
Models can be distributed across multiple GPUs for efficient training such as DeepSpeed [155] and Zero [156]. A series
of parameter-efficient learning methods can also be applied including Lora [157, 158], Adapter [159], BitFit [160], and
P-tuning [161]. Other efficient acceleration methods include model quantization [162], and FlashAttention [163].
Diffusion Model Acceleration Diffusion models often entail numerous steps for image/video generation, emphasizing the
importance of optimizing this process for efficiency. SpeeD [164] separates time steps into distinct areas and accelerates training
process. P2 [165] and Min-SNR [166] adjust weights on the time steps based on heuristic rules. State Space Models(SSM) is
utilized in DiffuSSM [167] to increase training speed without attention mechanism.
3.4 Inference
3.4.1 Workflow
During the training phase, the specified condition serves as a directive for the generation, thereby guiding the creation process.
These conditions include modalities such as text, images, depth information, and sound (see Table 1). Beyond directly
inputting conditions for the model to generate video content, there are various inference techniques that can boost the inference
performance. These include but are not limited to: 1) prompt engineering; 2) video interpolation and super-resolution; 3)
self-condition consistency model; and 4) explicit generation.
Prompt Engineering Since existing video generation models are trained with long and detail-rich prompts [24, 25, 168],
a good prompt directly affects the quality of video generation, such as the richness of details. Prompt engineering is the
technique that helps unleash their potential. For example, Sora [24] utilizes GPT [82] to turn users’ short prompts into longer
captions to generate high-quality videos that better align with users’ input.
Video Interpolation and Super-resolution Videos directly generated by generative models may sometimes have a low
frame rate and low pixel resolution. Therefore, post-processing with video interpolation models and super-resolution models
can produce higher-quality videos. Make-A-Video [146], ImagenVideo [169], and VideoLDM [46] utilize a multi-stage process
including a series of interpolation models and super-resolution videos to generate longer and higher-resolution videos.
Vision-Language Model-guided Inference VideoAgent [53] proposed Vision-Language Model (VLM)-guided video
generation. VideoAgent first plans the video conditioned on the first frame and language. Based on the video plan and latent
10
noise from the previous iteration, VLM helps refine the model by making denoising adjustments. Besides VLM, humans can
also engage in the refining iteration to adjust based on their own preferences. VLM-guided inference can effectively eliminate
hallucinations and make the imagery more realistic.
Explicit Generation Explicit generation, proposed by EMUVideo [170] divides the video generation process into two
stages: 1) image generation conditioned on the text and 2) video generation conditioned on text and generated image. The
image generated after the first stage is regarded as the explicit representation of generation content. It initializes the video
generation model with stable diffusion model [141] weights and fine-tune the temporal layers on video datasets. This explicit
inference framework helps generate videos with more details.
Table 1: Current video generation methods categorized by conditional modalities.
Method Year Architecture Method Year Architecture
Text-guided Generation
TGANs-C[171] 2017 GAN T2V[18] 2018 GAN
StoryGAN[172] 2019 GAN LVT[173] 2020 Auto-regressive Model
Godiva[86] 2021 VAE LVDM[92] 2022 Diffusion Model(U-net)
VDM[147] 2022 Diffusion Model(U-net) ImagenVideo[169] 2022 Diffusion Model(U-net)
Stylegan-V[47] 2022 GAN CogVideo[21] 2022 Auto-regressiveModel
Make-a-Video[146] 2022 Diffusion Model(U-net) Show-1[174] 2023 Diffusion Model(U-net)
SVD[22] 2023 Diffusion Model(U-net) Stream2V[95] 2024 Diffusion Model(U-net)
Latte[93] 2024 Diffusion Model(Transformer) CogVideoX[168] 2024 Diffusion Model(Transformer)
Gentron[94] 2024 Diffusion Model(Transformer) VD3D[96] 2024 Diffusion Mo del(Transformer)
Sora[24] 2024 Diffusion Model(Transformer) MovieGen[25] 2024 Diffusion Model(Transformer)
Pose-guided Generation
DynamicGAN[175] 2022 GAN DreamPose[176] 2023 Diffusion Model(U-net)
MagicAnimate[177] 2024 Diffusion Model(U-net) MimicMotion[178] 2024 Diffusion Model(U-net)
Follow-your-pose[179] 2024 Diffusion Model(U-net) Disco[180] 2024 Diffusion Model(U-net)
Motion-guided Generation
DragNUWA[181] 2023 Diffusion mo del(U-net) MCDiff[182] 2023 Diffusion Model(U-net)
DreamVideo[183] 2024 Diffusion Model(U-net) VMC[184] 2024 Diffusion Model(U-net)
MotionClone[185] 2024 Diffusion Model(U-net) MotionCTRL[186] 2024 Diffusion Model(U-net)
Revideo[187] 2024 Diffusion Model(U-net) 360DVD[188] 2024 Diffusion Model(U-net)
Image-guided Generation
Imaginator[189] 2020 GAN LaMD[190] 2023 Diffusion Model(U-net)
GID[39] 2023 Diffusion Model(U-net) LFDM[191] 2023 Diffusion Model(U-net)
Depth-guided Generation
Animate-a-Stroy[192] 2023 Diffusion Model(U-net) Make-your-video[193] 2024 Diffusion Model(U-net)
Sound-guided Generation
TPOS[194] 2023 Diffusion Model(U-net) Aadiff[195] 2023 Diffusion Model(U-net)
Generative Disco[196] 2023 Diffusion Model(U-net) TA2V[197] 2024 Auto-regressive Model
Video-guided Generation
Video Editing
Videop2p[198] 2023 Diffusion Model(U-net) Dreamix[199] 2023 Diffusion Model(U-net)
DynVideo[200] 2023 Diffusion Model(U-net) Anyv2v[201] 2023 Diffusion Model(U-net)
MagicCrop[202] 2023 Diffusion Model(U-net) ControlAVideo[203] 2023 Diffusion Model(U-net)
CCedit[204] 2024 Diffusion Model(U-net)
Video Interpolation & Video Prediction
PhBI[205] 2015 Phase-based AdaConv[206] 2017 kernel-based
SRFI[207] 2017 kernel-based PhaseNet[208] 2018 Phase-based
CyclicGen[209] 2019 OpticalFlow VQI[210] 2019 Optical Flow
FIGAN[211] 2022 GAN VIDIM[212] 2024 Diffusion Model(U-net)
LDMVFI[213] 2024 Diffusion Model(U-net)
Brain-guided Generation
f-CVGAN[214] 2022 GAN CinimaticMindscapes[215] 2024 Diffusion Model(U-net)
DynamicVStimuli[216] 2024 Diffusion Model(U-net) Animate-your-thoughts[217] 2024 Diffusion Model(U-net)
Multi-modal-guided Generation
NUWA[218] 2022 VAE VideoPoet[20] 2023 Auto-regressive Model
MovieFactory[219] 2023 Diffusion Model(U-net) MovieComposer[220] 2024 Diffusion Model(U-net)
Lumieire[23] 2024 Diffusion Model(U-net) Sora[24] 2024 Diffusion Model(Transformer)
AV-DiT[221] 2024 Diffusion Model(Transformer)
Unconditional Generation
VGAN[101] 2016 GAN WGAN[17] 2017 GAN
WGAN[222] 2017 GAN MocoGAN[19] 2018 GAN
DVD-GAN[223] 2019 GAN DIGAN[102] 2022 GAN
3.4.2 Inference Acceleration
Inference speed and memory resource utilization are also crucial metrics for generative models. To address these challenges,
a range of optimization algorithms have been developed. Optimization efforts primarily target auto-regressive models and
diffusion models.
AR Model Acceleration Currently, the predominant architecture for mainstream auto-regressive generative models
is LLM structure, aligning the acceleration methods of LLMs accordingly. DeepSpeed [155] used in training acceleration,
11
can also be utilized in inference acceleration. In addition, packages like vLLM [224], lightLLM [225], and TensorRT [226]
have optimized cache management, attention, and quantization mechanism, which can significantly accelerate the inference of
LLMs.
Diffusion Model Acceleration Parallel inference techniques offer a promising avenue for expediting diffusion-based
generation. ParaDiGMS [227] and DistriFusion [228] splits denoising process and input patches to different GPUs respectively
to accelerate generation to reduce the computational cost. Time steps [229, 230] can also be optimized during inference time
to increase efficiency. In addition, DeepCache [231] and AT-EDM [232] demonstrate the advantageous utilization of model
cache and input data to enhance inference performance. To facilitate acceleration, specialized libraries such as Xformers [233],
AITemplate [234], TensorRT [226], and OneFlow [235] can be leveraged for streamlining the inference process.
3.5 Evaluation Metrics and Benchmarks
Evaluation metrics are essential for assessing generation performance. When evaluating natural video generation capabilities,
it is important to consider key factors such as generation quality, temporal continuity, and consistency between the condition
and the video, e.g., in text-to-video scenarios. To evaluate these aspects, IS [58], FID [57], FVD [59], SSIM [60], PSNR, and
CLIPSIM [61] are widely used for gauging the quality of generated videos.
The current mainstream benchmarks for video generation models include two categories: Text-to-Video (T2V) benchmarks
such as VBench [33], T2VBench [33], and T2V-Compbench [35], and Image-to-Video (I2V) benchmarks such as AIGCBench [34].
VBench [237] provides 16 evaluation metrics including object identity, motion smoothness, and space relationship. Each
dimension is designed as a set of around 100 prompts. VBench [237] has assessed open-source video generation models including
ModelScope [238], CogVideo [21], VideoCrafter-1 [239],and Show-1 [174] and closed-source models such as Gen-2 [240] and
Pika [241]. Among the open-source models, VideoCrafter-1 [239] and Show-1 [174] exhibited notable superiority. While
closed-source models excelled in video quality, including aesthetic and imaging quality, certain open-source models surpassed
their closed-source counterparts in terms of semantic consistency with user input prompts.
I2V benchmarks, unlike T2V benchmarks, have reference videos in the dataset, AIGCBench [34] contains 3928 samples
including real-world video-text and image-text pairs and generated image-text pairs. AIGCBench [34] evaluated both
open-source models including VideoCrafter [239], and SVD [22] and closed-source models including Pika [241] and Gen-2 [240]
on 11 metrics of 4 dimensions including control-video alignment, motion effect, temporal consistency, and video quality.
SVD [22] achieves the best on this benchmark among the open-source models and exhibits results that are comparable to
those of closed-source models.
4 Generative Video Research in Biomedicine
Video generation holds significant importance in the medical field, as it can enhance the quality of medical education and
improve clinical decision-making. However, it is essential to recognize the substantial differences between medical video data
and natural video data.
Medical imaging is typically multimodal. For example, Pathological images typically include multiple staining methods to
make various parts of the specimen clearly visible under the microscope, encompassing the structure, morphology, and abnormal
changes of cells and tissues. Moreover, distinct staining techniques can differentiate between various tissue components.
Medical imaging requires higher contrast than natural images to highlight diagnostic information effectively. Additionally,
some medical imaging modalities, such as ultrasound, do not provide depth information and these videos depend on modalities
rather than depth cues for control, in contrast to natural videos.
In addition, the generated medical videos usually require a high signal-to-noise ratio to have high diagnostic value [29],
unlike natural videos, which often focus more on continuity and realism, and do not emphasize the signal-to-noise ratio as
much as medical videos do. From existing medical examinations and datasets, such as surgical video recordings [242, 243] and
ultrasound imaging [244], it can be observed that the subject of imaging is usually located in the center of the video and does
not have significant motion, such as the heart only beating rhythmically within a small range. In contrast, natural videos
often have significant changes in perspective and object displacement. Therefore, in terms of understanding physical laws, the
12
Table 2: Commonly used metrics for assessing the quality of generated videos.
Metrics Formula Explanation Function Drawback
General Synthetic Video Metrics
IS[58] ISvideo =exp(E(1
T
T
P
t=1
E[KL(p(yt|xt)||p(yt))]))
IS measures K L divergence
between class probabilistic distribution
of one generated frame p(yt|xt)
and distribution of all
generated images p(y), where
Tis the number of frames
Evaluation of diversity of
generated samples and degree
of one sample belongs to a
certain class
Only consider generated
samples’ distribution.
FID[57] d2((m, C),(mw, Cw)) = ||m−mw||2
2+T r(C+Cw−2(CCw)1
2)
F ID measures Fr´echet distribution distance
between ground truth and
generated samples
whose mean and co-variance
are (m, C) and (mw, Cw) respectively.
Evaluation of Gaussian
distribution distance between
generated and real data
in the feature level.
Unable to assess the
overfitting situation of
generative models. Gaussian
distribution is insufficient
to represent feature
distribution.
FVD[59] MMD2(q, p) =
m
P
i=j
k(xi,xj)
m(m−1) −2
m
P
i=1
n
P
j=1
k(xi,yj)
mn +
n
P
i=j
k(yi,yj)
n(n−1)
F V D measures distance
between ground truth p(X)
and generated samples q(Y)
by Maximum Mean Discrepancy
Evaluation of distance
between generated and
real data using video
feature extractor
The same as FID
SSIM[59] SSIM =(2µxµy+c1)(2σxy +c2)
(µ2
x+µ2
y+c1)(σ2
x+σ2
y+c2)
SSIM measures luminance µ,
contrast σ, and structure similarity
between generated and
real samples
Evaluation of structural
similarity to represent human
perception.
Complex and large
computation cost
CLIPSIM[61] SI M =υtext·υvideo
||υtext||·||υvideo ||
CLI P SI M measures similarity
cos(·) between text and video
features extracted by CLIP
Evaluation of similarity
between text and generated
data using clip[61]
Simple calculation, but
it cannot completely represent
human perception
PSNR[236] MS E =1
mn
m−1
P
i=0
n−1
P
j=0
||I(i, j)−K(i, j )||2
P SN R = 10 ·log10( M AX2
I
MS E )
P SN R measures the ratio
of the peak signal energy
to the average noise
energy MS E
It compares the
differences in pixel values
between two images.
Simple calculation, but it cannot
completely represent human
perception
Medical Synthetic Video Metrics
BmU[32] BmU =cos(BE RT (Ti
new), BERT (Ti
org ))
BmU calculates BERT
similarity between original
text prompt Torg and generated
text Tnew of synthetic videos
Evaluating the degree of
adherence to prompts within the
latent space.
Effectiveness has not been
validated on other medical
video generation methods.
13
generation of medical videos is comparatively simpler.
4.1 Medical Video Dataset and Generation Techniques
Video generation methods have demonstrated significant promise in the medical field, with the curation of diverse video
datasets. These datasets encompass a range of categories, including 1) surgical video datasets; 2) medical imaging video
datasets; 3) microscopic video datasets; 4) medical observatory video datasets; 5) medical animation datasets; and 6) medical
websites and libraries (see Table 3). Leveraging these datasets, a variety of methods have been developed, including endoscopic
surgery video generation and microscopic video generation.
4.1.1 Surgical Video Generation
Surgical video datasets are often drawn from different surgical scenarios, including open surgery, minimally invasive surgery,
small-incision surgery, endoscopic surgery, robotic surgery, and interventional operation. For simple classification, the video
datasets from these 6 kinds of surgeries are categorized into 3 kinds of video datasets: open surgery (OS) video dataset,
interventional surgery (IS) video dataset, and minimally invasive surgery (MIS) video dataset.
OS Video Generation OS is a traditional type of surgery that exposes the surgical site through a large incision, allowing
the doctor to operate directly under direct vision. Therefore, it is more suitable for large-scale operations with complex and
difficult operations. Few open surgical videos are curated into structured datasets, most of which are included in online
websites (see Table 5). Nonetheless, there is currently no video generation technology developed for open surgeries.
IS Video Generation IS, guided by imaging equipment, introduced instruments such as guide wire and catheters into
the human body through minimal incisions and the body’s natural orifices to diagnose and treat diseases.
GenDSA [29] proposes a flow-mask-based model MoStNet to interpolate between frames based on DV-MuRC [29] containing
3 million digital subtraction angiography frames from 27117 patients. It extract motion-structure information by fusing
multi-scale features of two frames and predict flow between two of them. This method is capable of generating a whole
digital subtraction angiography video using only
1
3
of total frames, helping to reduce the radiation exposure for patients while
lowering the cost of detection.
MIS Video Generation MIS is a surgical approach conducted through smaller incisions or the body’s natural orifices,
aiming to achieve the best therapeutic outcomes with minimal trauma. It mainly uses modern medical instruments such
as laparoscope, thoracoscope, and arthroscope to enter the human body through tiny incisions or natural cavities for fine
surgical operations.
Endora [28], trained on Colonoscopic [274], Kvasir-Capsule [275], and CholecT [254], generates medical videos that simulate
minimally invasive surgery. It introduces a diffusion backbone of transformer with spatial and temporal blocks to deal with
endoscopic videos and extracts a prior from real video by DINO encoder [303] as a guide to control video synthesis using a
conditional mechanism of Pearson correlation. It shows decent performance with regarded to video generation metrics such as
FVD, FID, and IS. Surgen [30], trained on Cholec80 [255], designs its diffusion model based on DiT [26] conditioned on text
to generate high-quality surgical videos.
Apart from the surgical datasets used in Endora and Surgen [254, 255, 274, 275]. MIS also provides a large scale of
datasets for surgery video generation model training, including laparoscope surgery [256, 257, 271–273, 304], ophthalmological
surgery [246–251], and others [258–270].
4.1.2 Medical Imaging Video Generation
Medical imaging refers to the use of contrast techniques to obtain images of the internal structures and functions of the human
body, such as CT, MRI, X-ray, and ultrasound. Given the scarcity of medical imaging data and the radiation associated
with certain medical imaging examinations, generating corresponding videos holds applied value and clinical benefits. Some
efforts [31, 55, 56, 293–297] have been made in this direction.
Nguyen Van Phi et al. [293] proposed a conditional diffusion model for echocardiography video synthesis guided by
semantic mask. It uses spatial adaptive normalization [305] to introduce semantic conditions into the denoising process.
14
Table 3: Medical video datasets, including minimally invasive surgery (MIS) videos, interventional surgery (IS) videos, real-time MRI videos, videos under a
microscope, medical observatory videos, and medical animations.
Dataset Type Category Tasks Video Label Class Avg. Resolution
MESAD-Real[242, 243] MIS prostate Action Cls., Det. F:23366 Class, Bbox 21 - 720 ×756
MESAD-Phantom[242, 243] MIS prostate Action Cls., Det. F:22609 Class, Bbox 14 - 720 ×756
SurgicalActions160[245] MIS gynecology Action Cls. 160 Class 16 - -
Cataract-21[246] MIS Ophtha. Action Cls. 21 Class 10 - -
Cataract-101[246] MIS Ophtha. Action Cls. 101 Class 10 - -
IrisPupilSeg[247] MIS Ophtha. Iris pupil Seg., 35 Mask - - 540 ×720
CatInstSeg-Manu[248] MIS Ophtha Inst. Seg. F:843 Class, Mask, bbox 11 - -
CatInstSeg-Auto[248] MIS Ophtha. Inst. Seg. F:4738 Class, Mask, bbox 15 - -
CatRelComp-1[249] MIS Ophtha Idle Rec. F:22000 Class 22 - -
CatRelComp-2[249] MIS Ophtha Cornea, Inst. Seg. F:478 Mask 11 - -
CatRelDet[250] MIS Ophtha Surgery Phase Cls. 2200 Class 4 3s -
LensID-1[251] MIS Ophtha Surgery Phase Cls. 100 Class 2 3s -
LensID-2[251] MIS Ophtha Lens Seg. 27 Mask - - -
PitVis[252] MIS Pituitary Seg. Cls. 33 Mask Act:14,Tool:18 72.8min 720*720
SurgToolLoc[253] MIS laparoscope Tool Local. Cls. 24695 Class, bbox 14 30s 1280 ×720
CholecT50[254] MIS laparoscope Triplet det. Rec. 50 Class, bbox - - -
CholecT80[255] MIS laparoscope Det. Rec. 80 class,bbox Actions:7,Tools:7 - -
CholecT40[256] MIS laparoscope Triplet Det. Rec. 40 Class,Bbos 6 - -
SAR-RARP50[257] MIS laparoscope Action Rec.;
Inst. Seg. Cls. 50 Class,Mask Act:8, Inst:9 - 1920 ×1080
FetReg[258, 259] MIS - Placental Seg., Register 18 Mask 8s - -
PETRAW[260] MIS - Workflow Rec. - Class, Mask Phases:2;Stages:12;
Actions:6 - 1920 ×1080
MISAW[261] MIS - Workflow Rec. - Class Phases:2;Stages:2,;
Actions:17 - 960 ×540
ROBUST-MIS[262] MIS rectal Surgery Cls.;Inst. Seg - Class, Mask 3 - -
HeiChole[263] MIS - Workflow Rec.;
Action Cls.; Tool Cls. 30 Class Phases:7;
Actions:4,Tools:20 - -
SWAS[264] MIS - Workflow Rec.;
Inst. Cls. 42 Class Phases:14,Tools:12 - -
RSS[265] MIS Kidney Inst. Seg. 16 Mask 8 - -
CATARACT[266] MIS Ophtha. Tool Cls. 100 Class 21 - 1920 ×1080
RIS[267] MIS - Inst. Seg. 18 Mask - - -
KBD[268] MIS Kidney Kidney Det. 15 Mask - - -
ICT[269] MIS - Inst. Seg. 34 Mask,Bbox - -
AOD[270] MIS Intestine Polyp Seg. 49 Mask - - -
Endoscapes-CVS201[271] MIS Laparoscope Critical View of Safety 201 Grade 3 - -
Endoscapes-bbox201[271] MIS - Tissue Tool Det. 201 Bbox 6 - -
Endoscapes-Seg201[271] MIS - Tissue Tool Seg. 201 Mask - - -
SSG-VQA[272] MIS Laparoscop e VQA 25k Text - - -
MultiBypass140[273] MIS Laparoscope Workflow Rec. 140 Class P:12,St:45 - -
Colonoscopic[274] MIS Colon Cls. 76 Class 3 - 340 ×256
Kvasir-Capsule[275] MIS Digestive Tract Cls. 117 Class 14 - 336 ×336
DV-MuRC[29] IS - Generative F:3m Mask - - 489 ×489,512 ×512, ...
CAMUS[276] Ultrasound Heart Seg. F:1000 Mask 4 - -
Echonet[244] Ultrasound Heart - 10036 Clinical Info - - 112 ×112
BUV dataset[277] Ultrasound Breast Lesion Det. - Class, Bbox 2 - -
Uliver[278] Ultrasound Liver Track 7 - - - 500 ×480
2dRT[279] RT-MRI - - - Audio - - 84 ×84
Cell Track[280] Micro Hela Cell Track 2 Mask - - 700 ×1100
Tryp[281] Micro Parasites Det. 114 Bbox - - 1360 ×1024
Yeast[282] Micro Yeast Cls. 2417 Class 14 -
Embryo[283] Micro Embryo Cls. 704 Class 16 - 500 ×500
CTMC-v1[284] Micro Cell Track 86 bbox - - 320 ×400
VISEM[285] Micro Spermatozoa Evaluation 85 Text - - 640 ×480
PURE[286] Medical Observatory Videos Face Video Pulse Rate Est. 60 Pulse Rate - 60s 640 ×480
IMVIA-NIR[287] Medical Observatory Videos Face Video Pulse Rate Est. 20 Pulse Rate - - 1280 ×1024
MERL-RICE[288] Medical Observatory Videos Driving Pulse rate Est. 18 Pulse Rate - - 640 ×640
TokyoTech-NIR[289] Medical Observatory Videos Face Video Pulse Rate Est. 9 Pulse Rate - 180s 640 ×480
Video-EEG[290] Medical Observatory Videos Movements Epilepsy Diag. 191025 Class 2 4s 1920 ×1080
SimSurgSkill[291] 3D Animation Surgery Inst. Cls. Det. - Class, Bbox - - 1280 ×720
SurgVisDom[292] 3D Animation Surgery Surgical Task Cls. 59 Class 3 - 1280 ×720
1Abbreviations: Ophtha.-Ophthalmology; Cls.-Classification; Seg.-Segmentation; Local.-Localization; Det.-Detection; Est.-Estimation; Inst.-Instrument; F.-Frames
15
Table 4: Video generation methods in medical domain.
Domain Category Model Architecture Conditions
Surgery Video Generation
IS GenDSA[29] Optical Flow Low FPS Video
MIS Endora[28] Diffusion Model Unconditional
MIS Surgen[30] Diffusion Mo del Surgical Description
Medical Imaging Video Generation
Ultrasound Nguyen Van Phi[293] Diffusion Mo del Cardiac Structure Mask
Ultrasound Hardrein[294] Diffusion Model Echocardiogram Image
Clinical Parameters
Ultrasound Echonet-Syn[295] Diffusion Model Privacy-preserving Heart
Clinical Parameters
Ultrasound Pellicer AO[296] Diffusion Model Unconditional
Ultrasound Jiamin Liang[297] GAN Motion& Key Point
Ultrasound ECM[56] Diffusion Mo del Cardiac Motion
Ultrasound HeartBeat[55] Diffusion Model
Sketch Mask
Skeleton Optical flow
Echocardiogram Image
FFA Fundus2Video[31] GAN Fundus Image
Knowledge Mask
Microscopic Video Generation
Cell Track BVDM[298] Diffusion Model Cell Mask
Yeast&Embryo P´erez PC[299] Diffusion Model
GAN Unconditional
Embryo EmbryoTgan[300] GAN Unconditional
Medical Observatory Video Generation Near-infrared Videos Yannick[301] dichromatic
reflection model rPPG Signal
Medical Animation 3D Animation SyntheticColon[302] 3D Model Unconditional
Multimodal Video Generation MRI MIS
Cell Ultrasound Bora[32] Open Sora[132] Medical Description
Trained on CAMUS [276] dataset, this model is able to produce realistic echocardiography videos consistent with semantic
segmentation map.
Hardrien et al. [294] generate ultrasound video from a single image and left ventricular ejection fraction (LVEF) score
based on EchoNet [244]. To better synthesize video data from a single image and interpretable clinical data, Eclucidated
Diffusion Model(EDM) [306] is applied. Additionally, further experiments proved that proposed method has fine-grained
control of specific properties like LVEF leading to precise data generation. Based on EchoNet [244], Hadrien et al. also
developed a framework based on LVDM to generate longer echocardiogram videos at near real-time speeds [295]. Besides,
EchoNet-Synthetic dataset was proposed, with competitive performance and quality as real data. Another method of
generating high-quality echocardiogram videos proposed by Alexandre et al. [296] is capable of generating echocardiograms of
four different views.
In the field of ophthalmology, Fundus2Video [31] attempts to generate Fundus Fluorescein Angiography (FFA) from Color
Fundus (CF). Based on an in-house CF-FFA dataset, It auto-regressively trains a GAN to generate FFA sequence from CF
guided by a knowledge mask. In order to align the three of them with each other, including input CF, knowledge mask, and
FFA sequence, it designed knowledge-boosted attention and knowledge-aware discriminators for specific supervision on lesion
regions. Experiments show that it successfully addresses challenges in lesion generation and pixel misalignment.
Existing open-source medical imaging video datasets are mainly from two categories: 1) ultrasound datasets [56, 276–278,
295, 297] and 2) real-time MRI datasets [279]. While these datasets form the basis for generating related imaging videos,
publicly available data is still relatively scarce, especially for other imaging modalities such as CT and FFA.
4.1.3 Microscopic Video Generation
Microscopic videos record the biomedical behaviors under the microscope, primarily the activity of microorganisms, including
1) microorganisms’ morphology, i.e., morphological changes of microorganisms at different time points, such as division,
migration, and morphological changes; 2) microorganisms’ behavior, i.e., behavior of microorganisms in a specific environment,
such as microorganisms interactions, and responses to stimuli; and 3) microorganisms’ labeling, i.e., dynamic changes of specific
molecules or structures within cells through techniques such as fluorescent labeling. Cell Track [280], Tryp [281], Yeast [282],
Embryo [283], CTMC-v [284], and VISEM [285] record various kinds of behaviors of various types of microorganisms such as
Hela cell and yeast.
Trained on Hela Cell Track [280], BVDM [298] consists of two parts: DDPM [114] and VoxelMorph [307] for flow field
prediction. Living cell video frames are trained on DDPM [114] for image generation, and two consecutive masks are trained
on flow prediction model [307] for finding flow field between two consecutive images. During inference stage, the diffusion
16
model generates texture based on the first mask of cells. For the rest of frames, flow prediction from VoxelMorph [307] is
applied to the output from previous iteration, and the result is fed to DDPM to generate the next frame. This framework
tackles the scarcity of annotated real living cell datasets and, training on the synthetic data generated by BVDM [298] has
proved superior performance compared to training with a limited amount of real data.
Pedro Celard P´erez et al. [299] compared video generation performance based on yeast image sequence [282] and embryo
video dataset [283] with video diffusion model (VDM) [92] and Temporal GANv2 (TGANV2) [17]. Results showed that with
regard to 64
×
64 and 128
×
128 image and video generation, TGANv2 [17] have better performance than VDM [92] in FID [57]
and FVD [59].
4.1.4 Medical Observatory Video Generation
Medical observatory videos consistently document biological behaviors, including morbidity and behavioral records that reflect
certain medical situations, such as the stages of epilepsy onset [290] and cardiac dynamics [286–289]. Due to the clinical value
of such data and its paired vital signals, such as remote photoplethysmography (rPPG) and electroencephalogram(EEG), and
the scarcity of paired video-signal data, the synthesis of such data is worth exploring.
Because capturing videos using infrared cameras to reflect rPPG signals is costly, employing synthetic methods to expand
such datasets is hence a cost-effective approach. Yannick et al. [287] proposed a method to generate videos using synthetic
rPPG signals [301, 308]. They first generate the rPPG signals, then generate the spatial, channel, and motion dimensions of
the video based on the rPPG signals, and finally integrate them into a complete video.
In addition to the datasets [286–289] used in the algorithm proposed by Yannick et al [287], EEG signals can also guide
video generation. Video-EEG proposed by VepiNet [290] contains 191025 video-EEG segments from 484 patients, providing a
foundation for generating epilepsy videos from EEG signals.
4.1.5 Medical Animation and Generation
Medical animation offers a compelling way for the public to learn about medicine, featuring animations that explain complex
medical concepts and 3D simulators that demonstrate, e.g., surgical procedures
SyntheticColon [302]was not trained on existing 3D animation video datasets. It renders a 3D model and generates videos
by making a camera pass through the intestinal model.
Both SimSurgSkill [291] and SurgVisDom [292] contain VR videos for training and surgical videos for testing to tackle
domain adaption challenges. Taking SimSurgSkill as an example, it includes 157 VR videos at 30 frames per second with a
resolution of 1920
×
1080. Generating 3D animated surgical videos can contribute to the promotion of surgical knowledge.
With the advances in video generation, it may become a fast and effective approach that is comparable with current VR and
3D rendering approaches in medical animation.
4.1.6 Multimodal Video Generation
The generation of multi-modal medical videos implies that the model can simultaneously generate medical data from different
modalities, such as surgical videos, ultrasound videos, and microscopic videos. This type of generative model is commonly
referred to as an all-in-one generation model.
Bora [32] is the first diffusion model designed for text-guided multimodal biomedical video generation, fine-tuned on a new
large-scale medical video corpus, including paired text-video data of endoscope [254, 274, 275], cardiac ultrasound [244, 278],
real-time MRI [279] and cellular visualization [281, 284, 285]. The video captions are generated by LLM [81] with background
information such as technical documents and research papers. Fine-tuned on Sora [24], Bora can [32] not only understand the
caption details but generate realistic videos, outperforming other video generation techniques [238, 240, 241, 309, 310].
Despite that, it is worth considering that, as mentioned in AMIR [311], due to the differences between medical and natural
images, the implementation of an all-in-one model for medical applications faces greater difficulties and challenges, even
though the medical all-in-one model increases medical analyzing efficiency. For instance, the restoration of natural images
involves reversing various image degradations to the original RGB image distribution, whereas the restoration of medical
images not only requires the repair of image degradation but also the restoration of the image to different original medical
17
distributions, which may lead to interference between different tasks, significantly increasing the difficulty of restoration. This
same situation applies to Bora which generates multiple modalities of videos simultaneously, which may greatly increase
the complexity. Since there is no clear evidence that an all-in-one model [32] is a better design strategy compared with
domain-specific models such as Endora, it is difficult to verify whether an all-in-one model in medical video generation will
affect the quality of generation.
4.1.7 Medical Video Libraries and Websites
This section summarizes medical video sources from online platforms such as YouTube and various video libraries (see Table
5), including surgery videos especially open surgeries and videos explaining medical knowledge, complementing the curated
medical video datasets found in the literature.
For example, Stryker [312] contains 23 complete surgical procedures, with each video lasting several tens of minutes
from 5 kinds of surgical categories including facial trauma, orthognathic surgery, facial reconstruction, neurosurgery, and
temporomandibular joint of 19 surgical techniques. Vit-bult [313] provides ophthalmic surgeries of 5 kinds, including intraocular
lens surgery, macular surgery, pediatric surgery, trauma surgery, and uveity surgery videos.
Table 5: Online medical video libraries and websites
Library Type Library Type
Omnimedicalsearch [314] Surgery, Medical Knowledge SGS Library[315] Surgery, Medical Knowledge
Spinalsurgicalvideo[316] Surgery Coronary Bypass Surgery[317] Surgery
CarpalTunnel[318] Surgery TotalKneeReplacement[319] Suregry
TotalHipReplacement[320] Surgery Tracheostomy[321] Surgery
Thyroidectomy[322] Surgery Endoscopic Sinus Surgery[323] Surgery
Stryker Surgical Videos[312] Surgery MED EL Video library[324] Surgery
ACS Online Library[325] Surgery HIF library[326] Surgery
PlasmaJet Video Library[327] Surgery Eyes Surgical video[313] Surgery
Surgical video library[328] Surgery, Medical Knowledge YouTube Medcram[329] Medical Knowledge
YouTube Osmosis[330] Medical Knowledge lecturiomedical[331] Medical Knowledge
DoctorNajeeb[332] Medical Knowledge NinjaNerdOfficial[333] Medical Knowledge
armandohasudungan[334] Medical Knowledge TheMDJourney[335] Medical Knowledge
ZeroToFinals[336] Medical Knowledge StrongMed[337] Medical Knowledge
geekymedics[338] Surgery, Medical Knowledge nucleusmedicalmedia[339] Medical Knowledge
Anatomyzone[340] Medical Knowledge SpeedPharmacology[341] Medical Knowledge
Atrial-Fibrillation[342] Medical Knowledge CAB Graft[343] Medical Knowledge
4.2 Medical Video Metrics and Benchmarks
Metrics such as IS [58], FID [57], and FVD [59] have been widely used in evaluating the performance of generating natural
videos (see section 3.5). However, these traditional metrics may not fully capture the nuances and complexities inherent in
medical data, highlighting the need for tailored biomedical metrics. Biomedical Understanding (BmU) proposed by Bora [32]
evaluates biomedical video generation performance, which calculates BERT similarity between the origin biomedical prompt
and generated text of synthetic videos to evaluate the degree of adherence to prompts within the latent space. However, in
this metric, BERT is trained on general domain data, and therefore its suitability for medical analysis would be strengthened
if fine-tuned with medical data.
5 Future Applications
Although there are currently many video generation technologies, only a few of them have been applied to medical scenarios,
leaving majority of scenarios unexplored. Therefore, this chapter will discuss potential application scenarios for medical
video generation and provide a comprehensive analysis of the technical prerequisites and existing limitations. Specifically,
the chapter will examine six key application areas: 1) medical education; 2) patient-facing applications; 3) public health; 4)
integration with AI models; 5) diagnostic assistance; and 6) biomedical simulations.
18
5.1 Medical Education
Video is a more effective medium for imparting knowledge compared to static images and text. Due to the limitations of online
medical video resources, customizable synthetic medical videos have a broader range of practical applications, particularly
beneficial in fields like surgical and clinical education where tailored content is essential for effective learning [344].
Rich Description
Rich Parameters
Dense Video
Caption
Surgical
Videos
Pretrained
Surgical
T2I Model
Video Generation
Model
Surgical
Education
Fine-tuning Students
Figure 4: Video generation model for medical education, with surgical education as an example.
5.1.1 Surgical Education
Generative models hold the potential to enhance surgical education by producing synthetic videos that explain surgical
concepts and simulate surgical procedures [345] to provide a more diverse range of surgical guidance in a more dynamic
and effective manner for students. As a potential simulator of medical surgery, surgical video generation models will be
capable of comprehensively understanding and mastering the various stages of complicated surgical procedures, generating
surgical scenes, which requires surgical video generation models to establish a robust mapping between this knowledge and
corresponding visual representations, and consequently, possess both text-to-video and video-to-text generation capabilities.
Training this surgical video generator requires a large-scale video dataset with rich visual and textual pairings, including
textual descriptions of the videos and relevant surgical parameters such as categories, locations, and angles of surgical
instruments.
When generating a video, a commentary for the surgical video can also be produced simultaneously for better education.
Surgical video captioning techniques including SwinMLP-TranCAP [346] and SCA-Net [79] are able to caption from surgical
videos. Dense Video Captioning [78] is also a suitable captioning method for surgical video captioning.
5.1.2 Other Applications
In addition to its application in surgical education, video generation techniques can be effectively utilized in various medical
contexts, such as generating educational materials, providing videos to explain medical concepts and knowledge.
Educational Video Materials There are numerous explanations of medical textbooks on YouTube that facilitate more
efficient learning. However, these explanations often fail to cover all aspects of the knowledge. Therefore, video generation
models can be utilized to produce explanatory videos targeted at specific knowledge points, aiding in the learning process.
Unlike the intricate realism required for surgical education scenarios, applications in visual educational materials do not
necessitate highly detailed scenes, thereby lowering the complexity of generation while enhancing practical feasibility.
5.2 Patient-facing Application
Patient-facing applications are digital tools or platforms designed specifically for patients to interact with healthcare services.
Virtual video consultation is one of the key applications of video generation in this area.
5.2.1 Virtual Video Consultation
Virtual consultations provide an effective solution for urgent medical needs when immediate physician availability is lacking.
As a dynamic multimedia medium, video conveys medical information more intuitively and vividly than traditional text or
image formats [347, 348]. The consultation system is capable of generating videos to assist patients through dialogue, and
19
What medicine will I take?
Diarrhea
Nausea and vomiting
Stomach pain....
What are the side effects
of taking this medication?
Amoxicillin is a
penicillin-type antibiotic
for treating various
bacterial infections....
MLLM Video
Understanding
and Generation
Video
Generation
Model
Patient
Patient
Video
Generation
Model
Figure 5: Video generation model for patient-facing application, with video-based virtual consultation as an example.
by comprehending the semantic context of preceding conversations, they can produce more accurate and appropriate video
responses. This offers significant advantages in enhancing communication and understanding between healthcare providers
and patients.
For effective dataset preparation, ensuring diversity is paramount to cater to the spectrum of patient inquiries. Similar to
the question-answer (QA) dataset used in LLMs such as WiKiQA [349], the dataset for training the video-based consultation
system should adopt a video QA format, specifically the question-video framework. Thoughtfully crafted prompts must be
curated to meet the system’s requirements.
Similar to LLMs [81, 83], the video-based consultation system is supposed to have the ability to multi-turn conversation,
which is, video can be generated not only based on previous conversations, but based on the previous videos. For instance,
patients should have the ability to critique previous explanations, prompting system corrections. Current works such as
VideoChatGPT [350] have the capability of understanding video based on Multimodal LLM. Previous works have attempted
to edit images based on Multimodal LLMs [351, 352], demonstrating the potential of video editing capability.
The video generation model presents a promising solution for enhancing medical consultations. Particularly, when patients
are preparing to embark on specific medication regimens, treatments, or surgical procedures, the utilization of a virtual
treatment guide can offer elucidating videos that surpass the efficacy and utility of verbal or written instructions. While existing
methodologies can generate videos based on a well-curated dataset utilizing current techniques [22, 24], the development of a
comprehensive system for multi-turn conversation consulting and treatment necessitates further investigation and refinement.
5.3 Public Health
Medical video generation holds immense potential across various public health contexts. Its applications extend beyond merely
enhancing the efficiency of disseminating health knowledge to bolstering public health awareness such as promoting healthy
diet [353, 354].
5.3.1 Public Health Promotion
By creating medical videos focused on the