Available via license: CC BY-NC-SA 4.0
Content may be subject to copyright.
No Foundations without Foundations — Why semi-mechanistic models are
essential for regulatory biology
Luka Kovaˇ
cevi´
c1 2 Thomas Gaudelet 1James Opzoomer 1Hagen Triendl 1John Whittaker 1 2
Caroline Uhler 134 Lindsay Edwards 1Jake P. Taylor-King 1
Abstract
Despite substantial efforts, deep learning has not
yet delivered a transformative impact on elucidat-
ing regulatory biology, particularly in the realm
of predicting gene expression profiles. Here, we
argue that genuine “foundation models” of reg-
ulatory biology will remain out of reach unless
guided by frameworks that integrate mechanistic
insight with principled experimental design. We
present one such ground-up, semi-mechanistic
framework that unifies perturbation-based exper-
imental designs across both in vitro and in vivo
CRISPR screens, accounting for differentiating
and non-differentiating cellular systems. By re-
vealing previously unrecognised assumptions in
published machine learning methods, our ap-
proach clarifies links with popular techniques
such as variational autoencoders and structural
causal models. In practice, this framework sug-
gests a modified loss function that we demonstrate
can improve predictive performance, and further
suggests an error analysis that informs batching
strategies. Ultimately, since cellular regulation
emerges from innumerable interactions amongst
largely uncharted molecular components, we con-
tend that systems-level understanding cannot be
achieved through structural biology alone. In-
stead, we argue that real progress will require a
first-principles perspective on how experiments
capture biological phenomena, how data are gen-
erated, and how these processes can be reflected
in more faithful modelling architectures.
1
Relation, London, UK
2
MRC Biostatistics Unit, University
of Cambridge, UK
3
LIDS, Massachusetts Institute of Technology,
USA
4
Broad Institute of MIT and Harvard, USA. Correspondence
to: Jake P. Taylor-King <jake@relationrx.com>.
1. Introduction
Three main themes presently dominate machine learning
(ML) research in biology: structural biology [see AlphaFold
(2024)], sequence modelling [including DNA (Avsec et al.,
2021), RNA (Sumi et al.,2024), and proteins (Zhou et al.,
2024)], and regulatory biology. Regulatory biology harbours
a key unsolved problem: understanding the mapping be-
tween the manipulation of genes (e.g., knockout, inhibition,
or overexpression) and a resulting complex downstream
phenotype (e.g., proliferation, cytotoxicity, or extracellu-
lar matrix production) — a longstanding Grand Challenge
known as the ‘genotype–phenotype relationship’ (Uhler,
2024). Understanding which gene manipulations lead to
changes in phenotype that are considered beneficial is a fun-
damental task in drug discovery since it opens the possibility
of seeking drugs that mimic those perturbations. Despite
billions of dollars spent on drug development, success rates
remain low, with the principal cause of failure being an ab-
sence of efficacy (meaning a drug fails to exert a beneficial
effect) (Taylor-King et al.,2024) — in other words, a failure
to accurately predict the effect of a perturbation.
Historically, biological assays were exclusively low through-
put and collapsed high-dimensional regulatory states into a
single value (e.g. a phenotypic measure). However, now we
have the ability to generate large amounts of perturbation
data suitable for ML through the use of pooled CRISPR
screens with single-cell readouts (Frangieh et al.,2021;Pa-
palexi et al.,2021;Mimitou et al.,2019;Datlinger et al.,
2017;Dixit et al.,2016) or arrayed screens (with appropri-
ate automation). Other imaging-based readouts have also
been scaled for genome-scale perturbations, for example,
optical pooled screens (Gentili et al.,2024) and cell painting
(Chandrasekaran et al.,2023). Recent computational models
have explored the prediction of transcriptomic states for un-
seen perturbations — with the aim to understand biological
pathways and improve downstream phenotype prediction
(Roohani et al.,2022;Hetzel et al.,2022;Lotfollahi et al.,
2019;2021;Inecik et al.,2022).
Despite extensive research efforts, simple statistical meth-
ods continue to outperform deep learning in predicting tran-
scriptomic profiles (Gaudelet et al.,2024;Ahlmann-Eltze
1
arXiv:2501.19178v1 [cs.LG] 31 Jan 2025
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
et al.,2024;Wu et al.,2024;Bendidi et al.,2024;Wenteler
et al.,2024). It is implausible that the underlying regulatory
mechanisms are genuinely this trivial, so these shortfalls
likely reflect two intertwined deficits: insufficient curated
data and an overreliance on purely data-driven architectures.
We argue that the dream of “foundation models” in regu-
latory biology, those capable of robust and generalisable
predictions, will remain elusive unless grounded in a
biologically informed, semi-mechanistic framework.
Building frameworks to model regulatory biology is a chal-
lenging task because of the complex nature of gene–gene
interactions, e.g., physical protein–protein interactions, epis-
tasis, and pleiotropy. Furthermore, even with the latest
functional genomics techniques, it is not experimentally
tractable to exhaustively screen all genes in isolation when
using primary cells, and combinations of genes are not
possible even when cell numbers are immaterial (for exam-
ple, when using immortalized cell models) (Bertin et al.,
2023). Finally, the standard CRISPR-Cas9 toolbox is con-
stantly evolving, we can perform knockouts (Lara-Astiaso
et al.,2023), but also activation (Norman et al.,2019) (via
CRISPRa), interference (Tian et al.,2019) (via CRISPRi or
CRISPR-Cas13), base editing, and prime editing (Przybyla
& Gilbert,2022). Foundation models typically draw upon
data from a range of sources; when we consider the range
of cell types, culture conditions, and emerging perturbation
technologies available, we must develop sophisticated ways
of describing experimental systems for integration purposes.
In this paper, we develop a semi-mechanistic mathemati-
cal model that captures interventions in pooled CRISPR
screens with single-cell readouts, and show how this
framework applies equally to other perturbation types
and experimental designs (including both differentiating
and non-differentiating cellular systems). This “ground-
up” approach highlights subtle assumptions—often unvali-
dated—that underlie widely used methods, thus motivating
generation of new datasets for rigorous testing. We also
propose modifications to generic loss functions that incor-
porate key biological intuitions and demonstrate, on a pub-
lished dataset, that such modifications achieve faster and
more robust performance than standard alternatives. Our
overarching position is that only by weaving mechanistic
understanding with rigorous mathematical underpinnings
can we scale foundation models to achieve the next gen-
eration of predictive, interpretable, and clinically valuable
models in regulatory biology.
In Section 2, we provide a biologically-grounded mathemat-
ical model of an in vitro pooled CRISPR screen with single-
cell readout, and show how this leads to different loss func-
tions. In Section 3, we consider how single-cell technologies
are views over a hidden cell state, which gives us insights
into the relationship between batch effects and learned func-
tions. In Section 4, we discuss other experimental systems
and in Section 5, we show how the proposed mathematical
framework connects many popular established ML models.
In Section 6, we give a proof of principle demonstration of
our approach using a neural ordinary differential equation
(NODE) model, before providing a discussion in Section 7.
2. Modelling cell perturbation dynamics
For foundation models to achieve genuine out-of-
distribution performance, we need to encode some con-
ceptualization of how cells behave. Here, we describe a
perturb-seq experiment and subsequently build a mathemat-
ical description to highlight the subtle assumptions made by
other ML models.
2.1.
Typical in vitro perturb-seq experiment description
Functional genomic screens typically first rely on a tech-
nology to manipulate the function or expression of genes
followed by a downstream readout of cellular function. We
focus this initial exposition on a perturb-seq style system,
i.e., a pooled CRISPR based screen with single-cell readout.
However, this could easily apply to a phenotypic screen, an
arrayed screen, etc.
A.)
P
×nPperturbations
i
M
i
×nMmedia
conditions
W
B.)
i
P
M
i
M
i
W
W
W
W
X
Xpγ
X
γ1
...
γnP
Xpγ
m
Xpγ
Xm
X
...
m1
mnM
Xpγ
m,t
Xpγ
t
Xm,t
Xt
Harvested cells
path
1
path
2
path
3
path
4
∗
Figure 1: Illustration of abstracted phases within a perturb-
seq experiment: application of a genetic perturbation,
P
; a
change in a media condition,
M
; and the culturing of cells
over time,
W
. In panel (A.) we provide a typical wet lab
illustration, and in (B.) a branching process illustration with
(nP+ 1)(nM+ 1) total unique branches.
2
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
In perturb-seq style screens, a large number of cells are
simultaneously edited targeting a range of biological pro-
cesses in a manner that allows for identification of the origi-
nating perturbation (Dixit et al.,2016;Datlinger et al.,2017).
Perturbation technologies include knock outs (via CRISPR
nuclease; CRISPRn), knock down (via CRISPR interfer-
ence; CRISPRi), or overexpression (via CRISPR activation;
CRISPRa) applied to a specified set of genes. Gene target-
ing is achieved through delivery of a CRISPR protein that
will localise to a region of the genome via a single guide
RNA (sgRNAs).
In some screens, cells are separated and treated with addi-
tional stimuli (Dr
¨
ager et al.,2021); typically using cytokines
chosen to induce a biological process of interest. We then
wish to understand how this induced process is altered by
the earlier genetic perturbation. Other stimuli also consid-
ered include small molecule drug screens (Srivatsan et al.,
2020), or even co-culture (with a second cell type) as a new
“media” condition (Frangieh et al.,2021).
After some period of time whereby cells are cultured and
maintained, cells are harvested and sequenced to under-
stand how the perturbation and application of media leads to
dysregulation of chromatin accessibility (Liscovitch-Brauer
et al.,2021;Rubin et al.,2019;Pierce et al.,2021), the
transcriptome (Lara-Astiaso et al.,2023), or select mem-
bers of the proteome via oligonucleotide-tagged antibodies
(Frangieh et al.,2021). See Figure 1A for a cartoon of this
experiment. Resources now exist performing meta-analysis
across such experiments and provide easy access to stan-
dardised data (Peidli et al.,2024).
2.2. Mathematical description
We abstract the in vitro perturb-seq experiment in Figure 1A
to a sequence of three actions being performed: i.) the
instantaneous application of a functional genomic pertur-
bation; ii.) the instantaneous change of a cellular media
condition; and iii.) a waiting period whereby cells are cul-
tured and free to respond to changes induced by (i.) or (ii.).
Crucially, this order is important as these actions are not
commutative. Consider the transforming growth factor beta
(TGF
β
) signalling pathway induced by specific molecules
called TGF
β
cytokines. One example of such molecules is
TGFB1, which can be applied to cells through culture media.
If the TGFBR1 co-receptor was knocked out before apply-
ing TGFB1, the cascade cannot start. However, knocking
out TGFBR1 after stimulation with TGFB1 would have no
effect because the cascade has already begun – clearly the
order of operations matters! We do not yet introduce the act
of measuring cell state, introduced in Section 3.
We want to describe the internal state of a cell. In the ab-
sence of a highly technical mathematical construction, we
describe a cell at rest (a ‘control’ cell) by random variable
X
(in some undefined space
X
of random variables). Without
being too specific, this cell could be in minimum essential
media to maintain cell growth (i.e., amino acids, carbohy-
drates, vitamins, minerals, growth factors, hormones, and
gases); we refer to this as the baseline media condition. We
annotate a gene
γ
by perturbation status
pγ
driven by one
of the aforementioned CRISPR technologies:
pγ=×
for
CRISPRn;
pγ=↓
for CRISPRi;
pγ=↑
for CRISPRa; and
for completeness pγ=·for unperturbed.
Cells are then targeted and modified by CRISPR with as-
sociated apparatus, and the gene targeted by the relevant
sgRNA is perturbed. We represent this action by a function
Pthat applies pγto X, we write
P(X, pγ) = Xpγ.(1)
Here, we present a few properties of
P
. We first note that
one cannot repeatedly knock out the same gene, therefore
P(P(X, pγ=×),pγ=×) = P(X, pγ=×).
Second, in this “instantaneous” framework, we specify that
genetic perturbations are commutative in the case where
perturbations occur at the same point of time
(Xpγ)pδ= (Xpδ)pγ=Xpγpδfor γ=δ.
Since one cannot apply multiple CRISPRi or CRISPRa to
the same gene, operations like
P(P(X, pγ=↑),pγ=↓)
are
not well defined. We note that gene dosing effects can be
achieved through CRISPRi with semi-efficacious sgRNA
(Jost et al.,2020), however we do not address these niche
experimental set ups at this point
1
. Finally, we note that we
model the application of a non-targeting CRISPR construct
as the identity function P(X, ·)≡i(X) = X.
For simplicity of exposition, we do not distinguish between
edited cells containing a non-targeting sgRNA, and unedited
cells. Non-targeting or “scrambled” controls are typically
used in perturb-seq experiments in place of untransfected
cells as one may wish to discount any stress response in-
duced by introduction of sgRNA. These effects are believed
to be less relevant for longer experiments. If these effects
are in fact material, then such non-targeting sgRNA infected
cells could be reclassified as a perturbed population in its
own right — meaning that the mathematical model need not
change.
For the next phase of the experiment, a change in the media
is made
2
. We represent this as the function
M
which applies
media mto random variable X, we write
M(X, m) = Xm.(2)
1
Natural gene-dosing effects may also occur through the use of
knockouts whereby mixes of functional and non-functional genes
coexist within diploid or aneuploid cell models. However, robust
quality control can remove or account for such effects.
2
Typical media changes include the addition of small molecules
3
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
As before, if no media change is made, the identity function
is used,
M(X, ·)≡i(X) = X
. Similar to the use of non-
targeting sgRNAs, chemical experiments often use dimethyl
sulfoxide (DMSO) as a sham addition of media, but we do
not cover these effects here.
Finally, the cells are left for a
t
units of time, and the cell
state is modified by waiting function W, thus
W(X, t) = Xt,(3)
and W(X, 0) = X0≡X.
The whole in vitro perturb-seq experiment described in Sec-
tion 2.1 can then be abstracted to become the application of
the function
F(X, pγ, m, t) := (W◦M◦P)(X, pγ, m, t)
=W(M(P(X, pγ), m), t)
= [(Xpγ)m]t
=Xpγ
m,t .(4)
Here, we will always assume that
F
encodes this specific
order of operations and we write
[(Xpγ)m]t=Xpγ
m,t
, i.e.
W◦M◦Pis non-commutative.
To reiterate, for certain experiments the order of operations
is crucial. For example, editing out genes that prevent cel-
lular differentiation would have no effect if the target cells
have been exposed to differentiation-inducing media prior
to the perturbation. Similarly, if a toxic genetic perturbation
is applied to a cell before being exposed to media, the effect
would be the same regardless of media applied. Under these
circumstances, a different order of operations would lead to
a different outcome3, even with the same pγ,m, and t.
Up to this point, we have not specified how one would
actually learn the function
F
.However, we show this
subtly depends on the underlying assumptions with regards
to the in vitro system in question. By explicitly stating such
assumptions, we find a number of novel formulations logi-
cally follow. For example, we show assumptions pertaining
to how cellular differentiation is induced can alter the loss
function, or how differentiating versus non-differentiating
cells are similar but somewhat distinct problems.
and cytokines. From a ML perspective, small molecules can be
represented via their structure, cytokines may be characterised
by their amino acid sequence. Whilst cytokines are likely to be
somewhat characterised by a receptor they interact with, small
molecules may interact with a plethora of proteins (Gaudelet et al.,
2021).
3
If we wanted to flip the order such that media was added
before the genomic perturbation (followed by a waiting period),
then we would be be trying to learn F(F(X, ·, m, ·),pγ,·, t).
2.3. Non-differentiating cellular models
Many in vitro cellular models pertain to non-differentiating
systems, i.e, left to its own devices, the cells observed after
time
t
is essentially the same as it was at the beginning of
the experiment. This observation allows us to make our key
necessary simplifying assumption.
In fact, most perturb-seq datasets relate to non-
differentiating cellular models (Peidli et al.,2024). This
is because oftentimes immortalized cancer cell lines are
easier to culture, easier to genetically edit, and one does
not need to characterize complex cytokine or transcription
factor combinations to induce differentiation. In reality,
perturb-seq style methods are still an emerging technology
and there has been a trend to showcase sequencing methods
on simple cell lines before progressing to advanced models.
To construct loss functions, we must first define some nota-
tion. We write
G
as the set of perturbed genes for
nP=|G|
perturbed genes (or multi-gene perturbations). For each
gene
γ∈G
, we write that
pγ∈P
for one of the key per-
turbation types
P={↑,↓,×}
. For completeness, we write
P0=P∪ {·}
, where
·
corresponds to the action of not
perturbing the gene in question. The set of all possible per-
turbation states is then defined as PG
0={pγ∈P0|γ∈G}.
Analogously, either
m∈M
, where
M
is the set of non-
baseline media conditions for
nM=|M|
unique conditions,
and
M0=M∪ {·}
is the total set with the baseline media
condition included.
2.3.1.
NEC ESS ARY AS SUM PTION: UNE DIT ED CE LLS D O
NOT R ESP OND TO BAS ELI NE ME DIA
Published literature primarily includes experiments that do
not characterise their starting material
X
to the same extent
as their typical measured states, see Figure 1A. In the case
where unedited cells do not differentiate in the baseline
media, we show that our problem simplifies to become
tractable using the observations illustrated in Figure 1B.
This assumption appears to be implicitly made in many key
pieces of work using ML to predict the outcome of genetic
perturbations (Roohani et al.,2022). For all
t > 0
, we write
F(X, pγ=·, m =·, t) = W(M(P(X, ·),·), t)
=W(i(i(X)), t)
=W(X, t)
=Xt
=X . (5)
We also note that softer conditions would likely suffice, e.g.,
the moments of
X
and
Xt
are identical — however we have
not defined the space,
X
, in which
X
resides. Subsequently,
path 4 in Figure 1is the identity function and measurements
of
Xt
are in fact identically distributed to measurements of
4
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
X
. The function
F
can then be fit through pairs of input-
output data points by mapping
Xt
in path 4 to the end states
in paths 1, 2, and 3 in Figure 1.
For a loss,
L:X × X → R+
, applied to predicted-actual
pairs
(ˆ
X, X)∈ X × X
, we can calculate a total loss
LT
over all input–output pairs as
LT=X
m∈MX
γ∈G
LF(X, pγ, m, t), X pγ
m,t
| {z }
path 1: nPnMdata points
(6)
+X
γ∈G
LF(X, pγ,·, t), X pγ
t
| {z }
path 2: nPdata points
+X
m∈M
L(F(X, ·, m, t), Xm,t)
| {z }
path 3: nMdata points
+L(F(X, ·,·, t), Xt)
| {z }
path 4: 1data point
,
where
nP
is the number perturbations and
nM
is the number
of (non-baseline) media conditions. Across paths 1 to 4, we
count a total of (nP+ 1)(nM+ 1) pairs of data points.
In order to learn
F
, assuming we only measure a single
time point as shown in Figure 1B and we have a non-
differentiating cellular model, we must make this necessary
assumption that unedited cells do not respond in baseline
media.
If this necessary assumption cannot be made, the function
that learns the relationship between paired measurements
of
(Xt, Xpγ
m,t)
then becomes a counter factual prediction.
If we write that
Xt=F(X, ·,·, t)
, then by stating the
existence of the inverse function,
X=F−1(Xt,·,·, t)
,
we can define a counter factual function as
C(Xt,pγ, m)=(F◦F−1)(Xt,pγ, m)(7)
=F(F−1(Xt,·,·, t),pγ, m, t).
2.3.2. OPTIONAL ASSUMPTION I: PE RTURBED
DI STR IBU TIO NS AR E ATTR ACTO RS OF
DYNAMICAL SYSTEMS
In early systems biology literature employing large systems
of ordinary differential equations (ODEs) various steady
state assumptions are typically made to simplify down-
stream analysis (Klipp et al.,2005). Attractors are stable
steady states or regions of state space within a dynamical
systems that solutions converge towards. If we make the
assumption that
F
determines a dynamical system and
Xpγ
m,t
is a steady state
4
for some
(pγ, m)∈PG
0×M0
and
s, t ≥0
4
Note that our random variable,
X
, can still be at steady state
in aggregate, even though individual cells still progress through
the cell cycle.
then
d
dtF(Xpγ
m,s,·,·, t)=0,(8)
or in the non-infitesimal case
F(Xpγ
m,s,·,·, t) = Xpγ
m,s+t=Xpγ
m,s .(9)
In essence, this means that the duration of our experiment is
much longer than the time needed for cells to reach a steady
state (for example, in transcriptional space).
As a mechanism to incorporate this into a loss function, this
then leads to additional terms in our loss function for these
regions of state space
X
(γ,m)∈S
LF(Xpγ
m,t,·,·, s), X pγ
m,t
| {z }
Up to (nP+ 1)(nM+ 1) data points
,(10)
for subset
S⊆G×M
corresponding to perturbations and
media conditions where the dynamical system is believed
to have relaxed. As a trivial example, in the TGF
β
ex-
ample explained in the introduction to Section 2.2: within
a knockout screen for a non-differentiating cell model,
S
could include the element
(TGFBR1,TGFB1)∈S
because
the TGFB1 cytokine media condition cannot induce a re-
sponse in TGFBR1 knocked out cells and thus the system is
at steady state.
If we have further time series data, we obtain further paired
data points along trajectories as the system approaches the
steady state. In Section 6, we demonstrate how enforcing
steady states in a NODE model of transcription dynam-
ics using equation
(10)
leads to rapid convergence when
compared to a loss function without using this additional
term. We discuss an alternative “softer” version of optional
assumption I in Appendix A.
EXP ERI MEN TAL RE COM MEN DATIO NS
From Section 2.3, we proposed a number of assumptions
that allow one to better leverage perturbation data:
•
Verification of the necessary assumption through mea-
surement of
X
: unedited cells do not respond to base-
line media5.
•
Generation of time series data to validate optional
assumption I (or optional assumption II, see Appendix
A).
These assumptions will need validating for any experimental
system of interest. As these will typically require compari-
son of mRNA at different time points, we should note the
5
It is also worth characterising what exactly in the media is
driving the response such that only a minimal set of growth-factor
components are required.
5
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
likely requirement of fixation methods or cascading experi-
ment start times (De Jonghe et al.,2024a;b).
3. Measurements of cell state using single-cell
technology
One challenge when learning
F
is that we never actually
measure random variables within
X
, we typically measure
finite-dimensional count vectors as generated by single-cell
’omic technologies. Foundation models typically aggregate
large amounts of data originating from many sources. In a
perfect world where we have infinitely many perfect mea-
surements of cell state, the mathematical setup presented in
Section 2would suffice. However, with regards to training
foundation models: single-cell omics contains numerous
complexities that are not found in many other data types.
These relate to the modality used, the depth of sequencing
and batch effects driven by biological and technical factors.
Therefore, the types of model structures that
F
will incorpo-
rate will be limited by the types of measurement technology
and experimental design. We now move from an abstract
concept of cell state to specific single-cell omic readouts.
3.1. Single-cell technology description and resulting
learned function
The central dogma of molecular biology states that bio-
logical sequential information is transferred from DNA to
RNA as it is transcribed, and from RNA to proteins as it is
translated. Modern molecular biology has now advanced to
the point that single-cell technologies are now able to mea-
sure omic modalities relevant to: chromatin accessibility (a
DNA and nuclear protein complex) relevant to describing
which areas of DNA are being transcribed; mRNA tran-
script abundance levels relevant to specifying which genes
are active; and specific protein levels illustrating which
mRNA were translated (De Jonghe et al.,2024a;b). Orig-
inally these biomolecules would be measured separately
through single-cell Assay for Transposase-Accessible Chro-
matin using sequencing (scATAC-seq) (Chen et al.,2018b),
single-cell Ribonucleic acid sequencing (scRNA-seq), and
Cellular Indexing of Transcriptomes and Epitopes by Se-
quencing (CITE-seq) (Stoeckius et al.,2017). However,
some of these modalities can now be measured simultane-
ously, for example DOGMA-seq (Mimitou et al.,2021) and
TEA-seq (Swanson et al.,2021) are able to measure all of
the aforementioned biomolecules. For an illustration of how
measurements of key biomolecules are transformed into
processed data, see Figure 2. Due to the expense of run-
ning such advanced assays, any framework that endeavours
to capture large aspects of biology will have to be able to
handle incomplete data with missing observations.
Returning to our mathematical construction, we can con-
sider omic readouts as functions applied to
X
, which them-
selves become random variables that can be sampled from
to create a finite dimensional vector. Specifically single-cell
ATAC-seq, RNA-seq and CITE-seq measurements can be
written as
xG∼VG(X),xT∼VT(X),xP∼VP(X)(11)
respectfully for
xG∈NnWnG
0
and
xT,xP∈NnG
0
, and we
provide a visualisation of these measurements in Figure 2.
In equation
(11)
, we assume that ATAC-seq reads have been
binned to
nW
windows per gene, and
N0
is used to denote
the set of natural numbers including zero. We note that
(to date), no assay is able to measure a full state
xTOTAL =
(xG,xT,xP)
, but only a noisy subset of the transcriptome
or proteome. To simplify exposition, we will use
V
to
indicate some omic measurement has been made as many
of our conclusions are agnostic to measurement technology.
From equation (4), we can apply Vto both sides to obtain
V(Xpγ
m,t)
| {z }
∼x
pγ
m,t
=VF(X, pγ, m, t)
=VF(V−1(V(X)
|{z }
∼x
),pγ, m, t).(12)
Therefore, as we cannot learn
F
, the best we can do is learn
the projected function
F:= V◦F◦V−1:X×PG
0×M0×(0, t)→X(13)
to the extent that Vhas an inverse and X=supp(V(X)).
3.2. Measurement artifacts
Focusing on the most commonly used single-cell omic
modality, scRNA-seq, there are two technical caveats that
one must consider when modelling the resulting data:
dropout and batch effects.
Briefly, dropout refers to the inability for many of the popu-
lar single-cell sequencing technologies to detect lowly ex-
pressed reads, in fact only
∼
5-30% of transcripts are ac-
tually measured in the cell, and these measurements may
be biased towards highly expressed genes. Various mod-
els have been chosen as the measurement function
VT
to
account for this, most commonly via the Zero-Inflated Nega-
tive Binominial (ZINB) model
6
. There are similar technical
artefacts when considering scATAC-seq data. For reviews
pertaining to the modelling of single-cell count data, see
(Jiang et al.,2022;Choudhary & Satija,2022).
Batch effects, or small differences between experimental
runs can be much more pernicious and emerge for a number
6
Whilst not the focus of this work, Powell et al. (2024) attribute
the presence of zero-inflated counts to heterozygosity.
6
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
VG
VT
VP
Figure 2
: Diagrammatic
overview of how an all
encompassing variable
X
is
transformed by single-cell
technologies into a dataset
by
V
. Adapted from Peidli
et al. (2024).
of reasons; these are typically attributed to either biolog-
ical variation or technical variation. Biological variation
corresponds to differences in the cellular model of interest,
e.g., an immortalised cell model has been allowed to un-
dergo more rounds of cell division than another cell model
that is supposedly identical. Technical variation relates to
the imperfections in the manufacturing process for biologi-
cal instrumentation and reagents. Of particular relevance to
single-cell technologies: most technologies sample
∼
10,000
cells at a time7.
For CRISPR-based genetic screens, cells are typically edited
to express the sgRNA constitutively; subject to a few nu-
ances, this means that not only is the target gene edited, but
the identity of the genetic perturbation can be resolved from
RNA sequencing data. For cell populations maintained in
one of several medias of interest, one typically runs each cell
population through a separate single-cell reaction, leading
to irresolvable batch effects: one cannot be definitively sure
that differences in gene expression are driven by differences
in media or imperfections between single-cell sequencing
reactions. For an illustration of batching, see appendix Fig-
ure 7. For this reason, replicates should be performed
8
—
but are often not. We briefly discuss how batch effects fit
into our model framework.
7
For microfluidic systems, we typically refer to sequencing
∼
10,000-20,000 cells as using a reaction within a “chip lane”.
New non-microfluidic technologies are now available with fewer
limitations on reaction sizes, but potentially with other technical
limitations — see review (De Jonghe et al.,2024a;b).
8
Note that one can use “hashing” (barcoded antibodies target-
ing ubiquitously expressed surface proteins) to remove potential
batch effects. Here, the antibody barcode is used to encode the
identity of the sample (for example, relevant to a media condi-
tion, cell model, or donor) in a unique sequence, and thus cell
populations are mixed and the identities of the samples can be
re-identified later
One can assume that for each batch we largely measure the
true gene expression distribution, supplemented by a zero-
mean noise term,
ηb
, such that when averaged over batches
E[ηb] = n−1
BPnB−1
b=0 ηb=0, and we write
Vb(X)
| {z }
∼xb
=V(X)
|{z }
∼x
+ηb(. . . )(14)
for
b= 0, . . . , nB−1
. The key issue is that
ηb
is a func-
tion dependent on which cell states and perturbations are
contained within each batch. Therefore, our ability to learn
F
in equation
(13)
depends on how
Xt
,
Xpγ
t
,
Xm,t
and
Xpγ
m,t
become batched together. For a brief analysis of the
consequences of this, see Appendix Bwhere we investigate
how errors propagate across batches and interfere with our
ability to learn
F
. Common to foundation models, there
are also careful considerations one must make with regards
to combining datasets from multiple laboratories, discussed
more in Appendix B.
EXP ERI MEN TAL RE COM MEN DATIO NS
From the analysis in Appendix B, we find that the total error
is a function of the error in batches that contain unperturbed
cells in the baseline media,
X
, and the batches containing
perturbed cells in stimulated media
Xpγ
m,t
. Therefore, as-
suming that the errors from each batch are independent and
identically distributed, the total error can be reduced by
incorporating unperturbed cells in the baseline media into
every batch9.
3.3. Loss functions
In Section 2.3.1, we referred to a generic loss function
L:X × X → R+
with
(ˆ
X, X)∈ X × X
for illustrative
9
One would need to achieve this though a barcoding strategy
to combine media conditions into the same chip lane.
7
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
purposes. Depending on the omic measurement(s) taken,
we will need to define an appropriate loss function.
With single-cell technologies, one does not have control
over exactly how many cells one will capture. Therefore,
one is left with the challenge of comparing two distributions:
the set of model predictions generated from applying
F
to
I
non-targeting cell population in the baseline media, with
J
actual perturbed cells. Put simply, we need to construct
a loss function between two groups of cells with different
numbers of cells contained within each group. More specifi-
cally,
LVnF(xt[i],pγ, m)oI
i=1,xpγ
t,m[j]J
j=1,(15)
and therefore LV:XI×XJ→R+.
With this challenge in mind optimal transport has been of
increased interest to the single-cell community (Bunne et al.,
2023;2024), but contains challenges with respect to the
curse of dimensionality. Various methods from statistics
are also appropriate, for example use of E-distance (Peidli
et al.,2024), or simpler techniques including minimising the
mean squared error (MSE) between low order moments (i.e.,
mean, variance etc). Finally, a number of other hueristics
have been tried, including random matching of cells between
control and perturbed distributions (Roohani et al.,2022).
4. Other experimental designs
Thus far, we have covered non-differentiating pooled
screens in Section 2.3. Now we have an understanding of
how single-cell technology measures cell state, we highlight
an exciting new area of inquiry: differentiating cellular mod-
els, particularly via the use of in vivo systems. In Appendix
C, we briefly discuss arrayed screens and other modelling
assumptions worthy of consideration.
4.1. Differentiating cell models and in vivo systems
By optimising the time point at which cells are harvested,
one can capture a range of different differentiation states
along a trajectory within a single experiment. To achieve
such complex behaviour, cells require stimulation by a cock-
tail of cytokines in vitro, or naturally through the use of
in vivo perturb-seq screens (where media changes are not
possible,
M=∅
). Shown in Figure 3,Lara-Astiaso et al.
(2023) demonstrated an in vivo perturb-seq screen to inves-
tigate the differentiation of hematopoietic stem cells (HSCs)
into myeloid, erythroid, and lymphoid lineages within an
irradiated mouse model. After 14 days, we find exogenous
CRISPR edited cells in the bone marrow. Cells without edits
(the non-targeting population) achieve all 3 lineages, but
certain lineages no longer develop when specific proteins
are knocked out and these edited cells remain in a HSC
state.
A.) B.)
Figure 3: Illustration of haematopoetic stem cells differen-
tiating into myeloid, erythroid, and lymphoid lineages. In
panel (A.) we mark each cell with its corresponding pseudo-
time value, and in (B.) we label each point by estimated cell
type. Inset, we see the distribution of different knockout
populations along the trajectory.
This is a universal phenomena in such screens: in exper-
iments wherein cells are encouraged to differentiate, we
observe an imperfect process leading to multiple subpopula-
tions and retention of earlier undifferentiated states (Taylor-
King et al.,2020b), i.e., there is a (stochastic) drift in the
distribution of cell states to include further differentiated
cell states. Or in our mathematical notation, for
t > 0
, we
find
supp(X0)⊂supp(Xt),(16)
where
supp(X) := {x∈X:pX(x)>0}
and
pX(x)
is the probability density function associated to random
variable
X
. In equation
(16)
, we are specifying that the
space of possible states increases over time as some of the
cells achieve differentiation into terminal states.
For a single-cell transcriptomic readout with
nS
cells
passing quality control pipelines, we write
{xi}nS
i=1 ∼
VT({Xt, Xpγ
t})
. Pseudotime methods attempt to derive
a mapping
σ:{1, . . . , nS} → (0, t)
such that
{xσ(i)}
is
ordered in time.
In Lara-Astiaso et al. (2023), a pseudotime method was
applied to the non-targeting control population only
{xs} ∼
VT(Xs)to get time labels s∈(0, t). Thereafter, perturbed
cell populations
{xs} ∼ VT(Xpγ
s)
were given a pseudo-
time value based on their nearest neighbour within the non-
targeting control population. Thus, we have a pseudotime
value for perturbed and non-targeting cell populations within
the dataset.
If we were to learn a function
F
that maps non-targeting
cells with smaller pseudotime values to perturbed cells with
8
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
larger pseudotime values, we have an opportunity to use
modern ML methods.
Some developments within ML offer a natural framework to
approach this phenomena. From Section 3, we approximate
F
by finite dimensional approximation
F=V◦F◦V−1
.
In the case whereby there is no branching in the pseudotime
process,
xs∈X
maps a continuous path for
s∈(0, t)
. We
can then write
F
as the solution to a neural ODE (NODE)
10
(Chen et al.,2018a)
F(x,pγ, t) = x0+Zt
0
G(x,pγ, s)ds , (17)
for neural network
G
. When branching differentiation trajec-
tories occur, natural extensions to NODEs can be employed,
e.g., neural stochastic differential equations (Kidger,2022).
5. Connection to other areas of machine
learning literature
5.1. Variational autoencoders
We note that there have been many ML models utilising
variational autoencoders (VAEs) to model cellular responses.
We show that this is a special case of the mathematical
construction presented thus far, by noting the assumption
that recovery of
xpγ
m,t
is only dependent on a latent variable
Pxpγ
m,t|x,pγ, m, t(18)
=ZPxpγ
m,t|y, x,pγ, m, t
| {z }
=Px
pγ
m,t|y
Py|X, pγ, m, tdy
and the act of pγ,m, and tact via the function L, thus
Py|X, pγ, m, t(19)
=Zδy−L(z, pγ, m, t)Pz|x,pγ, m, tdz .
Therefore, by enforcing
z
to be normally distributed, we
recover the VAE-style formulation.
Pxpγ
m,t|x,pγ, m, t(20)
=ZPxpγ
m,t|L(z , pγ, m, t)Pz|x,pγ, m, tdz .
We note that different VAE-based models have used differ-
ent omic modalities. For example, in Inecik et al. (2022),
transcriptomic (
xT
) and proteomic data (
xP
) are mapped
into the same embedded space; Yang et al. (2021) use a sim-
ilar architecture but tailored to transcriptomic and imaging
10
NODEs have recently been employed (Cui et al.,2022) in the
development of RNA velocity models (La Manno et al.,2018) —
a related but distinct problem.
data. If you have modalities in the same coordinate system,
e.g. transcriptomic and proteomic (gene-based), you can
map data into the same latent space in a simple manner.
When data lives in different coordinate spaces such as tran-
scripomics and imaging, you have to match distributions in
the latent space.
In addition to training on data where predictions are matched
to empirical truth, when
L
is the identity function every
data point can also be mapped onto itself using a Kull-
back–Leibler divergence style loss function, generating ad-
ditional data points equal to the total number of cells.
5.2. Causal modelling
A substantial body of work (Sussex et al.,2021;Uhler &
Shivashankar,2022;Lopez et al.,2023;Ke et al.,2023;
Lagemann et al.,2023;Mao et al.,2024;Kova
ˇ
cevi
´
c et al.,
2024) has focused on using causally-inspired models to
predict the effect of interventions while providing an el-
ement of interpretability where the aim may be to learn
a causal graph
G= (V, E )
with vertices
V
and directed
edges
E
. Vertices that are d-separated in the graph corre-
spond to conditionally independent variables in the data.
The number of potential causal graphs that might explain
any set of observations can scale hyperexponentially with
|V|
— making causal structure learning for transcriptomics,
where
|V|=nG≈20,000
, very difficult even with inter-
ventional data (Uhler et al.,2013). Recent work in causal
modelling has worked towards reconciling differences be-
tween the observed unperturbed and perturbed distributions
by considering them as stationary diffusions (Lorch et al.,
2024).
Nevertheless, causal approaches are conceptually attractive,
especially where one is interested in causal mechanisms of
disease. For example, given a particular desired healthy cell
state and an initial diseased cell state, a causal model of
perturbations would help identify a perturbation that would
take us from the initial state to the desired state. In this case,
learning the full input-output mapping across perturbations
is not necessary as we are only interested in a particular
outcome (Zhang et al.,2023). Experiments and modelling
must go hand-in-hand. Developing models that answer bio-
logically relevant questions rather than performing generic
prediction will help narrow the causal hypothesis space.
Our mathematical framework can be considered as an aug-
mented version of the standard model of causality that has
been applied to perturb-seq experiments (Yang et al.,2018;
Wang et al.,2017). We treat
pγ
,
m
and
t
, as auxiliary context
variables or parameters (Magliacane et al.,2016) that act on
gene-level elements of the causal structure. In Figure 4(A),
m
and
t
are contextual random variables that potentially act
on every gene in
X
. Figure 4(B) represents a causal graph
on the gene level, where CRISPR perturbations
pγ
act on
9
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
A.)
CmCt
X
B.)
xi
xj
xk
{pγ}
{pδ}
·
·
·
Figure 4: (A.) High level causal graph where each con-
textual variable potentially acts on all genes within
X
or
some subset. (B.) Gene-level causal graph where
x=
(x1, . . . , xnG)
are gene counts and perturbations (e.g.,
pγ
)
are parameters.
individual genes. Perturbations are parameters instead of
random variables. This adaptation arises naturally as
m
and
t
modify cellular context and
pγ
is a direct intervention on
gene expression.
This is one of a number of possible approaches to adapting
a causal model to our framework. However, it has been
shown in previous work that without taking into account the
relevant contextual variables, it is impossible to distinguish
certain causal relationships (Mooij et al.,2020).
6. Proof of principle: Optional Assumption I
In Ishikawa et al. (2023), an iPSC model underwent a pooled
CRISPR screen with measurements taken on days 2, 3, 4,
and 5, but without inducing a terminally differentiated state.
In Appendix D, we confirm this from transcriptomic sig-
natures and identify 14 perturbations (including the non-
targeting control) that appear to converge on a steady state.
For the proof of principle demonstration, we pseudobulk
single-cell data over
(pγ, t)
pairs giving 100 unique data
points. To reduce the number of genes, only those that sig-
nificantly varied over the time course were selected, leaving
nG= 120
For the scenario without any changes to the media, that is,
M=∅
, then
F=F(x,pγ, t)
. Using the NODE model
in equation
(17)
, we predict unseen
(pγ, t)
pairs at time
t= 5
for 11 non-steady state perturbations. We train the
model using the remaining data via the original loss function
given by equation
(6)
, or a loss function with steady states
enforced using a modification analogous to equation (10).
We report the test MSE curves in Figure 5and find that the
modified loss function leads to improved performance and
stability of the underlying NODE model. Therefore, enforc-
ing steady states in some regions of
X
improves predictions
of other transcriptional states not at steady state by regular-
izing the overall space of possible functions attainable by
the neural network!
Figure 5: Mean squared error curve on the test set for numer-
ical proof of principle for the modified train loss function
in Section 2.3.2. Experiment is repeated
100
times with
different random seeds for each loss function.
7. Discussion
From the exposition, we have presented a unified frame-
work that encompasses many of the published ML models
developed for single-cell perturbation screens. Moreover,
we have uncovered and clarified many hidden assumptions
taken for granted by the ML community. By building gold-
standard datasets using the experimental recommendations
presented, we can systematically identify what assumptions
and ML architectures work and which do not. From here,
we will be in a position to build foundation models that have
a robust capacity for extensive out-of-distribution general-
ization: to predict transcriptomic states, and phenotypes,
for cells modified by novel perturbation in stimulated and
unstimulated conditions across time.
Although we are a long way away from this vision, we be-
lieve that building first-principles approaches is the most
promising starting point for such foundation models. There
are also substantial routes to strengthening our mathematical
formalism: we have not yet considered cell-cell interactions
or cell cycle effects — of which will be the subject of future
work. Other groups have also proposed mechanisms to com-
bine biophysical modelling with deep learning frameworks
(Carilli et al.,2024), suggesting we are not the only group
thinking in this manner.
7.1. Alternative Views
As an alternative, one may consider building a suitably gen-
eral ML model and then let the model “figure out the rules”
10
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
via active learning or reinforcement learning (Scherer et al.,
2022;Bertin et al.,2023). Whilst promising, our view is
that regardless of the vast datasets now being generated,
meaningful progress has yet to be realised as demonstrated
by the unreasonable effectiveness of linear models (see in-
troduction).
Impact Statement
Large-scale perturbation experiments are seen as having
great potential to advance drug discovery, and, as such,
millions of dollars are being invested in such experiments
and the AI models based on them. The work presented here
will help maximize progress made through this investment.
References
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T.,
Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J.,
Bambrick, J., et al. Accurate structure prediction of
biomolecular interactions with alphafold 3. Nature, pp.
1–3, 2024.
Ahlmann-Eltze, C., Huber, W., and Anders, S. Deep
learning-based predictions of gene perturbation effects
do not yet outperform simple linear methods. BioRxiv,
pp. 2024–09, 2024.
Avsec,
ˇ
Z., Agarwal, V., Visentin, D., Ledsam, J. R., Grabska-
Barwinska, A., Taylor, K. R., Assael, Y., Jumper, J., Kohli,
P., and Kelley, D. R. Effective gene expression predic-
tion from sequence by integrating long-range interactions.
Nature methods, 18(10):1196–1203, 2021.
Bendidi, I., Whitfield, S., Kenyon-Dean, K., Yedder, H. B.,
Mesbahi, Y. E., Noutahi, E., and Denton, A. K. Bench-
marking transcriptomics foundation models for perturba-
tion analysis: one pca still rules them all. arXiv preprint
arXiv:2410.13956, 2024.
Bertin, P., Rector-Brooks, J., Sharma, D., Gaudelet, T.,
Anighoro, A., Gross, T., Mart
´
ınez-Pe
˜
na, F., Tang, E. L.,
Suraj, M., Regep, C., et al. Recover identifies synergis-
tic drug combinations in vitro through sequential model
optimization. Cell Reports Methods, 3(10), 2023.
Bunne, C., Stark, S. G., Gut, G., Del Castillo, J. S.,
Levesque, M., Lehmann, K.-V., Pelkmans, L., Krause,
A., and R
¨
atsch, G. Learning single-cell perturbation re-
sponses using neural optimal transport. Nature methods,
20(11):1759–1768, 2023.
Bunne, C., Schiebinger, G., Krause, A., Regev, A., and
Cuturi, M. Optimal transport for single-cell and spatial
omics. Nature Reviews Methods Primers, 4(1):58, 2024.
Carilli, M., Gorin, G., Choi, Y., Chari, T., and Pachter, L.
Biophysical modeling with variational autoencoders for
bimodal, single-cell rna sequencing data. Nature Methods,
21(8):1466–1469, 2024.
Chandrasekaran, S. N., Ackerman, J., Alix, E., Ando, D. M.,
Arevalo, J., Bennion, M., Boisseau, N., Borowa, A., Boyd,
J. D., Brino, L., et al. Jump cell painting dataset: morpho-
logical impact of 136,000 chemical and genetic perturba-
tions. BioRxiv, pp. 2023–03, 2023.
Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud,
D. K. Neural ordinary differential equations. Advances
in neural information processing systems, 31, 2018a.
Chen, X., Miragaia, R. J., Natarajan, K. N., and Teichmann,
S. A. A rapid and robust method for single cell chromatin
accessibility profiling. Nature communications, 9(1):1–9,
2018b.
Choudhary, S. and Satija, R. Comparison and evaluation of
statistical error models for scrna-seq. Genome biology,
23(1):27, 2022.
Cui, H., Maan, H., and Wang, B. Deepvelo: Deep learning
extends rna velocity to multi-lineage systems with cell-
specific kinetics. bioRxiv, 2022.
Datlinger, P., Rendeiro, A. F., Schmidl, C., Krausgruber,
T., Traxler, P., Klughammer, J., Schuster, L. C., Kuchler,
A., Alpar, D., and Bock, C. Pooled crispr screening with
single-cell transcriptome readout. Nature methods, 14(3):
297–301, 2017.
De Jonghe, J., Opzoomer, J. W., Vilas-Zornoza, A., Crane,
P., Nilges, B. S., Vicari, M., Lee, H., Lara-Astiaso, D.,
Gross, T., Morf, J., et al. A community effort to track
commercial single-cell and spatial’omic technologies and
business trends. Nature Biotechnology, 42(7):1017–1023,
2024a.
De Jonghe, J., Opzoomer, J. W., Vilas-Zornoza, A., Nilges,
B. S., Crane, P., Vicari, M., Lee, H., Lara-Astiaso, D.,
Gross, T., Morf, J., et al. sctrends: A living review
of commercial single-cell and spatial’omic technologies.
Cell Genomics, 4(12), 2024b.
Dixit, A., Parnas, O., Li, B., Chen, J., Fulco, C. P., Jerby-
Arnon, L., Marjanovic, N. D., Dionne, D., Burks, T., Ray-
chowdhury, R., et al. Perturb-seq: dissecting molecular
circuits with scalable single-cell rna profiling of pooled
genetic screens. Cell, 167(7):1853–1866, 2016.
Dr
¨
ager, N. M., Sattler, S. M., Huang, C. T.-L., Teter, O. M.,
Leng, K., Hashemi, S. H., Hong, J., Clelland, C. D., Zhan,
L., Kodama, L., et al. A crispri/a platform in ipsc-derived
microglia uncovers regulators of disease states. bioRxiv,
2021.
11
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
Frangieh, C. J., Melms, J. C., Thakore, P. I., Geiger-Schuller,
K. R., Ho, P., Luoma, A. M., Cleary, B., Jerby-Arnon, L.,
Malu, S., Cuoco, M. S., et al. Multimodal pooled perturb-
cite-seq screens in patient models define mechanisms of
cancer immune evasion. Nature genetics, 53(3):332–341,
2021.
Gaudelet, T., Day, B., Jamasb, A. R., Soman, J., Regep,
C., Liu, G., Hayter, J. B., Vickers, R., Roberts, C., Tang,
J., et al. Utilizing graph machine learning within drug
discovery and development. Briefings in bioinformatics,
22(6):bbab159, 2021.
Gaudelet, T., Del Vecchio, A., Carrami, E. M., Cudini, J.,
Kapourani, C.-A., Uhler, C., and Edwards, L. Season
combinatorial intervention predictions with salt & peper.
arXiv preprint arXiv:2404.16907, 2024.
Gentili, M., Carlson, R. J., Liu, B., Hellier, Q., Andrews, J.,
Qin, Y., Blainey, P. C., and Hacohen, N. Classification and
functional characterization of regulators of intracellular
sting trafficking identified by genome-wide optical pooled
screening. Cell Systems, 15:1264–1277, 2024.
Hetzel, L., B
¨
ohm, S., Kilbertus, N., G
¨
unnemann, S., Lot-
follahi, M., and Theis, F. Predicting single-cell per-
turbation responses for unseen drugs. arXiv preprint
arXiv:2204.13545, 2022.
Inecik, K., Uhlmann, A., Lotfollahi, M., and Theis, F. J.
Multicpa: Multimodal compositional perturbation autoen-
coder. bioRxiv, 2022.
Ishikawa, M., Sugino, S., Masuda, Y., Tarumoto, Y., Seto,
Y., Taniyama, N., Wagai, F., Yamauchi, Y., Kojima, Y.,
Kiryu, H., et al. Renge infers gene regulatory networks
using time-series single-cell rna-seq data with crispr per-
turbations. Communications Biology, 6(1):1290, 2023.
Jiang, L., Dalgarno, C., Papalexi, E., Mascio, I., Wessels,
H.-H., Yun, H., Iremadze, N., Lithwick-Yanai, G., Lipson,
D., and Satija, R. Systematic reconstruction of molecular
pathway signatures using scalable single-cell perturbation
screens. bioRxiv, pp. 2024–01, 2024.
Jiang, R., Sun, T., Song, D., and Li, J. J. Statistics or
biology: the zero-inflation controversy about scrna-seq
data. Genome biology, 23(1):31, 2022.
Jost, M., Santos, D. A., Saunders, R. A., Horlbeck, M. A.,
Hawkins, J. S., Scaria, S. M., Norman, T. M., Hussmann,
J. A., Liem, C. R., Gross, C. A., et al. Titrating gene
expression using libraries of systematically attenuated
crispr guide rnas. Nature biotechnology, 38(3):355–364,
2020.
Ke, N. R., Dunn, S.-J., Bornschein, J., Chiappa, S., Rey, M.,
Lespiau, J.-B., Cassirer, A., Wang, J., Weber, T., Barrett,
D., et al. Discogen: Learning to discover gene regulatory
networks. arXiv preprint arXiv:2304.05823, 2023.
Kidger, P. On neural differential equations. arXiv preprint
arXiv:2202.02435, 2022.
Klipp, E., Herwig, R., Kowald, A., Wierling, C., and
Lehrach, H. Systems biology in practice: concepts, im-
plementation and application. John Wiley & Sons, 2005.
Kova
ˇ
cevi
´
c, L., Newsham, I., Mukherjee, S., and Whittaker,
J. Simulation-based benchmarking for causal structure
learning in gene perturbation experiments. arXiv preprint
arXiv:2407.06015, 2024.
La Manno, G., Soldatov, R., Zeisel, A., Braun, E.,
Hochgerner, H., Petukhov, V., Lidschreiber, K., Kastriti,
M. E., L
¨
onnerberg, P., Furlan, A., et al. Rna velocity of
single cells. Nature, 560(7719):494–498, 2018.
Lagemann, K., Lagemann, C., Taschler, B., and Mukherjee,
S. Deep learning of causal structures in high dimensions
under data limitations. Nature Machine Intelligence, 5
(11):1306–1316, 2023.
Lara-Astiaso, D., Go
˜
ni-Salaverri, A., Mendieta-Esteban, J.,
Narayan, N., Del Valle, C., Gross, T., Giotopoulos, G.,
Beinortas, T., Navarro-Alonso, M., Aguado-Alvaro, L. P.,
et al. In vivo screening characterizes chromatin factor
functions during normal and malignant hematopoiesis.
Nature genetics, 55(9):1542–1554, 2023.
Leng, K., Rooney, B., Kim, H., Xia, W., Koontz, M.,
Krawczyk, M., Zhang, Y., Ullian, E. M., Fancy, S. P.,
Schrag, M. S., et al. Crispri screens in human astro-
cytes elucidate regulators of distinct inflammatory reac-
tive states. BioRxiv, 2021.
Liscovitch-Brauer, N., Montalbano, A., Deng, J., M
´
endez-
Mancilla, A., Wessels, H.-H., Moss, N. G., Kung, C.-
Y., Sookdeo, A., Guo, X., Geller, E., et al. Profiling
the genetic determinants of chromatin accessibility with
scalable single-cell crispr screens. Nature biotechnology,
39(10):1270–1277, 2021.
Lopez, R., Tagasovska, N., Ra, S., Cho, K., Pritchard, J.,
and Regev, A. Learning causal representations of single
cells via sparse mechanism shift modeling. In Conference
on Causal Learning and Reasoning, pp. 662–691. PMLR,
2023.
Lorch, L., Krause, A., and Sch
¨
olkopf, B. Causal modeling
with stationary diffusions. In International Conference
on Artificial Intelligence and Statistics, pp. 1927–1935.
PMLR, 2024.
12
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
Lotfollahi, M., Wolf, F. A., and Theis, F. J. scgen predicts
single-cell perturbation responses. Nature methods, 16
(8):715–721, 2019.
Lotfollahi, M., Susmelj, A. K., De Donno, C., Ji, Y., Ibarra,
I. L., Wolf, F. A., Yakubova, N., Theis, F. J., and Lopez-
Paz, D. Compositional perturbation autoencoder for
single-cell response modeling. BioRxiv, 2021.
Magliacane, S., Claassen, T., and Mooij, J. M. Joint causal
inference on observational and experimental datasets.
arXiv preprint arXiv:1611.10351, 2016.
Mao, H., Lopez, R., Liu, K., Huetter, J.-C., Richmond,
D., Benos, P., and Qiu, L. Learning identifiable factor-
ized causal representations of cellular responses. arXiv
preprint arXiv:2410.22472, 2024.
Mimitou, E. P., Cheng, A., Montalbano, A., Hao, S., Stoeck-
ius, M., Legut, M., Roush, T., Herrera, A., Papalexi, E.,
Ouyang, Z., et al. Multiplexed detection of proteins, tran-
scriptomes, clonotypes and crispr perturbations in single
cells. Nature methods, 16(5):409–412, 2019.
Mimitou, E. P., Lareau, C. A., Chen, K. Y., Zorzetto-
Fernandes, A. L., Hao, Y., Takeshima, Y., Luo, W., Huang,
T.-S., Yeung, B. Z., Papalexi, E., et al. Scalable, multi-
modal profiling of chromatin accessibility, gene expres-
sion and protein levels in single cells. Nature Biotechnol-
ogy, pp. 1–13, 2021.
Mooij, J. M., Magliacane, S., and Claassen, T. Joint causal
inference from multiple contexts. Journal of machine
learning research, 21(99):1–108, 2020.
Norman, T. M., Horlbeck, M. A., Replogle, J. M., Ge, A. Y.,
Xu, A., Jost, M., Gilbert, L. A., and Weissman, J. S.
Exploring genetic interaction manifolds constructed from
rich single-cell phenotypes. Science, 365(6455):786–793,
2019.
Papalexi, E., Mimitou, E. P., Butler, A. W., Foster, S.,
Bracken, B., Mauck, W. M., Wessels, H.-H., Hao, Y.,
Yeung, B. Z., Smibert, P., et al. Characterizing the molec-
ular regulation of inhibitory immune checkpoints with
multimodal single-cell screens. Nature Genetics, 53(3):
322–331, 2021.
Peidli, S., Green, T. D., Shen, C., Gross, T., Min, J., Garda,
S., Yuan, B., Schumacher, L. J., Taylor-King, J. P., Marks,
D. S., et al. scperturb: harmonized single-cell perturba-
tion data. Nature Methods, 21(3):531–540, 2024.
Pierce, S. E., Granja, J. M., and Greenleaf, W. J. High-
throughput single-cell chromatin accessibility crispr
screens enable unbiased identification of regulatory net-
works in cancer. Nature communications, 12(1):1–8,
2021.
Powell, J. E., Xue, A., Yazar, S., Alquicira, J., Cuomo, A.,
Senabouth, A., Gordon, G., Kathail, P., Ye, J., and Hewitt,
A. Genetic variants associated with cell-type-specific
intra-individual gene expression variability reveal new
mechanisms of genome regulation. bioRxiv, pp. 2024–05,
2024.
Przybyla, L. and Gilbert, L. A. A new era in functional
genomics screens. Nature Reviews Genetics, 23(2):89–
103, 2022.
Roohani, Y., Huang, K., and Leskovec, J. Gears: Predicting
transcriptional outcomes of novel multi-gene perturba-
tions. bioRxiv, 2022.
Rubin, A. J., Parker, K. R., Satpathy, A. T., Qi, Y., Wu,
B., Ong, A. J., Mumbach, M. R., Ji, A. L., Kim, D. S.,
Cho, S. W., et al. Coupled single-cell crispr screening
and epigenomic profiling reveals causal gene regulatory
networks. Cell, 176(1-2):361–376, 2019.
Scherer, P., Pouplin, A., Del Vecchio, A., M S, S., Bolton,
O., Soman, J., Taylor-King, J. P., Edwards, L., and
Gaudelet, T. Pyrelational: a python library for ac-
tive learning research and development. arXiv preprint
arXiv:2205.11117, 2022.
Srivatsan, S. R., McFaline-Figueroa, J. L., Ramani, V., Saun-
ders, L., Cao, J., Packer, J., Pliner, H. A., Jackson, D. L.,
Daza, R. M., Christiansen, L., et al. Massively multi-
plex chemical transcriptomics at single-cell resolution.
Science, 367(6473):45–51, 2020.
Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-
Loomis, B., Chattopadhyay, P. K., Swerdlow, H., Satija,
R., and Smibert, P. Simultaneous epitope and transcrip-
tome measurement in single cells. Nature methods, 14
(9):865–868, 2017.
Sumi, S., Hamada, M., and Saito, H. Deep generative design
of rna family sequences. Nature Methods, 21(3):435–443,
2024.
Sussex, S., Uhler, C., and Krause, A. Near-optimal multi-
perturbation experimental design for causal structure
learning. Advances in Neural Information Processing
Systems, 34:777–788, 2021.
Swanson, E., Lord, C., Reading, J., Heubeck, A. T., Genge,
P. C., Thomson, Z., Weiss, M. D., Li, X.-j., Savage, A. K.,
Green, R. R., et al. Simultaneous trimodal single-cell
measurement of transcripts, epitopes, and chromatin ac-
cessibility using tea-seq. Elife, 10:e63632, 2021.
Taylor-King, J. P., Buenzli, P. R., Chapman, S. J., Lynch,
C. C., and Basanta, D. Modeling osteocyte network for-
mation: healthy and cancerous environments. Frontiers
in bioengineering and biotechnology, 8:757, 2020a.
13
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
Taylor-King, J. P., Riseth, A. N., Macnair, W., and Claassen,
M. Dynamic distribution decomposition for single-cell
snapshot time series identifies subpopulations and trajec-
tories during ipsc reprogramming. PLoS computational
biology, 16(1):e1007491, 2020b.
Taylor-King, J. P., Bronstein, M., and Roblin, D. The future
of machine learning within target identification: causality,
reversibility, and druggability. Clinical Pharmacology &
Therapeutics, 2024.
Tian, R., Gachechiladze, M. A., Ludwig, C. H., Lau-
rie, M. T., Hong, J. Y., Nathaniel, D., Prabhu, A. V.,
Fernandopulle, M. S., Patel, R., Abshari, M., et al.
Crispr interference-based platform for multimodal ge-
netic screens in human ipsc-derived neurons. Neuron,
104(2):239–255, 2019.
Uhler, C. Building a two-way street between cell biology
and machine learning. Nature Cell Biology, 26:13–14,
2024.
Uhler, C. and Shivashankar, G. Machine learning ap-
proaches to single-cell data integration and translation.
In Proceedings of the IEEE, volume 110, pp. 557–576,
2022.
Uhler, C., Raskutti, G., B
¨
uhlmann, P., and Yu, B. Geometry
of faithfulness assumption in causal inference. Annals of
Statistics, 41:436–463, 2013.
Wang, Y., Solus, L., Yang, K. D., and Uhler, C. Permutation-
based causal inference algorithms with interventions. In
Advances in Neural Information Processing Systems, vol-
ume 30, 2017.
Wenteler, A., Occhetta, M., Branson, N., Huebner, M.,
Curean, V., Dee, W., Connell, W., Hawkins-Hooker, A.,
Chung, P., Ektefaie, Y., et al. Perteval-scfm: Benchmark-
ing single-cell foundation models for perturbation effect
prediction. bioRxiv, pp. 2024–10, 2024.
Wu, Y., Wershof, E., Schmon, S. M., Nassar, M., Osi
´
nski, B.,
Eksi, R., Zhang, K., and Graepel, T. Perturbench: Bench-
marking machine learning models for cellular perturba-
tion analysis. arXiv preprint arXiv:2408.10609, 2024.
Yang, K., Katcoff, A., and Uhler, C. Characterizing and
learning equivalence classes of causal dags under inter-
ventions. In International Conference on Machine Learn-
ing, pp. 5541–5550. PMLR, 2018.
Yang, K. D., Belyaeva, A., Venkatachalapathy, S.,
Damodaran, K., Katcoff, A., Radhakrishnan, A., Shiv-
ashankar, G., and Uhler, C. Multi-domain translation be-
tween single-cell imaging and sequencing data using au-
toencoders. Nature Communications, 12(1):1–10, 2021.
Zhang, J., Cammarata, L., Squires, C., Sapsis, T. P., and
Uhler, C. Active learning for optimal intervention design
in causal models. Nature Machine Intelligence, 5(10):
1066–1075, 2023.
Zhou, B., Zheng, L., Wu, B., Yi, K., Zhong, B., Tan, Y., Liu,
Q., Li
`
o, P., and Hong, L. A conditional protein diffusion
model generates artificial programmable endonuclease
sequences with enhanced activity. Cell Discovery, 10(1):
95, 2024.
14
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
A. Optional assumption II: Genetic perturbations do not induce responses in baseline media
The steady state assumption detailed in Section 2.3.2 may be too extreme for many complex experimental systems. However,
there is an alternative to this assumption that may be more appropriate.
For knock out screens in particular, genetic perturbations often exhibit few differentially expressed genes (DEGs) in the
baseline media condition. This is because the cell does not actively require the protein to respond to the nascent signalling
cascades triggered by this baseline media condition. For example, it could be that the cell is slowly proliferating and the
protein is not involved in cell cycle or background metabolic processes. In contrast, once the cell is stimulated by the
addition of a component to the media, the effect can be profound once the cell needs a protein to process the response.
There are a few papers where we observe this effect, including Frangieh et al. (2021); Jiang et al. (2024). To this end, some
experimental protocols elect not to use the baseline media condition for knockout screens to save on costs, see Papalexi
et al. (2021). For an illustration of this effect, we show a Uniform Manifold Approximation and Projection (UMAP) in
Figure 6. Here, we see enhanced effects of CRISPRi perturbations within an iPSC model of astrocytes (Leng et al.,2021)
when exposed to a cocktail of IL-1
α
, TNF and C1q cytokines versus a baseline media background condition. In the baseline
media condition, both the perturbed cells and non-targeting control cells appear to be drawn from the same distibution; in
the stimulated condition the perturbations form clusters. When examining DEGs, regardless of the specific thresholds (log2
fold changes and p-values) used to calculate DEGs, we see approximately twice as many DEGs in the stimulated media (i.e.,
Xpγ
m,t vs Xm,t) when compared to the baseline media condition (i.e., Xpγ
tvs Xt).
A.) B.)
Figure 6
: UMAP embeddings of
astrocyte perturb-seq data, in (A.)
cells are coloured by media con-
dition, and in (B.) select perturba-
tions shown.
In section 2.3.1, we collapsed path 4 from Figure 1into the identity function. If we make the further assumption that cells
are not induced to differentiate by the perturbation in the baseline media, then for some (γ, ·)∈S⊆G×M
F(X, pγ, m =·, t) = W(M(P(X, pγ),·), t)
=W(M(Xpγ,·), t)
=W(Xpγ, t)
=Xpγ
t
=Xpγ.(21)
In words, we have collapsed path 2 into the identity function from
∗
in Figure 1as
X
becomes time and media invariant. We
can then get additional input-output data pairs, by inserting
Xpγ
at
∗
and predicting outputs
Xpγ
m,t
along path 1. This then
generates an additional term within the loss function
X
(γ,·)∈S
LF(Xpγ,·,·, t), Xpγ
t
| {z }
path 1-1: up to nPnMdata points
,(22)
for subset S⊆Gof the perturbations where this effect is observed.
B. Error analysis
We want to examine how far the “true” function
F=F(x,pγ, m, t)
is away from a learnt function
F∗=F∗(x,pγ, m, t)
.
The true function has access to unbiased measurements of gene expression, whereas the learned function relies on data from
15
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
Xpγ0
m0
Xpγ0
m1
Xpγ1
m0
Xpγ1
m1
Xpγ2
m1
Xpγ3
m1
. . .
. . .
V0({Xpγ0
m0, Xpγ1
m0})
V1({Xpγ0
m1, Xpγ1
m1, Xpγ3
m1})
V2({Xpγ1
m1, Xpγ2
m1, Xpγ3
m1})
Figure 7
: The set of actions in Figure 1is ex-
panded to include the process of measuring cell
state using single-cell technologies. Note that
aspects of experimental design can impact how
batch effects may emerge; the new arrows and
measured states are specific to the description in
the text.
batched single-cell sequencing runs; each batch will contain different perturbations and media conditions. To combat this,
we typically either preprocess the data to account for batch effects, or use a ML model that accounts for the batch identity.
As preprocessing relies on some underlying statistical model, this can lead to the introduction of hidden confounding factors
in the now processed dataset. In contract, including the batch identity is cleaner and allows for end-to-end training all of the
way through the raw data.
Focusing on incorporating the batch identity into the learned function, we are essentially interested in learning some form of
average over functions, FL, which incorporate batch information. To introduce notation, we define the valid set Vas
V:= (i, j, b) : Perturbation pγiin media mjis contained within batch b(23)
which will allow us to selectively take sums over scenarios when
(pγi, mj)
is contained within the batch of interest. We
can visualise this as a graph, see Figure 7. Without loss of generality and to shorten the notation, we define
pγi=·
when
i= 0
and
mj=·
when
j= 0
. We use
χV(i, j, b)
as the indicator function to selectively sum over perturbation-media pairs,
(pγi, mj)
, that are contained within batch
b= 0, . . . , nB
, and refer to
V0,0
as the set of batches where unperturbed cells in
baseline media, X, have been measured.
To enable analytical progress, consider
F∗(x,pγ, m, t) = 1
nB
nB−1
X
b=0
1
|V0,0|
nB−1
X
b′=0
χV0,0(b′)FL(x,pγ, m, t, b′→b),(24)
where
FL=FL(x,pγ, m, t, b′→b)
is a function that takes in measurement
x
made in batch
b′
, applies perturbation-
media-time triplet, (pγi, mj, t)to make a prediction xb,pγi
mj,t in batch b. The function FLsatisfies
FL(xb′
,pγ, m, t, b′→b) = xb,pγ
m,t ,(25)
whereby a measurement made in batch bis modified by order εerror function η, written
xb,pγi
mj,t =xpγi
mj,t +ηb({xpγi
mj,t : (i, j, b)∈V}).(26)
Equation
(26)
clarifies equation
(14)
by highlighting the dependence that the error depends on the set of perturbations and
media conditions in the batch. For example, if a perturbation,
pγ
, stresses a cell then it may become permeable leading to
loss in cytoplasmic RNA; this often leads to fewer RNA reads being captured with said reads disproportionately originating
from mitochondria.
We would like to examine the error at point xover all (pγ, m, b)triplets, written as
ε(x, t) = 1
nPnM
nP
X
i=0
nM
X
j=0 F(x,pγi, mj, t)
| {z }
=x
pγi
mj,t
−F∗(x,pγi, mj, t)2
.(27)
16
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
Using the Taylor expansion of FLaround x, we find
FL(x,pγ, m, t, b′→b) = FL(xb′−ηb′
,pγ, m, t, b′→b)(28)
=FL(xb′
,pγ, m, t, b′→b)−ηb′· ∇FL(xb′
,pγ, m, t, b′→b) + . . .
=xb,pγi
mj,t −ηb′· ∇FL(xb′
,pγ, m, t, b′→b) + . . .
=xpγi
mj,t +ηb−ηb′· ∇FL(xb′
,pγ, m, t, b′→b) + . . .
and therefore
ε(x, t) = 1
nPnM
nP
X
i=0
nM
X
j=0 F(x,pγi, mj, t)−F∗(x,pγi, mj, t)2(29)
=1
nPnM
nP
X
i=0
nM
X
j=0 "xpγi
mj,t − 1
nB
nB−1
X
b=0
1
|V0,0|
nB−1
X
b′=0
χV0,0(b′)FL(x,pγ, m, t, b′→b)!#2
=1
nPnMnB
1
|V0,0|
nP
X
i=0
nM
X
j=0
nB−1
X
b=0
nB−1
X
b′=0
χV0,0(b′)h−ηb+ηb′· ∇FL(xb′
,pγ, m, t, b′→b) + . . . i2
Examining the final line of equation (29), we findsome interesting conclusions, namely that:
•
Even sequencing all perturbations and all media conditions in the same batch does not strictly mean one can learn
F
unless ∇FL(xb′, . . . )≈1.
•
For (
nB>1
), as the
ηb
and
ηb′
terms have opposite signs, the total error can be reduced by incorporating unperturbed
cells in the baseline media into every batch.
•
For (
nB>1
), the first term in the square brackets suggests that some batch effects are irreducible, however modelling
FLvia convex Lipschitz functions would be desirable if possible, because
||∇FL(xb′
, . . . )− ∇FL(x, . . . )|| ≤ L||xb′−x|| .
Implications for Foundation Models: Foundation models for regulatory systems typically rely on data gathered by
different laboratories, making a deep understanding of batch effects essential. In principle, technical variation is more
tractable because its sources can often be pinpointed. For instance, in scRNA-seq workflows, differences in cell isolation
methods, reverse transcription efficiencies, and PCR amplification introduce noise and batch effects, as do variations in
capture efficiency and library preparation chemistries (e.g., 10x Genomics, Parse, etc.). Addressing these issues requires
careful experimental design, appropriate controls, and standardization or batch-correction methods during data analysis. In
theory, many of these technical factors might also be modeled with machine learning and probabilistic approaches.
However, biological variation is more difficult to manage. There is no universal standard for cell lines or culture conditions,
so it is challenging to obtain consistent signals across multiple systems. Although useful insights can still be gained from
heterogeneous data, explicitly accounting for these biological differences often requires simplistic approaches (e.g., one-hot
encoding) rather than richer parametric modeling. Together, the compounding effects of technical and biological variation
can distort or mask the signal of interest.
Consequently, validating foundation models on wholly independent test datasets — without extensive data harmonization
that risks introducing data leakage — should be a top priority. Looking ahead, automation protocols offer a promising route
to generate large-scale standardized datasets that can support robust, generalizable models. Yet any approach that integrates
data across multiple sources must do so with a clear awareness of how both technical artifacts and unstandardized biology
can impede real-world predictive performance.
C. Arrayed screens
In contrast to pooled screens, arrayed screens can be used to understand more complex phenotypes whereby cells are
interacting with each other and their environment, e.g., bone formation (Taylor-King et al.,2020a). Practically, arrayed
17
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
screens are both more complicated and simpler than pooled screens for a number of reasons. On one hand, challenges
include that each well on a plate (96 well, 384 well plates etc) becomes a batch with the potential for edge effects — whereby
the outer rim of the plate may have slightly weaker or stronger phenotypes. On the other hand, some phenotypes emerge
from cell-cell signalling and can even by triggered by nearby cells; therefore in such set ups the evolution of random variable
Xpγi
mjis entirely independent of all other perturbation-media pairs Xpγi′
mj′with i=i′and j=j′.
D. Analysis of iPSC time series dataset
To assess whether the iPSC cells in Ishikawa et al. (2023) reach steady state we evaluate whether the mean log2 fold change
(LFC) decreases across days. We use a baseline computed using the non-targeting guides to evaluate baseline variation in the
dataset. In Figure 8, the mean LFC for each guide is shown in blue where each point represents a sequential day comparison.
Given that we have samples from days 2, 3, 4 and 5, the blue lines show the mean LFC for comparisons between days 2 and
3, 3 and 4, and 4 and 5. The baseline LFC is obtained by randomly splitting cells with the non-targeting guide for each day
into two groups and computing the LFC between the two splits. This is repeated five times to obtain the final baseline.
18
No Foundations without Foundations — Why semi-mechanistic models are essential for regulatory biology
0
1
2
3
4
5
Absolute LFC
ZIC2 MYCN ETS2 POU5F1 PRDM14
0
1
2
3
4
5
Absolute LFC
ZNF649 ID1 LIN28A NANOG TRIM25
0
1
2
3
4
5
Absolute LFC
non-targeting JARID2 ETV4 FOXH1 ZNF90
0
1
2
3
4
5
Absolute LFC
NR5A2 PDLIM1 ZIC3 ZNF398 TRIM24
2345
Day
0
1
2
3
4
5
Absolute LFC
RUNX1T1
2345
Day
MYC
2345
Day
SOX2
2345
Day
PPP1R12C
2345
Day
VENTX
LFC Comparison by CRISPR Guide
Across day
Within day
Figure 8: Absolute LFC between sequential days in the Ishikawa et al. (2023) dataset.
19