PreprintPDF Available

MIDAS: a deep generative model for mosaic integration and knowledge transfer of single-cell multimodal data

Authors:
  • The Beijing Institute of Basic Medical Sciences
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

A bstract Rapidly developing single-cell multi-omics sequencing technologies generate increasingly large bodies of multimodal data. Integrating multimodal data from different sequencing technologies, i.e . mosaic data, permits larger-scale investigation with more modalities and can help to better reveal cellular heterogeneity. However, mosaic integration involves major challenges, particularly regarding modality alignment and batch effect removal. Here we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation, and batch correction of mosaic data by employing self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to other methods and reliability by evaluating its performance in full trimodal integration and various mosaic tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells (PBMCs), and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data.
MIDAS: a deep generative model for mosaic integration and
knowledge transfer of single-cell multimodal data
Zhen He1,#Yaowen Chen1,#Shuofeng Hu1,#Sijing An1Junfeng Shi2Runyan Liu1
Jiahao Zhou3Guohua Dong1Jinhui Shi1Jiaxin Zhao1Jing Wang1Yuan Zhu2Le Ou-Yang3
Xiaochen Bo4,Xiaomin Ying1,
1Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
2School of Automation, China University of Geosciences, Wuhan, China
3College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
4Institute of Health Service and Transfusion Medicine, Beijing, China
#These authors contributed equally: Zhen He, Yaowen Chen, Shuofeng Hu
ABS TR ACT
Rapidly developing single-cell multi-omics sequencing technologies generate increasingly large
bodies of multimodal data. Integrating multimodal data from different sequencing technologies, i.e.
mosaic data, permits larger-scale investigation with more modalities and can help to better reveal
cellular heterogeneity. However, mosaic integration involves major challenges, particularly regarding
modality alignment and batch effect removal. Here we present a deep probabilistic framework for
the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS
simultaneously achieves dimensionality reduction, imputation, and batch correction of mosaic data
by employing self-supervised modality alignment and information-theoretic latent disentanglement.
We demonstrate its superiority to other methods and reliability by evaluating its performance in full
trimodal integration and various mosaic tasks. We also constructed a single-cell trimodal atlas of
human peripheral blood mononuclear cells (PBMCs), and tailored transfer learning and reciprocal
reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new
data.
1 Introduction
Recently emerged single-cell multimodal omics (scMulti-omics) sequencing technologies enable the simultaneous
detection of multiple modalities, such as RNA expression, protein abundance, and chromatin accessibility, in the same
cell [
1
]. These technologies, including the trimodal DOGMA-seq [
2
] and TEA-seq [
3
] and bimodal CITE-seq [
4
] and
ASAP-seq [
2
], among many others [
5
10
], reveal not only cellular heterogeneity at multiple molecular layers, enabling
more refined identification of cell characteristics, but also connections across omes, providing a systematic view of
ome interactions and regulation at single-cell resolution. The involvement of more measured modalities in analyses of
biological samples increases the potential for breakthroughs in the understanding of mechanisms underlying numerous
processes, including cell functioning, tissue development, and disease occurrence. The growing size of scMulti-omics
datasets necessitates the development of new computational tools to integrate massive high-dimensional data generated
from different sources, thereby facilitating more comprehensive and reliable downstream analysis for knowledge
mining [
1
,
11
]. Such “integrative analysis” also enables the construction of a large-scale single-cell multimodal atlas,
which is urgently needed to make full use of publicly available single-cell multimodal data. Such an atlas can serve as
an encyclopedia allowing researchers’ transfer of knowledge to their new data and in-house studies [1214].
Several methods for single-cell multimodal integration have been presented recently. Most of them have been
proposed for the integration of bimodal data [
14
21
]. Fewer trimodal integration methods have been developed;
MOFA+ [
22
] has been proposed for trimodal integration with complete modalities, and GLUE [
23
] and uniPort [
24
]
have been developed for the integration of unpaired trimodal data (i.e., datasets involving single specific modalities).
Correspondence to: Xiaomin Ying (email: yingxmbio@foxmail.com) and Xiaochen Bo (email: boxiaoc@163.com).
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L
MULTIMODAL DATA
All of these current integration methods have difficulty in handling flexible omics combinations. Due to the diversity
of scMulti-omics technologies, datasets from different studies often include heterogeneous omics combinations with one
or more missing modalities, resulting in a mosaic-like data. The mosaic-like data is increasing rapidly and is predictably
prevalent. Mosaic integration methods are urgently needed to markedly expand the scale and modalities of integration,
breaking through the modality scalability and cost limitations of existing scMulti-omics sequencing technologies. Most
recently, scVAEIT [
25
], scMoMaT [
26
], and StabMap [
27
] have been proposed to tackle this problem. However,
these methods are not capable of aligning modalities or correcting batches, which results in limited functions and
performances. Therefore, flexible and general multimodal mosaic integration remains challenging [
28
,
29
]. One
major challenge is the reconciliation of modality heterogeneity and technical variation across batches. Another is the
achievement of modality imputation and batch correction for downstream analysis.
To overcome these challenges, we developed a probabilistic framework, MIDAS, for the mosaic integration and
knowledge transfer of single-cell multimodal data. By employing self-supervised learning [
30
] and information-
theoretic approaches [
31
], MIDAS simultaneously achieves modality alignment, imputation, and batch correction for
single-cell trimodal mosaic data. We further designed transfer learning and reciprocal reference mapping schemes
tailored to MIDAS to enable knowledge transfer. Systematic benchmarks and case studies demonstrate that MIDAS
can accurately and robustly integrate mosaic datasets. Through the atlas-level mosaic integration of trimodal human
peripheral blood mononuclear cell (PBMC) data, MIDAS achieved flexible and accurate knowledge transfer for various
types of unimodal and multimodal query datasets.
2 Results
2.1 MIDAS enables the mosaic integration and knowledge transfer of single-cell multimodal data
MIDAS is a deep generative model [32,33] that represents the joint distribution of incomplete single-cell multimodal
data with Assay for Transposase-Accessible Chromatin (ATAC), RNA, and Antibody-Derived Tags (ADT) measure-
ments. MIDAS assumes that each cell’s multimodal measurements are generated from two modality-agnostic and
disentangled latent variables—the biological state (i.e., cellular heterogeneity) and technical noise (i.e., unwanted
variation induced by single-cell experimentation)—through deep neural networks [
34
]. Its input consists of a mosaic
feature-by-cell count matrix comprising different single-cell samples (batches), and a vector representing the cell
batch IDs (Fig. 1a). The batches can derive from different experiments or be generated by the application of different
sequencing techniques (e.g., scRNA-seq [
35
], CITE-seq [
4
], ASAP-seq [
2
], and TEA-seq [
3
]), and thus can have
different technical noise, modalities, and features. The MIDAS output comprises biological state and technical noise
matrices, which are the two low-dimensional representations of different cells, and an imputed and batch-corrected
count matrix in which modalities and features missing from the input data are interpolated and batch effects are
removed. These outputs can be used for downstream analyses such as clustering, differential expression analyses, and
cell typing [36].
MIDAS is based on a Variational Autoencoder (VAE) [
37
] architecture, with a modularized encoder network
designed to handle the mosaic input data and infer the latent variables, and a decoder network that uses the latent
variables to seed the generative process for the observed data (Fig. 1b). It uses self-supervised learning to align
different modalities in latent space, improving cross-modal inference in downstream tasks such as imputation and
translation (Fig. 1b). Information-theoretic approaches are applied to disentangle the biological state and technical noise,
enabling further batch correction (Fig. 1b). Combining these elements into our optimization objective, scalable learning
and inference of MIDAS are simultaneously achieved by the Stochastic Gradient Variational Bayes (SGVB) [
38
],
which also enables large-scale mosaic integration and atlas construction of single-cell multimodal data. For the
robust transfer of knowledge from the constructed atlas to query datasets with various modality combinations, transfer
learning and reciprocal reference mapping schemes were developed for the transfer of model parameters and cell labels,
respectively (Fig. 1c).
2.2 MIDAS shows superior performance in trimodal integration with complete modalities
To compare MIDAS with the state-of-the-art methods, we evaluated the performance of MIDAS in trimodal integration
with complete modalities, a simplified form of mosaic integration, as few methods are designed specifically for trimodal
mosaic integration. We named this task “rectangular integration”. We used two (DOGMA-seq [
2
] and TEA-seq [
3
],
Supplementary Table 1) published single-cell trimodal human PBMC datasets with simultaneous RNA, ADT, and ATAC
measurements for each cell to construct dogma-full and teadog-full datasets. Dogma-full took all four batches (LLL_Ctrl,
LLL_Stim, DIG_Ctrl, and DIG_Stim) from the DOGMA-seq dataset, and teadog-full took two batches (W1 and W6)
from the TEA-seq dataset and two batches (LLL_Ctrl and DIG_Stim) from the DOGMA-seq dataset (Supplementary
Table 2). The integration of each dataset requires the handling of batch effects and missing features and preservation
2
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L
MULTIMODAL DATA
of biological signals, which is challenging, especially for the teadog-full dataset, as the involvement of more datasets
amplifies biological and technical variation.
Uniform manifold approximation and projection (UMAP) [
39
] visualization showed that the biological states of
different batches were well aligned and that their grouping was consistent with the ground-truth cell types (Fig. 2a left),
and that the technical noise was grouped by batch and exhibited little relevance to cell types (Fig. 2b). Thus, the two
inferred latent variables were disentangled well and independently represented biological and technical variation.
Taking the inferred biological states as low-dimensional representations of the integrated data, we compared the
performance of MIDAS with that of nine strategies derived from recently published methods (Methods) in the removal of
batch effects and preservation of biological signals. UMAP visualization of the integration results showed that MIDAS
ideally removed batch effects and meanwhile preserved cell type information on both dogma-full and teadog-full
datasets, whereas the performance of other strategies was not satisfactory. For example, BBKNN+average, MOFA+,
PCA+WNN, Scanorama-embed+WNN, and Scanorama-feat+WNN did not mix different batches well, and PCA+WNN
and Scanorama-feat+WNN produced cell clusters largely inconsistent with cell types (Fig. 2a).
In a quantitative evaluation of the low-dimensional representations of different strategies performed with the
widely used single-cell integration benchmarking (scIB) [
40
] tool, MIDAS had the highest batch correction, biological
conservation, and overall scores for the dogma-full and teadog-full datasets (Fig. 2c and Supplementary Fig. 1). In
addition, MIDAS preserved cell type-specific patterns in batch-corrected RNA, ADT, and ATAC data (Methods). For
each cell type, fold changes in gene/protein abundance and chromatin accessibility in raw and batch-corrected data
correlated strongly and positively (all r > 0.8; Fig. 2d).
Manual cell clustering and typing based on the integrated low-dimensional representations and batch-corrected
data from MIDAS led to the identification of 13 PBMC types, including B cells, T cells, dendritic cells (DCs),
natural killer (NK) cells, and monocytes (Fig. 2e). We got a distinct T cell cluster that highly expresses CD4 and
CD8 simultaneously. We labeled this cluster as double positive (DP) CD4
+
/CD8
+
T cells. This phenomenon was
also reported in previous studies [
41
]. Another T cell cluster, containing mucosa-associated invariant T cells and
gamma-delta T cells, was distinct from conventional T cells and was labeled as unconventional T cells [42].
As is known, multiple omes regulate biological functions synergistically [
1
]. MIDAS integrates RNA, ADT and
ATAC single-cell data and hence facilitates to discover the intrinsic nature of cell activities in a more comprehensive
manner. We found that all omics contributed greatly to the identification of cell types and functions.
Systematic screening for expression inconsistencies between proteins and their corresponding genes, expected
to reflect ome irreplaceability, at the RNA and ADT levels demonstrated that several markers in each cell type were
expressed strongly in one modality and weakly in the other (Fig. 2f, Supplementary Fig. 2). For instance, MS4A1, which
encodes a B cell-specific membrane protein, was expressed extremely specifically in B cells, but the CD20 protein
encoded by MS4A1 was rarely detected, confirming the irreplaceability of the RNA modality. We also found that ADT
could complement RNA-based clustering. For example, the simultaneous expression of T-cell markers (CD3 and CD4)
was unexpectedly observed in two subclusters of B cells (B2 and B3) expressing canonical B-cell makers (CD19, CD20,
and CD79; Fig. 2g). As this phenomenon could not be replicated using RNA data alone, this finding confirms the
irreplaceability of the ADT modality.
Investigation of the uniqueness of chromatin accessibility in multi-omics integration at the ATAC level showed that
ATAC contributed more than did ADT and RNA to the integration of a subcluster of CD4
+
naive T cells (Fig. 2h–j). We
took the ratio of peak number of a cell to that of all cells as the representation of the cell accessibility level. RNA and
ADT expression did not differ between these cells and their CD4
+
naive T-cell siblings, but surprisingly less accessibility
level was observed at the ATAC layer (
<
0.02, Supplementary Fig. 3). Gene ontology enrichment analysis [
43
] indicated
that the inaccessible regions are related to T-cell activation, cell adhesion, and other immune functions. Therefore, we
define this cluster as low chromatin accessible (LCA) naive CD4
+
T cells. Although this discovery needs to be verified
further, it demonstrates the remarkable multi-omics integration capability of MIDAS.
2.3 MIDAS enables reliable trimodal mosaic integration
At present, trimodal sequencing techniques are still immature. Most of the existing datasets are unimodal or bimodal
with various modality combinations. MIDAS is designed to integrate these diverse multimodal datasets, i.e. mosaic
datasets. To evaluate the performance of MIDAS on mosaic integration, we further constructed 14 incomplete
datasets based on the previously generated rectangular datasets including dogma-full and teadog-full datasets (Methods,
Supplementary Table 2). Each mosaic dataset was generated by removing several modality-batch blocks from the
full-modality dataset. Then, we took the rectangular integration results as the baseline, and examined whether MIDAS
could obtain comparable results on mosaic integration tasks. We assessed MIDAS’s ability of batch correction, modality
3
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L
MULTIMODAL DATA
alignment, and biological conservation. Here we also focused on modality alignment because it guarantees accurate
cross-modal inference for processes such as downstream imputation and knowledge transfer. For qualitative evaluation,
we use UMAP to visualize the biological states and technical noises inferred from the individual and the joint input
modalities (Fig. 3a, b, Supplementary Fig. 4, 5). Taking the dogma-paired-abc dataset for example, for each modality,
the biological states were consistently distributed across different batches (Fig. 3a) whereas the technical noises were
grouped by batches (Fig. 3b), indicating that the batch effect were well disentangled from the biological states. Similarly,
the distributions of biological states and technical noises within batches were very similar across modalities (Fig. 3a,
b), suggesting that MIDAS internally aligns different modalities in latent space. Moreover, the biological states
of each cell type were grouped together and the cell type silhouettes were consistent across batches and modality
combinations (Fig. 3a), reflecting robust conservation of the biological signals after mosaic integration.
To quantitatively evaluate MIDAS on mosaic integration, we proposed single-cell mosaic integration bench-
marking (scMIB). scMIB extends scIB with modality alignment metrics, and defines each type of metrics on both
embedding (latent) space and feature (observation) space, resulting in 20 metrics in total (Methods, Supplementary
Table 3). The obtained batch correction, modality alignment, biological conservation, and overall scores for paired+full,
paired-abc, paired-ab, paired-ac, paired-bc, and diagonal+full tasks performed with the dogma and teadog datasets
were similar to those obtained with rectangular integration (Fig. 3c, Supplementary Fig. 6a). MIDAS showed moderate
performance in the dogma- and teadog-diagonal tasks, likely due to the lack of cell-to-cell correspondence across
modalities in these tasks, which can be remedied via knowledge transfer (shown in Result 2.5).
scIB benchmarking showed that MIDAS, when given incomplete datasets (paired+full, paired-abc, paired-ab,
paired-ac, and paired-bc for dogma and teadog), outperformed methods that rely on the full-modality datasets (dogma-
and teadog-full; Supplementary Fig. 6b, c). Even with the severely incomplete dogma- and teadog-diagonal+full
datasets, the performance of MIDAS surpassed that of most the other methods.
We also compared MIDAS against scVAEIT, scMoMaT, and StabMap (Methods) that can handle mosaic datasets.
UMAP visualization of the low-dimensional cell embeddings showed that MIDAS removed batch effects and preserved
biological signals well on various tasks, while the other three methods did not integrate trimodal data well, especially
when with missing modalities (Fig. 3d, e, Supplementary Fig. 7). To be specific, MIDAS aligned the cells of different
batches well and grouped them consistently with the ground-truth cell types, while the other methods did not mix
different batches well and produced cell clusters largely inconsistent with cell types. scIB benchmarking showed that
MIDAS had stable performance on different mosaic tasks, and its overall scores were much higher than the other
methods (Fig. 3f, g, Supplementary Fig. 8).
The identification of cells’ nearest neighbors based on individual dimensionality reduction results and comparison
of neighborhood overlap among tasks showed that this overlap exceeded 0.75 for most tasks, except dogma-diagonal,
when the number of neighbors reached 10,000 (Fig. 3h). As imputed omics data has been inferred to deteriorate
the accuracy of gene regulatory inference in many cases [
44
], we evaluated the consistency of downstream analysis
results obtained with the performance of different mosaic integration tasks with the dogma datasets. We validated
the conservation of gene regulatory networks in the imputed data. In the dogma-paired+full task, for example, the
regulatory network predicted from imputed data was consistent with that predicted from the ground-truth dogma-full
data (Fig. 3i). These results indicate that the modality inference performed by MIDAS is reliable.
The manual annotation of cell types for the mosaic integration tasks and computation of their confusion matrices
and micro-F1 scores with the dogma-full cell typing results serving as the ground truth showed that the cell type labels
generated from the incomplete datasets, except dogma-diagonal, were largely consistent with the ground truth, with all
micro F1-scores exceeding 0.885 (Fig. 3j, Supplementary Fig. 9). The separation of monocytes and DCs was difficult
in some mosaic experiments, mainly because the latter originate from the former [
45
] and likely also because the
monocyte population in the dogma dataset was small.
2.4 MIDAS enables the atlas-level mosaic integration of trimodal PBMC data
We used MIDAS for the large-scale mosaic integration of 18 PBMC batches from bimodal sequencing platforms (e.g.,
10X Multiome, ASAP-seq, and CITE-seq) and the 9 batches from the DOGMA-seq and TEA-seq trimodal datasets (total,
27 batches from 10 platforms comprising 185,518 cells; Methods, Supplementary Table 1, 4). Similar to the results
obtained with the dogma-full and teadog-full datasets, MIDAS achieved satisfactory batch removal and biological
conservation. UMAP visualization showed that the inferred biological states of different batches maintained a
consistent PBMC population structure and conserved batch-specific (due mainly to differences in experimental design)
biological information (Fig. 4a, Supplementary Fig. 10a). In addition, the technical noise was clearly grouped by
batch (Supplementary Fig. 10b). These results suggest that the biological states and technical noises were disentangled
well and the data could be used reliably in downstream analysis.
4
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L
MULTIMODAL DATA
The manual labeling of cell types according to cluster markers achieved clearer separation and annotation than did
automatic labeling by Seurat [
14
] (Fig. 4b). Consistent with the rectangular integration results (Fig. 2e), we identified all
cell types known to be in the atlas, including B cells, conventional T-cell subsets, DP T cells, NK cells, unconventional
T cells, and hematopoietic stem cells (HSCs), demonstrating the robustness of MIDAS. Remarkably, the integration
of more datasets with MIDAS led to the identification of rare clusters and high-resolution cell typing. For example,
whereas platelets could not be easily identified by dogma-full rectangular integration due to their extremely limited
number (Fig. 2e), platelets from the DOGMA-seq dataset aggregated into a much larger cluster with recognizable
platelet markers in the PBMC atlas (Fig. 4c). In addition, the atlas contained more monocyte subclusters, including
CD14
+
, CD16
+
, and CD3
+
CD14
+
monocytes, than obtained with rectangular integration (Fig. 4d). Other cell types
present in more subclusters in the atlas included CD158e1
+
NK cells, CD4
+
CD138
+
CD202b
+
T cells, and RTKN2
+
CD8
+
T cells (Supplementary Fig. 11a).
Most batches in the atlas contained considerable numbers of LCA cells (Fig. 4e, Supplementary Fig. 11b) with
<
0.02 accessibility level, as did the DOGMA-seq dataset (Fig. 2i). The chromatin accessibility levels of cells in the
atlas showed an obvious bimodal distribution, reflecting the existence of two ATAC patterns (Supplementary Fig. 11b).
CD8
+
T-cell, CD14
+
monocyte, NK cell, B cell, and other clusters contained LCA cells (Fig. 4b, e) implying that LCA
is common in various cell types.
2.5 MIDAS enables flexible and accurate knowledge transfer across modalities and datasets
To investigate the knowledge transfer capability of MIDAS, we re-partitioned the atlas dataset into reference (for atlas
construction) and query (knowledge transfer target) datasets (Supplementary Table 4). By removing DOGMA-seq
from the atlas, we obtained a reference dataset named atlas-no_dogma. To test the flexibility of knowledge transfer,
we used DOGMA-seq to construct 14 query datasets: 1 rectangular integration and 7 incomplete mosaic datasets
generated previously, and 6 rectangular integration datasets with fewer modalities (Methods, Supplementary Table 5).
In consideration of real applications, we defined model and label knowledge transfer scenarios (Methods). In the model
transfer scenario, knowledge was transferred implicitly through model parameters via transfer learning. In the label
transfer scenario, knowledge was transferred explicitly through cell labels via reference mapping.
We assessed the performance of MIDAS in the model transfer scenario. For the transfer learned models, we used
UMAP to visualize the inferred biological states and technical noises and scMIB and scIB for integration benchmarking,
and compared the results of different tasks with those generated by de novo trained models. Transfer learning greatly
improved performance on the dogma-diagonal, dogma-atac, dogma-rna, and dogma-paired-a tasks, with performance
levels on the other tasks maintained (Fig. 5a–c, Supplementary Fig. 12, 13). For example, the de novo trained model
failed to integrate well in the dogma-diagonal task due to lack of cell-to-cell correspondence across modalities (Fig. 5a),
whereas the transfer learned model with atlas knowledge successfully aligned the biological states across batches and
modalities and formed groups consistent with cell types (Fig. 5b). The results obtained by transfer learned models with
all 14 datasets were not only comparable (Supplementary Fig. 13a), but also better than those obtained with many other
methods, even with the complete dataset (Fig. 5c, Supplementary Fig. 13b).
To assess the performance of MIDAS in the label transfer scenario, we compared the widely used query-to-reference
mapping [
46
,
47
], reference-to-query mapping [
13
,
48
], and our proposed reciprocal reference mapping (Methods).
With each strategy, we aligned each query dataset to the reference dataset and transferred cell type labels through
k-nearest neighbors, with the ground-truth cell type labels taken from the trimodal PBMC atlas. Visualization of the
mapped biological states showed that reciprocal reference mapping with different query datasets yielded consistent
results, with strong agreement with the atlas integration results obtained with the dogma-full dataset (Fig. 5d). Micro
F1-scores indicated that reciprocal reference mapping outperformed the query-to-reference and reference-to-query
mapping strategies for various forms of query data, achieving robust and accurate label transfer and thereby avoiding
the need for de novo downstream analysis (Fig. 5e).
Thus, MIDAS can be used to transfer atlas-level knowledge to various forms of users’ datasets without expensive de
novo training or complex downstream analysis.
3 Discussion
By modeling the single-cell mosaic data generative process, MIDAS can precisely disentangle biological states and
technical noises from the input and robustly align modalities to support multi-source and heterogeneous integration
analysis. It provides accurate and robust results and outperforms other methods when performing various mosaic
integration tasks. It also powerfully and reliably integrates large datasets, as demonstrated with the atlas-scale integration
of publicly available PBMC multi-omics datasets. Moreover, MIDAS efficiently and flexibly transfers knowledge from
5
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L
MULTIMODAL DATA
reference to query datasets, enabling convenient handling of new multi-omics data. With superior performance in
dimensionality reduction and batch correction, MIDAS supports accurate downstream biological analysis.
To our knowledge, MIDAS is the first model that supports simultaneous dimensionality reduction, modality
complementing, and batch correction in single-cell trimodal mosaic integration. MIDAS accurately integrates mosaic
data with missing modalities, achieving results comparable to the ground truth (rectangular integration) and superior to
those obtained from other methods. These distinct advantages of MIDAS derive from the deep generative modeling,
product of experts, information-theoretic disentanglement, and self-supervised modality alignment components of the
algorithm, which are specifically designed and inherently competent for the heterogeneous integration of data with
missing features and modalities. In addition, MIDAS is the first model that allows knowledge transfer across mosaic
data modalities and batches in a highly flexible and reliable manner, enabling researchers to conquer the vast bodies of
data produced with continuously emerging multi-omics techniques.
GLUE and uniPort, designed for the integration of unpaired single-cell trimodal data, align the global distributions
of different modalities with the use of prior information to improve cell-wise correspondence. They may face challenges
when little prior information about the inter-modality (e.g., ATAC vs. ADT) correspondence of single cells is available.
In addition, they are not designed to utilize paired data to enhance modality alignment. With the widespread adoption
and rapid development of scMulti-omics sequencing technologies, paired data are rapidly becoming more common
and will soon be ubiquitous. By leveraging paired cell data, MIDAS can learn to better align different modalities
in a self-supervised manner and can thus play a more important role than other available methods in the blooming
scMulti-omics era.
Most recently, scVAEIT, scMoMaT, and StabMap were proposed for mosaic integration. However, they do not
have the functionalities of modality alignment and batch correction. StabMap also requires to manually select paired
datasets as reference sets in advance, leading to biased results. A general mosaic integration method should allow input
of diverse mosaic combinations, and support modality alignment and batch correction both in embedding space and
feature space [
29
]. These functions are all essential and urgently needed in real-world scenarios. Compared to scVAEIT,
scMoMaT, and StabMap, MIDAS is the only one that tackles the problem of general mosaic integration (Supplementary
Table 6).
Recently proposed methods for scalable reference-to-query knowledge transfer for single-cell multimodal data
have issues with generalization to unseen query data [
46
,
47
] or the retention of information learned on the reference
data [
13
,
48
], which make the alignment of reference and query datasets difficult. In addition, they support limited
numbers of specific modalities. The MIDAS knowledge transfer scheme stands out from these methods because it
supports various types of mosaic query dataset and enables model transfer for sample-efficient mosaic integration and
label transfer for automatic cell typing. Moreover, the generalization and retention problems are mitigated through a
novel reciprocal reference mapping scheme.
We envision two major developmental directions for MIDAS. At present, MIDAS integrates only three modalities.
By fine tuning the model architecture, we can achieve the integration of four or more modalities, overcoming the
limitations of existing scMulti-omics sequencing technologies. In addition, the continuous incorporation of rapidly
increasing bodies of newly generated scMulti-omics data is needed to update the model and improve the quality of the
atlas. This process requires frequent model retraining, which is computationally expensive and time consuming. Thus,
employing incremental learning [
49
] is an inevitable trend to achieve continuous mosaic integration without model
retraining.
4 Methods
4.1 Deep generative modeling of mosaic single-cell multimodal data
For Cell
n N ={1, . . . , N}
with batch ID
sn S ={1, . . . , S }
, let
xm
nNDm
n
be the count vector of size
Dm
n
from Modality
m
, and
xn={xm
n}m∈Mn
the set of count vectors from the measured modalities
Mn M =
{ATAC,RNA,A DT}
. We define two modality-agnostic low-dimensional latent variables
cRDc
and
uRDu
to
represent each cell’s biological state and technical noise, respectively. To model the generative process of the observed
variables xand sfor each cell, we factorize the joint distribution of all variables as below:
p(x, s, c,u) = p(c)p(u)p(s|u)p(x|c,u)
=p(c)p(u)p(s|u)Y
m∈Mn
p(xm|c,u)(1)
6
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
where we assume that
c
and
u
are independent of each other and the batch ID
s
only depends on
u
in order to facilitate
the disentanglement of both latent variables, and that the count variables
{xm}m∈Mn
from different modalities are
conditional independent given cand u.
Based on the above factorization, we define a generative model for xand sas follows:
p(c) = Normal(c|0,I)(2)
p(u) = Normal(u|0,I)(3)
π=gs(u;θs)(4)
pθ(s|u) = Categorical(s|π)(5)
λm=gm(c,u;θm)for m Mn(6)
pθ(xm|c,u) = Bernoulli(xm|λm)if m=ATAC
Poisson(xm|λm)if m {RNA,ADT}for m Mn(7)
where the priors
p(c)
and
p(u)
are set as standard normals. The likelihood
pθ(s|u)
is set as a categorical distribution
with probability vector
πS1
generated through a batch-ID decoder
gs
which is a neural network with learnable
parameters
θs
. The likelihood
pθ(xm|c,u)
is set as a Bernoulli distribution with mean
λm[0,1]Dm
n
when
m=ATAC
, and as a Poisson distribution with mean
λmRDm
n
+
when
m {RNA,ADT}
, where
λm
is generated
through a modality decoder neural network
gm
parameterized by
θm
. To mitigate overfitting and improve generalization,
we share parameters of the first few layers of different modality decoders
{gm}m∈M
(the gray parts of the decoders in
Fig. 1b middle). The corresponding graphical model is shown in Fig. 1b (left).
Given the observed data
{xn, sn}n∈N
, we aim to fit the model parameters
θ={θs,{θm}m∈M}
and meanwhile
infer the posteriors of latent variables
{c,u}
for each cell. This can be achieved by using the SGVB [
38
] which
maximizes the expected Evidence Lower Bound (ELBO) for individual datapoints. The ELBO for each individual
datapoint {xn, sn}can be written as:
ELBO(θ,ϕ;xn, sn)Eqϕ(c,u|xn,sn)log pθ(xn, sn,c,u)
qϕ(c,u|xn, sn)
=Eqϕ(c,u|xn,sn)[log pθ(xn, sn|c,u)] KL (qϕ(c,u|xn, sn)p(c,u))
=Eqϕ(c,u|xn,sn)"log pθ(sn|u) + X
m∈Mn
log pθ(xm
n|c,u)#KL (qϕ(c,u|xn, sn)p(c,u))
(8)
where
qϕ(c,u|xn, sn)
, with learnable parameters
ϕ
, is the variational approximation of the true posterior
p(c,u|xn, sn)
and is typically implemented by neural networks.
4.2 Scalable variational inference via the Product of Experts
Let
M=|M|
be the total modality number, since there are
(2M1)
possible modality combinations for the count
data
xn={xm
n}m∈Mn⊆M
, naively implementing
qϕ(c,u|xn, sn)
in equation (8) requires
(2M1)
different neural
networks to handle different cases of input
(xn, sn)
, making inference unscalable. Let
z={c,u}
, inspired by [
50
]
which utilizes the Product of Experts (PoEs) to implement variational inference in a combinatorial way, we factorize the
7
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
posterior p(z|xn, sn)and define its variational approximation qϕ(z|xn, sn)as follows:
p(z|xn, sn) = p(z)p(sn|z)p(xn|z)
p(xn, sn)
=p(z)
p(xn, sn)p(sn|z)Y
m∈Mn
p(xm
n|z)
=p(z)
p(xn, sn)
p(sn)p(z|sn)
p(z)Y
m∈Mn
p(xm
n)p(z|xm
n)
p(z)
=p(sn)
p(xn, sn) Y
m∈Mn
p(xm
n)!p(z)p(z|sn)
p(z)Y
m∈Mn
p(z|xm
n)
p(z)
p(sn)
p(xn, sn) Y
m∈Mn
p(xm
n)!p(z)qϕ(z|sn)
p(z)Y
m∈Mn
qϕ(z|xm
n)
p(z)
qϕ(z|xn, sn)
(9)
where
qϕ(z|sn)
and
qϕ(z|xm
n)
are the variational approximations of the true posteriors
p(z|sn)
and
p(z|xm
n)
, respec-
tively. From equation (9) we further get:
qϕ(z|xn, sn)p(z)qϕ(z|sn)
p(z)
| {z }
Cs·eqϕ(z|sn)Y
m∈Mn
qϕ(z|xm
n)
p(z)
| {z }
Cm·eqϕ(z|xm
n)
p(z)eqϕ(z|sn)Y
m∈Mneqϕ(z|xm
n)
(10)
where we set
qϕ(z|sn)
and
qϕ(z|xm
n)
to be Gaussians. We define
eqϕ(z|sn) = 1
Cs·qϕ(z|sn)
p(z)
and
eqϕ(z|xm
n) =
1
Cm·qϕ(z|xm
n)
p(z)
to be the normalized quotients of Gaussians with normalizing constants
Cs
and
Cm
, respectively.
Since the quotient of Gaussians is an unnormalized Gaussian,
eqϕ(z|sn)
and
eqϕ(z|xm
n)
are also Gaussians. Thus, in
equation (10) the variational posterior
qϕ(z|xn, sn)
is proportional to the product of individual Gaussians (or “experts”),
indicating that itself is a Gaussian with mean µand covariance Λ:
µ= X
i
µiΛ1
i! X
i
Λ1
i!1
Λ= X
i
Λ1
i!1(11)
where
µi
and
Λi
are respectively the mean and covariance of the
i
-th individual Gaussian. We further assume
eqϕ(z|xm
n)
and eqϕ(z|sn)are isotropic Gaussians and define them as follows:
(µm,νm) = fm(xm
n;ϕm)(12)
eqϕ(z|xm
n) = Normal (z|µm,νmI)(13)
(µs,νs) = fs(sn;ϕs)(14)
eqϕ(z|sn) = Normal(z|µs,νsI)(15)
where
fm
is the modality encoder parameterized by
ϕm
, and
fs
the batch-ID encoder parameterized by
ϕs
, both of
which are neural networks for generating Gaussian mean-variance pairs. In doing this,
qϕ(z|xn, sn)
is modularized into
(M+1)
neural networks to handle
(2M1)
different modality combinations, increasing the model’s scalability. Similar
to the modality decoders, we also share parameters of the last few layers of different modality encoders
{fm}m∈M
(the
gray parts of the encoders in Fig. 1b middle) to improve generalization.
4.3 Handling missing features via padding and masking
For each modality, as different batches can have different feature sets (e.g., genes for RNA modality), it is hard to use a
fixed-size neural network to handle these batches. To remedy this, we first convert
xm
n
of variable size into a fixed size
8
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
vector for inference. For Modality
m
, let
Fm
s
be the features of batch
s
, and
Fm=Ss∈S Fm
s
the feature union of all
batches. The missing features of batch
s
can then be defined as
Fm
s=Fm\ Fm
s
. We pad
xm
n
of size
Dm
n
with zeros
corresponding to its missing features Fm
snthrough a zero-padding function h:
e
xm
n=h(xm
n)(16)
where
e
xm
n
is the zero-padded count vector of constant size
Dm=|Fm|
. The modality encoding process is thus
decomposed as:
(µm,νm) = fm(xm
n;ϕm)
=b
fm(h(xm
n); ϕm)
=b
fm(e
xm
n;ϕm)
(17)
where
b
fm
is the latter part of the modality encoder to handle a fixed size input
e
xm
n
. On the other hand, to calculate the
likelihood
pθ(xm
n|cn,un)
we also need to generate a mean
λm
n
of variable size for
xm
n
. To achieve this, we decompose
the modality decoding process as follows:
λm
n=gm(cn,un;θm)
=h1(bgm(cn,un;θm))
=h1e
λm
n(18)
where
bgm
is the front part of the modality decoder to generate the mean
e
λm
n
of fixed size
Dm
, and
h1
, the inverse
function of
h
, is the mask function to remove the padded missing features
Fm
s
from
e
λm
n
to generate
λm
n
. Note that
e
λm
n
can also be taken as the imputed values for downstream analyses (Methods 4.8 and 4.9).
4.4 Self-supervised modality alignment
To achieve cross-modal inference in downstream tasks, we resort to aligning different modalities in the latent space.
Leveraging self-supervised learning, we first use each cell’s multimodal observation
{{xm
n}m∈Mn, sn}
to construct
unimodal observations
{xm
n, sn}m∈Mn
, each of which is associated with the latent variables
zm={cm,um}
. Then,
we construct a pretext task, which enforces modality alignment by regularizing on the joint space of unimodal variational
posteriors with the dispersion of latent variables as a penalty (Fig. 1 upper right), corresponding to a modality alignment
loss:
lmod(ϕ;xn, sn)Zv(e
z)qϕ(e
z|xn, sn)de
z
=Eqϕ(e
z|xn,sn)v(e
z)
(19)
where
e
z={zm}m∈Mn
is the set of latent variables, and
qϕ(e
z|xn, sn)
represents the joint distribution of unimodal
variational posteriors since:
qϕ(e
z|xn, sn) = qϕ({zm}m∈Mn|xn, sn)
=Y
m∈Mn
qϕ(zm|xn, sn)
=Y
m∈Mn
qϕ(zm|xm
n, sn)
(20)
In equation (19),
v(e
z)
is the Mean Absolute Deviation, which measures the dispersion among different elements in
e
z
and is used to regularize qϕ(e
z|xn, sn). It is defined as:
v(e
z)1
|Mn|X
m∈Mn
zm¯
z2(21)
where ¯
z=1
|Mn|Pm∈Mnzmis the mean and ∥·∥2the Euclidean distance.
Note that the computation of
qϕ(zm|xm
n, sn)
in equation (20) is efficient. Since
qϕ(zm|xm
n, sn) =
qϕ(z|xm
n, sn)|z=zm, according to equation (10) we have:
qϕ(z|xm
n, sn)p(z)eqϕ(z|sn)eqϕ(z|xm
n)(22)
As the mean and covariance of each Gaussian term on the right-hand side of equation (22) was already obtained when
inferring
qϕ(z|xn, sn)
(equation (10)), the mean and covariance of
qϕ(z|xm
n, sn)
can be directly calculated using
equation (11), avoiding the need of passing each constructed unimodal observation to the encoders.
9
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
4.5 Information-theoretic disentanglement of latent variables
To better disentangle the biological state cand the technical noise u, we adopt an information-theoretic approach, the
Information Bottleneck (IB) [
31
], to control the information flow during inference. We define two types of IB, where
the technical IB prevents batch-specific information being encoded into
c
by minimizing the Mutual Information (MI)
between
s
and
c
, and the biological IB prevents biological information being encoded into
u
by minimizing the MI
between
x
and
u
(Fig. 1 bottom right). Let
I(·,·)
denote the MI between two variables, considering both
I(s, c)
and
I(x,u)we have:
I(s, c) + I(x,u) = Ep(s,c)log p(s, c)
p(s)p(c)+Ep(x,u)log p(x,u)
p(x)p(u)
=Ep(s,c)log p(s|c)
|{z}
pb
α(s|c)Ep(s,c)[log p(s)]
| {z }
const.
+Ep(x,u)log p(u|x)
p(u)
Ep(x,s)Ep(c|x,s)[log pb
α(s|c)]+const. +Ep(x,s)"Ep(u|x,s)log p(u|x)
p(u)#
=Ep(x,s)"Ep(c|x,s)[log pb
α(s|c)] + Ep(u|x,s)log p(u|x)
p(u)#+const.
1
NX
n"Ep(c|xn,sn)[log pb
α(sn|c)] + Ep(u|xn,sn)log p(u|xn)
p(u)#+const.
1
NX
n"Eqϕ(c|xn,sn)[log pb
α(sn|c)] + Eqϕ(u|xn,sn)log qϕ(u|xn)
p(u)
| {z }
b
lIB (ϕ;xn,sn,b
α)
#+const.
=1
NX
nhb
lIB (ϕ;xn, sn,b
α)i+const.
(23)
where
pb
α(s|c)
is a learned likelihood with parameters
b
α
, and
qϕ(u|xn)
p(u)Qm∈Mneqϕ(u|xm
n)
can be com-
puted via equation (11). From equation (23), minimizing
(I(s, c) + I(x,u))
approximately equals to minimizing
b
lIB (ϕ;xn, sn,b
α)w.r.t ϕfor all cells. For pb
α(s|c), we model it as:
pb
α(s|c) = Categorical s|r(c;b
α)(24)
where
r
is a classifier neural network parameterized by
b
α
. To learn the classifier, we minimize the following expected
negative log-likelihood w.r.t b
α:
Ep(c,s)[log pb
α(s|c)]
=Ep(x,s)Ep(c|x,s)[log pb
α(s|c)]
1
NX
n
Ep(c|xn,sn)[log pb
α(sn|c)]
1
NX
n
Eqϕ(c|xn,sn)[log pb
α(sn|c)]
|{z }
b
lr(b
α;xn,sn,ϕ)
=1
NX
nb
lr(b
α;xn, sn,ϕ)
(25)
Thus, minimizing Ep(c,s)[log pb
α(s|c)] approximately equals to minimizing b
lr(b
α;xn, sn,ϕ)w.r.t b
αfor all cells.
To further enhance latent disentanglement for cross-modal inference, for each modality
m
, we also minimize
(I(s, cm) + I(xm,um))
. Similar to equation (23), this can be achieved by minimizing
b
lIB (ϕ;xm
n, sn,αm)
where
αm
is
the parameters of the classifier neural network
rm
to generate the likelihood
pαm(s|cm) = Categorical (rm(cm;αm))
.
Together with b
lIB (ϕ;xn, sn,b
α)defined in equation (23), the total IB loss is defined as:
lIB (ϕ;xn, sn,α)b
lIB (ϕ;xn, sn,b
α) + X
m∈Mnb
lIB (ϕ;xm
n, sn,αm)(26)
10
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
where
α={b
α,{αm}m∈M}
. To learn
pαm(s|cm)
, we can also minimize
Ep(cm,s)[log pαm(s|cm)]
, which corre-
sponds to minimizing
b
lr(αm;xm
n, sn,ϕ)
according to equation (25). Considering the classifier loss
b
lr(b
α;xn, sn,ϕ)
defined in equation (25), the total classifier loss is defined as:
lr(α;xn, sn,ϕ) = b
lr(b
α;xn, sn,ϕ) + X
m∈Mnb
lr(αm;xm
n, sn,ϕ)(27)
4.6 Training MIDAS
To train the encoders and decoders of MIDAS, considering the training objectives defined in equations (8), (19), and (26),
we minimize the following objective w.r.t {θ,ϕ}for all observations {xn, sn}n∈N :
lf,g (θ,ϕ;xn, sn,α) = lELB O(θ,ϕ;xn, sn) + lmod(ϕ;xn, sn) + lIB (ϕ;xn, sn,α)(28)
where
lELB O(θ,ϕ;xn, sn)ELBO(θ,ϕ;xn, sn)
. Since
α
is unknown and the learning of
α
depends on
ϕ
as in
equation (27), we iteratively minimize equations (28) and (27) with Stochastic Gradient Descent (SGD), forming the
MIDAS training algorithm (Algorithm 1).
Algorithm 1 The MIDAS training algorithm.
Input: A single-cell multimodal mosaic dataset {xn, sn}n∈N
Output: Encoder parameters θ, decoder parameters ϕ, and classifier parameters α
1: Randomly initialize parameters {θ,ϕ,α}
2: for t= 1,2, . . . , T do
3: Sample a mini-batch {xn, sn}n∈Ntfrom the dataset, where Nt N
4: Freeze ϕand update αvia SGD with loss 1
|Nt|Pn∈Ntlr(α;xn, sn,ϕ)See equation (27)
5: Freeze αand update {θ,ϕ}via SGD with loss 1
|Nt|Pn∈Ntlf,g (θ,ϕ;xn, sn,α)See equation (28)
6: end for
4.7 Mosaic integration on latent space
A key goal of single-cell mosaic integration is to extract biological meaningful low-dimensional representations from the
mosaic data for downstream analysis. To achieve this, for each cell we first use the trained MIDAS to infer the mean and
variance of the latent posterior
qϕ(c,u|xn, sn)
through equation (10), and then take the maximum a posteriori (MAP)
estimation of the biological state
c
as the cell’s low-dimensional representation. Since
qϕ(c|xn, sn)
is Gaussian, the
MAP estimation of ccorresponds to the inferred mean of qϕ(c|xn, sn).
4.8 Imputation for missing modalities and features
Based on the latent posterior
qϕ(c,u|xn, sn)
inferred from the single-cell mosaic data (Methods 4.7), it is straightfor-
ward to impute missing modalities and features. We first pass the inferred posterior means of
c
and
u
to the decoders to
generate padded feature mean
e
λm
n
for each modality
m M
via equation (18). Then, we sample from a Bernoulli
distribution with mean
e
λATAC
n
to generate the imputed ATAC counts, and from two Poisson distributions with means
e
λRNA
nand e
λADT
nto generate the imputed RNA and ADT counts, respectively.
4.9 Batch correction via latent variable manipulation
Besides performing mosaic integration on the latent space (Methods 4.7), we can also perform it on the feature space,
i.e., imputing missing values and correcting batch effect for the count data. Mosaic integration on feature space is
important since it is required by many downstream tasks such as differential expression analysis and cell typing.
Based on the inferred posterior means of latent variables
c
and
u
(Methods 4.8) of each cell, we can perform
imputation and batch correction simultaneously by manipulating the technical noise
u
. Concretely, let
cn
and
un
be
the posterior means of cand ufor Cell n, respectively, we first calculate the mean of unwithin each batch s:
¯
us=1
|Ns|X
n∈Ns
un(29)
11
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
where Ns N is the set of cell-IDs belonging to batch s. Next, we calculate the mean of ¯
usover all batches:
¯
u=1
SX
s
¯
us(30)
Then, we find a mean ¯
uSwhich is the closest to ¯
u, and treat ¯
uSas a “standard” technical noise, where:
S= arg min
s
¯
us¯
u2(31)
Finally, for each cell we correct the batch effect by substituting
un
with a common
¯
uS
, and pass
{cn,¯
uS}
to the
decoders to generate imputed and batch-corrected data (similar to Methods 4.8 but here we use
{cn,¯
uS}
instead of
{cn,un}to correct batch effect).
4.10 Model transfer via transfer learning
When MIDAS has been pre-trained on a reference dataset, we can conduct model transfer to transfer the model’s learned
knowledge to a query dataset through transfer learning, i.e., on the query dataset we fine-tune the pre-trained model
instead of train the model from scratch. Since compared to the reference dataset, the query dataset can contain different
number of batches collected from different platforms, the batch-ID related modules need to be redefined. Thus, during
transfer learning, we reparameterize and reinitialize the batch-ID encoder and decoder
{fs, gs}
and the batch classifiers
{r, {rm}m∈M}, and only fine-tune the modality encoders and decoders {fm, gm}m∈M .
A core advantage of our model transfer scheme is that it can flexibly transfer the knowledge of multimodal data to
various types of query datasets, even to those with fewer modalities, improving the de novo integration of single-cell
data.
4.11 Label transfer via reciprocal reference mapping
Whilst model transfer implicitly transfers knowledge through model parameters, label transfer explicitly transfers
knowledge in the form of data labels. These labels can be different kinds of downstream analysis results such as
cell types, cell cycles, or pseudotime. Through accurate label transfer, we can not only avoid the expensive de novo
downstream analysis, but also improve the label quality.
Typically, the first step of label transfer is reference mapping, which aligns the query cells with the reference cells so
that labels can be transferred reliably. For MIDAS, we can naively achieve reference mapping in two ways: (1) mapping
the query data onto the reference space, i.e., applying the model pre-trained on the reference data to infer the biological
states for the query data [
46
,
47
], and (2) mapping the reference data onto the query space, i.e., applying the model
transfer-learned on the query data (Methods 4.10) to infer the biological states for the reference data [
13
,
48
]. However,
the first way suffers from the ‘generalization problem’ since the pre-trained model is hard to generalize to the query data
which usually contains unseen technical variations, while the second way suffers from the ‘forgetting problem’ since
the transfer-learned model may lose information learned on the reference data, affecting the inferred biological states.
To tackle both problems, we propose an reciprocal reference mapping scheme, where we fine-tune the pre-trained
model on the query dataset to avoid the generalization problem, and meanwhile feed the model with the past data sampled
from the reference dataset to prevent forgetting. In doing this, the model can find a mapping suitable for both reference
and query datasets. We can then align the two datasets by inferring the biological states, through which the reference
labels can be transferred to the query data based on nearest neighbors. Similar to model transfer (Methods 4.10), in
label transfer knowledge can also be flexibly and accurately transferred to various types of query datasets.
4.12 Modality contribution to the integrated clustering
We assess the contribution of different modalities to clustering by measuring the agreement between single-modality
clustering and multi-modalities cell clustering. For each cell, the normalized consistency ratio of the nearest neighbors
in the single modal clustering and multi-modalities clustering is used to represent contribution of the modal for the final
integrated clustering.
4.13 Regulatory network inference from scRNA-seq datasets
GRNBoost2 [
51
] is used to infer the regulatory network from scRNA-seq datasets and is one of the regression-based
methods for regulatory network inference. Weighted regulatory links between genes and transcription factors are
provided from GRNBoost2. The weights of shared links from different data are compared to indicate the regulatory
network retention.
12
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
4.14 Correlation of expression fold changes between raw and batch-corrected data
For each cell type, expression fold changes of genes and proteins are calculated against all other cells using FoldChange
function in Seurat. Pearson correlation coefficient is used to measure linear correlations of fold changes between raw
and batch-corrected data.
4.15 Generating ground-truth cell type labels
To generate ground-truth cell type labels for both qualitative and quantitative evaluation, we employed the third-party
tool, Seurat, to annotate cell types for different PBMC datasets through label transfer. We took the CITE-seq PBMC
atlas from [
14
] as the reference set, and utilized the FindTransferAnchors and TransferData functions in Seurat to
perform label transfer, where “cca” was used as the reduction method for reference mapping. For cells without raw
RNA expression, we first utilized ATAC data to create a gene activity matrix using GeneActivity function in Signac [
52
].
The gene activity matrix was subsequently used for label transfer.
4.16 Evaluation metrics
To evaluate the performance of MIDAS and the state-of-the-art tools on multimodal integration, we utilize metrics
from scIB on batch correction and biological conservation, and also propose our own metrics on modality alignment
to better evaluate mosaic integration, extending scIB to scMIB (Supplementary Table 3). Since mosaic integration
should generate low-dimensional representations and the imputed and batch corrected data, scMIB is performed on
both embedding space and feature space. To evaluate the batch correction and biological conservation metrics on
the feature space, we convert the imputed and batch corrected feature into a similarity graph via the PCA+WNN
strategy (Methods 4.20), and then use this graph for evaluation. Our metrics for batch correction, modality alignment,
and biological conservation are defined as follows.
4.16.1 Batch correction metrics
The batch correction metrics comprise graph iLISI (
yiLIS I
embed
and
yiLIS I
feat
), graph connectivity (
ygc
embed
and
ygc
feat
), and
kBET (
ykBET
embed
and
ykBET
feat
), where
yiLIS I
embed
,
ygc
embed
, and
ykBET
embed
are defined in embedding space and
yiLIS I
feat
,
ygc
feat
, and
ykBET
feat
are
defined in feature space.
Graph iLISI. The graph iLISI (local inverse Simpson’s index [
53
]) metric is extended from the iLISI, which is used
to measure the batch mixing degree. The iLISI scores are computed based on kNN graphs by computing the inverse
Simpson’s index for diversity. The scores estimate the effective number of batches present in the neighborhood. iLISI
ranges from 1 to
N
, where
N
equals the number of batches. Scores close to the real batch numbers denotes good mixing.
However, typical iLISI score is not applicable to graph-based outputs. scIB proposed the Graph iLISI, which utilizes a
graph-based distance metric to determine the nearest neighbor list and avoids skews on graph-based integration outputs.
The graph iLISI scores are scaled to [0,1], where 0 indicates strong separation and 1 indicates perfect mixing.
Graph connectivity. The graph connectivity is proposed by scIB to inspect whether cells with the same label are
connected in the kNN graph of all cells. For each label
c
, we get the largest connected graph component (LCC) of
c
-labeled cells and divide the LCC size by the
c
-labeled cells population size to represent the graph connectivity for cell
label
c
. Then, we calculate the connectivity values for all labels and take the average as the total graph connectivity.
The score ranges from 0 to 1. The score 1 means that all cells with the same cell identity from different batches are
connected in the integrated kNN graph, which also indicates the perfect batch mixing and vice versa.
kBET. The kBET (k-nearest neighbor batch-effect test [
54
]) is used to measure the batch mixing at the local level
of the k-nearest neighbors. Certain fractions of random cells are repeatedly selected to test whether the local label
distributions are statistically similar to the global label distributions (null hypothesis). The kBET value is the rejection
rate over all tested neighborhoods and values close to zero represent that the batches are well mixed. scIB adjusts the
kBET with a diffusion-based correction to enable unbiased comparison on graph- and non-graph-based integration
results. kBET values are first computed for each label, and then averaged and subtracted from 1 to get a final kBET
score.
13
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
4.16.2 Modality alignment metrics
The modality alignment metrics comprise modality ASW (
yASW
), FOSCTTM (
yFOS CTT M
), label transfer F1 (
yltF1
), ATAC
AUROC (
yAURO C
), RNA Pearson’s r(
yRNAr
), and ADT Pearson’s r(
yADT r
), where
yASW
,
yFOS CTT M
, and
yltF1
are defined
in embedding space and yAURO C,yRN Ar, and yADTrare defined in feature space.
Modality ASW. The modality ASW (averaged silhouette width) is used to measure the alignment of distributions
between different modality embeddings. The ASW [
55
] is originally used to measure the separation of clusters. In scIB,
ASW is also modified to measure the performance of batch effect removal, resulting in a batch ASW that ranges from 0
to 1, where 1 denotes perfect batch mixing and 0 denotes strong batch separation. By replacing batch embeddings with
modality embeddings, we can define a modality ASW in the same manner as the batch ASW, where 1 denotes perfect
modality alignment and 0 denotes strong modality separation. For MIDAS, the modality embeddings are generated by
feeding the trained model with each modality individually.
FOSCTTM. The FOSCTTM (fraction of samples closer than the true match [
56
]) is used to measure the alignment
of values between different modality embeddings. Let
yFOS CTT M
m1,m2
be the FOSCTTM for a modality pair
{m1, m2}
, it is
defined as:
yFOS CTT M
m1,m2=1
2N X
i
Nm1
i
N+X
i
Nm2
i
N!
Nm1
i=|j| em1
iem2
j2<em1
iem2
i2|
Nm2
i=|j| em1
jem2
i2<em1
iem2
i2|
(32)
where
N
is the number of cells,
i
and
j
are the cell indices, and
em1
i
and
em2
i
are the embeddings of cell
i
in modalities
m1
and
m2
, respectively.
Nm1
i
is the number of cells in modality
m2
that are closer to
em1
i
than
em2
i
is to
em1
i
, and
it is similar for
Nm2
i
. We first get the embeddings of individual modalities, then calculate the FOSCTTM values for
each modality pair, and lastly average these values and subtract it from 1 to obtain a final FOSCTTM score. Higher
FOSCTTM scores indicate better modality alignment.
Label transfer F1. The label transfer F1 is used to measure the alignment of cell types between different modality
embeddings. This can be achieve by testing whether cell type labels can be transferred from one modality to another
without any bias. For each pair of modalities, we first build a kNN graph between their embeddings, and then transfer
labels from one modality to the other based on the nearest neighbors. The transferred labels are compared to the
original labels by the micro F1-score, which is defined as the label transfer F1. We take the F1 score averaged from all
comparison pairs as the final label transfer F1 score.
ATAC AUROC The ATAC AUROC (area under the receiver operating characteristic) is used to measure the alignment
of different modalities in the ATAC feature space. It has been previously used to evaluate the quality of ATAC
predictions [
57
]. For each method to be evaluated, we first use it to convert different modality combinations (excluding
ATAC) into ATAC features respectively, then calculate the AUROC of each converted result by taking the true ATAC
features as the ground truth, and finally take the average of these AUROCs as the final score. Taking MIDAS as an
example, if ATAC, RNA and ADT are involved, the data of the three modality combinations
{RNA}
,
{ADT}
, and
{RNA,ADT}can be input into the trained model to obtain three sets of ATAC predictions.
RNA Pearson’s rThe RNA Pearson’s ris used to measure the alignment of different modalities in the RNA feature
space. For each method to be evaluated, we first use it to convert different modality combinations (excluding RNA) into
RNA features respectively, then calculate the Pearson’s rbetween each converted result and the true RNA features, and
finally take the average of these Pearson’s rs as the final score.
ADT Pearson’s rThe ADT Pearson’s ris used to measure the alignment of different modalities in the ADT feature
space. The calculation of the ADT Pearson’s is similar to that of the RNA Pearson’s r.
4.16.3 Biological conservation metrics
The biological conservation metrics comprise NMI (
yNMI
embed
and
yNMI
feat
), ARI (
yARI
embed
and
yARI
feat
), isolated label F1 (
yilF1
embed
and
yilF1
feat
), and graph cLISI (
ycLIS I
embed
and
ycLIS I
feat
), where
yNMI
embed
,
yARI
embed
,
yilF1
embed
, and
ycLIS I
embed
are defined in embedding space
and yNMI
feat ,yARI
feat ,yilF1
feat , and ycLIS I
feat are defined in feature space.
14
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
NMI. The NMI (Normalized Mutual Information) is used to measure the similarity between two clustering results,
namely the predefined cell type labels and the clustering result obtained from the embeddings or the graph. Optimized
Louvain clustering is used here according to scIB. The NMI scores are scaled to
[0,1]
where 0 and 1 correspond to
uncorrelated clustering and a perfect match, respectively.
ARI. The ARI (Adjusted Rand Index) also measures the overlap of two clustering results. The RI (Rand Index [
58
])
considers not only cell pairs that are assigned in the same clusters but also ones in different clusters in the predicted (Lou-
vain clustering) and true (cell type) clusters. ARI corrects RI for randomly correct labels. An ARI of 1 represents
perfect match and 0 represents random labeling.
Isolated label F1. scIB proposes the isolated label F1 score to evaluate the integration performance, specifically
focusing on cells with the label that is share by few batches. Cell labels presented in the least number of batches are
identified as isolated labels. The F1 score for measuring the clustering performance on isolated labels is defined as the
isolated label F1 score. It reflects how well the isolated labels separate from other cell identities, ranging from 0 to 1,
where 1 means all the isolated label cells and no others are grouped into one cluster.
Graph cLISI. The Graph cLISI is similar to the Graph iLISI but focuses on cell type labels rather than batch labels.
Unlike iLISI that highlights the mixing of groups, cLISI values the separation of groups. The graph-adjusted cLISI is
scaled to
[0,1]
with value 0 corresponding to low cell-type separation and 1 corresponding to strong cell-type separation.
4.16.4 Overall scores
scIB. We compute the scIB overall score using the batch correction and biological conservation metrics defined
either on the embedding space (for algorithms generating embeddings or graphs) or the feature space (for algorithms
generating batch-corrected features). Following [
40
], the overall score
y
is the sum of the averaged batch correction
metric ybatch weighted by 0.4 and the averaged biological conservation metric ybio weighted by 0.6:
ybatch = (yiLIS I
ω+ygc
ω+ykBET
ω)/3
ybio = (yNMI
ω+yARI
ω+yilF1
ω+ycLIS I
ω)/4
y= 0.4·ybatch + 0.6·ybio
(33)
where ω=embed for embedding or graph outputs, and ω=feat for feature outputs.
scMIB. As an extension of scIB, the scMIB overall score
y
is computed from the batch correction, modality alignment,
and biological conservation metrics defined on both the embedding and feature space. It is the sum of the averaged
batch correction metric
ybatch
weighted by 0.3, the averaged modality alignment metric
ymod
weighted by 0.3, and the
averaged biological conservation metric ybio weighted by 0.4:
ybatch = (yiLIS I
embed +ygc
embed +ykBET
embed +yiLIS I
feat +ygc
feat +ykBET
feat )/6
ymod = (yASW +yF OSC TTM +yltF1+yAU ROC +yRNA r+yADTr)/6
ybio = (yNMI
embed +yARI
embed +yilF1
embed +ycLIS I
embed +yNMI
feat +yARI
feat +yilF1
feat +ycLIS I
feat )/8
y= 0.3·ybatch + 0.3·ymod + 0.4·ybio
(34)
4.17 Data availability
All datasets of human PBMCs were publicly available (Supplementary Table 1). Count matrices of gene UMIs, ATAC
fragments and antibody-derived tags were downloaded for data analysis.
DOGMA-seq dataset. The DOGMA-seq dataset contains four batches profiled by DOGMA-seq, which measures
RNA, ATAC and ADT data simultaneously. Trimodal data of this dataset were obtained from Gene Expression
Omnibus (GEO) [59] under accession ID GSE166188 [2].
TEA-seq dataset. The TEA-seq dataset contains five batches profiled by TEA-seq, which measures RNA, ATAC and
ADT data simultaneously. Trimodal data of these batches were obtained from GEO under accession ID GSE158013 [
3
].
TEA Multiome dataset. The TEA Multiome dataset measuring paired RNA and ATAC data was obtained from GEO
under accession ID GSE158013 [
3
]. This dataset contains two batches profiled by 10x Chromium Single Cell Multiome
ATAC + Gene Expression.
15
MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL
MULTIMODAL DATA
10X Multiome dataset. The 10X Multiome dataset measuring paired RNA and ATAC data was collected from 10x
Genomics (https://www.10xgenomics.com/resources/datasets/) [6063].
ASAP dataset. The ASAP dataset was obtained from GEO under accession ID GSE156473 [
2
]. Two batches profiled
by ASAP-seq include ATAC and ADT data, and the other two batches profiled by CITE-seq measure RNA and ADT
data.
WNN dataset. The WNN dataset measuring paired RNA and ADT data was obtained from
https://atlas.
fredhutch.org/nygc/multimodal-pbmc
[
14
]. This dataset was profiled by CITE-seq. We selected the eight PBMC
batches generated before the administration of HIV vaccine for integration.
4.18 Data preprocessing
The count matrices of RNA and ADT were processed via the Seurat package (v4.1.0) [
14
]. The ATAC fragment files
were processed using the Signac package (v1.6.0) [
52
] and peaks were called via MACS2 [
64
]. We performed quality
control separately for each batch. Briefly, metrics of detected gene number per cell, total UMI number, percentage of
mtRNA reads, total protein tag number, total fragment number, TSS score, and nucleosome signal were evaluated. We
manually checked the distributions of these metrics and set customized criteria to filter low-quality cells in each batch.
The number of cells that passed quality control in each batch is shown in Supplementary Table 1.
For each batch, we adopted common normalization strategies for RNA, ADT and ATAC respectively. Specifically,
for RNA data, UMI count matrices are normalized and log-transformed using the NormalizeData function in Seurat; for
ADT data, tag count matrices are centered log-ratio (CLR)-normalized using the NormalizeData function in Seurat;
for ATAC data, fragment matrices are term frequency inverse document frequency (TF-IDF) normalized using the
RunTFIDF function in Signac.
To integrate batches profiled by various technologies, we need to create a union of features for RNA, ADT and
ATAC data, respectively. For RNA data, firstly, low-frequency genes are removed based on gene occurrence frequency
across all batches; then we selected 4000 high variable genes (HVGs) using the FindVariableFeatures function with
default parameters in each batch; the union of these HVGs are ranked using the SelectIntegrationFeatures function and
the top 4000 genes are selected. In addition, we also retained genes that encode the proteins targeted by the antibodies.
For ADT data, the union of antibodies in all batches are retained for data integration. For ATAC data, we used reduce
function in Signac to merge all intersecting peaks across batches and then re-calculated the fragment counts in the
merged peaks. The merged peaks are used for data integration.
The input data for MIDAS are UMI counts for RNA data, tag counts for ADT data, and binarized fragment counts
for ATAC data. For each modality, the union of features from all batches are used. Counts of missing features are set to
0. Binary feature masks are generated accordingly where 1 and 0 denote presented and missing features, respectively.
4.19 Implementation of MIDAS
We implement the architecture of MIDAS with PyTorch [65]. We set the sizes of the shared hidden layers of different
modality encoders to 1024-128, set the sizes of the shared hidden layers of different modality decoders to 128-1024,
and set the sizes of the biological state and technical noise hidden variables to 32 and 2, respectively. Each hidden
layer is implemented by four functions: Linear, LayerNorm, Mish, and Dropout. For all tasks, we split the training
data into a proportion of 95/5 for training/validation. To train the model, we set the mini-batch size to 256, and use the
AdamW [66] optimizer with a learning rate of 104to implement SGD. Early stopping is used to terminate training.
4.20 Implementation of comparing methods
We compare MIDAS to recent methods in both rectangular and mosaic integration of trimodal data. Rectangular
integration is a simpler case of mosaic integration, where all modalities are complete. Here all the generated low-
dimensional representations have the same size as the biological states inferred by MIDAS, which are 32 dimensions.
4.20.1 Rectangular integration methods
Since there are a few methods directly applicable to rectangular integration tasks involving ATAC, RNA, and ADT, we
decompose rectangular integration into two steps, i.e., batch correction for each modality independently, and modality
fusion