Content uploaded by Zhen He

Author content

All content in this area was uploaded by Zhen He on Jan 04, 2023

Content may be subject to copyright.

Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

MIDAS: a deep generative model for mosaic integration and

knowledge transfer of single-cell multimodal data

Zhen He1,#Yaowen Chen1,#Shuofeng Hu1,#Sijing An1Junfeng Shi2Runyan Liu1

Jiahao Zhou3Guohua Dong1Jinhui Shi1Jiaxin Zhao1Jing Wang1Yuan Zhu2Le Ou-Yang3

Xiaochen Bo4,∗Xiaomin Ying1,∗

1Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China

2School of Automation, China University of Geosciences, Wuhan, China

3College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China

4Institute of Health Service and Transfusion Medicine, Beijing, China

#These authors contributed equally: Zhen He, Yaowen Chen, Shuofeng Hu

ABS TR ACT

Rapidly developing single-cell multi-omics sequencing technologies generate increasingly large

bodies of multimodal data. Integrating multimodal data from different sequencing technologies, i.e.

mosaic data, permits larger-scale investigation with more modalities and can help to better reveal

cellular heterogeneity. However, mosaic integration involves major challenges, particularly regarding

modality alignment and batch effect removal. Here we present a deep probabilistic framework for

the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS

simultaneously achieves dimensionality reduction, imputation, and batch correction of mosaic data

by employing self-supervised modality alignment and information-theoretic latent disentanglement.

We demonstrate its superiority to other methods and reliability by evaluating its performance in full

trimodal integration and various mosaic tasks. We also constructed a single-cell trimodal atlas of

human peripheral blood mononuclear cells (PBMCs), and tailored transfer learning and reciprocal

reference mapping schemes to enable ﬂexible and accurate knowledge transfer from the atlas to new

data.

1 Introduction

Recently emerged single-cell multimodal omics (scMulti-omics) sequencing technologies enable the simultaneous

detection of multiple modalities, such as RNA expression, protein abundance, and chromatin accessibility, in the same

cell [

1

]. These technologies, including the trimodal DOGMA-seq [

2

] and TEA-seq [

3

] and bimodal CITE-seq [

4

] and

ASAP-seq [

2

], among many others [

5

–

10

], reveal not only cellular heterogeneity at multiple molecular layers, enabling

more reﬁned identiﬁcation of cell characteristics, but also connections across omes, providing a systematic view of

ome interactions and regulation at single-cell resolution. The involvement of more measured modalities in analyses of

biological samples increases the potential for breakthroughs in the understanding of mechanisms underlying numerous

processes, including cell functioning, tissue development, and disease occurrence. The growing size of scMulti-omics

datasets necessitates the development of new computational tools to integrate massive high-dimensional data generated

from different sources, thereby facilitating more comprehensive and reliable downstream analysis for knowledge

mining [

1

,

11

]. Such “integrative analysis” also enables the construction of a large-scale single-cell multimodal atlas,

which is urgently needed to make full use of publicly available single-cell multimodal data. Such an atlas can serve as

an encyclopedia allowing researchers’ transfer of knowledge to their new data and in-house studies [12–14].

Several methods for single-cell multimodal integration have been presented recently. Most of them have been

proposed for the integration of bimodal data [

14

–

21

]. Fewer trimodal integration methods have been developed;

MOFA+ [

22

] has been proposed for trimodal integration with complete modalities, and GLUE [

23

] and uniPort [

24

]

have been developed for the integration of unpaired trimodal data (i.e., datasets involving single speciﬁc modalities).

∗Correspondence to: Xiaomin Ying (email: yingxmbio@foxmail.com) and Xiaochen Bo (email: boxiaoc@163.com).

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L

MULTIMODAL DATA

All of these current integration methods have difﬁculty in handling ﬂexible omics combinations. Due to the diversity

of scMulti-omics technologies, datasets from different studies often include heterogeneous omics combinations with one

or more missing modalities, resulting in a mosaic-like data. The mosaic-like data is increasing rapidly and is predictably

prevalent. Mosaic integration methods are urgently needed to markedly expand the scale and modalities of integration,

breaking through the modality scalability and cost limitations of existing scMulti-omics sequencing technologies. Most

recently, scVAEIT [

25

], scMoMaT [

26

], and StabMap [

27

] have been proposed to tackle this problem. However,

these methods are not capable of aligning modalities or correcting batches, which results in limited functions and

performances. Therefore, ﬂexible and general multimodal mosaic integration remains challenging [

28

,

29

]. One

major challenge is the reconciliation of modality heterogeneity and technical variation across batches. Another is the

achievement of modality imputation and batch correction for downstream analysis.

To overcome these challenges, we developed a probabilistic framework, MIDAS, for the mosaic integration and

knowledge transfer of single-cell multimodal data. By employing self-supervised learning [

30

] and information-

theoretic approaches [

31

], MIDAS simultaneously achieves modality alignment, imputation, and batch correction for

single-cell trimodal mosaic data. We further designed transfer learning and reciprocal reference mapping schemes

tailored to MIDAS to enable knowledge transfer. Systematic benchmarks and case studies demonstrate that MIDAS

can accurately and robustly integrate mosaic datasets. Through the atlas-level mosaic integration of trimodal human

peripheral blood mononuclear cell (PBMC) data, MIDAS achieved ﬂexible and accurate knowledge transfer for various

types of unimodal and multimodal query datasets.

2 Results

2.1 MIDAS enables the mosaic integration and knowledge transfer of single-cell multimodal data

MIDAS is a deep generative model [32,33] that represents the joint distribution of incomplete single-cell multimodal

data with Assay for Transposase-Accessible Chromatin (ATAC), RNA, and Antibody-Derived Tags (ADT) measure-

ments. MIDAS assumes that each cell’s multimodal measurements are generated from two modality-agnostic and

disentangled latent variables—the biological state (i.e., cellular heterogeneity) and technical noise (i.e., unwanted

variation induced by single-cell experimentation)—through deep neural networks [

34

]. Its input consists of a mosaic

feature-by-cell count matrix comprising different single-cell samples (batches), and a vector representing the cell

batch IDs (Fig. 1a). The batches can derive from different experiments or be generated by the application of different

sequencing techniques (e.g., scRNA-seq [

35

], CITE-seq [

4

], ASAP-seq [

2

], and TEA-seq [

3

]), and thus can have

different technical noise, modalities, and features. The MIDAS output comprises biological state and technical noise

matrices, which are the two low-dimensional representations of different cells, and an imputed and batch-corrected

count matrix in which modalities and features missing from the input data are interpolated and batch effects are

removed. These outputs can be used for downstream analyses such as clustering, differential expression analyses, and

cell typing [36].

MIDAS is based on a Variational Autoencoder (VAE) [

37

] architecture, with a modularized encoder network

designed to handle the mosaic input data and infer the latent variables, and a decoder network that uses the latent

variables to seed the generative process for the observed data (Fig. 1b). It uses self-supervised learning to align

different modalities in latent space, improving cross-modal inference in downstream tasks such as imputation and

translation (Fig. 1b). Information-theoretic approaches are applied to disentangle the biological state and technical noise,

enabling further batch correction (Fig. 1b). Combining these elements into our optimization objective, scalable learning

and inference of MIDAS are simultaneously achieved by the Stochastic Gradient Variational Bayes (SGVB) [

38

],

which also enables large-scale mosaic integration and atlas construction of single-cell multimodal data. For the

robust transfer of knowledge from the constructed atlas to query datasets with various modality combinations, transfer

learning and reciprocal reference mapping schemes were developed for the transfer of model parameters and cell labels,

respectively (Fig. 1c).

2.2 MIDAS shows superior performance in trimodal integration with complete modalities

To compare MIDAS with the state-of-the-art methods, we evaluated the performance of MIDAS in trimodal integration

with complete modalities, a simpliﬁed form of mosaic integration, as few methods are designed speciﬁcally for trimodal

mosaic integration. We named this task “rectangular integration”. We used two (DOGMA-seq [

2

] and TEA-seq [

3

],

Supplementary Table 1) published single-cell trimodal human PBMC datasets with simultaneous RNA, ADT, and ATAC

measurements for each cell to construct dogma-full and teadog-full datasets. Dogma-full took all four batches (LLL_Ctrl,

LLL_Stim, DIG_Ctrl, and DIG_Stim) from the DOGMA-seq dataset, and teadog-full took two batches (W1 and W6)

from the TEA-seq dataset and two batches (LLL_Ctrl and DIG_Stim) from the DOGMA-seq dataset (Supplementary

Table 2). The integration of each dataset requires the handling of batch effects and missing features and preservation

2

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L

MULTIMODAL DATA

of biological signals, which is challenging, especially for the teadog-full dataset, as the involvement of more datasets

ampliﬁes biological and technical variation.

Uniform manifold approximation and projection (UMAP) [

39

] visualization showed that the biological states of

different batches were well aligned and that their grouping was consistent with the ground-truth cell types (Fig. 2a left),

and that the technical noise was grouped by batch and exhibited little relevance to cell types (Fig. 2b). Thus, the two

inferred latent variables were disentangled well and independently represented biological and technical variation.

Taking the inferred biological states as low-dimensional representations of the integrated data, we compared the

performance of MIDAS with that of nine strategies derived from recently published methods (Methods) in the removal of

batch effects and preservation of biological signals. UMAP visualization of the integration results showed that MIDAS

ideally removed batch effects and meanwhile preserved cell type information on both dogma-full and teadog-full

datasets, whereas the performance of other strategies was not satisfactory. For example, BBKNN+average, MOFA+,

PCA+WNN, Scanorama-embed+WNN, and Scanorama-feat+WNN did not mix different batches well, and PCA+WNN

and Scanorama-feat+WNN produced cell clusters largely inconsistent with cell types (Fig. 2a).

In a quantitative evaluation of the low-dimensional representations of different strategies performed with the

widely used single-cell integration benchmarking (scIB) [

40

] tool, MIDAS had the highest batch correction, biological

conservation, and overall scores for the dogma-full and teadog-full datasets (Fig. 2c and Supplementary Fig. 1). In

addition, MIDAS preserved cell type-speciﬁc patterns in batch-corrected RNA, ADT, and ATAC data (Methods). For

each cell type, fold changes in gene/protein abundance and chromatin accessibility in raw and batch-corrected data

correlated strongly and positively (all r > 0.8; Fig. 2d).

Manual cell clustering and typing based on the integrated low-dimensional representations and batch-corrected

data from MIDAS led to the identiﬁcation of 13 PBMC types, including B cells, T cells, dendritic cells (DCs),

natural killer (NK) cells, and monocytes (Fig. 2e). We got a distinct T cell cluster that highly expresses CD4 and

CD8 simultaneously. We labeled this cluster as double positive (DP) CD4

+

/CD8

+

T cells. This phenomenon was

also reported in previous studies [

41

]. Another T cell cluster, containing mucosa-associated invariant T cells and

gamma-delta T cells, was distinct from conventional T cells and was labeled as unconventional T cells [42].

As is known, multiple omes regulate biological functions synergistically [

1

]. MIDAS integrates RNA, ADT and

ATAC single-cell data and hence facilitates to discover the intrinsic nature of cell activities in a more comprehensive

manner. We found that all omics contributed greatly to the identiﬁcation of cell types and functions.

Systematic screening for expression inconsistencies between proteins and their corresponding genes, expected

to reﬂect ome irreplaceability, at the RNA and ADT levels demonstrated that several markers in each cell type were

expressed strongly in one modality and weakly in the other (Fig. 2f, Supplementary Fig. 2). For instance, MS4A1, which

encodes a B cell-speciﬁc membrane protein, was expressed extremely speciﬁcally in B cells, but the CD20 protein

encoded by MS4A1 was rarely detected, conﬁrming the irreplaceability of the RNA modality. We also found that ADT

could complement RNA-based clustering. For example, the simultaneous expression of T-cell markers (CD3 and CD4)

was unexpectedly observed in two subclusters of B cells (B2 and B3) expressing canonical B-cell makers (CD19, CD20,

and CD79; Fig. 2g). As this phenomenon could not be replicated using RNA data alone, this ﬁnding conﬁrms the

irreplaceability of the ADT modality.

Investigation of the uniqueness of chromatin accessibility in multi-omics integration at the ATAC level showed that

ATAC contributed more than did ADT and RNA to the integration of a subcluster of CD4

+

naive T cells (Fig. 2h–j). We

took the ratio of peak number of a cell to that of all cells as the representation of the cell accessibility level. RNA and

ADT expression did not differ between these cells and their CD4

+

naive T-cell siblings, but surprisingly less accessibility

level was observed at the ATAC layer (

<

0.02, Supplementary Fig. 3). Gene ontology enrichment analysis [

43

] indicated

that the inaccessible regions are related to T-cell activation, cell adhesion, and other immune functions. Therefore, we

deﬁne this cluster as low chromatin accessible (LCA) naive CD4

+

T cells. Although this discovery needs to be veriﬁed

further, it demonstrates the remarkable multi-omics integration capability of MIDAS.

2.3 MIDAS enables reliable trimodal mosaic integration

At present, trimodal sequencing techniques are still immature. Most of the existing datasets are unimodal or bimodal

with various modality combinations. MIDAS is designed to integrate these diverse multimodal datasets, i.e. mosaic

datasets. To evaluate the performance of MIDAS on mosaic integration, we further constructed 14 incomplete

datasets based on the previously generated rectangular datasets including dogma-full and teadog-full datasets (Methods,

Supplementary Table 2). Each mosaic dataset was generated by removing several modality-batch blocks from the

full-modality dataset. Then, we took the rectangular integration results as the baseline, and examined whether MIDAS

could obtain comparable results on mosaic integration tasks. We assessed MIDAS’s ability of batch correction, modality

3

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L

MULTIMODAL DATA

alignment, and biological conservation. Here we also focused on modality alignment because it guarantees accurate

cross-modal inference for processes such as downstream imputation and knowledge transfer. For qualitative evaluation,

we use UMAP to visualize the biological states and technical noises inferred from the individual and the joint input

modalities (Fig. 3a, b, Supplementary Fig. 4, 5). Taking the dogma-paired-abc dataset for example, for each modality,

the biological states were consistently distributed across different batches (Fig. 3a) whereas the technical noises were

grouped by batches (Fig. 3b), indicating that the batch effect were well disentangled from the biological states. Similarly,

the distributions of biological states and technical noises within batches were very similar across modalities (Fig. 3a,

b), suggesting that MIDAS internally aligns different modalities in latent space. Moreover, the biological states

of each cell type were grouped together and the cell type silhouettes were consistent across batches and modality

combinations (Fig. 3a), reﬂecting robust conservation of the biological signals after mosaic integration.

To quantitatively evaluate MIDAS on mosaic integration, we proposed single-cell mosaic integration bench-

marking (scMIB). scMIB extends scIB with modality alignment metrics, and deﬁnes each type of metrics on both

embedding (latent) space and feature (observation) space, resulting in 20 metrics in total (Methods, Supplementary

Table 3). The obtained batch correction, modality alignment, biological conservation, and overall scores for paired+full,

paired-abc, paired-ab, paired-ac, paired-bc, and diagonal+full tasks performed with the dogma and teadog datasets

were similar to those obtained with rectangular integration (Fig. 3c, Supplementary Fig. 6a). MIDAS showed moderate

performance in the dogma- and teadog-diagonal tasks, likely due to the lack of cell-to-cell correspondence across

modalities in these tasks, which can be remedied via knowledge transfer (shown in Result 2.5).

scIB benchmarking showed that MIDAS, when given incomplete datasets (paired+full, paired-abc, paired-ab,

paired-ac, and paired-bc for dogma and teadog), outperformed methods that rely on the full-modality datasets (dogma-

and teadog-full; Supplementary Fig. 6b, c). Even with the severely incomplete dogma- and teadog-diagonal+full

datasets, the performance of MIDAS surpassed that of most the other methods.

We also compared MIDAS against scVAEIT, scMoMaT, and StabMap (Methods) that can handle mosaic datasets.

UMAP visualization of the low-dimensional cell embeddings showed that MIDAS removed batch effects and preserved

biological signals well on various tasks, while the other three methods did not integrate trimodal data well, especially

when with missing modalities (Fig. 3d, e, Supplementary Fig. 7). To be speciﬁc, MIDAS aligned the cells of different

batches well and grouped them consistently with the ground-truth cell types, while the other methods did not mix

different batches well and produced cell clusters largely inconsistent with cell types. scIB benchmarking showed that

MIDAS had stable performance on different mosaic tasks, and its overall scores were much higher than the other

methods (Fig. 3f, g, Supplementary Fig. 8).

The identiﬁcation of cells’ nearest neighbors based on individual dimensionality reduction results and comparison

of neighborhood overlap among tasks showed that this overlap exceeded 0.75 for most tasks, except dogma-diagonal,

when the number of neighbors reached 10,000 (Fig. 3h). As imputed omics data has been inferred to deteriorate

the accuracy of gene regulatory inference in many cases [

44

], we evaluated the consistency of downstream analysis

results obtained with the performance of different mosaic integration tasks with the dogma datasets. We validated

the conservation of gene regulatory networks in the imputed data. In the dogma-paired+full task, for example, the

regulatory network predicted from imputed data was consistent with that predicted from the ground-truth dogma-full

data (Fig. 3i). These results indicate that the modality inference performed by MIDAS is reliable.

The manual annotation of cell types for the mosaic integration tasks and computation of their confusion matrices

and micro-F1 scores with the dogma-full cell typing results serving as the ground truth showed that the cell type labels

generated from the incomplete datasets, except dogma-diagonal, were largely consistent with the ground truth, with all

micro F1-scores exceeding 0.885 (Fig. 3j, Supplementary Fig. 9). The separation of monocytes and DCs was difﬁcult

in some mosaic experiments, mainly because the latter originate from the former [

45

] and likely also because the

monocyte population in the dogma dataset was small.

2.4 MIDAS enables the atlas-level mosaic integration of trimodal PBMC data

We used MIDAS for the large-scale mosaic integration of 18 PBMC batches from bimodal sequencing platforms (e.g.,

10X Multiome, ASAP-seq, and CITE-seq) and the 9 batches from the DOGMA-seq and TEA-seq trimodal datasets (total,

27 batches from 10 platforms comprising 185,518 cells; Methods, Supplementary Table 1, 4). Similar to the results

obtained with the dogma-full and teadog-full datasets, MIDAS achieved satisfactory batch removal and biological

conservation. UMAP visualization showed that the inferred biological states of different batches maintained a

consistent PBMC population structure and conserved batch-speciﬁc (due mainly to differences in experimental design)

biological information (Fig. 4a, Supplementary Fig. 10a). In addition, the technical noise was clearly grouped by

batch (Supplementary Fig. 10b). These results suggest that the biological states and technical noises were disentangled

well and the data could be used reliably in downstream analysis.

4

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L

MULTIMODAL DATA

The manual labeling of cell types according to cluster markers achieved clearer separation and annotation than did

automatic labeling by Seurat [

14

] (Fig. 4b). Consistent with the rectangular integration results (Fig. 2e), we identiﬁed all

cell types known to be in the atlas, including B cells, conventional T-cell subsets, DP T cells, NK cells, unconventional

T cells, and hematopoietic stem cells (HSCs), demonstrating the robustness of MIDAS. Remarkably, the integration

of more datasets with MIDAS led to the identiﬁcation of rare clusters and high-resolution cell typing. For example,

whereas platelets could not be easily identiﬁed by dogma-full rectangular integration due to their extremely limited

number (Fig. 2e), platelets from the DOGMA-seq dataset aggregated into a much larger cluster with recognizable

platelet markers in the PBMC atlas (Fig. 4c). In addition, the atlas contained more monocyte subclusters, including

CD14

+

, CD16

+

, and CD3

+

CD14

+

monocytes, than obtained with rectangular integration (Fig. 4d). Other cell types

present in more subclusters in the atlas included CD158e1

+

NK cells, CD4

+

CD138

+

CD202b

+

T cells, and RTKN2

+

CD8

+

T cells (Supplementary Fig. 11a).

Most batches in the atlas contained considerable numbers of LCA cells (Fig. 4e, Supplementary Fig. 11b) with

<

0.02 accessibility level, as did the DOGMA-seq dataset (Fig. 2i). The chromatin accessibility levels of cells in the

atlas showed an obvious bimodal distribution, reﬂecting the existence of two ATAC patterns (Supplementary Fig. 11b).

CD8

+

T-cell, CD14

+

monocyte, NK cell, B cell, and other clusters contained LCA cells (Fig. 4b, e) implying that LCA

is common in various cell types.

2.5 MIDAS enables ﬂexible and accurate knowledge transfer across modalities and datasets

To investigate the knowledge transfer capability of MIDAS, we re-partitioned the atlas dataset into reference (for atlas

construction) and query (knowledge transfer target) datasets (Supplementary Table 4). By removing DOGMA-seq

from the atlas, we obtained a reference dataset named atlas-no_dogma. To test the ﬂexibility of knowledge transfer,

we used DOGMA-seq to construct 14 query datasets: 1 rectangular integration and 7 incomplete mosaic datasets

generated previously, and 6 rectangular integration datasets with fewer modalities (Methods, Supplementary Table 5).

In consideration of real applications, we deﬁned model and label knowledge transfer scenarios (Methods). In the model

transfer scenario, knowledge was transferred implicitly through model parameters via transfer learning. In the label

transfer scenario, knowledge was transferred explicitly through cell labels via reference mapping.

We assessed the performance of MIDAS in the model transfer scenario. For the transfer learned models, we used

UMAP to visualize the inferred biological states and technical noises and scMIB and scIB for integration benchmarking,

and compared the results of different tasks with those generated by de novo trained models. Transfer learning greatly

improved performance on the dogma-diagonal, dogma-atac, dogma-rna, and dogma-paired-a tasks, with performance

levels on the other tasks maintained (Fig. 5a–c, Supplementary Fig. 12, 13). For example, the de novo trained model

failed to integrate well in the dogma-diagonal task due to lack of cell-to-cell correspondence across modalities (Fig. 5a),

whereas the transfer learned model with atlas knowledge successfully aligned the biological states across batches and

modalities and formed groups consistent with cell types (Fig. 5b). The results obtained by transfer learned models with

all 14 datasets were not only comparable (Supplementary Fig. 13a), but also better than those obtained with many other

methods, even with the complete dataset (Fig. 5c, Supplementary Fig. 13b).

To assess the performance of MIDAS in the label transfer scenario, we compared the widely used query-to-reference

mapping [

46

,

47

], reference-to-query mapping [

13

,

48

], and our proposed reciprocal reference mapping (Methods).

With each strategy, we aligned each query dataset to the reference dataset and transferred cell type labels through

k-nearest neighbors, with the ground-truth cell type labels taken from the trimodal PBMC atlas. Visualization of the

mapped biological states showed that reciprocal reference mapping with different query datasets yielded consistent

results, with strong agreement with the atlas integration results obtained with the dogma-full dataset (Fig. 5d). Micro

F1-scores indicated that reciprocal reference mapping outperformed the query-to-reference and reference-to-query

mapping strategies for various forms of query data, achieving robust and accurate label transfer and thereby avoiding

the need for de novo downstream analysis (Fig. 5e).

Thus, MIDAS can be used to transfer atlas-level knowledge to various forms of users’ datasets without expensive de

novo training or complex downstream analysis.

3 Discussion

By modeling the single-cell mosaic data generative process, MIDAS can precisely disentangle biological states and

technical noises from the input and robustly align modalities to support multi-source and heterogeneous integration

analysis. It provides accurate and robust results and outperforms other methods when performing various mosaic

integration tasks. It also powerfully and reliably integrates large datasets, as demonstrated with the atlas-scale integration

of publicly available PBMC multi-omics datasets. Moreover, MIDAS efﬁciently and ﬂexibly transfers knowledge from

5

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CEL L

MULTIMODAL DATA

reference to query datasets, enabling convenient handling of new multi-omics data. With superior performance in

dimensionality reduction and batch correction, MIDAS supports accurate downstream biological analysis.

To our knowledge, MIDAS is the ﬁrst model that supports simultaneous dimensionality reduction, modality

complementing, and batch correction in single-cell trimodal mosaic integration. MIDAS accurately integrates mosaic

data with missing modalities, achieving results comparable to the ground truth (rectangular integration) and superior to

those obtained from other methods. These distinct advantages of MIDAS derive from the deep generative modeling,

product of experts, information-theoretic disentanglement, and self-supervised modality alignment components of the

algorithm, which are speciﬁcally designed and inherently competent for the heterogeneous integration of data with

missing features and modalities. In addition, MIDAS is the ﬁrst model that allows knowledge transfer across mosaic

data modalities and batches in a highly ﬂexible and reliable manner, enabling researchers to conquer the vast bodies of

data produced with continuously emerging multi-omics techniques.

GLUE and uniPort, designed for the integration of unpaired single-cell trimodal data, align the global distributions

of different modalities with the use of prior information to improve cell-wise correspondence. They may face challenges

when little prior information about the inter-modality (e.g., ATAC vs. ADT) correspondence of single cells is available.

In addition, they are not designed to utilize paired data to enhance modality alignment. With the widespread adoption

and rapid development of scMulti-omics sequencing technologies, paired data are rapidly becoming more common

and will soon be ubiquitous. By leveraging paired cell data, MIDAS can learn to better align different modalities

in a self-supervised manner and can thus play a more important role than other available methods in the blooming

scMulti-omics era.

Most recently, scVAEIT, scMoMaT, and StabMap were proposed for mosaic integration. However, they do not

have the functionalities of modality alignment and batch correction. StabMap also requires to manually select paired

datasets as reference sets in advance, leading to biased results. A general mosaic integration method should allow input

of diverse mosaic combinations, and support modality alignment and batch correction both in embedding space and

feature space [

29

]. These functions are all essential and urgently needed in real-world scenarios. Compared to scVAEIT,

scMoMaT, and StabMap, MIDAS is the only one that tackles the problem of general mosaic integration (Supplementary

Table 6).

Recently proposed methods for scalable reference-to-query knowledge transfer for single-cell multimodal data

have issues with generalization to unseen query data [

46

,

47

] or the retention of information learned on the reference

data [

13

,

48

], which make the alignment of reference and query datasets difﬁcult. In addition, they support limited

numbers of speciﬁc modalities. The MIDAS knowledge transfer scheme stands out from these methods because it

supports various types of mosaic query dataset and enables model transfer for sample-efﬁcient mosaic integration and

label transfer for automatic cell typing. Moreover, the generalization and retention problems are mitigated through a

novel reciprocal reference mapping scheme.

We envision two major developmental directions for MIDAS. At present, MIDAS integrates only three modalities.

By ﬁne tuning the model architecture, we can achieve the integration of four or more modalities, overcoming the

limitations of existing scMulti-omics sequencing technologies. In addition, the continuous incorporation of rapidly

increasing bodies of newly generated scMulti-omics data is needed to update the model and improve the quality of the

atlas. This process requires frequent model retraining, which is computationally expensive and time consuming. Thus,

employing incremental learning [

49

] is an inevitable trend to achieve continuous mosaic integration without model

retraining.

4 Methods

4.1 Deep generative modeling of mosaic single-cell multimodal data

For Cell

n∈ N ={1, . . . , N}

with batch ID

sn∈ S ={1, . . . , S }

, let

xm

n∈NDm

n

be the count vector of size

Dm

n

from Modality

m

, and

xn={xm

n}m∈Mn

the set of count vectors from the measured modalities

Mn⊆ M =

{ATAC,RNA,A DT}

. We deﬁne two modality-agnostic low-dimensional latent variables

c∈RDc

and

u∈RDu

to

represent each cell’s biological state and technical noise, respectively. To model the generative process of the observed

variables xand sfor each cell, we factorize the joint distribution of all variables as below:

p(x, s, c,u) = p(c)p(u)p(s|u)p(x|c,u)

=p(c)p(u)p(s|u)Y

m∈Mn

p(xm|c,u)(1)

6

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

where we assume that

c

and

u

are independent of each other and the batch ID

s

only depends on

u

in order to facilitate

the disentanglement of both latent variables, and that the count variables

{xm}m∈Mn

from different modalities are

conditional independent given cand u.

Based on the above factorization, we deﬁne a generative model for xand sas follows:

p(c) = Normal(c|0,I)(2)

p(u) = Normal(u|0,I)(3)

π=gs(u;θs)(4)

pθ(s|u) = Categorical(s|π)(5)

λm=gm(c,u;θm)for m∈ Mn(6)

pθ(xm|c,u) = Bernoulli(xm|λm)if m=ATAC

Poisson(xm|λm)if m∈ {RNA,ADT}for m∈ Mn(7)

where the priors

p(c)

and

p(u)

are set as standard normals. The likelihood

pθ(s|u)

is set as a categorical distribution

with probability vector

π∈∆S−1

generated through a batch-ID decoder

gs

which is a neural network with learnable

parameters

θs

. The likelihood

pθ(xm|c,u)

is set as a Bernoulli distribution with mean

λm∈[0,1]Dm

n

when

m=ATAC

, and as a Poisson distribution with mean

λm∈RDm

n

+

when

m∈ {RNA,ADT}

, where

λm

is generated

through a modality decoder neural network

gm

parameterized by

θm

. To mitigate overﬁtting and improve generalization,

we share parameters of the ﬁrst few layers of different modality decoders

{gm}m∈M

(the gray parts of the decoders in

Fig. 1b middle). The corresponding graphical model is shown in Fig. 1b (left).

Given the observed data

{xn, sn}n∈N

, we aim to ﬁt the model parameters

θ={θs,{θm}m∈M}

and meanwhile

infer the posteriors of latent variables

{c,u}

for each cell. This can be achieved by using the SGVB [

38

] which

maximizes the expected Evidence Lower Bound (ELBO) for individual datapoints. The ELBO for each individual

datapoint {xn, sn}can be written as:

ELBO(θ,ϕ;xn, sn)≜Eqϕ(c,u|xn,sn)log pθ(xn, sn,c,u)

qϕ(c,u|xn, sn)

=Eqϕ(c,u|xn,sn)[log pθ(xn, sn|c,u)] −KL (qϕ(c,u|xn, sn)∥p(c,u))

=Eqϕ(c,u|xn,sn)"log pθ(sn|u) + X

m∈Mn

log pθ(xm

n|c,u)#−KL (qϕ(c,u|xn, sn)∥p(c,u))

(8)

where

qϕ(c,u|xn, sn)

, with learnable parameters

ϕ

, is the variational approximation of the true posterior

p(c,u|xn, sn)

and is typically implemented by neural networks.

4.2 Scalable variational inference via the Product of Experts

Let

M=|M|

be the total modality number, since there are

(2M−1)

possible modality combinations for the count

data

xn={xm

n}m∈Mn⊆M

, naively implementing

qϕ(c,u|xn, sn)

in equation (8) requires

(2M−1)

different neural

networks to handle different cases of input

(xn, sn)

, making inference unscalable. Let

z={c,u}

, inspired by [

50

]

which utilizes the Product of Experts (PoEs) to implement variational inference in a combinatorial way, we factorize the

7

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

posterior p(z|xn, sn)and deﬁne its variational approximation qϕ(z|xn, sn)as follows:

p(z|xn, sn) = p(z)p(sn|z)p(xn|z)

p(xn, sn)

=p(z)

p(xn, sn)p(sn|z)Y

m∈Mn

p(xm

n|z)

=p(z)

p(xn, sn)

p(sn)p(z|sn)

p(z)Y

m∈Mn

p(xm

n)p(z|xm

n)

p(z)

=p(sn)

p(xn, sn) Y

m∈Mn

p(xm

n)!p(z)p(z|sn)

p(z)Y

m∈Mn

p(z|xm

n)

p(z)

≈p(sn)

p(xn, sn) Y

m∈Mn

p(xm

n)!p(z)qϕ(z|sn)

p(z)Y

m∈Mn

qϕ(z|xm

n)

p(z)

≜qϕ(z|xn, sn)

(9)

where

qϕ(z|sn)

and

qϕ(z|xm

n)

are the variational approximations of the true posteriors

p(z|sn)

and

p(z|xm

n)

, respec-

tively. From equation (9) we further get:

qϕ(z|xn, sn)∝p(z)qϕ(z|sn)

p(z)

| {z }

≜Cs·eqϕ(z|sn)Y

m∈Mn

qϕ(z|xm

n)

p(z)

| {z }

≜Cm·eqϕ(z|xm

n)

∝p(z)eqϕ(z|sn)Y

m∈Mneqϕ(z|xm

n)

(10)

where we set

qϕ(z|sn)

and

qϕ(z|xm

n)

to be Gaussians. We deﬁne

eqϕ(z|sn) = 1

Cs·qϕ(z|sn)

p(z)

and

eqϕ(z|xm

n) =

1

Cm·qϕ(z|xm

n)

p(z)

to be the normalized quotients of Gaussians with normalizing constants

Cs

and

Cm

, respectively.

Since the quotient of Gaussians is an unnormalized Gaussian,

eqϕ(z|sn)

and

eqϕ(z|xm

n)

are also Gaussians. Thus, in

equation (10) the variational posterior

qϕ(z|xn, sn)

is proportional to the product of individual Gaussians (or “experts”),

indicating that itself is a Gaussian with mean µand covariance Λ:

µ= X

i

µiΛ−1

i! X

i

Λ−1

i!−1

Λ= X

i

Λ−1

i!−1(11)

where

µi

and

Λi

are respectively the mean and covariance of the

i

-th individual Gaussian. We further assume

eqϕ(z|xm

n)

and eqϕ(z|sn)are isotropic Gaussians and deﬁne them as follows:

(µm,νm) = fm(xm

n;ϕm)(12)

eqϕ(z|xm

n) = Normal (z|µm,νmI)(13)

(µs,νs) = fs(sn;ϕs)(14)

eqϕ(z|sn) = Normal(z|µs,νsI)(15)

where

fm

is the modality encoder parameterized by

ϕm

, and

fs

the batch-ID encoder parameterized by

ϕs

, both of

which are neural networks for generating Gaussian mean-variance pairs. In doing this,

qϕ(z|xn, sn)

is modularized into

(M+1)

neural networks to handle

(2M−1)

different modality combinations, increasing the model’s scalability. Similar

to the modality decoders, we also share parameters of the last few layers of different modality encoders

{fm}m∈M

(the

gray parts of the encoders in Fig. 1b middle) to improve generalization.

4.3 Handling missing features via padding and masking

For each modality, as different batches can have different feature sets (e.g., genes for RNA modality), it is hard to use a

ﬁxed-size neural network to handle these batches. To remedy this, we ﬁrst convert

xm

n

of variable size into a ﬁxed size

8

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

vector for inference. For Modality

m

, let

Fm

s

be the features of batch

s

, and

Fm=Ss∈S Fm

s

the feature union of all

batches. The missing features of batch

s

can then be deﬁned as

Fm

s=Fm\ Fm

s

. We pad

xm

n

of size

Dm

n

with zeros

corresponding to its missing features Fm

snthrough a zero-padding function h:

e

xm

n=h(xm

n)(16)

where

e

xm

n

is the zero-padded count vector of constant size

Dm=|Fm|

. The modality encoding process is thus

decomposed as:

(µm,νm) = fm(xm

n;ϕm)

=b

fm(h(xm

n); ϕm)

=b

fm(e

xm

n;ϕm)

(17)

where

b

fm

is the latter part of the modality encoder to handle a ﬁxed size input

e

xm

n

. On the other hand, to calculate the

likelihood

pθ(xm

n|cn,un)

we also need to generate a mean

λm

n

of variable size for

xm

n

. To achieve this, we decompose

the modality decoding process as follows:

λm

n=gm(cn,un;θm)

=h−1(bgm(cn,un;θm))

=h−1e

λm

n(18)

where

bgm

is the front part of the modality decoder to generate the mean

e

λm

n

of ﬁxed size

Dm

, and

h−1

, the inverse

function of

h

, is the mask function to remove the padded missing features

Fm

s

from

e

λm

n

to generate

λm

n

. Note that

e

λm

n

can also be taken as the imputed values for downstream analyses (Methods 4.8 and 4.9).

4.4 Self-supervised modality alignment

To achieve cross-modal inference in downstream tasks, we resort to aligning different modalities in the latent space.

Leveraging self-supervised learning, we ﬁrst use each cell’s multimodal observation

{{xm

n}m∈Mn, sn}

to construct

unimodal observations

{xm

n, sn}m∈Mn

, each of which is associated with the latent variables

zm={cm,um}

. Then,

we construct a pretext task, which enforces modality alignment by regularizing on the joint space of unimodal variational

posteriors with the dispersion of latent variables as a penalty (Fig. 1 upper right), corresponding to a modality alignment

loss:

lmod(ϕ;xn, sn)≜Zv(e

z)qϕ(e

z|xn, sn)de

z

=Eqϕ(e

z|xn,sn)v(e

z)

(19)

where

e

z={zm}m∈Mn

is the set of latent variables, and

qϕ(e

z|xn, sn)

represents the joint distribution of unimodal

variational posteriors since:

qϕ(e

z|xn, sn) = qϕ({zm}m∈Mn|xn, sn)

=Y

m∈Mn

qϕ(zm|xn, sn)

=Y

m∈Mn

qϕ(zm|xm

n, sn)

(20)

In equation (19),

v(e

z)

is the Mean Absolute Deviation, which measures the dispersion among different elements in

e

z

and is used to regularize qϕ(e

z|xn, sn). It is deﬁned as:

v(e

z)≜1

|Mn|X

m∈Mn

∥zm−¯

z∥2(21)

where ¯

z=1

|Mn|Pm∈Mnzmis the mean and ∥·∥2the Euclidean distance.

Note that the computation of

qϕ(zm|xm

n, sn)

in equation (20) is efﬁcient. Since

qϕ(zm|xm

n, sn) =

qϕ(z|xm

n, sn)|z=zm, according to equation (10) we have:

qϕ(z|xm

n, sn)∝p(z)eqϕ(z|sn)eqϕ(z|xm

n)(22)

As the mean and covariance of each Gaussian term on the right-hand side of equation (22) was already obtained when

inferring

qϕ(z|xn, sn)

(equation (10)), the mean and covariance of

qϕ(z|xm

n, sn)

can be directly calculated using

equation (11), avoiding the need of passing each constructed unimodal observation to the encoders.

9

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

4.5 Information-theoretic disentanglement of latent variables

To better disentangle the biological state cand the technical noise u, we adopt an information-theoretic approach, the

Information Bottleneck (IB) [

31

], to control the information ﬂow during inference. We deﬁne two types of IB, where

the technical IB prevents batch-speciﬁc information being encoded into

c

by minimizing the Mutual Information (MI)

between

s

and

c

, and the biological IB prevents biological information being encoded into

u

by minimizing the MI

between

x

and

u

(Fig. 1 bottom right). Let

I(·,·)

denote the MI between two variables, considering both

I(s, c)

and

I(x,u)we have:

I(s, c) + I(x,u) = Ep(s,c)log p(s, c)

p(s)p(c)+Ep(x,u)log p(x,u)

p(x)p(u)

=Ep(s,c)log p(s|c)

|{z}

≈pb

α(s|c)−Ep(s,c)[log p(s)]

| {z }

const.

+Ep(x,u)log p(u|x)

p(u)

≈Ep(x,s)Ep(c|x,s)[log pb

α(s|c)]+const. +Ep(x,s)"Ep(u|x,s)log p(u|x)

p(u)#

=Ep(x,s)"Ep(c|x,s)[log pb

α(s|c)] + Ep(u|x,s)log p(u|x)

p(u)#+const.

≈1

NX

n"Ep(c|xn,sn)[log pb

α(sn|c)] + Ep(u|xn,sn)log p(u|xn)

p(u)#+const.

≈1

NX

n"Eqϕ(c|xn,sn)[log pb

α(sn|c)] + Eqϕ(u|xn,sn)log qϕ(u|xn)

p(u)

| {z }

≜b

lIB (ϕ;xn,sn,b

α)

#+const.

=1

NX

nhb

lIB (ϕ;xn, sn,b

α)i+const.

(23)

where

pb

α(s|c)

is a learned likelihood with parameters

b

α

, and

qϕ(u|xn)

p(u)∝Qm∈Mneqϕ(u|xm

n)

can be com-

puted via equation (11). From equation (23), minimizing

(I(s, c) + I(x,u))

approximately equals to minimizing

b

lIB (ϕ;xn, sn,b

α)w.r.t ϕfor all cells. For pb

α(s|c), we model it as:

pb

α(s|c) = Categorical s|r(c;b

α)(24)

where

r

is a classiﬁer neural network parameterized by

b

α

. To learn the classiﬁer, we minimize the following expected

negative log-likelihood w.r.t b

α:

Ep(c,s)[−log pb

α(s|c)]

=Ep(x,s)Ep(c|x,s)[−log pb

α(s|c)]

≈1

NX

n

Ep(c|xn,sn)[−log pb

α(sn|c)]

≈1

NX

n

Eqϕ(c|xn,sn)[−log pb

α(sn|c)]

|{z }

≜b

lr(b

α;xn,sn,ϕ)

=1

NX

nb

lr(b

α;xn, sn,ϕ)

(25)

Thus, minimizing Ep(c,s)[−log pb

α(s|c)] approximately equals to minimizing b

lr(b

α;xn, sn,ϕ)w.r.t b

αfor all cells.

To further enhance latent disentanglement for cross-modal inference, for each modality

m

, we also minimize

(I(s, cm) + I(xm,um))

. Similar to equation (23), this can be achieved by minimizing

b

lIB (ϕ;xm

n, sn,αm)

where

αm

is

the parameters of the classiﬁer neural network

rm

to generate the likelihood

pαm(s|cm) = Categorical (rm(cm;αm))

.

Together with b

lIB (ϕ;xn, sn,b

α)deﬁned in equation (23), the total IB loss is deﬁned as:

lIB (ϕ;xn, sn,α)≜b

lIB (ϕ;xn, sn,b

α) + X

m∈Mnb

lIB (ϕ;xm

n, sn,αm)(26)

10

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

where

α={b

α,{αm}m∈M}

. To learn

pαm(s|cm)

, we can also minimize

Ep(cm,s)[−log pαm(s|cm)]

, which corre-

sponds to minimizing

b

lr(αm;xm

n, sn,ϕ)

according to equation (25). Considering the classiﬁer loss

b

lr(b

α;xn, sn,ϕ)

deﬁned in equation (25), the total classiﬁer loss is deﬁned as:

lr(α;xn, sn,ϕ) = b

lr(b

α;xn, sn,ϕ) + X

m∈Mnb

lr(αm;xm

n, sn,ϕ)(27)

4.6 Training MIDAS

To train the encoders and decoders of MIDAS, considering the training objectives deﬁned in equations (8), (19), and (26),

we minimize the following objective w.r.t {θ,ϕ}for all observations {xn, sn}n∈N :

lf,g (θ,ϕ;xn, sn,α) = lELB O(θ,ϕ;xn, sn) + lmod(ϕ;xn, sn) + lIB (ϕ;xn, sn,α)(28)

where

lELB O(θ,ϕ;xn, sn)≜−ELBO(θ,ϕ;xn, sn)

. Since

α

is unknown and the learning of

α

depends on

ϕ

as in

equation (27), we iteratively minimize equations (28) and (27) with Stochastic Gradient Descent (SGD), forming the

MIDAS training algorithm (Algorithm 1).

Algorithm 1 The MIDAS training algorithm.

Input: A single-cell multimodal mosaic dataset {xn, sn}n∈N

Output: Encoder parameters θ, decoder parameters ϕ, and classiﬁer parameters α

1: Randomly initialize parameters {θ,ϕ,α}

2: for t= 1,2, . . . , T do

3: Sample a mini-batch {xn, sn}n∈Ntfrom the dataset, where Nt⊂ N

4: Freeze ϕand update αvia SGD with loss 1

|Nt|Pn∈Ntlr(α;xn, sn,ϕ)▷See equation (27)

5: Freeze αand update {θ,ϕ}via SGD with loss 1

|Nt|Pn∈Ntlf,g (θ,ϕ;xn, sn,α)▷See equation (28)

6: end for

4.7 Mosaic integration on latent space

A key goal of single-cell mosaic integration is to extract biological meaningful low-dimensional representations from the

mosaic data for downstream analysis. To achieve this, for each cell we ﬁrst use the trained MIDAS to infer the mean and

variance of the latent posterior

qϕ(c,u|xn, sn)

through equation (10), and then take the maximum a posteriori (MAP)

estimation of the biological state

c

as the cell’s low-dimensional representation. Since

qϕ(c|xn, sn)

is Gaussian, the

MAP estimation of ccorresponds to the inferred mean of qϕ(c|xn, sn).

4.8 Imputation for missing modalities and features

Based on the latent posterior

qϕ(c,u|xn, sn)

inferred from the single-cell mosaic data (Methods 4.7), it is straightfor-

ward to impute missing modalities and features. We ﬁrst pass the inferred posterior means of

c

and

u

to the decoders to

generate padded feature mean

e

λm

n

for each modality

m∈ M

via equation (18). Then, we sample from a Bernoulli

distribution with mean

e

λATAC

n

to generate the imputed ATAC counts, and from two Poisson distributions with means

e

λRNA

nand e

λADT

nto generate the imputed RNA and ADT counts, respectively.

4.9 Batch correction via latent variable manipulation

Besides performing mosaic integration on the latent space (Methods 4.7), we can also perform it on the feature space,

i.e., imputing missing values and correcting batch effect for the count data. Mosaic integration on feature space is

important since it is required by many downstream tasks such as differential expression analysis and cell typing.

Based on the inferred posterior means of latent variables

c

and

u

(Methods 4.8) of each cell, we can perform

imputation and batch correction simultaneously by manipulating the technical noise

u

. Concretely, let

cn

and

un

be

the posterior means of cand ufor Cell n, respectively, we ﬁrst calculate the mean of unwithin each batch s:

¯

us=1

|Ns|X

n∈Ns

un(29)

11

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

where Ns⊆ N is the set of cell-IDs belonging to batch s. Next, we calculate the mean of ¯

usover all batches:

¯

u=1

SX

s

¯

us(30)

Then, we ﬁnd a mean ¯

uS∗which is the closest to ¯

u, and treat ¯

uS∗as a “standard” technical noise, where:

S∗= arg min

s

∥¯

us−¯

u∥2(31)

Finally, for each cell we correct the batch effect by substituting

un

with a common

¯

uS∗

, and pass

{cn,¯

uS∗}

to the

decoders to generate imputed and batch-corrected data (similar to Methods 4.8 but here we use

{cn,¯

uS∗}

instead of

{cn,un}to correct batch effect).

4.10 Model transfer via transfer learning

When MIDAS has been pre-trained on a reference dataset, we can conduct model transfer to transfer the model’s learned

knowledge to a query dataset through transfer learning, i.e., on the query dataset we ﬁne-tune the pre-trained model

instead of train the model from scratch. Since compared to the reference dataset, the query dataset can contain different

number of batches collected from different platforms, the batch-ID related modules need to be redeﬁned. Thus, during

transfer learning, we reparameterize and reinitialize the batch-ID encoder and decoder

{fs, gs}

and the batch classiﬁers

{r, {rm}m∈M}, and only ﬁne-tune the modality encoders and decoders {fm, gm}m∈M .

A core advantage of our model transfer scheme is that it can ﬂexibly transfer the knowledge of multimodal data to

various types of query datasets, even to those with fewer modalities, improving the de novo integration of single-cell

data.

4.11 Label transfer via reciprocal reference mapping

Whilst model transfer implicitly transfers knowledge through model parameters, label transfer explicitly transfers

knowledge in the form of data labels. These labels can be different kinds of downstream analysis results such as

cell types, cell cycles, or pseudotime. Through accurate label transfer, we can not only avoid the expensive de novo

downstream analysis, but also improve the label quality.

Typically, the ﬁrst step of label transfer is reference mapping, which aligns the query cells with the reference cells so

that labels can be transferred reliably. For MIDAS, we can naively achieve reference mapping in two ways: (1) mapping

the query data onto the reference space, i.e., applying the model pre-trained on the reference data to infer the biological

states for the query data [

46

,

47

], and (2) mapping the reference data onto the query space, i.e., applying the model

transfer-learned on the query data (Methods 4.10) to infer the biological states for the reference data [

13

,

48

]. However,

the ﬁrst way suffers from the ‘generalization problem’ since the pre-trained model is hard to generalize to the query data

which usually contains unseen technical variations, while the second way suffers from the ‘forgetting problem’ since

the transfer-learned model may lose information learned on the reference data, affecting the inferred biological states.

To tackle both problems, we propose an reciprocal reference mapping scheme, where we ﬁne-tune the pre-trained

model on the query dataset to avoid the generalization problem, and meanwhile feed the model with the past data sampled

from the reference dataset to prevent forgetting. In doing this, the model can ﬁnd a mapping suitable for both reference

and query datasets. We can then align the two datasets by inferring the biological states, through which the reference

labels can be transferred to the query data based on nearest neighbors. Similar to model transfer (Methods 4.10), in

label transfer knowledge can also be ﬂexibly and accurately transferred to various types of query datasets.

4.12 Modality contribution to the integrated clustering

We assess the contribution of different modalities to clustering by measuring the agreement between single-modality

clustering and multi-modalities cell clustering. For each cell, the normalized consistency ratio of the nearest neighbors

in the single modal clustering and multi-modalities clustering is used to represent contribution of the modal for the ﬁnal

integrated clustering.

4.13 Regulatory network inference from scRNA-seq datasets

GRNBoost2 [

51

] is used to infer the regulatory network from scRNA-seq datasets and is one of the regression-based

methods for regulatory network inference. Weighted regulatory links between genes and transcription factors are

provided from GRNBoost2. The weights of shared links from different data are compared to indicate the regulatory

network retention.

12

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

4.14 Correlation of expression fold changes between raw and batch-corrected data

For each cell type, expression fold changes of genes and proteins are calculated against all other cells using FoldChange

function in Seurat. Pearson correlation coefﬁcient is used to measure linear correlations of fold changes between raw

and batch-corrected data.

4.15 Generating ground-truth cell type labels

To generate ground-truth cell type labels for both qualitative and quantitative evaluation, we employed the third-party

tool, Seurat, to annotate cell types for different PBMC datasets through label transfer. We took the CITE-seq PBMC

atlas from [

14

] as the reference set, and utilized the FindTransferAnchors and TransferData functions in Seurat to

perform label transfer, where “cca” was used as the reduction method for reference mapping. For cells without raw

RNA expression, we ﬁrst utilized ATAC data to create a gene activity matrix using GeneActivity function in Signac [

52

].

The gene activity matrix was subsequently used for label transfer.

4.16 Evaluation metrics

To evaluate the performance of MIDAS and the state-of-the-art tools on multimodal integration, we utilize metrics

from scIB on batch correction and biological conservation, and also propose our own metrics on modality alignment

to better evaluate mosaic integration, extending scIB to scMIB (Supplementary Table 3). Since mosaic integration

should generate low-dimensional representations and the imputed and batch corrected data, scMIB is performed on

both embedding space and feature space. To evaluate the batch correction and biological conservation metrics on

the feature space, we convert the imputed and batch corrected feature into a similarity graph via the PCA+WNN

strategy (Methods 4.20), and then use this graph for evaluation. Our metrics for batch correction, modality alignment,

and biological conservation are deﬁned as follows.

4.16.1 Batch correction metrics

The batch correction metrics comprise graph iLISI (

yiLIS I

embed

and

yiLIS I

feat

), graph connectivity (

ygc

embed

and

ygc

feat

), and

kBET (

ykBET

embed

and

ykBET

feat

), where

yiLIS I

embed

,

ygc

embed

, and

ykBET

embed

are deﬁned in embedding space and

yiLIS I

feat

,

ygc

feat

, and

ykBET

feat

are

deﬁned in feature space.

Graph iLISI. The graph iLISI (local inverse Simpson’s index [

53

]) metric is extended from the iLISI, which is used

to measure the batch mixing degree. The iLISI scores are computed based on kNN graphs by computing the inverse

Simpson’s index for diversity. The scores estimate the effective number of batches present in the neighborhood. iLISI

ranges from 1 to

N

, where

N

equals the number of batches. Scores close to the real batch numbers denotes good mixing.

However, typical iLISI score is not applicable to graph-based outputs. scIB proposed the Graph iLISI, which utilizes a

graph-based distance metric to determine the nearest neighbor list and avoids skews on graph-based integration outputs.

The graph iLISI scores are scaled to [0,1], where 0 indicates strong separation and 1 indicates perfect mixing.

Graph connectivity. The graph connectivity is proposed by scIB to inspect whether cells with the same label are

connected in the kNN graph of all cells. For each label

c

, we get the largest connected graph component (LCC) of

c

-labeled cells and divide the LCC size by the

c

-labeled cells population size to represent the graph connectivity for cell

label

c

. Then, we calculate the connectivity values for all labels and take the average as the total graph connectivity.

The score ranges from 0 to 1. The score 1 means that all cells with the same cell identity from different batches are

connected in the integrated kNN graph, which also indicates the perfect batch mixing and vice versa.

kBET. The kBET (k-nearest neighbor batch-effect test [

54

]) is used to measure the batch mixing at the local level

of the k-nearest neighbors. Certain fractions of random cells are repeatedly selected to test whether the local label

distributions are statistically similar to the global label distributions (null hypothesis). The kBET value is the rejection

rate over all tested neighborhoods and values close to zero represent that the batches are well mixed. scIB adjusts the

kBET with a diffusion-based correction to enable unbiased comparison on graph- and non-graph-based integration

results. kBET values are ﬁrst computed for each label, and then averaged and subtracted from 1 to get a ﬁnal kBET

score.

13

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

4.16.2 Modality alignment metrics

The modality alignment metrics comprise modality ASW (

yASW

), FOSCTTM (

yFOS CTT M

), label transfer F1 (

yltF1

), ATAC

AUROC (

yAURO C

), RNA Pearson’s r(

yRNAr

), and ADT Pearson’s r(

yADT r

), where

yASW

,

yFOS CTT M

, and

yltF1

are deﬁned

in embedding space and yAURO C,yRN Ar, and yADTrare deﬁned in feature space.

Modality ASW. The modality ASW (averaged silhouette width) is used to measure the alignment of distributions

between different modality embeddings. The ASW [

55

] is originally used to measure the separation of clusters. In scIB,

ASW is also modiﬁed to measure the performance of batch effect removal, resulting in a batch ASW that ranges from 0

to 1, where 1 denotes perfect batch mixing and 0 denotes strong batch separation. By replacing batch embeddings with

modality embeddings, we can deﬁne a modality ASW in the same manner as the batch ASW, where 1 denotes perfect

modality alignment and 0 denotes strong modality separation. For MIDAS, the modality embeddings are generated by

feeding the trained model with each modality individually.

FOSCTTM. The FOSCTTM (fraction of samples closer than the true match [

56

]) is used to measure the alignment

of values between different modality embeddings. Let

yFOS CTT M

m1,m2

be the FOSCTTM for a modality pair

{m1, m2}

, it is

deﬁned as:

yFOS CTT M

m1,m2=1

2N X

i

Nm1

i

N+X

i

Nm2

i

N!

Nm1

i=|j| ∥em1

i−em2

j∥2<∥em1

i−em2

i∥2|

Nm2

i=|j| ∥em1

j−em2

i∥2<∥em1

i−em2

i∥2|

(32)

where

N

is the number of cells,

i

and

j

are the cell indices, and

em1

i

and

em2

i

are the embeddings of cell

i

in modalities

m1

and

m2

, respectively.

Nm1

i

is the number of cells in modality

m2

that are closer to

em1

i

than

em2

i

is to

em1

i

, and

it is similar for

Nm2

i

. We ﬁrst get the embeddings of individual modalities, then calculate the FOSCTTM values for

each modality pair, and lastly average these values and subtract it from 1 to obtain a ﬁnal FOSCTTM score. Higher

FOSCTTM scores indicate better modality alignment.

Label transfer F1. The label transfer F1 is used to measure the alignment of cell types between different modality

embeddings. This can be achieve by testing whether cell type labels can be transferred from one modality to another

without any bias. For each pair of modalities, we ﬁrst build a kNN graph between their embeddings, and then transfer

labels from one modality to the other based on the nearest neighbors. The transferred labels are compared to the

original labels by the micro F1-score, which is deﬁned as the label transfer F1. We take the F1 score averaged from all

comparison pairs as the ﬁnal label transfer F1 score.

ATAC AUROC The ATAC AUROC (area under the receiver operating characteristic) is used to measure the alignment

of different modalities in the ATAC feature space. It has been previously used to evaluate the quality of ATAC

predictions [

57

]. For each method to be evaluated, we ﬁrst use it to convert different modality combinations (excluding

ATAC) into ATAC features respectively, then calculate the AUROC of each converted result by taking the true ATAC

features as the ground truth, and ﬁnally take the average of these AUROCs as the ﬁnal score. Taking MIDAS as an

example, if ATAC, RNA and ADT are involved, the data of the three modality combinations

{RNA}

,

{ADT}

, and

{RNA,ADT}can be input into the trained model to obtain three sets of ATAC predictions.

RNA Pearson’s rThe RNA Pearson’s ris used to measure the alignment of different modalities in the RNA feature

space. For each method to be evaluated, we ﬁrst use it to convert different modality combinations (excluding RNA) into

RNA features respectively, then calculate the Pearson’s rbetween each converted result and the true RNA features, and

ﬁnally take the average of these Pearson’s rs as the ﬁnal score.

ADT Pearson’s rThe ADT Pearson’s ris used to measure the alignment of different modalities in the ADT feature

space. The calculation of the ADT Pearson’s is similar to that of the RNA Pearson’s r.

4.16.3 Biological conservation metrics

The biological conservation metrics comprise NMI (

yNMI

embed

and

yNMI

feat

), ARI (

yARI

embed

and

yARI

feat

), isolated label F1 (

yilF1

embed

and

yilF1

feat

), and graph cLISI (

ycLIS I

embed

and

ycLIS I

feat

), where

yNMI

embed

,

yARI

embed

,

yilF1

embed

, and

ycLIS I

embed

are deﬁned in embedding space

and yNMI

feat ,yARI

feat ,yilF1

feat , and ycLIS I

feat are deﬁned in feature space.

14

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

NMI. The NMI (Normalized Mutual Information) is used to measure the similarity between two clustering results,

namely the predeﬁned cell type labels and the clustering result obtained from the embeddings or the graph. Optimized

Louvain clustering is used here according to scIB. The NMI scores are scaled to

[0,1]

where 0 and 1 correspond to

uncorrelated clustering and a perfect match, respectively.

ARI. The ARI (Adjusted Rand Index) also measures the overlap of two clustering results. The RI (Rand Index [

58

])

considers not only cell pairs that are assigned in the same clusters but also ones in different clusters in the predicted (Lou-

vain clustering) and true (cell type) clusters. ARI corrects RI for randomly correct labels. An ARI of 1 represents

perfect match and 0 represents random labeling.

Isolated label F1. scIB proposes the isolated label F1 score to evaluate the integration performance, speciﬁcally

focusing on cells with the label that is share by few batches. Cell labels presented in the least number of batches are

identiﬁed as isolated labels. The F1 score for measuring the clustering performance on isolated labels is deﬁned as the

isolated label F1 score. It reﬂects how well the isolated labels separate from other cell identities, ranging from 0 to 1,

where 1 means all the isolated label cells and no others are grouped into one cluster.

Graph cLISI. The Graph cLISI is similar to the Graph iLISI but focuses on cell type labels rather than batch labels.

Unlike iLISI that highlights the mixing of groups, cLISI values the separation of groups. The graph-adjusted cLISI is

scaled to

[0,1]

with value 0 corresponding to low cell-type separation and 1 corresponding to strong cell-type separation.

4.16.4 Overall scores

scIB. We compute the scIB overall score using the batch correction and biological conservation metrics deﬁned

either on the embedding space (for algorithms generating embeddings or graphs) or the feature space (for algorithms

generating batch-corrected features). Following [

40

], the overall score

y

is the sum of the averaged batch correction

metric ybatch weighted by 0.4 and the averaged biological conservation metric ybio weighted by 0.6:

ybatch = (yiLIS I

ω+ygc

ω+ykBET

ω)/3

ybio = (yNMI

ω+yARI

ω+yilF1

ω+ycLIS I

ω)/4

y= 0.4·ybatch + 0.6·ybio

(33)

where ω=embed for embedding or graph outputs, and ω=feat for feature outputs.

scMIB. As an extension of scIB, the scMIB overall score

y

is computed from the batch correction, modality alignment,

and biological conservation metrics deﬁned on both the embedding and feature space. It is the sum of the averaged

batch correction metric

ybatch

weighted by 0.3, the averaged modality alignment metric

ymod

weighted by 0.3, and the

averaged biological conservation metric ybio weighted by 0.4:

ybatch = (yiLIS I

embed +ygc

embed +ykBET

embed +yiLIS I

feat +ygc

feat +ykBET

feat )/6

ymod = (yASW +yF OSC TTM +yltF1+yAU ROC +yRNA r+yADTr)/6

ybio = (yNMI

embed +yARI

embed +yilF1

embed +ycLIS I

embed +yNMI

feat +yARI

feat +yilF1

feat +ycLIS I

feat )/8

y= 0.3·ybatch + 0.3·ymod + 0.4·ybio

(34)

4.17 Data availability

All datasets of human PBMCs were publicly available (Supplementary Table 1). Count matrices of gene UMIs, ATAC

fragments and antibody-derived tags were downloaded for data analysis.

DOGMA-seq dataset. The DOGMA-seq dataset contains four batches proﬁled by DOGMA-seq, which measures

RNA, ATAC and ADT data simultaneously. Trimodal data of this dataset were obtained from Gene Expression

Omnibus (GEO) [59] under accession ID GSE166188 [2].

TEA-seq dataset. The TEA-seq dataset contains ﬁve batches proﬁled by TEA-seq, which measures RNA, ATAC and

ADT data simultaneously. Trimodal data of these batches were obtained from GEO under accession ID GSE158013 [

3

].

TEA Multiome dataset. The TEA Multiome dataset measuring paired RNA and ATAC data was obtained from GEO

under accession ID GSE158013 [

3

]. This dataset contains two batches proﬁled by 10x Chromium Single Cell Multiome

ATAC + Gene Expression.

15

MIDAS: A DEEP GENERATIVE MODEL FOR MOSAIC INTEGRATION AND KNOWLEDGE TRANSFER OF SINGLE-CELL

MULTIMODAL DATA

10X Multiome dataset. The 10X Multiome dataset measuring paired RNA and ATAC data was collected from 10x

Genomics (https://www.10xgenomics.com/resources/datasets/) [60–63].

ASAP dataset. The ASAP dataset was obtained from GEO under accession ID GSE156473 [

2

]. Two batches proﬁled

by ASAP-seq include ATAC and ADT data, and the other two batches proﬁled by CITE-seq measure RNA and ADT

data.

WNN dataset. The WNN dataset measuring paired RNA and ADT data was obtained from

https://atlas.

fredhutch.org/nygc/multimodal-pbmc

[

14

]. This dataset was proﬁled by CITE-seq. We selected the eight PBMC

batches generated before the administration of HIV vaccine for integration.

4.18 Data preprocessing

The count matrices of RNA and ADT were processed via the Seurat package (v4.1.0) [

14

]. The ATAC fragment ﬁles

were processed using the Signac package (v1.6.0) [

52

] and peaks were called via MACS2 [

64

]. We performed quality

control separately for each batch. Brieﬂy, metrics of detected gene number per cell, total UMI number, percentage of

mtRNA reads, total protein tag number, total fragment number, TSS score, and nucleosome signal were evaluated. We

manually checked the distributions of these metrics and set customized criteria to ﬁlter low-quality cells in each batch.

The number of cells that passed quality control in each batch is shown in Supplementary Table 1.

For each batch, we adopted common normalization strategies for RNA, ADT and ATAC respectively. Speciﬁcally,

for RNA data, UMI count matrices are normalized and log-transformed using the NormalizeData function in Seurat; for

ADT data, tag count matrices are centered log-ratio (CLR)-normalized using the NormalizeData function in Seurat;

for ATAC data, fragment matrices are term frequency inverse document frequency (TF-IDF) normalized using the

RunTFIDF function in Signac.

To integrate batches proﬁled by various technologies, we need to create a union of features for RNA, ADT and

ATAC data, respectively. For RNA data, ﬁrstly, low-frequency genes are removed based on gene occurrence frequency

across all batches; then we selected 4000 high variable genes (HVGs) using the FindVariableFeatures function with

default parameters in each batch; the union of these HVGs are ranked using the SelectIntegrationFeatures function and

the top 4000 genes are selected. In addition, we also retained genes that encode the proteins targeted by the antibodies.

For ADT data, the union of antibodies in all batches are retained for data integration. For ATAC data, we used reduce

function in Signac to merge all intersecting peaks across batches and then re-calculated the fragment counts in the

merged peaks. The merged peaks are used for data integration.

The input data for MIDAS are UMI counts for RNA data, tag counts for ADT data, and binarized fragment counts

for ATAC data. For each modality, the union of features from all batches are used. Counts of missing features are set to

0. Binary feature masks are generated accordingly where 1 and 0 denote presented and missing features, respectively.

4.19 Implementation of MIDAS

We implement the architecture of MIDAS with PyTorch [65]. We set the sizes of the shared hidden layers of different

modality encoders to 1024-128, set the sizes of the shared hidden layers of different modality decoders to 128-1024,

and set the sizes of the biological state and technical noise hidden variables to 32 and 2, respectively. Each hidden

layer is implemented by four functions: Linear, LayerNorm, Mish, and Dropout. For all tasks, we split the training

data into a proportion of 95/5 for training/validation. To train the model, we set the mini-batch size to 256, and use the

AdamW [66] optimizer with a learning rate of 10−4to implement SGD. Early stopping is used to terminate training.

4.20 Implementation of comparing methods

We compare MIDAS to recent methods in both rectangular and mosaic integration of trimodal data. Rectangular

integration is a simpler case of mosaic integration, where all modalities are complete. Here all the generated low-

dimensional representations have the same size as the biological states inferred by MIDAS, which are 32 dimensions.

4.20.1 Rectangular integration methods

Since there are a few methods directly applicable to rectangular integration tasks involving ATAC, RNA, and ADT, we

decompose rectangular integration into two steps, i.e., batch correction for each modality independently, and modality

fusion