ArticlePDF Available

Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector

Springer Nature
Nature Communications
Authors:

Abstract and Figures

Deciphering individual cell phenotypes from cell-specific transcriptional processes requires high dimensional single cell RNA sequencing. However, current dimensionality reduction methods aggregate sparse gene information across cells, without directly measuring the relationships that exist between genes. By performing dimensionality reduction with respect to gene co-expression, low-dimensional features can model these gene-specific relationships and leverage shared signal to overcome sparsity. We describe GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information between gene expression. Unlike other methods, including principal component analysis and variational autoencoders, GeneVector uses latent space arithmetic in a lower dimensional gene embedding to identify transcriptional programs and classify cell types. In this work, we show in four single cell RNA-seq datasets that GeneVector was able to capture phenotype-specific pathways, perform batch effect correction, interactively annotate cell types, and identify pathway variation with treatment over time.
Comparison of Results using Mutual Information A–D Pathway co-membership vs. cosine similarity between gene vectors for all gene pairs in PBMCs. Each point represents one gene pair, and plots show the number of pathways (combined Reactome and MSigDB cell type signatures [C8]) that contain both genes (y-axis) and the cosine distance between the two genes (x-axis). The results show both correlation A, B and MI (Mutual Information) C, D based GeneVector. In addition to a standard set of results B, D, a baseline relationship between pathway co-membership and cosine similarity is established by performing an identical analysis over randomly shuffled gene A, C. E Top 16 most similar genes by cosine similarity to IFIT1 using correlation coefficient. Genes in the interferon signaling pathway are colored orange. F Top 16 most similar genes to IFIT1 after training GeneVector using mutual information shows a higher number of interferon signaling pathway genes. G, H Cosine similarity and Pearson correlation coefficient for un-annotated gene pairs (n = 314090), ChIP-Seq annotated TF-targets pairs (n = 1275), and literature annotated activator (n = 26) or repressor (n = 26) TF-target pairs. The center of the box plot is denoted by the median, a horizontal line dividing the box into two equal halves. The bounds of the box are defined by the lower quartile (25th percentile) and the upper quartile (75th percentile). The whiskers extend from the box and represent the data points that fall within 1.5 times the interquartile range (IQR) from the lower and upper quartiles. Any data point outside this range is considered an outlier and plotted individually. Significance assessed using Mann-Whitney-Wilcoxon two-sided test. I Cosine similarity versus correlation coefficient for gene pairs in the TICA (Tumor Immune Cell Atlas) dataset with TF-target gene pairs highlighted (blue) and colored by activator/repressor status (green/orange respectively). J, K Linear regression of mean log-normalized expression per cell type±95% confidence interval for repressor TF-target pair SOCS3-STAT4 and activator TF-target pair KLF-THBD, respectively. L Mean log-normalized expression for SOCS3-STAT4 and KLF4-THBD across annotated cell types. Source data provided as a Source Data file.
… 
Metagenes Associated with Directional Difference in HGSOC Cancer Cells from Adnexa to Bowel A UMAP of HGSOC cells with classified by GeneVector. B Uncorrected UMAP of cancer cells from patients with adnexa and bowel samples. C GeneVector batch corrected UMAP with patient labels on site labels on batch corrected UMAP. D Confusion matrix of accuracy comparing SPECTRUM annotated cell types with GeneVector classification. E Hallmark pathway enrichment for top 30 metagenes by cosine similarity to Vadnexa – Vbowel. F Hallmark pathway enrichment for top 30 metagenes by cosine similarity to Vbowel – Vadnexa. G Pseudo-probabilities for metagenes associated with up-regulation in bowel to adnexa (Epithelial-to-Mesenchymal Transition) and down-regulation (Major Histocompatibility Class I). H EMT (Epithelial-Mesenchymal Transition) metagene significantly up regulated in four of six patients by gene module score. I MHC Class I (MHCI) metagene significantly downregulated in the metastatic site (bowel) in three of six patients by gene module score. Source data provided as a Source Data file. The center of the box plot is denoted by the median, a horizontal line dividing the box into two equal halves. The bounds of the box are defined by the lower quartile (25th percentile) and the upper quartile (75th percentile). The whiskers extend from the box and represent the data points that fall within 1.5 times the interquartile range (IQR) from the lower and upper quartiles. Any data point outside this range is considered an outlier and plotted separately. Significance assessed using Mann-Whitney-Wilcoxon two-sided test.
… 
This content is subject to copyright. Terms and conditions apply.
Article https://doi.org/10.1038/s41467-023-39985-2
Identication of transcriptional programs
using dense vector representations dened
by mutual information with GeneVector
Nicholas Ceglia
1
, Zachary Sethna
1,2,3,9
,SamuelS.Freeman
1,9
, Florian Uhlitz
1
,
Viktoria Bojilova
1
,NicoleRusk
1
,BharatBurman
4
, Andrew Chow
5
,
Sohrab Salehi
1
, Farhia Kabeer
6,7
,SamuelAparicio
6,7
,
Benjamin D. Greenbaum
1,8
, Sohrab P. Shah
1
& Andrew McPherson
1
Deciphering individual cell phenotypes from cell-specic transcriptional pro-
cesses requires high dimensional single cell RNA sequencing. However, cur-
rent dimensionality reduction methods aggregate sparse gene information
across cells, without directly measuring the relationships that exist between
genes. By performing dimensionality reduction with respect to gene co-
expression, low-dimensional features can model these gene-specicrelation-
ships and leverage shared signal to overcome sparsity. We describe Gene-
Vector, a scalable framework for dimensionality reduction implemented as a
vector space model using mutual information between gene expression.
Unlike other methods, including principal component analysis and variational
autoencoders, GeneVector uses latent space arithmetic in a lower dimensional
gene embedding to identify transcriptional programs and classify cell types. In
this work, we show in four single cell RNA-seq datasets that GeneVector was
able to capture phenotype-specic pathways, perform batch effect correction,
interactively annotate cell types, and identify pathway variation with treatment
over time.
Maintenance of cell state and execution of cellular function are based
on coordinated activity within networks of related genes. To approx-
imate these connections, transcriptomic studies have conceptually
organized the transcriptome into sets of co-regulated genes, termed
gene programs1or metagenes2.Therst intuitive step to identify such
co-regulated genes is the reduction of dimensionality for sparse
expression measurements: high dimensional gene expression data is
compressed into a minimal set of explanatory features that highlight
similarities in cellular function. However, to map existing biological
knowledge to each cell, the derived features must be interpretable at
thegenelevel.
To nd similarities in lower dimensions,biology can borrow from
the eld of natural language processing (NLP). NLP commonly uses
dimensionality reduction to identify word associations within a body
of text3,4.Tond contextually similarwords, NLP methods make use of
vector space models to represent similarities in a lower dimensional
space. Similar methodology has been applied to bulk RNA-seq
expression for nding co-expression patterns5.Inspiredbysuch
Received: 8 July 2022
Accepted: 4 July 2023
Check for updates
1
Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
2
Immuno-Oncology
Service, Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
3
Hepatopancreatobiliary Service,
Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
4
Department of Medicine, Memorial Sloan Kettering Cancer Center,
New York, NY, USA.
5
Department of Medicine, Thoracic Oncology Service, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
6
Department of
Molecular Oncology, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada.
7
Department of Pathology and Laboratory Medicine,
University of British Columbia, Vancouver, British Columbia, Canada.
8
Physiology, Biophysics & Systems Biology, Weill Cornell Medicine, Weill Cornell
Medical College, New York, NY, USA.
9
These authors contributed equally: Zachary Sethna, Samuel S. Freeman. e-mail: ceglian@mskcc.org
Nature Communications | (2023) 14:4400 1
1234567890():,;
1234567890():,;
Content courtesy of Springer Nature, terms of use apply. Rights reserved
work, we developed a tool that generates gene vectors based on single
cell RNA (scRNA)-seq expression data. While current methods reduce
dimensionality with respect to sparse expression across each cell, our
tool produces a lower dimensional embedding with respect to each
gene. The vectors derived from GeneVector provide a framework for
identifying metagenes within a gene co-expression graph and relating
these metagenes back to each cell using latent space arithmetic.
The most pervasive method for identifying the sources of varia-
tion in scRNA-seq studies is principal component analysis (PCA)68.The
relationship of principal components to gene expression is linear,
allowing lower dimensional structureto be directly related to variation
in expression. A PCA embedding is an ideal input for building a nearest
neighbor graph for unsupervised clustering algorithms9and visuali-
zation methods including t-SNE10 and UMAP11.However,theassump-
tion of a continuous multivariate gaussian distribution creates
distortion in modeling read counts generated by a true distribution
that is over-dispersed, possibly zero-inated12, with positive support
and mean close to zero2. Despite such issues, gene programs gener-
ated from PCA loadings have been used to generate metagenes that
explain each principal component13. While these loadings highlight
sets of genes that explain each orthogonal axis of variation, pathways
and cell type signatures can be conated within a single axis.
In addition to PCA, more sophisticated methods have been devel-
oped to better handle the specic challenges of scRNA data. The single
cell variational inference (scVI) framework14 generates an embedding
using non-linear autoencoders that can be used in a range of analyses
including normalization, batch correction, gene-dropout correction,
and visualization. While scVI embeddings show improved performance
over traditional PCA-based analysis in these tasks, they have a non-linear
relationship to the original count matrix that may distort the link
between structure in the generated embedding and potentially identi-
able gene programs2. A subsequent method uses a linearly decoded
variational autoencoder (LDVAE), which combines a variational auto-
encoder with a factor model of negative binomial distributed read
counts to learn an interpretable linear embedding of cell expression
proles2. However, the relationship between gene expression and cell
representation is still tied to correlated variation across cells, which may
confound co-varying pathway and phenotypic signatures.
Recognizing the importance of modeling the non-linearity of gene
expression and the complexity of statistical dependencies between
genes, several methods have adopted information theoretical
approaches. Many of these methods use mutual information (MI), an
information theoretic measure of the statistical dependence between
two variables. ARACNE15 uses MI to prune independent and indirectly
interacting genes during construction of a gene regulatory network
from microarray expression data. PIDC16 identies regulatory rela-
tionships using partial information decomposition (PID), a measure of
the dependence between triples of variables. The authors apply PIDC
to relatively high-depth single-cell qPCR datasets and restrict their
analysis to on the order of hundreds of genes. More recently, IQCELL17
uses MAGIC18 to impute missing gene expression, builds a GRN from
pairwise MI between genes, and applies a series of lters to produce a
GRN composed of only functional relationships. The authors use
IQCELL to identify known causal gene interactions in scRNA-seq data
from mouse T-cell and red blood cell development experiments.
Because of the success of these methods, we hypothesized that MI
could be combined with vector space models to produce a meaningful
low dimensional representation of genes from scRNA data.
In this work, we present GeneVector (Fig. 1)asaframeworkfor
generating low dimensional embeddings constructed from the mutual
information between genes. GeneVector summarizes co-expression of
genes as mutual information between the probability distribution of
read counts across cells. We showcase GeneVector on four scRNA
datasets produced from a diverse set of experiments: peripheral blood
mononuclear cells (PBMCs) subjected to interferon beta stimulation19,
the Tumor Immune Cell Atlas (TICA)20, treatment naive multi-site
samples from High Grade Serous Ovarian Cancers (HGSOC)21 and a
time series of cisplatin treatment in patient-derived xenografts (PDX)
of triple negative breast cancer (TNBC)22.Werst conrm GeneVec-
tors ability to identify putatively co-regulated gene pairs from sparse
single cell expression measurements using the TICA and PBMC data-
sets. We demonstrate that latent space arithmetic can be used to
accurately label cell types in the TICA dataset and validate our cell type
predictions against published annotations. Next, we show that Gene-
Vector can identify metagenes corresponding to cell-specictran-
scriptional processes in PBMCs, including interferon activated gene
Batch Effect Correction
B/Plasma
Myeloid
TCell
Metagenes
G
1
G
1
G
3
G
3
G
2
G
2
Counts Matrix
ngenes by kcells
ngenes by ngenes
G1
G2
Joint Probability
Distribution
Mutual Information I(G1, G2)
Model OutputsGeneVector Model Downstream Analyses
w1
ihidden units
ngenes
Gene Pair
G1,G2
w2
T
w
1
w
2
Gene Embedding
ngenes Xidimensions
Cell Embedding
kcells Xidimensions
Co-expression
Graph
G1
G3
G2
G1*G2
||G1|| ||G2||
Fig. 1 | GeneVector Framework. Overview of GeneVector framework starting from
single cell read counts. Mutual information is computed on the joint probability
distribution of read counts for each gene pair. Each pair is used to train a single
layer neural network where the MSE loss is evaluated from the model output
(w
1
Tw
2
) with the mutual information between genes. From the resulting weight
matrix, a gene embedding, cellembedding, and co-expression similarity graph are
constructed. Using vector space arithmetic, downstream analyses include identi-
cation of cell-specic metagenes, batch effect correction, and cell type
classication.
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 2
Content courtesy of Springer Nature, terms of use apply. Rights reserved
expression (ISG). We use vector space arithmetic to directly map
metagenes to site specic changes in primary and metastatic sites in
the HGSOC dataset, capturing changes in MHC class I expression and
epithelial-mesenchymal transition (EMT). Finally, we show GeneVector
can identify cisplatin treatment dependent transcriptional programs
related to TGF-beta in TNBC PDXs.
Results
Dening the GeneVector framework
We trained a single layer neural network over all gene pairsto generate
low dimensional gene embeddings and identify metagenes from a co-
expression similarity graph. The input weights (w
1
) and output weights
(w
2
) are updated with adaptive gradient descent (ADADELTA)23.Gene
co-expression relationships are dened using mutual information
(Methods:Mutual Information) computed from a joint probability
distribution of expression counts. Training loss is evaluated as the
mean squared error of mutual information with the model output,
dened as w
1
Tw
2
.Thenal latent space is a matrix dened as a series of
vectors for each gene.
Gene vectors produced by the framework are useful for several
fundamental gene expression analyses. Gene vectors weighted by
expression in each cell are combined to generate the cell embedding
analysis of cell populations and their relationships to experimental
covariates. The cell embedding can be batch corrected by using vector
arithmetic to identify vectors that represent batch effects and then
shift cells in the opposite direction (Methods:Batch correction). A co-
expression graph is constructed in which each node is dened as a
gene and each edge is weighted by cosine similarity. After generating
the co-expression graph, we use Leiden clustering9to identify meta-
gene clusters. Further downstream analysis of the cell embedding
includes phenotype assignment based on sets of marker genes and
computation of the distances between cells and metagenes to high-
light changes related to experimental covariates (Fig. 1).
To perform cell type assignment, a set of known marker genes is
used to generate a representative vector for each cell type, where each
gene vector is weighted by the normalized and log-transformed gene
expression. The cosine similarity of each possible phenotype is com-
puted between the cell vector and the marker gene vector. SoftMax is
applied to cosine distances to obtain a pseudo-probability over each
phenotype (Methods: Cell type assignment). Discrete labels can be
assigned to cells by selecting the phenotype corresponding to the
maximum pseudo-probability. More generally, gene vectors can be
composed together to describe interesting gene expression features.
Cell or gene vectors can then be compared against these feature vec-
tors to evaluate the relevance of that feature to a given cell or gene
(Methods: Generation of Predictive Genes).
Robust inference of gene co-regulation with GeneVector
Our model relies on the advantages of mutual information to dene
relationships between genes, as opposed to other distance metrics. To
evaluate how MI contributes to the observed performance of the
model, we rst validated that the vectors inferred by GeneVector
capture semantic qualities of genes including pathway memberships
and regulatory relationships. Specically, we assessed the extent to
which pairs of genes withinthe same pathway, or expressed within the
same cell types, produced similar vector representations in the PBMC
dataset. As a ground truth we computed, for each gene pair, the
number of combined pathways from Reactome24 and MSigDB25,26 cell
type signatures (C8) for which both genes were members. In addition
to training GeneVector using raw read counts with an MI target, we
trained GeneVector using normalized and log-transformed read
counts on Pearson correlation coefcient to evaluate the relative
benet of MI on the accuracy of the model output. For both the cor-
relation and MI based models, cosine similarities between gene vectors
showed a stronger relationship with the number of shared pathways
and cell type signatures than randomly shufed gene pairs (Fig. 2A, C).
Comparing the MI model and the correlation model directly, the MI
model produced a much stronger relationship than the correlation
based model (r2=0.233vs.r
2=0.093,Fig.2B, D). To provide a pathway
specic example, we found that the most similar genes by cosine
similarity to IFIT1 (a known interferon stimulated gene) using the
correlation objective were less coherent in terms of ISG pathway
membership signal (6 of 16 genes were in the Interferon Signaling
Reactome pathway R-HSA-913531) (Fig. 2E) than with mutual infor-
mation (12 of 16 genes) (Fig. 2F). The greater proportion of interferon
stimulated pathway genes with high similarity to IFIT1 using a mutual
information objective function (Fig. 2Fvs.2E) is consistent with the
improved correlation between pathway co-membership and cosine
similarity over the Pearson correlation coefcient objective (Fig. 2D
vs. 2FB).
Next, we sought to understand whether GeneVector was able to
capture relationships between genes with known interactions such as
transcription factors (TF) and their targets. GeneVector cosine simi-
larities of TF-target pairs annotated based on ChIP-Seq data27 were
signicantly increased relative to un-annotated gene pairs in the TICA
dataset (Fig. 2G).Wealsoconsideredliteratureannotatedactivator-
target and repressor-target TF-target gene pairs28.Asexpected,
activator-target pairs showed increased cosine similarity above unan-
notated pairs. Importantly, repressor-target pairs showed an equally
strong increase in cosine similarity above unannotated pairs high-
lighting GeneVectors ability to identify a diversity of statistical
dependencies between co-regulated genes. By comparison, Pearson
correlation coefcients of ChIP-seq annotated TF-target pairs were not
signicantly different from unannotated gene pairs (Fig. 2H). Both
activator-target and repressor-target pairs were signicantly different
from unannotated pairs, though correlation was on-average positive
for both repressor-target and activator-target pairs (Fig. 2H). In fact,
while many repressor-target pairs hadhigh cosine similarity indicative
of a meaningful regulatory relationship, their correlation coefcients,
computed with normalized and log-transformed counts, were always
positive (Fig. 2I). For example, SOCS3-STAT4 exhibited the lowest
correlation of annotated repressor-target pairs (r2= 0.017) and aggre-
gating normalized and log-transformed expression across cell types
showed an absence of any relationship between these genes (Fig. 2J). In
contrast, analysis of activator-target pair KLF4-THBD revealed a posi-
tive correlation driven by co-expression in myeloid cells and T cells
(Fig. 2K). This relationship is further evidenced when looking at the
expression in more detailed cell type annotations (Fig. 2L). Identica-
tion of mutually exclusive expression is a theoretical benetof
correlation-based similarity measures, however, the sparsity of scRNA
likely results in positive or low correlation even for known examples of
mutual exclusivity. In summary, the vector space produced by Gene-
Vector successfully recovers the latent similarities between function-
ally related genes, including negative regulators and their targets,
overcoming the sparsity of scRNA data that confounds simpler
approaches.
Fast and accurate cell type classication using GeneVector
Comparative analysis of gene expression programs across large
cohorts of patients can potentially identify transcriptional patterns in
common cell types shared between many cancers. However, classi-
cation of cell types using methods such as CellAssign29 are computa-
tionally expensive. Furthermore, the large number of covariates in
these datasets makes disentangling patient-specicsignalsfromdis-
ease and therapy difcult. GeneVector provides a fast and accurate
method of cell type classication. We perform cell type classication
on a subset of 23,764 cells from the Tumor Immune Cell Atlas (TICA)
composed of 181 patients and 18cancer types20.The dataset was subset
to 2000 highly variable genes and the unnormalized read counts were
used train GeneVector. Cell vectors were generated by weighting each
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
gene vector by the normalized and log-transformed expression per
cell. Cell types were summarized into three main immune cell types:
T cells, B/Plasma, and Myeloid cells (Fig. 3A) from the original anno-
tations (Fig. 3B).Weselectedasetofgenemarkersforeachcelltype
(T cells: CD3D,CD3G,CD3E,TRAC,IL32,CD2;Bcells:CD79A,CD79B,
MZB1,CD19,BANK1; and Myeloid cells: LYZ,CST3,AIF1,CD68,C1QA,
C1QB,C1QC) based on signatures obtained from CellTypist30.Foreach
phenotype and each cell, we computed the cosine distance to the log-
normalized expression weighted average of the marker gene vectors.
The pseudo-probabilities for the three cell types are generated by
applying a SoftMax function to the set of cosine distances. The max-
imum pseudo-probability is used to classify each cell into T cell, B/
Plasma, or Myeloid (Fig. 3C).
To assess performance, we computed accuracy against the
coarse labels in a confusion matrix as the percentage of correctly
classied cells over the total number cells for each summarized cell
type. We found 97.1% of T cells, 95% of myeloid cells, and 94.3% of B/
Plasma cells were correctly classied with respect to the original
annotations (Fig. 3D). Additionally, we classied cells using the same
marker genes with CellAssign and found signicantly decreased
performance in the classication of myeloid cells (85.6%) (Fig. 3E).
Overall, the percentage of cells misclassied using GeneVector
(3.977%) showed improvement over CellAssign (7.54%). Using the
pseudo-probabilities, GeneVector can highlight cells that share gene
signatures including plasmacytoid dendritic cells (pDCs), where cell
type denition is difcult31 (Fig. 3F). For each cell type, we validated
that the classied cells are indeed expressing the supplied markers
by showing the normalized and log-transformed expression for each
marker grouped by classied cell type (Fig. 3G). For those cells
GeneVector reassigned from the original annotations (Fig. 3H), we
examined the mean normalized and log-transformed expression per
marker gene and found that many of these cells appear misclassied
in original annotations (Fig. 3I). Cells originally annotated as T cells
that were reassigned as B/Plasma by GeneVector show high expres-
sion for only B/Plasma markers (CD19,BANK1,CD79A,andCD79B).
Additionally, there is evidence that many of these reassigned cells
may be doublets. B/Plasma cells reassigned to T or myeloid cells
show simultaneous expression of both gene markers. While any
computational cell type classication cannot be considered ground
truth, cell type assignment with GeneVector is an improvement over
CellAssign and demonstrates sensitivity to cells that express over-
lapping cell type transcriptional signatures.
Fig. 2 | Comparison of Results using Mutual Information. ADPathway co-
membership vs. cosine similarity between genevectors for all genepairs in PBMCs.
Each point represents one gene pair, and plots show the number of pathways
(combined Reactome and MSigDB cell type signatures [C8]) that contain both
genes (y-axis) and the cosine distance between the two genes (x-axis). The results
show both correlation A,Band MI (MutualInformation) C,Dbased GeneVector. In
addition to a standard set of results B,D, a baseline relationship between pathway
co-membership and cosine similarity is established by performing an identical
analysis over randomly shufed gene A,C.ETop 16 most similar genes by cosine
similarity to IFIT1 using correlation coefcient. Genes in the interferon signaling
pathway are colored orange. FTop 16 most similar genes to IFIT1 after training
GeneVector using mutual information shows a higher number of interferon sig-
nalingpathway genes. G,HCosine similarity and Pearson correlationcoefcient for
un-annotated gene pairs (n= 314090), ChIP-Seq annotated TF-targets pairs
(n= 1275), andliterature annotated activator (n= 26)or repressor (n= 26) TF-target
pairs.The center of the box plotis denoted by the median, a horizontal linedividing
the box into two equal halves. The bounds of the box are dened by the lower
quartile (25th percentile) and the upper quartile (75th percentile). The whiskers
extend from the box and represent the data points that fall within 1.5 times the
interquartile range (IQR) from the lower and upper quartiles. Any data point out-
side this range is considered an outlier and plotted individually. Signicance
assessed using Mann-Whitney-Wilcoxon two-sided test. ICosine similarity versus
correlation coefcient for gene pairsin the TICA (Tumor Immune Cell Atlas) dataset
with TF-target gene pairs highlighted (blue) and colored by activator/repressor
status (green/orange respectively). J,KLinear regression of mean log-normalized
expression per cell type±95% condence interval for repressor TF-target pair
SOCS3-STAT4 and activator TF-target pair KLF-THBD,respectively.LMean log-
normalized expression for SOCS3-STAT4 and KLF4-THBD across annotated cell
types. Source data provided as a Source Data le.
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 4
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Next, we benchmarked GeneVectors cell type assignment per-
formance with respect to four input types: raw read counts, log-
transformed raw counts, normalized total counts per cell, and nor-
malized total counts per cell with log-transformation. We evaluated
each approach on the original TICA dataset, in addition to a series of
datasets for which we articially generated a library size batch effectby
down sampling reads in subsets of cells (Supplementary Fig. 1AE).
Performance was evaluated by calculating cell type assignment accu-
racyandbyrankinggenepairsbycosinesimilarityforannotatedco-
expressed and mutually exclusive gene pairs using CellTypist (Meth-
ods: Coexpressed and Mutually Exclusive Markers)30. Evaluating gene
pair rankings on repeat trained models of the full TICA dataset, we nd
that all four preprocessing types show a similar ranking of co-
expressed gene pairs (Supplementary Fig. 1A, left). Interestingly, raw
counts produce signicantly lower rankings for mutually exclusive
gene pairs (Supplementary Fig. 1A, right). When introducing an arti-
cial batch effect, we nd that raw counts generate signicantly lower
rankings for mutually exclusive marker pairs (Supplementary Fig. 2B).
Next, we compared the cell type prediction accuracy between nor-
malization procedures using the same gene markers used to perform
cell type prediction in the main results section on both the full and
subset datasets with articial batch effect. Raw counts signicantly
outperformed all other preprocessing inputs by a large margin, with
both normalized and log-normalized showing very poor performance
(Supplementary Fig. 2C, D). Our results suggest that normalization
increases association between pairs of genes with mutually exclusive
expression, resulting in negative downstream performance of cell type
prediction. Cell typing appears to be sensitive to mutual exclusivity of
cell type markers, and normalization produces slight increases in
cosine similarity between gene vectors of mutually exclusive genes
resulting in poor cell typing accuracy.
After learning the gene embedding, GeneVector allows rapid
testing of different marker genes and phenotypes in exploratory ana-
lysis settings. Increased performance in classication is important
given the large variation of markers used to dene the same pheno-
types across different studies. Cell type prediction can be recomputed
interactively within a Jupyter notebook within twenty seconds for even
large datasets on most machines (Supplementary Fig. 2). An additional
advantage of having a probability is the ability to map genes from
known pathways to a continuous value in each cell. In both phenotype
and pathway, demonstration of continuous gradients across cells
provides a measure of change and activation that cannot be seen from
unsupervised clustering.
Comparison of Methods in Identication of Interferon Meta-
genesin10kHumanPBMCs
To identify cell-specic metagenes related to interferon beta stimula-
tion and compare with transcriptionalprogramsidentied by PCA and
LDVAE loadings, we trained GeneVector using peripheral blood
mononuclear cells (PBMCs) scRNA-seq data from 6855 quality control
ltered cells composed of an interferon beta stimulated sample and a
control sample19. The raw count matrix was subset to 1000 highly
variable genes using the Seurat V3 method32 as implemented in
Scanpy7. We used previously annotated cell types generated from
unsupervised clustering as ground truth labels33. The cell embedding
used for batch correction was generated by weighting gene vectors by
the normalized, log-transformed expression in each cell. A comparison
of the uncorrected UMAP embedding (Fig. 4A) and subsequent
B/Plasma Myeloid T Cell
Accuracy(%)
B/Plasma
Myeloid
T Cell
96.8%
2710/2800
0.6%
18
2.6%
72
9.9%
810
85.6%
7015/8195
4.5%
370
3.1%
383
0.9%
117
96.0%
11986/12486
20
40
60
80
B/Plasma Myeloid T Cell
Accuracy (%)
B/Plasma
Myeloid
T Cell
TICA Summarized Cell Type
TICA Summarized Cell Type
94.3%
2640/2800
1.9%
52
3.9%
108
1.1%
89
95.0%
7785/8195
3.9%
321
0.9%
112
2.0%
252
97.1%
12122/12486
20
40
60
80
B/Plasma
Myeloid
TCell
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
Coarse Cell Type
B/Plasma
Myeloid
TCell
Cell TypeB cells
Cytotoxic CD8 T cells
Effector memory CD8 T cells
M2TAMs
Mast cells
Monocytes
NK
Naive T cells
Naive-memory CD4 T cells
Plasma B cells
Pre-exhausted CD8 T cells
Proliferative B cells
Proliferative T cells
Proliferative monocytes and macrophages
Recently activated CD4 T cells
Regulatory T cells
SPP1TAMs
T helper cells
Terminally exhausted CD8 Tcells
Th17 cells
Transitional memory CD4 T cells
cDC
mDC
pDC
T Cell Pseudo-probability
0.2
0.3
0.4
0.5
0.6
0.7
B/Plasma Pseudo-probability
0.2
0.3
0.4
0.5
0.6
0.7
Myeloid Pseudo-probability
0.2
0.3
0.4
0.5
0.6
0.7
A D
CE
B
F
H
TCell
B/Plasma
Myeloid
0.0 0.5 1.0
Mean expression
in group
CD3D
CD3G
CD3E
TRAC
IL32
CD2
CD79A
CD79B
MZB1
CD19
BANK1
LYZ
CST3
AIF1
CD68
C1QA
C1QB
C1QC
B/Plasma to Myeloid
B/Plasma to T Cell
Myeloid to B/Plasma
Myeloid to T Cell
T Cell to B/Plasma
T Cell to Myeloid
GeneVector Reassignment
Myeloid to B/Plasma
T Cell to B/Plasma
B/Plasma to Myeloid
T Cell to Myeloid
B/Plasma to T Cell
Myeloid to T Cell
CD3D
CD3G
CD3E
TRAC
IL32
CD2
CD79A
CD79B
MZB1
CD19
BANK1
LYZ
CST3
AIF1
CD68
C1QA
C1QB
C1QC
0.0
0.2
0.4
0.6
0.8
1.0
B/Plasma
B/Plasma
Markers
Myeloid
Myeloid
Markers
Scaled Log-Normalized Expression
TCell
T Cell
Markers
GI
Fig. 3 | GeneVector Accurately Classies Cells in TICA Cell Atlas. A Cells anno-
tated by cell type summarized from the original TICA provided annotations.
BOriginalannotations provided in TICA (Tumor Immune Cell Atlas). CGeneVector
classication results for each cell type. DConfusion matrix comparing GeneVector
classication with summarized cell types. EConfusion matrix using the same gene
markers with CellAssign shows decreased performance in myeloid cells. FPseudo-
probability values for each summarized cell type. GMarker gene log-normalized
expression over all cells grouped by GeneVector classication. HOriginal annota-
tions reassigned by GeneV ector. IMarker gene mean log-normalized expression for
cell type reassignments highlights misclassications in B/Plasma cells and doublet
transcriptionalsignatures in T and myeloid cells. Source data provided as a Source
Data le.
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
GeneVector-based batch correction (Fig. 4B) demonstrates correction
in the alignment of cell types between the two conditions. However, in
contrast to batch correction using Harmony34 (Fig. 4C), not all varia-
tion is lost between the interferon beta stimulated and control cells.
Specically, GeneVector does not align myeloid cell types, suggesting
a larger effect of the interferon beta stimulation treatment in these
cells. Finally, we explored the impact of cell type composition onbatch
correction performance and found that CD14+ Monocytes had the
largest batch silhouette coefcient, indicating that stimulated and
control Monocytes differed the most across cell types (Supplemen-
tary Fig. 6G).
As a method of both validation and exploration, GeneVector
provides the ability to query similarity in genes. For a given target
gene, a list of the closest genes sorted by cosine similarity can be
generated. This is useful in both validating known markers and iden-
tifying the function of unfamiliar genes by context. The genes most
similar to IFIT1 (Fig. 4D) include a large proportion of genes found in
the Reactome pathway Interferon Signaling (R-I-913531) (Gillespie et al.
2022).After clustering gene vectors,we identify a single metagene that
includes these genes (IFIT1,IFIT2,IFIT3,ISG15,ISG20,TFGS10,RSAD2,
LYSE,OAS1,andMX1). The ISG metagene can be visualized on a UMAP
generated from the gene embedding, like the familiar cell-based
visualizations common in scRNA-seq studies (Fig. 4E). The mean and
scaled log-normalized expression of each gene identied in the ISG
metagene is signicantly higher in interferon beta stimulated cells over
control cells (Fig. 4F).Importantly, the increased expression is found in
each cell type, indicating a global relation to treatment.
To compare the ISG metagene with results generated from PCA
loadings, we performed PCA on the normalized and log-transformed
gene-by-cell expression matrix (Fig. 4G) and colored the embedding by
cell type and treatment. After computing the PCA loadings using
Scanpy7,weidentied the top genes by contribution score to variation
in the rst and second principal components. The rst principal
component (PC1) explains variation related to cell type and the dif-
ferences between myeloid (TYROBP,FCER1G,FTL,CST3)andTcells
(LTB,CCR7)(Fig.4H). The second principal highlights the variation
related to interferon beta stimulation and includes the genes found in
the ISG metagene generated by GeneVector (Fig. 4I). However, the
increased effect of interferon stimulation in myeloid cells, conates
myeloid specic ISGs with the interferon signature. One such gene is
CXCL10, which shows cell type specicity to myeloid cells (Fig. 4J) and
is not found in the interferon signaling Reactome pathways. Addi-
tionally, IFITM3 shows increased expression only in myeloid cells
within these PBMCs. In contrast, GeneVector produces a metagene
that groups myeloid specic genes into a single metagene including
CXCL10 and IFITM3. A full list of metagenes produced by GeneVector is
presented in Supplementary Fig. 3. Among these metagenes, we
identify transcriptional programs specic to each cell type and treat-
ment condition, including those found in the least represented cell
type Megakaryocytes (132 of 14,038 cells).
To compare the GeneVector ISG metagene with LDVAE, we
trained an LDVAE model using 10 latent dimensions for 250 epochs
with control and stimulated batch labels in the SCVI framework on the
unnormalized read counts. In contrast to the specicity of the Gene-
Vector ISG metagene that includes only interferon stimulated genes,
the nearest LDVAE loading mixes interferon-related genes with mar-
kers of T cell activation (PRF1) and T cell dysfunction (LAG3)(Fig.4K).
With respect to PCA and LDVAE loadings, GeneVector identied an ISG
0.0 0.2 0.4 0.6 0.8 1.0
Mean scaled log-normalized expression
ISG15
RSAD2
TNFSF10
LY6E
IFIT2
IFIT3
IFIT1
OAS1
ISG20
MX1
B cells
CD4 T cells
CD8 T cells
CD14+ Monocytes
Dendritic cells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
B cells
CD4 T cells
CD8 T cells
CD14+ Monocytes
Dendritic cells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
Control
Stimulated
Control
Stimulated
I
SG
1
5
RS
AD
2
TNFSF10
L
Y6E
Y
L
L
IFIT
I
FIT
3
IFIT1
O
A
S
1
ISG20
MX1
Harmony Batch Correction
Control
Stimulated
Cell Type
Bcells
CD14+Monocytes
CD4T cells
CD8T cells
Dendriticcells
FCGR3A+Monocytes
Megakaryocytes
NKcells
GeneVector Batch Correction
Control
Stimulated
Cell Type
UMAP1
UMAP2
GeneVector Cell Embedding
Control
Stimulated
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
Cell Type
UMAP1
UMAP2
UMAP1
UMAP2
B cells
CD14+ Monocytes
CD4 T cells
CD8 T cells
Dendritic cells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
Gene Ranking
Contribution to Variance
0.10
0.05
0.00
0.05
0.10
0.15
TYROBP
C15orf48
FCER1G
FTL
TIMP1
CST3
SOD2
ISG20
TRAT1
SRSF7
RGCC
BIRC3
CREM
LTB
CCR7
PC1
Myeloid
TCell ISGs
Gene Ranking
0.2
0.1
0.0
0.1
IL8
S100A8
CLEC5A
CD14
FTH1
VCAN
ACTB
RSAD2
IFITM3
ISG20
TNFSF10
IFIT1
IFIT3
CXCL10
ISG15
PC2
ConditionConditionCell Type
Control
Stimulated
Bcells
CD14+Monocytes
CD4T cells
CD8T cells
Dendriticcells
FCGR3A+Monocytes
Megakaryocytes
NKcells
ISG15
CXCL10
IFIT3
IFIT1
TNFSF10
ISG20
IFITM3
RSAD2
IFIT2
LY6E
CXCL11
APOBEC3A
MX1
IDO1
CCL8
LAG3
IFIT1
PRF1
NEXN
IFIT2
IFIT3
MX1
NCALD
RSAD2
S100B
ISG15
TNFSF10
CD8B
MT2A
CD38
RP11-326C3.12
BCL9
BUB1
GPR171
LY6E
Bcells
CD4 Tcells
LDVAE ISG LoadingPCAISG Loading
CD8 Tcells
CD14+Monocytes
DendriticCells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
Bcells
CD4 Tcells
CD8Tcells
CD14+ Monocytes
Dendritic Cells
FCGR3A+Monocytes
Megakaryocytes
NKcells
0.00 0.25 0.50 0.75 1.00
Cosine Similarity
IFIT1
IFIT3
LY6E
ISG15
OAS1
TNFSF10
MX1
ISG20
LAP3
SAMD9L
IFIT2
RSAD2
IRF7
OASL
GBP1
IFITM3
MT2A
Gene
Correlation - IFIT1 Similarity
Interferon Signaling R-HSA-913531
20 40 60 80 100
Fraction of cells
in group (%)
0.00 0.25 0.50 0.75 1.00
Mean scaled log-normalized
expression per group
A
B
C
DE FF
GHI
JK
UMAP1
UMAP2
PC1
PC2
PC1
PC2
Fig. 4 | Comparing Methods with Interferon Beta Stimulated 10 K PBMC.
AUncorrected GeneVector UMAPs showing stimulated condition (left) and cell
type annotation (right) on 10k PBMCs with control and interferon beta stimulated
cells. BGeneVector batch corrected UMAPs showing stimulated condition (left)
and cell type annotation (right) indicated stronger interferon-beta stimulated
response in myeloid cells. CHarmony batch correction applied to normalized
expression eliminates all variation related to interferon beta stimulation. DMost
similar genes by cosine similarity to IFIT1 includes genes found in Reactome
interferon related pathways. EGene embedding UMAP highlighting an ISG (inter-
feron stimulated genes) metagene that includes genes most similar to IFIT1.
FScaled log-normalized expression of ISG (interferon stimulated gene) metagene
shows increased expression across stimulated cells without cell type specic
effects. GPCA embedding colored by cell type (left) and stimulation (right). HTop
genes by contribution to variance indicates PC1 denes cell type. ITop genes by
contribution to PC2 dened by ISG stimulation colored by scaled normalized, log-
transformed gene expression scaled by variable. J,KScaled normalized, log-
transformed expression scaled by variable of ISG related loadings from PCA (left)
and LDVAE (right) includes cell type specic markers intermixed with ISGs. Source
data provided as a Source Data le.
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 6
Content courtesy of Springer Nature, terms of use apply. Rights reserved
metagene that is not confounded by cell type and includes only
interferon pathway related genes.
Metagenes changes between primary and metastatic site
in HGSOC
Studies with scRNA-seq data sampled from multiple tumor sites in
the same patient provide a wide picture of cancer progression and
spread. As these datasets grow larger and more complex, under-
standing the transcriptional changes that occur from primary to
metastatic sites can help identify mechanisms that aid in the process
of the invasion-metastasis cascade. GeneVector provides a frame-
work for asking such questions in the form of latent space arithmetic.
By dening the difference between two sites as a vector, where the
direction denes transcriptional change, we identify metagenes
associated with expression loss and gain between primary and
metastasis sites from six patients in the Memorial Sloan Kettering
Cancer Center SPECTRUM cohort of patients with high-grade serous
ovarian cancer (HGSOC)21.
A set of 270,833 cells quality control ltered cells from adnexa
(primary) and bowel (metastasis) samples were processed with Gene-
Vector (Fig. 5A). The unnormalized counts were subset to 2000 highly
variable genes were used as input to GeneVector and cells were clas-
sied to one of six cell types using gene markers curated for HGSOC
and two markers for cancer cells (EPCAM and CD24)21. We performed
cell type classication (Methods: Cell Type Assignment) and compared
GeneVector accuracy to the original annotations generated from Cel-
lAssign for each cell type in a confusion matrix (Fig. 5B). Accuracy
reached 99.7% in three cell types with a minimum classication rate of
94.1%. In cells where GeneVector annotated differently than previous
annotations, there is evidence from the differentially expressed genes
that these cells may have been initially mislabeled. GeneVector reas-
signed a subset of cancer cells to broblast and the differentially
expressed genes between these cells and the cells annotated as cancer
by GeneVector highlighted broblast cell type markers including
COL1A1 (Supplementary Fig. 4A). In B/Plasma cells reassigned as
T cells, the differentially expressed genes highlight B cell receptor
B/Plasma
Endothelial Cell
Fibroblast
Myeloid
Cancer
T Cell
GeneVector
Accuracy (%)
B/Plasma
Endothelial Cell
Fibroblast
Myeloid
Cancer
T Cell
SPECTRUM
94.2%
8110/8605
0.1%
6
1.4%
122
1.5%
132
0.9%
78
1.8%
157
0.0%
1
97.4%
5253/5395
2.2%
116
0.1%
4
0.4%
20
0.0%
1
0.1%
36
99.7%
50196/50334
0.0%
7
0.2%
89
0.0%
6
0.1%
98
0.0%
4
0.0%
26
99.4%
65756/66180
0.1%
41
0.4%
255
0.1%
119
0.2%
184
2.8%
2422
0.1%
117
96.7%
84439/87305
0.0%
24
0.0%
14
0.0%
5
0.1%
45
0.1%
49
0.1%
32
99.7%
52869/53014
0
20
40
60
80
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
UMAP1
UMAP2
GeneVector Classification B/Plasma
Endothelial Cell
Fibroblast
Myeloid
Ovarian Cancer Cell
T Cell
Uncorrected GeneVector Cell Embedding
SPECTRUM-OV-007
SPECTRUM-OV-022
SPECTRUM-OV-026
SPECTRUM-OV-051
SPECTRUM-OV-082
SPECTRUM-OV-107
Batch Corrected GeneVector Cell Embedding
SPECTRUM-OV-007
SPECTRUM-OV-022
SPECTRUM-OV-026
SPECTRUM-OV-051
SPECTRUM-OV-082
SPECTRUM-OV-107
B
D
C
A
SPECTRUM-OV-007
SPECTRUM-OV-022
SPECTRUM-OV-107
SPECTRUM-OV-026
SPECTRUM-OV-082
SPECTRUM-OV-051
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Gene Module Score
p=0.000
p=0.8721
p=3.112e-26
p=2.795e-35
p=2.795e-35
p=3.814e-6
BOWEL
MHCI Metagene 50: CD74,HLA-A,HLA-E,HLA-C, HLA-B, RARRES3,B2M
ADNEXA
IL-6/JAK/STAT3Signaling
KRASSignaling Dn
IL-2/STAT5Signaling
Coagulation
CholesterolHomeostasis
G2-MCheckpoint
ApicalJunction
UVResponse Up
InflammatoryResponse
PI3K/AKT/mTOR Signaling
InterferonAlpha Response
p53Pathway
KRASSignaling Up
AllograftRejection
hemeMetabolism
ApicalSurface
EstrogenResponse Early
Spermatogenesis
EpithelialMesenchymal Transition
MitoticSpindle
InterferonGamma Response
mTORC1Signaling
DNARepair
E2FTargets
EstrogenResponse Late
Myogenesis
UVResponse Dn
MycTargets V1
TNF-alphaSignaling via NF-kB
Apoptosis
Complement
Hypoxia
217
18
16
188
198
540
299
5
205
231
154
334
149
454
1
194
413
339
552
62
94
15
446
566
89
163
373
587
514
275
Metagene Vadnexa -Vbowel
IL-6/JAK/STAT3Signaling
KRASSignaling Dn
PancreasBeta Cells
Coagulation
IL-2/STAT5Signaling
AndrogenResponse
CholesterolHomeostasis
G2-MCheckpoint
ApicalJunction
UVResponse Up
InflammatoryResponse
Adipogenesis
InterferonAlpha Response
UnfoldedProtein Response
KRASSignaling Up
p53Pathway
AllograftRejection
ReactiveOxygen Species Pathway
hemeMetabolism
XenobioticMetabolism
BileAcid Metabolism
FattyAcid Metabolism
EstrogenResponse Early
Pperoxisome
EpithelialMesenchymal Transition
NotchSignaling
InterferonGamma Response
mTORC1Signaling
DNARepair
EstrogenResponse Late
Myogenesis
Glycolysis
UVResponse Dn
TNF-alphaSignaling via NF-kB
TGF-betaSignaling
HedgehogSignaling
Apoptosis
Complement
Hypoxia
24
360
155
580
327
420
515
525
118
135
343
282
36
396
256
426
191
180
461
507
98
128
102
219
50
54
37
17
324
590
CombinedEnrichment ScoreMetagene Vbowel -Vadnexa
Hallmark Pathway
0
200
400
600
800
1000
SPECTRUM-OV-007
SPECTRUM-OV-022
SPECTRUM-OV-107
SPECTRUM-OV-026
SPECTRUM-OV-082
SPECTRUM-OV-051
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Gene Module Score
p=0.000
p=4.233e-76
p=2.663e-33
p=3.031e-8
p=0.8093
p=0.3321
EMT Metagene 198: CLDN1,RIC3,TAGLN,CLDN6
HI
UMAP1
UMAP2
UMAP1
UMAP2
EMT - Metagene 198 MHCI - Metagene 50
Pseudo-Probability
0.0
0.2
0.4
0.6
0.8
1.0
Site
ADNEXA
BOWEL
G
FE
BOWEL
ADNEXA
Fig. 5 | Metagenes Associated with Directional Difference in HGSOC Cancer
Cells from Adnexa to Bowel. A UMAP of HGSOC cells with classied by Gene-
Vector. BUncorrected UMAP of cancer cells from patients with adnexa and bowel
samples. CGeneVector batch corrected UMAP with patient labels on site labels on
batch corrected UMAP. DConfusion matrix of accuracy comparing SPECTRUM
annotated cell types with GeneVector classication. EHallmark pathway enrich-
ment for top 30 metagenes by cosine similarity to V
adnexa
V
bowel
.FHallmark
pathway enrichment for top 30 metagenes by cosine similarity to V
bowel
V
adnexa
.
GPseudo-probabilities for metagenes associated with up-regulation in bowel to
adnexa (Epithelial-to-Mesenchymal Transition) and down-regulation (Major Histo-
compatibility Class I). HEMT (Epithelial-Mesenchymal Transition) metagene
signicantlyup regulatedin four of six patients by gene module score. IMH C Class I
(MHCI) metagene signicantly downregulated in the metastatic site (bowel) in
three of sixpatients by gene module score. Source data provided as a Source Data
le. The center of the box plot is denoted by the median, a horizontal line dividing
the box into two equal halves. The bounds of the box are dened by the lower
quartile (25th percentile) and the upper quartile (75th percentile). The whiskers
extend from the box and represent the data points that fall within 1.5 times the
interquartile range (IQR) from the lower and upper quartiles. Any data point out-
side this range is considered an outlier and plotted separately. Signicance asses-
sed using Mann-Whitney-Wilcoxon two-sided test.
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
genes IGK and IGLC2 (Supplementary Fig. 4B). Finally, in B/Plasma cells
classied as myeloid, the top differentially expressed genes include
known canonical markers for macrophage/monocyte cells (TYROBP,
LYZ,CD4,andAIF1)30 (Supplementary Fig. 4C).
We recomputed the gene embedding and metagenes on only
those cells classied as cancer by GeneVector with both adnexa and
bowel samples. We found these cells exhibited large patient specic
batch effects (Fig. 5C) and applied GeneVector batch correction
(Fig. 5D, E). To understand the changes between primary and meta-
static sites, we computed an average vector for all cells from the
adnexa (V
adnexa
)astheprimarysiteandbowel(V
bowel
) as the site of
metastasis. We mapped the top 30 most similar metagenes to vectors
representing expression gain in metastasis (V
adnexa
-V
bowel
)(Fig.5F)
and expression loss (v
bowel
-v
adnexa
)(Fig.5G). Gene Set Enrichment
Analysis (GSEA) using GSEAPY with Hallmark gene set annotations
from Enrichr35 was performed to assess whether metagenes were
enriched for genes from known pathways. Metagenes enriched for E2F
targets and Epithelial-to-Mesenchymal Transition (EMT) pathway
genes were found gained in metastasis. Conversely, the set of meta-
genes representative ofloss from adnexa to bowel included MHC Class
I(HLA-A,HLA-B,HLA-C,HLA-E,andHLA-F) and the transcriptional
regulator B2M, suggesting a means of immune escape via loss of MHC
Class I expression and higher immune pressure in metastatic sites may
increase the potential tness benet of MHC Class I loss. For both the
EMT and MHCI metagenes, pseudo-probabilities computed using
GeneVector highlight pathway activity in either site in the UMAP
embedding (Fig. 5H).Computingthegenemodulescoringonthe
normalized, log-transformed expression36,weexaminedthechange
between sites in each patient for the EMT and MHCI metagenes. We
found that the MHCI metagene is signicantly downregulated in
metastatic sites in four of six patients (Fig. 5I). Conversely, the EMT
metagene was signicantly up regulated in metastatic sites for three of
six patients (Fig. 5J). The ability to phrase questions about transcrip-
tional change as vector arithmetic provides a powerful platform for
more complex queries than can be performed with differential
expression analysis alone.
It is possible to perform latent space arithmetic operations on any
embedding that is computed from a linear transformation ofthe gene
space, including PCA. To assess the performance of latent space
arithmetic using GeneVector and the principal components of a PCA
decomposition of the gene expression matrix, we performed the same
analysis on the cell-by-gene matrix. After clustering the gene embed-
ding using the Leiden algorithm and we generated a list of candidate
metagenes. We recomputed a representative vector for changes from
bowel to adnexa by subtracting V
adnexa
-V
bowel
and selected the most
similar metagenes. Like the GeneVector analysis, we recovered a
metagene related to MHC Class I using the PCA embedding. However,
several of the genes within this metagene were not present in the
Reactome MHC Class I pathway (HSA-983169), in contrast to the
GeneVector results which found a metagene containing only HLA-A,
HLA-B,HLA-C,andB2M. To test if the unannotated genes dened by the
PCA embedding were members of any gene signatures which con-
tainedHLA genes, we calculated the percentage of Reactome pathways
that include individual gene pairs and plotted this as a heatmap over all
genes in the metagene (Supplementary Fig. 5A). The genes TMEM59,
SERINC2, FOLR1,andWFDC2 were not found as a pair within any
Reactome pathway. We concluded that the PCA embedding identied
an MHC Class I metagene that was less consistent with previously
annotated pathways than GeneVector.
Given that the metagenes are computed using Leiden clustering,
the resolution parameter affects the coarseness of the PCA embed-
ding. We performed a parameter sweep over resolution values and
found that the genes identied by the PCA analysis are robustly
clustered together over a wide range of resolutions (Supplementary
Fig. 5B). In comparison, GeneVector identies a metagene containing
HLA-A,HLA-B,HLA-C,andB2M over a wide range of resolution
parameters (Supplementary Fig. 5C). The interval of values used for
the parameter sweep was selected to keep metagene membership
between three and 50 genes and represents the most reasonable
range of values for generating metagenes of this size. To understand
why non MHCI-annotated genes appear in the context of MHC genes
in the PCA embedding, we looked at differentially expressed genes
up-regulated in adnexa over bowel. We found that while these
genes do not appear in Reactome pathways together, they do appear
signicantly differentially expressed between adnexa and bowel.
Here, we draw the conclusion that PCA combines multiple pathways
to explain as much variance as possible when constructing
successive orthogonal PCA components. As a result, PCA will
combine multiple underlying sources of variance such as
bowel/adnexa variation, and MHC I variation between cells. By con-
trast, GeneVector has no orthogonality constraints and is not con-
structed to maximize variance explained by individual vectors,
allowing GeneVector to decompose pathway specic metagenes in a
more exible manner.
Metagenes associated with cisplatin treatment resistance
Understanding the transcriptional processes that generate resistance
to chemotherapies has immense clinical value. However, the tran-
scriptional organization of resistance is complex with many parallel
mechanisms contributing to cancer cell survival37.Weanalyzedlong-
itudinal single cell RNA-seq collected from a triple negative breast
cancer patient-derived xenograft model (SA609 PDX) along a treated
and untreated time series22. Using a total of 19,799 cancer cells with
treatment and timepoint labels (Fig. 6A), GeneVector was trained using
unnormalized read counts to generate metagenes that identify pro-
grams potentially related to cisplatin resistance. For each metagene,
we computed gene module scores over normalized and log-
transformed expression36 over the four timepoint (X1, X2, X3, and
X4) within the treated and untreated cells. Using these scores, we
calculated linear regression coefcients over the four time points and
selected candidate chemotherapy resistant and untreated metagenes
(β
treated
and β
untreated
)over a coefcient threshold (β
treated
>0.1 and
Bonferroni adjusted p= 0.001) with an untreated coefcient less than
the treated coefcient (β
treated
>β
untreated
)(Fig. 6B). Mean log-
normalized expression per gene was computed for each timepoint in
ve metagenes that were identied as treatment specic(Fig.6C).
Pathway enrichment on each of the ve metagenes using GSEAPY
with Hallmark gene set annotations from Enrichr35 showed that meta-
gene 24 was enriched for TGF-beta signaling (Fig. 6D), a pathway fre-
quently up-regulated during chemo-resistance38. This metagene
includes genes ID1,ID2,ID3,ID4,EPCAM,andFOXP1, all of which have
been found to be overexpressed in chemotherapy-resistant samples in
several cancer types3941. GeneVector also identied these genes in
global expression differences between treated and untreated cells
from the set of most similar genes to vectors V
treated
and V
untreated
(Fig. 6E). Several studies have implicated multiple resistance
mechanisms involving FOXP1 including transcriptional regulation,
immune response, and MAPK signaling4244. Additionally, GeneVector
identies EPCAM, whose high expression has been associated with
increased viability of cancer cells in diverse cancer types45.EPCAM has
been shown to have a role in resistance to chemotherapy in both breast
and ovarian cancers through WNT signaling and Epithelial-
Mesenchymal Transition (EMT)46,47.
Discussion
In this paper we propose GeneVector, a method for building a latent
vector representation of scRNA expression capturing the relevant
statistical dependencies between genes. By borrowing expression
signal across genes, GeneVector overcomes sparsity and produces an
information dense representation of each gene. The resulting vectors
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 8
Content courtesy of Springer Nature, terms of use apply. Rights reserved
can be used to generate a gene co-expression graph, and can be
clustered to predict transcriptional programs, or metagenes, in an
unsupervised fashion. Metagenes canbe related to a cellembedding to
identify transcriptional changes related to conditional labels or time
points. We show that gene vectors can be used to annotate cells with a
pseudo-probability, and that these labels are accurate with respect to
previously dened cell types.
We show that a single GeneVector embedding can be used for
many important downstream analyses. In interferon-stimulated
PBMCs, we identify a cell type independent ISG metagene that sum-
marizes interferon-stimulation across cell types and is not conated
with cell type signature. We demonstrate accurate celltype assignment
across 18 different cancers in 181 patients described in the Tumor
Immune Cell Atlas (TICA). In high grade serous ovarian cancer, we
identify metagenes that describe transcriptional changes from primary
to metastatic sites. Our results implicate the loss of MHC class I gene
expression as a potentialimmune escape mechanism in ovarian cancer
metastasis. In cisplatin treated TNBC PDXs, GeneVector uncovers
transcriptional signatures active in drug resistance, most notably
metagenes enriched in TGF-beta signaling. This signaling pathway is a
cornerstone in cancer progression since it promotes EMT transition
and invasion in advanced cancers; it is the target of various therapies,
but success has been mixed48 making it even more important to
employ tools that identify the multitude of players contributing to
therapy response.
GeneVector can produce an interpretable batch correction by
decomposing the derived correction vectors into transcriptional sig-
natures. Using both the full PBMC dataset and simulated mixtures of
CD14+ monocytes and CD4 T cells, we systematically evaluated the
quality of batch correction using a series of benchmarking metrics
(ARI, kBET, cLISI, and silhouette score)49,50 (Supplementary Fig. 6AI).
For the interferon betastimulated PBMCdataset, we expect the proper
batch correction to include ISGs, and indeed the batch correction
vector had high cosine similarity to an ISG metagene. In simulated
mixtures of cell types, we found that the similarity of the batch cor-
rection vector to ISGs was highest when the ratio of cell types was
balanced in the unstimulated and stimulated batches, we obtained the
best batch correction performance for balanced batches (Supple-
mentary Fig. 6J). By contrast, when cell types were imbalanced, batch
correction vectors had high similarity to cell type specicvectors
(Supplementary Fig. 6K, L). Batch correction is challenging in the
presence of batch-specic differences in cell type composition. As
opposed to other batch correction methods, the GeneVector imple-
mentation is highly interpretable, so users can verify whether batch
correction vectors are similar to cell type signatures or other biologi-
cally relevant metagenes.
Identifying correlations across scRNA data is a fundamental ana-
lysis task, necessary for identifying cells with similar phenotype or
activity, or genes with similar pathways or functional relationships. As
has been shown by us and in previous work15,16,51, the sparsity and non-
linearity of scRNA data impact the performance of both standard
measures of correlations between variables and global analysis of
assumed linear correlations using PCA. While some methods tailor
complex custom probabilistic models to the specic properties of
scRNA data, GeneVector instead builds upon MI, a simple yet powerful
tool for calculating the amount of information shared between two
variables. We show that MI and the vector space trained from the MI
matrix both capture relevant gene pair relationships including
between TF activators and repressors and their targets. Nevertheless,
GeneVector is unable to discern repression from activation, as it builds
MI that is agnostic to the direction of the statistical dependency.
Pearson correlation, theoretically sensitive to the direction of a
dependency, also performs poorly. Due to the high-level of sparsity,
absence of expression is not a signicant event and repressed
expression could just as easily be explained by under-sampling an
expressed gene. We suggest that identifying negative regulation and
UMAP1
UMAP2
UMAP1
UMAP2
PDXs Time Series
X4
X5
X6
X7
Treatment Status
Treated
Untreated
0.1 0.0 0.1 0.2
Untreated Coefficient
0.20
0.15
0.10
0.05
0.00
0.05
0.10
0.15
0.20
Treated Coefficient
M
eta
g
e
n
e
15
M
eta
g
e
n
e
72
M
eta
g
en
e
89
M
eta
g
en
e
98
M
eta
g
en
e
1
4
8
Treatment Specific Metagenes
Interferon Gamma Response
KRAS Signaling Up
Unfolded Protein Response
Myc TargetsV2
Pperoxisome
mTORC1 Signaling
Cholesterol Homeostasis
TNF-alpha Signaling via NF-kB
Androgen Response
Apoptosis
Wnt-beta Catenin Signaling
Xenobiotic Metabolism
E2F Targets
Estrogen Response Early
Myc TargetsV1
IL-2/STAT5Signaling
TGF-beta Signaling
UV Response Up
Complement
Epithelial Mesenchymal Transition
Hypoxia
Fatty Acid Metabolism
G2-M Checkpoint
Estrogen Response Late
Interferon Alpha Response
Inflammatory Response
Notch Signaling
UV Response Dn
Myogenesis
p53 Pathway
179
59
24
135
188
Hallmark Pathway Enrichment by Treatment Specific Metagene
Combined Enrichment Score
Metagene
0
200
400
600
800
179 59 24 135 188
12
Mean expression
in group
CRABP2
KIF1A
GAGE13
CIB1
SPATS2L
TNIK
CPE
HSPA1A
TCAF1
VCX3A
VCX
FDFT1
SQLE
MYC
ZNF503
FOSB
ID3
ID2
EPCAM
KDM3A
FOXP1
HES1
ID4
PTN
BARX1
NCAM1
KCNMA1
ZFHX3
TCF4
ID1
SYCP2
SPINT2
H1F0
TXNIP
MSMO1
HMGCS1
TUBB2A
ARF5
GAGE2A
KCNQ1OT1
DNM3
LINC01118
BOC
Treated_X4
Treated_X5
Treated_X6
Treated_X7
Untreated_X4
Untreated_X5
Untreated_X6
Untreated_X7
0.0 0.5 Cosine Similarity
PABPC1
PHB2
MGST1
PFN1
GAPDH
PFDN2
POLR2K
FKBP4
ATP5MC2
C1QBP
STRAP
SSR2
LDHB
PSMB5
RANBP1
MRPL51
HMGA1
UBE2N
ACTB
YBX3
NPM1
CDC123
RPL30
RPF2
DCAF13
CD24
EIF3E
NME4
PPIF
TPT1
Gene
Most Similar Genes to Vuntreated
0.0 0.2
ID3
CDK14
SYCP2
MAL2
CXCR4
KCNMA1
SAP25
SOHLH1
GFRA1
POMC
MCOLN3
LSR
TDRD12
SPINT2
ID2
CRABP2
ADGRG2
SYK
KIF1A
PITX1
RGS7
LY6H
QPCT
BARX1
GAGE13
ID1
GCK
NEUROG2
LMO7
SAMD11
Most Similar Genes to Vtreated
E
D
A B
C
Fig. 6 | Analysis of Metagenes in Cisplatin-treated PDX Time Series. A UMAP of
cisplatin treated PDX (patient derived xenograft) cells annotated by time point
(left) and treatment status (right). BMetagenes plotted with respect to regression
coefcients (β
treated
and β
untreated
) over four timepoints in either treated or
untreated cells with Bonferroni adjusted p-values= 0.001. CHallmark combined
enrichment scores for candidate chemo resistant metagenes. DGene expression
proles for each timepoint for the ve metagenes associated with increase in
treatment. Metagene 24 is associated with TGF-beta signaling includes EPCAM,
FOXP1,andID family genes with known cisplatin resistancefunction. EMost similar
genes for treated and untreated cells computed from cosine similarity to global
vectors. Source data provided as a Source Data le.
Article https://doi.org/10.1038/s41467-023-39985-2
Nature Communications | (2023) 14:4400 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
mutually exclusive expression is one of the more difcult problems in
scRNA-seq analysis.
As shown with correlation, the objective function employed in
training GeneVector has a signicant effect on the resulting gene
embedding. Mutual information calculated empirically from the his-
togram of binned expression counts for gene pairs is limited by the
available number of cells, delity of the counts, and discretization
strategy. By modeling the underlying distribution for each gene more
accurately, the joint probability distribution between genes can more
accurately reect expression-based relationships and improve model
results. Additionally, while only a one-time cost, the MI calculation is
computationally expensive. By improving the calculation of mutual
information, others have achieved improved performance in related
tasks including the identication of GRNs52.
The high dimensionality of scRNA and the vast complexity of
biological systems to which it is applied necessitate analytical tools
that facilitate intuitive and efcient data exploration and produce
easily interpretable results. GeneVector performs upfront computa-
tion of a meaningful low dimensional representation, transforming
sparse and correlated expression measurements into a concise vector
space summarizing the underlying structure in the data. The resulting
vector space is amenable to intuitive vector arithmetic operations that
can be composed into higher level analyses including cell type classi-
cation, treatment related gene signature discovery, and identication
of functionally related genes. Importantly, the vector arithmetic
operations and higher-level analyses can be performed interactively,
allowing for faster iteration in developing cell type and context specic
gene signatures or testing hypotheses related to experimental cov-
ariates. GeneVector is implemented as a python package available on
GitHub (https://github.com/nceglia/genevector) and installable
via PIP.
Methods
Dataset preprocessing
Each single cell RNA-seq dataset in this study was processed using the
Scanpy python library7. All highly variable gene selection was per-
formed using the Seurat V36method implemented in Scanpy. We
performed cell type classication on 23,764 cells from the Tumor
Immune Cell Atlas (TICA)20 using raw read counts from 2,000 highly
variable after removal of non-protein coding genes. We performed
analysis of 6,855 cells composed of an interferon beta stimulated
sample and a control sample using raw counts and cell type annota-
tions obtained from the SeuratData R package33 and subset to 1000
highly variable genes. The tness PDX dataset22 consisted of raw
counts from 19,799 cancer cells with treatment and timepoint labels
subset to 2000 highly variable genes. Coefcients and p-values were
computed over gene module scores using the statsmodels python
package. Cell type classication and vector arithmetic for the changes
between metastatic and primary sites was performed on 270,833 cells
from adnexa (primary) and bowel (metastasis) samples using 2000
highly variable genes. Normalization comparisons were made using
the Scanpy normalize_total and log1p functions. All subsampling of
datasets for simulated batch effects was performed using the Scanpy
subsample function.
Gene expression mutual information
In NLP applications, vector space models are trained by dening an
association between words that appear in the same context. In single
cell RNA sequencing data, we can redene this textual context as co-
expression within a given cell and mutual information across cells. The
simplest metric to dene association is the overall number of co-
expression events between genes. However, the expression proles
over cells may differ due to both technical and biological factors. To
summarize the variability in this relationship, we generate a joint
probability distribution on the co-occurrence of read counts. The
ranges of each bin are dened separately for each gene based on a user
dened number of quantiles. By dening the bin ranges separately, the
lowest counts in one gene can be compared directly to the lowest
counts in another gene without need for further normalization. Using
the joint probability distribution, we compute the mutual information
between genes dened in Eq. 1. The mutual information value is sub-
sequently used as the target in training the model, allowing us to
highlight the relationship between genes as a single-valued quantity.
IGi,Gj

=X
n
iX
n
j
pGi,Gj

log pGi,Gj
pGi

,pGj
!
ð1Þ
Equation 1: Mutual information between G
i
and G
j
computed on
the empirical joint probability distribution over expression bins for G
i
and G
j
.