Figure - available from: Genes
This content is subject to copyright.
Validation of SCINA on real data (a) SCINA identified the cell types of CD45+ single cells enriched from Renal Cell Carcinomas (RCCs). Dendritic cells were left out of this analysis, as they could be of either lymphoid or myeloid lineage. (b) SCINA identified the cell types in a pool of cells comprised of B cells, monocytes, NK cells, and a “pseudo” unknown cell type. (c) SCINA was used to analyze the mouse CyTOF data collected each day following gland injury, which profiled an average of 389,777 cells at each time point. (d) t-SNE was used to analyze the same mouse CyTOF dataset. The cells were colored by cell types assigned by SCINA.
Source publication
Advances in single-cell RNA sequencing (scRNA-Seq) have allowed for comprehensive analyses of single cell data. However, current analyses of scRNA-Seq data usually start from unsupervised clustering or visualization. These methods ignore prior knowledge of transcriptomes and the probable structures of the data. Moreover, cell identification heavily...
Citations
... The manual cell-type annotation process is labourintensive, expert-dependent, and not scalable to large datasets [6]. On the counterpart, automated cell-type annotation methods are scalable to large datasets and are less susceptible to human error [7][8][9]. Automated cell-type annotation methods can be broadly classified into two categories: marker-based and reference-based [10]. Marker-based methods extract cell-type-specific markers from literature studies or cell-type marker databases such as PanglaoDB [11], ACT database [12], and CellMarker database [13], and then classify cells based on the expression levels of these markers [14]. ...
... Most of these rely on either marker-based or reference-based approaches. Marker-based methods, including SCINA [8], ScType [7], Garnett [21], and scSorter [22], as well as reference-based tools, such as SingleR [17] and Seurat [19], have been developed to effectively annotate cell types in scRNA-seq datasets. Methods such as AtacAnnoR [23] and CellCano [24] were explicitly designed for annotation of scATAC-seq datasets. ...
... Most of the markerbased methods assume that marker gene sets should exhibit higher expression in the corresponding cell type. Among the marker-based tools, SCINA uses a Gaussian mixture model, assuming that marker gene sets should exhibit higher expression in the corresponding cell type [8]. ScType utilizes positive and negative marker sets to categorize the user-defined clusters. ...
Cell-type annotation remains a major challenge in single-cell and spatial omics analysis. Most existing methods rely on single-cell RNA sequencing (scRNA-seq) references or predefined marker sets. However, the scarcity of high-quality scRNA-seq references and marker sets makes relying on a single approach prone to bias and limits usability. Furthermore, available methods for cell-type annotation in single-cell ATAC-sequencing (scATAC-seq) and spatial transcriptomics datasets perform poorly. Here, we present ScInfeR, a graph-based cell-type annotation method that combines information from both scRNA-seq references and marker sets. By integrating these two data sources, ScInfeR can accurately annotate broad range of cell-types. It employs a hierarchical framework inspired by message-passing layers in graph neural networks to accurately identify cell subtypes. ScInfeR is highly versatile, supporting cell annotation across scRNA-seq, scATAC-seq, and spatial omics datasets. For scATAC-seq, it effectively utilizes chromatin accessibility data, while for spatial transcriptomics, it incorporates spatial coordinate information. Additionally, ScInfeR supports weighted positive and negative markers, allowing users to define marker importance in cell-type classification. Our extensive benchmarking across multiple atlas-scale scRNA-seq, scATAC-seq, and spatial datasets, evaluating 10 existing tools in over 100 cell-type prediction tasks, demonstrated ScInfeR’s superior performance. Noteworthy, it exhibits robustness against batch effects arising in these datasets. To facilitate seamless annotation, we developed ScInfeRDB, an interactive database containing manually curated scRNA-seq references and marker sets for 329 cell-types, covering 2497 gene markers in 28 tissue types from human and plant. ScInfeR is available as an R package, with both the tool and database publicly accessible at https://www.swainasish.in/scinfer.
... Samples with greater than 500 cells were carried forward for downstream analyses. Semi-supervised clustering was performed using SCINA [119] (https://github.com/ jcao89757/SCINA) with default parameters, rm_overlap = FALSE, allow_unknown = FALSE, and reference signatures were informed by markers from [4] and [15] (Supplemental Table S1). ...
Lung vasculature arises from both pulmonary and systemic (bronchial) circulations. Remodeling and structural changes in lung vasculature have been recognized in end‐stage fibrotic lung diseases such as idiopathic pulmonary fibrosis (IPF) but have not been well characterized. The vasculature that expands and supplies lung cancers is better described, with the recent recognition that systemic bronchial circulation expands to be the main blood supply to primary lung tumors. Here, we use publicly available single‐cell RNA‐sequencing (scRNA‐seq) data to compare vascular endothelial cell (EC) populations in multiple progressive interstitial lung diseases (ILD) and non‐small cell lung cancer (NSCLC) to identify common and distinct features. Lung tissue specimens were collected from healthy lung tissue (n = 59), ILD (n = 97), chronic obstructive pulmonary disease (n = 22), and NSCLC (n = 8). We identify two subtypes of expanded EC populations in both ILD and NSCLC, “Bronch‐1” and “Bronch‐2”, expressing transcripts associated with venules and angiogenic tip/stalk cells, respectively. Relative to pulmonary capillary and arterial ECs, bronchial ECs show low expression of transcripts associated with vascular barrier integrity. The pan‐bronchial EC marker COL15A1 showed positive staining in lung parenchyma from patients with IPF, SSc‐ILD, and NSCLC, whereas positive staining was limited to subpleural and peri‐bronchial regions in non‐fibrotic controls. In conclusion, expansion of a subset of ECs expressing markers of the bronchial circulation is one of the most pronounced changes in vascular cell composition across multiple ILDs and NSCLC. These data support additional studies to determine the role of the bronchial vasculature in ILD progression.
... These developments continue to expand the capabilities of scRNA-seq data analysis, enabling more precise insights into cellular diversity and function. Tools such as cellMeSH [17], CellAssign [18], scCATCH [19], SCINA [20], SCSA [21], scSorter [22], and scType [23] perform clustering first and then assign a cell type identity to each cluster. ...
Single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) produces vast amounts of individual cell profiling data. Its analysis presents a significant challenge in accurately annotating cell types and their associated biomarkers. Different pipelines based on deep neural network (DNN) methods have been employed to tackle these issues. These pipelines have arisen as a promising resource and can extract meaningful and concise features from noisy, diverse, and high-dimensional data to enhance annotations and subsequent analysis. Existing tools require high computational resources to execute large sample datasets. We have developed a cutting-edge platform known as scaLR (Single-cell analysis using low resource) that efficiently processes data into feature subsets, samples in batches to reduce the required memory for processing large datasets, and running DNN models in multiple central processing units. scaLR is equipped with data processing, feature extraction, training, evaluation, and downstream analysis. Its novel feature extraction algorithm first trains the model on a feature subset and stores the importance of the features for all the features in that subset. At the end of the training of all subsets, the top-K features are selected based on their importance. The final model is trained on top-K features; its performance evaluation and associated downstream analysis provide significant biomarkers for different cell types and diseases/traits. Our findings indicate that scaLR offers comparable prediction accuracy and requires less model training time and computational resources than existing Python-based pipelines. We present scaLR, a Python-based platform, engineered to utilize minimal computational resources while maintaining comparable execution times and analysis costs to existing frameworks.
... The differentially expressed marker genes between healthy and disease samples, as well as among different cell types, were identified through FindAllMarkers algorithm with default parameters [42]. SCINA method was used for cell type annotation [43,44], which is a marker-based automated annotation tool that uses predefined marker gene expression to label cell clusters [44]. To this end, we screened specific markers for different cell types from the CellMarker (2.0) database, including markers for endocrine cells (such as α, β, δ cells) and non-endocrine cells (such as stellate cells, endothelial cells, immune cells, and acinar cells) [45,46]. ...
... The differentially expressed marker genes between healthy and disease samples, as well as among different cell types, were identified through FindAllMarkers algorithm with default parameters [42]. SCINA method was used for cell type annotation [43,44], which is a marker-based automated annotation tool that uses predefined marker gene expression to label cell clusters [44]. To this end, we screened specific markers for different cell types from the CellMarker (2.0) database, including markers for endocrine cells (such as α, β, δ cells) and non-endocrine cells (such as stellate cells, endothelial cells, immune cells, and acinar cells) [45,46]. ...
Diabetes is a complex disease that involves multiple molecular mechanisms. Recent advances in multi-omics sequencing techniques have significantly enhanced the understanding of the pathogenesis of diabetes. To address the critical need for molecular resources in diabetes research, we present DiabetesOmic (https://bio.liclab.net/diabetesOmicdb/), a comprehensive multi-omics database designed to collect and analyze transcriptional regulatory information across five high-throughput sequencing modalities, including ChIP-seq, RNA-seq, ATAC-seq, scATAC-seq, and scRNA-seq. Currently, DiabetesOmic contains 487 samples, encompassing type 1 and type 2 diabetes spanning multiple tissues. These data underwent stringent quality assessment to ensure high-quality molecular profiles. Notably, we manually curated clinical complication annotations including diabetic nephropathy, retinopathy, and atherosclerosis to enhance translational relevance. For each type of sequencing data, we implemented specific analytical pipelines to generate multi-dimensional transcriptional regulatory information, including regulatory network identification, differential gene expression analysis, chromatin accessibility analysis, and transcription factor enrichment analysis. This comprehensive analysis enables the identification of disease-associated regulatory elements, epigenetic modifications, and cell type-specific molecular signatures, providing valuable insights into the molecular mechanisms of diabetes and its complications. This resource represents a significant advancement in diabetes research, facilitating deeper investigations into the disease's pathology and progression.
... This process typically involves two main approaches: manual annotation and automated annotation. Automated annotation generally follows two strategies: one classifies cells based on the expression patterns of cell-type-specific marker genes [10,11], while the other leverages machine learning algorithms to transfer cell-type labels from reference datasets to the target dataset [12][13][14][15]. Automated methods offer significant advantages when a reliable library of marker genes is available [16] and high-quality reference datasets can be obtained. ...
The groundbreaking development of scRNA-seq has significantly improved cellular resolution. However, accurate cell-type annotation remains a major challenge. Existing annotation tools are often limited by their reliance on reference datasets, the heterogeneity of marker genes, and subjective biases introduced through manual intervention, all of which impact annotation accuracy and reliability. To address these limitations, we developed FPCAM, a fully automated pulmonary fibrosis cell-type annotation model. Built on the R Shiny platform, FPCAM utilizes a matrix of up-regulated marker genes and a manually curated gene–cell association dictionary specific to pulmonary fibrosis. It achieves accurate and efficient cell-type annotation through similarity matrix construction and optimized matching algorithms. To evaluate its performance, we compared FPCAM with state-of-the-art annotation models, including SCSA, SingleR, and SciBet. The results showed that FPCAM and SCSA both achieved an accuracy of 89.7%, outperforming SingleR and SciBet. Furthermore, FPCAM demonstrated high accuracy in annotating the external validation dataset GSE135893, successfully identifying multiple cell subtypes. In summary, FPCAM provides an efficient, flexible, and accurate solution for cell-type identification and serves as a powerful tool for scRNA-seq research in pulmonary fibrosis and other related diseases.
... Traditional unsupervised clustering methods commonly used in scRNA-seq analysis operate by grouping cells based on the overall similarity of their marker profiles across the entire panel [13][14][15][16][17] . Their efficacy heavily relies on the presence of abundant markers that distinguish cell populations, a characteristic commonly found in singlecell sequencing data 18 . ...
... We downloaded two publicly available human datasets: Colorectal Cancer (PCF-CRC; n = 140-TMAs; n = 235,519-cells; n = 56-antibodies) and Healthy Intestine (PCF-HI; n = 64-samples; n = 2,603,217-cells; n = 56-antibodies); both were generated using the Akoya Phenocycler-Fusion (PCF; formerly CODEX) 1.0 system for spatial proteomics 26,27 . We compared TACIT's performance in cell type annotation against three alternative cell type annotation approaches -CELESTA, SCINA, and Louvain in both datasets, using original annotations as reference 13,28,29 . ...
Identifying cell types and states remains a time-consuming, error-prone challenge for spatial biology. While deep learning increasingly plays a role, it is difficult to generalize due to variability at the level of cells, neighborhoods, and niches in health and disease. To address this, we develop TACIT, an unsupervised algorithm for cell annotation using predefined signatures that operates without training data. TACIT uses unbiased thresholding to distinguish positive cells from background, focusing on relevant markers to identify ambiguous cells in multiomic assays. Using five datasets (5,000,000 cells; 51 cell types) from three niches (brain, intestine, gland), TACIT outperforms existing unsupervised methods in accuracy and scalability. Integrating TACIT-identified cell types reveals new phenotypes in two inflammatory gland diseases. Finally, using combined spatial transcriptomics and proteomics, we discover under- and overrepresented immune cell types and states in regions of interest, suggesting multimodality is essential for translating spatial biology to clinical applications.
... Databases like CellMarker [15], PanglaoDB [16], and CancerSEA [17] provide curated lists of such marker genes. Tools like scCATCH [18], SCSA [19], CellAssign [20], and SCINA [21] use statistical models to fit marker gene distributions and assign labels accordingly. For instance, scCATCH builds tissue-specific reference datasets and matches gene expression profiles to a predefined taxonomy, while CellAssign applies a Bayesian probabilistic model to label single cells. ...
Accurate cell type annotation is a critical step in single-cell RNA sequencing (scRNA-seq) analysis, enabling deeper insights into cellular heterogeneity and biological processes. In this study, we conducted a comprehensive comparative evaluation of various machine learning techniques, including support vector machine (SVM), decision tree, random forest, logistic regression, gradient boosting, k-nearest neighbour, transformer, and naive Bayes, to determine their effectiveness for single-cell annotation. These methods were evaluated using four diverse datasets comprising hundreds of cell types across several tissues. Our results revealed that SVM consistently outperformed other techniques, emerging as the top performer in three out of the four datasets, followed closely by logistic regression. Most methods demonstrated robust capabilities in annotating major cell types and identifying rare cell populations, though naive Bayes was the least effective due to its inherent limitations in handling high-dimensional and interdependent data. This study provides valuable insights into the relative strengths and weaknesses of machine learning methods for single-cell annotation, offering guidance for selecting appropriate techniques in scRNA-seq analyses.
... For each cluster, we used the Wilcoxon Rank-Sum Test to find significant deferentially expressed genes comparing there maining clusters. SCINA and known marker genes were used to identify cell type [22]. ...
Myelodysplastic syndrome (MDS) is a malignant hematologic disorder with limited curative options, primarily reliant on hematopoietic stem cell transplantation. Anemia, a prevalent symptom of MDS, has few effective treatment strategies. Realgar, though known for its therapeutic effects on MDS, remains poorly understood in terms of its mechanism of action. In this study, both in vivo and in vitro experiments were conducted using Realgar and its primary active component, As2S2, to examine their impact on mouse erythroblasts at the single-cell level. Realgar treatment significantly altered the transcriptional profiles and cellular composition of bone marrow in mice, both in vivo and in vitro. Differentially expressed genes in erythroblasts regulated by Realgar were identified, unveiling potential regulatory functions and signaling pathways, such as heme biosynthesis, hemoglobin production, oxygen binding, IL-17 signaling, and MAPK pathways. These findings suggest that Realgar enhances the differentiation of erythroblasts in mouse bone marrow and improves overall blood cell counts. This work offers preliminary insights into Realgar’s mechanisms, expands the understanding of this mineral medicine, and may inform strategies to optimize its therapeutic potential in hematologic diseases.
... Alternatively, the incorporated default references that encompass Parts 2 and 3 can be replaced by an external reference. A user can upload a marker-cell type reference table, such as the ones used by some semi-supervised methods 25,31 . Another option is to use a standardized panel for the reference, such as one that has been developed and used across experiments in large consortia 5 . ...
Advances in cytometry have led to increases in the number of cellular markers that are routinely measured. The resulting complexity of the data has prompted a shift from manual to automated analysis methods. Currently, numerous unsupervised methods are available to cluster cells based on marker expression values. However, phenotyping the resulting clusters is typically not part of the automated process. Manually identifying both marker definitions (e.g. CD4+, CCR7+, CD45RA+, CD19-) and descriptive cell type names (e.g. naive CD4+ T cells) based on marker expression values can be time-consuming, subjective, and error-prone.
In this work we propose an algorithm that addresses these problems through the creation of an automated tool, CytoPheno, that assigns marker definitions and cell type names to unidentified clusters. First, post-clustered expression data undergoes per-marker calculations to assign markers as positive or negative. Next, marker names undergo a standardization process to match to Protein Ontology identifier terms. Finally, marker descriptions are matched to cell type names within the Cell Ontology. Each part of the tool was tested with benchmark data to demonstrate performance. Additionally, the tool is encompassed in a graphical user interface (R Shiny) to increase user accessibility and interpretability. Overall, CytoPheno can aid researchers in timely and unbiased phenotyping of post-clustered cytometry data.
... Automated annotation in scRNA-seq has evolved into three primary strategies: marker-, correlation-, and model-based methodologies [8]. Marker-based techniques, such as scTyper [9], Digital Cell Sorter [10], SCINA [11], SCSA [12], CellAssign [13], scCATCH [14], MarkerCount [15], scClassifR [16], Garnett [17], and scSorter [18], use established cell markers for identification. However, their effectiveness is limited by the availability and specificity of these markers, often failing to identify cell types without known markers or in the presence of novel cells. ...
Background
Recent advancements in single-cell RNA sequencing have greatly expanded our knowledge of the heterogeneous nature of tissues. However, robust and accurate cell type annotation continues to be a major challenge, hindered by issues such as marker specificity, batch effects, and a lack of comprehensive spatial and interaction data. Traditional annotation methods often fail to adequately address the complexity of cellular interactions and gene regulatory networks.
Results
We proposed scMCGraph, a comprehensive computational framework that integrates gene expression with pathway activity to accurately annotate cell types within diverse scRNA-seq datasets. Initially, our model constructs multiple pathway-specific views using various pathway databases, which reflect both gene expression and pathway activities. These pathway-specific views are then integrated into a consensus graph. The consensus graph is subsequently utilized to reconstruct the multiple pathway views. Our model demonstrated exceptional robustness and accuracy across various analyses, including cross-platform, cross-time, cross-sample, and clinical dataset evaluations.
Conclusions
scMCGraph represents a significant advance in cell type annotation. The experiments have demonstrated that introducing pathway information significantly improves the learning of cell–cell graphs, with their resulting consensus graph enhancing the predictive performance of cell type prediction. Different pathway databases provide complementary data, and an increase in the number of pathways can also boost model performance. Extensive testing shows that in various cross-dataset application scenarios, scMCGraph consistently exhibits both accuracy and robustness.