Nature Methods

Published by Springer Nature
Online ISSN: 1548-7105
Learn more about this page
Recent publications
Overview of DAQ
DAQ is a residue-wise local quality estimation for protein models from cryo-EM maps, which is based on upgraded Emap2sec+. The example used here is the rNLRP1–rDPP9 complex (PDB accession 7CRW, chain A) and the EM map from which the structure was built (EMD-30458). a, DAQ protocol: Emap2sec+ scans an EM map with a box of a 11 × 11 × 11 ų size with a stride of 1 Å and outputs the probabilities of amino acid type, atom type, and secondary structure type for the center position of the box. Next, the probabilities at Cα positions of the structure model are gathered. Then, DAQ(AA), DAQ(Cα), and DAQ(SS) are further calculated as log-odds scores using the average probability for the corresponding property across the entire model. In this figure, higher values (blue) indicate higher quality indicated by DAQ, while red values indicate lower quality of the local structures by DAQ. b, A detailed network architecture of upgraded Emap2sec+, which was used to compute the probability values. It has six residual blocks and outputs the amino acid, atom, and secondary structure probability for an input box. conv3d is a 3D convolutional layer that uses convolution kernels to slide along input features; batch norm is a normalization layer that re-center and re-scale input according to batch statistics; ReLU is an activation function with the form of f(x)=max(0,x); FC is a fully connected layer.
Comparison of DAQ scores between first and revised protein models in the same PDB entry
a, Comparison of DAQ(AA) scores between the first and revised protein models averaged over the entire chain. b, DAQ(AA) scores of the first and revised protein models averaged over inconsistent residues. The symbols denote PDB IDs of the protein chains. c, Average DAQ(AA) score of inconsistent residues relative to the scores of consistent residues in the first-version models. d, Average DAQ(AA) score of inconsistent residues relative to the scores of consistent residues in the revised models. a–d, A window size of 1 was used. e. AUC-ROC of different combinations of terms used to compute the DAQ score.
Analysis of the DAQ score distribution for PDB entry 7JSN-B (EMD-22458)
a, Validation of the two versions of chain B deposited for entry 7JSN. Left, first version colored according to the deviation of the same Cα atom position in the revised version. Colors are scaled from green (deviation < 1.0 Å) to magenta (deviation > 4.0 Å). Middle and right, structures of the first and revised versions, respectively. DAQ(AA) scores along the chain are shown in a color scale from red (DAQ(AA) < -2.0) to blue (DAQ(AA) > 2.0) with the width of the ribbon representation proportional to the absolute value of the DAQ(AA) score when it is negative. b, Four regions that exhibit large deviations between the two models are detailed. The first and revised versions of 7JSN chain B are shown in cyan and green, respectively. (i) Residues 11–73. Three residues (Leu-27, Asn-34, and Arg-48) are highlighted with stick side chains to highlight their misplacement in the first version. (ii) Residues 198–233. Three residues (Asp-205, Phe-215, and Lys-221) are shown with stick side chains as reference points to highlight misalignment. (iii) Residues 282–391. Two residues (Leu-312 and Val-329) are shown with stick side chains to highlight misalignment. (iv) Residues 431–483. Two residues (Arg-444 and Glu-461) are shown with stick side chains to highlight misalignment. c, DAQ scores and other validation metrics are shown as a function of sequence position. Left, three DAQ component scores are shown: DAQ(AA) (blue); DAQ(SS) (orange); and DAQ(Cα) (green). Misaligned and mispositioned residues are shaded gray and pink, respectively. The right three plots show results from three different validation scores: Q-score with the horizontal line of the expected Q-score (orange) for maps of this resolution, EMRinger, and CaBLAM with the outlier cutoff at the bottom 1% (orange) and the dis-favored cutoff at bottom 5% (green), using a 19-residue sliding window.
DAQ score analysis of misaligned residues in the PDBNR90 dataset
a, Comparison of the average DAQ(AA) score of inconsistent regions within 399 model pairs in the PDBNR90 dataset. A 19-residue-long window was used. b, DAQ(AA) score distribution of inconsistent (red) and consistent residues (blue) in the first-version models in the PDB2Ver dataset. The inconsistent residues are residues that were modified in the revised model in PDB entries and thus more likely to be incorrect in one of the models. The two curves show the fraction of inconsistent residues with a score at a negative score cutoff or below (red) and the fraction of consistent residues with a positive score at the score cutoff or higher (blue) for the data in PDBNR90. c, Two examples of protein model pairs that have a large score difference. The left column shows the superposition of the pairs (cyan and pink). Inconsistent Cα atom positions (deviation > 4.0 Å) between the two models are indicated in blue and magenta in the cyan and pink models, respectively. Models in the middle and right columns correspond to the lower- and higher-scored PDB models, respectively. Surface meshes represent the EM map at author-recommended contour levels. In models in the right column, DAQ(AA) score is indicated in colors from red (DAQ(AA) < −2.0) to blue (DAQ(AA) > 2.0) and by the radius of the tube (thicker being more negative).
Analysis of 4,485 non-redundant PDB chain models in PDBNR1Å by DAQ score
a, Distribution of DAQ(AA) score with a 19-residue sliding window. Green bars represent the distribution of the PDBNR1Å dataset. For comparison, the score distributions of consistent (blue) and inconsistent (red) residues in the PDBVer2 dataset are also shown. b, The fraction of residue positions with a low DAQ(AA) score in structure models (y-axis) were plotted relative to cross-correlation between the models and the corresponding EM maps (x-axis). The fraction (y-axis) is defined by the number of residues that have DAQ score below three cutoff values, red = 0.0, green = −0.5, and blue = −1.0, relative to the length of the protein chain. c, DAQ(AA) score mapped on PDB entry, 6S8B chain L associated with EM map (EMD-10117) determined at a 2.41-Å resolution. The chain is colored according to the DAQ(AA) score from red (DAQ(AA) score < −2.0) to blue (DAQ(AA) score > 2.0). The width of the ribbon is proportional to the absolute value of the DAQ(AA) score when it is negative. Middle, model from the PDB entry. The main chain is shown in tube with color showing DAQ(AA) score. Side chains discussed are in cyan. Right, Alphafold2-predicted model in magenta. d, DAQ(AA) score mapped on PDB entry 7NNU chain A associated with EM map (EMD-12487) determined at a 2.7-Å resolution. Middle, PDB model. This region has a strong negative DAQ(AA) score as shown in red. Right, Alphafold2-predicted model. e, DAQ(AA) score mapped on PDB entry 6TNI chain A. Its associated EM map (EMD-10534) is shown as a transparent envelope at the recommended contour level. Right, two entangled regions, Val-661 to Gln-709 and Leu-758 to Lys-778 are highlighted.
An increasing number of protein structures are being determined by cryogenic electron microscopy (cryo-EM). Although the resolution of determined cryo-EM density maps is improving in general, there are still many cases where amino acids of a protein are assigned with different levels of confidence. Here we developed a method that identifies potential misassignment of residues in the map, including residue shifts along an otherwise correct main-chain trace. The score, named DAQ, computes the likelihood that the local density corresponds to different amino acids, atoms, and secondary structures, estimated via deep learning, and assesses the consistency of the amino acid assignment in the protein structure model with that likelihood. When DAQ was applied to different versions of model structures in the Protein Data Bank that were derived from the same density maps, a clear improvement in the DAQ score was observed in the newer versions of the models. DAQ also found potential misassignment errors in a substantial number of deposited protein structure models built into cryo-EM maps. The DAQ score assesses the consistency of amino acid assignment in protein structure models with local density from cryo-EM maps. The method complements existing quality metrics and is a versatile tool for highlighting problematic regions of model structures.
ScBasset architecture
a, scBasset is a deep CNN to predict single-cell chromatin accessibility from the DNA sequence underlying peak calls. The input to the model is a 1,344-bp DNA sequence from each peak’s center and the output is accessibility per cell (corresponding to one row of the peak × cell matrix). Conv, convolution. b, scBasset prediction performance on held-out peaks evaluated by auROC per cell (left) and auROC per peak (right) for the Buenrostro2018 dataset. c, t-SNE visualization of cell embeddings learned by scBasset as the weights of the final dense layer, colored by cell type (left). Hematopoietic stem cell differentiation lineage diagram in the Buenrostro2018 study (right). The cell type labels refer to hematopoietic stem cell (HSC), multipotent progenitor (MPP), lymphoid primed MPP (LMPP), common lymphoid progenitor (CLP), plasmacytoid dendritic cell (pDC), common myeloid progenitor (CMP), granulocyte macrophage progenitor (GMP), megakaryocyte-erythroid progenitor (MEP) cell, monocyte (Mono) and unkonwn (UNK). Source data for this figure are provided.
Source data
ScBasset cell representation performance
a, Performance comparison of different cell-embedding methods evaluated by clustering metrics (ARI, cell type ASW and AMI) on the Buenrostro2018 dataset. b, Performance comparison of different cell-embedding methods evaluated by label score—the proportion of cells’ nearest neighbors that share its cell type label (Methods)—on Buenrostro2018 dataset. c, Performance comparison of different cell-embedding methods evaluated by clustering metrics on 10x multiome PBMC dataset. d, Performance comparison of different cell-embedding methods evaluated by neighbor score—the proportion of cells’ nearest neighbors that are also nearest neighbors in an independent scRNA analysis (Methods)—on 10x multiome PBMC dataset. e, Performance comparison of different cell-embedding methods evaluated by clustering metrics on 10x multiome mouse brain dataset. f, Performance comparison of different cell-embedding methods evaluated by neighbor score (Methods) on 10x multiome mouse brain dataset. Source data for this figure are provided.
Source data
ScBasset batch correction
a, Cell embeddings learned by scBasset without batch correction on a mixture of PBMC scATAC from 10x multiome and 10x next GEM chemistries (top). Cells are colored by chemistry. Cell embeddings learned by scBasset with batch correction (scBasset-BC) on the same data (bottom). b, Buenrostro2018 cell embeddings learned by scBasset, colored by cell type (left) or batch (right). c, Buenrostro2018 cell embeddings learned by scBasset-BC, colored by cell type (left) or batch (right). d, Performance comparison of different batch correction methods on chemistry-mixed PBMC data. Harmony is applied on either PCA, named Harmony (PCA), or scBasset embeddings, named Harmony (scBasset), and performance was evaluated by kBET and label score (with a neighborhood of 100). e, Similar performance comparison of different batch correction methods on Buenrostro2018 data. Source data for this figure are provided.
Source data
ScBasset denoising performance evaluation
a, Binary count matrix of 200 cells and 500 peaks sampled from Buenrostro2018 dataset, hierarchically clustered by both cells and peaks (left). Cell type labels annotate the rows. The same matrix and procedure after scBasset denoising (right). b, Correlation between gene accessibility score and gene expression across genes for each cell before (x axis) and after scBasset denoising (y axis) for the multiome PBMC dataset. A one-sided Wilcoxon signed-rank test was performed. Cells are colored by sequencing depth. c, Comparison of different denoising methods in multiome PBMC dataset as evaluated by label score and cell type ASW (left). Comparison of different denoising methods in multiome PBMC dataset as evaluated by correlation between scVI-denoised RNA and denoised ATAC profiles across genes per cell (correlation per cell), and correlation between scVI-denoised RNA and denoised ATAC profiles across cells per gene (correlation per gene) (right). d, UMAPs of RNA and ATAC co-embeddings after integration for multiome PBMC dataset. Integration performed on scVI-denoised RNA (blue) and raw ATAC (red) (left). Integration performed on scVI-denoised RNA (blue) and scBasset-denoised ATAC (red) (right). e, Comparison of integration performance on multiome PBMC dataset. Performance is measured by the relative distances between each cell’s RNA and ATAC embeddings (Methods) when integrating the scVI-denoised RNA profiles with ATAC profiles denoised with different methods; n = 2,714 cells for each box plot on the left, and n = 4,881 cells for each box plot on the right. The box plot shows min and max as whiskers (excluding outliers), first and third quartiles as boxes and median in the center. Outliers (>1.5× interquartile range away from the box) are not shown. Source data for this figure are provided.
Source data
ScBasset infers single cell TF activity
a, UMAP showing annotated PBMC cell types. b, Pearson correlation between TF expression and scBasset or chromVAR-predicted TF activity for 203 differentially expressed TFs. A one-sided Wilcoxon signed-rank test was performed. The example TFs that we examined in c are highlighted in red. c, UMAP visualization of TF expression (left), scBasset TF activity (middle) and chromVAR TF activity (right) for key PBMC regulators. Pearson correlation between inferred TF activity and expression are shown in the title. d, Precision-recall (PR) curves of scBasset and chromVAR for distinguishing sgGATA1 cells from sgNT cells in the spear-ATAC dataset (top). PR curves of scBasset and chromVAR for distinguishing sgGATA2 cells from sgNT cells (bottom). e, HSC differentiation lineage diagram in the Buenrostro2018 study. f, ISM scores for β-globin enhancer at chr11:5297158-5297258 for HSC, MPP, CMP and MEP cell types. Sequences that match GATA1 and KLF1 motifs are highlighted in red boxes. g, Distributions of per-cell TF PWM-ISM scores for GATA1 and KLF1 for cells in HSC, MPP, CMP and MEP cell types. n = 502, 344, 142, 138 cells for each of CMP, HSC, MPP and MEP. The PWM-ISM score is the dot product of the PWM and ISM measurements at sites of motif matches (GATA1 at chr11:5297906 and KLF1 at chr11:5297940). A one-sided Wilcoxon rank-sum test was performed to test for significance. *P < 0.01; NS, not significant. Exact P values are P = 2.06 × 10⁻⁹ for MPP versus HSC, P = 2.46 × 10⁻¹¹ for CMP versus MPP, and P = 3.83 × 10⁻³⁹ for MEP versus CMP for GATA1; P = 0.10 for MPP versus HSC, P = 0.38 for CMP versus MPP and P = 4.95 × 10⁻⁴¹ for MEP versus CMP for KLF1. Source data for this figure are provided.
Source data
Single-cell assay for transposase-accessible chromatin using sequencing (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain important challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single-cell multiome datasets, including cell clustering, scATAC profile denoising, data integration across assays and transcription factor activity inference. Using a sequence-based deep neural network, scBasset facilitates various tasks of single-cell ATAC-seq analysis in a unified framework.
  • Mikhail GoncharovMikhail Goncharov
  • Dmitry BagaevDmitry Bagaev
  • Dmitrii ShcherbininDmitrii Shcherbinin
  • [...]
  • Mikhail ShugayMikhail Shugay
New computational method uses convolutional neural networks for cis-regulatory sequence analysis to analyze and cluster scATAC-seq data.
Tardigrades are everywhere. They’re tiny — usually under a millimeter long — and they’re mostly transparent, so they’re easy to miss. But you probably walk by them every day. We’ve been grooming them as emerging models for studying how body forms evolve and how biological materials can survive extreme conditions.
Researchers have discovered two naturally occurring channelrhodopsins for potassium ion transport that can be used in optogenetic applications.
VASA-seq offers a single cell sequencing tool for capturing full-length coverage of coding sequences and supplementing non-coding RNA molecules.
DNA–protein interactions mediate physiologic gene regulation and may be altered by DNA variants linked to polygenic disease. To enhance the speed and signal-to-noise ratio (SNR) in the identification and quantification of proteins associated with specific DNA sequences in living cells, we developed proximal biotinylation by episomal recruitment (PROBER). PROBER uses high-copy episomes to amplify SNR, and proximity proteomics (BioID) to identify the transcription factors and additional gene regulators associated with short DNA sequences of interest. PROBER quantified both constitutive and inducible association of transcription factors and corresponding chromatin regulators to target DNA sequences and binding quantitative trait loci due to single-nucleotide variants. PROBER identified alterations in regulator associations due to cancer hotspot mutations in the hTERT promoter, indicating that these mutations increase promoter association with specific gene activators. PROBER provides an approach to rapidly identify proteins associated with specific DNA sequences and their variants in living cells.
Stem cell-derived endothelial cell subtypes enable study of viral tropism.
Love’s the only engine of survival. —Leonard Cohen
Imputing missing parental genotypes sharpens estimates of direct genetic effects.
Extensive noise in spatially resolved transcriptomics data
a, Spatial expression levels of CD45 (IF), PTPRC (targeted panel sequencing) and PTPRC (whole transcriptomic sequencing) of the 10X Visium Ovarian Cancer dataset. b, Severe drop-outs in PTPRC gene expression for both targeted sequencing and whole transcriptomic sequencing. The x axis shows the PTPRC RNA expression level (unit = log2(CPM+1)). c, Severe drop-outs in scRNA-seq, Slide-Seq, Visium and bulk RNA-Seq. The x axis shows average RNA expression levels for each gene profiled by each technique/dataset, and the y axis shows the percentages of counts of exactly 0 for each gene. d, Example mantle zone structure with poor agreement with IgD expression from the 10X Visium human lymph node dataset. The red lines mark the borders of the mantle zone. e, Cartoon describing the calculation of the percentage of sequencing spots with expression larger than a given cutoff in the subset of spots with non-zero expression (adjusted exp.%). Prob, probability. f, Adjusted exp.% of beads in the four bins of different sequencing qualities in the mouse Slide-Seq dataset and with the cutoffs to define the adjusted exp.% shown on the x axis.
Sprod for de-noising of spatially resolved transcriptomics data
a, Cartoon describing the Sprod model, from data preparation and graph building to expressional de-noising. b, Simulated dataset: the figure shows the location of the simulated spots and the cell types to which they are assigned. The light gray bars show the graph built by Sprod. A bar connects two dots as long as the graph connects two spots, regardless of edge weights. Dots have been subsampled to avoid overcluttering of the presentation. c, De-noising% of all parameter combinations tested. The x axis shows all parameter combinations, ordered from lower to higher de-noising%. d, Visualizing the de-noising% with respect to specific choices of parameters. K = 10. The x and y axes show all choices tried for each R and Lambda. The z axis shows the de-noising%, which is the average of the de-noising% over all the choices of all other parameters. Noise level = 0.5. e, Example of the two diagnostic plots generated by Sprod from the 10X breast cancer Visium dataset. Left, spots and edges of the detected graph on the x–y coordinates. Right, spots and edges of the detected graph in the UMAP space of the image features. The coloring denotes confidence of the edges, with blue referring to high confidence and gray to low confidence.
Validation of Sprod on real Visium and Slide-Seq datasets
a, Sprod-corrected PTPRC expression. Darker colors refer to higher expression. b, Scatterplots showing the correlation between CD45 IF and PTPRC gene expression, corrected by Sprod. c, Spearman and Pearson correlations between CD45 IF and PTPRC expression, from the original expression data, Sprod-corrected expression data, expression data with drop-out removal performed by SAVER and scImpute and the Sprod ‘scrambling’ control. d, Spatial IgD expression of the mantle zone marked in Extended Data Figs. 2 and 5, for both the original Slide-Seq expression data and the Sprod-adjusted expression data. e, Pearson correlations between IgD and CD3/CD20/CD1c, for each analysis group. f, Adjusted exp.% of beads in the four bins of different sequencing qualities, in the Sprod-corrected expression data, with the cutoffs to define the adjusted exp.% shown on the x axis.
Detection of spatially differentially expressed genes is more accurate after de-noising
a, Slide-Seq beads defined as being in basal neuropils, soma and proximal neuropils of the CA1 region (following the definition of Stickels et al.¹). b, Expression of Camk2a and Hpca in the CA1 region Slide-Seq beads, ordered along the soma–proximal axis. The x axis shows the location of the beads with respect to the soma. Results for the raw expression matrix, SAVER-adjusted matrix and Sprod-adjusted matrix are shown. Data are presented as mean values ± s.d. c, Venn diagrams showing the overlap of the genes with differential expression detected from the raw expression data, the Sprod-adjusted data or the SAVER-adjusted data, with the genes that show dendritic enrichment (from Tushev et al. and Ainsley et al.14,15). d, Enrichment of GO pathways in the genes with stronger expression in the proximal neuropil regions, detected from the raw, the SAVER-corrected and the Sprod-corrected expression data. Circle color refers to the P values of the GO analyses (one-sided hypergeometric test, multiple-test adjusted P values (Padj) reported); circle size is proportional to the number of genes found in the pathway.
Inference of cell-to-cell communication is more accurate with Sprod-corrected expression data
a, Tumor and stroma/immune (SI) regions were both split into subregions (A, B, C and D) that are either close or not close to the tumor-stroma/immune boundaries. b, Numbers of CellChat-inferred significantly interacting pathways in the raw and Sprod-corrected expression matrices. c, Expression of PD-1 in the SI regions and PD-L1 in the tumor regions, for the raw and Sprod-corrected expression matrices, respectively. d, Close pairs of spots with one spot in the tumor region and another in the tumor/stroma region. e, Expression of PD-1 in the SI-side spots in the pairs of spots from (d) dichotomized by the expression of PD-L1 in the corresponding tumor-side spots and vice versa. Dichotomization was performed on the 75% percentile of PD-L1 or PD-1. Bold lines in the boxes refer to median values. Box boundaries represent interquartile ranges, whiskers extend to the most extreme data point (no more than 1.5 times the interquartile range) and the line in the middle of the box represents the median (Med.). N = 3,282.
Spatially resolved transcriptomics (SRT) provide gene expression close to, or even superior to, single-cell resolution while retaining the physical locations of sequencing and often also providing matched pathology images. However, SRT expression data suffer from high noise levels, due to the shallow coverage in each sequencing unit and the extra experimental steps required to preserve the locations of sequencing. Fortunately, such noise can be removed by leveraging information from the physical locations of sequencing, and the tissue organization reflected in corresponding pathology images. In this work, we developed Sprod, based on latent graph learning of matched location and imaging data, to impute accurate SRT gene expression. We validated Sprod comprehensively and demonstrated its advantages over previous methods for removing drop-outs in single-cell RNA-sequencing data. We showed that, after imputation by Sprod, differential expression analyses, pathway enrichment and cell-to-cell interaction inferences are more accurate. Overall, we envision de-noising by Sprod to become a key first step towards empowering SRT technologies for biomedical discoveries. Sprod accurately denoises spatially resolved transcriptomics data and improves downstream analysis results.
Functional ULM reveals brain-wide hyperemia at a microscopic scale during brain activation
a, Schematic of the experimental setup for ULM brain imaging through a coronal plane during whisker or visual stimulations in an anesthetized rat receiving a continuous intravenous injection of MBs. b, Blood velocity (left) and MB count ULM maps (right) of the rat brain vasculature at 6.5-μm resolution (n = 7 experiments). c, Temporal rasterization scheme to create dynamic ULM data. d, Time courses of MB count for a pixel in a large (pial vessel, β) and smaller blood vessel (first-order branching after descending arteriole, δ), illustrated in g. e, Pearson correlation coefficient computed between stimulation pattern and MB flux signal, in the whole-brain slice imaged (left) and zoomed in the activated barrel cortex (right). The map is overlaid with rat brain atlas⁵² at Bregma = −3.12 mm; n = 4 experiments. VPL, ventro-postero-lateral thalamic nuclei. f, Time courses of MB count for the same pixels after spatial registration. g, Time courses of MB flux for the same pixels after pattern summation and division by window length. The time courses are given for an increasing number of pattern repetitions. h, Functional correlation map as obtained in e, but using conventional fUS imaging (ultrafast Doppler imaging without any MB injection), n = 4 experiments.
Super-resolved quantification applied to MB trajectories reveals increased MB flux, speed and vessel diameter in arterioles and venules of the activated barrel cortex during functional hyperemia
a, Subdivision of the barrel cortex into penetrating arterioles, venules, pial vessels and intraparenchymal vessels based on the super-resolved ULM maps. b, Dynamic histograms of the MB velocity distribution in the compartments defined in a during whisker stimulations (stim, n = 40 stimuli). c, Mean MB flow and speed (±s.e.m.) from n = 4 time courses obtained on ten stimulations each, either expressed as absolute value for each compartment (left), or as relative to baseline (right). d–g, Longitudinal profiles of MB count, speed and diameter during rest and stimulation periods, along an arteriole (d,e) or venule (f,g), activated site (e,g) or control site (d,f). Labeling of all MB trajectories passing through the chosen white segment at the entry of the penetrating arteriole or venule. Quantification of this perfusion of drainage area was performed using this selective set of MBs. For each blood vessel, those area for rest and stimulation periods are displayed on ULM MB count maps. h, Changes during whisker stimulation relative to rest in the MB count and speed for the blood vessels analyzed in d–g. i,j, Analysis on n = 20 arterioles (20 in the activated barrels and 20 controls) and n = 18 venules (18 in the activated barrels and 18 controls) transversal sections from n = 4 rats at a depth <400 μm from pial vessels. MB count (i) and speed (j) transversal profiles shown as mean ± s.e.m. Percentage variation relative to rest and the results from a two-sided Wilcoxon signed-rank test on this variation (null hypothesis, distribution with median equal to zero; NS, not significant, *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001) are given for the max MB count (activated arterioles, P = 9 × 10−5; control arterioles, P = 0.91; activated venules, P = 9 × 10⁻⁴; control arterioles, P = 0.98) and speed (P = 9 × 10⁻⁵; P = 0.55; P = 3 × 10⁻⁴; P = 0.012); n = 4 experiments (a–j).
Functional ULM reveals activation in subcortical structures such as VPM and VPL after whisker stimulation and SC after visual stimulation
a, Selection of two blood vessels (orange, Th1 and Th2) within the activated thalamus displayed on an ULM MB count map. MB count, speed and diameter quantifications for rest and stimulation periods for the two blood vessels; n = 4 experiments. b, MB flux correlation map after visual stimulations overlaid with rat brain atlas⁵² at Bregma = −6.12 mm. c, Selection of three blood vessels (red, SC1, SC2, and SC3) within the activated SC displayed on an ULM MB count map. MB count, speed and diameter quantifications for rest and stimulation periods for the three blood vessels; n = 3 experiments (b,c).
Singular value decomposition of sparse dynamic ULM data extracts the spatial and temporal profile of multiple parameters during activation
a,b, SVD analysis applied to pattern-averaged data: MB count variation map during stimulation (a) and associated temporal singular vectors (b). c–e, Zooming in on the somatosensory barrel cortex, absolute MB count variation (c), speed variation (d) and relative MB count variation (e). f,g, SVD analysis applied to the raw temporal ULM matrix (without any pattern summation, sparse signal): stimulation spatial singular vector (f) and corresponding temporal singular vectors (g). h–m, SVD analysis in continuous versus bolus injections experiments. These experiments were performed in the same rat, using either a continuous injection (i) or bolus injections every 35 s (l). Results show the spatial singular vector corresponding to stimulation (h,k) and the first SVD temporal singular vectors (j,m). n = 4 experiments (a–e), single micrograph (f–m).
Temporal resolution
a–c, SVD applied to MB flux. Spatial singular vector corresponding to functional hyperemia (second singular mode for a–c) for different temporal resolutions: 5-s window with 1-s step (a); 1-s window with a 0.5-s step (b); and 0.5-s with a 0.5-s step (c). d–f, Corresponding temporal singular vector. g–i, Results for the same experiment and the same temporal resolutions as in a–c but using correlation analysis (Pearson correlation coefficient computed between stimulation pattern and MB flux signal); n = 7 experiments (a–i).
The advent of neuroimaging has increased our understanding of brain function. While most brain-wide functional imaging modalities exploit neurovascular coupling to map brain activity at millimeter resolutions, the recording of functional responses at microscopic scale in mammals remains the privilege of invasive electrophysiological or optical approaches, but is mostly restricted to either the cortical surface or the vicinity of implanted sensors. Ultrasound localization microscopy (ULM) has achieved transcranial imaging of cerebrovascular flow, up to micrometre scales, by localizing intravenously injected microbubbles; however, the long acquisition time required to detect microbubbles within microscopic vessels has so far restricted ULM application mainly to microvasculature structural imaging. Here we show how ULM can be modified to quantify functional hyperemia dynamically during brain activation reaching a 6.5-µm spatial and 1-s temporal resolution in deep regions of the rat brain. Functional ultrasound localization microscopy monitors cerebrovascular blood flow by detecting the flow of injected microbubbles, providing access to brain activity at high spatiotemporal resolution.
To diversify science, some labs open summer doors wide to reach out to under-represented groups.
Dipole–dipole crosstalk between fluorophores separated by a distance of less than 10 nm induces changes in their photophysics, which adds a challenge to localization microscopy in the sub-10-nm regime.
DSTORM and DNA-PAINT imaging of DNA origami
a, Scheme of DNA origami labeled with four Cy5 at interfluorophore distances of 18 nm, 9 nm, 6 nm and 3 nm. b,c, Selected dSTORM (b) and DNA-PAINT (c) images of DNA origami. Samples were measured 3–5 times independently. Scale bars, 40 nm. d, Analysis of fluorescence trajectories recorded from individual DNA origami imaged using 640 nm excitation at an intensity of 5 kW cm⁻². e,f, Relative occurrence of fluorescence intensity ms⁻¹ in the on-state (Intensity), lifetime of the on-state (On-time), lifetime of the off-state (Off-time) and number of on-states (On-events) detected for DNA origami with different interfluorophore distances in dSTORM (e) calculated from n = 3–5 and DNA-PAINT (f) calculated from n = 2–3 individual experiments. Color code; singly labeled reference (gray), 18 nm (dark blue), 9 nm (light blue), 6 nm (red) and 3 nm (orange). g, Number of on-events (cumulative localizations, cum. locs.) detected per frame as a function of time during 10 min dSTORM videos (Supplementary Videos 6–11) of DNA origami with different interfluorophore distance (n = 3–5). h, Histogram of the times after which 80% of all localizations were detected per individual DNA origami (n = 3–5).
Various energy transfer pathways are responsible for fast blinking observed in the sub-10-nm range
a–e, Fluorescence trajectories recorded for a singly labeled reference (a), 18 nm (b), 9 nm (c), 6 nm (d) and 3 nm (e) DNA origamis in dSTORM photoswitching buffer. Color code, singly labeled reference (gray), 18 nm (dark blue), 9 nm (light blue), 6 nm (red) and 3 nm (orange). Zoomed-in trajectories of the first seconds show fast blinking observed for the 6- and 3-nm DNA origamis. Time bins, 1 ms. f, Fluorescence trajectory recorded for a 3-nm DNA origami in trolox buffer and zoomed-in fluorescence signal of the first 2 s. Time bin, 1 ms. g, Average fluorescence decays from n = 7–10 individual fluorescence trajectories of singly labeled reference (gray) and 3-nm DNA origamis measured in trolox (black) and photoswitching buffer (orange), respectively, revealing different energy transfer pathways between the Cy5 fluorophores. h, Average intensity autocorrelation functions (G(τ)) calculated from n = 7–10 individual fluorescence trajectories of singly labeled reference and 3-nm DNA origamis measured in trolox and photoswitching buffer, respectively, normalized to 1 ms. i, Histogram of average FLIMs measured from n = 7–15 fluorescence trajectories of individual DNA origami with different interfluorophore distances of 18, 9, 6 and 3 nm in photoswitching buffer. j, Fluorescence trajectory of a 3 nm DNA origami in photoswitching and corresponding fluorescence decays with average fluorescence lifetimes (τAV) of 0.66, 1.25 and 1.77 ns recorded during the gray marked areas. k, Typical FLIM images of the 18-, 9-, 6- and 3-nm DNA origami measured in trolox buffer emphasize the increased blinking and shorter fluorescence lifetime of Cy5 fluorophores in the sub-10-nm range (moving from top left to the bottom right). The samples were measured 5–10 times and excited at 640 nm with 2.5 kW cm⁻² at an integration time of 5 µs pixel⁻¹. Scale bar, 1 µm.
Time-resolved photoswitching fingerprint analysis in cells
a–c, Molecular structures of the pentameric GABA-A (PDB 6HUG) and tetrameric GluK2 receptor (PDB 5KUF) with incorporation sites of ncAAs shown as black circles (blue, γ2 subunit GABA-A (a); red, dimeric α2 GABA-A (b); orange, homotetrameric GluK2 (c)) and corresponding dSTORM images of HEK293T membrane sections showing fluorescence signals of individual receptors (5 nm pixel⁻¹). The ncAAs were labeled by click chemistry with Met-Tet-Cy5. In the GABA-AS181TAG mutant the distance between the two fluorophores in the α2 subunits is roughly 5 nm. In the GluK2S398TAG mutant the distance between the four Cy5 molecules is roughly 7 nm (refs. 41,42). The samples were measured 3–5 times independently. Scale bars, 500 nm. d, Relative occurrence of lifetimes of the off-state (Off-time), and number of on-states (On-events) detected from individual receptors in dSTORM experiments (n = 3–5). e, Number of on-events (localizations) detected per frame as a function of time during 10 min dSTORM experiments of membrane receptors (n = 3–5). f, FLIM images of HEK293T cells expressing monomeric γ2 subunit of GABA-A (left, blue), dimeric α2 subunit of GABA-A (middle, red), and homotetrameric GluK2 receptors (right, orange) click-labeled with Met-Tet-Cy5 measured by confocal TCSPC imaging in photoswitching buffer at an irradiation intensity of 2.5 kW cm⁻². To minimize photobleaching of fluorophores FLIM images were recorded at 5 µs of integration time per pixel. No intensity threshold was applied. Scale bars, 2 µm. g, Average fluorescence decays from n = 8–13 FLIM images of HEK293T cells expressing receptors labeled with one, two and four Cy5 fluorophores.
Advances in super-resolution microscopy have demonstrated single-molecule localization precisions of a few nanometers. However, translation of such high localization precisions into sub-10-nm spatial resolution in biological samples remains challenging. Here we show that resonance energy transfer between fluorophores separated by less than 10 nm results in accelerated fluorescence blinking and consequently lower localization probabilities impeding sub-10-nm fluorescence imaging. We demonstrate that time-resolved fluorescence detection in combination with photoswitching fingerprint analysis can be used to determine the number and distance even of spatially unresolvable fluorophores in the sub-10-nm range. In combination with genetic code expansion with unnatural amino acids and bioorthogonal click labeling with small fluorophores, photoswitching fingerprint analysis can be used advantageously to reveal information about the number of fluorophores present and their distances in the sub-10-nm range in cells.
As the resident immune cells in the central nervous system (CNS), microglia orchestrate immune responses and dynamically sculpt neural circuits in the CNS. Microglial dysfunction and mutations of microglia-specific genes have been implicated in many diseases of the CNS. Developing effective and safe vehicles for transgene delivery into microglia will facilitate the studies of microglia biology and microglia-associated disease mechanisms. Here, we report the discovery of adeno-associated virus (AAV) variants that mediate efficient in vitro and in vivo microglial transduction via directed evolution of the AAV capsid protein. These AAV-cMG and AAV-MG variants are capable of delivering various genetic payloads into microglia with high efficiency, and enable sufficient transgene expression to support fluorescent labeling, Ca2+ and neurotransmitter imaging and genome editing in microglia in vivo. Furthermore, single-cell RNA sequencing shows that the AAV-MG variants mediate in vivo transgene delivery without inducing microglia immune activation. These AAV variants should facilitate the use of various genetically encoded sensors and effectors in the study of microglia-related biology. Recombinant adeno-associated virus tools for enhanced microglial transduction in mice are reported. These viruses can be used to express functional reporters or genome editing tools with high microglial specificity, with the help of microglia-specific Cre lines.
Self-supervised deep learning of protein subcellular localization with cytoself
a, Workflow of the learning process. Only images and the proteins identifiers are required as input. We trained our model with a second fiducial channel for the cell nuclei, but its presence is optional as its performance contribution is negligible (Fig. 4). The protein identification pretext task ensures that images corresponding to the same or similar proteins have similar representations. b, Architecture of our VQ-VAE-2 (ref. ³⁷) -based deep-learning model featuring our two innovations: split-quantization and protein identification pretext task. Numbers in the encoders and decoders indicate encoder1, encoder2, decoder1 or decoder2 (Supplementary File 1). Global representation and local representation use different codebooks. c, The level of use of the codebook (that is, perplexity) increases and then saturates during training and is enhanced by applying split quantization.
High-resolution protein localization atlas
Each point corresponds to a single image from our test dataset of 109,751 images. To reveal the underlying structure of our map, each point in the central UMAP is colored according to 11 distinct protein localization categories (mitochondria, vesicles, nucleoplasm, cytoplasm, nuclear membrane, ER, nucleolus, Golgi, chromatin domain). These categories are expanded in the surrounding circles. Tight clusters corresponding to functionally defined protein complexes can be identified within each localization category. Only proteins with a clear and exclusive localization pattern are colored, gray points correspond to proteins with other or mixed localizations. Within each localization category, the resolution of cytoself representations is further illustrated by labeling the images corresponding to individual proteins in different colors (dashed circular inserts). Note that while the colors in the central UMAP represent different cellular territories, colors in the inserts are only used to delineate individual proteins, and do not correspond to the colors used in the main UMAP. The list of annotated proteins and the subunits of each complex are indicated in Supplementary Files 2 and 5, respectively.
Exploring the protein localization atlas
a, Representative images of proteins localized along an exemplary path across the nuclear-cytoplasmic transition and over the ‘gray’ space of mixed localizations. b, The subunits of well-known and stable protein complexes tightly cluster together. Moreover, the complexes themselves are placed in their correct cellular contexts. Different proteins have different expression levels, hence we adjusted the brightness of each panel so as to make all localizations present in each image more visible (only minimum–maximum intensities are adjusted, no gamma adjustment used). All representative images were randomly selected. Protein localization is displayed in grayscale in both panels, the nuclei in b are displayed in blue. The list of the subunits of each complex are indicated in Supplementary File 5. Scale bars, 10 μm.
Clustering performance comparison
For each model variation, we trained five model instances, compute UMAPs for ten random seeds, compute clustering scores using organelle- and protein-complex-level ground truth and then report the mean and standard error of the mean.
Feature spectral analysis
a, Features in the local representation are reordered by hierarchical clustering to form a feature spectra (Extended Data Fig. 6). The color bar indicates the strength of correlation. Negative values indicate anti-correlation. On the basis of the feature clustering, we manually identified 11 primary top-level clusters, which are illustrated with representative images (Supplementary Fig. 3). Those images have the highest occurrence of the corresponding features. b, Average feature spectrum for each unique localization family. Occurrence indicates how many times a quantized vector is found in the local representation of an image. All spectra, as well as the heatmap, are vertically aligned. c, The feature spectrum of FAM241A, a poorly characterized orphan protein. d, Correlation between FAM241A and other unique localization categories. The highest correlation is 0.777 with ER, next is 0.08 with cytoplasm. e, Experimental confirmation of the ER localization of FAM241A. The localization of FAM241A to the ER is experimentally confirmed by coexpression of a classical ER marker (mCherry fused to the SEC61B transmembrane domain, left) in FAM241A-mNeonGreen endogenously tagged cells (right). The ER marker is expressed using transient transfection. As a consequence, not all cells are transfected and levels of expression may vary. Scale bars, 10 μm.
Explaining the diversity and complexity of protein localization is essential to fully understand cellular architecture. Here we present cytoself, a deep-learning approach for fully self-supervised protein localization profiling and clustering. Cytoself leverages a self-supervised training scheme that does not require preexisting knowledge, categories or annotations. Training cytoself on images of 1,311 endogenously labeled proteins from the OpenCell database reveals a highly resolved protein localization atlas that recapitulates major scales of cellular organization, from coarse classes, such as nuclear and cytoplasmic, to the subtle localization signatures of individual protein complexes. We quantitatively validate cytoself’s ability to cluster proteins into organelles and protein complexes, showing that cytoself outperforms previous self-supervised approaches. Moreover, to better understand the inner workings of our model, we dissect the emergent features from which our clustering is derived, interpret them in the context of the fluorescence images, and analyze the performance contributions of each component of our approach. Cytoself is a self-supervised deep learning-based approach for profiling and clustering protein localization from fluorescence images. Cytoself outperforms established approaches and can accurately predict protein subcellular localization.
A multitude of sequencing-based and microscopy technologies provide the means to unravel the relationship between the three-dimensional organization of genomes and key regulatory processes of genome function. Here, we develop a multimodal data integration approach to produce populations of single-cell genome structures that are highly predictive for nuclear locations of genes and nuclear bodies, local chromatin compaction and spatial segregation of functionally related chromatin. We demonstrate that multimodal data integration can compensate for systematic errors in some of the data and can greatly increase accuracy and coverage of genome structure models. We also show that alternative combinations of different orthogonal data sources can converge to models with similar predictive power. Moreover, our study reveals the key contributions of low-frequency (‘rare’) interchromosomal contacts to accurately predicting the global nuclear architecture, including the positioning of genes and chromosomes. Overall, our results highlight the benefits of multimodal data integration for genome structure analysis, available through the Integrative Genome Modeling software package. The Integrative Genome Modeling platform is a tool for population-based three-dimensional genome structure modeling and analysis by integrating various experimental data sources.
Estimating age from the transcriptome using RAPToR
a, Cartoon of individuals sampled at identical chronological times in two conditions, resulting in different developmental age between groups owing to the condition impacting developmental speed. b–c, Cartoons of differential expression analysis situations where hidden developmental variation is either misinterpreted as an effect of the condition due to development and condition being confounded (b), or masking an effect of the condition owing to developmental spread (c). d–f, RAPToR staging exploits existing reference time-series expression data (d). This data is first decomposed into principal/independent components, which are interpolated with respect to time (e). Comp., component. Interpolated reference is then reconstructed at gene level with interpolated components and gene loadings (f). g,h, For each sample, a correlation profile is built by computing genome-wide Spearman correlation with every time point of the reference (g). Corr., correlation. The reference time with maximal correlation becomes the estimate, and bootstrapping on random gene subsets defines a confidence interval (see Methods) (h).
RAPToR precisely stages development and ageing, and works from whole-organism to single-cell data
a,b, Chronological age versus RAPToR estimates of C. elegans late-larval samples²⁶ (linear model is y = −4.7 + 1.6x; a) and D. rerio embryo samples²⁷ (linear model is y = 0.7x; b). c, Somite number versus RAPToR estimates of M. musculus embryo samples²⁹ (linear model is y = 9.2 + 0.05x). d, Chronological age versus RAPToR estimates of D. melanogaster embryo samples²⁷ (linear model is y = 1.3 + 0.77x). e,f, Selected principal components of the data staged in d, plotted in black along chronological age (e) and in red along RAPToR estimates (f). g–j, Chronological age versus RAPToR estimates of adult C. elegans bulk samples³³ (g) and single-worms³⁴ (h), of adult D. melanogaster³⁵ (i), and of human brain tissue³⁶ (j). k, Chronological age versus RAPToR estimates of dissected samples of upper jaw first molars from M. musculus embryos staged using the lower jaw samples as reference37,38. l, Inferred pseudo-time versus RAPToR estimates of H. sapiens embryo single cells³⁹. a–d,g,h, Staged samples26,27,29,33,34 and references20,23–25 are from independent time-series experiments. Original time points of the references within the plot area are shown to the right (blue), but the references can span much longer coverage.
Source data
Tissue-specific staging
a,b, Selected independent components from ICA on joint C. elegans RILs¹¹ (dots) and reference data²¹ they were staged on (orange line). c–f, As in (a,b), with RILs plotted in red along soma age (c,d), and in blue along germline age (e,f). g, Root mean square error between RILs and reference for independent components 2–8 when using soma, global or germline age estimates.
Source data
Staging samples cross-species
a, Chronological age versus RAPToR estimates for time series of embryo development of six Drosophila species⁴¹ staged on a D. melanogaster reference²⁵ (Extended Data Fig. 6). b, Spearman correlation between samples from a and the reference at age estimate, along RAPToR estimates. c, Chronological age versus RAPToR estimates for single cells of M. musculus embryos⁴², staged on a H. sapiens single-cell reference³⁹ using orthologs (Extended Data Fig. 7). d, Chronological age versus RAPToR estimates for C. elegans embryo samples²⁷, staged on a D. melanogaster reference²⁵ using orthologs.
Source data
Quantifying and correcting for developmental effects using RAPToR age estimates
a, Effect of increasing drug dose exposure⁴⁶ on RAPToR estimates of C. elegans germline age (Methods; Supplementary Fig. 12). P values are derived from two-sided t-tests on linear model coefficients for each drug. From top to bottom, respectively: mefloquine, P = 6.32 × 10⁻⁵, P = 0.0252, P = 0.8248; dichlorvos, P = 1.54 × 10⁻⁴, P = 0.0067, P = 0.0620; fenamiphos, P = 0.0040, P = 0.0946, P = 0.7702. n.s., not significant. b, RAPToR estimates versus reported chronological age highlight large developmental spread within time points of WT C. elegans and pash-1ts time series⁴⁷ (Supplementary Note 2 and Supplementary Fig. 13). c, R² per gene of identical models with chronological age, or RAPToR estimates. Genes and gene counts above and below dashed line (x = y) are indicated in red and black, respectively. d, Germline age estimates of control and post-dauer (PD) C. elegans adults⁴⁸, P = 0.025, derived from a two-tailed t-test. e, Germline gene logFCs between control and PD from d in comparison to logFCs expected from developmental time difference only (Extended Data Fig. 10). f, Chronological age versus RAPToR estimates of a time series of WT C. elegans and xrn-2 late larval development⁴⁹. Sample subsets defining a gold standard of truly DE genes and shifted WT sets used in subsequent panels are color-coded. g, Correlation of observed logFCs and expected developmental logFCs computed from the interpolated reference between the xrn-2 subset and increasingly shifted WT sets from f (Supplementary Note 2). h, PR curves showing the performance of a standard differential-expression model P value for each shifted WT subset in detecting gold-standard DE genes. i, AUPRC of standard differential expression model P value from h, or of the age-corrected classifier for each shifted WT subset in detecting gold-standard DE genes (Supplementary Note 2). DE, differentially expressed. In a,b,d, central bar denotes group mean.
Source data
Transcriptomic data is often affected by uncontrolled variation among samples that can obscure and confound the effects of interest. This variation is frequently due to unintended differences in developmental stages between samples. The transcriptome itself can be used to estimate developmental progression, but existing methods require many samples and do not estimate a specimen’s real age. Here we present real-age prediction from transcriptome staging on reference (RAPToR), a computational method that precisely estimates the real age of a sample from its transcriptome, exploiting existing time-series data as reference. RAPToR works with whole animal, dissected tissue and single-cell data for the most common animal models, humans and even for non-model organisms lacking reference data. We show that RAPToR can be used to remove age as a confounding factor and allow recovery of a signal of interest in differential expression analysis. RAPToR will be especially useful in large-scale single-organism profiling because it eliminates the need for accurate staging or synchronisation before profiling. Real age prediction from transcriptome staging on reference (RAPToR) precisely estimates the real age of a specimen on the basis of transcriptomic data. RAPToR is broadly applicable and can be used to remove age as a confounding variable.
Expeditions are delivering data wealth about our planet’s oceans, including its microbes. Some labs are now diving deep into the ocean virome.
This is a research briefing on our article published the same day.
Even without a stint on an ocean-faring vessel, scientists can trawl through data to explore marine viruses and address new puzzles and cultural shifts.
The study of human–animal chimeras is fraught with technical and ethical challenges. In this Comment, we discuss the importance and future of human–monkey chimera research within the context of current scientific and regulatory obstacles.
FlipFlop mice harbor functionally reversed T cells due to a switch in CD4 and CD8 expression.
Researchers repurpose cytosine deaminase toxin DddA to measure DNA–protein interaction sites in bacteria.
Deep Visual Proteomics combines the power of deep-learning-based image analysis with microdissection and ultrasensitive mass spectrometry to provide insights into the spatial proteome.
Machine learning decodes the hidden rules of enhancers.
A blend of science, food and fun can empower well-being and new connections.
Sequencing and assembly statistics for the Zymo mock bacterial species (n = 7)
a, Observed raw read accuracies measured through read-mapping. b, Observed homopolymer length of raw reads compared with the reference genomes (see Supplementary Figs. 2 and 3 for a complete overview). c, Observed indels of de novo assemblies per 100 kbp at different coverage levels, with and without Illumina polishing. Note that the reference genomes available for the Zymo mock are not identical to the sequenced strains (Supplementary Table 3). d, IDEEL²⁸ score, calculated as the proportion of predicted proteins that are ≥95% the length of their best-matching known protein in a database¹⁹. The dotted line represents the IDEEL score for the reference genome, while the dashed lines mark a 40-fold coverage cut-off.
Long-read Oxford Nanopore sequencing has democratized microbial genome sequencing and enables the recovery of highly contiguous microbial genomes from isolates or metagenomes. However, to obtain near-finished genomes it has been necessary to include short-read polishing to correct insertions and deletions derived from homopolymer regions. Here, we show that Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing. This study demonstrates the feasibility of generating near-finished microbial genomes using only Oxford Nanopore R10.4 data from pure cultures or metagenomes.
The Emu algorithm
The Emu algorithm begins by generating alignments between input reads (R) and database sequences (S). The probability of each non-matching character alignment type (mismatch (X), insertion (I), deletion (D), softclip (S)) is calculated based on the number of occurrences of each character alignment type in all of the primary alignments from the read mapping. The probability of each alignment in the read mapping is then generated as P(r|t) from the counts of each character alignment type and their corresponding established probabilities. The EM phase is then entered, in which each read is broken down into the likelihood that it is derived from each possible species in the database P(t|r), and its overall composition estimate F(t) (which is deduced). This cycle repeats as the composition estimate influences read-taxonomy probabilities to give more weight to taxa with higher abundances, then the composition estimate is updated accordingly. Once minimal changes are detected between cycle iterations, the EM loop is exited. The composition estimate is then trimmed based on the specified minimum abundance probability threshold to complete one final EM iteration and the final composition estimate is produced.
Performance on simulated ONT reads
a, Quantitative result statistics for our MBARC-26 simulated dataset. A heatmap is shown of species-level error between expected and inferred relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. The color scheme is capped at ±10, meaning that errors greater than ±10% will be shown in the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the included results. ‘Other’ represents the sum of all species not shown in the figure for the respective column. Species-level L1-norm, L2-norm, precision, recall and F-score are also plotted for the methods evaluated. b, The same statistics shown in a for our CAMI2 simulated dataset.
Performance on our ZymoBIOMICS community standard dataset
Heatmap of species-level error between the calculated ground truth and estimated relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. All ONT errors are measured in relation to the ground truth of the ONT dataset, while Illumina errors are measured in relation to the ground truth for the Illumina dataset. The color scheme is capped at ±10, meaning that errors greater than ±10% will be shown in the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the ONT or Illumina results. ‘Other’ represents the sum of all species not shown in the figure for the respective column. Species-level L1-norm, L2-norm, precision, recall and F-score are also plotted for the methods evaluated. The true- and false-positive counts used to calculate precision, recall and F-score are restricted to species with relative abundance ≥0.01% to align with guidance from ZymoBIOMICS on maximum levels of contamination.
Relative error after consecutive EM iterations within Emu on ZymoBIOMICS ONT reads
The relative error of the Emu algorithm after 1, 2, 3, 4, 5, 10, 15 and 20 EM iterations is shown, as well as the final Emu output (out) on our ZymoBIOMICS sample sequenced by an ONT device. The 15 most abundant species in the computational estimate are displayed. The final Emu output includes threshold trimming and final re-estimation after 22 EM iterations. Darker blue represents an underestimate by the method, while darker red represents an overestimate. The color scheme is capped at ±5, meaning that errors greater than ±5% will be shown in the maximum error colors. The false-positive count and L1-norm are reported for each iteration with the ZymoBIOMICS guaranteed minimum abundance threshold of 0.01% applied.
16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation–maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu. Emu accurately estimates microbial abundance using full-length Nanopore 16S rRNA gene sequencing data.
Bioluminescent phasor imaging
a, Unique luciferase–luciferin pairs produce distinct phasor signatures. Both the enzyme and substrate influence phasor location. Created with b, A four-channel detection scheme. Created with c, Common luciferase–luciferin pairs (Nluc/FRZ and Fluc/d-luc) produce distinguishable fingerprints on the phasor plot. Varying amounts of Fluc (0−500 nM) and Nluc (0−10 nM) were treated with d-luc and FRZ, then imaged. d, Structurally similar luciferin analogs registered distinct locations on the phasor plot.
Facile multicomponent imaging with bioluminescent phasor
a, Bioluminescent phasor records the entire emission spectrum of a probe, enabling facile assignment of BRET efficiency. b, Seven BRET reporters were readily distinguished via bioluminescent phasor. c, BRET efficiencies were measured with single-cell resolution. The distribution of BRET efficiency for each ReNL-expressing cell is shown in the boxplot, where the central mark denotes the median and the edges define the 25th and 75th percentiles. Black dots indicate the value measured for each pixel within an individual cell. The whiskers extend to the most extreme data points and the outliers are plotted individually as red crosses. d, Cell mixtures were readily analyzed via the phasor fingerprints of individual reporters. Image is representative of n > 3 independent experiments. e, ReNL- and LumiScarlet-expressing cells could not be distinguished via conventional spectral imaging with a filter (LP560). The cells were readily discerned via bioluminescent phasor analysis. This experiment was repeated n = 3 times with similar results. For c–e, scale bars, 25 µm.
Continuous, excitation-free imaging of tumor spheroids
a, Total photon output from the heterogeneous spheroid (left) or unmixed channels (right) corresponding to CeNL (cyan), YeNL (yellow) and LumiScarlet (red) upon luciferin administration. Two imaging planes were used, with the upper plane shown in the top images and the lower plane shown in the bottom images for the individual channels. b, Maximum projection of the unmixed images from a. c, Magnified views of the boxed region from b. d, Bioluminescent phasor imaging over time, using a single bolus of luciferin. Composite images of the three unmixed channels corresponding to CeNL (cyan), YeNL (yellow) and LumiScarlet (red) are shown. Scale bars, 100 µm. This experiment was repeated two times with similar results.
Bioluminescence imaging with luciferase–luciferin pairs is a well-established technique for visualizing biological processes across tissues and whole organisms. Applications at the microscale, by contrast, have been hindered by a lack of detection platforms and easily resolved probes. We addressed this limitation by combining bioluminescence with phasor analysis, a method commonly used to distinguish spectrally similar fluorophores. We built a camera-based microscope equipped with special optical filters to directly assign phasor locations to unique luciferase–luciferin pairs. Six bioluminescent reporters were easily resolved in live cells, and the readouts were quantitative and instantaneous. Multiplexed imaging was also performed over extended time periods. Bioluminescent phasor further provided direct measures of resonance energy transfer in single cells, setting the stage for dynamic measures of cellular and molecular features. The merger of bioluminescence with phasor analysis fills a long-standing void in imaging capabilities, and will bolster future efforts to visualize biological events in real time and over multiple length scales. The combination of engineered probes and spectral phasor analysis overcomes long-standing challenges associated with bioluminescence detection at the microscale, enabling multiplexed, real-time imaging of cellular features without the need for excitation light.
Initially widespread detection of pHis peptides in tryptic digests of human cells reduces to just a handful of genuine ones after stringent filtering
Whereas a stringent procedure to detect and filter our genuine pHis sites when analyzing E. Coli samples predicts ~16% of the initially detected sites to be authentic, the same procedure marks only ~0.4% of the initially detected pHis sites in human cells as genuine.
It has been suggested that in mammalian cells histidine residues in proteins may become as frequently phosphorylated as serine, threonine and tyrosine, and may play a key role in mammalian signaling. Here we applied a robust workflow that earlier allowed us to detect histidine phosphorylation in bacteria unambiguously, to probe for histidine phosphorylation in four human cell lines. Initially, seemingly hundreds of protein histidine phosphorylations were picked up in all studied human cell lines. However, careful examination of the data, and several control experiments, led us to the conclusion that >99% of these initially assigned pHis sites were not genuine, and should be site localized to neighboring Ser/Thr residues. Nevertheless, our methods are selective enough to detect just a handful of genuine pHis sites in mammalian cells, representing well-known enzymatic intermediates. Consequently, we do not find any evidence in our data supporting that protein histidine phosphorylation plays a role in mammalian signaling. Extensive analyses of mammalian phosphoproteomics datasets show that protein histidine phosphorylation in human cells may not be as prevalent as previously thought.
Identifying transcription factors that reprogram starting cells to target cell types
a, Two common starting cell types, fibroblasts (skin cells) and pluripotent stem cells can be reprogrammed to eight cell types through over-expression of reprogramming transcription factors. b, Transcription factors that have been previously implicated in reprogramming protocols. c, Methods for identifying reprogramming transcription factors (TFs) from gene expression (RNA-seq) or chromatin accessibility (ATAC-seq). Output of these methods is either a ranked list of transcription factors or a rank representation of DNA binding domain of a transcription factor by a PWM. Methods operate under different assumptions of what are important features for ranking reprogramming factors.
Selection of genomic regions affects traditional DNA sequence-based methods for identification of transcription factors from chromatin accessibility
a, The three axes for selection of genomic regions are choice of top regions or regions that are target cell type-specific relative to starting cell type, choice of background sequences for discriminative motif discovery and number of regions. Values in parenthesis represent number of choices tested at each decision axis. b, For each traditional method, the best set of genomic regions are chosen based on transcription factor recovery of all eight cell types within in the top ten ranked transcription factor motifs. c–e, Normalized AURC for top ten ranked factors averaged over eight cell types for each choice of negative sequences for discriminative motifs discovery of genome-sampled GC-content matched, universal enhancer sequences and random dinucleotide preserving shuffled positive sequences marginalized over number of regions and discriminative region selection axes (n = 150) (c) or each choice of differentially accessible using stem cells or fibroblasts as starting cell type as well as top accessible regions in the target cell type cell type marginalized over number of regions and negative background sequence selection (n = 150) (d), and stratified by number of regions, marginalizing over negative background sequence selection and discriminative region selection (n = 150) (e). Box plots show median and quartile values. Whiskers extend to represent the rest of the data distribution with the exception of outliers that are defined as values greater than 1.5 times the inter-quartile range and are plotted as individual points. f–i, Linear regression model predicting normalized recovery were used to estimate weights and 95% confidence interval of decision axis on reprogramming factor recovery at rank less than ten for AME (n = 345) (f), DREME (n = 345) (g), HOMER (n = 230) (h) and KMAC (n = 230) (i). Data are presented as parameter weights and 95% confidence intervals. Feature P values are reported if significantly nonzero (P < 0.05) after Bonferroni correction for multiple hypothesis testing of parameters for each method regression model.
Source data
Use of histone mark and EP300 annotation does not significantly affect transcription factor recovery in liver cells
a, Normalized AURC within the top ten motifs on hepatocyte transcription factor recovery for accessible regions with addition of five epigenomic markers, each dot represents a unique combination of one of the four methods (KMAC, AME, HOMER and DREME), selection of differential regions, selection of background sequences and number of sequences. AURC for ATAC-seq and ATAC-seq overlapping active enhancer marks tested against ATAC-seq overlapping promoter mark H3K4me3 (n = 150) for ATAC (n = 150; rank sum statistic 2.896; P = 0.00378), ATAC + H3K27ac (n = 150; rank sum statistic 2.831; P = 0.00463), ATAC + EP300 (n = 150; rank sum statistic 3.221; P = 0.00128), ATAC + EP300_H3K27ac_H3K4me1 (n = 150; rank sum statistic 3.612; P = 0.000394) and ATAC + H3K4me1 (n = 150; rank sum statistic 2.500; P = 0.0124). Significance reported as P values by Rank sum test under Bonferroni correction with star indicating significant under adjusted threshold P < 0.0083 for multiple hypotheses (n = 6). b, Normalized AURC for four methods (KMAC, AME, HOMER and DREME) on hepatocyte transcription factor recovery (theoretical maximum AURC 1.0; range 0–0.953; n = 900). c, AURC within the top ten motifs for four methods (KMAC, AME, HOMER and DREME) on hepatocyte transcription factor recovery (theoretical maximum AURC 1.0; range 0–0.942; n = 900). b,c, All box plots show median and quartile values. Whiskers extend to represent the rest of the data distribution with the exception of outliers that are defined as values greater than 1.5 times the inter-quartile range and are plotted as individual points. d, Rank of each reprogramming transcription factor for each method and each genomic signal shows some trends that may relate to promoter (Hnf1) or enhancer (Fox) bias of transcription factors, and some highly consistent transcription factors (Gata).
Source data
Complex chromatin methods are top performers for transcription factor recovery and significance ranking
a, Evaluation of nine methods for each cell type (n = 8; CellNet n = 4) using normalized AURC for top 100 ranked factors. b, AURC for the top ten ranked motifs for each method in each cell type (n = 8; CellNet n = 4). Box plots show median and quartile values. Whiskers extend to represent the rest of the data distribution with the exception of outliers that are defined as values greater than 1.5 times the inter-quartile range and are plotted as individual points. c,d, Linear regression models used to estimate effect size and 95% confidence intervals of method on normalized AURC for top 100 ranked factors (n = 68) (c) and top ten ranked factors (n = 68) (d). Data are presented as parameter weights and 95% confidence intervals. Feature adjusted P values are reported if significantly nonzero (P < 0.05) after Benjamini–Hochberg correction for multiple hypothesis testing of parameters within the regression model. e, Reprogramming factor recall plots for each of the eight cell types for all nine methods. f, Rank of reprogramming transcription factors for each method and each cell type. g, Correlation between reprogramming factors ranked by methods and ranked by significance based on literature.
Source data
Transcription factor over-expression is a proven method for reprogramming cells to a desired cell type for regenerative medicine and therapeutic discovery. However, a general method for the identification of reprogramming factors to create an arbitrary cell type is an open problem. Here we examine the success rate of methods and data for differentiation by testing the ability of nine computational methods (CellNet, GarNet, EBseq, AME, DREME, HOMER, KMAC, diffTF and DeepAccess) to discover and rank candidate factors for eight target cell types with known reprogramming solutions. We compare methods that use gene expression, biological networks and chromatin accessibility data, and comprehensively test parameter and preprocessing of input data to optimize performance. We find the best factor identification methods can identify an average of 50–60% of reprogramming factors within the top ten candidates, and methods that use chromatin accessibility perform the best. Among the chromatin accessibility methods, complex methods DeepAccess and diffTF have higher correlation with the ranked significance of transcription factor candidates within reprogramming protocols for differentiation. We provide evidence that AME and diffTF are optimal methods for transcription factor recovery that will allow for systematic prioritization of transcription factor candidates to aid in the design of new reprogramming protocols. A comparison of nine computational methods for identification of reprogramming factors for cell differentiation.
Proteomic map of mouse tissues
a, Illustration of the 41 tissues (covering 15 systems) and 66 PDAC cell lines subjected to proteome analysis. Each organ system is represented by a unique color code, and each tissue has a unique abbreviation, both are kept consistent throughout the figures. b, Number and overlap of identified protein-coding genes in the proteome and phosphoproteome datasets compared to the UniProt database. c,d, The number of protein and class I p-site (localization probability >0.75) identifications for each tissue (c) and cell line (d) is displayed by heatmap bars. The color gradient within each bar reflects the number of samples each protein or p-site was identified in, where the darkest color regions represent the ubiquitous proteomes and phosphoproteomes. Dashed lines indicate proteins and p-sites identified and quantified in all tissues or cell lines. e, Schematic representation of the data and analysis workflows available in ProteomicsDB and PACiFIC.
Consolidation of the mouse proteome
a, Pie charts showing the percentage of proteins identified by one or multiple peptides and grouped by UniProt protein evidence annotations (PE1–5). Numbers in brackets refer to the number of identified proteins, along with the number of unique genes they represent. b, Spectrum validation of four protein products for the gene Ahcyl2. In the left panel, the amino acid sequence of the canonical protein (Q68FL4) is shown, along with the three alternative products. Portions of the sequences identified in our dataset and which discriminate between the four isoforms are highlighted. In the right panel, a mirror plot of the experimental (E, top) and predicted (P, bottom) tandem mass spectra are shown for a representative peptide. Red and blue signals indicate y- and b-type fragment ions, respectively. Calculated SA of 0.9 indicates near identical spectra. c, Number of observed SEPs as a function of the SA comparing measured and predicted reference spectra. SA values of >0.7 (dotted line) indicate near perfect agreement. At this cutoff, our dataset retains 719 SEPs, mapping to 712 unique sORFs (blue area). The inserted pie chart shows the proportion of sORFs with or without MS-based supporting evidence in the database. d, Classification and characterization of the validated (SA > 0.7) sORFs, in terms of genetic coordinates (top), initiation codon usage (bottom-left) and intensity distribution (bottom-right). The box indicates the interquartile range (IQR), the black vertical line indicate median value and whiskers extend to the maximum and minimum values. e, Identification frequency of the validated SEPs across all tissues and all cell lines. Bottom panel, mirror plot of the experimental (E, top) and predicted peptide (P, bottom) tandem mass spectra of an identified SEP (EDNPFAGSR) without previous MS-based supporting evidence, representing the Rbakdn gene.
Proteomic expression landscapes in the mouse
a, Dynamic range of protein abundance (blue) and p-sites (red). Protein abundance spans roughly seven orders of magnitude (OM), whereas p-sites abundance only spans roughly five. In both cases, around 90% of the proteome or phosphoproteome is confined to within around three OM around the median value. b, Cumulative protein (top) and p-site (bottom) intensities (ranked by abundance, x axis) and their contribution to total proteome and phosphoproteome mass (y axis), respectively, across all tissues or PDAC cell lines. The black solid line indicates the median, the filled area corresponds to the minimum and maximum across tissues or cell lines. c, Unsupervised clustering of mouse tissues and mPDAC proteomes, showing that strong qualitative and quantitative expression differences exist between the different proteomes. The clustering separates tissues from mPDACs, but also distinguishes the nervous system tissues, the female reproductive system tissues, the immune system tissues and, to a lesser extent, the digestive system tissues. d, Dynamic range of the intensity-ranked proteomes of three representative tissues: frontal lobe (FRL), tongue (TNG), and eyes (EYS). Five of the most abundant genes that relate to the functional specialization of the respective tissue are listed in descending order.
Proteome comparative analysis across tissues and species
a, Violin plots (n = 29 tissues) depicting the spread in relative contribution of the selected molecular features that can predict gene-level protein abundance using our model across tissues and species. The white dot denotes the median, while box borders indicate the first and third quartiles. Whiskers extend to the maximum and minimum values. PPI, protein–protein interactions. b, Venn diagram of the relationship between orthologs and identified genes in the two species. c, Scatter plot of Pearson correlation coefficients as a measure for coexpression conservation. Each dot represents a gene annotation category (molecular functions, biological processes or cellular components). Across each tissue pair, when restricted to only the members of a given category, the proteome expression is highly correlated between mouse and human for most of the tested ontologies. However, for a small fraction of functional categories, their members are far less well conserved (higher variability of the person correlation across tissues, x axis), suggesting different functional remodeling of the mouse and human proteomes during evolution. The dashed line marks the diagonal. d, PCA of the 21 mouse and human matching tissues showing a predominant clustering of the proteomes by species. Each tissue is represented by a color matching the ones used in Fig. 1 to represent the different anatomical systems. e, Proportion of gene expression variance explained by tissues (x axis) and by species (y axis) for each orthologous mouse–human gene pair (n = 7,459). The proteome abundance variations between mouse and human can be modeled considering two contributing factors: the species of origin and the type of tissues. Variance decomposition identified a large set of SVOs and TVOs. The density estimation is calculated independently for each of the three sections of the plot, denoted by the dashed lines. f, NACC between mouse and human matching tissues at the proteome and transcriptome level. The distribution of NACC distances for each gene is shown, which represents the tendency of a gene to be coexpressed with the same set of orthologs in both species. The boxes indicate the IQR, the black horizontal lines indicate median values and whiskers extend to ±1.5× IQR; no outliers are shown. g, Percentage of orthologs having a certain fold change when comparing each tissue pair. Between the two species, orthologs can differ as much as 100-fold. The colored lines indicate the different tissues. h,i, Scatter plot depicting proteome-based expression levels of mouse and human genes with 1:1 orthologs, highlighting differentially expressed genes in heart (h) and liver (i). The solid black line indicates the linear model estimated by reduced major‐axis regression, other lines indicate absolute fold changes from the regression line of log2(10) and log2(100).
Linking large proteomic data collection with phenotypic drug and radiation response data
a, Schematic representation of the multilevel integrative analysis workflow performed in this study to identify protein or p-site signatures associated with sensitivity or resistance. b, General selection at protein level by the partitioning tree method of the mPDACs panel in the radiation response dataset. The inset shows the prediction accuracy (Pearson correlation, n = 100 predictive models) between the predicted and measured radiation activity of random forest models combining the selected 20 proteins (Methods). The median value and the IQR are indicated in purple. T, V and H indicate the training, the validation and the hold-out data, respectively. Markers for resistance and sensitivity are colored in orange and blue, respectively. This color scheme is consistently used throughout the other panels of the figure. c, Lrrfip1 is a sensitive marker for radiation response (n = 66 cell lines, Pearson correlation, two-sided Pearson correlation test P < 0.05). The filled area indicates the 95% confidence interval, in blue is the regression line. d, Same as Fig. 5b, but for p-sites. e, STRING-based interaction networks as before. DNA damage and chromatin modifying enzyme networks are highly enriched in p-sites positively correlated with radiation activity. f, Scatter plot from elastic net regression analysis showing that Sirt6 is a sensitivity marker for multiple inhibitors targeting Mek1/2. g, Scatter plot showing that Shroom2 is a sensitivity marker for five drugs targeting tubulin. ΔAUC indicates the difference between the maximum and minimum value of the standardized AUC across the tested cell lines, plotted against the P values of the Pearson correlation between Shroom2 abundance and drug sensitivity. h, Scatter plot showing that Mical2 Ser515 is a resistant marker for multiple inhibitors targeting CDK, CHK1 or ATR.
The laboratory mouse ranks among the most important experimental systems for biomedical research and molecular reference maps of such models are essential informational tools. Here, we present a quantitative draft of the mouse proteome and phosphoproteome constructed from 41 healthy tissues and several lines of analyses exemplify which insights can be gleaned from the data. For instance, tissue- and cell-type resolved profiles provide protein evidence for the expression of 17,000 genes, thousands of isoforms and 50,000 phosphorylation sites in vivo. Proteogenomic comparison of mouse, human and Arabidopsis reveal common and distinct mechanisms of gene expression regulation and, despite many similarities, numerous differentially abundant orthologs that likely serve species-specific functions. We leverage the mouse proteome by integrating phenotypic drug (n > 400) and radiation response data with the proteomes of 66 pancreatic ductal adenocarcinoma (PDAC) cell lines to reveal molecular markers for sensitivity and resistance. This unique atlas complements other molecular resources for the mouse and can be explored online via ProteomicsDB and PACiFIC. This work presents a quantitative draft of the mouse proteome and phosphoproteome constructed from 41 healthy tissues covering 15 major anatomical systems and 66 cell lines.
Current imaging approaches limit the ability to perform multi-scale characterization of three-dimensional (3D) organotypic cultures (organoids) in large numbers. Here, we present an automated multi-scale 3D imaging platform synergizing high-density organoid cultures with rapid and live 3D single-objective light-sheet imaging. It is composed of disposable microfabricated organoid culture chips, termed JeWells, with embedded optical components and a laser beam-steering unit coupled to a commercial inverted microscope. It permits streamlining organoid culture and high-content 3D imaging on a single user-friendly instrument with minimal manipulations and a throughput of 300 organoids per hour. We demonstrate that the large number of 3D stacks that can be collected via our platform allows training deep learning-based algorithms to quantify morphogenetic organizations of organoids at multi-scales, ranging from the subcellular scale to the whole organoid level. We validated the versatility and robustness of our approach on intestine, hepatic, neuroectoderm organoids and oncospheres. A method for high-content 3D imaging of organoids.
Inosine is a prevalent RNA modification in animals and is formed when an adenosine is deaminated by the ADAR family of enzymes. Traditionally, inosines are identified indirectly as variants from Illumina RNA-sequencing data because they are interpreted as guanosines by cellular machineries. However, this indirect method performs poorly in protein-coding regions where exons are typically short, in non-model organisms with sparsely annotated single-nucleotide polymorphisms, or in disease contexts where unknown DNA mutations are pervasive. Here, we show that Oxford Nanopore direct RNA sequencing can be used to identify inosine-containing sites in native transcriptomes with high accuracy. We trained convolutional neural network models to distinguish inosine from adenosine and guanosine, and to estimate the modification rate at each editing site. Furthermore, we demonstrated their utility on the transcriptomes of human, mouse and Xenopus. Our approach expands the toolkit for studying adenosine-to-inosine editing and can be further extended to investigate other RNA modifications. This work combines nanopore native RNA sequencing with machine learning models for identifying inosine-containing sites in transcriptomes.
Regulation of receptor tyrosine kinase (RTK) activity is necessary for studying cell signaling pathways in health and disease. We developed a generalized approach for engineering RTKs optically controlled with far-red light. We targeted the bacterial phytochrome DrBphP to the cell surface and allowed its light-induced conformational changes to be transmitted across the plasma membrane via transmembrane helices to intracellular RTK domains. Systematic optimization of these constructs has resulted in optically regulated epidermal growth factor receptor, HER2, TrkA, TrkB, FGFR1, IR1, cKIT and cMet, named eDrRTKs. eDrRTKs induced downstream signaling in mammalian cells in tens of seconds. The ability to activate eDrRTKs with far-red light enabled spectral multiplexing with fluorescent probes operating in a shorter spectral range, allowing for all-optical assays. We validated eDrTrkB performance in mice and found that minimally invasive stimulation in the neocortex with penetrating via skull far-red light-induced neural activity, early immediate gene expression and affected sleep patterns.
The new capabilities of TrackMate
TrackMate can now create, use, analyze, and store object contours segmented from 2D images. These contours enable TrackMate to extract morphological features of the tracked objects over time. We also wrote a new application programming interface (API) to allow the integration of external components in TrackMate. We use this API to incorporate popular segmentation tools including ilastik, the Weka Trainable-Segmentation Fiji plugin, cellpose, StarDist, and the morphological segmentation tool MorphoLibJ within TrackMate. TrackMate can also import segmentation results as masks or label images and use them for tracking, making it compatible with any segmentation algorithm. B&W, Black and White.
TrackMate can be used to track objects from a wide variety of bio-imaging experiments
a, Migration of cells, labeled with SiR-DNA, recorded using a spinning disk confocal microscope and automatically tracked using a custom StarDist model loaded in TrackMate (see Supplementary Video 1). Detected cells and their local tracks (colors indicate track ID) are displayed. Scale bar, 250 µm. b, The migration of activated T cells plated on ICAM-1 was recorded using a brightfield microscope and automatically tracked using a custom StarDist model loaded in TrackMate (see Supplementary Video 2). Detected cells (colors indicate the mean track speed: blue, slow-moving cells; red, fast-moving cells) and their local tracks (colors indicate track ID) are displayed. Scale bar, 250 µm. c, MDA-MB-231 cells stably expressing an ERK activity reporter (ERK-KTR-Clover) and labeled using SiR-DNA were recorded live using a widefield fluorescence microscope over 17 hours. Cell nuclei were automatically tracked over time using a StarDist model available in TrackMate (see Supplementary Video 3). For each tracked cell, the average intensity of the ERK reporter was measured in their nucleus over time (directly in TrackMate). Changes in ERK activity and in instant velocity are displayed as heatmaps (blue, high; yellow, low). d, The growth of Neisseria meningitidis expressing PilQ-mCherry was recorded using a spinning-disk confocal microscope. An ilastik pixel classifier, trained to segment individual bacteria, was loaded into TrackMate to follow bacteria growth. Representative fields of view and the lineage tree of the bacteria highlighted in green are displayed (see Supplementary Video 5). Changes in area and circularity of a bacterium over the tracking period are also highlighted (green track). Cell division events translate into sharp decreases in area, followed by a quasi-linear increase. The circularity roughly plateaus during cell growth then decreases before cell division. Scale bar, 25 µm. e, Glioblastoma cells migrating on a polyacrylamide gel were automatically segmented using a custom cellpose model trained in the ZeroCostDL4Mic platform. The resulting label images were automatically tracked using TrackMate (see Supplementary Fig. 1 and Supplementary Video 6). Example raw and label images, as well as cell tracks, are displayed. f. 3D spheroids were stained for DAPI and imaged using a spinning disk confocal microscope. Across the Z volume, nuclei were detected at each Z plane using StarDist and tracked (all performed in TrackMate). Tracked nuclei were then exported as a label image to create 3D labels (see Supplementary Video 9).
TrackMate is an automated tracking software used to analyze bioimages and is distributed as a Fiji plugin. Here, we introduce a new version of TrackMate. TrackMate 7 is built to address the broad spectrum of modern challenges researchers face by integrating state-of-the-art segmentation algorithms into tracking pipelines. We illustrate qualitatively and quantitatively that these new capabilities function effectively across a wide range of bio-imaging experiments. TrackMate 7 combines the benefits of machine and deep learning-based image segmentation with accurate object tracking to enable improved 2D and 3D tracking of diverse objects in biological research.
Conceptual overview of MSNovelist
Using the existing SIRIUS and CSI:FingerID approach, a molecular fingerprint and a molecular formula were predicted. These data were used as input to an encoder–decoder RNN model with LSTM architecture to predict a SMILES sequence. Finally, candidate structures were ranked by modified Platt score, that is, according to the match to the predicted molecular fingerprint.
Validation of MSNovelist with GNPS dataset
a, Rank of correct structure in results for MSNovelist (blue), and naïve generation (orange), with ranking by modified Platt score (solid line) or by RNN score (dashed line), and comparison to database search (CSI:FingerID on PubChem; green) for the GNPS dataset (n = 3,863). b, Rank of correct structure in results for MSNovelist and naïve generation, with ranking by modified Platt score or ordered by model probability, and comparison to database search for GNPS-OK dataset (n = 1,507). c, Tanimoto similarity of best incorrect candidate to correct structure for MSNovelist, naïve generation, database search, best candidate from training set and random candidate from training set. d, Modified Platt score of top candidates, for MSNovelist, naïve generation, database search, best candidate from training set (light blue) and random candidate from training set (red) e, Three randomly chosen examples of incorrect predictions (top candidate) from GNPS dataset. Structures 1a, 2a and 3a represent de novo prediction; structures 1b, 2b and 3b represent a correct result. Red marks sites predicted incorrectly by the model (or the entire molecule if the prediction was completely wrong), and blue marks the corresponding correct alternative.
De novo annotation of bryophyte metabolites
a, Scores of best MSNovelist candidates versus best database scores for 232 spectra; the solid line represents a 1:1, and the dashed line represents ModPlattMSNovelist=ModPlattDB+50\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{ModPlatt}}_{{\mathrm{MSNovelist}}} = {\mathrm{ModPlatt}}_{{\mathrm{DB} }} + 50$$\end{document}; labels indicate spectrum ID. b, MS² spectrum of feature 377. c, Proposed spectrum interpretation for structure 377a (MSNovelist) and 377b (database).
Current methods for structure elucidation of small molecules rely on finding similarity with spectra of known compounds, but do not predict structures de novo for unknown compound classes. We present MSNovelist, which combines fingerprint prediction with an encoder–decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS ² ) spectra. In an evaluation with 3,863 MS ² spectra from the Global Natural Product Social Molecular Networking site, MSNovelist predicted 25% of structures correctly on first rank, retrieved 45% of structures overall and reproduced 61% of correct database annotations, without having ever seen the structure in the training phase. Similarly, for the CASMI 2016 challenge, MSNovelist correctly predicted 26% and retrieved 57% of structures, recovering 64% of correct database annotations. Finally, we illustrate the application of MSNovelist in a bryophyte MS ² dataset, in which de novo structure prediction substantially outscored the best database candidate for seven spectra. MSNovelist is ideally suited to complement library-based annotation in the case of poorly represented analyte classes and novel compounds.
Lactylation was initially discovered on human histones. Given its nascence, its occurrence on nonhistone proteins and downstream functional consequences remain elusive. Here we report a cyclic immonium ion of lactyllysine formed during tandem mass spectrometry that enables confident protein lactylation assignment. We validated the sensitivity and specificity of this ion for lactylation through affinity-enriched lactylproteome analysis and large-scale informatic assessment of nonlactylated spectral libraries. With this diagnostic ion-based strategy, we confidently determined new lactylation, unveiling a wide landscape beyond histones from not only the enriched lactylproteome but also existing unenriched human proteome resources. Specifically, by mining the public human Meltome Atlas, we found that lactylation is common on glycolytic enzymes and conserved on ALDOA. We also discovered prevalent lactylation on DHRS7 in the draft of the human tissue proteome. We partially demonstrated the functional importance of lactylation: site-specific engineering of lactylation into ALDOA caused enzyme inhibition, suggesting a lactylation-dependent feedback loop in glycolysis.
A diagnostic fragment ion in tandem mass spectrometry enables confident protein lactylation assignment and the discovery of broad lysine modification beyond histones.
Cameras are a crucial part of microscopes and are also built into many kinds of instruments. To make their output comparable takes standards.
Imaging and microscopy technology advances in leaps and bounds. To address accumulated pain points, academics and companies are making headway on standards.
Evidence for at least one protein product from 80% of all mouse genes is reported in a comprehensive proteomic analysis of 41 adult mouse tissues. Comparison of tissue profiles between mouse and human suggests that the fundamental biology of this important model organism is even more different from our own than we thought.
A newly described fluorescent protein, StayGold, is bright and extremely photostable, enabling extended time-lapse imaging.
Top-cited authors
Pavel Tomancak
  • Max Planck Institute of Molecular Cell Biology and Genetics
Stephan Preibisch
  • Max-Delbrück-Centrum für Molekulare Medizin
Kevin Eliceiri
  • University of Wisconsin–Madison
Stephan Saalfeld
  • HHMI Janelia Research Campus
Johannes Schindelin