An increasing number of protein structures are being determined by cryogenic electron microscopy (cryo-EM). Although the resolution of determined cryo-EM density maps is improving in general, there are still many cases where amino acids of a protein are assigned with different levels of confidence. Here we developed a method that identifies potential misassignment of residues in the map, including residue shifts along an otherwise correct main-chain trace. The score, named DAQ, computes the likelihood that the local density corresponds to different amino acids, atoms, and secondary structures, estimated via deep learning, and assesses the consistency of the amino acid assignment in the protein structure model with that likelihood. When DAQ was applied to different versions of model structures in the Protein Data Bank that were derived from the same density maps, a clear improvement in the DAQ score was observed in the newer versions of the models. DAQ also found potential misassignment errors in a substantial number of deposited protein structure models built into cryo-EM maps. The DAQ score assesses the consistency of amino acid assignment in protein structure models with local density from cryo-EM maps. The method complements existing quality metrics and is a versatile tool for highlighting problematic regions of model structures.
Single-cell assay for transposase-accessible chromatin using sequencing (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain important challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single-cell multiome datasets, including cell clustering, scATAC profile denoising, data integration across assays and transcription factor activity inference. Using a sequence-based deep neural network, scBasset facilitates various tasks of single-cell ATAC-seq analysis in a unified framework.
Tardigrades are everywhere. They’re tiny — usually under a millimeter long — and they’re mostly transparent, so they’re easy to miss. But you probably walk by them every day. We’ve been grooming them as emerging models for studying how body forms evolve and how biological materials can survive extreme conditions.
DNA–protein interactions mediate physiologic gene regulation and may be altered by DNA variants linked to polygenic disease. To enhance the speed and signal-to-noise ratio (SNR) in the identification and quantification of proteins associated with specific DNA sequences in living cells, we developed proximal biotinylation by episomal recruitment (PROBER). PROBER uses high-copy episomes to amplify SNR, and proximity proteomics (BioID) to identify the transcription factors and additional gene regulators associated with short DNA sequences of interest. PROBER quantified both constitutive and inducible association of transcription factors and corresponding chromatin regulators to target DNA sequences and binding quantitative trait loci due to single-nucleotide variants. PROBER identified alterations in regulator associations due to cancer hotspot mutations in the hTERT promoter, indicating that these mutations increase promoter association with specific gene activators. PROBER provides an approach to rapidly identify proteins associated with specific DNA sequences and their variants in living cells.
Spatially resolved transcriptomics (SRT) provide gene expression close to, or even superior to, single-cell resolution while retaining the physical locations of sequencing and often also providing matched pathology images. However, SRT expression data suffer from high noise levels, due to the shallow coverage in each sequencing unit and the extra experimental steps required to preserve the locations of sequencing. Fortunately, such noise can be removed by leveraging information from the physical locations of sequencing, and the tissue organization reflected in corresponding pathology images. In this work, we developed Sprod, based on latent graph learning of matched location and imaging data, to impute accurate SRT gene expression. We validated Sprod comprehensively and demonstrated its advantages over previous methods for removing drop-outs in single-cell RNA-sequencing data. We showed that, after imputation by Sprod, differential expression analyses, pathway enrichment and cell-to-cell interaction inferences are more accurate. Overall, we envision de-noising by Sprod to become a key first step towards empowering SRT technologies for biomedical discoveries. Sprod accurately denoises spatially resolved transcriptomics data and improves downstream analysis results.
The advent of neuroimaging has increased our understanding of brain function. While most brain-wide functional imaging modalities exploit neurovascular coupling to map brain activity at millimeter resolutions, the recording of functional responses at microscopic scale in mammals remains the privilege of invasive electrophysiological or optical approaches, but is mostly restricted to either the cortical surface or the vicinity of implanted sensors. Ultrasound localization microscopy (ULM) has achieved transcranial imaging of cerebrovascular flow, up to micrometre scales, by localizing intravenously injected microbubbles; however, the long acquisition time required to detect microbubbles within microscopic vessels has so far restricted ULM application mainly to microvasculature structural imaging. Here we show how ULM can be modified to quantify functional hyperemia dynamically during brain activation reaching a 6.5-µm spatial and 1-s temporal resolution in deep regions of the rat brain. Functional ultrasound localization microscopy monitors cerebrovascular blood flow by detecting the flow of injected microbubbles, providing access to brain activity at high spatiotemporal resolution.
Dipole–dipole crosstalk between fluorophores separated by a distance of less than 10 nm induces changes in their photophysics, which adds a challenge to localization microscopy in the sub-10-nm regime.
Advances in super-resolution microscopy have demonstrated single-molecule localization precisions of a few nanometers. However, translation of such high localization precisions into sub-10-nm spatial resolution in biological samples remains challenging. Here we show that resonance energy transfer between fluorophores separated by less than 10 nm results in accelerated fluorescence blinking and consequently lower localization probabilities impeding sub-10-nm fluorescence imaging. We demonstrate that time-resolved fluorescence detection in combination with photoswitching fingerprint analysis can be used to determine the number and distance even of spatially unresolvable fluorophores in the sub-10-nm range. In combination with genetic code expansion with unnatural amino acids and bioorthogonal click labeling with small fluorophores, photoswitching fingerprint analysis can be used advantageously to reveal information about the number of fluorophores present and their distances in the sub-10-nm range in cells.
As the resident immune cells in the central nervous system (CNS), microglia orchestrate immune responses and dynamically sculpt neural circuits in the CNS. Microglial dysfunction and mutations of microglia-specific genes have been implicated in many diseases of the CNS. Developing effective and safe vehicles for transgene delivery into microglia will facilitate the studies of microglia biology and microglia-associated disease mechanisms. Here, we report the discovery of adeno-associated virus (AAV) variants that mediate efficient in vitro and in vivo microglial transduction via directed evolution of the AAV capsid protein. These AAV-cMG and AAV-MG variants are capable of delivering various genetic payloads into microglia with high efficiency, and enable sufficient transgene expression to support fluorescent labeling, Ca2+ and neurotransmitter imaging and genome editing in microglia in vivo. Furthermore, single-cell RNA sequencing shows that the AAV-MG variants mediate in vivo transgene delivery without inducing microglia immune activation. These AAV variants should facilitate the use of various genetically encoded sensors and effectors in the study of microglia-related biology. Recombinant adeno-associated virus tools for enhanced microglial transduction in mice are reported. These viruses can be used to express functional reporters or genome editing tools with high microglial specificity, with the help of microglia-specific Cre lines.
Explaining the diversity and complexity of protein localization is essential to fully understand cellular architecture. Here we present cytoself, a deep-learning approach for fully self-supervised protein localization profiling and clustering. Cytoself leverages a self-supervised training scheme that does not require preexisting knowledge, categories or annotations. Training cytoself on images of 1,311 endogenously labeled proteins from the OpenCell database reveals a highly resolved protein localization atlas that recapitulates major scales of cellular organization, from coarse classes, such as nuclear and cytoplasmic, to the subtle localization signatures of individual protein complexes. We quantitatively validate cytoself’s ability to cluster proteins into organelles and protein complexes, showing that cytoself outperforms previous self-supervised approaches. Moreover, to better understand the inner workings of our model, we dissect the emergent features from which our clustering is derived, interpret them in the context of the fluorescence images, and analyze the performance contributions of each component of our approach. Cytoself is a self-supervised deep learning-based approach for profiling and clustering protein localization from fluorescence images. Cytoself outperforms established approaches and can accurately predict protein subcellular localization.
A multitude of sequencing-based and microscopy technologies provide the means to unravel the relationship between the three-dimensional organization of genomes and key regulatory processes of genome function. Here, we develop a multimodal data integration approach to produce populations of single-cell genome structures that are highly predictive for nuclear locations of genes and nuclear bodies, local chromatin compaction and spatial segregation of functionally related chromatin. We demonstrate that multimodal data integration can compensate for systematic errors in some of the data and can greatly increase accuracy and coverage of genome structure models. We also show that alternative combinations of different orthogonal data sources can converge to models with similar predictive power. Moreover, our study reveals the key contributions of low-frequency (‘rare’) interchromosomal contacts to accurately predicting the global nuclear architecture, including the positioning of genes and chromosomes. Overall, our results highlight the benefits of multimodal data integration for genome structure analysis, available through the Integrative Genome Modeling software package. The Integrative Genome Modeling platform is a tool for population-based three-dimensional genome structure modeling and analysis by integrating various experimental data sources.
Transcriptomic data is often affected by uncontrolled variation among samples that can obscure and confound the effects of interest. This variation is frequently due to unintended differences in developmental stages between samples. The transcriptome itself can be used to estimate developmental progression, but existing methods require many samples and do not estimate a specimen’s real age. Here we present real-age prediction from transcriptome staging on reference (RAPToR), a computational method that precisely estimates the real age of a sample from its transcriptome, exploiting existing time-series data as reference. RAPToR works with whole animal, dissected tissue and single-cell data for the most common animal models, humans and even for non-model organisms lacking reference data. We show that RAPToR can be used to remove age as a confounding factor and allow recovery of a signal of interest in differential expression analysis. RAPToR will be especially useful in large-scale single-organism profiling because it eliminates the need for accurate staging or synchronisation before profiling. Real age prediction from transcriptome staging on reference (RAPToR) precisely estimates the real age of a specimen on the basis of transcriptomic data. RAPToR is broadly applicable and can be used to remove age as a confounding variable.
The study of human–animal chimeras is fraught with technical and ethical challenges. In this Comment, we discuss the importance and future of human–monkey chimera research within the context of current scientific and regulatory obstacles.
Long-read Oxford Nanopore sequencing has democratized microbial genome sequencing and enables the recovery of highly contiguous microbial genomes from isolates or metagenomes. However, to obtain near-finished genomes it has been necessary to include short-read polishing to correct insertions and deletions derived from homopolymer regions. Here, we show that Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing. This study demonstrates the feasibility of generating near-finished microbial genomes using only Oxford Nanopore R10.4 data from pure cultures or metagenomes.
16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation–maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu. Emu accurately estimates microbial abundance using full-length Nanopore 16S rRNA gene sequencing data.
Bioluminescence imaging with luciferase–luciferin pairs is a well-established technique for visualizing biological processes across tissues and whole organisms. Applications at the microscale, by contrast, have been hindered by a lack of detection platforms and easily resolved probes. We addressed this limitation by combining bioluminescence with phasor analysis, a method commonly used to distinguish spectrally similar fluorophores. We built a camera-based microscope equipped with special optical filters to directly assign phasor locations to unique luciferase–luciferin pairs. Six bioluminescent reporters were easily resolved in live cells, and the readouts were quantitative and instantaneous. Multiplexed imaging was also performed over extended time periods. Bioluminescent phasor further provided direct measures of resonance energy transfer in single cells, setting the stage for dynamic measures of cellular and molecular features. The merger of bioluminescence with phasor analysis fills a long-standing void in imaging capabilities, and will bolster future efforts to visualize biological events in real time and over multiple length scales. The combination of engineered probes and spectral phasor analysis overcomes long-standing challenges associated with bioluminescence detection at the microscale, enabling multiplexed, real-time imaging of cellular features without the need for excitation light.
It has been suggested that in mammalian cells histidine residues in proteins may become as frequently phosphorylated as serine, threonine and tyrosine, and may play a key role in mammalian signaling. Here we applied a robust workflow that earlier allowed us to detect histidine phosphorylation in bacteria unambiguously, to probe for histidine phosphorylation in four human cell lines. Initially, seemingly hundreds of protein histidine phosphorylations were picked up in all studied human cell lines. However, careful examination of the data, and several control experiments, led us to the conclusion that >99% of these initially assigned pHis sites were not genuine, and should be site localized to neighboring Ser/Thr residues. Nevertheless, our methods are selective enough to detect just a handful of genuine pHis sites in mammalian cells, representing well-known enzymatic intermediates. Consequently, we do not find any evidence in our data supporting that protein histidine phosphorylation plays a role in mammalian signaling. Extensive analyses of mammalian phosphoproteomics datasets show that protein histidine phosphorylation in human cells may not be as prevalent as previously thought.
Transcription factor over-expression is a proven method for reprogramming cells to a desired cell type for regenerative medicine and therapeutic discovery. However, a general method for the identification of reprogramming factors to create an arbitrary cell type is an open problem. Here we examine the success rate of methods and data for differentiation by testing the ability of nine computational methods (CellNet, GarNet, EBseq, AME, DREME, HOMER, KMAC, diffTF and DeepAccess) to discover and rank candidate factors for eight target cell types with known reprogramming solutions. We compare methods that use gene expression, biological networks and chromatin accessibility data, and comprehensively test parameter and preprocessing of input data to optimize performance. We find the best factor identification methods can identify an average of 50–60% of reprogramming factors within the top ten candidates, and methods that use chromatin accessibility perform the best. Among the chromatin accessibility methods, complex methods DeepAccess and diffTF have higher correlation with the ranked significance of transcription factor candidates within reprogramming protocols for differentiation. We provide evidence that AME and diffTF are optimal methods for transcription factor recovery that will allow for systematic prioritization of transcription factor candidates to aid in the design of new reprogramming protocols. A comparison of nine computational methods for identification of reprogramming factors for cell differentiation.
The laboratory mouse ranks among the most important experimental systems for biomedical research and molecular reference maps of such models are essential informational tools. Here, we present a quantitative draft of the mouse proteome and phosphoproteome constructed from 41 healthy tissues and several lines of analyses exemplify which insights can be gleaned from the data. For instance, tissue- and cell-type resolved profiles provide protein evidence for the expression of 17,000 genes, thousands of isoforms and 50,000 phosphorylation sites in vivo. Proteogenomic comparison of mouse, human and Arabidopsis reveal common and distinct mechanisms of gene expression regulation and, despite many similarities, numerous differentially abundant orthologs that likely serve species-specific functions. We leverage the mouse proteome by integrating phenotypic drug (n > 400) and radiation response data with the proteomes of 66 pancreatic ductal adenocarcinoma (PDAC) cell lines to reveal molecular markers for sensitivity and resistance. This unique atlas complements other molecular resources for the mouse and can be explored online via ProteomicsDB and PACiFIC. This work presents a quantitative draft of the mouse proteome and phosphoproteome constructed from 41 healthy tissues covering 15 major anatomical systems and 66 cell lines.
Current imaging approaches limit the ability to perform multi-scale characterization of three-dimensional (3D) organotypic cultures (organoids) in large numbers. Here, we present an automated multi-scale 3D imaging platform synergizing high-density organoid cultures with rapid and live 3D single-objective light-sheet imaging. It is composed of disposable microfabricated organoid culture chips, termed JeWells, with embedded optical components and a laser beam-steering unit coupled to a commercial inverted microscope. It permits streamlining organoid culture and high-content 3D imaging on a single user-friendly instrument with minimal manipulations and a throughput of 300 organoids per hour. We demonstrate that the large number of 3D stacks that can be collected via our platform allows training deep learning-based algorithms to quantify morphogenetic organizations of organoids at multi-scales, ranging from the subcellular scale to the whole organoid level. We validated the versatility and robustness of our approach on intestine, hepatic, neuroectoderm organoids and oncospheres. A method for high-content 3D imaging of organoids.
Inosine is a prevalent RNA modification in animals and is formed when an adenosine is deaminated by the ADAR family of enzymes. Traditionally, inosines are identified indirectly as variants from Illumina RNA-sequencing data because they are interpreted as guanosines by cellular machineries. However, this indirect method performs poorly in protein-coding regions where exons are typically short, in non-model organisms with sparsely annotated single-nucleotide polymorphisms, or in disease contexts where unknown DNA mutations are pervasive. Here, we show that Oxford Nanopore direct RNA sequencing can be used to identify inosine-containing sites in native transcriptomes with high accuracy. We trained convolutional neural network models to distinguish inosine from adenosine and guanosine, and to estimate the modification rate at each editing site. Furthermore, we demonstrated their utility on the transcriptomes of human, mouse and Xenopus. Our approach expands the toolkit for studying adenosine-to-inosine editing and can be further extended to investigate other RNA modifications. This work combines nanopore native RNA sequencing with machine learning models for identifying inosine-containing sites in transcriptomes.
Regulation of receptor tyrosine kinase (RTK) activity is necessary for studying cell signaling pathways in health and disease. We developed a generalized approach for engineering RTKs optically controlled with far-red light. We targeted the bacterial phytochrome DrBphP to the cell surface and allowed its light-induced conformational changes to be transmitted across the plasma membrane via transmembrane helices to intracellular RTK domains. Systematic optimization of these constructs has resulted in optically regulated epidermal growth factor receptor, HER2, TrkA, TrkB, FGFR1, IR1, cKIT and cMet, named eDrRTKs. eDrRTKs induced downstream signaling in mammalian cells in tens of seconds. The ability to activate eDrRTKs with far-red light enabled spectral multiplexing with fluorescent probes operating in a shorter spectral range, allowing for all-optical assays. We validated eDrTrkB performance in mice and found that minimally invasive stimulation in the neocortex with penetrating via skull far-red light-induced neural activity, early immediate gene expression and affected sleep patterns.
TrackMate is an automated tracking software used to analyze bioimages and is distributed as a Fiji plugin. Here, we introduce a new version of TrackMate. TrackMate 7 is built to address the broad spectrum of modern challenges researchers face by integrating state-of-the-art segmentation algorithms into tracking pipelines. We illustrate qualitatively and quantitatively that these new capabilities function effectively across a wide range of bio-imaging experiments. TrackMate 7 combines the benefits of machine and deep learning-based image segmentation with accurate object tracking to enable improved 2D and 3D tracking of diverse objects in biological research.
Current methods for structure elucidation of small molecules rely on finding similarity with spectra of known compounds, but do not predict structures de novo for unknown compound classes. We present MSNovelist, which combines fingerprint prediction with an encoder–decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS ² ) spectra. In an evaluation with 3,863 MS ² spectra from the Global Natural Product Social Molecular Networking site, MSNovelist predicted 25% of structures correctly on first rank, retrieved 45% of structures overall and reproduced 61% of correct database annotations, without having ever seen the structure in the training phase. Similarly, for the CASMI 2016 challenge, MSNovelist correctly predicted 26% and retrieved 57% of structures, recovering 64% of correct database annotations. Finally, we illustrate the application of MSNovelist in a bryophyte MS ² dataset, in which de novo structure prediction substantially outscored the best database candidate for seven spectra. MSNovelist is ideally suited to complement library-based annotation in the case of poorly represented analyte classes and novel compounds.
Lactylation was initially discovered on human histones. Given its nascence, its occurrence on nonhistone proteins and downstream functional consequences remain elusive. Here we report a cyclic immonium ion of lactyllysine formed during tandem mass spectrometry that enables confident protein lactylation assignment. We validated the sensitivity and specificity of this ion for lactylation through affinity-enriched lactylproteome analysis and large-scale informatic assessment of nonlactylated spectral libraries. With this diagnostic ion-based strategy, we confidently determined new lactylation, unveiling a wide landscape beyond histones from not only the enriched lactylproteome but also existing unenriched human proteome resources. Specifically, by mining the public human Meltome Atlas, we found that lactylation is common on glycolytic enzymes and conserved on ALDOA. We also discovered prevalent lactylation on DHRS7 in the draft of the human tissue proteome. We partially demonstrated the functional importance of lactylation: site-specific engineering of lactylation into ALDOA caused enzyme inhibition, suggesting a lactylation-dependent feedback loop in glycolysis.
Evidence for at least one protein product from 80% of all mouse genes is reported in a comprehensive proteomic analysis of 41 adult mouse tissues. Comparison of tissue profiles between mouse and human suggests that the fundamental biology of this important model organism is even more different from our own than we thought.