Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users.
The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.
The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.
Synopsis Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.
Recombination is an important evolutionary force in bacteria, but it remains challenging to reconstruct the imports that occurred in the ancestry of a genomic sample. Here we present ClonalFrameML, which uses maximum likelihood inference to simultaneously detect recombination in bacterial genomes and account for it in phylogenetic reconstruction. ClonalFrameML can analyse hundreds of genomes in a matter of hours, and we demonstrate its usefulness on simulated and real datasets. We find evidence for recombination hotspots associated with mobile elements in Clostridium difficile ST6 and a previously undescribed 310kb chromosomal replacement in Staphylococcus aureus ST582. ClonalFrameML is freely available at http://clonalframeml.googlecode.com/.
Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we review the key concepts underlying GWAS, including the architecture of common diseases, the structure of common human genetic variation, technologies for capturing genetic information, study designs, and the statistical methods used for data analysis. We also look forward to the future beyond GWAS.
The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.
Author Summary The advances in nanotechnology have allowed for manipulation of single biomolecules and determination of their elastic properties. Titin was among the first proteins studied in this way. Its unravelling by stretching requires a 204 pN force. The resistance to stretching comes mostly from a localized region known as a force clamp. In titin, the force clamp is simple as it is formed by two parallel β-strands that are sheared on pulling. Studies of a set of under a hundred proteins accomplished in the last decade have revealed a variety of the force clamps that lead to forces ranging from under 20 pN to about 500 pN. This set comprises only a tiny fraction of proteins known. Thus one needs guidance as to what proteins should be considered for specific mechanical properties. Such a guidance is provided here through simulations within simplified coarse-grained models on 17 134 proteins that are stretched at constant speed. We correlate their unravelling forces with two structure classification schemes. We identify proteins with large resistance to unravelling and characterize their force clamps. Quite a few top strength proteins owe their sturdiness to a new type of the force clamp: the cystein slipknot in which the force peak is due to dragging of a piece of the backbone through a closed ring formed by two other pieces of the backbone and two connecting disulphide bonds.
Changes in the synaptic connection strengths between neurons are believed to play a role in memory formation. An important mechanism for changing synaptic strength is through movement of neurotransmitter receptors and regulatory proteins to and from the synapse. Several activity-triggered biochemical events control these movements. Here we use computer models to explore how these putative memory-related changes can be stabilised long after the initial trigger, and beyond the lifetime of synaptic molecules. We base our models on published biochemical data and experiments on the activity-dependent movement of a glutamate receptor, AMPAR, and a calcium-dependent kinase, CaMKII. We find that both of these molecules participate in distinct bistable switches. These simulated switches are effective for long periods despite molecular turnover and biochemical fluctuations arising from the small numbers of molecules in the synapse. The AMPAR switch arises from a novel self-recruitment process where the presence of sufficient receptors biases the receptor movement cycle to insert still more receptors into the synapse. The CaMKII switch arises from autophosphorylation of the kinase. The switches may function in a tightly coupled manner, or relatively independently. The latter case leads to multiple stable states of the synapse. We propose that similar self-recruitment cycles may be important for maintaining levels of many molecules that undergo regulated movement, and that these may lead to combinatorial possible stable states of systems like the synapse.
Although there is great promise in the benefits to be obtained by analyzing cancer genomes, numerous challenges hinder different stages of the process, from the problem of sample preparation and the validation of the experimental techniques, to the interpretation of the results. This chapter specifically focuses on the technical issues associated with the bioinformatics analysis of cancer genome data. The main issues addressed are the use of database and software resources, the use of analysis workflows and the presentation of clinically relevant action items. We attempt to aid new developers in the field by describing the different stages of analysis and discussing current approaches, as well as by providing practical advice on how to access and use resources, and how to implement recommendations. Real cases from cancer genome projects are used as examples.
This year we celebrate the 150th anniversary of the law of mass action. This law is often assumed to have been "there" forever, but it has its own history, background, and a definite starting point. The law has had an impact on chemistry, biochemistry, biomathematics, and systems biology that is difficult to overestimate. It is easily recognized that it is the direct basis for computational enzyme kinetics, ecological systems models, and models for the spread of diseases. The article reviews the explicit and implicit role of the law of mass action in systems biology and reveals how the original, more general formulation of the law emerged one hundred years later ab initio as a very general, canonical representation of biological processes.
Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.
Development of high throughput analytical methods has given physicians the potential access to extensive and patient-specific data sets, such as gene sequences, gene expression profiles or metabolite footprints. This opens for a new approach in health care, which is both personalized and based on system-level analysis. Genome-scale metabolic networks provide a mechanistic description of the relationships between different genes, which is valuable for the analysis and interpretation of large experimental data-sets. Here we describe the generation of genome-scale active metabolic networks for 69 different cell types and 16 cancer types using the INIT (Integrative Network Inference for Tissues) algorithm. The INIT algorithm uses cell type specific information about protein abundances contained in the Human Proteome Atlas as the main source of evidence. The generated models constitute the first step towards establishing a Human Metabolic Atlas, which will be a comprehensive description (accessible online) of the metabolism of different human cell types, and will allow for tissue-level and organism-level simulations in order to achieve a better understanding of complex diseases. A comparative analysis between the active metabolic networks of cancer types and healthy cell types allowed for identification of cancer-specific metabolic features that constitute generic potential drug targets for cancer treatment.
The abundance of different SSU rRNA ("16S") gene sequences in environmental samples is widely used in studies of microbial ecology as a measure of microbial community structure and diversity. However, the genomic copy number of the 16S gene varies greatly - from one in many species to up to 15 in some bacteria and to hundreds in some microbial eukaryotes. As a result of this variation the relative abundance of 16S genes in environmental samples can be attributed both to variation in the relative abundance of different organisms, and to variation in genomic 16S copy number among those organisms. Despite this fact, many studies assume that the abundance of 16S gene sequences is a surrogate measure of the relative abundance of the organisms containing those sequences. Here we present a method that uses data on sequences and genomic copy number of 16S genes along with phylogenetic placement and ancestral state estimation to estimate organismal abundances from environmental DNA sequence data. We use theory and simulations to demonstrate that 16S genomic copy number can be accurately estimated from the short reads typically obtained from high-throughput environmental sequencing of the 16S gene, and that organismal abundances in microbial communities are more strongly correlated with estimated abundances obtained from our method than with gene abundances. We re-analyze several published empirical data sets and demonstrate that the use of gene abundance versus estimated organismal abundance can lead to different inferences about community diversity and structure and the identity of the dominant taxa in microbial communities. Our approach will allow microbial ecologists to make more accurate inferences about microbial diversity and abundance based on 16S sequence data.
Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of beta-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.
Recent advances in automated high-resolution fluorescence microscopy and robotic handling have made the systematic and cost effective study of diverse morphological changes within a large population of cells possible under a variety of perturbations, e.g., drugs, compounds, metal catalysts, RNA interference (RNAi). Cell population-based studies deviate from conventional microscopy studies on a few cells, and could provide stronger statistical power for drawing experimental observations and conclusions. However, it is challenging to manually extract and quantify phenotypic changes from the large amounts of complex image data generated. Thus, bioimage informatics approaches are needed to rapidly and objectively quantify and analyze the image data. This paper provides an overview of the bioimage informatics challenges and approaches in image-based studies for drug and target discovery. The concepts and capabilities of image-based screening are first illustrated by a few practical examples investigating different kinds of phenotypic changes caEditorsused by drugs, compounds, or RNAi. The bioimage analysis approaches, including object detection, segmentation, and tracking, are then described. Subsequently, the quantitative features, phenotype identification, and multidimensional profile analysis for profiling the effects of drugs and targets are summarized. Moreover, a number of publicly available software packages for bioimage informatics are listed for further reference. It is expected that this review will help readers, including those without bioimage informatics expertise, understand the capabilities, approaches, and tools of bioimage informatics and apply them to advance their own studies.
The reciprocal differentiation of T helper 17 (T(H)17) cells and induced regulatory T (iT(reg)) cells plays a critical role in both the pathogenesis and resolution of diverse human inflammatory diseases. Although initial studies suggested a stable commitment to either the T(H)17 or the iT(reg) lineage, recent results reveal remarkable plasticity and heterogeneity, reflected in the capacity of differentiated effectors cells to be reprogrammed among T(H)17 and iT(reg) lineages and the intriguing phenomenon that a group of naïve precursor CD4(+) T cells can be programmed into phenotypically diverse populations by the same differentiation signal, transforming growth factor beta. To reconcile these observations, we have built a mathematical model of T(H)17/iT(reg) differentiation that exhibits four different stable steady states, governed by pitchfork bifurcations with certain degrees of broken symmetry. According to the model, a group of precursor cells with some small cell-to-cell variability can differentiate into phenotypically distinct subsets of cells, which exhibit distinct levels of the master transcription-factor regulators for the two T cell lineages. A dynamical control system with these properties is flexible enough to be steered down alternative pathways by polarizing signals, such as interleukin-6 and retinoic acid and it may be used by the immune system to generate functionally distinct effector cells in desired fractions in response to a range of differentiation signals. Additionally, the model suggests a quantitative explanation for the phenotype with high expression levels of both master regulators. This phenotype corresponds to a re-stabilized co-expressing state, appearing at a late stage of differentiation, rather than a bipotent precursor state observed under some other circumstances. Our simulations reconcile most published experimental observations and predict novel differentiation states as well as transitions among different phenotypes that have not yet been observed experimentally.
The connection between chromatin nuclear organization and gene activity is vividly illustrated by the observation that transcriptional coregulation of certain genes appears to be directly influenced by their spatial proximity. This fact poses the more general question of whether it is at all feasible that the numerous genes that are coregulated on a given chromosome, especially those at large genomic distances, might become proximate inside the nucleus. This problem is studied here using steered molecular dynamics simulations in order to enforce the colocalization of thousands of knowledge-based gene sequences on a model for the gene-rich human chromosome 19. Remarkably, it is found that most ([Formula: see text]) gene pairs can be brought simultaneously into contact. This is made possible by the low degree of intra-chromosome entanglement and the large number of cliques in the gene coregulatory network. A clique is a set of genes coregulated all together as a group. The constrained conformations for the model chromosome 19 are further shown to be organized in spatial macrodomains that are similar to those inferred from recent HiC measurements. The findings indicate that gene coregulation and colocalization are largely compatible and that this relationship can be exploited to draft the overall spatial organization of the chromosome in vivo. The more general validity and implications of these findings could be investigated by applying to other eukaryotic chromosomes the general and transferable computational strategy introduced here.
Relatively little is known about the viral factors contributing to the lethality of the 1918 pandemic, although its unparalleled virulence was likely due in part to the newly discovered PB1-F2 protein. This protein, while unnecessary for replication, increases apoptosis in monocytes, alters viral polymerase activity in vitro, enhances inflammation and increases secondary pneumonia in vivo. However, the effects the PB1-F2 protein have in vivo remain unclear. To address the mechanisms involved, we intranasally infected groups of mice with either influenza A virus PR8 or a genetically engineered virus that expresses the 1918 PB1-F2 protein on a PR8 background, PR8-PB1-F2(1918). Mice inoculated with PR8 had viral concentrations peaking at 72 hours, while those infected with PR8-PB1-F2(1918) reached peak concentrations earlier, 48 hours. Mice given PR8-PB1-F2(1918) also showed a faster decline in viral loads. We fit a mathematical model to these data to estimate parameter values. The model supports a higher viral production rate per cell and a higher infected cell death rate with the PR8-PB1-F2(1918) virus. We discuss the implications these mechanisms have during an infection with a virus expressing a virulent PB1-F2 on the possibility of a pandemic and on the importance of antiviral treatments.
Author Summary We developed a general-purpose biochemical simulation program, called Smoldyn. It represents proteins and other molecules of interest with point-like particles that diffuse, interact with surfaces, and react, all in continuous space. This high level of detail allows users to investigate spatial organization within cells and natural stochastic variability. Although similar to the MCell and ChemCell programs, Smoldyn is more accurate and runs faster. Smoldyn also supports many unique features, such as commands that a “virtual experimenter” can execute during simulations and automatic reaction network expansion for simulating protein complexes. We illustrate Smoldyn's capabilities with a model of signaling between yeast cells of opposite mating type. It investigates the role of the secreted protease Bar1, which inactivates mating pheromone. Intuitively, it might seem that inactivating most of the pheromone would make a cell less able to detect the local pheromone concentration gradient. In contrast, we found that Bar1 secretion improves pheromone gradient detectability: the local gradient is sharpened because pheromone is progressively inactivated as it diffuses through a cloud of Bar1. This result helps interpret experiments that showed that Bar1 secretion helped cells distinguish between potential mates, and suggests that Bar1 helps yeast cells identify the fittest mating partners.
A considerable fraction of all the major scholars in computational biology frequently participate in ISMB. Consistent with this leading role in representation, ISMB has become a major outlet for increasing the visibility of this extremely dynamic new discipline, and for maintaining and elevating its scientific standards. It has become the vehicle for the education of scholars at all stages of their careers, for the integration of students, and for the support of young leaders in the field. ISMB has also become a forum for reviewing the state of the art in the many fields of this growing discipline, for introducing new directions, and for announcing technological breakthroughs. ISMB and ISCB are contributing to the advance of biology, and to helping to build bridges and understanding between dedicated and passionate groups of scholars from an unusual variety of backgrounds. ISMB 1993–2008 The ISMB conference series began in 1993, the result of the vision of David Searls (GlaxoSmithKline), Jude Shavlik (University of Wisconsin Madison), and Larry Hunter (University of Colorado). A few years later, ISMB had established itself as a primary event for the computational biology community and triggered the founding of ISCB, the International Society for Computational Biology (http://www.iscb.org/). ISCB has been organizing the ISMB conference series since 1998. While ISCB evolved into the only society representing computational biology globally, its flagship conference has become the largest annual worldwide forum focused on computational biology. In January 2007, the ISCB came to an agreement with the European Conference on Computational Biology (ECCB) to organize a joint meeting in Europe every other year. This led to the ISMB/ECCB in Vienna in 2007 that set the standard for a large-scale integrative forum for all those with interest in subjects related to computational biology. ISCB is now focusing on expanding participation beyond North America and Europe, which has accounted for the majority of participants during the history of ISMB. One meeting in South Asia (InCoB; http://incob.binfo.org.tw/) has already been sponsored by ISCB, and another one in North Asia is going to follow. ISMB itself has also been held in Australia (2003) and Brazil (2006).