PLoS Computational Biology

Published by PLOS

Online ISSN: 1553-7358


Print ISSN: 1553-734X


Figure 1. Cumulative plots of SMART version 6 and Pfam release 23 problematic domains. In SMART version 6, the total number of domains with predicted SP/TM segments peaks at 18, which made up 2.2% of 809 SMART domains (see top). Red triangles mark time points for the years 1998, 2002 and 2009 when the total number of domain models was 86, 600 and 809 respectively. In Pfam, the total number of problematic domains peaks at 1214, which made up 11.8% of 10340 Pfam domains (see bottom). Likewise, red triangles marked the years 1999, 2002 and 2008 with 1465, 3360 and 10340 Pfam entries respectively. doi:10.1371/journal.pcbi.1000867.g001 
Table 1 . Summary of predicted/validated non-globular segments and supporting evidence for the 18 SMART version 6 domains.
Figure 2. Histograms of average log probability per predicted transmembrane helix and per predicted signal peptide in Pfam release 23. The top part shows the histogram of average log probability per predicted transmembrane helix; the bottom part shows the same per predicted signal peptide. The log probability provided on the x-axis is calculated with equations 5 and 6. At the TMcutoff of $2 12 (false-positive rate 4.67%) and SPcutoff of $2 1 (false-positive rate 4.02%), the number of predicted TM helices and signal peptides are 3849 and 164 respectively. doi:10.1371/journal.pcbi.1000867.g002 
Figure 3. Average log probability plot of transmembrane helix and signal peptide predictions per domain. The top part shows the average log probability per predicted transmembrane helix calculated per domain; the bottom part shows the same per predicted signal peptide. Whereas the y-axis shows the log probability in accordance with equation 6 applied over all predicted segments for a given domain, the x-axis represents their cumulative length. At the TMcutoff of $2 12 and SPcutoff of $2 1 (horizontal dashed lines), the number of problematic TM and SP domains are 1079 and 164 respectively. The total number of problematic domains is 1214 (1050 TM, 135 SP and 29 concurrent TM and SP). doi:10.1371/journal.pcbi.1000867.g003 
Table 3 . FP and FN rates of TM predictions based on different TM cutoffs.


More Than 1,001 Problems with Protein Domain Databases: Transmembrane Regions, Signal Peptides and the Issue of Sequence Homology
  • Article
  • Full-text available

July 2010


231 Reads



Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users.

Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

December 2013


272 Reads

Tjaart A P de Beer






The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.

Figure 1. Blue-Yellow Microarray Figure Applied to KEGG Vectors for Four Metagenomics Projects The whale-fall and Sargasso sea data are partitioned into three different samples each. The rows correspond to the different datasets and the columns to the 137 KEGG categories. Blue corresponds to underrepresentation and yellow to overrepresentation. Note that some branch lengths have been adjusted for visualization purposes and do not correspond to an actual meaningful distance. 
Table 1.
Published Microbial Community Shotgun Sequencing Projects
Table 2.
Bounds on Amount of Sequence Needed to Assemble Genomes (in Mbp)
Figure 2.  Projection of the KEGG Vectors on the First Two Principle Components
Table 3.
Examples of Ongoing Community WGS Sequencing Projects
Chen, K. & Pachter, L. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol. 1, 106-112

August 2005


319 Reads

The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.

Large-Scale Trends in the Evolution of Gene Structures within 11 Animal Genomes

April 2006


92 Reads

Synopsis Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.

Figure 1.  An example application of ClonalFrameML to a simulated dataset.
(A) The clonal genealogy produced by simulation. (B) Maximum-likelihood reconstructed phylogeny. (C) ClonalFrameML reconstructed phylogeny. (D) Representation of recombination events along the genome for each branch of the genealogy in (A). True events are shown in blue and events detected by ClonalFrameML are shown in red. Three branches of interest and their associated recombination events are highlighted by red boxes.
Figure 2.  Comparison of correct parameter values with estimates from ClonalFrameML for a hundred datasets simulated under the ClonalFrame model.
Dots represent the point estimates and bars the 95% confidence intervals. Colours represent the correct value of the compound parameter δR ranging from 10−3 (black) to 102 (red).
Figure 3.  Comparison of correct parameter values with estimates from ClonalFrameML for a hundred datasets simulated under the coalescent with gene conversion model of intra-population recombination.
Dots represent the point estimates and bars the 95% confidence intervals. Colours represent the correct value of the parameter δ ranging from 102 (black) to 104 (red).
Figure 4.  Application of ClonalFrameML to 86 genomes of C. difficile ST6.
For any branch of the genealogy and any position along the genome, inferred recombination is marked in blue.
Figure 5.  ClonalFrameML analysis of recombination in S. aureus based on 110 genomes representing carriage and reference isolates mapped to MRSA252.
Reconstructed substitutions (white vertical bars) are shown for each branch of the ML tree. Grey areas represent non-core regions of the MRSA252 genome. Dark blue horizontal bars indicate recombination events detected by the analysis.
Didelot X, Wilson DJ.. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLOS Comput Biol 11: e1004041

February 2015


1,299 Reads

Recombination is an important evolutionary force in bacteria, but it remains challenging to reconstruct the imports that occurred in the ancestry of a genomic sample. Here we present ClonalFrameML, which uses maximum likelihood inference to simultaneously detect recombination in bacterial genomes and account for it in phylogenetic reconstruction. ClonalFrameML can analyse hundreds of genomes in a matter of hours, and we demonstrate its usefulness on simulated and real datasets. We find evidence for recombination hotspots associated with mobile elements in Clostridium difficile ST6 and a previously undescribed 310kb chromosomal replacement in Staphylococcus aureus ST582. ClonalFrameML is freely available at

Chapter 11: Genome-Wide Association Studies

December 2012


17,504 Reads

Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we review the key concepts underlying GWAS, including the architecture of common diseases, the structure of common human genetic variation, technologies for capturing genetic information, study designs, and the statistical methods used for data analysis. We also look forward to the future beyond GWAS.

Chapter 13: Mining Electronic Health Records in the Genomics Era

December 2012


1,849 Reads

The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.

Mechanical Strength of 17 134 Model Proteins and Cysteine Slipknots

October 2009


118 Reads

Author Summary The advances in nanotechnology have allowed for manipulation of single biomolecules and determination of their elastic properties. Titin was among the first proteins studied in this way. Its unravelling by stretching requires a 204 pN force. The resistance to stretching comes mostly from a localized region known as a force clamp. In titin, the force clamp is simple as it is formed by two parallel β-strands that are sheared on pulling. Studies of a set of under a hundred proteins accomplished in the last decade have revealed a variety of the force clamps that lead to forces ranging from under 20 pN to about 500 pN. This set comprises only a tiny fraction of proteins known. Thus one needs guidance as to what proteins should be considered for specific mechanical properties. Such a guidance is provided here through simulations within simplified coarse-grained models on 17 134 proteins that are stretched at constant speed. We correlate their unravelling forces with two structure classification schemes. We identify proteins with large resistance to unravelling and characterize their force clamps. Quite a few top strength proteins owe their sturdiness to a new type of the force clamp: the cystein slipknot in which the force peak is due to dragging of a piece of the backbone through a closed ring formed by two other pieces of the backbone and two connecting disulphide bonds.

Hayer, A. & Bhalla, U. S. Molecular switches at the synapse emerge from receptor and kinase traffic. PLoS Comput. Biol. 1, 137-154

August 2005


122 Reads

Changes in the synaptic connection strengths between neurons are believed to play a role in memory formation. An important mechanism for changing synaptic strength is through movement of neurotransmitter receptors and regulatory proteins to and from the synapse. Several activity-triggered biochemical events control these movements. Here we use computer models to explore how these putative memory-related changes can be stabilised long after the initial trigger, and beyond the lifetime of synaptic molecules. We base our models on published biochemical data and experiments on the activity-dependent movement of a glutamate receptor, AMPAR, and a calcium-dependent kinase, CaMKII. We find that both of these molecules participate in distinct bistable switches. These simulated switches are effective for long periods despite molecular turnover and biochemical fluctuations arising from the small numbers of molecules in the synapse. The AMPAR switch arises from a novel self-recruitment process where the presence of sufficient receptors biases the receptor movement cycle to insert still more receptors into the synapse. The CaMKII switch arises from autophosphorylation of the kinase. The switches may function in a tightly coupled manner, or relatively independently. The latter case leads to multiple stable states of the synapse. We propose that similar self-recruitment cycles may be important for maintaining levels of many molecules that undergo regulated movement, and that these may lead to combinatorial possible stable states of systems like the synapse.

Chapter 14: Cancer Genome Analysis

December 2012


1,018 Reads

Although there is great promise in the benefits to be obtained by analyzing cancer genomes, numerous challenges hinder different stages of the process, from the problem of sample preparation and the validation of the experimental techniques, to the interpretation of the results. This chapter specifically focuses on the technical issues associated with the bioinformatics analysis of cancer genome data. The main issues addressed are the use of database and software resources, the use of analysis workflows and the presentation of clinically relevant action items. We attempt to aid new developers in the field by describing the different stages of analysis and discussing current approaches, as well as by providing practical advice on how to access and use resources, and how to implement recommendations. Real cases from cancer genome projects are used as examples.

150 Years of the Mass Action Law

January 2015


1,358 Reads

This year we celebrate the 150th anniversary of the law of mass action. This law is often assumed to have been "there" forever, but it has its own history, background, and a definite starting point. The law has had an impact on chemistry, biochemistry, biomathematics, and systems biology that is difficult to overestimate. It is easily recognized that it is the direct basis for computational enzyme kinetics, ecological systems models, and models for the spread of diseases. The article reviews the explicit and implicit role of the law of mass action in systems biology and reveals how the original, more general formulation of the law emerged one hundred years later ab initio as a very general, canonical representation of biological processes.

Table 1.  Some knowledge sources for biomedical natural language processing.
Chapter 16: Text Mining for Translational Bioinformatics

April 2013


329 Reads

Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

Figure 1. General pipeline used in the reconstruction of cell specific genome-scale metabolic networks. Biological information at the genome, transcriptome, proteome and metabolome levels contained in publicly available databases and generic human GEMs (Recon1, EHMN, HumanCyc) is integrated to form a generic human metabolic network, which is processed in order to obtain the connected iHuman1512 network. Subsequently, the cell type specific evidence is used to generate cell type specific subnetworks using the INIT algorithm. doi:10.1371/journal.pcbi.1002518.g001 
Figure 2. Illustration of the principles of the INIT algorithm. The hierarchical structure of GEMs is characterized by its gene-transcript-protein- reaction (GTPR) associations. In GEMs, each metabolic reaction is associated to one or more enzymes, which in turn are associated to transcripts and genes. Depending on the evidence for presence/absence of a given enzyme/gene in a cell type, a score can be calculated for the reaction(s) catalyzed by that enzyme. The HPA evidence scores are illustrated as red, light, medium and dark green representing negative, weak, moderate and strong evidence, respectively. The transcriptome evidence scores (GeneX), which are illustrated as red, light, medium, and dark blue representing low, medium and high expression, respectively. No evidence is present as white object. For some metabolites (yellow filled circle), metabolomic data are available to prove that they are present in the considered cell type. The aim of the algorithm is to find a sub-network in which the involved genes/ proteins have strong evidence supporting their presence in the cell type under consideration. This is done by maximizing the sum of evidence scores. All the included reactions should be able to carry a flux and all the metabolites observed experimentally should be synthesized from precursors that the cell is known to take up. The bold lines represent the resulting network after optimization. doi:10.1371/journal.pcbi.1002518.g002 
Figure 3. Gene content comparison between our hepatocyte model and HepatoNet1. The Venn diagram shows the overlap in terms of included genes between three models. The blue, green and red squares represent iHuman1512 , our hepatocyte model iHepatocyte1154 and HepatoNet1, respectively. The distribution of evidence scores of each section of the Venn diagram is plotted. The HPA evidence scores are illustrated as red, light, medium and dark green represent negative, weak moderate and strong expression, respectively. The transcriptome evidence scores (GeneX) are illustrated as red, light, medium and dark blue representing low, medium and high expression, respectively. No evidence (NE) is illustrated as grey color. doi:10.1371/journal.pcbi.1002518.g003 
Figure 4. Example of a metabolic sub-network that was identified as being significantly more present in cancer tissues compared to their corresponding healthy tissues. Aminoacetone, which is a toxic by-product of amino acid catabolism, is converted to toxic methylglyoxal in a reaction that also result in hydrogen peroxide. The toxicity of methylglyoxal is relieved by two reaction steps involving ligation to glutathione and resulting in lactic acid. The generated hydrogen peroxide is taken care of by the enzyme biliverdin reductase. This is an example of how network- based analysis can lead to a more mechanistic interpretation of data. doi:10.1371/journal.pcbi.1002518.g004 
Reconstruction of Genome-Scale Active Metabolic Networks for 69 Human Cell Types and 16 Cancer Types Using INIT

May 2012


443 Reads

Development of high throughput analytical methods has given physicians the potential access to extensive and patient-specific data sets, such as gene sequences, gene expression profiles or metabolite footprints. This opens for a new approach in health care, which is both personalized and based on system-level analysis. Genome-scale metabolic networks provide a mechanistic description of the relationships between different genes, which is valuable for the analysis and interpretation of large experimental data-sets. Here we describe the generation of genome-scale active metabolic networks for 69 different cell types and 16 cancer types using the INIT (Integrative Network Inference for Tissues) algorithm. The INIT algorithm uses cell type specific information about protein abundances contained in the Human Proteome Atlas as the main source of evidence. The generated models constitute the first step towards establishing a Human Metabolic Atlas, which will be a comprehensive description (accessible online) of the metabolism of different human cell types, and will allow for tissue-level and organism-level simulations in order to achieve a better understanding of complex diseases. A comparative analysis between the active metabolic networks of cancer types and healthy cell types allowed for identification of cancer-specific metabolic features that constitute generic potential drug targets for cancer treatment.

Kembel SW, Wu M, Eisen JA, Green JL.. Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLOS Comput Biol 8: e1002743

October 2012


420 Reads

The abundance of different SSU rRNA ("16S") gene sequences in environmental samples is widely used in studies of microbial ecology as a measure of microbial community structure and diversity. However, the genomic copy number of the 16S gene varies greatly - from one in many species to up to 15 in some bacteria and to hundreds in some microbial eukaryotes. As a result of this variation the relative abundance of 16S genes in environmental samples can be attributed both to variation in the relative abundance of different organisms, and to variation in genomic 16S copy number among those organisms. Despite this fact, many studies assume that the abundance of 16S gene sequences is a surrogate measure of the relative abundance of the organisms containing those sequences. Here we present a method that uses data on sequences and genomic copy number of 16S genes along with phylogenetic placement and ancestral state estimation to estimate organismal abundances from environmental DNA sequence data. We use theory and simulations to demonstrate that 16S genomic copy number can be accurately estimated from the short reads typically obtained from high-throughput environmental sequencing of the 16S gene, and that organismal abundances in microbial communities are more strongly correlated with estimated abundances obtained from our method than with gene abundances. We re-analyze several published empirical data sets and demonstrate that the use of gene abundance versus estimated organismal abundance can lead to different inferences about community diversity and structure and the identity of the dominant taxa in microbial communities. Our approach will allow microbial ecologists to make more accurate inferences about microbial diversity and abundance based on 16S sequence data.

The Effects of Alignment Quality, Distance Calculation Method, Sequence Filtering, and Region on the Analysis of 16S rRNA Gene-Based Studies

July 2010


1,182 Reads

Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of beta-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.

Chapter 17: Bioimage Informatics for Systems Pharmacology
Recent advances in automated high-resolution fluorescence microscopy and robotic handling have made the systematic and cost effective study of diverse morphological changes within a large population of cells possible under a variety of perturbations, e.g., drugs, compounds, metal catalysts, RNA interference (RNAi). Cell population-based studies deviate from conventional microscopy studies on a few cells, and could provide stronger statistical power for drawing experimental observations and conclusions. However, it is challenging to manually extract and quantify phenotypic changes from the large amounts of complex image data generated. Thus, bioimage informatics approaches are needed to rapidly and objectively quantify and analyze the image data. This paper provides an overview of the bioimage informatics challenges and approaches in image-based studies for drug and target discovery. The concepts and capabilities of image-based screening are first illustrated by a few practical examples investigating different kinds of phenotypic changes caEditorsused by drugs, compounds, or RNAi. The bioimage analysis approaches, including object detection, segmentation, and tracking, are then described. Subsequently, the quantitative features, phenotype identification, and multidimensional profile analysis for profiling the effects of drugs and targets are summarized. Moreover, a number of publicly available software packages for bioimage informatics are listed for further reference. It is expected that this review will help readers, including those without bioimage informatics expertise, understand the capabilities, approaches, and tools of bioimage informatics and apply them to advance their own studies.

A Mathematical Model for the Reciprocal Differentiation of T Helper 17 Cells and Induced Regulatory T Cells

July 2011


302 Reads

The reciprocal differentiation of T helper 17 (T(H)17) cells and induced regulatory T (iT(reg)) cells plays a critical role in both the pathogenesis and resolution of diverse human inflammatory diseases. Although initial studies suggested a stable commitment to either the T(H)17 or the iT(reg) lineage, recent results reveal remarkable plasticity and heterogeneity, reflected in the capacity of differentiated effectors cells to be reprogrammed among T(H)17 and iT(reg) lineages and the intriguing phenomenon that a group of naïve precursor CD4(+) T cells can be programmed into phenotypically diverse populations by the same differentiation signal, transforming growth factor beta. To reconcile these observations, we have built a mathematical model of T(H)17/iT(reg) differentiation that exhibits four different stable steady states, governed by pitchfork bifurcations with certain degrees of broken symmetry. According to the model, a group of precursor cells with some small cell-to-cell variability can differentiate into phenotypically distinct subsets of cells, which exhibit distinct levels of the master transcription-factor regulators for the two T cell lineages. A dynamical control system with these properties is flexible enough to be steered down alternative pathways by polarizing signals, such as interleukin-6 and retinoic acid and it may be used by the immune system to generate functionally distinct effector cells in desired fractions in response to a range of differentiation signals. Additionally, the model suggests a quantitative explanation for the phenotype with high expression levels of both master regulators. This phenotype corresponds to a re-stabilized co-expressing state, appearing at a late stage of differentiation, rather than a bipotent precursor state observed under some other circumstances. Our simulations reconcile most published experimental observations and predict novel differentiation states as well as transitions among different phenotypes that have not yet been observed experimentally.

Colocalization of Coregulated Genes: A Steered Molecular Dynamics Study of Human Chromosome 19

March 2013


135 Reads

The connection between chromatin nuclear organization and gene activity is vividly illustrated by the observation that transcriptional coregulation of certain genes appears to be directly influenced by their spatial proximity. This fact poses the more general question of whether it is at all feasible that the numerous genes that are coregulated on a given chromosome, especially those at large genomic distances, might become proximate inside the nucleus. This problem is studied here using steered molecular dynamics simulations in order to enforce the colocalization of thousands of knowledge-based gene sequences on a model for the gene-rich human chromosome 19. Remarkably, it is found that most ([Formula: see text]) gene pairs can be brought simultaneously into contact. This is made possible by the low degree of intra-chromosome entanglement and the large number of cliques in the gene coregulatory network. A clique is a set of genes coregulated all together as a group. The constrained conformations for the model chromosome 19 are further shown to be organized in spatial macrodomains that are similar to those inferred from recent HiC measurements. The findings indicate that gene coregulation and colocalization are largely compatible and that this relationship can be exploited to draft the overall spatial organization of the chromosome in vivo. The more general validity and implications of these findings could be investigated by applying to other eukaryotic chromosomes the general and transferable computational strategy introduced here.

Effect of 1918 PB1-F2 Expression on Influenza A Virus Infection Kinetics

February 2011


372 Reads

Relatively little is known about the viral factors contributing to the lethality of the 1918 pandemic, although its unparalleled virulence was likely due in part to the newly discovered PB1-F2 protein. This protein, while unnecessary for replication, increases apoptosis in monocytes, alters viral polymerase activity in vitro, enhances inflammation and increases secondary pneumonia in vivo. However, the effects the PB1-F2 protein have in vivo remain unclear. To address the mechanisms involved, we intranasally infected groups of mice with either influenza A virus PR8 or a genetically engineered virus that expresses the 1918 PB1-F2 protein on a PR8 background, PR8-PB1-F2(1918). Mice inoculated with PR8 had viral concentrations peaking at 72 hours, while those infected with PR8-PB1-F2(1918) reached peak concentrations earlier, 48 hours. Mice given PR8-PB1-F2(1918) also showed a faster decline in viral loads. We fit a mathematical model to these data to estimate parameter values. The model supports a higher viral production rate per cell and a higher infected cell death rate with the PR8-PB1-F2(1918) virus. We discuss the implications these mechanisms have during an infection with a virus expressing a virulent PB1-F2 on the possibility of a pandemic and on the importance of antiviral treatments.

Detailed Simulations of Cell Biology with Smoldyn 2.1

March 2010


400 Reads

Author Summary We developed a general-purpose biochemical simulation program, called Smoldyn. It represents proteins and other molecules of interest with point-like particles that diffuse, interact with surfaces, and react, all in continuous space. This high level of detail allows users to investigate spatial organization within cells and natural stochastic variability. Although similar to the MCell and ChemCell programs, Smoldyn is more accurate and runs faster. Smoldyn also supports many unique features, such as commands that a “virtual experimenter” can execute during simulations and automatic reaction network expansion for simulating protein complexes. We illustrate Smoldyn's capabilities with a model of signaling between yeast cells of opposite mating type. It investigates the role of the secreted protease Bar1, which inactivates mating pheromone. Intuitively, it might seem that inactivating most of the pheromone would make a cell less able to detect the local pheromone concentration gradient. In contrast, we found that Bar1 secretion improves pheromone gradient detectability: the local gradient is sharpened because pheromone is progressively inactivated as it diffuses through a cloud of Bar1. This result helps interpret experiments that showed that Bar1 secretion helped cells distinguish between potential mates, and suggests that Bar1 helps yeast cells identify the fittest mating partners.

Table 1.  ISMB/ECCB History.
Figure 1. Sketch of Program for ISMB 2008 in Toronto. doi:10.1371/journal.pcbi.1000094.g001 
Figure 1.  Sketch of Program for ISMB 2008 in Toronto.
Figure 2.  Increasing Breadth of ISMB.
ISMB 2008 Toronto

July 2008


76 Reads

A considerable fraction of all the major scholars in computational biology frequently participate in ISMB. Consistent with this leading role in representation, ISMB has become a major outlet for increasing the visibility of this extremely dynamic new discipline, and for maintaining and elevating its scientific standards. It has become the vehicle for the education of scholars at all stages of their careers, for the integration of students, and for the support of young leaders in the field. ISMB has also become a forum for reviewing the state of the art in the many fields of this growing discipline, for introducing new directions, and for announcing technological breakthroughs. ISMB and ISCB are contributing to the advance of biology, and to helping to build bridges and understanding between dedicated and passionate groups of scholars from an unusual variety of backgrounds. ISMB 1993–2008 The ISMB conference series began in 1993, the result of the vision of David Searls (GlaxoSmithKline), Jude Shavlik (University of Wisconsin Madison), and Larry Hunter (University of Colorado). A few years later, ISMB had established itself as a primary event for the computational biology community and triggered the founding of ISCB, the International Society for Computational Biology ( ISCB has been organizing the ISMB conference series since 1998. While ISCB evolved into the only society representing computational biology globally, its flagship conference has become the largest annual worldwide forum focused on computational biology. In January 2007, the ISCB came to an agreement with the European Conference on Computational Biology (ECCB) to organize a joint meeting in Europe every other year. This led to the ISMB/ECCB in Vienna in 2007 that set the standard for a large-scale integrative forum for all those with interest in subjects related to computational biology. ISCB is now focusing on expanding participation beyond North America and Europe, which has accounted for the majority of participants during the history of ISMB. One meeting in South Asia (InCoB; has already been sponsored by ISCB, and another one in North Asia is going to follow. ISMB itself has also been held in Australia (2003) and Brazil (2006).

Top-cited authors