PLoS Computational Biology

Published by Public Library of Science
Online ISSN: 1553-7358
Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users.
The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.
Blue-Yellow Microarray Figure Applied to KEGG Vectors for Four Metagenomics Projects The whale-fall and Sargasso sea data are partitioned into three different samples each. The rows correspond to the different datasets and the columns to the 137 KEGG categories. Blue corresponds to underrepresentation and yellow to overrepresentation. Note that some branch lengths have been adjusted for visualization purposes and do not correspond to an actual meaningful distance. 
Published Microbial Community Shotgun Sequencing Projects
Bounds on Amount of Sequence Needed to Assemble Genomes (in Mbp)
Projection of the KEGG Vectors on the First Two Principle Components
Examples of Ongoing Community WGS Sequencing Projects
The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.
An example application of ClonalFrameML to a simulated dataset.
(A) The clonal genealogy produced by simulation. (B) Maximum-likelihood reconstructed phylogeny. (C) ClonalFrameML reconstructed phylogeny. (D) Representation of recombination events along the genome for each branch of the genealogy in (A). True events are shown in blue and events detected by ClonalFrameML are shown in red. Three branches of interest and their associated recombination events are highlighted by red boxes.
Comparison of correct parameter values with estimates from ClonalFrameML for a hundred datasets simulated under the ClonalFrame model.
Dots represent the point estimates and bars the 95% confidence intervals. Colours represent the correct value of the compound parameter δR ranging from 10−3 (black) to 102 (red).
Comparison of correct parameter values with estimates from ClonalFrameML for a hundred datasets simulated under the coalescent with gene conversion model of intra-population recombination.
Dots represent the point estimates and bars the 95% confidence intervals. Colours represent the correct value of the parameter δ ranging from 102 (black) to 104 (red).
Application of ClonalFrameML to 86 genomes of C. difficile ST6.
For any branch of the genealogy and any position along the genome, inferred recombination is marked in blue.
ClonalFrameML analysis of recombination in S. aureus based on 110 genomes representing carriage and reference isolates mapped to MRSA252.
Reconstructed substitutions (white vertical bars) are shown for each branch of the ML tree. Grey areas represent non-core regions of the MRSA252 genome. Dark blue horizontal bars indicate recombination events detected by the analysis.
Recombination is an important evolutionary force in bacteria, but it remains challenging to reconstruct the imports that occurred in the ancestry of a genomic sample. Here we present ClonalFrameML, which uses maximum likelihood inference to simultaneously detect recombination in bacterial genomes and account for it in phylogenetic reconstruction. ClonalFrameML can analyse hundreds of genomes in a matter of hours, and we demonstrate its usefulness on simulated and real datasets. We find evidence for recombination hotspots associated with mobile elements in Clostridium difficile ST6 and a previously undescribed 310kb chromosomal replacement in Staphylococcus aureus ST582. ClonalFrameML is freely available at
Synopsis Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.
Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we review the key concepts underlying GWAS, including the architecture of common diseases, the structure of common human genetic variation, technologies for capturing genetic information, study designs, and the statistical methods used for data analysis. We also look forward to the future beyond GWAS.
The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.
Author Summary The advances in nanotechnology have allowed for manipulation of single biomolecules and determination of their elastic properties. Titin was among the first proteins studied in this way. Its unravelling by stretching requires a 204 pN force. The resistance to stretching comes mostly from a localized region known as a force clamp. In titin, the force clamp is simple as it is formed by two parallel β-strands that are sheared on pulling. Studies of a set of under a hundred proteins accomplished in the last decade have revealed a variety of the force clamps that lead to forces ranging from under 20 pN to about 500 pN. This set comprises only a tiny fraction of proteins known. Thus one needs guidance as to what proteins should be considered for specific mechanical properties. Such a guidance is provided here through simulations within simplified coarse-grained models on 17 134 proteins that are stretched at constant speed. We correlate their unravelling forces with two structure classification schemes. We identify proteins with large resistance to unravelling and characterize their force clamps. Quite a few top strength proteins owe their sturdiness to a new type of the force clamp: the cystein slipknot in which the force peak is due to dragging of a piece of the backbone through a closed ring formed by two other pieces of the backbone and two connecting disulphide bonds.
Changes in the synaptic connection strengths between neurons are believed to play a role in memory formation. An important mechanism for changing synaptic strength is through movement of neurotransmitter receptors and regulatory proteins to and from the synapse. Several activity-triggered biochemical events control these movements. Here we use computer models to explore how these putative memory-related changes can be stabilised long after the initial trigger, and beyond the lifetime of synaptic molecules. We base our models on published biochemical data and experiments on the activity-dependent movement of a glutamate receptor, AMPAR, and a calcium-dependent kinase, CaMKII. We find that both of these molecules participate in distinct bistable switches. These simulated switches are effective for long periods despite molecular turnover and biochemical fluctuations arising from the small numbers of molecules in the synapse. The AMPAR switch arises from a novel self-recruitment process where the presence of sufficient receptors biases the receptor movement cycle to insert still more receptors into the synapse. The CaMKII switch arises from autophosphorylation of the kinase. The switches may function in a tightly coupled manner, or relatively independently. The latter case leads to multiple stable states of the synapse. We propose that similar self-recruitment cycles may be important for maintaining levels of many molecules that undergo regulated movement, and that these may lead to combinatorial possible stable states of systems like the synapse.
Although there is great promise in the benefits to be obtained by analyzing cancer genomes, numerous challenges hinder different stages of the process, from the problem of sample preparation and the validation of the experimental techniques, to the interpretation of the results. This chapter specifically focuses on the technical issues associated with the bioinformatics analysis of cancer genome data. The main issues addressed are the use of database and software resources, the use of analysis workflows and the presentation of clinically relevant action items. We attempt to aid new developers in the field by describing the different stages of analysis and discussing current approaches, as well as by providing practical advice on how to access and use resources, and how to implement recommendations. Real cases from cancer genome projects are used as examples.
This year we celebrate the 150th anniversary of the law of mass action. This law is often assumed to have been "there" forever, but it has its own history, background, and a definite starting point. The law has had an impact on chemistry, biochemistry, biomathematics, and systems biology that is difficult to overestimate. It is easily recognized that it is the direct basis for computational enzyme kinetics, ecological systems models, and models for the spread of diseases. The article reviews the explicit and implicit role of the law of mass action in systems biology and reveals how the original, more general formulation of the law emerged one hundred years later ab initio as a very general, canonical representation of biological processes.
Some knowledge sources for biomedical natural language processing.
Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.
General pipeline used in the reconstruction of cell specific genome-scale metabolic networks. Biological information at the genome, transcriptome, proteome and metabolome levels contained in publicly available databases and generic human GEMs (Recon1, EHMN, HumanCyc) is integrated to form a generic human metabolic network, which is processed in order to obtain the connected iHuman1512 network. Subsequently, the cell type specific evidence is used to generate cell type specific subnetworks using the INIT algorithm. doi:10.1371/journal.pcbi.1002518.g001 
Illustration of the principles of the INIT algorithm. The hierarchical structure of GEMs is characterized by its gene-transcript-protein- reaction (GTPR) associations. In GEMs, each metabolic reaction is associated to one or more enzymes, which in turn are associated to transcripts and genes. Depending on the evidence for presence/absence of a given enzyme/gene in a cell type, a score can be calculated for the reaction(s) catalyzed by that enzyme. The HPA evidence scores are illustrated as red, light, medium and dark green representing negative, weak, moderate and strong evidence, respectively. The transcriptome evidence scores (GeneX), which are illustrated as red, light, medium, and dark blue representing low, medium and high expression, respectively. No evidence is present as white object. For some metabolites (yellow filled circle), metabolomic data are available to prove that they are present in the considered cell type. The aim of the algorithm is to find a sub-network in which the involved genes/ proteins have strong evidence supporting their presence in the cell type under consideration. This is done by maximizing the sum of evidence scores. All the included reactions should be able to carry a flux and all the metabolites observed experimentally should be synthesized from precursors that the cell is known to take up. The bold lines represent the resulting network after optimization. doi:10.1371/journal.pcbi.1002518.g002 
Gene content comparison between our hepatocyte model and HepatoNet1. The Venn diagram shows the overlap in terms of included genes between three models. The blue, green and red squares represent iHuman1512 , our hepatocyte model iHepatocyte1154 and HepatoNet1, respectively. The distribution of evidence scores of each section of the Venn diagram is plotted. The HPA evidence scores are illustrated as red, light, medium and dark green represent negative, weak moderate and strong expression, respectively. The transcriptome evidence scores (GeneX) are illustrated as red, light, medium and dark blue representing low, medium and high expression, respectively. No evidence (NE) is illustrated as grey color. doi:10.1371/journal.pcbi.1002518.g003 
Example of a metabolic sub-network that was identified as being significantly more present in cancer tissues compared to their corresponding healthy tissues. Aminoacetone, which is a toxic by-product of amino acid catabolism, is converted to toxic methylglyoxal in a reaction that also result in hydrogen peroxide. The toxicity of methylglyoxal is relieved by two reaction steps involving ligation to glutathione and resulting in lactic acid. The generated hydrogen peroxide is taken care of by the enzyme biliverdin reductase. This is an example of how network- based analysis can lead to a more mechanistic interpretation of data. doi:10.1371/journal.pcbi.1002518.g004 
Development of high throughput analytical methods has given physicians the potential access to extensive and patient-specific data sets, such as gene sequences, gene expression profiles or metabolite footprints. This opens for a new approach in health care, which is both personalized and based on system-level analysis. Genome-scale metabolic networks provide a mechanistic description of the relationships between different genes, which is valuable for the analysis and interpretation of large experimental data-sets. Here we describe the generation of genome-scale active metabolic networks for 69 different cell types and 16 cancer types using the INIT (Integrative Network Inference for Tissues) algorithm. The INIT algorithm uses cell type specific information about protein abundances contained in the Human Proteome Atlas as the main source of evidence. The generated models constitute the first step towards establishing a Human Metabolic Atlas, which will be a comprehensive description (accessible online) of the metabolism of different human cell types, and will allow for tissue-level and organism-level simulations in order to achieve a better understanding of complex diseases. A comparative analysis between the active metabolic networks of cancer types and healthy cell types allowed for identification of cancer-specific metabolic features that constitute generic potential drug targets for cancer treatment.
The abundance of different SSU rRNA ("16S") gene sequences in environmental samples is widely used in studies of microbial ecology as a measure of microbial community structure and diversity. However, the genomic copy number of the 16S gene varies greatly - from one in many species to up to 15 in some bacteria and to hundreds in some microbial eukaryotes. As a result of this variation the relative abundance of 16S genes in environmental samples can be attributed both to variation in the relative abundance of different organisms, and to variation in genomic 16S copy number among those organisms. Despite this fact, many studies assume that the abundance of 16S gene sequences is a surrogate measure of the relative abundance of the organisms containing those sequences. Here we present a method that uses data on sequences and genomic copy number of 16S genes along with phylogenetic placement and ancestral state estimation to estimate organismal abundances from environmental DNA sequence data. We use theory and simulations to demonstrate that 16S genomic copy number can be accurately estimated from the short reads typically obtained from high-throughput environmental sequencing of the 16S gene, and that organismal abundances in microbial communities are more strongly correlated with estimated abundances obtained from our method than with gene abundances. We re-analyze several published empirical data sets and demonstrate that the use of gene abundance versus estimated organismal abundance can lead to different inferences about community diversity and structure and the identity of the dominant taxa in microbial communities. Our approach will allow microbial ecologists to make more accurate inferences about microbial diversity and abundance based on 16S sequence data.
Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of beta-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.
Recent advances in automated high-resolution fluorescence microscopy and robotic handling have made the systematic and cost effective study of diverse morphological changes within a large population of cells possible under a variety of perturbations, e.g., drugs, compounds, metal catalysts, RNA interference (RNAi). Cell population-based studies deviate from conventional microscopy studies on a few cells, and could provide stronger statistical power for drawing experimental observations and conclusions. However, it is challenging to manually extract and quantify phenotypic changes from the large amounts of complex image data generated. Thus, bioimage informatics approaches are needed to rapidly and objectively quantify and analyze the image data. This paper provides an overview of the bioimage informatics challenges and approaches in image-based studies for drug and target discovery. The concepts and capabilities of image-based screening are first illustrated by a few practical examples investigating different kinds of phenotypic changes caEditorsused by drugs, compounds, or RNAi. The bioimage analysis approaches, including object detection, segmentation, and tracking, are then described. Subsequently, the quantitative features, phenotype identification, and multidimensional profile analysis for profiling the effects of drugs and targets are summarized. Moreover, a number of publicly available software packages for bioimage informatics are listed for further reference. It is expected that this review will help readers, including those without bioimage informatics expertise, understand the capabilities, approaches, and tools of bioimage informatics and apply them to advance their own studies.
The reciprocal differentiation of T helper 17 (T(H)17) cells and induced regulatory T (iT(reg)) cells plays a critical role in both the pathogenesis and resolution of diverse human inflammatory diseases. Although initial studies suggested a stable commitment to either the T(H)17 or the iT(reg) lineage, recent results reveal remarkable plasticity and heterogeneity, reflected in the capacity of differentiated effectors cells to be reprogrammed among T(H)17 and iT(reg) lineages and the intriguing phenomenon that a group of naïve precursor CD4(+) T cells can be programmed into phenotypically diverse populations by the same differentiation signal, transforming growth factor beta. To reconcile these observations, we have built a mathematical model of T(H)17/iT(reg) differentiation that exhibits four different stable steady states, governed by pitchfork bifurcations with certain degrees of broken symmetry. According to the model, a group of precursor cells with some small cell-to-cell variability can differentiate into phenotypically distinct subsets of cells, which exhibit distinct levels of the master transcription-factor regulators for the two T cell lineages. A dynamical control system with these properties is flexible enough to be steered down alternative pathways by polarizing signals, such as interleukin-6 and retinoic acid and it may be used by the immune system to generate functionally distinct effector cells in desired fractions in response to a range of differentiation signals. Additionally, the model suggests a quantitative explanation for the phenotype with high expression levels of both master regulators. This phenotype corresponds to a re-stabilized co-expressing state, appearing at a late stage of differentiation, rather than a bipotent precursor state observed under some other circumstances. Our simulations reconcile most published experimental observations and predict novel differentiation states as well as transitions among different phenotypes that have not yet been observed experimentally.
The connection between chromatin nuclear organization and gene activity is vividly illustrated by the observation that transcriptional coregulation of certain genes appears to be directly influenced by their spatial proximity. This fact poses the more general question of whether it is at all feasible that the numerous genes that are coregulated on a given chromosome, especially those at large genomic distances, might become proximate inside the nucleus. This problem is studied here using steered molecular dynamics simulations in order to enforce the colocalization of thousands of knowledge-based gene sequences on a model for the gene-rich human chromosome 19. Remarkably, it is found that most ([Formula: see text]) gene pairs can be brought simultaneously into contact. This is made possible by the low degree of intra-chromosome entanglement and the large number of cliques in the gene coregulatory network. A clique is a set of genes coregulated all together as a group. The constrained conformations for the model chromosome 19 are further shown to be organized in spatial macrodomains that are similar to those inferred from recent HiC measurements. The findings indicate that gene coregulation and colocalization are largely compatible and that this relationship can be exploited to draft the overall spatial organization of the chromosome in vivo. The more general validity and implications of these findings could be investigated by applying to other eukaryotic chromosomes the general and transferable computational strategy introduced here.
Relatively little is known about the viral factors contributing to the lethality of the 1918 pandemic, although its unparalleled virulence was likely due in part to the newly discovered PB1-F2 protein. This protein, while unnecessary for replication, increases apoptosis in monocytes, alters viral polymerase activity in vitro, enhances inflammation and increases secondary pneumonia in vivo. However, the effects the PB1-F2 protein have in vivo remain unclear. To address the mechanisms involved, we intranasally infected groups of mice with either influenza A virus PR8 or a genetically engineered virus that expresses the 1918 PB1-F2 protein on a PR8 background, PR8-PB1-F2(1918). Mice inoculated with PR8 had viral concentrations peaking at 72 hours, while those infected with PR8-PB1-F2(1918) reached peak concentrations earlier, 48 hours. Mice given PR8-PB1-F2(1918) also showed a faster decline in viral loads. We fit a mathematical model to these data to estimate parameter values. The model supports a higher viral production rate per cell and a higher infected cell death rate with the PR8-PB1-F2(1918) virus. We discuss the implications these mechanisms have during an infection with a virus expressing a virulent PB1-F2 on the possibility of a pandemic and on the importance of antiviral treatments.
Author Summary We developed a general-purpose biochemical simulation program, called Smoldyn. It represents proteins and other molecules of interest with point-like particles that diffuse, interact with surfaces, and react, all in continuous space. This high level of detail allows users to investigate spatial organization within cells and natural stochastic variability. Although similar to the MCell and ChemCell programs, Smoldyn is more accurate and runs faster. Smoldyn also supports many unique features, such as commands that a “virtual experimenter” can execute during simulations and automatic reaction network expansion for simulating protein complexes. We illustrate Smoldyn's capabilities with a model of signaling between yeast cells of opposite mating type. It investigates the role of the secreted protease Bar1, which inactivates mating pheromone. Intuitively, it might seem that inactivating most of the pheromone would make a cell less able to detect the local pheromone concentration gradient. In contrast, we found that Bar1 secretion improves pheromone gradient detectability: the local gradient is sharpened because pheromone is progressively inactivated as it diffuses through a cloud of Bar1. This result helps interpret experiments that showed that Bar1 secretion helped cells distinguish between potential mates, and suggests that Bar1 helps yeast cells identify the fittest mating partners.
ISMB/ECCB History.
Sketch of Program for ISMB 2008 in Toronto. doi:10.1371/journal.pcbi.1000094.g001 
Sketch of Program for ISMB 2008 in Toronto.
Increasing Breadth of ISMB.
A considerable fraction of all the major scholars in computational biology frequently participate in ISMB. Consistent with this leading role in representation, ISMB has become a major outlet for increasing the visibility of this extremely dynamic new discipline, and for maintaining and elevating its scientific standards. It has become the vehicle for the education of scholars at all stages of their careers, for the integration of students, and for the support of young leaders in the field. ISMB has also become a forum for reviewing the state of the art in the many fields of this growing discipline, for introducing new directions, and for announcing technological breakthroughs. ISMB and ISCB are contributing to the advance of biology, and to helping to build bridges and understanding between dedicated and passionate groups of scholars from an unusual variety of backgrounds. ISMB 1993–2008 The ISMB conference series began in 1993, the result of the vision of David Searls (GlaxoSmithKline), Jude Shavlik (University of Wisconsin Madison), and Larry Hunter (University of Colorado). A few years later, ISMB had established itself as a primary event for the computational biology community and triggered the founding of ISCB, the International Society for Computational Biology ( ISCB has been organizing the ISMB conference series since 1998. While ISCB evolved into the only society representing computational biology globally, its flagship conference has become the largest annual worldwide forum focused on computational biology. In January 2007, the ISCB came to an agreement with the European Conference on Computational Biology (ECCB) to organize a joint meeting in Europe every other year. This led to the ISMB/ECCB in Vienna in 2007 that set the standard for a large-scale integrative forum for all those with interest in subjects related to computational biology. ISCB is now focusing on expanding participation beyond North America and Europe, which has accounted for the majority of participants during the history of ISMB. One meeting in South Asia (InCoB; has already been sponsored by ISCB, and another one in North Asia is going to follow. ISMB itself has also been held in Australia (2003) and Brazil (2006).
Reported cases at US military installations during the 2009 influenza pandemic.
a number of reported cases per week of: ILI-large (green); ILI-small (blue); the top 50 military installations' contribution to ILI-small (magenta); and the CDC's ILI weekly surveillance (red). Profiles overlap because of the independent y-axis scaling. b heat map representation of ILI-small profiles for each of the top 50 military installations by zip code (MPZ), ordered by total number of ILI-small reported (largest at top). c as b but each profile has been renormalized to its own maximum value, thus, highlighting relative variations. Incidence curves for: Fort Carson d, just outside of Colorado Springs in El Paso County, Colorado (MPZ 80913), containing over 21,000 soldiers; Bob Wilson Naval Hospital e in San Diego, which serves as a clinic for several military installations primarily within San Diego County, and including MCAS Miramar (MPZ 92134); and Marine base at Quantico, Virginia(MPZ 22134) f, which is a major training facility for both Marines and federal law enforcement agencies. The timing of individual MPZ peaks is marked by the red vertical line. A complete set of the profiles for the largest 50 MPZs is given in Figure S1.
The timing of the pandemic peaks for military installations by zip code (MPZ) and their relationship to civilian profiles.
a distribution of the timing of the peaks at each installation during the interval between April 1, 2009 and January 1, 2010. A number of installations showed evidence for two waves, one in the summer and one in October. Here, only the highest peak from the entire interval is shown. Comparison of military and civilian population profiles for three locations: b Incidence profiles for San Diego County, together with MCAS Miramar (MPZ 92134) and Camp Pendleton (MPZ 92055) bases; c El Paso County and Fort Carson Army Base (MPZ 80913); and d Alaska State (data at Borough/County level not available) and Elmendorf Air Force Base (MPZ 99506). e comparison of the timing of the peaks within MPZs and the nearest civilian populations for installations for which relatively localized civilian data could be obtained. The legend summarizes the type of civilian data obtained (confirmed/antigen, PCR, or culture) and the installation to which it was compared. The solid line is a linear regression to the data with a Pearson correlation coefficient of 0.9. Points lying above and to the left of the dashed line () represent cases where the military peak lagged the civilian peak.
Model fits for the top 50 installations during the 2009 pandemic.
(a, b, c, and d) Comparison of model fits with military installations for a selection of installations: (a) Portsmouth Naval Medical Hospital, Portsmouth, Virginia (MPZ-23708). This location produced the largest number of ILI-small cases. The hospital employs 4,300 active duty military and civilians but is also located near several Navy and Army facilities. The profile demonstrates a clean epidemic curve and the model fit closely matches the observed profile. (b) Camp Pendleton Marine base (MPZ-92055). The installation has five schools on the base, three of which fall under the Oceanside school district and two of which are managed by Fallbrook. (c) Fort Sam Houston Army base located in San Antonio, Texas (MPZ-78234). This large installation has over 70,000 family members, 15,000 retirees, and trains more than 25,000 students each year. An independent school district is located on the base. (d) Quantico Marine base (MPZ-22134). See Figure 2 for more details. In each panel a–d, the red line indicates data, the blue line indicates the model fit, and the green line shows the time evolution of . (e) Comparison of  and  for the top 50 installations. The solid line marks a slope of one, while the dashed circular curves mark boundaries at , 2, and 3, serving to separate the outliers from the main cluster. (f) Distribution of , the maximum or  or , and the inferred value of  during the pandemic. The basic reproduction number clusters around a median value of 1.39 (mean 1.57); however, there are some notable exceptions. A complete set of model parameters is provided in Table S3 and histograms of , ,  and  are shown in Figure S4. (g) The relationship between  and the model-determined time of initial infection, . A linear regression to all fits (left) shows a modest increase in  from the early summer to late fall. When the outliers (that is, ) are removed from the analysis, the general rise in  still persists. Moreover, when only the top 30 bases are included in the analysis (red points), the trend persists.
Rapidly characterizing the amplitude and variability in transmissibility of novel human influenza strains as they emerge is a key public health priority. However, comparison of early estimates of the basic reproduction number during the 2009 pandemic were challenging because of inconsistent data sources and methods. Here, we define and analyze influenza-like-illness (ILI) case data from 2009-2010 for the 50 largest spatially distinct US military installations (military population defined by zip code, MPZ). We used publicly available data from non-military sources to show that patterns of ILI incidence in many of these MPZs closely followed the pattern of their enclosing civilian population. After characterizing the broad patterns of incidence (e.g. single-peak, double-peak), we defined a parsimonious SIR-like model with two possible values for intrinsic transmissibility across three epochs. We fitted the parameters of this model to data from all 50 MPZs, finding them to be reasonably well clustered with a median (mean) value of 1.39 (1.57) and standard deviation of 0.41. An increasing temporal trend in transmissibility ([Formula: see text], p-value: 0.013) during the period of our study was robust to the removal of high transmissibility outliers and to the removal of the smaller 20 MPZs. Our results demonstrate the utility of rapidly available - and consistent - data from multiple populations.
Influenza pandemics in the last century were characterized by successive waves and differences in impact and timing between different regions, for reasons not clearly understood. The 2009 H1N1 pandemic showed rapid global spread, but with substantial heterogeneity in timing within each hemisphere. Even within Europe substantial variation was observed, with the UK being unique in experiencing a major first wave of transmission in early summer and all other countries having a single major epidemic in the autumn/winter, with a West to East pattern of spread. Here we show that a microsimulation model, parameterised using data about H1N1pdm collected by the beginning of June 2009, explains the occurrence of two waves in UK and a single wave in the rest of Europe as a consequence of timing of H1N1pdm spread, fluxes of travels from US and Mexico, and timing of school vacations. The model provides a description of pandemic spread through Europe, depending on intra-European mobility patterns and socio-demographic structure of the European populations, which is in broad agreement with observed timing of the pandemic in different countries. Attack rates are predicted to depend on the socio-demographic structure, with age dependent attack rates broadly agreeing with available serological data. Results suggest that the observed heterogeneity can be partly explained by the between country differences in Europe: marked differences in school calendars, mobility patterns and sociodemographic structures. Moreover, higher susceptibility of children to infection played a key role in determining the epidemiology of the 2009 pandemic. Our work shows that it would have been possible to obtain a broad-brush prediction of timing of the European pandemic well before the autumn of 2009, much more difficult to achieve with simpler models or pre-pandemic parameterisation. This supports the use of models accounting for the structure of complex modern societies for giving insight to policy makers.
Summary of commented talks at ISMB/ECCB 2009.
(A) Keynotes were the most commented talks at ISMB/ECCB 2009. Here, keynotes are listed in chronological order, showing that the number of comments per keynote was higher at the end of the conference than at the beginning. Nonetheless, the absolute number of comments does not necessarily reflect the quality of the coverage. (B) Besides the keynotes, in the main session the highlights and proceedings track received the most attention. However, the total number of covered talks in the SIGs was higher than in any other session. For simplicity, talks from the special sessions and SIGs are summarized across all special sessions and all SIGs, as the commenting method for those sessions was not as uniform as for the main sessions. More detailed statistics are available at the Web sites in Box 1.
In last year's report on microblogging ISMB 2008 [1], the authors anticipated that new methods of using the Web and of reporting the conference would make live blogging even easier ( This year, the ISMB/ECCB 2009 Web site contained all of the features of the mock-up, far more live bloggers participated than last year, and there was increased coverage of talks and special sessions. We, in turn, look forward to the new technologies appearing on the horizon (such as Google Wave), and how both tools and bloggers will make next year's conference an even greater success. In summary, conference organizers found that microblogging added value for all conference attendees, and allowed attendees to follow the thoughts of others as well as to follow presentations that conflicted with others they wished to see. The usefulness of live blogging extends beyond the duration of the conference, remaining accessible long after the conference has closed.
The annual international conference on Intelligent Systems for Molecular Biology (ISMB) is the largest meeting of the International Society for Computational Biology (ISCB). In 2010 it was held in Boston, United States, July 11–13. What follows are four conference postcards that reflect different activities considered exciting and important by younger attendees. Postcards, as the name suggests, are brief reports on the talks and other events that interested attendees. You can read more about the idea of conference postcards at, and if you are a graduate student or postdoctoral fellow, please consider contributing postcards at any future meetings of interest to the PLoS Computational Biology readership. We want to hear your view of the science being presented.
Each year the International Society for Computational Biology (ISCB; honors a young scientist who has already achieved a significant and lasting impact on our field. The ISCB Awards Committee, comprised of current and former directors of the society and chaired by Soren Brunak, director of the Center for Biological Sequence Analysis at the Technical University of Denmark, has announced that the recipient of the 2010 ISCB Overton Prize is Steven E. Brenner of the University of California, Berkeley, California, United States (Image 1). Image 1. Steven E. Brenner. Additionally, the 2010 ISCB Accomplishment by a Senior Scientist Award goes to Chris Sander of the Memorial Sloan-Kettering Cancer Center, New York, US. Each of these awards are recognized well beyond the borders of the discipline of bioinformatics or computational biology as honoring excellence in science. Brunak points out some interesting connections between this year's award winners: “Both Sander and Brenner started out in structural bioinformatics and made distinguished contributions to the analysis of protein structure before moving into genomics-based research and a more translational approach to bioinformatics.” Both honorees will be presented with their awards at the ISCB's 18th annual international conference, Intelligent Systems for Molecular Biology (ISMB;, where Brenner will give the opening keynote lecture and Sander will speak on the second day. ISMB 2010 will take place in Boston, Massachusetts, US, July 11–13, 2010. This article features Steven E. Brenner as recipient of the Overton Prize; a future article will highlight Chris Sander's accomplishments that have earned him ISCB's Senior Scientist Award.
Each year, the International Society for Computational Biology (ISCB; makes awards for exceptional achievement to two scientists. The first is presented to a scientist who has made distinguished contributions over many years in research, teaching, service, or any combination of the three. This year, the ISCB Accomplishment by a Senior Scientist Award goes to Michael Ashburner in the department of genetics at the University of Cambridge. The second, known as the Overton Prize, honours a young scientist in the early to mid-stage of his or her career who has already achieved significant and lasting impact in the field of computational biology. In 2011, the Overton Prize is awarded to Olga Troyanskaya of Princeton University in New Jersey. The recipients were chosen by the ISCB's awards committee chaired by Alfonso Valencia at the CNIO (Spanish National Cancer Research Centre) in Madrid. The winners will receive their awards at the ISCB's annual meeting, where they will also deliver keynote talks. This meeting, ISMB/ECCB 2011, will take place in Vienna, Austria, 17–19 July 2011.
Image 2. Ziv Bar-Joseph of Carnegie Mellon University.
Photo courtesy of Carnegie Mellon University.
Image 1. Gunnar von Heijne of Stockholm University.
Photo by Max Brouwers.
Each year, the International Society for Computational Biology (ISCB; makes awards for exceptional achievement to two scientists. The ISCB Accomplishment by a Senior Scientist Award honours career achievement in recognition of distinguished contributions over many years in research, teaching, service, or any combination of the three. In 2012 this award is going to Gunnar von Heijne of the Stockholm University in Sweden. The Overton Prize recognizes a young scientist in the early to mid-stage of his or her career who has already achieved a significant and lasting impact in the field of computational biology. In 2012, the Overton Prize is being awarded to Ziv Bar-Joseph of Carnegie Mellon University in Pittsburgh, Pennsylvania, United States. The recipients were chosen by the ISCB's awards committee chaired by Alfonso Valencia at the CNIO (Spanish National Cancer Research Centre) in Madrid. The winners will receive their awards at the ISCB's annual Intelligent Systems for Molecular Biology (ISMB) meeting, where they will deliver keynote talks. ISMB 2012 ( marks the 20th anniversary of the conference, and will take place July 15–17 in Long Beach, California, United States.
Combining genome-scale predictive strategies to predict and prioritize candidate microRNAs in HNSCC. (A) Enriched gene targets of 46 microRNAs among inheritable cancer genes in OMIM are significantly overlapping with 34 predictions of deregulated microRNAs based on HNSCC expression arrays (GSE6631, GSE2379; Figure S3; Table S2 and Table S3), yielding ten prioritized microRNAs (P = 2.33610 24 ). P: Cumulative hypergeometric Statistics. 1: miR-204 and let-7g are located in chromosomal regions with known increased genetic risk of HNSCC (9q21.1–22.1 for  
miR-204 gene targets exhibit significant topological properties in a predicted protein interaction network of HNSCC based on single protein network modeling. (A–B) A 56-gene ''prioritized HNSCC PPIN'' was predicted from single protein network modeling and was significantly enriched with bottleneck (P = 7.3610 27 ), hub (P = 8.7610 28 ) and hub-bottleneck genes (P = 1.6610 28 ). P-values were calculated using onetailed cumulative hypergeometric tests. Genes colored in red: miR-204 gene targets. (C) Gene Ontology enrichment analysis of the ''biological processes'' (BP) and ''molecular functions'' (MF) identified two EGFR-dependent sub-networks in the ''prioritized HNSCC PPIN'' (adjusted P,0.05). Different BPs and MFs were coded by colors as indicated. Every gene analyzed in the network are represented as circles, the majority do not reach statistical significance and remain as unnamed grey dots on the bottom of the figure (statistical details and names are provided in Table S11, and their interactions in Table S12). doi:10.1371/journal.pcbi.1000730.g003  
miR-204 suppressed HNSCC cell migration, adhesion and invasion in vitro and lung colonization in vivo. (A–C) Ectopic enhancement of miR-204 function inhibited JSQ3 and SQ38 adhesion to laminaI or basement membrane complex (BMC) (A), migration through the porous membrane in Transwell (B), and invasion through Matrigel (C). Triplicate repeats were conducted at experimental point for A (Methods). For  
Expression pattern of miR-204 targets identified a subtype of HNSCC tumors exhibiting an EGFR-pathway signature and miR-204 was deregulated in other squamous and epithelial tumors. miR-204 functional targets classified 60 HNSCC tumors in (GSE686) [41] microarray based on their intrinsic properties (Methods). P-values were obtained using a Fisher's exact test; *: censored data. doi:10.1371/journal.pcbi.1000730.g005
Due to the large number of putative microRNA gene targets predicted by sequence-alignment databases and the relative low accuracy of such predictions which are conducted independently of biological context by design, systematic experimental identification and validation of every functional microRNA target is currently challenging. Consequently, biological studies have yet to identify, on a genome scale, key regulatory networks perturbed by altered microRNA functions in the context of cancer. In this report, we demonstrate for the first time how phenotypic knowledge of inheritable cancer traits and of risk factor loci can be utilized jointly with gene expression analysis to efficiently prioritize deregulated microRNAs for biological characterization. Using this approach we characterize miR-204 as a tumor suppressor microRNA and uncover previously unknown connections between microRNA regulation, network topology, and expression dynamics. Specifically, we validate 18 gene targets of miR-204 that show elevated mRNA expression and are enriched in biological processes associated with tumor progression in squamous cell carcinoma of the head and neck (HNSCC). We further demonstrate the enrichment of bottleneckness, a key molecular network topology, among miR-204 gene targets. Restoration of miR-204 function in HNSCC cell lines inhibits the expression of its functionally related gene targets, leads to the reduced adhesion, migration and invasion in vitro and attenuates experimental lung metastasis in vivo. As importantly, our investigation also provides experimental evidence linking the function of microRNAs that are located in the cancer-associated genomic regions (CAGRs) to the observed predisposition to human cancers. Specifically, we show miR-204 may serve as a tumor suppressor gene at the 9q21.1-22.3 CAGR locus, a well established risk factor locus in head and neck cancers for which tumor suppressor genes have not been identified. This new strategy that integrates expression profiling, genetics and novel computational biology approaches provides for improved efficiency in characterization and modeling of microRNA functions in cancer as compared to the state of art and is applicable to the investigation of microRNA functions in other biological processes and diseases.
Author Summary MHC class I molecules present antigenic peptides derived from endogenously expressed foreign or aberrant protein molecules to the outside world so that they can be specifically recognised by cytotoxic T lymphocytes (CTLs) at the cell surface. Responsible for the generation of these peptides is the 20S proteasome, which is the major proteolytic enzyme of the cell. These peptides were so far believed to exhibit a linear sequence identical to that found in the unprocessed parental protein. Using patient derived CTL it was previously shown that by proteasome catalyzed peptide splicing, i.e., by fusion of two proteasome generated peptide fragments in a reversed proteolysis reaction, novel spliced antigenic peptides can be generated. To resolve the CTL dependence of spliced-peptide identification we here performed experiments, which combined mass spectrometric analysis of proteasome generated peptides with a computer based algorithm that predicts the masses of all theoretically possible spliced peptides from a given substrate molecule (SpliceMet). Using this unrestricted approach we here identified several new spliced peptides of which some were derived from two distinct substrate molecules. Our data reveal that peptide splicing is an intrinsic additional catalytic property of the proteasome, which may provide a qualitatively new peptide pool for immune selection.
Scheme of the systemic IL-21 mathematical model.
A model of IL-21 PD effects on immune regulation of tumor growth [39] is combined with a new IL-21 PK model based on data in mice [41]. Under SC/IP administration, IL-21 is introduced at site (A) and is transported through 3 compartments to the plasma (P). Under IV administration, the drug is injected directly into the plasma. The drug is degraded via 3 additional compartments. IL-21 concentrations in the target tissue (T) are correlated with the plasma levels by parameter s. In the target site, IL-21 inhibits NK survival and promotes CTL expansion, while enhancing CPs of both cells and facilitating their tumor cell targeting. Abbreviations: PK- pharmacokinetics; PD- pharmacodynamics; IV- intravenous; SC- subcutaneous; IP-intraperitoneal; NK- natural killer cell; CTL- cytotoxic T lymphocyte; CPs- cytotoxic proteins.
Estimation and sensitivity analysis of model parameter s.
Curve-fits produced during estimation of parameter s, using experimental training data of B16 dynamics under an early-onset (day 3) IL-21 treatment (50 µg/day) [41]. The parameter was evaluated per route of administration (see Table S1 in Text S1), and model-data approximation is indicated for both SC and IP treatment (“Model fit”). Predictions of the model under 2-fold increased or decreased s values (“Model prediction s×2” and “Model prediction s/2”, respectively) retrieving these experimental data (Exp), are plotted. Simulations (lines) are shown with respect to data (circles), given as means±SEM.
IL-21-induced antitumor effects: model simulations retrospectively verified in experimental murine tumors.
Model predictions (lines) retrieve experimental validation data (circles, triangles) of tumor dynamics from a preclinical study [41], where (A) B16-bearing mice were treated by a 50 µg/day IL-21 treatment applied SC or IP, starting on day 8 after tumor inoculation; (B) RenCa-challenged mice were treated by IL-21, 50 µg 3×/week, SC or IP, commencing either early (day 7) or late (day 12) after tumor inoculation; (C) RenCa-bearing mice were SC-administered various IL-21 doses between 1–20 µg (3×/week), or given a 30 µg (1×/week) regimen. Data are given as means±SEM.
Model-improved IL-21 therapies with modified onset and fractionation.
(A) Predicted outcomes (final B16 volumes; squares) of 20-day regimens (50 µg/day given SC) initiated on different days. (B) Predicted outcomes of regimens with the same total IL-21 dose (800 µg/treatment given SC) yet with different fractionations (i.e. number of injections, inter-dosing intervals and dose intensities). (C) Prospective validation of the model predictions (lines) in B16-bearing mice treated by a standard (std) 50 µg/day regimen vs. a fractionated (frac) 25 µg/twice daily schedule, both administered SC between days 3–20 (data in circles). Tumor growth in PBS controls is indicated as well. Means±SEM of data are given (n = 10; *p<0.001 for 25 µg-treated mice vs. PBS-treated mice, and for 50 µg-treated mice vs. PBS-treated mice; **p<0.05 for 25 µg-treated mice vs. 50 µg-treated mice).
Model-improved alternative-dosing IL-21 regimens.
(A) Predicted outcomes (final B16 volumes) of various 20-day treatments (initialized at day 3, and given SC), where different daily dose are applied (squares). (B) Experimental B16 dynamics following prospective treatments under the standard (std) 50 µg/day regimen, or under a model-based reduced-dosing (low) schedule (12 µg/day). Data (circles) are shown vs. model simulations (lines). Tumor growth in PBS controls appears as well. Means±SEM of data are indicated (n = 10; *p<0.05 for 12 µg-treated mice vs. PBS-treated mice; **p<0.001 for 50 µg-treated mice vs. PBS-treated mice; ns-not significant for 12 µg-treated mice vs. 50 µg -treated mice).
Interleukin (IL)-21 is an attractive antitumor agent with potent immunomodulatory functions. Yet thus far, the cytokine has yielded only partial responses in solid cancer patients, and conditions for beneficial IL-21 immunotherapy remain elusive. The current work aims to identify clinically-relevant IL-21 regimens with enhanced efficacy, based on mathematical modeling of long-term antitumor responses. For this purpose, pharmacokinetic (PK) and pharmacodynamic (PD) data were acquired from a preclinical study applying systemic IL-21 therapy in murine solid cancers. We developed an integrated disease/PK/PD model for the IL-21 anticancer response, and calibrated it using selected "training" data. The accuracy of the model was verified retrospectively under diverse IL-21 treatment settings, by comparing its predictions to independent "validation" data in melanoma and renal cell carcinoma-challenged mice (R(2)>0.90). Simulations of the verified model surfaced important therapeutic insights: (1) Fractionating the standard daily regimen (50 µg/dose) into a twice daily schedule (25 µg/dose) is advantageous, yielding a significantly lower tumor mass (45% decrease); (2) A low-dose (12 µg/day) regimen exerts a response similar to that obtained under the 50 µg/day treatment, suggestive of an equally efficacious dose with potentially reduced toxicity. Subsequent experiments in melanoma-bearing mice corroborated both of these predictions with high precision (R(2)>0.89), thus validating the model also prospectively in vivo. Thus, the confirmed PK/PD model rationalizes IL-21 therapy, and pinpoints improved clinically-feasible treatment schedules. Our analysis demonstrates the value of employing mathematical modeling and in silico-guided design of solid tumor immunotherapy in the clinic.
An important determinant of a pathogen's success is the rate at which it is transmitted from infected to susceptible hosts. Although there are anecdotal reports that methicillin-resistant Staphylococcus aureus (MRSA) clones vary in their transmissibility in hospital settings, attempts to quantify such variation are lacking for common subtypes, as are methods for addressing this question using routinely-collected MRSA screening data in endemic settings. Here we present a method to quantify the time-varying transmissibility of different subtypes of common bacterial nosocomial pathogens using routine surveillance data. The method adapts approaches for estimating reproduction numbers based on the probabilistic reconstruction of epidemic trees, but uses relative hazards rather than serial intervals to assign probabilities to different sources for observed transmission events. The method is applied to data collected as part of a retrospective observational study of a concurrent MRSA outbreak in the United Kingdom with dominant endemic MRSA clones (ST22 and ST36) and an Asian ST239 MRSA strain (ST239-TW) in two linked adult intensive care units, and compared with an approach based on a fully parametric transmission model. The results provide support for the hypothesis that the clones responded differently to an infection control measure based on the use of topical antiseptics, which was more effective at reducing transmission of endemic clones. They also suggest that in one of the two ICUs patients colonized or infected with the ST239-TW MRSA clone had consistently higher risks of transmitting MRSA to patients free of MRSA. These findings represent some of the first quantitative evidence of enhanced transmissibility of a pandemic MRSA lineage, and highlight the potential value of tailoring hospital infection control measures to specific pathogen subtypes.
Signaling pathways mediate the effect of external stimuli on gene expression in cells. The signaling proteins in these pathways interact with each other and their phosphorylation levels often serve as indicators for the activity of signaling pathways. Several signaling pathways have been identified in mammalian cells but the crosstalk between them is not well understood. Alliance for Cellular Signaling (AfCS) has measured time-course data in RAW 264.7 macrophage cells on important phosphoproteins, such as the mitogen-activated protein kinases (MAPKs) and signal transducer and activator of transcription (STATs), in single- and double-ligand stimulation experiments for 22 ligands. In the present work, we have used a data-driven approach to analyze the AfCS data to decipher the interactions and crosstalk between signaling pathways in stimulated macrophage cells. We have used dynamic mapping to develop a predictive model using a partial least squares approach. Significant interactions were selected through statistical hypothesis testing and were used to reconstruct the phosphoprotein signaling network. The proposed data-driven approach is able to identify most of the known signaling interactions such as protein kinase B (Akt) --> glycogen synthase kinase 3alpha/beta (GSKalpha/beta) etc., and predicts potential novel interactions such as P38 --> RSK and GSK --> ezrin/radixin/moesin. We have also shown that the model has good predictive power for extrapolation. Our novel approach captures the temporal causality and directionality in intracellular signaling pathways. Further, case specific analysis of the phosphoproteins in the network has led us to propose hypothesis about inhibition (phosphorylation) of GSKalpha/beta via P38.
Author Summary Group B Streptococcus (GBS) is the leading cause of neonatal invasive diseases and pili, as long filamentous fibers protruding from the bacterial surface, have been discovered as important virulence factors and potential vaccine candidates. The bacterial surface is the main interface between host and pathogen, and the ability of the host to identify molecular determinants that are unique to pathogens has a crucial role for microbial clearance. Here, we describe a strategy to investigate the immunological and structural proprieties of a protective pilus protein, by elucidating the molecular mechanisms, in terms of single residue contributions, by which functional epitopes guide bacterial clearance. We generated neutralizing monoclonal antibodies raised against the protein and identified the epitope region in the antigen. Then, we performed computational docking analysis of the antibodies in complex with the target antigen and identified specific residues on the target protein that mediate hydrophobic interactions at the binding interface. Our results suggest that a perfect balance of shape and charges at the binding interface in antibody/antigen interactions is crucial for the antibody/antigen complex in driving a successful neutralizing response. Knowing the native molecular architecture of protective determinants might be useful to selectively engineer the antigens for effective vaccine formulations.
The serotonin 2C receptor (5-HT(2C)R)-a key regulator of diverse neurological processes-exhibits functional variability derived from editing of its pre-mRNA by site-specific adenosine deamination (A-to-I pre-mRNA editing) in five distinct sites. Here we describe a statistical technique that was developed for analysis of the dependencies among the editing states of the five sites. The statistical significance of the observed correlations was estimated by comparing editing patterns in multiple individuals. For both human and rat 5-HT(2C)R, the editing states of the physically proximal sites A and B were found to be strongly dependent. In contrast, the editing states of sites C and D, which are also physically close, seem not to be directly dependent but instead are linked through the dependencies on sites A and B, respectively. We observed pronounced differences between the editing patterns in humans and rats: in humans site A is the key determinant of the editing state of the other sites, whereas in rats this role belongs to site B. The structure of the dependencies among the editing sites is notably simpler in rats than it is in humans implying more complex regulation of 5-HT(2C)R editing and, by inference, function in the human brain. Thus, exhaustive statistical analysis of the 5-HT(2C)R editing patterns indicates that the editing state of sites A and B is the primary determinant of the editing states of the other three sites, and hence the overall editing pattern. Taken together, these findings allow us to propose a mechanistic model of concerted action of ADAR1 and ADAR2 in 5-HT(2C)R editing. Statistical approach developed here can be applied to other cases of interdependencies among modification sites in RNA and proteins.
The microsomal, membrane-bound, human cytochrome P450 (CYP) 2C9 is a liver-specific monooxygenase essential for drug metabolism. CYPs require electron transfer from the membrane-bound CYP reductase (CPR) for catalysis. The structural details and functional relevance of the CYP-membrane interaction are not understood. From multiple coarse grained molecular simulations started with arbitrary configurations of protein-membrane complexes, we found two predominant orientations of CYP2C9 in the membrane, both consistent with experiments and conserved in atomic-resolution simulations. The dynamics of membrane-bound and soluble CYP2C9 revealed correlations between opening and closing of different tunnels from the enzyme's buried active site. The membrane facilitated the opening of a tunnel leading into it by stabilizing the open state of an internal aromatic gate. Other tunnels opened selectively in the simulations of product-bound CYP2C9. We propose that the membrane promotes binding of liposoluble substrates by stabilizing protein conformations with an open access tunnel and provide evidence for selective substrate access and product release routes in mammalian CYPs. The models derived here are suitable for extension to incorporate other CYPs for oligomerization studies or the CYP reductase for studies of the electron transfer mechanism, whereas the modeling procedure is generally applicable to study proteins anchored in the bilayer by a single transmembrane helix.
Two dimensional polyacrylamide gel electrophoresis (2D PAGE) is used to identify differentially expressed proteins and may be applied to biomarker discovery. A limitation of this approach is the inability to detect a protein when its concentration falls below the limit of detection. Consequently, differential expression of proteins may be missed when the level of a protein in the cases or controls is below the limit of detection for 2D PAGE. Standard statistical techniques have difficulty dealing with undetected proteins. To address this issue, we propose a mixture model that takes into account both detected and non-detected proteins. Non-detected proteins are classified either as (a) proteins that are not expressed in at least one replicate, or (b) proteins that are expressed but are below the limit of detection. We obtain maximum likelihood estimates of the parameters of the mixture model, including the group-specific probability of expression and mean expression intensities. Differentially expressed proteins can be detected by using a Likelihood Ratio Test (LRT). Our simulation results, using data generated from biological experiments, show that the likelihood model has higher statistical power than standard statistical approaches to detect differentially expressed proteins. An R package, Slider (Statistical Likelihood model for Identifying Differential Expression in R), is freely available at
Top-cited authors
Alexei J Drummond
  • University of Auckland
Denise Kühnert
  • University of Zurich
Timothy Glenn Vaughan
  • University of Auckland
Dong Xie
  • University of Auckland
Giulio Tononi
  • University of Wisconsin–Madison