Bioinformatics (BIOINFORMATICS)

Publisher: Oxford University Press (OUP)

Journal description

The journal aims to publish high quality peer-reviewed original scientific papers and excellent review articles in the fields of computational molecular biology biological databases and genome bioinformatics.

Current impact factor: 4.62

Impact Factor Rankings

2015 Impact Factor Available summer 2015
2013 / 2014 Impact Factor 4.621
2012 Impact Factor 5.323
2011 Impact Factor 5.468
2010 Impact Factor 4.877
2009 Impact Factor 4.926
2008 Impact Factor 4.328
2007 Impact Factor 5.039
2006 Impact Factor 4.894
2005 Impact Factor 6.019
2004 Impact Factor 5.742
2003 Impact Factor 6.701
2002 Impact Factor 4.615
2001 Impact Factor 3.421
2000 Impact Factor 3.409
1999 Impact Factor 2.259

Impact factor over time

Impact factor

Additional details

5-year impact 6.05
Cited half-life 6.20
Immediacy index 0.67
Eigenfactor 0.16
Article influence 2.61
Website Bioinformatics website
Other titles Bioinformatics (Oxford, England: Online)
ISSN 1367-4811
OCLC 39184474
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

Oxford University Press (OUP)

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 12 months embargo
  • Conditions
    • Pre-print can only be posted prior to acceptance
    • Pre-print must be accompanied by set statement (see link)
    • Pre-print must not be replaced with post-print, instead a link to published version with amended set statement should be made
    • Pre-print on author's personal website, employer website, free public server or pre-prints in subject area
    • Post-print in Institutional repositories or Central repositories
    • Publisher's version/PDF cannot be used
    • Published source must be acknowledged
    • Must link to publisher version
    • Set phrase to accompany archived copy (see policy)
    • Eligible authors may deposit in OpenDepot
    • The publisher will deposit in PubMed Central on behalf of NIH authors
    • Publisher last contacted on 19/02/2015
    • This policy is an exception to the default policies of 'Oxford University Press (OUP)'
  • Classification
    ​ yellow

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterised. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional subclassification of CATH superfamilies. The superfamilies are subclassified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110,439 FunFams in 2735 superfamilies which can be used to functionally annotate >16 million domain sequences. All FunFam annotation data are made available through the CATH webpages ( The FunFHMMer webserver ( allows users to submit query sequences for assignment to a CATH FunFam. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. © The Author(s) 2015. Published by Oxford University Press.
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv398
  • [Show abstract] [Hide abstract]
    ABSTRACT: Post-translational modification by the Small Ubiquitin-like Modifier (SUMO) proteins, a process termed SUMOylation, is involved in many fundamental cellular processes. SUMO proteins are conjugated to a protein substrate, creating an interface for the recruitment of cofactors harboring SUMO-interacting motifs (SIMs). Mapping both SUMO-conjugation sites and SIMs is required to study the functional consequence of SUMOylation. To define the best candidate sites for experimental validation we designed JASSA, a Joint Analyzer of SUMOylation site and SIMs. JASSA is a predictor that uses a scoring system based on a Position Frequency Matrix derived from the alignment of experimental SUMOylation sites or SIMs. Compared to existing web-tools, JASSA displays on par or better performances. Novel features were implemented towards a better evaluation of the prediction, including identification of database hits matching the query sequence and representation of candidate sites within the secondary structural elements and/or the 3D fold of the protein of interest, retrievable from deposited PDB files. JASSA is freely accessible at {{}}. Website is implemented in PHP and MySQL, with all major browsers supported. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv403
  • [Show abstract] [Hide abstract]
    ABSTRACT: Gene transcription is mostly conducted through interactions of various transcription factors and their binding sites on DNA (regulatory elements, REs). Today, we are still far from understanding the real regulatory content of promoter regions. Computer methods for identification of REs remain a widely used tool for studying and understanding transcriptional regulation mechanisms. The NSITE: , NSITEH: and NSITEM: programs perform searches for statistically significant (non-random) motifs of known human, animal and plant one-box and composite REs in a single genomic sequence, in a pair of aligned homologous sequences and in a set of functionally related sequences, respectively. Pre-compiled executables built under commonly used operating systems are available for download by visiting and Supplementary data are available at Bioinformatics online. © The Author(s) 2015. Published by Oxford University Press.
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv404
  • [Show abstract] [Hide abstract]
    ABSTRACT: Haplotype models enjoy a wide range of applications in population inference and disease gene discovery. The hidden Markov models (HMM) traditionally used for haplotypes are hindered by the dubious assumption that dependencies occur only between consecutive pairs of variants. In the current paper, we apply the multivariate Bernoulli (MVB) distribution to model haplotype data. The MVB distribution relies on interactions among all sets of variants, thus allowing for the detection and exploitation of long-range and higher-order interactions. We discuss penalized estimation and present an efficient algorithm for fitting sparse versions of the MVB distribution to haplotype data. Finally, we showcase the benefits of the MVB model in predicting DNaseI hypersensitivity (DH) status - an epigenetic mark describing chromatin accessibility- from population-scale haplotype data. We fit the MVB model to real data from 59 individuals on whom both haplotypes and DH status in lymphoblastoid cell lines are publicly available. The model allows prediction of DH status from genetic data (prediction R(2) = 0:12 in cross-validations). Comparisons of prediction under the MVB model with prediction under linear regression (best linear unbiased prediction or BLUP) and logistic regression demonstrate that the MVB model achieves about 10% higher prediction R(2) than the two competing methods in empirical data. Software implementing the method described can be downloaded at; © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv397
  • Alexander H Stram, Paul Marjoram, Gary K Chen
    [Show abstract] [Hide abstract]
    ABSTRACT: The development of Approximate Bayesian Computation (ABC) algorithms for parameter inference which are both computationally efficient and scalable in parallel computing environments is an important area of research. Monte Carlo rejection sampling, a fundamental component of ABC algorithms, is trivial to distribute over multiple processors but is inherently inefficient. The development of algorithms such as ABC Sequential Monte Carlo (ABC-SMC) helps address the inherent inefficiencies of rejection sampling, but such approaches are not as easily scaled on multiple processors. As a result, current Bayesian inference software offerings that use ABC-SMC lack the ability to scale in parallel computing environments. We present al3c, a C++ framework for implementing ABC-SMC in parallel. By requiring only that users define essential functions such as the simulation model and prior distribution function, al3c abstracts the user from both the complexities of parallel programming and the details of the ABC-SMC algorithm. By using the al3c framework, the user is able to scale the ABC-SMC algorithm in parallel computing environments for the specific application, with minimal programming overhead. al3c is offered as a static binary for Linux and OS-X computing environments. The user completes an XML configuration file and C++ plug-in template for the specific application, which are used by al3c to obtain the desired results. Users can download the static binaries, source code, reference documentation, and examples (including those in this paper) by visiting © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv393
  • [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequencing produces vast amounts of data with errors that are difficult to distinguish from true biological variation when coverage is low. We demonstrate large reductions in error frequencies, especially for high-error-rate reads, by three independent means: (i) filtering reads according to their expected number of errors, (ii) assembling overlapping read pairs, and (iii) for amplicon reads, by exploiting unique sequence abundances to perform error correction. We also show that most published paired read assemblers calculate incorrect posterior quality scores. robert@drive5.comAvailability: These methods are implemented in the USEARCH package. Binaries are freely available at © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv401
  • Andrei-Alin Popescu, Katharina T Huber
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-Wide Association Studies are an invaluable tool for identifying genotypic loci linked with agriculturally important traits or certain diseases. The signal on which such studies rely upon can however be obscured by population stratification making it necessary to account for it in some way. Population stratification is dependent on when admixture happend and thus can occur at various levels. To aid in its inference at the genome-level, we recently introduced PSIKO and comparison with leading methods indicate that it has attractive properties. However uptil now it could not be used for local ancestry inference (LAI) which is preferable in cases of recent admixture as the genome level tends to be too coarse to properly account for processes acting on small segments of a genome. To also bring the powerful ideas underpinning PSIKO to bear in such studies, we extended it to PSIKO2 which we introduce here. Source code, binaries, and user manual are freely available at, © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv396
  • [Show abstract] [Hide abstract]
    ABSTRACT: Assessing linkage disequilibrium (LD) across ancestral populations is a powerful approach for investigating population-specific genetic structure as well as functionally mapping regions of disease susceptibility. Here we present LDlink, a web-based collection of bioinformatic modules that query single nucleotide polymorphisms (SNPs) in population groups of interest to generate haplotype tables and interactive plots. Modules are designed with an emphasis on ease of use, query flexibility, and interactive visualization of results. Phase 3 haplotype data from the 1000 Genomes Project are referenced for calculating pairwise metrics of LD, searching for proxies in high LD, and enumerating all observed haplotypes. LDlink is tailored for investigators interested in mapping common and uncommon disease susceptibility loci by focusing on output linking correlated alleles and highlighting putative functional variants. LDlink is a free and publically available web tool which can be accessed at Published by Oxford University Press 2015. This work is written by US Government employees and are in the public domain in the US.
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv402
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze, and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exist a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different Next Generation Sequencing (NGS) data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference based genome compression. We have done extensive experiments using 5 real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising. The implementations are freely available for non-commercial purposes. They can be downloaded from: Sanguthevar Rajasekaran. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 07/2015; DOI:10.1093/bioinformatics/btv399
  • [Show abstract] [Hide abstract]
    ABSTRACT: The position-weight matrix (PWM) is a useful representation of a transcription factor binding site (TFBS) sequence pattern because the PWM can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual nucleotide positions, the PWMs for some TFs poorly discriminate binding sites from non-binding-sites that have similar sequence content. Since the local three-dimensional DNA structure ("shape") is a determinant of TF binding specificity and since DNA shape has a significant sequence-dependence, we combined DNA shape-derived features into a TF-generalized regulatory score and tested whether the score could improve PWM-based discrimination of TFBS from non-binding-sites. We compared a traditional PWM model to a model that combines the PWM with a DNA shape feature-based regulatory potential score, for accuracy in detecting binding sites for 75 vertebrate transcription factors. The PWM + shape model was more accurate than the PWM-only model, for 45% of TFs tested, with no significant loss of accuracy for the remaining TFs. The shape-based model is available as an open-source R package at that is archived on the GitHub software repository at: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv391
  • [Show abstract] [Hide abstract]
    ABSTRACT: MicroRNAs (miRNAs) play important roles in general biological processes and diseases pathogenesis. Identifying miRNA target genes is an essential step to fully understand the regulatory effects of miRNAs. Many computational methods based on the sequence complementary rules and the miRNA and mRNA expression profiles have been developed for this purpose. It is noted that there have been many sequence features of miRNA targets available, including the context features of the target sites, the thermodynamic stability and the accessibility energy for miRNA-mRNA interaction. However, most of current computational methods that combine sequence and expression information do not effectively integrate full spectrum of these features; instead, they perceive putative miRNA-mRNA interactions from sequence-based prediction as equally meaningful. Therefore, these sequence features have not been fully utilized for improving miRNA target prediction. We propose a novel regularized regression approach that is based on the adaptive Lasso procedure for detecting functional miRNA-mRNA interactions. Our method fully takes into account the gene sequence features and the miRNA and mRNA expression profiles. Given a set of sequence features for each putative miRNA-mRNA interaction and their expression values, our model quantifies the down-regulation effect of each miRNA on its targets while simultaneously estimating the contribution of each sequence feature to predicting functional miRNA-mRNA interactions. By applying our model to the expression datasets from two cancer studies, we have demonstrated our prediction results have achieved better sensitivity and specificity and are more biologically meaningful compared to those based on other methods. The source code is available at: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv392
  • [Show abstract] [Hide abstract]
    ABSTRACT: The amount of sequenced genomes and proteins is growing at an unprecedented pace. Unfortunately, manual curation and functional knowledge lags behind. Homologous inference often fails at labeling proteins with diverse functions and broad classes. Thus, identifying high-level protein functionality remains challenging. We hypothesize that a universal, feature engineering approach can yield classification of high-level functions and unified properties when combined with machine learning (ML) approaches, without requiring external databases or alignment. In this study, we present a novel bioinformatics toolkit called ProFET (Protein Feature Engineering Toolkit). ProFET extracts hundreds of features covering the elementary biophysical and sequence derived attributes. Most features capture statistically informative patterns. In addition, different representations of sequences and the amino acids (AA) alphabet provide a compact, compressed set of features. The results from ProFET were incorporated in data analysis pipelines, implemented in python, and adapted for multi-genome scale analysis. ProFET was applied on 17 established and novel protein benchmark datasets involving classification for a variety of binary and multi-class tasks. The results show state of the art performance. The extracted features2019 show excellent biological interpretability. The success of ProFET applies to a wide range of high-level functions such as subcellular localization, structural classes and proteins with unique functional properties (e.g., neuropeptide precursors, thermophilic and nucleic acid binding). ProFET allows easy, universal discovery of new target proteins, as well as understanding the features underlying different high-level protein functions. ProFET source code and the datasets used are freely available at CONTACT: © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv345
  • [Show abstract] [Hide abstract]
    ABSTRACT: In profiling the composition and structure of complex microbial communities via high throughput amplicon sequencing, a very low proportion of community members are typically sampled. As a result of this incomplete sampling, estimates of dissimilarity between communities are often inflated, an issue we term pseudo β-diversity. We present a set of tools to identify and correct for the presence of pseudo β-diversity in contrasts between microbial communities. The variably weighted Odum dissimilarity (DwOdum) allows for down-weighting the influence of either abundant or rare taxa in calculating a measure of similarity between two communities. We show that down-weighting the influence of rare taxa can be used to minimize pseudo β-diversity arising from incomplete sampling. Down-weighting the influence of abundant taxa can increase the sensitivity of hypothesis testing. OTUshuff is an associated test for identifying the presence of pseudo β-diversity in pairwise community contrasts. A Perl script for calculating the DwOdum score from a taxon abundance table and performing pairwise contrasts with OTUshuff can be obtained at: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Published by Oxford University Press 2015. This work is written by US Government employees and are in the public domain in the US.
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv394
  • [Show abstract] [Hide abstract]
    ABSTRACT: Insertion sequences (ISs) are transposable elements present in most bacterial and archaeal genomes that play an important role in genomic evolution. The increasing availability of sequenced prokaryotic genomes offers the opportunity to study ISs comprehensively, but development of efficient and accurate tools is required for discovery and annotation. Additionally, prokaryotic genomes are frequently deposited as incomplete, or draft stage because of the substantial cost and effort required to finish genome assembly projects. Development of methods to identify IS directly from raw sequence reads or draft genomes is therefore desirable. Software tools such as OASIS and IScan currently identify IS elements in completely assembled and annotated genomes; however, to our knowledge no methods have been developed to identify ISs from raw fragment data or partially assembled genomes. We have developed novel methods to solve this computationally challenging problem, and implemented these methods in the software package ISQuest. This software identifies bacterial ISs and their sequence elements-inverted and direct repeats-in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours, making it a valuable high-throughput tool for a global search of IS elements. We tested ISQuest on simulated read libraries of 3810 complete bacterial genomes and plasmids in GenBank and were capable of detecting 82% of the ISs and transposases annotated in GenBank with 80% sequence identity. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email:
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv388
  • [Show abstract] [Hide abstract]
    ABSTRACT: Seasonal influenza viruses evolve rapidly, allowing them to evade immunity in their human hosts and reinfect previously infected individuals. Similarly, vaccines against seasonal influenza need to be updated frequently to protect against an evolving virus population. We have thus developed a processing pipeline and browser-based visualization that allows convenient exploration and analysis of the most recent influenza virus sequence data. This web-application displays a phylogenetic tree that can be decorated with additional information such as the viral genotype at specific sites, sampling location, and derived statistics that have been shown to be predictive of future virus dynamics. Additionally, mutation, genotype and clade frequency trajectories are calculated and displayed. Python and Javascript source code is freely available from, while the web-application is live at © The Author(s) 2015. Published by Oxford University Press.
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv381
  • [Show abstract] [Hide abstract]
    ABSTRACT: The data that put the 'evidence' into 'evidence-based medicine' are central to developments in public health, primary and hospital care. A fundamental challenge is to site such data in repositories that can easily be accessed under appropriate technical and governance controls which are effectively audited and are viewed as trustworthy by diverse stakeholders. This demands socio-technical solutions that may easily become enmeshed in protracted debate and controversy as they encounter the norms, values, expectations and concerns of diverse stakeholders. In this context, the development of what are called 'Data Safe Havens' has been crucial. Unfortunately, the origins and evolution of the term have led to a range of different definitions being assumed by different groups. There is, however, an intuitively meaningful interpretation that is often assumed by those who have not previously encountered the term: a repository in which useful but potentially sensitive data may be kept securely under governance and informatics systems that are fit-for-purpose and appropriately tailored to the nature of the data being maintained, and may be accessed and utilized by legitimate users undertaking work and research contributing to biomedicine, health and/or to ongoing development of healthcare systems. This review explores a fundamental question: 'what are the specific criteria that ought reasonably to be met by a data repository if it is to be seen as consistent with this interpretation and viewed as worthy of being accorded the status of 'Data Safe Haven' by key stakeholders'? We propose 12 such criteria. © The Author 2015. Published by Oxford University Press.
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv279
  • [Show abstract] [Hide abstract]
    ABSTRACT: DamID is a powerful technique for identifying regions of the genome bound by a DNA-binding (or DNA-associated) protein. Currently no method exists for automatically processing next-generation sequencing DamID (DamID-seq) data, and the use of DamID-seq datasets with normalisation based on read-counts alone can lead to high background and the loss of bound signal. DamID-seq thus presents novel challenges in terms of normalisation and background minimisation. We describe here damidseq_pipeline, a software pipeline that performs automatic normalisation and background reduction on multiple DamID-seq FASTQ datasets. Open-source and freely available from The damidseq_pipeline is implemented in Perl and is compatible with any Unix-based operating system (e.g. Linux, Mac OSX). © The Author(s) 2015. Published by Oxford University Press.
    Bioinformatics 06/2015; DOI:10.1093/bioinformatics/btv386