IRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition

Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan, China, Bioinformatics and Computer-Aided Drug Discovery, Gordon Life Science Institute, San Diego, CA, USA, School of Public Health, Hebei United University, Tangshan 063000, China and Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China.
Nucleic Acids Research (Impact Factor: 9.11). 01/2013; 41(6). DOI: 10.1093/nar/gks1450
Source: PubMed


Meiotic recombination is an important biological process. As a main driving force of evolution, recombination provides natural new combinations of genetic variations. Rather than randomly occurring across a genome, meiotic recombination takes place in some genomic regions (the so-called 'hotspots') with higher frequencies, and in the other regions (the so-called 'coldspots') with lower frequencies. Therefore, the information of the hotspots and coldspots would provide useful insights for in-depth studying of the mechanism of recombination and the genome evolution process as well. So far, the recombination regions have been mainly determined by experiments, which are both expensive and time-consuming. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the recombination regions. In this study, a predictor, called 'iRSpot-PseDNC', was developed for identifying the recombination hotspots and coldspots. In the new predictor, the samples of DNA sequences are formulated by a novel feature vector, the so-called 'pseudo dinucleotide composition' (PseDNC), into which six local DNA structural properties, i.e. three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise), are incorporated. It was observed by the rigorous jackknife test that the overall success rate achieved by iRSpot-PseDNC was >82% in identifying recombination spots in Saccharomyces cerevisiae, indicating the new predictor is promising or at least may become a complementary tool to the existing methods in this area. Although the benchmark data set used to train and test the current method was from S. cerevisiae, the basic approaches can also be extended to deal with all the other genomes. Particularly, it has not escaped our notice that the PseDNC approach can be also used to study many other DNA-related problems. As a user-friendly web-server, iRSpot-PseDNC is freely accessible at

Download full-text


Available from: Hao Lin, Jun 17, 2014
52 Reads
  • Source
    • "We consider these properties owing to the observation that hotspots centers are characterized by a depletion of nucleosomes (Pan et al., 2011) and DNA flexibility plays an important role in nucleosome positioning (Richmond and Davey, 2003; Tolstorukov et al., 2007). The performance of the dinucleotide structure parameters in hot/cold spots prediction was previously demonstrated (Chen et al., 2013). The third one is thermodynamic properties including dinucleotide free energy, entropy and enthalpy (Ignatova et al., 2008). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Characterization and accurate prediction of recombination hotspots and coldspots have crucial implications for the mechanism of recombination. Several models have predicted recombination hot/cold spots successfully, but there is still much room for improvement. We present a novel classifier in which k-mer frequency, physical and thermodynamic properties of DNA sequences are incorporated in the form of weighted features. Applying the classifier to recombination hot/cold ORFs in Saccharomyces cerevisiae, we achieved an accuracy of 90%, which is ∼5% higher than existing methods, such as iRSpot-PseDNC, IDQD and Random Forest. The model also predicted non-ORF recombination hot/cold spots sequences in Saccharomyces cerevisiae with high accuracy. A broad applicability of the model in the field of classification is expected. Copyright © 2015. Published by Elsevier Ltd.
    Journal of Theoretical Biology 06/2015; 382. DOI:10.1016/j.jtbi.2015.06.030 · 2.12 Impact Factor
  • Source
    • "d their physical properties , secondary structure components and range of ASA values . Finally , we applied the predicted ASA values to improve the accuracy of the energy function , 3DIGARS , which actually resulted in outperforming all the state - of - the - art energy functions significantly . As demonstrated by a series of recent publications ( Chen et al . , 2013 ) ; ( Lin et al . , 2014 , 2015 ) ; ( Ding et al . , 2014 ) ; ( Xu et al . , 2014 ) ; ( Jia et al . , 2015 ) , to establish a really useful sequence - based statistical predictor for a biological system , we aligned the outline of our paper accordingly towards the steps of Chou ' s 5 - step rule ( Chou , 2011 ) for the two different par"
    [Show abstract] [Hide abstract]
    ABSTRACT: An accurate prediction of real value accessible surface area (ASA) from protein sequence alone has wide application in the field of bioinformatics and computational biology. ASA has been helpful in understanding the 3-dimensional structure and function of a protein, acting as high impact feature in secondary structure prediction, disorder prediction, binding region identification and fold recognition applications. To enhance and support broad applications of ASA, we have made an attempt to improve the prediction accuracy of absolute accessible surface area by developing a new predictor paradigm, namely REGAd3p, for real value prediction through classical Exact Regression with Regularization and polynomial kernel of degree 3 which was further optimized using Genetic Algorithm. ASA assisting effective energy function, motivated us to enhance the accuracy of predicted ASA for better energy function application. Our ASA prediction paradigm was trained and tested using a new benchmark dataset, proposed in this work, consisting of 1001 and 298 protein chains, respectively. We achieved maximum Pearson Correlation Coefficient (PCC) of 0.76 and 1.45% improved PCC when compared with existing top performing predictor, SPINE-X, in ASA prediction on independent test set. Furthermore, we modeled the error between actual and predicted ASA in terms of energy and combined this energy linearly with the energy function 3DIGARS which resulted in an effective energy function, namely 3DIGARS2.0, outperforming all the state-of-the-art energy functions. Based on Rosetta and Tasser decoy-sets 3DIGARS2.0 resulted 80.78%, 73.77%, 141.24%, 16.52%, and 32.32% improvement over DFIRE, RWplus, dDFIRE, GOAP and 3DIGARS respectively.
    Journal of Theoretical Biology 06/2015; 380:380-391. DOI:10.1016/j.jtbi.2015.06.012 · 2.12 Impact Factor
  • Source
    • "As shown by a series of recent publications [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] in response to the call from a comprehensive review [23] to develop and present a really useful statistical predictor for a biological system, one should make the following procedures crystal clear: (i) how to construct or select a valid benchmark dataset to train and test the predictor, (ii) how to formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted, (iii) how to introduce or develop a powerful algorithm (or engine) to operate the prediction, (iv) how to properly perform the cross-validation tests to objectively evaluate the anticipated accuracy of the predictor , and (v) how to establish a user-friendly web-server for the predictor that is accessible to the public. Below, we address these five procedures one by one. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Predominantly occurring on cytosine, DNA methylation is a process by which cells can modify their DNAs to change the expression of gene products. It plays very important roles in life development but also in forming nearly all types of cancer. Therefore, knowledge of DNA methylation sites is significant for both basic research and drug development. Given an uncharacterized DNA sequence containing many cytosine residues, which one can be methylated, and which one cannot? With the avalanche of DNA sequences generated in the postgenomic age, it is highly desired to develop computational methods for accurately identifying the methylation sites in DNA. Using the trinucleotide composition, pseudo amino acid components, as well as dataset-optimizing technique, we developed a new predictor called "iDNA-Methyl" that has achieved remarkably higher success rates in identifying the DNA methylation sites than the existing predictors. A user-friendly web-server for the new predictor has been established at, by which users can easily get their desired results. We anticipate that the web-server predictor will become a very useful high throughput tool for basic research and drug development, and that the novel approach and technique can also be used to investigate many other DNA-related problems and genome analysis. Copyright © 2014 Elsevier Inc. All rights reserved.
    Analytical Biochemistry 01/2015; 474. DOI:10.1016/j.ab.2014.12.009 · 2.22 Impact Factor
Show more