Figure 1 - uploaded by Wei Chen
Content may be subject to copyright.
The schematic diagram of origin of replication of human. The process of DNA replication requires two DNA  

The schematic diagram of origin of replication of human. The process of DNA replication requires two DNA  

Source publication
Article
Full-text available
The initiation of replication is an extremely important process in DNA life cycle. Given an uncharacterized DNA sequence, can we identify where its origin of replication (ORI) is located? It is no doubt a fundamental problem in genome analysis. Particularly, with the rapid development of genome sequencing technology that results in a huge amount of...

Context in source publication

Context 1
... the detailed replication process in human DNA, see [11][12][13] as well as Figure 1. ...

Similar publications

Article
Full-text available
Microbial taxonomy is increasingly influenced by genome-based computational methods. Yet such analyses can be complex and require expert knowledge. Here we introduce TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy, connected to a large, continuously growing database of genomic,...

Citations

... Gao and Zhang [25] proposed a webserver to find ORI's in the genome of bacteria and archaea and named it Ori-finder, which was further enhanced by Luo et al. [26] using the Z-curve method and gene prediction pipelines for further exploration of ORI's and gene identification. A classifier called iOri-Human [27] used physicochemical properties along with pseudo nucleotide composition features of DNA sequence to identify human ORI's. Xiao et al. [28] proposed a webserver named iROS-gPseKNC, which used random forest (RF) to classify ORIs from non-ORIs in the Saccharomyces cerevisiae genome. ...
Article
Replication of DNA is an important process for the cell division cycle, gene expression regulation and other biological evolution processes. It also has a crucial role in a living organism’s physical growth and structure. Replication of DNA comprises of three stages known as initiation, elongation and termination, whereas the origin of replication sites (ORI) is the location of initiation of the DNA replication process. There exist various methodologies to identify ORIs in the genomic sequences, however, these methods have used either extensive computations for execution, or have limited optimization for the large datasets. Herein, a model called ORI-Deep is proposed to identify ORIs from the multiple cell type genomic sequence benchmark data. An efficient method is proposed using a deep neural network to identify ORIs for four different eukaryotic species. For better representation of data, a feature vector is constructed using statistical moments for the training and testing of data and is further fed to a long short-term memory (LSTM) network. To prove the effectiveness of the proposed model, we applied several validation techniques at different levels to obtain seven accuracy metrics, and the accuracy score for self-consistency, 10-fold cross-validation, jackknife and the independent set test is observed to be 0.977, 0.948, 0.976 and 0.977, respectively. Based on the results, it can be concluded that ORI-Deep can efficiently predict the sites of origin replication in DNA sequence with high accuracy. Webserver for ORI-Deep is available at (https://share.streamlit.io/waqarhusain/orideep/main/app.py), whereas source code is available at (https://github.com/WaqarHusain/OriDeep).
... Usually, in various prediction methods, accuracy is used to examine the strength of hypothesis learners, although the only accuracy is not sufficient to evaluate the efficiency of a prediction model [92,93]. However, a set of four metrics were introduced on the basis of Chou's symbols utilized for studying protein signal peptides and further these metrics were adopted by a series of publications [94][95][96][97][98][99][100][101][102] ...
Article
Please cite this article as: S. Akbar, A.U. Rahman, M. Hayat, M. Sohail, cACP: Classifying anticancer peptides using discriminative intelligent model via Chou's 5-step rules and general pseudo components, Chemometrics and Intelligent Laboratory Systems (2020), doi: https://doi. Abstract World widely, cancer is considered a fatal disease and remains the major cause of death. Conventional medication approaches using therapies and anticancer drugs are deemed ineffective due to its high cost and harmful impacts on the normal cells. However, the innovation of anticancer peptides (ACPs) provides an effective way how to deals with cancer affected cells. Due to the rapid increases in peptide sequences, truly characterization of ACPs has become a challenging task for investigators. In this paper, an effort has been carried out to develop a reliable and intelligent computational method for the accurate discrimination of anticancer peptides. Three statistical feature representation schemes namely: Quasi-sequence order (QSO), conjoint triad feature, and Geary autocorrelation descriptor are applied to express motif of the target class. In order to eradicate irrelevant and noisy features, while select salient, profound and high variated features, principal component analysis is employed. Furthermore, the diverse nature of learning algorithms is utilized in order to select the best operational engine for the proposed model. After examining the empirical outcomes, support vector machine obtained quite encouraging results in combination with QSO feature space. It has achieved an accuracy of 96.91% and 89.54% using the main dataset and alternative dataset, respectively. It is observed that our proposed model shows an outstanding improvement compared to literature methods. It is expected that the developed model may be played a useful role in research academia as well as proteomics and drug development.
... Protein functions are mainly predicted by using protein features, including amino acid sequence [2,3], 3-D protein structure [4], protein-protein interaction (PPI) network [5] and the other molecular and functions [6,7]. Machine learning, which is widely used to predict protein function, uses features extracted from protein properties to train classification models, such as Artificial Neural Networks (ANNs) [8][9][10]. Deep Neural Network (DNN) is a subclass of ANNs which builds more advanced features in each subsequent layer with the input of initial features. ...
... The average value ΔTP, ΔFP, ΔTN, ΔFN of TP, FP, TN, FN are calculated to measure the performance of IGP-DNN. The formula (7)(8)(9) shows the definition of the measure approaches. ...
Article
It is vital for the annotation of uncharacterized proteins by protein function prediction. At present, Deep Neural Network based protein function prediction is mainly carried out for dataset of small scale proteins or Gene Ontology, and usually explore the relationships between single protein feature and function tags. The practical methods for large-scale multi-features protein prediction still need to be studied in depth. This paper proposes a DNN based protein function prediction approach IGP-DNN. This method uses Grasshopper Optimization Algorithm (GOA) and Intuitionistic Fuzzy c-Means clustering (IFCM) based protein function modules extracting algorithm to extract the features of protein modules, utilizing Kernel Principal Component Analysis (KPCA) method to reduce the dimensionality of the protein attribute information, and integrating module features and attribute features. Inputting integrated data into DNN through multiple hidden layers to classify proteins and predict protein functions. In the experiments, the F-measure value of IGP-DNN on the DIP dataset reaches 0.4436, which shows better performance.
... In 2012, Chen et al. [4] studied the replication initiation site of Saccharomyces cerevisiae by calculating the bending degree and cleavage intensity of the DNA sequence, which is highly effective for identifying positive samples. In 2016, Zhang et al. [5] first attempted to study the origin of human DNA replication and constructed a predictor based on random forest. In 2016, Wang et al. [6] studied H. sapiens, M. musculus, E. coli and came up with a method "MaloPred". ...
... For studying the origin of DNA replication in various eukaryotes, seven sample datasets of eukaryotes were collected, which are H. sapiens, M. musculus, D. melanogaster, A. thaliana, P. pastoris, S. pombe and K. lactis [5,7,8]. Among them, all the sequences are 300 bp in length, the positive and negative sample sets are balanced on the whole. ...
... PseKNC-II, also known as the series correlation PseKNC [5,23], which not only considers the frequency information of k-tuple nucleotides, but also calculates the physical and chemical properties of pseudo-nucleotides. In this work, we extracted three pseudonucleotides feature sets on which k = 1, 2, 3, 4, 5 and 6. ...
Article
Full-text available
Background The origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children. More importantly, accurately identifying the origin of replication has great application value in the diagnosis and treatment of diseases related to genetic information errors, while the traditional biological experimental methods are time-consuming and laborious. Results We carried out research on the origin of replication in a variety of eukaryotes and proposed a unique prediction method for each species. Throughout the experiment, we collected data from 7 species, including Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Kluyveromyces lactis, Pichia pastoris and Schizosaccharomyces pombe. In addition to the commonly used sequence feature extraction methods PseKNC-II and Base-content, we designed a feature extraction method based on TF-IDF. Then the two-step method was utilized for feature selection. After comparing a variety of traditional machine learning classification models, the multi-layer perceptron was employed as the classification algorithm. Ultimately, the data and codes involved in the experiment are available at https://github.com/Sarahyouzi/EukOriginPredict. Conclusions The prediction accuracy of the training set of the above-mentioned seven species after 100 times fivefold cross validation reach 92.60%, 90.80%, 91.22%, 96.15%, 96.72%, 99.86%, 96.72%, respectively. It denotes that compared with other methods, the methods we designed could accomplish superior performance. In addition, our experiments reveals that the models of multiple species could predict each other with high accuracy, and the results of STREME shows that they have a certain common motif.
... In recent publications (Lin et al., 2014;Xu et al., 2014;Chen et al., 2016c;Zhang et al., 2016;Ehsan et al., 2018), these set of metrics were utilized for research in state-of-the-art methods. According to Eq. (18), SUMOk sites or non-SUMOk sites are applicable only for the binary classification data. ...
Article
Full-text available
Sumoylation is the post-translational modification that is involved in the adaption of the cells and the functional properties of a large number of proteins. Sumoylation has key importance in subcellular concentration, transcriptional synchronization, chromatin remodeling, response to stress, and regulation of mitosis. Sumoylation is associated with developmental defects in many human diseases such as cancer, Huntington’s, Alzheimer’s, Parkinson’s, Spin cerebellar ataxia 1, and amyotrophic lateral sclerosis. The covalent bonding of Sumoylation is essential to inheriting part of the operative characteristics of some other proteins. For that reason, the prediction of the Sumoylation site has significance in the scientific community. A novel and efficient technique is proposed to predict the Sumoylation sites in proteins by incorporating Chou’s Pseudo Amino Acid Composition (PseAAC) with statistical moments-based features. The outcomes from the proposed system using 10 fold cross-validation testing are 94.51%, 94.24%, 94.79% and 0.8903% accuracy, sensitivity, specificity and MCC, respectively. The performance of the proposed system is so far the best in comparison to the other state-of-the-art methods. The codes for the current study are available on the GitHub repository using the link: https://github.com/csbioinfopk/iSumoK-PseAAC .
... This study defines an assiduous methodology for a new prediction model for computational identification of cancer driver genes. The work adapts broadly used approaches in bioinformatics and computational science for the recognition of cancer driver genes [7][8][9] . A valuable and systematic sequence-based methodology for an organic framework can be planned by observing the following simple steps (1) development or determination of a substantial benchmark dataset for training and testing the prediction model; (2) definition of the organic arrangement tests with a viable numerical expression, reflecting their basic relationship with the targets concerned; (3) creating an effective computational algorithm for prediction; (4) validation of outcomes that equitably evaluate the expected precision (5) Providing a framework for public use based upon the carved out robust model. ...
Article
Full-text available
Cancer is driven by distinctive sorts of changes and basic variations in genes. Recognizing cancer driver genes is basic for accurate oncological analysis. Numerous methodologies to distinguish and identify drivers presently exist, but efficient tools to combine and optimize them on huge datasets are few. Most strategies for prioritizing transformations depend basically on frequency-based criteria. Strategies are required to dependably prioritize organically dynamic driver changes over inert passengers in high-throughput sequencing cancer information sets. This study proposes a model namely PCDG-Pred which works as a utility capable of distinguishing cancer driver and passenger attributes of genes based on sequencing data. Keeping in view the significance of the cancer driver genes an efficient method is proposed to identify the cancer driver genes. Further, various validation techniques are applied at different levels to establish the effectiveness of the model and to obtain metrics like accuracy, Mathew’s correlation coefficient, sensitivity, and specificity. The results of the study strongly indicate that the proposed strategy provides a fundamental functional advantage over other existing strategies for cancer driver genes identification. Subsequently, careful experiments exhibit that the accuracy metrics obtained for self-consistency, independent set, and cross-validation tests are 91.08%., 87.26%, and 92.48% respectively.
... This area explains the last step of Chou's 5-steps rule (Liu et al. 2016c;Zhang et al. 2016;Chou 2011) which is the improvement of a web server for the simplicity of clients and easy to understand, as shown by the different examiner in some ongoing publications, easy to understand and freely available web-servers speak to the future heading for growing increasingly helpful prediction strategies and computational analyses tools. They have fundamentally improved the effects of computational science on restorative science (Chou 2015), driving medicinal science into an extraordinary upheaval (Cheng et al. 2017). ...
Article
Full-text available
DNA replication is one of the specific processes to be considered in all the living organisms, specifically eukaryotes. The prevalence of DNA replication is significant for an evolutionary transition at the beginning of life. DNA replication proteins are those proteins which support the process of replication and are also reported to be important in drug design and discovery. This information depicts that DNA replication proteins have a very important role in human bodies, however, to study their mechanism, their identification is necessary. Thus, it is a very important task but, in any case, an experimental identification is time-consuming, highly-costly and laborious. To cope with this issue, a computational methodology is required for prediction of these proteins, however, no prior method exists. This study comprehends the construction of novel prediction model to serve the proposed purpose. The prediction model is developed based on the artificial neural network by integrating the position relative features and sequence statistical moments in PseAAC for training neural networks. Highest overall accuracy has been achieved through tenfold cross-validation and Jackknife testing that was computed to be 96.22% and 98.56%, respectively. Our astonishing experimental results demonstrated that the proposed predictor surpass the existing models that can be served as a time and cost-effective stratagem for designing novel drugs to strike the contemporary bacterial infection.
... Xiao et al. successfully incorporated dinucleotide location-specific propensity into PseKNC and used Random Forest 29 classifier to form a predictor called 'iROS-gPseKNC' , which has a pretty high accuracy of 98.03% and other indexes are also close to 100% 30 . Zhang et al. demonstrated that the integration of dinucleotide physicochemical properties with the pseudo nucleotide composition is an effective way to improve the prediction performance of human ORIs, and the Random Forest classifier was used to form the predictor, called 'iOri-Human' 31 . The latest method proposed by Do et al. 32 is a hybrid identification system incorporating fusion features extracted by FastText 33 and PseKNC with XGBoost 34 , which achieved an accuracy of 89.51% in Saccharomyces cerevisiae. ...
Article
Full-text available
The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote’s ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, ‘Word2vec’, to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew’s correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.
... Three testing protocols including jackknife testing, 10-fold cross-validation testing, and independent dataset testing are employed to analyze the consistency and reliability of the proposed technique [16], [31], [32]. Jackknife testing protocol always gives unique results that make it widely acceptable VOLUME 8, 2020 testing technique in the bioinformatics community for assessing the performance of their proposed models [32]- [37]. ...
Article
Full-text available
Many organelles inside and outside a living cell depend on the perfect behavior of Golgi apparatus for smooth and normal functioning. Its poor performance may lead to many inheritable diseases like diabetes and cancer. Therefore, it is highly crucial to detect any strange behavior of Golgi apparatus in advance. Accurate discrimination of cis -Golgi from trans -Golgi proteins surely helps researchers identify the role of Golgi proteins in various diseases and assist pharmacists in drug development. In this work, various hybrid models of Bi-Profile Bayes, Bigram PSSM, Di-Peptide Composition, and Split Amphiphilic Pseudo Amino Acid Composition with SMOTE oversampling technique have been employed to discriminate Golgi protein types. Multiple linear Support Vector Machines have been used to exploit the discrimination power of these models. The proposed prediction system: Golgi-predictor has shown significant performance and achieved promising results compared to other existing state-of-the-art techniques. Through the 10-fold cross-validation, the proposed system achieved an accuracy value of 97.6%, sensitivity value of 98.8%, specificity value of 96.5%, G-mean value of 97.6%, MCC value of 0.95, and F-score value of 0.97. Similarly, through the jackknife cross-validation, the achieved values for accuracy, sensitivity, specificity, G-mean, MCC, and F-score are respectively, 96.5%, 97.8%, 95.2%, 96.4%, 0.93, and 0.96. Moreover, through the independent dataset testing, Golgi-predictor demonstrated significant enhancement in performance over other techniques. The proposed methodology aims at supporting drug designers in pharmaceutical industry and assisting researchers from the fields of bioinformatics and computational biology towards better innovation in predicting the behavior of Golgi proteins.
... More recently, Singh et al. [11] utilized the content-based and context-based computation analysis to learn a classification model. Zhang et al. [12] constructed 'iOri-Human', a classifier to identify human origin of replication by leveraging the physicochemical properties as well as the pseudo nucleotide composition feature. Among the existing approaches, iORI-Euk is the latest and the first method containing cell-specific multiple eukaryotic species predictors, which was developed using the large benchmark/training datasets. ...
Article
Full-text available
Origins of replication sites (ORIs), which refers to the initiative locations of genomic DNA replication, play essential roles in DNA replication process. Detection of ORIs' distribution in genome scale is one of key steps to in-depth understanding their regulation mechanisms. In this study, we presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models for identifying ORIs from four different eukaryotic species (Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana). For each cell-specific model, we employed 12 feature encoding schemes that cover nucleic acid composition, position-specific and physicochemical properties information. The optimal feature set was identified from each encoding individually and developed their respective baseline models using the eXtreme Gradient Boosting (XGBoost) classifier. Subsequently, the predicted scores of 12 baseline models are integrated as a novel feature vector to train XGBoost and develop the final model. Extensive experimental results show that Stack-ORI achieves significantly better performance as compared with their baseline models on both training and independent datasets. Interestingly, Stack-ORI consistently outperforms existing predictor in all cell-specific models, not only on training but also on independent test. Moreover, our novel approach provides necessary interpretations that help understanding model success by leveraging the powerful SHapley Additive exPlanation algorithm, thus underlining the most important feature encoding schemes significant for predicting cell-specific ORIs.