ArticlePDF Available

csDMA: an improved bioinformatics tool for identifying DNA 6 mA modifications via Chou’s 5-step rule

Springer Nature
Scientific Reports
Authors:

Abstract and Figures

DNA N6-methyldeoxyadenosine (6 mA) modifications were first found more than 60 years ago but were thought to be only widespread in prokaryotes and unicellular eukaryotes. With the development of high-throughput sequencing technology, 6 mA modifications were found in different multicellular eukaryotes by using experimental methods. However, the experimental methods were time-consuming and costly, which makes it is very necessary to develop computational methods instead. In this study, a machine learning-based prediction tool, named csDMA, was developed for predicting 6 mA modifications. Firstly, three feature encoding schemes, Motif, Kmer, and Binary, were used to generate the feature matrix. Secondly, different algorithms were selected into the prediction model and the ExtraTrees model received the best AUC of 0.878 by using 5-fold cross-validation on the training dataset. Besides, the ExtraTrees model also received the best AUC of 0.893 on the independent testing dataset. Finally, we compared our method with state-of-the-art predictors and the results shown that our model achieved better performance than existing tools.
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
csDMA: an improved bioinformatics
tool for identifying DNA 6 mA
modications via Chou’s 5-step rule
Ze Liu1,2, Wei Dong1,2, Wei Jiang1,2 & Zili He1,2
DNA N6-methyldeoxyadenosine (6 mA) modications were rst found more than 60 years ago but
were thought to be only widespread in prokaryotes and unicellular eukaryotes. With the development
of high-throughput sequencing technology, 6 mA modications were found in dierent multicellular
eukaryotes by using experimental methods. However, the experimental methods were time-
consuming and costly, which makes it is very necessary to develop computational methods instead.
In this study, a machine learning-based prediction tool, named csDMA, was developed for predicting
6 mA modications. Firstly, three feature encoding schemes, Motif, Kmer, and Binary, were used to
generate the feature matrix. Secondly, dierent algorithms were selected into the prediction model
and the ExtraTrees model received the best AUC of 0.878 by using 5-fold cross-validation on the training
dataset. Besides, the ExtraTrees model also received the best AUC of 0.893 on the independent testing
dataset. Finally, we compared our method with state-of-the-art predictors and the results shown that
our model achieved better performance than existing tools.
DNA N6-methyldeoxyadenosine (6 mA) modications were rst discovered in Bacteria in 19551. However, it had
not received much attention as 5-methylcytosine (5mC) did. One important reason is that 6 mA modications
were thought to be only widespread in prokaryotes and unicellular eukaryotes, but rarely in multicellular eukar-
yotes2,3. Researchers have proposed several experimental methods to identify 6 mA modications in the past few
decades. e rst method, developed by Dunn et al. in 1955, is a combination of ultraviolet absorption spectra,
electrophoretic mobility, and paper chromatographic movement, but this method is relatively insensitive and
cannot be used to detect 6 mA modications in animals1. en a restriction enzyme method was used to dis-
cover 6 mA modications in 1978. However, this method can only nd modied adenosines that occurred in the
restriction enzyme target motifs4. With the development of high-throughput sequencing technology, thousands
of 6 mA modications were found in dierent multicellular eukaryotes. In 2015, Fu et al. found 6 mA modica-
tions in 84% genes of Chlamydomonas by using 6 mA immunoprecipitation sequencing (6mA-IP-Seq)5. In 2016,
Koziol et al. used dot blots, HPLC, and methyl DNA immunoprecipitation followed by sequencing (MeDIP-seq)
to detect 6 mA modications in vertebrates including Xenopus laevis, mouse and human6. In 2017, Mondo et
al. observed that up to 2.8% of all adenines were methylated in early-diverging fungi by using single-molecule
real-time (SMRT) sequencing7. In 2018, Zhou et al. found that about 0.2% of adenines in the rice genome were
6 mA methylated by using mass spectrometry, immunoprecipitation, and SMRT, and Zhang et al. observed that
the 6 mA distribution in the rice and Arabidopsis genome were very similar by using 6mA-IP-seq8,9. As the exper-
imental methods are time-consuming and costly, researchers are trying to predict DNA 6 mA modications
by using computational methods. Two prediction tools are reported up to now, i.e., iDNA6mA-PseKNC10 and
i6mA-Pred11. iDNA6mA-PseKNC is the rst prediction tool for predicting 6 mA modications in the Mus mus-
culus genome and i6mA-Pred is the rst identication method in the rice genome.
Predicting DNA 6 mA modications based on computational algorithms is still in the infancy. However, in
the parallel study of prediction of post-translational modication (PTM) sites, there are many PTM-predicting
papers published by the previous researchers1222. Although there is some detailed dierence for each of the indi-
vidual PTMs, the basic core is about the same. us, the feature extraction and classication methods proposed
in these studies provide a valuable basis for the prediction of DNA 6 mA modications. In this research, we aim
1College of Water Resources and Architectural Engineering, Northwest A&F University, Yangling, 712100, Shaanxi,
China. 2Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education,
Northwest A & F University, Yangling, 712100, Shaanxi, China. Correspondence and requests for materials should be
addressed to W.D. (email: dongw@nwafu.edu.cn)
Received: 26 April 2019
Accepted: 24 August 2019
Published: xx xx xxxx
OPEN
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
to develop a prediction tool that can be used to predict DNA 6 mA modications across species. e benchmark
datasets created in the iDNA6mA-PseKNC and i6mA-Pred predictors were used and dierent algorithms were
implemented to generate the nal optimized model. 5-fold cross-validation was performed and the prediction
results demonstrated that our model achieved a better performance than existing 6 mA prediction tools.
As demonstrated by a series of recent publications10,1319 and summarized in two comprehensive review
papers23,24, to develop a really useful predictor for a biological system, one needs to follow Chou’s 5-steps rule
(more detailed description can be found in https://en.wikipedia.org/wiki/5-step_rules.) to go through the follow-
ing ve steps: (1) construct a gold standard dataset to train and test the model; (2) encode samples with eective
formulations; (3) conduct the prediction model with a powerful classier; (4) evaluate model performance by
using cross-validation tests and standard measures; (5) establish a user-friendly web-server for the predictor that
can be accessible to the public. Below, we are to address these points one by one, making them crystal clear in
logic development and completely transparent in operation.
Method
Dataset generation. Feng et al. created a DNA 6 mA benchmark dataset of the M. musculus genome in
201810. e benchmark dataset includes 1,934 positive samples and 1,934 negative samples. Chen et al. launched
a 6 mA benchmark dataset of the rice genome in 201911. e benchmark dataset consists of 880 positive samples
and 880 negative samples. e above two benchmark datasets were used to create the cross-species dataset and
the CD-HIT-EST soware25 with dierent threshold was used to reduce sequence redundancy in the original
datasets (Table1). Finally, the cross-species dataset consists of 2,768 positive samples and 2,716 negative samples
with the most rigorous threshold at 0.80, and the length of each sample is 41nt. To build a cross-species 6 mA pre-
diction model, the stratied selection method was used and we random selected 80% samples for model training
and the le 20% for independent testing. Finally, the training dataset consists of 2,214 positive samples and 2,214
negative samples, while the independent testing dataset includes 554 positive samples and 502 negative samples.
Feature encoding scheme. To construct a DNA 6 mA predictor, one of the most important but also most
dicult issue is how to encode feature vector for each sequence, yet still retains most of the key patterns. e
pseudo amino acid composition (PseAAC) was proposed by Chou et al. and has been widely used in nearly all
the areas of computational proteomics26,27. Based on the PseAAC, four powerful soware, such as ‘PseAAC’28,
‘PseAAC-Builder’29, ‘propy’30, and ‘PseAAC-General’31, were established: the former three are for generating var-
ious modes of Chou’s special PseAAC32; while the 4th one for those of Chou’s general PseAAC23. Encouraged by
the successes of using PseAAC to deal with protein/peptide sequences, the concept of Pseudo K-tuple Nucleotide
Composition (PseKNC)33 was developed for encoding features of DNA/RNA sequences3436 that have proved
very useful as well. Particularly, recently a very powerful web-server called ‘Pse-in-One’37 and its updated version
‘Pse-in-One2.0’38 have been established that can be used to generate any desired feature vectors for protein/pep-
tide and DNA/RNA sequences according to the need of users’ studies.
K-mer pattern. K monomeric units (k-mers), are simply patterns of k consecutive nucleic acids37 and have a
total of 4k kinds of nucleotide patterns for DNA/RNA. Such as 1-mer has 4 and 2-mer has 16 kinds of nucleotide
patterns. To calculate the frequencies of k-mer nucleotide patterns, the length range L of the scanning region
must be determined at rst, and then the absolute frequencies of the k-mer nucleotide patterns are calculated
from the start position to the L-k-1 position. Finally, the relative frequencies of k-mer patterns are calculated for
each region. In this study, we set k as 2, 3, 4, and extracted 42 + 43 + 44 = 336 kinds of k-mer nucleotide patterns
for feature encoding.
KSNPF frequency. e KSNPF frequencies are nucleotide pairs separated by k arbitrary nucleotides and have
been successfully employed for the prediction of mucin-type O-glycosylation sites39 and phosphotase-specic
dephosphorylation sites40. e KSNPF can be calculated using the following equation:
=
−−
fnGapkn(1 ()2)
S(n1Gap(k)n2)
Lk1(1)
where n1 and n2 represent a pair of sequence elements. For nucleotide, n stands for any one of A, C, G, T/U. us,
there are 42 = 16 combinations in each pair. Gap(k) stands for k arbitrary elements at intervals and S(n1Gap(k)n2)
indicates the number of occurrences of the element pair. In this study, L represents the length of the nucleotide
sequence, and the k was set as 1, 2, 3, 4, and the dimension of the KSNPF can be calculated by 42 × 4 = 64.
Species Dataset
Sequence identity threshold
0.95 0.90 0.85 0.80
Mouse Positive 1,931 1,924 1,914 1,892
Negative 1,885 1,866 1,844 1,836
Rice Positive 880 879 878 876
Negative 880 880 880 880
cross-species Positive 2,811 2,803 2,792 2,768
Negative 2,767 2,746 2,724 2,716
Table 1. Reduce sequence redundancy in the dierent datasets by using the CD-HIT-EST soware.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
Nucleic shift density. Nucleic shi density encoding can be used to calculate the density of any nucleo-
tide at the current position in its prex string and has been used to encode nucleotide sequences in the iDNA-
6mA-PseKNC predictor10. A nucleic shi density feature at any position can be dened as follows:
==
=
=
dNFn Fn
if nq
othercase
1
(),()
1
0(2)
i
i
j
ijj j
1
where q represents of any nucleotide at current position i, Ni is the length of the ith prex string in the sequence.
For example, the DNA sequence “CAGCTG. e Nucleic shi density of ‘C’ at the position 1, 2, 3, 4, 5 or 6 is
1/1 = 1, 1/2 = 0.5, 1/3 0.33, 2/4 = 0.5, 2/5 = 0.4 or 2/6 0.33, respectively. In this study, the length of each sam-
ple is 41nt. us, 41 Nucleic shi density features were generated for each sample.
Binary code. Binary encoding scheme is used to predict 6 mA modications in the iDNA6mA-PseKNC pre-
dictor10. For the nucleotide in position i, the Binary features can be dened as following:
=
=
=
xif nAG
if nCT
yif nAT
if nCG
zif nAC
if nGT
1{,}
0{,}
1{,}
0{,}
1{,}
0{,}
(3)
ii
i
ii
i
ii
i
In this research, the Binary encoding scheme generates a vector with 3 × 41 = 123 elements by characterizing
each nucleotide, “A, “C”, “G”, or “T, with (1, 1, 1), (0, 0, 1), (1, 0, 0), or (0, 1, 0), respectively.
Motif score matrix. e MEME Suite (http://meme-suite.org/) consists of several motif-based sequence
analysis tools. In this study, the MEME tool with dierential enrichment mode was used and the maximum num-
ber of motifs was set to 10. e most enriched motifs were selected based on E-value and the motif matrixes were
used for generating motif scores of each sample.
Performance evaluation. Five different classifiers, Random Forest, GradientBoosting, AdaBoost,
ExtraTrees and SVM, were implemented by using Python. For Random Forest, GradientBoosting, AdaBoost,
ExtraTrees Classiers, 1,000 trees were selected for each of them. For SVM, grid research was used to search the
best combination of C and gamma parameters. 5-fold cross-validation was used to evaluate the performance of
our model. In a dierent fold of cross-validation, each subset was iteratively selected as a testing set, while the le
4 subsets were used to train the model. e mean results of the ve experiments were nally used as the perfor-
mance estimates of the algorithms.
Based on the Chou’s symbols introduced for studying signal peptides41,42, Four standard measures were
derived and have been adopted by several recent publications4345. e measures can be dened as follows:
=−
=−
=−
+
+
=−+
+
+
+
+
+
++
+−
−−
+
++
+
+
+
++
()
Sn N
N
Sp N
N
ACCNN
NN
MCC
1
1
1
1
11
(4)
N
N
N
N
NN
N
NN
N
where N+ and N refer to the number of positive samples or negative samples, respectively.
+
N
stands for the
number of positive samples that were predicted to be negatives,
+
N
refers to the number of negative samples that
were predicted to be positives. However, these measures are valid only for single-label learning issues. For the
multi-label learning problems, whose appearances are more common in system biology46, system medicine47 and
biomedicine16, a completely dierent set of standard measures is needed48. Besides, the receiver operating char-
acteristic curve (ROC) combined with the area under the ROC curve (AUC), the Precision-Recall curve com-
bined with the average precision (AP), and the F1 score49 were also used to evaluate the performance of dierent
classiers.
Using graphic approaches to study biological and medical systems can provide an intuitive vision and useful
insights for helping analyze complicated relations therein as shown in the systems of enzyme fast reaction50,
graphical rules in molecular biology51, and low-frequency internal motion in biomacromolecules (such as protein
and DNA)52. Particularly, what happened is that this kind of insightful implication has also been demonstrated
in53 and many follow-up publications5456. e framework of csDMA is shown in Fig.1.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
As pointed outby Chou et al.57 and demonstrated in a series of recent publications1618, publicly accessible
web-servers or online bioinformatics tools have signicantly increased the impacts of bioinformatics on medical
science58, driving medicinal chemistry into an unprecedented revolution59. Accordingly, the datasets and online
tool involved in this paper are all available at https://github.com/liuze-nwafu/csDMA.
Results
Differential enrichment motifs discovery. To find the enriched motifs in the flank of 6 mA
sites, the MEME tool with differential enrichment mode was used and the maximum number of motifs
was set to 10. We used the positive samples in the cross-species dataset as the input and treated the neg-
ative samples as the control sequences. e detailed information of the enriched motifs can be found in the
supplementary materials. Consider the statistical significance of the motifs, the E-value lower than 0.05
was used to find the most statistically significant motifs and two motifs were selected. The first motif,
NNNNNNNHHNHHNHWNTNTNWNNNWNYNNNNNNNNNNNNNN, with an E-value of 3.3e-18 was
the most statistically signicant. And the third motif ACCGATCSA, with an E-value of 2.9e-2, was also selected.
e probability matrixes can also be downloaded from the MEME website which can be used to build motif score
matrixes in the training process.
Model training with dierent feature subsets. To nd the best combination of feature subsets, dierent
feature subsets were selected into the Random Forest classier and 5-fold cross-validation was used on the train-
ing dataset to evaluate the performance of our model. As shown in Fig.2, the classier received an AUC value of
0.866 only by using the Binary code features, which means that the Binary code features were the most signicant
features that can be used to distinguish positive samples from negative samples. Interestingly, this result was even
slightly higher than using combined feature subsets, such as Motif and Binary, Ksnpf and Binary, which achieved
an AUC value of 0.861 and 0.862, respectively. Besides, the model achieved the best AUC value of 0.871 when
Figure 1. e framework of csDMA.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
three feature subsets Motif, Kmer, and Binary feature subsets were selected into the classier. is result was even
a little better than the model performance by using all feature subsets. us, we used the Motif, Kmer, and Binary
encoding scheme to generate the optimized feature matrix.
Performance evaluation with dierent classiers. Five dierent algorithms were implemented in this
research. For the Random Forest, GradientBoosting, AdaBoost, ExtraTrees Classiers, 1,000 trees were selected
for each of them. For the SVM classier, grid research was used to search the best combination of C and gamma
parameters and the SVM classier achieved the best performance with C of 0.98 and gamma of 0.01. To compare
the performance of dierent classiers, 5-fold cross-validation was used and each classier was trained with the
same fold. As shown in Fig.3, the ExtraTrees classier received the best ACC of 0.799 and Sn of 0.864, while the
AdaBoost got the lowest ACC of 0.715, Sn of 0.713, Sp of 0.718. However, the ExtraTrees classier performed
not very well for predicting negative samples and received an Sp of 0.735, but it is only a little lower than those
of other methods. A more detailed comparison of dierent classiers is also shown in Table2. What’s more, the
Figure 2. Model performance based on the dierent feature subsets. 1,000 decision trees were selected into the
Random Forest classier and 5-fold cross-validation was used to evaluate the performance of csDMA.
Figure 3. e model performance of dierent classiers. e Motif, Kmer, and Binary feature subsets were
selected into each classier and the optimized parameters were used for model training. To evaluate the
performance of each classier, 5-fold cross-validation was used and Standard measures such as ACC, Sn and Sp
were used to evaluate the performance of our model.
Algorithm Sn Sp ACC MCC AUC F1
RandomForest 0.853 0.735 0.794 0.593 0.871 0.806
GradientBoosting 0.743 0.762 0.752 0.506 0.818 0.750
AdaBoost 0.713 0.718 0.715 0.431 0.777 0.715
ExtraTrees 0.864 0.735 0.799 0.603 0.878 0.811
SVM 0.807 0.764 0.785 0.572 0.858 0.790
Table 2. Model performance of each algorithm on the training dataset. e highest value of each column is
marked in bold.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
ExtraTrees classier also achieved the highest MCC of 0.603, AUC of 0.878 and F1 of 0.811. us, we used the
ExtraTrees algorithm to train the optimized model.
e independent testing dataset was also used to further evaluate the performance of each classier. Each
classier was trained on the whole training dataset and evaluated on the independent testing dataset. As shown
in Table3, the ExtraTrees classier received the best Sn of 0.888, AUC of 0.893 and F1 of 0.832, while the SVM
model got the highest Sp of 0.761. Interestingly, the performance of each classier on the independent testing
dataset was even a little higher than that on the training dataset, which suggests that the classier will receive
better performance with a larger training dataset.
Comparison with existing 6 mA predictors. e SVM-based tool iDNA6mA-PseKNC was also imple-
mented in this research. Grid research was used to nd the best C and gamma, and the iDNA6mA-PseKNC
achieved the best performance with C of 0.336 and gamma of 0.02. e same fold used for training csDMA
was also used for training iDNA6mA-PseKNC. e iDNA6mA-PseKNC predictor received Sn of 0.767, Sp of
0.769, ACC of 0.767, MCC of 0.536, and F1 of 0.767. Most of the measures were lower except Sp is higher than
our model with the ExtraTrees classier. To further compare the performance of the two algorithms. e ROC
and Precision-Recall curves were also plotted in Fig.4. Our model received an AUC of 0.893, while iDNA-
6mA-PseKNC got an AUC of 0.840, which also demonstrates that our model achieved better performance than
the iDNA6mA-PseKNC predictor.
To test the performance of our model across species, we compared the performance of csDMA and iDNA-
6mA-PseKNC on the dierent datasets, i.e., Cross-species, rice, and M. musculus datasets. For each dataset, 5-fold
cross-validation was performed and the previously optimized parameters were used. We used the same fold for
training on dierent datasets. e ve-round results of each measure were averaged and shown in Table4. For the
Cross-species dataset, iDNA6mA-PseKNC got an AUC of 0.844, while our model received a higher AUC of 0.879.
For the rice dataset, iDNA6mA-PseKNC received an AUC of 0.896, while our model achieved a higher AUC of
0.923. For the M. musculus dataset, both models got the same AUC values, but our model also received higher
MCC and F1 than those of iDNA6mA-PseKNC. All these results show that the proposed method is very accurate
and can be used to predict 6 mA sites in dierent species.
Discussion
Unlike the prediction of m6A modications in mRNA, the identication of 6 mA modications in DNA is still
at the beginning. In this study, we developed an improved tool, called csDMA, for predicting 6 mA modica-
tions in dierent species. ree feature encoding strategies were used to generate the feature matrix and dierent
algorithms were selected into the model. For performance evaluation, 5-fold cross-validation and independent
test were used and the ExtraTrees classier received the best performance on the training and independent test
Algorithm Sn Sp ACC MCC AUC F1
RandomForest 0.875 0.747 0.814 0.630 0.884 0.832
GradientBoosting 0.765 0.757 0.761 0.522 0.854 0.771
AdaBoost 0.776 0.719 0.749 0.496 0.814 0.764
ExtraTrees 0.888 0.729 0.813 0.628 0.893 0.832
SVM 0.843 0.761 0.804 0.607 0.875 0.819
Table 3. Model performance of the dierent algorithms on the independent testing dataset. e highest value
of each column is marked in bold.
Figure 4. Performance comparison of csDMA and iDNA6mA-PseKNC. (A) e ROC curves of csDMA and
iDNA6mA-PseKNC. (B) e Precision-Recall curves of csDMA and iDNA6mA-PseKNC.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
datasets. We also compared the performance of our tool with that of iDNA6mA-PseKNC. And the results showed
that our model improved the recognition performance of DNA 6 mA modications eectively.
The i6mA-Pred predictor is another of the two existing tools for DNA 6 mA prediction. However, the
research paper is still in the corrected proof phase and their method cannot be reached until our work nished.
Fortunately, we acknowledge from their online abstract that the method received an ACC of 0.831 by using a jack-
knife test. As jackknife test will generate a xed ACC on the same dataset and their dataset was also downloaded
as the rice dataset in this study. us, we also evaluated the performance of our model on the rice dataset by using
a jackknife test and our model received an ACC of 0.859, which is also higher than that of i6mA-Pred.
Although our model received a high performance on the M. musculus dataset, the performance on the rice
and cross-species datasets were relatively low. In the future, more feature encoding schemes, such as genomic and
structural features, will be used to improve the performance of csDMA. And also we will extend csDMA to other
species, such as human and Arabidopsis thaliana.
References
1. Dunn, D. B. & Smith, J. D. Occurrence of a new base in the deoxyribonucleic acid of a strain of bacterium coli. Nature. 175, 336–337
(1955).
2. Vanyushin, B. F., Belozersy, A. N., ourina, N. A. & adirova, D. X. 5-Methylcytosine and 6-Methylaminopurine in Bacterial
DNA. Nature. 218, 1066–1067 (1968).
3. Casadesus, J. & Low, D. Epigenetic gene regulation in the bacterial world. Microbiol and Molecular Biology Reviews. 70, 830 (2006).
4. Bird, A. Use of restriction enzymes to study euaryotic DNA methylation: II. e symmetry of methylated sites supports semi-
conservative copying of the methylation pattern. Journal of Molecular Biology. 118, 49–60 (1978).
5. Fu, Y. et al. N6-Methyldeoxyadenosine mars active transcription start sites in Chlamydomonas. Cell. 161, 879–892 (2015).
6. oziol, M. J. et al. Identication of methylated deoxyadenosines in vertebrates reveals diversity in DNA modications. Nature
Structural & Molecular Biology. 23, 24–30 (2016).
7. Mondo, S. et al. Widespread adenine N6-methylation of active genes in fungi. Nature Genetics. 49 (2017).
8. Zhou, C. et al. Identication and analysis of adenine N6-methylation sites in the rice genome. Nature Plants. 4, 554–563 (2018).
9. Zhang, Q. et al. N(6)-Methyladenine DNA methylation in Japonica and Indica rice genomes and its association with gene
expression, Plant Development, and Stress esponses. Molecular Plant. 11, 1492–1508 (2018).
10. Feng, P. M. et al. iDNA6mA-PseNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical
properties into PseNC. Genomics. 111, 96–102 (2018).
11. Chen, W., Lv, H., Nie, F. & Lin, H. i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics. btz015
(2019).
12. Xu, Y. et al. iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. Plos One. 9,
e105018 (2014).
13. Chen, W., Feng, P., Ding, H., Lin, H. & Chou, . C. iNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide
composition. Analytical Biochemistry. 490, 26–33 (2015).
14. Chen, W., Tang, H., Ye, J., Lin, H. & Chou, . C. iNA-PseU: Identifying NA pseudouridine sites. Molecular erapy-Nucleic Acids.
5, e332 (2016).
15. Jia, J., Zhang, L. X., Liu, Z., Xiao, X. & Chou, . C. pSumo-CD: Predicting sumoylation sites in proteins with covariance discriminant
algorithm by incorporating sequence-coupled eects into general PseAAC. Bioinformatics. 32, 3133–3141 (2016).
16. Qiu, W.  ., Sun, B. Q., Xiao, X., Xu, Z. C. & Chou, . C. iPTM-mLys: identifying multiple lysine PTM sites and their dierent types.
Bioinformatics. 32, 3116–3123 (2016).
17. Feng, P. et al. iNA-PseColl: Identifying the occurrence sites of dierent NA modications by incorporating collective eects of
nucleotides into PseNC. Molecular erapy-Nucleic Acids. 7, 155–163 (2017).
18. Chen, W. et al. iNA-3typeA: identifying 3-types of modication at NA’s adenosine sites. Molecular erapy-Nucleic Acid. 11,
468–474 (2018).
19. Qiu, W. . et al. icr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classier.
Genomics. 110, 239–246 (2018).
20. Li, F. et al. Positive-unlabelled learning of glycosylation sites in the human proteome. BMC Bioinformatics. 20, 112 (2019).
21. Zhang, Y. et al. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an
integrative machine-learning framewor. Briengs in Bioinformatics. https://doi.org/10.1093/bib/bby079 (2018).
22. Chen, Z. et al. Large-scale comparative assessment of computational predictors for lysine post-translational modication sites.
Briengs in Bioinformatics. https://doi.org/10.1093/bib/bby089 (2018).
23. Chou, . C. Some remars on protein attribute prediction and pseudo amino acid composition. Journal of eoretical Biology. 273,
236–247 (2011).
24. Chou, . C. Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target
drugs. Current Medicinal Chemistry, https://doi.org/10.2174/0929867326666190507082559 (2019).
25. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 28,
3150–3152 (2012).
26. Chou, . C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 43, 246–255 (2001).
27. Chou, . C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 21, 10–19
(2005).
28. Shen, H. B. & Chou, . C. PseAAC: a exible web-server for generating various inds of protein pseudo amino acid composition.
Analytical Biochemistry. 373, 386–388 (2008).
Algorithm Species Sn Sp ACC MCC AUC F1
csDMA
Cross-species 0.863 0.735 0.799 0.603 0.879 0.811
Rice 0.842 0.880 0.861 0.723 0.923 0.858
M. musculus 0.932 1 0.966 0.935 0.974 0.965
iDNA6mA-PseKNC
Cross-species 0.762 0.769 0.765 0.531 0.844 0.764
Rice 0.569 0.721 0.641 0.394 0.896 0.543
M. musculus 0.869 1 0.935 0.877 0.974 0.930
Table 4. Model performance of each algorithm across species.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
29. Du, P., Wang, X., Xu, C. & Gao, Y. PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s
pseudo amino acid compositions. Analytical Biochemistry. 425, 117–119 (2012).
30. Cao, D. S., Xu, Q. S. & Liang, Y. Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics. 29, 960–962 (2013).
31. Du, P., Gu, S. & Jiao, Y. PseAAC-General: Fast building various modes of general form of Chou’s pseudo amino acid composition for
large-scale protein datasets. International Journal of Molecular Sciences. 15, 3495–3506 (2014).
32. Chou, . C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current
Proteomics. 6, 262–274 (2009).
33. Chen, W., Lei, T. Y., Jin, D. C., Lin, H. & Chou, . C. PseNC: a exible web-server for generating pseudo -tuple nucleotide
composition. Analytical Biochemistry. 456, 53–60 (2014).
34. Chen, W. & Lin, H. Pseudo nucleotide composition or PseNC: an eective formulation for analyzing genomic sequences. Molecular
BioSystems. 11, 2620–2634 (2015).
35. Liu, B., Yang, F., Huang, D. S. & Chou, . C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-
window-based PseNC. Bioinformatics. 34, 33–40 (2018).
36. Tahir, M., Tayara, H. & Chong, . T. iNA-PseNC(2methyl): Identify NA 2-O-methylation sites by convolution neural networ
and Chou’s pseudo components. Journal of eoretical Biology. 465, 1–6 (2019).
37. Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, NA, and protein sequences.
Nucleic Acids Research. 43, W65–W71 (2015).
38. Liu, B. & Wu, H. Pse-in-One 2.0: An improved pacage of web servers for generating various modes of pseudo components of DNA,
NA, and protein sequences. Natural Science. 9, 67–91 (2017).
39. Chen, Y., Tang, Y., Sheng, Z. & Zhang, Z. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the
composition of -spaced amino acid pairs. BMC Bioinformatics. 9, 101 (2008).
40. Wang, X., Yan, . & Song, J. DephosSite: a machine learning approach for discovering phosphotase-specic dephosphorylation sites.
Scientic Reports. 6, 23510 (2016).
41. Chou, . C. Using subsite coupling to predict signal peptides. Protein Engineering. 14, 75–79 (2001).
42. Chou, . C. Prediction of signal peptides using scaled window. Peptides. 22, 1973–1979 (2001).
43. Liu, B., Wang, S., Long, . & Chou, . C. iSpot-EL: identify recombination spots with an ensemble learning approach.
Bioinformatics. 33, 35–41 (2017).
44. Cheng, X., Lin, W. Z., Xiao, X. & Chou, . C. pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing
training dataset and PseAAC. Bioinformatics. 35, 398–406 (2019).
45. Song, J., Wang, Y. & Li, F. iProt-Sub: a comprehensive pacage for accurately mapping and predicting protease-specic substrates
and cleavage sites. Briengs in Bioinformatics. 20, 638–658 (2018).
46. Cheng, X., Zhao, S. G., Lin, W. Z., Xiao, X. & Chou, . C. pLoc-mAnimal: predict subcellular localization of animal proteins with
both single and multiple sites. Bioinformatics. 33, 3524–3531 (2017).
47. Cheng, X., Zhao, S. G., Xiao, X. & Chou, . C. iATC-mISF: a multi-label classier for predicting the classes of anatomical therapeutic
chemicals. Bioinformatics. 33, 341–346 (2017).
48. Chou, . C. Some remars on predicting multi-label attributes in molecular biosystems. Molecular Biosystems. 9, 1092–1100 (2013).
49. Song, J. et al. Transcriptome-wide annotation of m5C NA modications using machine learning. Frontiers in Plant Science. 9, 519
(2018).
50. Chou, . C. & Forsén, S. Diusion-controlled eects in reversible enzymatic fast reaction system: Critical spherical shell and
proximity rate constants. Biophysical Chemistry. 12, 255–263 (1980).
51. Carter, . E. & Forsén, S. A new graphical method for deriving rate equations for complicated mechanisms. Chemica Scripta. 18,
82–86 (1981).
52. Chou, ., Chen, N. & Forsén, S. e biological functions of low-frequency phonons: 2. Cooperative eects. Chemica Scripta. 18,
126–132 (1981).
53. Jiang, S. P., Liu, W. M. & Fee, C. H. Graph theory of enzyme inetics: 1. Steady-state reaction system. Scientia Sinica. 22, 341–358
(1979).
54. Shen, H. B., Song, J. N. & Chou, . C. Prediction of protein folding rates from primary sequence by fusing multiple sequential
features. Journal of Biomedical Science and Engineering. 2, 136–143 (2009).
55. Chou, . C. Graphic rule for drug metabolism systems. Current Drug Metabolism. 11, 369–378 (2010).
56. Zhou, G. P. e disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein
interaction mechanism. Journal of eoretical Biology. 284, 142–148 (2011).
57. Chou, . C. & Shen, H. B. ecent advances in developing web-servers for predicting protein attributes. Natural Science. 1, 63–92
(2009).
58. Chou, . C. Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry. 11, 218–234 (2015).
59. Chou, . C. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in
Medicinal Chemistry. 17, 2337–2358 (2017).
Acknowledgements
is work was supported by the Start-up foundation of Northwest A&F University (Z109021809), the National
Natural Science Foundation of China (51809218), and the Postdoctoral Research Foundation of China
(2018M643744).
Author Contributions
Z.L. participated in conceiving and performing the experiments. W.D. and W.J. participated in analyzing the data.
All authors contributed to the writing of the manuscript.
Additional Information
Supplementary information accompanies this paper at https://doi.org/10.1038/s41598-019-49430-4.
Competing Interests: e authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2019
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... In order to more precisely distinguish modification sites, for 4mc modification prediction, 24 different classifiers have been proposed, out of which 11 are based on machine learning (ML) [61,49,36,5,103,14,88,35,93,59,24] and 13 are based on deep learning (DL) [54,107,89,98,94,4,78,83,44,22,1,73,99]. Similarly, for 6ma modification prediction, out of 15 different classifiers, 8 are based on DL [10,77,100,80,97,2,39,72] and 7 are based on ML [9,7,34,55,45,70,84]. Two generalized approaches iDNA-Ms [58] and iDNA-MT [94] which are capable to predict multiple types of DNA modifications, are based on DL [94] and ML classifiers [58] respectively. ...
... To summarise, prior mentioned deep learning classifiers make use of different neural architectures i.e., Multi-Layer perceptron models (MLPs) [102], convolutional neural networks (CNNs) [77], recurrent neural networks (RNNs) [94], hybrid neural network (CNNs+RNNs) [10], and language models like transformers [99]. Whereas, ML based approaches make use of traditional ML classifiers i.e., random forest [5], decision trees (DT) [5], adaboost [55], random forest (RF) [35], extra tree classifier (EXT) [55], gradient boosting (GB) [84], naive bayes (NB) [49,24], logistic regression (LR) [24], and support vector machine (SVM) [99] classifier. ...
... To summarise, prior mentioned deep learning classifiers make use of different neural architectures i.e., Multi-Layer perceptron models (MLPs) [102], convolutional neural networks (CNNs) [77], recurrent neural networks (RNNs) [94], hybrid neural network (CNNs+RNNs) [10], and language models like transformers [99]. Whereas, ML based approaches make use of traditional ML classifiers i.e., random forest [5], decision trees (DT) [5], adaboost [55], random forest (RF) [35], extra tree classifier (EXT) [55], gradient boosting (GB) [84], naive bayes (NB) [49,24], logistic regression (LR) [24], and support vector machine (SVM) [99] classifier. ...
Article
Full-text available
Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach “DNA-MP” that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method “position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference” (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on 17 benchmark DNA modifications prediction datasets of 12 different species using 10 different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing 32 different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.
... We also provide the ROC curves for the 5-fold cross-validation and independent tests as shown in Figure 5A,B. For a fair comparison with IDNA6mA-PseKNC [22,28], csDMA [56], and ilM-CNN [28], we used all the samples as the training dataset. The performance over the 5-fold cross-validation was shown in Table 8. ...
... Obviously, the SoftVoting6mA outperformed the iDNA6mA-PseKNC, the ilM-CNN, and the csDMA in terms of ACC. The SoftVoting6mA reached the SP of 0.804, the ACC of 0.828, the MCC of 0.656, and the AUC of 0.900, elevating the ACC by 0.063 over iDNA6mA-PseKNC [22], by 0.029 over csDMA [56], and by 0.004 over ilM-CNN [28]. ...
Article
Full-text available
The DNA N6-methyladenine (6mA) is an epigenetic modification, which plays a pivotal role in biological processes encompassing gene expression, DNA replication, repair, and recombination. Therefore, the precise identification of 6mA sites is fundamental for better understanding its function, but challenging. We proposed an improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes called SoftVoting6mA. The SoftVoting6mA selected four (electron–ion-interaction pseudo potential, One-hot encoding, Kmer, and pseudo dinucleotide composition) codes from 15 types of encoding to represent DNA sequences by comparing their performances. Similarly, the SoftVoting6mA combined four learning algorithms using the soft voting strategy. The 5-fold cross-validation and the independent tests showed that SoftVoting6mA reached the state-of-the-art performance. To enhance accessibility, a user-friendly web server is provided at http://www.biolscience.cn/SoftVoting6mA/.
... In addition to software tools that rely on SMRT CCS data for identifying 6mA sites, another category of tools has been developed to predict 6mA sites without requiring sequencing, such as 6mA-Finder, csDMA, and PSAC-6mA [38][39][40]. These tools utilize deep learning models trained on existing databases, enabling them to predict 6mA sites with reasonable accuracy based solely on sequence features of the target species. ...
Article
Full-text available
DNA modifications, such as N6-methyladenine (6mA), play important roles in various processes in eukaryotes. Single-molecule, real-time (SMRT) sequencing enables the direct detection of DNA modifications without requiring special sample preparation. However, most SMRT-based studies of 6mA rely on ensemble-level consensus by combining multiple reads covering the same genomic position, which misses the single-molecule heterogeneity. While recent methods have aimed at single-molecule level detection of 6mA, limitations in sequencing platforms, resolution, accuracy, and usability restrict their application in comprehensive epigenetic studies. Here, we present SMAC (single-molecule 6mA analysis of CCS reads), a novel framework for accurately detecting 6mA at the single-molecule level using SMRT circular consensus sequencing (CCS) data from the Sequel II system. It is an automated method that streamlines the entire workflow by packaging both existing softwares and built-in scripts, with user-defined parameters to allow easy adaptation for various studies. By utilizing the statistical distribution characteristics of enzyme kinetic indicators on single DNA molecules rather than a fixed cutoff, SMAC significantly improves 6mA detection accuracy at the single-nucleotide and single-molecule levels. It simplifies analysis by providing comprehensive information, including quality control, statistical analysis, and site visualization, directly from raw sequencing data. SMAC is a powerful new tool that enables de novo detection of 6mA and empowers investigation of its functions in modulating physiological processes.
... To overcome this problem, we have applied the PseAAC strategy, because PseAAC has been broadly applied in different studies and has provided Receiver operating characteristic (ROC) curves of HormoNet before and after feature selection techniques on our benchmark datasets. Where A depicts the prediction performance of HormoNet for HDI before using feature selection, B illustrates the prediction performance of HormoNet for HDI after using RF, C shows the prediction performance of HormoNet for HDI after using lsvc, D is the prediction performance of HormoNet for HDI after using XGBoost.E is the prediction performance of HormoNet for risk level before applying feature selection methods, F is the prediction performance of HormoNet for risk level after RF, G is F is the prediction performance of HormoNet for risk level after lsvc, and H is F is the prediction performance of HormoNet for risk level after XGBoost sufficient performances in the field of protein interaction predictions [11][12][13][14][15][16][17][18][19]. Thus, in this study we used this technique to encode protein sequences. ...
Article
Full-text available
Several experimental evidences have shown that the human endogenous hormones can interact with drugs in many ways and affect drug efficacy. The hormone drug interactions (HDI) are essential for drug treatment and precision medicine; therefore, it is essential to understand the hormone-drug associations. Here, we present HormoNet to predict the HDI pairs and their risk level by integrating features derived from hormone and drug target proteins. To the best of our knowledge, this is one of the first attempts to employ deep learning approach for prediction of HDI prediction. Amino acid composition and pseudo amino acid composition were applied to represent target information using 30 physicochemical and conformational properties of the proteins. To handle the imbalance problem in the data, we applied synthetic minority over-sampling technique technique. Additionally, we constructed novel datasets for HDI prediction and the risk level of their interaction. HormoNet achieved high performance on our constructed hormone-drug benchmark datasets. The results provide insights into the understanding of the relationship between hormone and a drug, and indicate the potential benefit of reducing risk levels of interactions in designing more effective therapies for patients in drug treatments. Our benchmark datasets and the source codes for HormoNet are available in: https://github.com/EmamiNeda/HormoNet. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05708-7.
... The 6mA-RicePred fused multiple features including Markov feature for rice 6mA prediction [53]. The csDMA utilized three representations and explored performances of different algorithms on 6mA prediction, and finally chose the algorithm which performed best over experiments to construct the predictor [54]. The 6mA-Finder combined seven sequencebased coding schemes such as accumulated nucleotide frequency, the composition of K-spaced nucleic acid pairs, three types of physicochemical features, and seven conventional learning algorithms [55]. ...
Article
DNA N6-methyladenine (6mA) is a key DNA modification, which plays versatile roles in the cellular processes, including regulation of gene expression, DNA repair, and DNA replication. DNA 6mA is closely associated with many diseases in the mammals and with growth as well as development of plants. Precisely detecting DNA 6mA sites is of great importance to exploration of 6mA functions. Although many computational methods have been presented for DNA 6mA prediction, there is still a wide gap in the practical application. We presented a convolution neural network (CNN) and bi-directional long-short term memory (Bi-LSTM)-based deep learning method (Deep6mAPred) for predicting DNA 6mA sites across plant species. The Deep6mAPred stacked the CNNs and the Bi-LSTMs in a paralleling manner instead of a series-connection manner. The Deep6mAPred also employed the attention mechanism for improving the representations of sequences. The Deep6mAPred reached an accuracy of 0.9556 over the independent rice dataset, far outperforming the state-of-the-art methods. The tests across plant species showed that the Deep6mAPred is of a remarkable advantage over the state of the art methods. We developed a user-friendly web application for DNA 6mA prediction, which is freely available at http://106.13.196.152:7001/ for all the scientific researchers. The Deep6mAPred would enrich tools to predict DNA 6mA sites and speed up the exploration of DNA modification.
Article
Full-text available
DNA methylation is of crucial importance for biological genetic expression, such as biological cell differentiation and cellular tumours. The identification of DNA-6mA sites using traditional biological experimental methods requires more cumbersome steps and a large amount of time. The advent of neural network technology has facilitated the identification of 6 mA sites on cross-species DNA with enhanced efficacy. Nevertheless, the majority of contemporary neural network models for identifying 6 mA sites prioritize the design of the identification model, with comparatively limited research conducted on the statistically significant DNA sequence itself. Consequently, this paper will focus on the statistical strategy of DNA double-stranded features, utilising the multi-head self-attention mechanism in neural networks applied to DNA position probabilistic relationships. Furthermore, a new recognition model, PSATF-6 mA, will be constructed by continually adjusting the attentional tendency of feature fusion through an integrated learning framework. The experimental results, obtained through cross-validation with cross-species data, demonstrate that the PSATF-6 mA model outperforms the baseline model. The in-Matthews correlation coefficient (MCC) for the cross-species dataset of rice and m. musus genomes can reach a score of 0.982. The present model is expected to assist biologists in more accurately identifying 6 mA locus and in formulating new testable biological hypotheses.
Article
Identifying DNA N6-methyladenine (6mA) sites is significantly important to understanding the function of DNA. Many deep learning-based methods have been developed to improve the performance of 6mA site prediction. In this study, to further improve the performance of 6mA site prediction, we propose a new meta method, called Co6mA, to integrate bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNNs), and self-attention mechanisms (SAM) via assembling two different deep learning-based models. The first model developed in this study is called CBi6mA, which is composed of CNN, BiLSTM, and fully connected modules. The second model is borrowed from LA6mA, which is an existing 6mA prediction method based on BiLSTM and SAM modules. Experimental results on two independent testing sets of different model organisms, i.e., Arabidopsis thaliana and Drosophila melanogaster, demonstrate that Co6mA can achieve an average accuracy of 91.8%, covering 89% of all 6mA samples while achieving an average Matthews correlation coefficient value (0.839), which is higher than the second-best method DeepM6A.
Article
Full-text available
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website ( http://lab.malab.cn/~acy/BioseqData/home.html ), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Article
Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.
Article
Full-text available
Background: As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites). Results: In this study, we propose a positive unlabelled (PU) learning-based method, PA2DE (V2.0), based on the AlphaMax algorithm for protein glycosylation site prediction. The predictive performance of this proposed method was evaluated by a range of glycosylation data collected over a ten-year period based on an interval of three years. Experiments using both benchmarking and independent tests show that our method outperformed the representative supervised-learning algorithms (including support vector machines and random forests) and one-class learners, as well as currently available prediction methods in terms of F1 score, accuracy and AUC measures. In addition, we developed an online web server as an implementation of the optimized model (available at http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/) to facilitate community-wide efforts for accurate prediction of protein glycosylation sites. Conclusion: The proposed PU learning approach achieved a competitive predictive performance compared with currently available methods. This PU learning schema may also be effectively employed and applied to address the prediction problems of other important types of protein PTM site and functional sites.
Article
Full-text available
N6-methyladenine (6mA) DNA methylation has recently been implicated as a potential new epigenetic marker in eukaryotes, including the dicot model Arabidopsis thaliana. However, conservation and divergence of 6mA distribution patterns and the relevant functions in plants remain elusive. Here we report high-quality 6mA methylomes at single-nucleotide resolution in rice based on substantially improved genome sequences of two rice cultivars, Nipponbare (Nip; Japonica) and 93-11 (Indica). Analysis of 6mA genomic distribution and its association with transcription suggest that 6mA distribution and function is rather conserved between rice and Arabidopsis. 6mA levels are positively correlated with the expression of key stress-related genes, which may be responsible for the difference in stress tolerance between Nip and 93-11. Moreover, mutations in DDM1 display defects in plant growth and decreased 6mA level. Our results reveal that 6mA is a conserved DNA modification that is positively associated with gene expression and contributes to key agronomic traits in plants.
Article
Full-text available
Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, due to the large volumes of sequencing data generated from genome sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques, and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation, and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (MUltiple scalable Accurate DEep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterisation, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The webserver of MUscADEL, together with all the datasets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.
Article
Full-text available
As a newly discovered post-translational modification, lysine malonylation regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of lysine malonylation sites. We identify optimized feature sets, with which four commonly used machine learning methods (Random Forest (RF), Support Vector Machines (SVM) and K-Nearest Neighbor (KNN)) and one recently proposed (Light Gradient Boosting Machine, LightGBM) are trained on data from three species, namely Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (ACC 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types, and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.
Article
Full-text available
DNA N6-methyladenine (6mA) is a non-canonical DNA modification that is present at low levels in different eukaryotes1-8, but its prevalence and genomic function in higher plants are unclear. Using mass spectrometry, immunoprecipitation and validation with analysis of single-molecule real-time sequencing, we observed that about 0.2% of all adenines are 6mA methylated in the rice genome. 6mA occurs most frequently at GAGG motifs and is mapped to about 20% of genes and 14% of transposable elements. In promoters, 6mA marks silent genes, but in bodies correlates with gene activity. 6mA overlaps with 5-methylcytosine (5mC) at CG sites in gene bodies and is complementary to 5mC at CHH sites in transposable elements. We show that OsALKBH1 may be potentially involved in 6mA demethylation in rice. The results suggest that 6mA is complementary to 5mC as an epigenomic mark in rice and reinforce a distinct role for 6mA as a gene expression-associated epigenomic mark in eukaryotes.
Article
Full-text available
The emergence of epitranscriptome opened a new chapter in gene regulation. 5-methylcytosine (m⁵C), as an important post-transcriptional modification, has been identified to be involved in a variety of biological processes such as subcellular localization and translational fidelity. Though high-throughput experimental technologies have been developed and applied to profile m⁵C modifications under certain conditions, transcriptome-wide studies of m⁵C modifications are still hindered by the dynamic nature of m⁵C and the lack of computational prediction methods. In this study, we introduced PEA-m5C, a machine learning-based m⁵C predictor trained with features extracted from the flanking sequence of m⁵C modifications. PEA-m5C yielded an average AUC (area under the receiver operating characteristic) of 0.939 in 10-fold cross-validation experiments based on known Arabidopsis m⁵C modifications. A rigorous independent testing showed that PEA-m5C (Accuracy [Acc] = 0.835, Matthews correlation coefficient [MCC] = 0.688) is remarkably superior to the recently developed m⁵C predictor iRNAm5C-PseDNC (Acc = 0.665, MCC = 0.332). PEA-m5C has been applied to predict candidate m⁵C modifications in annotated Arabidopsis transcripts. Further analysis of these m⁵C candidates showed that 4nt downstream of the translational start site is the most frequently methylated position. PEA-m5C is freely available to academic users at: https://github.com/cma2015/PEA-m5C.
Article
Motivation: DNA N6-methyladenine (6mA) is associated with a wide range of biological processes. Since the distribution of 6mA site in the genome is non-random, accurate identification of 6mA sites is crucial for understanding its biological functions. Although experimental methods have been proposed for this regard, they are still cost-ineffective for detecting 6mA site in genome-wide scope. Therefore, it is desirable to develop computational methods to facilitate the identification of 6mA site. Results: In this study, a computational method called i6mA-Pred was developed to identify 6mA sites in the rice genome, in which the optimal nucleotide chemical properties obtained by the using feature selection technique were used to encode the DNA sequences. It was observed that the i6mA-Pred yielded an accuracy of 83.13% in the jackknife test. Meanwhile, the performance of i6mA-Pred was also superior to other methods. Availability and implementation: A user-friendly web-server, i6mA-Pred is freely accessible at http://lin-group.cn/server/i6mA-Pred.
Article
The smallest unit of life is a cell, which contains numerous protein molecules. Most of the functions critical to the cell’s survival are performed by these proteins located in its different organelles, usually called ‘‘subcellular locations”. Information of subcellular localization for a protein can provide useful clues about its function. To reveal the intricate pathways at the cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite. Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing and selecting the right targets for drug development. Unfortunately, it is both time-consuming and costly to determine the subcellular locations of proteins purely based on experiments. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods for rapidly and effectively identifying the subcellular locations of uncharacterized proteins based on their sequences information alone. Actually, considerable progresses have been achieved in this regard. This review is focused on those methods, which have the capacity to deal with multi-label proteins that may simultaneously exist in two or more subcellular location sites. Protein molecules with this kind of characteristic are vitally important for finding multi-target drugs, a current hot trend in drug development. Focused in this review are also those methods that have use-friendly web-servers established so that the majority of experimental scientists can use them to get the desired results without the need to go through the detailed mathematics involved.
Article
The 2’-O-methylation transferase is involved in the process of 2’-O-methylation. In catalytic processes, the 2-hydroxy group of the ribose moiety of a nucleotide accept a methyl group. This methylation process is a post-transcriptional modification, which occurs in various cellular RNAs and plays a vital role in regulation of gene expressions at the post-transcriptional level. Through biochemical experiments 2’-O-methylation sites produce good results but these biochemical process and exploratory techniques are very expensive. Thus, it is required to develop a computational method to identify 2’-O-methylation sites. In this work, we proposed a simple and precise convolution neural network method namely: iRNA-PseKNC(2methyl) to identify 2’-O-methylation sites. The existing techniques use handcrafted features, while the proposed method automatically extracts the features of 2’-O-methylation using the proposed convolution neural network model. The proposed prediction iRNA-PseKNC(2methyl) method obtained 98.27% of accuracy, 96.29% of sensitivity, 100% of specificity, and 0.965 of MCC on Home sapiens dataset. The reported outcomes present that our proposed method obtained better outcomes than existing method in terms of all evaluation parameters. These outcomes show that iRNA-PseKNC(2methyl) method might be beneficial for the academic research and drug design.
Article
Motivation A cell contains numerous protein molecules. One of the fundamental goals in cell biology is to determine their subcellular locations, which can provide useful clues about their functions. Knowledge of protein subcellular localization is also indispensable for prioritizing and selecting the right targets for drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mAnimal” was developed for identifying the subcellular localization of animal proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with the multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mAnimal was trained by an extremely skewed dataset in which some subset (subcellular location) was about 128 times the size of the other subsets. Accordingly, such an uneven training dataset will inevitably cause a biased consequence. Results To alleviate such biased consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mAnimal by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mAnimal, the existing state-of-the-art predictor, in identifying the subcellular localization of animal proteins. Availability To maximize the convenience for the vast majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mAnimal/, by which users can easily get their desired results without the need to go through the complicated mathematics. Supplementary information Supplementary data are available at Bioinformatics online.