Access to this full-text is provided by Springer Nature.
Content available from Scientific Reports
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
csDMA: an improved bioinformatics
tool for identifying DNA 6 mA
modications via Chou’s 5-step rule
Ze Liu1,2, Wei Dong1,2, Wei Jiang1,2 & Zili He1,2
DNA N6-methyldeoxyadenosine (6 mA) modications were rst found more than 60 years ago but
were thought to be only widespread in prokaryotes and unicellular eukaryotes. With the development
of high-throughput sequencing technology, 6 mA modications were found in dierent multicellular
eukaryotes by using experimental methods. However, the experimental methods were time-
consuming and costly, which makes it is very necessary to develop computational methods instead.
In this study, a machine learning-based prediction tool, named csDMA, was developed for predicting
6 mA modications. Firstly, three feature encoding schemes, Motif, Kmer, and Binary, were used to
generate the feature matrix. Secondly, dierent algorithms were selected into the prediction model
and the ExtraTrees model received the best AUC of 0.878 by using 5-fold cross-validation on the training
dataset. Besides, the ExtraTrees model also received the best AUC of 0.893 on the independent testing
dataset. Finally, we compared our method with state-of-the-art predictors and the results shown that
our model achieved better performance than existing tools.
DNA N6-methyldeoxyadenosine (6 mA) modications were rst discovered in Bacteria in 19551. However, it had
not received much attention as 5-methylcytosine (5mC) did. One important reason is that 6 mA modications
were thought to be only widespread in prokaryotes and unicellular eukaryotes, but rarely in multicellular eukar-
yotes2,3. Researchers have proposed several experimental methods to identify 6 mA modications in the past few
decades. e rst method, developed by Dunn et al. in 1955, is a combination of ultraviolet absorption spectra,
electrophoretic mobility, and paper chromatographic movement, but this method is relatively insensitive and
cannot be used to detect 6 mA modications in animals1. en a restriction enzyme method was used to dis-
cover 6 mA modications in 1978. However, this method can only nd modied adenosines that occurred in the
restriction enzyme target motifs4. With the development of high-throughput sequencing technology, thousands
of 6 mA modications were found in dierent multicellular eukaryotes. In 2015, Fu et al. found 6 mA modica-
tions in 84% genes of Chlamydomonas by using 6 mA immunoprecipitation sequencing (6mA-IP-Seq)5. In 2016,
Koziol et al. used dot blots, HPLC, and methyl DNA immunoprecipitation followed by sequencing (MeDIP-seq)
to detect 6 mA modications in vertebrates including Xenopus laevis, mouse and human6. In 2017, Mondo et
al. observed that up to 2.8% of all adenines were methylated in early-diverging fungi by using single-molecule
real-time (SMRT) sequencing7. In 2018, Zhou et al. found that about 0.2% of adenines in the rice genome were
6 mA methylated by using mass spectrometry, immunoprecipitation, and SMRT, and Zhang et al. observed that
the 6 mA distribution in the rice and Arabidopsis genome were very similar by using 6mA-IP-seq8,9. As the exper-
imental methods are time-consuming and costly, researchers are trying to predict DNA 6 mA modications
by using computational methods. Two prediction tools are reported up to now, i.e., iDNA6mA-PseKNC10 and
i6mA-Pred11. iDNA6mA-PseKNC is the rst prediction tool for predicting 6 mA modications in the Mus mus-
culus genome and i6mA-Pred is the rst identication method in the rice genome.
Predicting DNA 6 mA modications based on computational algorithms is still in the infancy. However, in
the parallel study of prediction of post-translational modication (PTM) sites, there are many PTM-predicting
papers published by the previous researchers12–22. Although there is some detailed dierence for each of the indi-
vidual PTMs, the basic core is about the same. us, the feature extraction and classication methods proposed
in these studies provide a valuable basis for the prediction of DNA 6 mA modications. In this research, we aim
1College of Water Resources and Architectural Engineering, Northwest A&F University, Yangling, 712100, Shaanxi,
China. 2Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education,
Northwest A & F University, Yangling, 712100, Shaanxi, China. Correspondence and requests for materials should be
addressed to W.D. (email: dongw@nwafu.edu.cn)
Received: 26 April 2019
Accepted: 24 August 2019
Published: xx xx xxxx
OPEN
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
to develop a prediction tool that can be used to predict DNA 6 mA modications across species. e benchmark
datasets created in the iDNA6mA-PseKNC and i6mA-Pred predictors were used and dierent algorithms were
implemented to generate the nal optimized model. 5-fold cross-validation was performed and the prediction
results demonstrated that our model achieved a better performance than existing 6 mA prediction tools.
As demonstrated by a series of recent publications10,13–19 and summarized in two comprehensive review
papers23,24, to develop a really useful predictor for a biological system, one needs to follow Chou’s 5-steps rule
(more detailed description can be found in https://en.wikipedia.org/wiki/5-step_rules.) to go through the follow-
ing ve steps: (1) construct a gold standard dataset to train and test the model; (2) encode samples with eective
formulations; (3) conduct the prediction model with a powerful classier; (4) evaluate model performance by
using cross-validation tests and standard measures; (5) establish a user-friendly web-server for the predictor that
can be accessible to the public. Below, we are to address these points one by one, making them crystal clear in
logic development and completely transparent in operation.
Method
Dataset generation. Feng et al. created a DNA 6 mA benchmark dataset of the M. musculus genome in
201810. e benchmark dataset includes 1,934 positive samples and 1,934 negative samples. Chen et al. launched
a 6 mA benchmark dataset of the rice genome in 201911. e benchmark dataset consists of 880 positive samples
and 880 negative samples. e above two benchmark datasets were used to create the cross-species dataset and
the CD-HIT-EST soware25 with dierent threshold was used to reduce sequence redundancy in the original
datasets (Table1). Finally, the cross-species dataset consists of 2,768 positive samples and 2,716 negative samples
with the most rigorous threshold at 0.80, and the length of each sample is 41nt. To build a cross-species 6 mA pre-
diction model, the stratied selection method was used and we random selected 80% samples for model training
and the le 20% for independent testing. Finally, the training dataset consists of 2,214 positive samples and 2,214
negative samples, while the independent testing dataset includes 554 positive samples and 502 negative samples.
Feature encoding scheme. To construct a DNA 6 mA predictor, one of the most important but also most
dicult issue is how to encode feature vector for each sequence, yet still retains most of the key patterns. e
pseudo amino acid composition (PseAAC) was proposed by Chou et al. and has been widely used in nearly all
the areas of computational proteomics26,27. Based on the PseAAC, four powerful soware, such as ‘PseAAC’28,
‘PseAAC-Builder’29, ‘propy’30, and ‘PseAAC-General’31, were established: the former three are for generating var-
ious modes of Chou’s special PseAAC32; while the 4th one for those of Chou’s general PseAAC23. Encouraged by
the successes of using PseAAC to deal with protein/peptide sequences, the concept of Pseudo K-tuple Nucleotide
Composition (PseKNC)33 was developed for encoding features of DNA/RNA sequences34–36 that have proved
very useful as well. Particularly, recently a very powerful web-server called ‘Pse-in-One’37 and its updated version
‘Pse-in-One2.0’38 have been established that can be used to generate any desired feature vectors for protein/pep-
tide and DNA/RNA sequences according to the need of users’ studies.
K-mer pattern. K monomeric units (k-mers), are simply patterns of k consecutive nucleic acids37 and have a
total of 4k kinds of nucleotide patterns for DNA/RNA. Such as 1-mer has 4 and 2-mer has 16 kinds of nucleotide
patterns. To calculate the frequencies of k-mer nucleotide patterns, the length range L of the scanning region
must be determined at rst, and then the absolute frequencies of the k-mer nucleotide patterns are calculated
from the start position to the L-k-1 position. Finally, the relative frequencies of k-mer patterns are calculated for
each region. In this study, we set k as 2, 3, 4, and extracted 42 + 43 + 44 = 336 kinds of k-mer nucleotide patterns
for feature encoding.
KSNPF frequency. e KSNPF frequencies are nucleotide pairs separated by k arbitrary nucleotides and have
been successfully employed for the prediction of mucin-type O-glycosylation sites39 and phosphotase-specic
dephosphorylation sites40. e KSNPF can be calculated using the following equation:
=
−−
fnGapkn(1 ()2)
S(n1Gap(k)n2)
Lk1(1)
where n1 and n2 represent a pair of sequence elements. For nucleotide, n stands for any one of A, C, G, T/U. us,
there are 42 = 16 combinations in each pair. Gap(k) stands for k arbitrary elements at intervals and S(n1Gap(k)n2)
indicates the number of occurrences of the element pair. In this study, L represents the length of the nucleotide
sequence, and the k was set as 1, 2, 3, 4, and the dimension of the KSNPF can be calculated by 42 × 4 = 64.
Species Dataset
Sequence identity threshold
0.95 0.90 0.85 0.80
Mouse Positive 1,931 1,924 1,914 1,892
Negative 1,885 1,866 1,844 1,836
Rice Positive 880 879 878 876
Negative 880 880 880 880
cross-species Positive 2,811 2,803 2,792 2,768
Negative 2,767 2,746 2,724 2,716
Table 1. Reduce sequence redundancy in the dierent datasets by using the CD-HIT-EST soware.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
Nucleic shift density. Nucleic shi density encoding can be used to calculate the density of any nucleo-
tide at the current position in its prex string and has been used to encode nucleotide sequences in the iDNA-
6mA-PseKNC predictor10. A nucleic shi density feature at any position can be dened as follows:
∑
==
=
=
dNFn Fn
if nq
othercase
1
(),()
1
0(2)
i
i
j
ijj j
1
where q represents of any nucleotide at current position i, Ni is the length of the ith prex string in the sequence.
For example, the DNA sequence “CAGCTG”. e Nucleic shi density of ‘C’ at the position 1, 2, 3, 4, 5 or 6 is
1/1 = 1, 1/2 = 0.5, 1/3 ≈ 0.33, 2/4 = 0.5, 2/5 = 0.4 or 2/6 ≈ 0.33, respectively. In this study, the length of each sam-
ple is 41nt. us, 41 Nucleic shi density features were generated for each sample.
Binary code. Binary encoding scheme is used to predict 6 mA modications in the iDNA6mA-PseKNC pre-
dictor10. For the nucleotide in position i, the Binary features can be dened as following:
=
∈
∈
=
∈
∈
=
∈
∈
xif nAG
if nCT
yif nAT
if nCG
zif nAC
if nGT
1{,}
0{,}
1{,}
0{,}
1{,}
0{,}
(3)
ii
i
ii
i
ii
i
In this research, the Binary encoding scheme generates a vector with 3 × 41 = 123 elements by characterizing
each nucleotide, “A”, “C”, “G”, or “T”, with (1, 1, 1), (0, 0, 1), (1, 0, 0), or (0, 1, 0), respectively.
Motif score matrix. e MEME Suite (http://meme-suite.org/) consists of several motif-based sequence
analysis tools. In this study, the MEME tool with dierential enrichment mode was used and the maximum num-
ber of motifs was set to 10. e most enriched motifs were selected based on E-value and the motif matrixes were
used for generating motif scores of each sample.
Performance evaluation. Five different classifiers, Random Forest, GradientBoosting, AdaBoost,
ExtraTrees and SVM, were implemented by using Python. For Random Forest, GradientBoosting, AdaBoost,
ExtraTrees Classiers, 1,000 trees were selected for each of them. For SVM, grid research was used to search the
best combination of C and gamma parameters. 5-fold cross-validation was used to evaluate the performance of
our model. In a dierent fold of cross-validation, each subset was iteratively selected as a testing set, while the le
4 subsets were used to train the model. e mean results of the ve experiments were nally used as the perfor-
mance estimates of the algorithms.
Based on the Chou’s symbols introduced for studying signal peptides41,42, Four standard measures were
derived and have been adopted by several recent publications43–45. e measures can be dened as follows:
=−
=−
=−
+
+
=−+
+
+
−
+
+
+
−
−
−
++
−
+−
−−
−
+
++
−
−
+
−−
+
+−
++
−
−
()
Sn N
N
Sp N
N
ACCNN
NN
MCC
1
1
1
1
11
(4)
N
N
N
N
NN
N
NN
N
where N+ and N− refer to the number of positive samples or negative samples, respectively.
−
+
N
stands for the
number of positive samples that were predicted to be negatives,
+
−
N
refers to the number of negative samples that
were predicted to be positives. However, these measures are valid only for single-label learning issues. For the
multi-label learning problems, whose appearances are more common in system biology46, system medicine47 and
biomedicine16, a completely dierent set of standard measures is needed48. Besides, the receiver operating char-
acteristic curve (ROC) combined with the area under the ROC curve (AUC), the Precision-Recall curve com-
bined with the average precision (AP), and the F1 score49 were also used to evaluate the performance of dierent
classiers.
Using graphic approaches to study biological and medical systems can provide an intuitive vision and useful
insights for helping analyze complicated relations therein as shown in the systems of enzyme fast reaction50,
graphical rules in molecular biology51, and low-frequency internal motion in biomacromolecules (such as protein
and DNA)52. Particularly, what happened is that this kind of insightful implication has also been demonstrated
in53 and many follow-up publications54–56. e framework of csDMA is shown in Fig.1.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
As pointed outby Chou et al.57 and demonstrated in a series of recent publications16–18, publicly accessible
web-servers or online bioinformatics tools have signicantly increased the impacts of bioinformatics on medical
science58, driving medicinal chemistry into an unprecedented revolution59. Accordingly, the datasets and online
tool involved in this paper are all available at https://github.com/liuze-nwafu/csDMA.
Results
Differential enrichment motifs discovery. To find the enriched motifs in the flank of 6 mA
sites, the MEME tool with differential enrichment mode was used and the maximum number of motifs
was set to 10. We used the positive samples in the cross-species dataset as the input and treated the neg-
ative samples as the control sequences. e detailed information of the enriched motifs can be found in the
supplementary materials. Consider the statistical significance of the motifs, the E-value lower than 0.05
was used to find the most statistically significant motifs and two motifs were selected. The first motif,
NNNNNNNHHNHHNHWNTNTNWNNNWNYNNNNNNNNNNNNNN, with an E-value of 3.3e-18 was
the most statistically signicant. And the third motif ACCGATCSA, with an E-value of 2.9e-2, was also selected.
e probability matrixes can also be downloaded from the MEME website which can be used to build motif score
matrixes in the training process.
Model training with dierent feature subsets. To nd the best combination of feature subsets, dierent
feature subsets were selected into the Random Forest classier and 5-fold cross-validation was used on the train-
ing dataset to evaluate the performance of our model. As shown in Fig.2, the classier received an AUC value of
0.866 only by using the Binary code features, which means that the Binary code features were the most signicant
features that can be used to distinguish positive samples from negative samples. Interestingly, this result was even
slightly higher than using combined feature subsets, such as Motif and Binary, Ksnpf and Binary, which achieved
an AUC value of 0.861 and 0.862, respectively. Besides, the model achieved the best AUC value of 0.871 when
Figure 1. e framework of csDMA.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
three feature subsets Motif, Kmer, and Binary feature subsets were selected into the classier. is result was even
a little better than the model performance by using all feature subsets. us, we used the Motif, Kmer, and Binary
encoding scheme to generate the optimized feature matrix.
Performance evaluation with dierent classiers. Five dierent algorithms were implemented in this
research. For the Random Forest, GradientBoosting, AdaBoost, ExtraTrees Classiers, 1,000 trees were selected
for each of them. For the SVM classier, grid research was used to search the best combination of C and gamma
parameters and the SVM classier achieved the best performance with C of 0.98 and gamma of 0.01. To compare
the performance of dierent classiers, 5-fold cross-validation was used and each classier was trained with the
same fold. As shown in Fig.3, the ExtraTrees classier received the best ACC of 0.799 and Sn of 0.864, while the
AdaBoost got the lowest ACC of 0.715, Sn of 0.713, Sp of 0.718. However, the ExtraTrees classier performed
not very well for predicting negative samples and received an Sp of 0.735, but it is only a little lower than those
of other methods. A more detailed comparison of dierent classiers is also shown in Table2. What’s more, the
Figure 2. Model performance based on the dierent feature subsets. 1,000 decision trees were selected into the
Random Forest classier and 5-fold cross-validation was used to evaluate the performance of csDMA.
Figure 3. e model performance of dierent classiers. e Motif, Kmer, and Binary feature subsets were
selected into each classier and the optimized parameters were used for model training. To evaluate the
performance of each classier, 5-fold cross-validation was used and Standard measures such as ACC, Sn and Sp
were used to evaluate the performance of our model.
Algorithm Sn Sp ACC MCC AUC F1
RandomForest 0.853 0.735 0.794 0.593 0.871 0.806
GradientBoosting 0.743 0.762 0.752 0.506 0.818 0.750
AdaBoost 0.713 0.718 0.715 0.431 0.777 0.715
ExtraTrees 0.864 0.735 0.799 0.603 0.878 0.811
SVM 0.807 0.764 0.785 0.572 0.858 0.790
Table 2. Model performance of each algorithm on the training dataset. e highest value of each column is
marked in bold.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
ExtraTrees classier also achieved the highest MCC of 0.603, AUC of 0.878 and F1 of 0.811. us, we used the
ExtraTrees algorithm to train the optimized model.
e independent testing dataset was also used to further evaluate the performance of each classier. Each
classier was trained on the whole training dataset and evaluated on the independent testing dataset. As shown
in Table3, the ExtraTrees classier received the best Sn of 0.888, AUC of 0.893 and F1 of 0.832, while the SVM
model got the highest Sp of 0.761. Interestingly, the performance of each classier on the independent testing
dataset was even a little higher than that on the training dataset, which suggests that the classier will receive
better performance with a larger training dataset.
Comparison with existing 6 mA predictors. e SVM-based tool iDNA6mA-PseKNC was also imple-
mented in this research. Grid research was used to nd the best C and gamma, and the iDNA6mA-PseKNC
achieved the best performance with C of 0.336 and gamma of 0.02. e same fold used for training csDMA
was also used for training iDNA6mA-PseKNC. e iDNA6mA-PseKNC predictor received Sn of 0.767, Sp of
0.769, ACC of 0.767, MCC of 0.536, and F1 of 0.767. Most of the measures were lower except Sp is higher than
our model with the ExtraTrees classier. To further compare the performance of the two algorithms. e ROC
and Precision-Recall curves were also plotted in Fig.4. Our model received an AUC of 0.893, while iDNA-
6mA-PseKNC got an AUC of 0.840, which also demonstrates that our model achieved better performance than
the iDNA6mA-PseKNC predictor.
To test the performance of our model across species, we compared the performance of csDMA and iDNA-
6mA-PseKNC on the dierent datasets, i.e., Cross-species, rice, and M. musculus datasets. For each dataset, 5-fold
cross-validation was performed and the previously optimized parameters were used. We used the same fold for
training on dierent datasets. e ve-round results of each measure were averaged and shown in Table4. For the
Cross-species dataset, iDNA6mA-PseKNC got an AUC of 0.844, while our model received a higher AUC of 0.879.
For the rice dataset, iDNA6mA-PseKNC received an AUC of 0.896, while our model achieved a higher AUC of
0.923. For the M. musculus dataset, both models got the same AUC values, but our model also received higher
MCC and F1 than those of iDNA6mA-PseKNC. All these results show that the proposed method is very accurate
and can be used to predict 6 mA sites in dierent species.
Discussion
Unlike the prediction of m6A modications in mRNA, the identication of 6 mA modications in DNA is still
at the beginning. In this study, we developed an improved tool, called csDMA, for predicting 6 mA modica-
tions in dierent species. ree feature encoding strategies were used to generate the feature matrix and dierent
algorithms were selected into the model. For performance evaluation, 5-fold cross-validation and independent
test were used and the ExtraTrees classier received the best performance on the training and independent test
Algorithm Sn Sp ACC MCC AUC F1
RandomForest 0.875 0.747 0.814 0.630 0.884 0.832
GradientBoosting 0.765 0.757 0.761 0.522 0.854 0.771
AdaBoost 0.776 0.719 0.749 0.496 0.814 0.764
ExtraTrees 0.888 0.729 0.813 0.628 0.893 0.832
SVM 0.843 0.761 0.804 0.607 0.875 0.819
Table 3. Model performance of the dierent algorithms on the independent testing dataset. e highest value
of each column is marked in bold.
Figure 4. Performance comparison of csDMA and iDNA6mA-PseKNC. (A) e ROC curves of csDMA and
iDNA6mA-PseKNC. (B) e Precision-Recall curves of csDMA and iDNA6mA-PseKNC.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
datasets. We also compared the performance of our tool with that of iDNA6mA-PseKNC. And the results showed
that our model improved the recognition performance of DNA 6 mA modications eectively.
The i6mA-Pred predictor is another of the two existing tools for DNA 6 mA prediction. However, the
research paper is still in the corrected proof phase and their method cannot be reached until our work nished.
Fortunately, we acknowledge from their online abstract that the method received an ACC of 0.831 by using a jack-
knife test. As jackknife test will generate a xed ACC on the same dataset and their dataset was also downloaded
as the rice dataset in this study. us, we also evaluated the performance of our model on the rice dataset by using
a jackknife test and our model received an ACC of 0.859, which is also higher than that of i6mA-Pred.
Although our model received a high performance on the M. musculus dataset, the performance on the rice
and cross-species datasets were relatively low. In the future, more feature encoding schemes, such as genomic and
structural features, will be used to improve the performance of csDMA. And also we will extend csDMA to other
species, such as human and Arabidopsis thaliana.
References
1. Dunn, D. B. & Smith, J. D. Occurrence of a new base in the deoxyribonucleic acid of a strain of bacterium coli. Nature. 175, 336–337
(1955).
2. Vanyushin, B. F., Belozersy, A. N., ourina, N. A. & adirova, D. X. 5-Methylcytosine and 6-Methylaminopurine in Bacterial
DNA. Nature. 218, 1066–1067 (1968).
3. Casadesus, J. & Low, D. Epigenetic gene regulation in the bacterial world. Microbiol and Molecular Biology Reviews. 70, 830 (2006).
4. Bird, A. Use of restriction enzymes to study euaryotic DNA methylation: II. e symmetry of methylated sites supports semi-
conservative copying of the methylation pattern. Journal of Molecular Biology. 118, 49–60 (1978).
5. Fu, Y. et al. N6-Methyldeoxyadenosine mars active transcription start sites in Chlamydomonas. Cell. 161, 879–892 (2015).
6. oziol, M. J. et al. Identication of methylated deoxyadenosines in vertebrates reveals diversity in DNA modications. Nature
Structural & Molecular Biology. 23, 24–30 (2016).
7. Mondo, S. et al. Widespread adenine N6-methylation of active genes in fungi. Nature Genetics. 49 (2017).
8. Zhou, C. et al. Identication and analysis of adenine N6-methylation sites in the rice genome. Nature Plants. 4, 554–563 (2018).
9. Zhang, Q. et al. N(6)-Methyladenine DNA methylation in Japonica and Indica rice genomes and its association with gene
expression, Plant Development, and Stress esponses. Molecular Plant. 11, 1492–1508 (2018).
10. Feng, P. M. et al. iDNA6mA-PseNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical
properties into PseNC. Genomics. 111, 96–102 (2018).
11. Chen, W., Lv, H., Nie, F. & Lin, H. i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics. btz015
(2019).
12. Xu, Y. et al. iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. Plos One. 9,
e105018 (2014).
13. Chen, W., Feng, P., Ding, H., Lin, H. & Chou, . C. iNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide
composition. Analytical Biochemistry. 490, 26–33 (2015).
14. Chen, W., Tang, H., Ye, J., Lin, H. & Chou, . C. iNA-PseU: Identifying NA pseudouridine sites. Molecular erapy-Nucleic Acids.
5, e332 (2016).
15. Jia, J., Zhang, L. X., Liu, Z., Xiao, X. & Chou, . C. pSumo-CD: Predicting sumoylation sites in proteins with covariance discriminant
algorithm by incorporating sequence-coupled eects into general PseAAC. Bioinformatics. 32, 3133–3141 (2016).
16. Qiu, W. ., Sun, B. Q., Xiao, X., Xu, Z. C. & Chou, . C. iPTM-mLys: identifying multiple lysine PTM sites and their dierent types.
Bioinformatics. 32, 3116–3123 (2016).
17. Feng, P. et al. iNA-PseColl: Identifying the occurrence sites of dierent NA modications by incorporating collective eects of
nucleotides into PseNC. Molecular erapy-Nucleic Acids. 7, 155–163 (2017).
18. Chen, W. et al. iNA-3typeA: identifying 3-types of modication at NA’s adenosine sites. Molecular erapy-Nucleic Acid. 11,
468–474 (2018).
19. Qiu, W. . et al. icr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classier.
Genomics. 110, 239–246 (2018).
20. Li, F. et al. Positive-unlabelled learning of glycosylation sites in the human proteome. BMC Bioinformatics. 20, 112 (2019).
21. Zhang, Y. et al. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an
integrative machine-learning framewor. Briengs in Bioinformatics. https://doi.org/10.1093/bib/bby079 (2018).
22. Chen, Z. et al. Large-scale comparative assessment of computational predictors for lysine post-translational modication sites.
Briengs in Bioinformatics. https://doi.org/10.1093/bib/bby089 (2018).
23. Chou, . C. Some remars on protein attribute prediction and pseudo amino acid composition. Journal of eoretical Biology. 273,
236–247 (2011).
24. Chou, . C. Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target
drugs. Current Medicinal Chemistry, https://doi.org/10.2174/0929867326666190507082559 (2019).
25. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 28,
3150–3152 (2012).
26. Chou, . C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 43, 246–255 (2001).
27. Chou, . C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 21, 10–19
(2005).
28. Shen, H. B. & Chou, . C. PseAAC: a exible web-server for generating various inds of protein pseudo amino acid composition.
Analytical Biochemistry. 373, 386–388 (2008).
Algorithm Species Sn Sp ACC MCC AUC F1
csDMA
Cross-species 0.863 0.735 0.799 0.603 0.879 0.811
Rice 0.842 0.880 0.861 0.723 0.923 0.858
M. musculus 0.932 1 0.966 0.935 0.974 0.965
iDNA6mA-PseKNC
Cross-species 0.762 0.769 0.765 0.531 0.844 0.764
Rice 0.569 0.721 0.641 0.394 0.896 0.543
M. musculus 0.869 1 0.935 0.877 0.974 0.930
Table 4. Model performance of each algorithm across species.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
29. Du, P., Wang, X., Xu, C. & Gao, Y. PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s
pseudo amino acid compositions. Analytical Biochemistry. 425, 117–119 (2012).
30. Cao, D. S., Xu, Q. S. & Liang, Y. Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics. 29, 960–962 (2013).
31. Du, P., Gu, S. & Jiao, Y. PseAAC-General: Fast building various modes of general form of Chou’s pseudo amino acid composition for
large-scale protein datasets. International Journal of Molecular Sciences. 15, 3495–3506 (2014).
32. Chou, . C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current
Proteomics. 6, 262–274 (2009).
33. Chen, W., Lei, T. Y., Jin, D. C., Lin, H. & Chou, . C. PseNC: a exible web-server for generating pseudo -tuple nucleotide
composition. Analytical Biochemistry. 456, 53–60 (2014).
34. Chen, W. & Lin, H. Pseudo nucleotide composition or PseNC: an eective formulation for analyzing genomic sequences. Molecular
BioSystems. 11, 2620–2634 (2015).
35. Liu, B., Yang, F., Huang, D. S. & Chou, . C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-
window-based PseNC. Bioinformatics. 34, 33–40 (2018).
36. Tahir, M., Tayara, H. & Chong, . T. iNA-PseNC(2methyl): Identify NA 2′-O-methylation sites by convolution neural networ
and Chou’s pseudo components. Journal of eoretical Biology. 465, 1–6 (2019).
37. Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, NA, and protein sequences.
Nucleic Acids Research. 43, W65–W71 (2015).
38. Liu, B. & Wu, H. Pse-in-One 2.0: An improved pacage of web servers for generating various modes of pseudo components of DNA,
NA, and protein sequences. Natural Science. 9, 67–91 (2017).
39. Chen, Y., Tang, Y., Sheng, Z. & Zhang, Z. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the
composition of -spaced amino acid pairs. BMC Bioinformatics. 9, 101 (2008).
40. Wang, X., Yan, . & Song, J. DephosSite: a machine learning approach for discovering phosphotase-specic dephosphorylation sites.
Scientic Reports. 6, 23510 (2016).
41. Chou, . C. Using subsite coupling to predict signal peptides. Protein Engineering. 14, 75–79 (2001).
42. Chou, . C. Prediction of signal peptides using scaled window. Peptides. 22, 1973–1979 (2001).
43. Liu, B., Wang, S., Long, . & Chou, . C. iSpot-EL: identify recombination spots with an ensemble learning approach.
Bioinformatics. 33, 35–41 (2017).
44. Cheng, X., Lin, W. Z., Xiao, X. & Chou, . C. pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing
training dataset and PseAAC. Bioinformatics. 35, 398–406 (2019).
45. Song, J., Wang, Y. & Li, F. iProt-Sub: a comprehensive pacage for accurately mapping and predicting protease-specic substrates
and cleavage sites. Briengs in Bioinformatics. 20, 638–658 (2018).
46. Cheng, X., Zhao, S. G., Lin, W. Z., Xiao, X. & Chou, . C. pLoc-mAnimal: predict subcellular localization of animal proteins with
both single and multiple sites. Bioinformatics. 33, 3524–3531 (2017).
47. Cheng, X., Zhao, S. G., Xiao, X. & Chou, . C. iATC-mISF: a multi-label classier for predicting the classes of anatomical therapeutic
chemicals. Bioinformatics. 33, 341–346 (2017).
48. Chou, . C. Some remars on predicting multi-label attributes in molecular biosystems. Molecular Biosystems. 9, 1092–1100 (2013).
49. Song, J. et al. Transcriptome-wide annotation of m5C NA modications using machine learning. Frontiers in Plant Science. 9, 519
(2018).
50. Chou, . C. & Forsén, S. Diusion-controlled eects in reversible enzymatic fast reaction system: Critical spherical shell and
proximity rate constants. Biophysical Chemistry. 12, 255–263 (1980).
51. Carter, . E. & Forsén, S. A new graphical method for deriving rate equations for complicated mechanisms. Chemica Scripta. 18,
82–86 (1981).
52. Chou, ., Chen, N. & Forsén, S. e biological functions of low-frequency phonons: 2. Cooperative eects. Chemica Scripta. 18,
126–132 (1981).
53. Jiang, S. P., Liu, W. M. & Fee, C. H. Graph theory of enzyme inetics: 1. Steady-state reaction system. Scientia Sinica. 22, 341–358
(1979).
54. Shen, H. B., Song, J. N. & Chou, . C. Prediction of protein folding rates from primary sequence by fusing multiple sequential
features. Journal of Biomedical Science and Engineering. 2, 136–143 (2009).
55. Chou, . C. Graphic rule for drug metabolism systems. Current Drug Metabolism. 11, 369–378 (2010).
56. Zhou, G. P. e disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein
interaction mechanism. Journal of eoretical Biology. 284, 142–148 (2011).
57. Chou, . C. & Shen, H. B. ecent advances in developing web-servers for predicting protein attributes. Natural Science. 1, 63–92
(2009).
58. Chou, . C. Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry. 11, 218–234 (2015).
59. Chou, . C. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in
Medicinal Chemistry. 17, 2337–2358 (2017).
Acknowledgements
is work was supported by the Start-up foundation of Northwest A&F University (Z109021809), the National
Natural Science Foundation of China (51809218), and the Postdoctoral Research Foundation of China
(2018M643744).
Author Contributions
Z.L. participated in conceiving and performing the experiments. W.D. and W.J. participated in analyzing the data.
All authors contributed to the writing of the manuscript.
Additional Information
Supplementary information accompanies this paper at https://doi.org/10.1038/s41598-019-49430-4.
Competing Interests: e authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
SCIENTIFIC REPORTS | (2019) 9:13109 | https://doi.org/10.1038/s41598-019-49430-4
www.nature.com/scientificreports
www.nature.com/scientificreports/
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2019
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Available via license: CC BY 4.0
Content may be subject to copyright.