Entropy 2015,17, 1673-1689; doi:10.3390/e17041673 OPEN ACCESS
Analysis of Data Complexity in Human DNA for
Gene-Containing Zone Prediction †
Ricardo E. Monge 1,* and Juan L. Crespo 2
1Escuela de Ciencias de la Computación y de la Informática, Universidad de Costa Rica, San Pedro de
Montes de Oca, San José, Código Postal 2060-San José, Costa Rica
2Escuela de Ingeniería Eléctrica, Universidad de Costa Rica, San Pedro de Montes de Oca, San José,
Código Postal 2060-San José, Costa Rica; E-Mail: email@example.com
†This paper is an extended version of our paper published in Proceedings of International Work
Conference on Bio-inspired Intelligence (IWOBI), Costa Rica, 16–18 July 2014; pp. 71–75,
*Author to whom correspondence should be addressed; E-Mail: firstname.lastname@example.org;
Academic Editors: Carlos M. Travieso-González and Jesús B. Alonso-Hernández
Received: 27 November 2014 / Accepted: 17 March 2015 / Published: 27 March 2015
Abstract: This study delves further into the analysis of genomic data by computing a variety
of complexity measures. We analyze the effect of window size and evaluate the precision and
recall of the prediction of gene zones, aided with a much larger dataset (full chromosomes).
A technique based on the separation of two cases (gene-containing and non-gene-containing)
has been developed as a basic gene predictor for automated DNA analysis. This predictor
was tested on various sequences of human DNA obtained from public databases, in a set
of three experiments. The ﬁrst one covers window size and other parameters; the second
one corresponds to an analysis of a full human chromosome (198 million nucleic acids);
and the last one tests subject variability (with ﬁve different individual subjects). All three
experiments have high-quality results, in terms of recall and precision, thus indicating the
effectiveness of the predictor.
Keywords: information complexity; DNA; genomic variability; gene prediction; nucleic
Entropy 2015,17 1674
The analysis of complexity measures of genomic sequences is one of the pre-processing techniques
that can lead to better pattern recognition and pattern inference in DNA (a type of nucleic acid called
deoxyribonucleic acid) sequences. The authors are engaged in active research regarding the development
of new techniques based on bioinspired intelligence (that is, artiﬁcial intelligence techniques that emulate
up to a certain point, or are inspired from, the processes behind living organisms), to analyze genomic
data and their different relationships with other types of biological data (as found in medical diagnosis
and medical imagery).
The objective of the paper is to show that gene-prediction location (i.e., determining whether a
subsequence of genomic data corresponds to a particular gene) is possible by means of measures based
on information (or data) complexity. Entropy has been used recently as input for data mining , and
this study explores techniques based not only on entropy, but on a number of data complexity measures.
The paper discusses previous work, outlines some complexity measures and proposes a predictor based
on clustering the complexity results. This predictor is then tested on a variety of datasets to show that,
in effect, prediction is possible. This research does not delve into the biological and biochemical details
regarding different types of genes, since our analysis is performed over the genomic sequence. Thus,
predicting based on genomic characteristics (such as codons, exons and introns) is not within the scope
of this research. As mentioned below, this research addresses the genomic sequence as such, without the
use of genetic structures.
In previous work , presented at the International Workshop and Conference on Bioinspired
Intelligence, held in Costa Rica, it was shown that complexity metrics along a sequence can be used
as an indicator of the presence (or absence) of patterns that correspond to genes. That initial study of
the usage of complexity metrics showed that certain statistical properties of the sequence of complexity
measures were signiﬁcantly different for the subsequences that contained genes than for subsequences
that did not contain genes, in spite of being tested on a relatively small dataset. Therefore, these results
suggest that it is worth pursuing these types of transformations even further, to convey information
more precisely for the computational intelligence algorithms to be developed as part of future research.
Nekturenko and Makova have shown that it is possible to compute the potential of a genomic region
with a comparative interspecies technique , whereas the present study shows that that potential can be
approximated without a comparative technique.
This paper is divided into six sections, starting with an exploration of the complexity analysis of
nucleic acids over the years. It is worth noting that this has not been, however, a very active research topic
within entropy and complexity analysis itself. The second section summarizes recent work on the topic.
The following section describes, from a mathematical point of view, the complexity measures selected
and discusses the reasons for the selection of measures. In Section 4, certain basic concepts of human
DNA and details about the databases used in the study are described, and in Section 5, the predictor
model is outlined. Finally, Section 6 contemplates an experimental process in which the hypothesis that
prediction is possible is tested with a variety of datasets from the human genome, and the reference gene
locations that are tested were taken from the RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/).
Entropy 2015,17 1675
As a general framework, the present study uses the notion of data complexity derived from complex
adaptive systems (due to the fact that genomic architecture deﬁnes complex behavior), as mentioned by
Neil , in which a complex adaptive system has some (or all) of the following attributes: (1) the number
of parts in the system and the number of relations between the parts is non-trivial (even though there is no
general rule to separate trivial from non-trivial); (2) the system has memory or feedback; (3) the system
can adapt itself according to its history; (4) the relations between the system and its environment are
non-trivial or non-linear; (5) the system can be inﬂuenced by, or can adapt itself to, its environment; and
(6) the system is highly sensitive to initial conditions. As a complex system, biological life, encoded by
DNA, fulﬁlls all six attributes in the following way: (1) DNA contains large amounts of genes, and the
relationship between genes is not yet fully understood; (2) most DNA and genetic expressions depend
on previous events; (3) genes can adapt very easily to different conditions; (4) the relationship between
genes, DNA and the environment is just being understood; (5) genes and DNA can be inﬂuenced by, or
can adapt themselves to, their environment; and (6) gene expressions are sensitive to initial conditions
and environmental factors.
2. Previous Work in DNA Complexity Study
Here, DNA complexity is a term used to refer to all research regarding the computation and
application of complexity measures to sequences of DNA (or other nucleic acids) carried out to explain
biological processes or to transform data for later use. Furthermore, the computation of DNA complexity
can be used for general research analysis.
DNA complexity has been studied throughout the years, even when computational tools were not yet
mainstream in genetics research. In 1974 , one of the ﬁrst techniques for nucleic acid complexity
was developed by counting the amount of hybrid DNA-RNA sequences produced by an embryo. In
1982, Hough-Evans proposed a technique based on chromatography to reveal repetitive patterns in
single-celled bacterial organisms , thus providing a fairly visual technique for DNA and organism
The discovery (and use) of repetitive patterns in DNA and progress in computational and numerical
tools, as well as the availability of sequenced genomes triggered a set of computational calculations .
Gusev and others evaluated genetic complexity by ﬁnding the amount of repetitive sequences (commonly
interpreted as regulatory DNA) with a Lempel–Ziv measure . The role of protein-coding DNA and
regulatory DNA has been understood only recently and provides insight into the belief that organism
complexity is related to the amount of “extra” DNA . Entropy is a common measure applied to
DNA, when referring to the randomness and structure of the data. Schmitt  explores the statistical
estimation of the information content of DNA, while Crochemore and Vérin show that there are zones
with low entropy (structured data) within DNA . Additional research proved that, in effect, natural
DNA has lower entropy than laboratory-synthesized DNA . Lanctot  estimates DNA entropy
by the use of an algorithm based on language parsers, treating DNA as a ﬁnite state machine. Koslicki
works out the geometric version of entropy and applies it to DNA .
The use of proﬁles based on complexity measures is not new. Proﬁle analysis of genomic sequences
using linguistic complexity (a type of Lempel–Ziv) was proposed by Troyanskaya . Lempel–Ziv
Entropy 2015,17 1676
complexity and compressibility have also been used outside genomic studies. Xiao and others use
Lempel–Ziv to predict a protein sequence location within a cell . Ferenets and others  compared
different types of entropy, Lempel–Ziv complexity and fractal dimension to electroencephalograms
of patients under anesthesia with the standard clinical depth of sedation score. Aboy, in 2006,
used Lempel–Ziv to estimate the bandwidth and harmonic variability of diverse quasi-periodic signals
recorded for biomedical purposes .
Other studies have approached genomic complexity (complexity based not only on DNA, but also
on other nucleic acids) by interpreting the nucleic acid sequence as a fractal. Berthelsen analyzed
genomic sequences as a four-dimensional random walk and computed the resulting fractal dimension,
which turned out to be signiﬁcantly lower than sequences of genomic data generated randomly .
Concerning the physical structure (rather than the speciﬁc sequence), Ercolini and others computed the
fractal dimension of DNA imaged under an atomic force microscope . Complexity analysis of the
genomic data obtained has been done on RNA and certain types of bacterial DNA, but not on human
chromosomes (which is the central point in this study).
A less-known complexity measure, known as the regularity index , appears more promising,
but will not be covered here. The regularity index is a generic variant of the index of maximum
regularity, described as a new tool for computational analysis of DNA . The regularity index
measures how regular a symbol sequence is and will be tested by the authors and reported on in a
paper devoted to analyzing the relationship between regularity and complexity in DNA subsequences.
This technique based on regularity could have better results, because the original version (the index
of maximum regularity) has been used to detect chromosome telomeres [23,24] and genetic regularity
in simple yeasts  with high conﬁdence. The role of information theory in molecular biology ,
proteomics  and genomics has currently been an active topic.
More recently, research has been oriented towards the usage of next-generation sequencers, and
complexity analyses have not been done yet for these types of new data. This is due to: (1) the
amounts of data produced, which pose difﬁculties for processing; and (2) the greater interest in more
applied research regarding genomic data. As such, studies that use data from next-generation sequencers
tend to be directed toward well-known computational and statistical techniques already in use by
3. Measures for Data Complexity and Entropy
To be able to perform analyses of the values of complexity for a sequence of DNA, it was necessary to
deﬁne certain measures (computable values) for data complexity and entropy (which can be considered
a type of data complexity). Entropy is a measure of the order (or absence of order, to be precise) of a
physical system (in the case of data, of bits); and data complexity is a generic term referring here to the
variety of models, measures and techniques to indicate how “complex” a set of information is.
Lloyd has compiled lists of diverse techniques to measure and compare information complexity, and
we have based our selection on his non-exhaustive list , complemented by bibliographic research in
medical, physics and mathematics journal archives. Table 1 shows the complexity measures analyzed
and indicates whether they are considered in the study or not. The selection criteria for the complexity
Entropy 2015,17 1677
measures were concise, with algebraic, arithmetic or statistical computation that would not require
extensive programming and testing, to make it possible to perform the analysis on large datasets and
with a variety of parameters. Thus, techniques based on integrals, derivatives and differential equations
were not included here.
Table 1. List of possible complexity measures considered.
Measure Type Included in the proposed predictor?
Entropy Thermodynamic No
Shannon entropy Algebraic Yes
Renyi continuous entropy  Algebraic No
Kolmogorov Complexity Not computable No
Lempel–Ziv Algorithmic Yes
Statistical complexity (CLMC ) Algorithmic Yes
Fractal dimension Algebraic No (see the text)
Information ﬂuctuation Statistical No
Randomness Physical No
Thermodynamic depth Physical No
Predictive complexity Artiﬁcial intelligence No
Fractal dimension is not included in this study because of difﬁculties in computing precise values for
representations of DNA as a fractal (in spite of studies [19,20] indicating that fractal dimension would
have good results).
3.1. Shannon Entropy
Information entropy, or Shannon entropy , corresponds to how much a given set of information
is unpredictable. If one can predict the next set of events, given the actual information, one has a
low-entropy information set. Shannon entropy, denoted by H(S), is computed by:
H(S) = −
where P(Si)is the relative frequency of character iwithin string S, with length n.
As such, Shannon entropy is a basic measure of complexity, based on the frequency of event
3.2. Statistical Complexity
Regarding statistical complexity, of which several measures were studied and analyzed by Feldman
and Crutchﬁeld , and given the computation time constraints, the authors of this paper decided to
opt for a measure that would require only statistical manipulations. Most statistical complexity measures
require the computation of approximations of differential equations. The CLMC measure is one which
Entropy 2015,17 1678
requires only statistical and algebraic manipulations to offer a measure for statistical complexity .
It considers both the intrinsic entropy of the data and the departure of the probability of each symbol
from uniformity (referred to as disequilibrium by the authors of the measure). Therefore, the CLM C for
a given string Sis deﬁned by:
CLM C (Y) = H(S)D(S),(2)
where His the Shannon entropy (deﬁned above), and:
is the disequilibrium, where nis the length of the sequence.
Since this measure is a derivation of Shannon entropy, it is an easily computable measure for large
datasets (such as raw DNA data) and also returns measure values between zero and one.
3.3. Kolmogorov Complexity
Kolmogorov complexity is based on the concept that if a given sequence Scan be generated by
an algorithm smaller than the length of the given sequence, then its complexity corresponds to the
size of the algorithm. If a sequence is truly complex and random, the shortest representation is the
sequence itself . These models were later reﬁned by Chaitin when algorithmic information content
was proposed . The issue is that Kolmogorov complexity itself is not computable. In the present
study, the Lempel–Ziv approximation to Kolmogorov complexity was used. It quantiﬁes the amount
that a sequence (or a string of characters or symbols) can be compressed using the Lempel–Ziv–Welch
algorithm , which is the complexity measure used by Gusev .
4. Characteristics of Human DNA
Human DNA is composed of 23 chromosomes (of diverse lengths) with an average of 30 to 200
million base pairs per chromosome (thus giving it a total of three billion base pairs). A base pair
encodes a Watson–Crick complement (when a double-stranded molecular structure, such as DNA, is
being described) with a letter, representing adenine, cytosine, guanine and thymine (and its respective
complement), which are the molecules that form DNA in all species. “Base pair” can also refer to a unit
of length that covers exactly one nucleic acid (and its complement). In Figure 1, the chemical structure
of DNA and the corresponding nucleotides (also called nucleic acids) are illustrated.
When a genome is sequenced, it means that either experts or machines have identiﬁed each base
pair in the molecule of DNA, and that sequence of symbols (a letter for each nucleotide) is saved into
databases as a sequence. In Figure 2, a fragment that has been sequenced is illustrated.
It is over these types of sequences that the predictor operates, when performing the complexity
measurements to determine whether it contains a gene or not.
Entropy 2015,17 1679
Figure 1. Double helix structure of DNA. Illustration by Madeleine Price Ball.
Figure 2. Transcript of DNA. Subsequence of 1224 nucleotides.
5. Predictor Proposal
The present research has been based on the fact that there is some insight into genomic data
complexity and information content to discern between gene-containing regions and those that do not
contain genes, based solely on quantiﬁable information from complexity measures.
The intention of the study is to provide an automated technique to indicate, on a given sequence
of base pairs corresponding to nucleic acids, whether subsequences correspond to genes or do not
correspond to genes. Thus, the paper outlines a predictor of gene-coding zones by analyzing the different
Entropy 2015,17 1680
values of complexity measures that provide insight on whether the zone might correspond to a gene
Since the calculation of complexity measures requires a set of data, and not one single value, a
technique based on windows was selected to produce a sequence of measures for the DNA dataset.
The computation of the complexity measures is done by a shifting window, called W, of size Ws.
The measure is computed for the DNA subsequence, starting at position p= 0 and with a length of
Ws. The resulting value is stored, for later processing, and the starting position is shifted Tw, that is,
pnext =p+Tw; then, the computation continues until an end-of-string indicator has been triggered.
As explained above, the predictor has two parameters (window size and window shift) and is evaluated
for each complexity measure deﬁned in Section 3. Figure 3 illustrates the window process and how it
manages to cover the entire sequence in less computational time.
Figure 3. Window shift and window size illustration.
In the earlier version of this work , the classiﬁer had its ranges for “gene” and “non-gene”
established manually based on the standard deviation and mean; and it was used for the initial
experimentation. However, an automated process for the establishment of the limit is necessary, to be
able to function on all types of genomic data. Therefore, a technique to compute those ranges from the
data obtained was deemed required, and among one-dimensional clustering and grouping techniques,
k-means clustering was found to be the simplest to compute for the given datasets. It was used to
determine intervals for “coding” and “non-coding” zones, for each complexity measure being used,
taking into account the knowledge of the measures developed for the preliminary version of this paper.
Once the intervals were deﬁned (varying, depending on window size), they were used to determine
Entropy 2015,17 1681
whether a given section of DNA is, in effect, “coding” or “non-coding”. As such, the threshold, which
indicates whether a zone is “gene” or “non-gene,” is computed and updated as data are processed.
Finally, to merge results among the different complexity measures, a simple majority rule is used. If,
for a given subsection, a majority of measures indicates that it is “coding,” it will be considered as such.
Otherwise, to play it safe, a non-match condition will be found and reported as “non-coding.”
6. Experimental Work
The veriﬁcation that entropy measures can accurately locate gene zones in human DNA requires
developing a set of statistical and repetitive tests for the mechanism proposed in Section 3. It is divided
into two sections: one devoted to a description of each of the three tests performed and another discussing
the results and the corresponding analysis for each of the three experimental processes.
Predictor testing was developed by extracting known gene locations, comparing whether the gene
locations match between the predicted set and the reference set, and computing statistical parameters,
such as precision and recall. In this context, recall, indicated by R, is the fraction of returned zones that
really correspond to known genes, where the whole is the set of the reference gene locations. Another
term, precision, indicated by P, is the ratio, within returned gene zones, that correspond to known
For all of the experimental processes, a curated and tested dataset (veriﬁed by researchers and in
wide use) was selected. It was produced by the Genome Reference Consortium, tagged GRCh38,
corresponding to Release 38, dated December 24, 2013, which is the most recently updated assembly
and used by both NCBI and UCSC. Each one of the 23 human chromosomes was annotated with known
gene data and saved in the GenBank database with accessions NC_000001 to NC_000024.
6.1. Description of the Experimental Processes
To verify the feasibility of the proposed gene predictor based on DNA entropy variability, we
conducted three experiments, using two different types of datasets to convey more precise and detailed
results. We have also tested a variety of different moving window conﬁgurations to analyze the effect of
It must be noted that the processes described here to verify the predictor are not classical experiments
(i.e., there is no statistical testing with a given signiﬁcance and no control groups). These experimental
processes are used exclusively to show the feasibility of the predictor described in this paper.
6.1.1. Experiment 1 (Parameter Selection)
The ﬁrst experiment addressed the effect of window size and window shift for gene-zone detection
in human DNA. We used a portion of 1,000,000 base pairs from chromosome Y (NCBI RefSeq ID
NC_000024 ), containing 150 genes (thus covering more than 100 genes to obtain precise values
for recall and precision). Recall and precision were computed for each test case, taking into account
the selection of window sizes and shift parameters. Previous research  suggests that window sizes
(Ws) cannot be too large, because that would extend the computation time, and it would not cover
Entropy 2015,17 1682
small genes. However, they cannot be too small, because large genes take a long time to be analyzed.
Thus, a set of four options for window sizes was included in the experimental setup (250, 500, 1000,
2000). Regarding window shift (Tw), a small window shift is more effective, but it requires larger
computation times. The window shift has been set for 1
4of the window size, and ﬁve more ﬁxed values
were selected. A window shift of 1, which results in a very slow computation, should provide exact
matches, instead of approximations limited by the shift. Chromosome Y was selected because it was
the one used for illustrative purposes in the conference version of this paper, with the objective of
offering a comparable experiment, and it also offers certain characteristics in regard to entropy that were
described by Koslicki  in his study of the entropy of exons and introns (start and end nucleotide
sequences found in identiﬁed genes). The proponents of the predictor discussed here are well aware of
the issues regarding sex chromosomes and their differences with remaining chromosomes (such as their
evolution [37,38], their variation in humans  and their bias towards a speciﬁc sex ). This then
does not invalidate the experimental process of determining the best parameters for the computation.
For a match to be considered valid, to compute recall and precision, the difference between the
reference and the predicted locations must be less than the deﬁned window shift. Table 2 illustrates
how the matching process is performed for some sample cases, using a Ws=1000 and Tw= 250.
Table 2. Partial match processing.
Gene Real Location Predicted Location Start Range Finish Range Decision
SRY 5375–6261 5500–6000 5125–5625 6011–6511 Yes
SRY 5375–6261 5000–6000 5125–5625 6011–6511 No
RPS4Y1 60,103–85,477 60,000-85,500 59,853–60,353 85,227–85,727 Yes
RPS4Y1 60,103–85,477 60,000-82,500 59,853–60,353 85,227–85,727 No
To ﬁnd these speciﬁc values, a number of preliminary trials were carried out to have a basis to deﬁne
the representative values. The experiment also contemplates feedback discussed during the conference
version of this publication.
The most effective combination (highest recall and precision) of window shift and size parameters has
been used for Experiments 2 and 3, with the intention of having comparable data using the same factors.
6.1.2. Experiment 2 (Full Chromosome Prediction)
In the second case, a sequence of human DNA corresponding to chromosome 3 (NCBI RefSeq ID
NC_000003 ) has been chosen. It consists of 198 million base pairs (with a few gaps) and with
known gene positions in a separate tag ﬁle. Window size (Ws) and window shift (Tw) are set to the
values determined in Experiment 1. Chromosome 3 covers 2203 genes in a variety of lengths, making
it appropriate for testing the predictor. The predictor, as it determines the classiﬁcation ranges by an
automated method, is aware of the variation in complexity among different chromosomes .
Entropy 2015,17 1683
6.1.3. Experiment 3 (Prediction for a Variety of Subjects)
For the third experiment, we used a subset of 100,000 base pairs from the start of chromosome Y
for ﬁve different individuals, taken from the 1000 Genome Project ; and we used the predictor to
locate the gene-coding zones of all ﬁve individuals, and those zones were compared to the known gene
locations taken from the reference sequence. This is the same region used in the conference version of
this paper .
6.2. Analysis of Results
Once the computational tasks are performed, it is essential to outline, describe and analyze the results
for each of the three experimental processes designed to test the feasibility of the prediction of genes
using complexity measures. For each case, at least one table of results is displayed, with a discussion of
the results and how they are related to one another.
6.2.1. Experiment 1: Regarding Parameter Selection
The computation process for all twenty-four trials of the parameter selection experiment took less
that an hour on a standard, four-core processor, using the given subset of 1,000,000 base pairs taken
from human chromosome 3 for all trials. Table 3 shows the computational results (in terms of recall and
precision in matching against the reference locations) for window sizes Ws= 250 and Ws= 500; and
Table 4 describes the results for window sizes Ws=1000 and Ws=2000.
Table 3. Experiment 1: The effect of small window size and window shift parameters in the
estimation of gene-coding zones in human DNA.
Ws= 250 Ws= 500
4R= 0.65 P= 0.62 R= 0.68 P= 0.70
Tw= 1 R= 0.98 P= 0.99 R= 0.98 P= 0.99
Tw= 50 R= 0.65 P= 0.61 R= 0.65 P= 0.67
Tw= 100 R= 0.61 P= 0.51 R= 0.67 P= 0.65
Tw= 250 R= 0.50 P= 0.41 R= 0.55 P= 0.60
Tw= 500 R= 0.43 P= 0.23 R= 0.51 P= 0.61
Table 4. Experiment 1: The effect of medium window size and window shift parameters in
the estimation of gene-coding zones in human DNA.
4R= 0.89 P= 0.89 R= 0.86 P= 0.89
Tw= 1 R= 0.97 P= 0.98 R= 0.98 P= 0.97
Tw= 50 R= 0.80 P= 0.88 R= 0.85 P= 0.88
Tw= 100 R= 0.85 P= 0.91 R= 0.84 P= 0.87
Tw= 250 R= 0.89 P= 0.89 R= 0.89 P= 0.85
Tw= 500 R= 0.71 P= 0.80 R= 0.86 P= 0.89
Entropy 2015,17 1684
These results indicate that the window-size parameter has to be large enough to encompass a variety
of features (for example, Ws≤500 results in poor precision and recall, because most DNA subsections
that code a gene are larger than 500 base pairs). In addition, the window-shift parameter has to permit
certain overlapping (a large Twtends to skip sections completely). The most effective combination, at
least for human DNA, is Ws=1000 and Tw= 250; with a large relative overlap between individual
analysis and a medium-sized window, which covers small genes with no problem. As the data above
suggest, using an elemental shift (Tw= 1) leads to exact matches to the genes.
Considering reverse recall (1−R) as the rate of false-positives for the predictor (i.e., gene zones that
do not have a corresponding gene in the reference location list), it can be observed that small window
sizes tend to misclassify genomic data (due, perhaps, to the fact that in such small sections, there is no
“pattern” that can be observed). As the data displayed in Tables 3 and 4 show, when the window size
parameter increases, this rate of false positives has a tendency to decrease.
6.2.2. Experiment 2: Regarding Full Chromosome Prediction
For the second experiment, which was performed over a full human chromosome (approximately 6%
of the total genetic information in the human species) in blocks of 20,000,000 nucleotides (base pairs
of genomic data), the results are also very satisfactory, taking into account that the processing time was
under three hours. The prediction parameters were set to the values indicated by the experimental process
and the results discussed in Section 6.2.1, which were Ws=1000 and Tw= 250.
To compare results and to have a clear overview of the positive and negative matching, with the
same parameters (window size and window shift), precision and recall were computed, for every
section composed of twenty million base pairs, and tabulated in Table 5. The data tested (i.e., human
chromosome 3) contain an approximation of 2000 genes, of different lengths and functions.
Table 5. Experiment 2: Recall and precision results for the entire third chromosome.
DNA Section Recall Precision
1–20,000,000 0.89 0.89
20,000,001–40,000,000 0.87 0.89
40,000,001–60,000,000 0.88 0.88
60,000,001–80,000,000 0.87 0.88
80,000,001–100,000,000 0.85 0.89
100,000,001–120,000,000 0.91 0.87
120,000,001–140,000,000 0.83 0.89
140,000,001–160,000,000 0.91 0.85
160,000,001–180,000,000 0.90 0.89
180,000,001–198,295,559 0.89 0.82
For a full chromosome, both recall and precision have low variability, and they have values close to
one, thus indicating high-quality prediction and detection.
Entropy 2015,17 1685
6.2.3. Experiment 3: Regarding Prediction for a Variety of Subjects
Short strands of genomic data from chromosome Y, taken from ﬁve subjects, were used to evaluate
the prediction process for a variety of subjects. The region selected contains three known genes, and the
predictor was expected to indicate the start and end locations of “potential genes.” Both the start location
and the ending location for each “potential gene” among the ﬁve subjects correspond quite closely to the
reference locations, indicating a good match. In Table 6, a summary of the computation and comparison
process is shown for these three genes and ﬁve subjects, with a total of ﬁfteen matched zones.
Table 6. Experiment 3: Matching locations for known genes, for a variety of subjects.
Gene Real Location Individual Matched Location
SRY 5375–6261 A 5500–6000
RNASEH2CP1 8347–8918 A 8250–9000
RPS4Y1 60,103–85,477 A 60,000–85,500
SRY 5375–6261 B 5250–6250
RNASEH2CP1 8347–8918 B 8500–9000
RPS4Y1 60,103–85,477 B 61,000–84,000
SRY 5375–6261 C 5250–6000
RNASEH2CP1 8347–8918 C 8500–8750
RPS4Y1 60,103–85,477 C 60,000–85,250
SRY 5375–6261 D 5000–6000
RNASEH2CP1 8347–8918 D 8000–9000
RPS4Y1 60,103–85,477 D 60,000–82,000
SRY 5375–6261 E 5250–6500
RNASEH2CP1 8347–8918 E 8250–9000
RPS4Y1 60,103–85,477 E 60,000–85,000
Small variations can be observed among the ﬁve subjects tested and are within reasonable tolerances.
A smaller window shift parameter (Tw) would have permitted greater precision in the establishment of
the limits corresponding to each gene. However, what is outstanding here is that the predictor was able
to match the locations without previous knowledge about the genes. Observe that the matched location
is a multiple of the window shift chosen. As mentioned above, it was found that the most effective
combination, at least for human DNA, is Ws=1000 and Tw= 250, with a reasonable computation time.
Entropy 2015,17 1686
Using the experimental results and, in particular, the recall and precision ratios for the cases
considered, it can be shown that data complexity actually does offer a high-quality prediction for gene
coding zones, for the case of human DNA. This result agrees with similar work in DNA complexity for
simple organisms mentioned above. Automated analysis permits decision-making and grouping based
solely on the characteristics of the data and results in different complexity measure values for coding and
non-coding DNA zones.
Finally, these results suggest that future work should be directed toward analyzing other complexity
measures (such as fractal dimension) and toward researching the potential of regularity measures, which
may be better suited for DNA than classic information complexity. Furthermore, future studies can
provide insight into more detailed genomic structures (parts of genes, parts of chromosomes) and
determine whether a mutation is present. An interesting point to research is whether the existing
principles behind genes (their structure, their known sequences, the existence of non-coding DNA) are
compatible with the technique described in this paper. The use of data complexity algorithms speciﬁcally
adapted for genomic data (such as Renyi continuous entropy  and topological entropy ) is also
worth including in future versions of the predictor developed.
The authors would like to thank the University of Costa Rica for providing an encouraging research
environment and the reviewers for providing valuable feedback.
Ricardo Monge formulated the problem, outlined the idea and performed the computational
processing. Juan Crespo provided important theoretical foundations and established the research topic.
Both authors worked jointly on interpreting the results and wrote the paper. Both authors have read and
approved the ﬁnal manuscript.
Conﬂicts of Interest
The authors declare no conﬂict of interest.
1. Holzinger, A.; Hörtenhuber, M.; Mayer, C.; Bachler, M.; Wassertheurer, S.; Pinho, A.J.;
Koslicki, D. On entropy-based data mining. In Interactive Knowledge Discovery and Data Mining
in Biomedical Informatics; Springer: Berlin/Heidelberg, Germany, 2014; pp. 209–226.
2. Monge, R.E.; Crespo, J.L. Comparison of complexity measures for DNA sequence analysis.
In Proceedings of International Work Conference on Bio-inspired Intelligence (IWOBI), Liberia,
Costa Rica, 16–18 July 2014; pp. 71–75.
3. Nekrutenko, A.; Makova, K.D.; Li, W.H. The KA/KS ratio test for assessing the protein-coding
potential of genomic regions: an empirical and simulation study. Genome Res. 2002,12, 198–202.
Entropy 2015,17 1687
4. Johnson, N. Simply Complexity: A Clear Guide to Complexity Theory; Oneworld Publications:
London, UK, 2009.
5. Galau, G.A.; Britten, R.J.; Davidson, E.H. A measurement of the sequence complexity of
polysomal messenger RNA in sea urchin embryos. Cell 1974,2, 9–21.
6. Hough-Evans, B.R.; Howard, J. Genome size and DNA complexity of Plasmodium falciparum.
Biochim. Biophys. Acta Gene Struct. Expr. 1982,698, 56–61.
7. Farach, M.; Noordewier, M.; Savari, S.; Shepp, L.; Wyner, A.; Ziv, J. On the entropy of DNA:
Algorithms and measurements based on memory and rapid convergence. In Proceedings of the
Sixth Annual ACM-SIAM Symposium on Discrete algorithms, San Francisco, CA, USA, 22–24
January 1995; Volume 95, pp. 48–57.
8. Gusev, V.D.; Nemytikova, L.A.; Chuzhanova, N.A. On the complexity measures of genetic
sequences. Bioinformatics 1999,15, 994–999.
9. Taft, R.J.; Pheasant, M.; Mattick, J.S. The relationship between non-protein-coding DNA and
eukaryotic complexity. Bioessays 2007,29, 288–299.
10. Schmitt, A.O.; Herzel, H. Estimating the entropy of DNA sequences. J. Theor. Biol. 1997,
11. Crochemore, M.; Vérin, R. Zones of low entropy in genomic sequences. Comput. Chem. 1999,
12. Loewenstern, D.; Yianilos, P.N. Signiﬁcantly lower entropy estimates for natural DNA sequences.
J. Comput. Biol. 1999,6, 125–142.
13. Lanctot, J.K.; Li, M.; Yang, E.H. Estimating DNA sequence entropy. In Proceedings of the
Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, 9–11
January 2000; Volume 9, pp. 409–418.
14. Koslicki, D. Topological entropy of DNA sequences. Bioinformatics 2011,27, 1061–1067.
15. Troyanskaya, O.G.; Arbell, O.; Koren, Y.; Landau, G.M.; Bolshoy, A. Sequence complexity
proﬁles of prokaryotic genomic sequencs: A fast algorithm for calculating linguistic complexity.
Bioinformatics 2002,18, 679–688.
16. Xiao, X.; Shao, S.; Ding, Y.; Huang, Z.; Huang, Y.; Chou, K.C. Using complexity measure factor
to predict protein subcellular location. Amino Acids 2005,28, 57–61.
17. Ferenets, R.; Lipping, T.; Anier, A.; Jantti, V.; Melto, S.; Hovilehto, S. Comparison of entropy
and complexity measures for the assessment of depth of sedation. IEEE Trans. Biomed. 2006,
18. Aboy, M.; Hornero, R.; Abásolo, D.; Álvarez, D. Interpretation of the Lempel-Ziv complexity
measure in the context of biomedical signal analysis. IEEE Trans. Biomed. 2006,53, 2282–2288.
19. Berthelsen, C.L.; Glazier, J.A.; Skolnick, M.H. Global fractal dimension of human DNA sequences
treated as pseudorandom walks. Phys. Rev. A 1992,45, doi:10.1103/PhysRevA.45.8902.
20. Ercolini, E.; Valle, F.; Adamcik, J.; Witz, G.; Metzler, R.; De Los Rios, P.; Roca, J.; Dietler, G.
Fractal dimension and localization of DNA knots. Phys. Rev. Lett. 2007,98, doi:10.1103/
21. Skliar, O.; Monge, R.E.; Oviedo, G.; Medina, V. Indices of Regularity and Indices of Randomness
for m-ary Strings. Revista de Matemática Teoría y Aplicaciones 2009,16, 43–59.
Entropy 2015,17 1688
22. Láscaris-Comneno, T.; Skliar, O.; Medina, V. Determinación de valores del índice de
máxima regularidad correspondientes a diversas secuencias de bases de ADN: un nuevo método
computacional en genética. In Proceedings of the IX Congreso Internacional de Biomatemática,
Concepción, Chile, 1999; pp. 81–97.
23. Morales, Y.; Ugalde, A.; Láscaris-Comneno, T. Análisis de regularidad de genomas para detección
de telómeros y secuencias autónomamente replicativas. UNICIENCIA 2009,24, 103–110.
24. Láscaris-Comneno, T.; Ugalde, A.; Morales, Y. Análisis de regularidad para el reconocimiento
de telómeros en Candida parapsilosis mitochondrion y el cromosoma XVI de Saccharomyces
cerevisae. Tecnología en Marcha 2011,24, Available online: http://tecdigital.tec.ac.cr/servicios/
ojs/index.php/tec_marcha/article/view/157/155 (accessed on 20 March 2015). (In Spanish)
25. Ugalde, A.; Láscaris-Comneno, T. Mathematics Applied to the Detection of Genetic Regularities
in the Yeast Yarrowia lipolytica. In Proccedings of the XVII International Conference on
Mathematical Methods Applied to the Sciences, San Jose, Costa Rica, 16–19 February 2010.
26. Adami, C. Information theory in molecular biology. Phys. Life Rev. 2004,1, 3–22.
27. Weiss, O.; Jimenez-Montano, M.A.; Herzel, H. Information content of protein sequences. J. Theor.
Biol. 2000,206, 379–386.
28. Lloyd, S. Measures of complexity: A nonexhaustive list. IEEE Control Syst. 2001,21, 7–8.
29. Vinga, S.; Almeida, J.S. Renyi continuous entropy of DNA sequences. J. Comput. Biol. 2004,
30. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948,27, 379–423.
31. Feldman, D.P.; Crutchﬁeld, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998,238,
32. Lopez-Ruiz, R.; Mancini, H.L.; Calbet, X. A statistical measure of complexity. Phys. Lett. A 1995,
33. Kolmogorov, A.N. Three approaches to the quantitative deﬁnition of information. Probl. Inf.
Transm. 1965,1, 3–11.
34. Chaitin, G.J. On the length of programs for computing ﬁnite binary sequences. JACM 1966,13,
35. Lempel, A.; Ziv, J. On the complexity of ﬁnite sequences. IEEE Trans. Inf. Theory 1976,22,
36. Pruitt, K.D.; Tatusova, T.; Maglott, D.R. NCBI Reference Sequence (RefSeq): A curated
non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005,
37. Wilson, M.A.; Makova, K.D. Genomic analyses of sex chromosome evolution. Annu. Rev.
Genomics Hum. Genet. 2009,10, 333–354.
38. Bull, J.J. Evolution of Sex Determining Mechanisms; Benjamin Cummings: San Francisco, CA,
39. Underhill, P.A.; Shen, P.; Lin, A.A.; Jin, L.; Passarino, G.; Yang, W.H.; Kauffman, E.;
Bonné-Tamir, B.; Bertranpetit, J.; Francalacci, P.; et al. Y chromosome sequence variation and
the history of human populations. Nat. Genet. 2000,26, 358–361.
Entropy 2015,17 1689
40. Makova, K.D.; Yang, S.; Chiaromonte, F. Insertions and deletions are male biased too: A
whole-genome analysis in rodents. Genome Res. 2004,14, 567–573.
41. Siva, N. 1000 Genomes project. Nat. Biotech. 2008,26, doi:10.1038/nbt0308-256b.
2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article
distributed under the terms and conditions of the Creative Commons Attribution license