Identification of homologous microRNAs in 56 animal genomes
Sung-Chou Lia,b,c, Wen-Ching Chana,b,d, Ling-Yueh Huc, Chun-Hung Laic, Chun-Nan Hsud, Wen-chang Lina,c,⁎
aInstitute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
bBioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
cInstitute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
dInstitute of Information Sciences, Academia Sinica, Taipei, Taiwan
a b s t r a c t a r t i c l ei n f o
Received 16 December 2009
Accepted 17 March 2010
Available online 27 March 2010
MicroRNAs (miRNAs) are endogenous non-protein-coding RNAs of approximately 22 nucleotides. Thousands
of miRNA genes have been identified (computationally and/or experimentally) in a variety of organisms,
which suggests that miRNA genes have been widely shared and distributed among species. Here, we used
unique miRNA sequence patterns to scan the genome sequences of 56 bilaterian animal species for locating
candidate miRNAs first. The regions centered surrounding these candidate miRNAs were then extracted for
folding and calculating the features of their secondary structure. Using a support vector machine (SVM) as a
classifier combined with these features, we identified an additional 13,091 orthologous or paralogous
candidate pre-miRNAs, as well as their corresponding candidate mature miRNAs. Stem-loop RT-PCR and
deep sequencing methods were used to experimentally validate the prediction results in human, medaka
and rabbit. Our prediction pipeline allows the rapid and effective discovery of homologous miRNAs in a large
number of genomes.
© 2010 Elsevier Inc. All rights reserved.
MicroRNAs (miRNAs) are short, endogenous, non-protein-coding
RNAs that negatively regulate gene expression by complementary
binding to the 3'-UTR regions of their target genes. The identifi-
cation of miRNAs in a wide variety of organisms suggests the
evolutionary conservation of miRNA regulation mechanism . This
conservation characteristics serves as an important criterion for iden-
tifying homologous miRNAs in closely related species [3–6], although
not all miRNAs are evolutionally conserved especially in phylogenet-
ically distinct species. miRNAs function in several important physi-
ological pathways; for example, plant miRNAs regulate development
in embryos, leaves and floral meristems [7,8], and mammalian
miRNAs participate in developmental regulation  but may
contribute to pathogenesis if they are dysfunctional .
Establishing a comprehensive miRNA resource would benefit sub-
sequent experimental research on miRNA expression and functional
analysis. However, discovery of miRNAs by experimental approaches is
not an easy task due to their relatively small size and low abundance.
Therefore, several bioinformatics approaches combined with experi-
mental validations have been applied to identify miRNAs from several
species [11–16]. Additionally, some researchers also investigated cross
metazoan or bilaterian conservation of miRNAs to study miRNA
distribution patterns in several metazoan species [6,17–21]. According
to their reports, miRNAs are excellent phylogenetic markers and some
miRNAs are conserved throughout bilaterian evolution.
Previous miRNA identification studies have focused on only one
or a few selected organisms. For example, Grad et al.  identified
Caenorhabditis elegans miRNAs, Wang et al.  predicted miRNAs
in Arabidopsis thaliana, and Berezikov and colleagues  predicted
miRNAs in human. Dezulian et al.  has identified plant miRNAs
and Artzi  has recently performed similar research on seven
mammalian genomes. It would be interesting to examine the overall
distribution patterns of miRNAs across the bilaterian animal genomes
to learn more about the expansion of miRNAs. In previous bioinfor-
matic prediction pipelines, including those from our laboratory [3,25],
the first step is to predict miRNA hairpin precursors. In these schemes,
candidate hairpins were located in the genomic sequences, and then
the values of many quantifiable features were calculated to serve as
criteria to discriminate positive candidates from false ones. Because
of the large number of hairpins predicted from genomic sequences,
this hairpin-based discovery schema is less efficient for large-scale
genomes and needs to be improved.
Recently, many mature miRNA molecules have been cloned and
sequenced using new experimental methodologies. Thus, there are
more than 7,628 identified animal miRNA entries in the current
version (13.0) of miRBase . Because the sequence conservation
feature of miRNAs in phylogenetically related species is used for all
bioinformatics predictions, we modified our miRNA discovery pipe-
line by implementing an initial search for homologous mature miRNA
Genomics 96 (2010) 1–9
⁎ Corresponding author. Institute of Biomedical Sciences, Academia Sinica, Taipei
115, Taiwan, R. O. C. Fax: +886 2 2782 7654.
E-mail addresses: email@example.com (S.-C. Li), firstname.lastname@example.org
(W.-C. Chan), email@example.com (L.-Y. Hu), firstname.lastname@example.org (C.-H. Lai),
email@example.com (C.-N. Hsu), firstname.lastname@example.org (W. Lin).
0888-7543/$ – see front matter © 2010 Elsevier Inc. All rights reserved.
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/ygeno
sequences in available genomes first to increase the prediction
efficiency and accuracy. This procedure was used mainly to promptly
locate homologous candidate miRNAs in genome sequences. Subse-
quently, we checked whether or not the sequences adjacent to the
candidatemiRNAs could fold intohairpins.We haveimplemented this
modified discovery pipeline to detect mature miRNAs and pre-
miRNAs in 56 bilaterian animal genomes obtained from the UCSC
Genome Bioinformatics web site [27,28]. We then calculated the
values of selected quantifiable features from the resulting pool of
potential miRNAs and used them as discriminating indices in a
support vector machine (SVM) algorithm . Using SVM as a
classifier, we identified an additional 13,091 new orthologous or
paralogous pre-miRNAs, as well as their putative corresponding
mature miRNAs. Our results suggest that miRNA genes are widely
distributed in bilaterian animal species, including planarians, nema-
todes, insects, urchins, sea squirt and vertebrates. In addition, our
results indicate that different miRNAs have distinct distribution pat-
terns among various species.
We obtained 7,628 animal pre-miRNA entries from miRBase 13.0
and grouped them into 1,534 family classes based on their ID and
sequence similarity. These 7,628 pre-miRNAs encode a total of 7,939
mature miRNAs, which can be further classified into 3,861 unique
mature miRNA sequence patterns (see Materials and methods). To
avoid redundant findings among miRNA classes, we only used these
3,861 unique sequence patterns in the subsequent scan to identify
potential conserved candidate miRNAs in repeat-masked animal ge-
nomes from UCSC.
Using this revised search criterion, we collected 227,349 candidate
miRNAs from the genomes of 56 bilaterian animals after the initial
sequence conservation screening procedure. These candidate miRNAs
were highly similar to the 3,861 unique mature miRNA sequence
patterns, either with a completely identical pattern or with, at most, a
two-nucleotide variation. To examine whether the sequences adjacent
to the candidate miRNAs had the potential for hairpin structure for-
mation, we extracted 200-nucleotide regions centered surrounding
these 227,349 candidate miRNAs as potential miRNA precursors. After
evaluating folding potentials using the Srnaloop program, we identified
Discrimination of candidate miRNAs by SVM
Following the identification of potential pre-miRNA hairpins, we
applied a SVM algorithm to discriminate positive candidates from
negative ones. All of the features used as discriminating indices in
SVM are listed in Table 1. We first trained our prediction model with
known animal miRNAs as the positive training set and random ge-
nomic sequences as the negative data set. The resulting 177,002 pairs
of candidate miRNAs and candidate pre-miRNAs were input into the
SVM classifier to discover putative authentic miRNAs based on the
acquired SVM model.
As a result, 22,613 candidate pre-miRNAs were classified as
positive ones based on SVM scores. As shown in Fig. 1, some candidate
pre-miRNAs may overlap with each other. After further inspecting
their genomic positions, we identified a total of 13,091 non-over-
lapping candidate pre-miRNAs as our final result set, excluding the
known miRNAs. In addition, the potential mature miRNA sequences
present in these candidate pre-miRNAs were also determined. The
sequences of these potential mature miRNAs are homologous to
known miRNAs so that they are completely identical to known miRNA
or they have at most two-nucleotide variance with known miRNAs.
Therefore, they could be viewed as either paralogous miRNAs or
orthologous miRNAs, where paralogous one denotes its sequences
homology to the known miRNAs from its own species and orthologous
one denotes its sequence homology to the known miRNAs from other
These 13,091 candidate pre-miRNAs were assigned unique IDs
according to their genomic positions plus their UCSC Genome Bio-
informatics release version as prefixes. These candidates can be ac-
cessed via the web (http://insr.ibms.sinica.edu.tw), and users can
query all information from the species classification table or pre-
miRNA class classification table. Taking triCas2-26 in Fig. 1 as an
as discriminating indices to distinguish positive candidates from negative ones. These
features belong to either candidate miRNAs themselves or to their precursor hairpins.
A content of the candidate miRNA
U content of the candidate miRNA
G content of the candidate miRNA
C content of the candidate miRNA
Score calculated by Srnaloop
Core mfe of the core hairpin calculated by RNAfolda
Hairpin mfe of the whole hairpin calculated by RNAfolda
Ch_ratio: the ratio of core mfe to hairpin mfea
Coverage rate: degree of overlap between the mature miRNA
and the 26-nt putative miRNAb
Length of terminal loop
acore mfe, hairpin mfe and ch_ratio.According to Zeng's report , we divided the
hairpin structure into two parts, core region and whole hairpin. We calculated the mfes
of the core alone and of the entire hairpin (includes both the core plus the stem
extension) and named these variables core mfe and hairpin mfe, respectively. We
divided core mfe by hairpin mfe and named this quotient ch_ratio, which indicates the
fact that core contributes more to hairpin mfe than stem extension does.
bCoverage. Mature miRNAs usually locate at either or both of the two arms of the
hairpin structures and are close to the junction of the stem and terminal loop. We
inferred 26-nt fragments close to the junction as putative mature miRNAs and
concluded that most mature miRNAs overlapped highly with their corresponding
putative ones in our previous report . Therefore, we defined this degree of
overlapping as a new feature, coverage rate. A higher coverage value implies a more
precise prediction of secondary structure.
Fig. 1. Presentation interface for information regarding candidate pre-miRNAs. All information or candidate pre-miRNAs were listed in our presentation interface. Users may query
this information based on the classification tables of Class or Species.
S.-C. Li et al. / Genomics 96 (2010) 1–9
triCas2 denotes the used UCSC genome release version of Tribolium
castaneum. There are 17 animal species encoding 37 mature miR-276 s,
and these 37 miR-276s can be classified into nine unique patterns
26 hairpin with high homology (Fig. 1). This result demonstrates the
usefulness of our prediction strategy. Moreover, the “Evo_source”
column indicates that triCas2-26 is a paralogous pre-miRNA, a homol-
ogous miRNA of the known gene tca-mir-276 from the same Tribolium
castaneum species. In the “Species_No” column, the value “17/17”
indicates that there are 17 species carrying the known mir-276 and the
same 17 species encode known and candidate mir-276 hairpins. In the
miRNAsare foundinTriboliumcastaneumanda totalof53 mir-276 pre-
miRNAs are distributed among 86 species examined. The detailed
genomic context (intergenic, intronic or exonic distribution) can be
further accessed by linking to the UCSC genome browser via the
hyperlink at the “precursor start” column. The “mature range” column
We also calculated the numbers of known pre-miRNAs, candidate
orthologous pre-miRNAs and candidate paralogous pre-miRNAs in
all species. As shown in Table 2, Caenorhabditis briggsae had a total
of 171 pre-miRNAs, including 95 known ones, 56 paralogous and 20
Summary of pre-miRNA distribution in 56 animal species. Known pre-miRNAs are reported by miRBase 13.0; predicted paralogous and orthologous pre-miRNAs are the pre-miRNAs
predicted in the 56 animals and homologous to the miRNAs from their own species or other species, respectively.
Species Known pre-miRNA
S.-C. Li et al. / Genomics 96 (2010) 1–9
Distribution of miRNAs in 56 bilaterian animal genomes
To provide a comprehensive comparison of pre-miRNAs among
different species, we examined the pre-miRNA distribution densities
in 56 species. As shown in Table 2, the numbers of miRNAs identi-
fied in worm and fly were approximately 150 and 100, respectively.
There was a large number of miRNAs (approximately 300) in fish.
Interestingly, Strongylocentrotus purpuratus and Ciona intestinalis had
only 146 and 49 pre-miRNAs identified, respectively. Such low
numbers of miRNA population in some species may also reflect the
incompleteness of scaffold assembly in these species. There were
fewer miRNAs identified in amphibians, birds and older mammals.
The number of miRNAs expanded in rodent to approximately 500 to
600. In higher mammals, the number of miRNAs was maintained at
600 or greater. This observation somewhat correlates with the gene
numbers in genomes, implying the possibility of miRNA-mediated
regulation of gene expression. Since the genome assembly is not
completed in several species, the distribution numbers might be
adjusted in future revision.
To understand miRNA distribution among various species, we
examined the distribution patterns of different miRNA classes. As
shown in Fig. 2, the pixels in the lsy-6 class row are represented by
four distinctive colors: blue, green, red and white. The blue pixels
indicate that known lsy-6 class exists in ce6 (Species index: 02,
Caenorhabditis elegans), and the green pixels indicate that candidate
lsy-6 class exists in caePb1 (Species index: 03, Caenorhabditis
Fig. 2. The distribution patterns of predicted and known pre-miRNAs in 56 different species. The blue pixels (known pre-miRNAs), red pixels (both known and candidate pre-
miRNAs), green pixels (only candidate pre-miRNAs) and white pixels (neither known nor candidate pre-miRNAs) show the occurrence of pre-miRNAs in the corresponding species.
Use of this comprehensive presentation facilitates the study of miRNA evolution patterns.
S.-C. Li et al. / Genomics 96 (2010) 1–9
brenneri). The red pixels show that both known and candidate lsy-6
class exists in cb3 (Species index: 01, Caenorhabditis briggsae), while
the white pixels show that neither known nor candidate lsy-6 class
exist in other rest species. Fig. 2 also illustrates another important
observation, namely that certain miRNA classes are conserved from
worm to human; these include mir-1, mir-7, mir-9, mir-10, mir-31,
mir-33, mir-34, mir-92, mir-100, mir-124, mir-125, mir-133, mir-137,
mir-184, mir-190, mir-210, mir-219 and mir-297. However, some
mir-3, 4, 5 and 6 are only found in Drosophila. Likewise, lin-4, lsy-6,
miR-35 to 70 and a few others are specific to worm species.
Most of the miRNAs were shared across species, but of the initial
1,536 classes, there are 494 miRNA classes found in only one species
among all 86 metazoan species. These could represent linage-specific
miRNA classes. Alternatively, some of these miRNA sequences could
be generated from previous false-positive experimental or bioinfor-
matics results. Our data potentially could lead to future rigid exam-
ination of miRNA discovery and expression.
Validation of candidate miRNAs
After classification with SVM, we identified a total of 13,091 non-
overlapping positive candidate pre-miRNAs. To assess the discovery
sensitivity of our pipeline, we examined how many known pre-
miRNAs were detected in this set and excluded them to construct the
final result set, which included 13,091 candidate pre-miRNAs. To
make the comparison more precise, we selected 15 species with
identical genome assembly versions in miRBase 13.0 and our dataset.
This excluded few numbers of known pre-miRNA records without
genomic locations from the miRBase. Examining the exact genome
coordinates as positive hits made the comparison of discovery. As a
result, our pipeline detected 3,900 known miRNAs from raw genomic
sequences, and the overall pipeline performance was 84.3% (3,900 out
of 4,624 reported in miRBase). We believe that our pipeline
performed well in detecting reported miRNAs in miRBase from raw
genome sequences, even considering that we have to encounter the
various genome sequence quality and contig assembly completeness
issues in different species.
In this study, 78 candidate orthologous human miRNAs, including
37 miRNA classes, were discovered using known unique miRNA
sequence patterns from other species. Recently, stem-loop RT-PCR
technology  has been applied not only in detecting the expression
of known miRNAs but also in validating novel miRNAs[11,12,14,16].
Therefore, we selected nine of them for experimental validation using
stem-loop RT-PCR technology. Followingthesmall RNA extraction and
stem-loop RT-PCR experiment conformation, six miRNAs (miR-265,
miR-250, miR-343, miR-461, miR-670, miR-683, miR-764) were
detected and the RT-PCR product sequence was confirmed in human
cell lines (Fig. 3 and Supplementary Table 1). Since some miRNAs are
expressed at certain developmental stages or tissues, the three
predicted miRNA candidates that failed to be detected in our cancer
cell line samples might be validated using additional samples or
a new orthologous miR-705 class miRNA (hg18-78) from the human
ES cell MPSS dataset . This observation indicated that our cross-
evolution prediction pipeline is useful for discovering conserved
Recently, deep sequencing methods have been applied in large-
scale miRNA identification research projects [31,32]. In this study, we
selected rabbit (Oryctolagus cuniculus) and medaka (Oryzias latipes)
for small RNA library construction and sequencing validation with
the SOLiD platform. Originally, our pipeline identified 273 medaka
pre-miRNAs from the pipeline; in rabbit, 354 rabbit pre-miRNAs were
predicted. Using the BLAST program, we then use the predicted
mature miRNA sequences to search against deep sequencing reads to
confirm their expression in medaka and rabbit. To ensure the match
reliability, we defined a matched candidate mature miRNA as 100%
identical to the SOLiD reads. Under such stringent criterion, most of
our predicted miRNAs were still confirmed to be expressed. Further
presenting the result in terms of pre-miRNA, 79.9% (218 out of 273)
medaka candidate pre-miRNAs have mature miRNAs confirmed with
SOLiD reads; in rabbit, 59.3% (210 out of 354) candidate pre-miRNAs
Fig. 3. Expression analysis of predicted orthologous human miRNAs in human cell lines. (A) Selected predicted orthologous human miRNAs (miR-265, 343, 461, 670, 683, 764-5p)
were amplified by the stem-loop RT-PCR procedure, and the known human miR-16 was used as an internal control for PCR reaction. The respective orthologous miRNA class is
indicated in parenthesis. Cell line legends are as follows: Huh7 (human hepatoma cell line), HR (human gastric cancer cell line), HeLa (human cervical cancer cell line) and 293 T
(human embryonic fibroblast). ntc: negative control of the PCR reaction without the addition of cDNA templates. RT-c: negative control of the PCR reaction without the addition of
specific stem-loop RT primers. (B) Sequencing results of miRNA PCR amplicons. The predicted mature miRNA sequences are indicated as underlined sequences.
S.-C. Li et al. / Genomics 96 (2010) 1–9
do so. These results show that most of our predicted conserved
medaka and rabbit candidate miRNAs can be validated using se-
quencing data even under stringent validation criterion.
After acquiring the number of reads matching each mature miRNA,
we may infer the read abundance of each mature miRNA and pre-
miRNA. Here, read abundance can serve as a measurement of ex-
pression level of each miRNA gene. In Supplementary Table 2 and 3,
we listed the expression level for each medaka and rabbit candidate
pre-miRNA, using expression evidence from SOLiD reads. After ac-
quiring the read abundance information, we found that the ex-
pression levels of different miRNA classes differ dramatically. In
Supplementary Table 4, we ranked miRNA class based on the SOLiD
read abundance and listed the first 30 abundant classes. In medaka,
the 30 most abundant classes account for91.79% of total medaka reads;
is consistent with previous reports that indicate the expression of
miRNA genes is usually dominated by several miRNAs .
From the recent efforts in large-scale miRNA sequencing and dis-
approximately 9,500. miRBase serves as a reliable and accurate foun-
dation for searching miRNAs conserved among species. Using these
reported miRNA sequences, we established a comprehensive set of
homologous miRNAs in 56 bilaterian animal genomes. This set of
miRNAs is valuable information for scientists interested in the distri-
bution patterns of miRNAs and discovering new miRNAs in many
different species. Further functional experiments are still required to
validate these previously unidentified miRNAs in various species.
Before our study, many researchers had also developed methods to
identify miRNA genes in animals or plants [23,24,34–36]. In general,
most of them required the use of known pre-miRNA hairpin sequences
to search against genomic sequences to locate candidate pre-miRNAs.
However, miRNA genes are mainly conserved at the mature miRNA
regions. Sequence variations among miRNA classes are often found in
between remotely related species. Our approach described here can
greatly improve the detection efficiency in many animal genomes by
using only the mature miRNA sequence patterns instead of the entire
hairpin sequences. Because the use of precursor hairpin sequences
would decrease the detection sensitivity by including the more
diversified regions (Supplementary Fig. 2), using the unique miRNA
sequence patterns would greatly increase the detection coverage in
addition to reducing redundancy. In our present study, we effectively
discovered large numbers of miRNAs in 56 animal genomes across
animal development using the unique sequence patterns of mature
Grad et al. first developed their computational method, relying on
sequence conservation and structural similarity, to predict 214
candidate miRNAs from C. elegans genome . Like many prediction
works [23,24,34–36], they also adopted stepwise filters based on many
features to exclude false candidates step by step, which assumes that
these filters and features act independently and contribute equally to
the prediction result without overall consideration. Biologically speak-
ing, some features contribute more and some less. Our method, clas-
sification with SVM, takes an overall consideration between these
features with different weights. Moreover, in the study by Grad et al.,
four out of 20 selected candidates can be detected with a Northern blot
assay; ten out of 54 candidates can be confirmed with a PCR approach
. In our study, six out of nine selected human candidates can be
confirmed with stem-loop RT-PCR and more than 50% of rabbit and
medaka candidates can be sequenced, even under the stringent perfect
sequence match criterion. Such comparison shows that our study is
comparable with that of Grad et al.
However, using the whole-hairpin sequences, such as let-7, could
still identify a few highly conserved miRNAs. Hentel et al. identified
about 1,000 new miRNA genes in animal genomes by using the
sequences of known pre-miRNAs as a query and searching for se-
quential homologs in other genomic sequences . They reported
certain miRNAs specific to earthworms that might be related to the
advent of bilaterian species genomes. Two additional expansions of
miRNA classes were identified in vertebrates and placental mammals.
Our study here has greatly improved the coverage and genome sizes
by using mature miRNA core regions instead of entire hairpins.
Using the conservation feature as an initial screen greatly in-
creased the accuracy of miRNA prediction. In addition, the use of
unique miRNA sequences decreased the redundancy for the predicted
candidate hairpins. Interestingly, the boundaries of mature miRNAs
usually vary among different species, but the central region is highly
conserved. Using miR-276 as an example, we could not definitively
determine what type of miRNA, aga-miR-276-5p, dgr-miR-276*, tca-
miR-276, tca-miR-276* or dme-miR-276*, may be encoded by the
predicted triCas2-26 precursor hairpin. The mature miRNA encoded
by triCas2-26 could be validated experimentally.
Although we used repeat-masked genomes for candidate miRNA
identification, we still found some high-frequency occurring miRNA
classes. For example, mir-669, mir-680 and mir-1187 have 22, 214 and
70 occurrences in mouse, respectively; mir-574, mir-297 and mir-1187
have 457, 128 and 107 occurrences in Tupaia belangeri, respectively.
These high-frequency class pre-miRNAs bring high distribution densi-
ties to these two species, but still trigger a doubt as to whether they
belong to repeat elements rather than functional miRNAs. Considering
the fact that some known pre-miRNAs can also be recognized as repeat
elements by RepeatMasker (Supplementary Table 5), we cannot
definitively conclude that all of these high-frequency candidate pre-
colleagues discovered that human miRNAs are interspersed among Alu
elements and other repeat elements and could be transcribed by RNA
polymerase III. It is possible that some of the repeat elements have the
potential to generate miRNAs. Or, these high-frequency candidate pre-
miRNAs could be repeat elements that are not included in Repbase
library so that they cannot be recognized by RepeatMasker[38,39].
Whether these repeat-like known pre-miRNAs and candidate pre-
miRNAs are non-functional repeat elements or functional miRNA
precursors requires further experimental validation.
The comprehensive discovery of miRNAs and a user-friendly
display interface will allow researchers to examine miRNA classes
withincreased confidence. Ourpresentationinterface enablesusers to
compare the miRNA distribution patterns among several specific
species, such as six fish species (Supplementary Figs. 3 and 4). As
illustrated in Supplementary Fig. 4, many known pre-miRNAs, in-
cluding let-7, mir-1, mir-7 and mir-9, exist in the three fish Danio
rerio, Fugu rubripes and Tetraodon nigroviridis. They are also predicted
to distribute in three other fish, Petromyzon marinus, Oryzias latipes
and Gasterosteus aculeatus. Because of the high specificity (98.2%) and
the close phylogeny of fish, we are confident that P. marinus, O. latipes
and G. aculeatus can also encode these miRNAs. By referring to the
sequences of these miRNAs in the fish genome, researchers may
design northern probes or stem-loop RT-PCR primers to detect these
miRNAs in these fish. In fact, as shown in Supplementary Table 3, we
found strong expression evidence of let-7, mir-1, mir-7 and mir-9
classes from SOLiD reads in medaka (Oryzias latipes).
In this report, a bioinformatics pipeline was applied to discover the
existence of miRNAs in 56 bilaterian animal genome sequences,
including many species without known miRNAs. By using the se-
quences of known animal miRNAs, we identified 13,091 new miRNAs
S.-C. Li et al. / Genomics 96 (2010) 1–9
with high confidence and many of them can be experimentally
confirmed with deep sequencing methods or stem-loop RT-PCR.
Materials and methods
Classifying animal miRNAs into family classes based on sequence
We downloaded all animal miRNAs from miRBase 13.0 and
classified them with the same integer ID into the same class without
considering species prefix, alphabet suffix, 5p, 3p or * suffix. For
example, as shown in Supplementary Fig. 1, there are 37 mature
miRNAs with the same integer ID 276 and they belong to 17 species.
We classified them into the mir-276 class according to the rule
described above. Further checking their sequences, we found nine
sequential unique patterns. For each sequentially unique pattern, we
selected the underlined one as representative. As a result, all miRNAs
were classified into 1,534 classes and we identified 3,861 unique
miRNA sequences as query sequences to identify the conserved
candidate miRNAs in animal genomes.
Genomic sequences from UCSC Genome Bioinformatics web site
We downloaded the repeat-masked genomic sequences from the
UCSC Genome Bioinformatics web site. To date, genomic sequences
for 56 animal species are available from UCSC Genome Bioinformatics.
The released versions of the genomic sequences for all analyzed
species are listed in Table 2. According to miRBase 13.0, there are 64
reported animal species encoding miRNAs. In our analysis, 34 of these
species were also included. In total, the miRNA information from 86
different animal species was used in this study. Among these, 30
species have only the known miRNAs, 34 species have both known
miRNAs and predicted miRNAs (either orthologous or paralogous),
and 22 species have only predicted miRNAs.
Conservation filter: using mature miRNAs to locate candidate miRNAs
We used 3,861 unique miRNA sequences to search against the
repeat-masked genomic sequences of 56 bilaterian animal species
downloaded from UCSC Genome Bioinformatics. Based upon the
miRBase classification, the most well known public database collect-
ing miRNA information, the criterion for a blast hit is as follows. For
query sequences of the miRNAs without any alphabet suffix (for
example, let-7, miR-106), if a blast hit had, at most, a 2-nt variance
with the query sequences, it was regarded as a candidate miRNA. For a
miRNA with an alphabet suffix (for example, let-7a, let-7b, miR-106a
or miR-106b), we required a candidate miRNA to be completely
identical to the sequences of these query miRNAs. Following this
criterion, we ensured that our candidate miRNAs were sequentially
homologous to known animal miRNAs. In total, we identified 227,349
candidate miRNAs, highly similar to the sequences of the 3,861
miRNAs, in 56 bilaterian animal species genomes.
Folding the sequences adjacent to candidate miRNAs using Srnaloop
During miRNA maturation, the full-length miRNA gene transcript
needs to form a hairpin (stem-loop) structure. The secondary
structure is folded via intra-molecular base pairing and has been the
most significant criterion for computational identification of miRNAs
[4,40–42]. We fetched the 200-nt sequence regions centered
surrounding the 227,349 candidate miRNAs as potential miRNA
precursors, which were further folded using the Srnaloop program.
. The optimized parameters for Srnaloop were the default values
with the exception of the “-l 110” and “-lml 10” parameters adjusted
for miRNAs. These parameters are specific for identifying hairpins for
whichtheentirelengthandtheterminalloopsare up to110 basesand
not shorter than 10 bases, respectively. In addition, the “-t 20”
parameter demanded a score cutoff of 20 for these identified hairpins.
As a result, we located 177,002 potential hairpins from 227,349
regions. These located hairpins, together with their corresponding
candidate miRNAs, were further investigated to determine if they
were authentic miRNAs.
Support vector machine classifier and model training
SVM is one type of machine learning technique used for
classification and regression and was originally developed by Vapnik
based on statistical learning theory . In the present study, we
formulated our problem as a binary classification problem, positive or
negative, and used LIBSVM [43,44] to derive the core algorithm. Here,
according to the characteristics of our training features, we chose to
use the radial basis function (RBF) kernel implemented in LIBSVM as
shown in Supplementary Fig. 5. We used 10 features, listed in Table 1,
as SVM indices to discriminate positive from negative candidates in
this study. These features are based on our previous studies [3,25]
and many of them have also been adopted by other reports [6,16,24,
45–48]. In miRBase 13.0, there are 7,628 animal pre-miRNAs and
7,939 mature miRNAs, comprising 8,844 pairs of miRNAs and pre-
miRNAs. To avoid noise from possible wrongly predicted miRNAs, we
selected only 4,144 experimentally confirmed pairs of pre-miRNAs
and miRNAs and 4,144 random sequence pairs as a positive set and a
negative set, respectively, to train the SVM. To identify an optimal
parameter setting for RBF kernel, we used 10-fold cross-validation on
our training data; the accuracy was 94.98% and the area under curve
(AUC) was 99.05% (Supplementary Fig. 6), when C = 2048 and
gamma = 0.03125.
However, it is more difficult to assess the true system specificity
because there is no true negative dataset available for miRNA
prediction. Owing to this, some researchers adopted large numbers
of random genomic sequences , including our previous study .
With low probability, such random sequence dataset could still carry
miRNAs, but we have no means to verify them. To generate a more
confident negative dataset here, we first used randomly selected
protein coding sequences (CDS) since there are no reports about
miRNA locating at CDS regions. We first downloaded all animal
reference sequences of RefSeq 35 from NCBI. Then we extracted the
CDSs of the reference gene with NM accession numbers. Finally, we
collected 227,349 200-nt random sequences from the CDSs as a
negative controldataset. After folding with Srnaloop and classification
with SVM under the same parameters as previously described, we
achieved an average specificity of 98.2% from three independent
We also used the regions immediately upstream of known miRNAs
as the negative dataset for testing specificity. Since miRNAs some-
times exist as gene clusters, we selected 10,000 base pairs upstream
of 4,624 known miRNAs with identical genome versions in miRBase
and our UCSC dataset as the negative dataset. As a result, a specificity
of 96.9% was obtained following the folding with Srnaloop and
classification with SVM classifier. This number is similar to the CDS
negative dataset described above.
RNA extraction from cell lines and tissues
Total RNA was extracted from cultured cells and tissues using
Trizol (Invitrogen) according to the manufacturer's protocol. Human
tumor cell lines (Huh7, HR, HeLa and 293T) were cultured in D-MEM
medium containing 10% fetal bovine serum and collected for total
RNA extraction. Whole-body RNA was extracted from a male/female
pair of medaka (Oryzias latipes). Eleven different tissues and organs,
including heart, brain, lung, liver, spleen, stomach, intestine, colon,
pancreas, ovary and skeletal muscle, were collected from rabbit
(Oryctolagus cuniculus) and used for RNA extraction.
S.-C. Li et al. / Genomics 96 (2010) 1–9
miRNA validation with the next generation sequencing platform
Pooled total RNA from the whole body of medaka or rabbit was
used for small RNA library construction and direct sequence analysis
following manufacture's guidelines (ABI SOLiD system). As a result,
we collected 46 M and 35 M initial sequence reads from medaka and
rabbit small RNA libraries, respectively. Following the mapping of
reads to the genomic locations, we then had the predicted medaka
and rabbit mature miRNA sequences search individually against the
medaka and rabbit SOLiD reads using the BLAST program. For a
positive BLAST hit, we demanded that these mature miRNAs must be
completely identical to the SOLiD reads beginning at the first
nucleotide of both miRNA and read, as suggested in a previous report
. By doing so, we could confidently confirm the expression of
putative predicted miRNAs and further measure the expression levels
of candidate mature miRNAs by the read counts.
miRNA validation with the stem-loop RT-PCR method
For the stem-loop RT-PCR assay small RNA was further enriched
from total RNA using the mirVana miRNA isolation kit (Ambion).
Small RNA pellets were dissolved in RNase-free water. The RNA
quality was analyzed by 1% (v/v) agarose gel and quantified with a
nano-drop spectrophotometer (ND-1000). Small RNA (0.5 µg) from
each cell line was reverse transcribed with 20 nM miRNA-specific
stem-loop RT primers for each miRNA and SuperScript III Reverse
Transcriptase (Invitrogen). The reaction was performed with the
following incubation conditions: 16 °C/30 s, 42 °C/30 s and 50 °C/1 s
for 50 cycles. The enzyme was subsequently inactivated by incubation
at 85 °C for 5 min. The resulting cDNA was used as a template in
subsequent PCR reactions. Specific stem-loop RT primers and PCR
primers were designed according to all of the selected human mature
miRNAs (Supplementary Table 1 and 6). PCR reactions were
performed as follows: 94 °C for 2 min, followed by 36 or 40 cycles
of 94 °C/30 s, 60 °C/30 s, 70 °C/5 s. PCR products were analyzed using
gel electrophoresis (4.0% agarose/TAE buffer). The sequences of the
miRNA amplicons were determined by cloning the PCR product into
TA cloning vectors (Invitrogen) according to the manufacturer's
protocol. The expression of human mir-16 miRNA was used as an
We thank Dr. Chun-Houh Chen of the Institute of Statistical
Science, Academia Sinica for helpful suggestions regarding the statis-
tical evaluation of SVM performance; Drs. Shu-Jen Chen and Hua-
Chien Chen at Chang-Gang University for the stem-loop RT-PCR
protocols; Dr. Pung-Pung Hwang of Institute of Cellular and Oranismic
Biology for the medaka fishes; Dr. Chi-Chin Kao of Taipei Medical
University for the rabbit tissues. This study was supported in part by
the Academia Sinica thematic project and by research grants from the
National Sciences Council of Taiwan.
Appendix A. Supplementary data
Supplementary data associated with this article can be found, in
the online version, at doi:10.1016/j.ygeno.2010.03.009.
 D.P. Bartel, MicroRNAs: genomics, biogenesis, mechanism, and function, Cell 116
 S. Griffiths-Jones, R.J. Grocock, S. van Dongen, A. Bateman, A.J. Enright, miRBase:
microRNA sequences, targets and gene nomenclature, Nucleic Acids Res 34
 S.C. Li, C.Y. Pan, W.C. Lin, Bioinformatic discovery of microRNA precursors from
human ESTs and introns, BMC Genomics 7 (2006) 164.
 E. Berezikov, V. Guryev, J. van de Belt, E. Wienholds, R.H. Plasterk, E. Cuppen,
Phylogenetic shadowing and computational identification of human microRNA
genes, Cell 120 (2005) 21–24.
 Y. Grad, J. Aach, G.D. Hayes, B.J. Reinhart, G.M. Church, G. Ruvkun, J. Kim,
Computational and experimental identification of C. elegans microRNAs, Mol Cell
11 (2003) 1253–1263.
 A. Stark, P. Kheradpour, L. Parts, J. Brennecke, E. Hodges, G.J. Hannon, M. Kellis,
Systematic discovery and characterization of fly microRNAs using 12 Drosophila
genomes, Genome Res 17 (2007) 1865–1879.
 B.J. Reinhart, E.G. Weinstein, M.W. Rhoades, B. Bartel, D.P. Bartel, MicroRNAs in
plants, Genes Dev 16 (2002) 1616–1626.
 X. Chen, A microRNA as a translational repressor of APETALA2 in Arabidopsis
flower development, Science 303 (2004) 2022–2025.
 P.H. Olsen, V. Ambros, The lin-4 regulatory RNA controls developmental timing in
Caenorhabditis elegans by blocking LIN-14 protein synthesis after the initiation of
translation, Dev Biol 216 (1999) 671–680.
 S.M. Johnson, H. Grosshans, J. Shingara, M. Byrom, R. Jarvis, A. Cheng, E. Labourier,
K.L. Reinert, D. Brown, F.J. Slack, RAS is regulated by the let-7 microRNA family,
Cell 120 (2005) 635–647.
 D.B. Weaver, J.M. Anzola, J.D. Evans, J.G. Reid, J.T. Reese, K.L. Childs, E.M. Zdobnov,
M.P. Samanta, J. Miller, C.G. Elsik, Computational and transcriptional evidence for
microRNAs in the honey bee genome, Genome Biol 8 (2007) R97.
 F.L. Xie, S.Q. Huang, K. Guo, A.L. Xiang, Y.Y. Zhu, L. Nie, Z.M. Yang, Computational
identification of novel microRNAs and targets in Brassica napus, FEBS Lett 581
 X. Xue, J. Sun, Q. Zhang, Z. Wang, Y. Huang, W. Pan, Identification and
characterization of novel microRNAs from Schistosoma japonicum, PLoS One 3
 J. Cao, C. Tong, X. Wu, J. Lv, Z. Yang, Y. Jin, Identification of conserved microRNAs in
in heterologous system, Insect Biochem Mol Biol 38 (2008) 1066–1071.
 W.C. Lin, S.C. Li, J.W. Shin, S.N. Hu, X.M. Yu, T.Y. Huang, S.C. Chen, H.C. Chen, S.J.
Chen, P.J. Huang, R.R. Gan, C.H. Chiu, P. Tang, Identification of microRNA in the
protist Trichomonas vaginalis, Genomics 93 (2009) 487–493.
 X. Yu, Q. Zhou, S.C. Li, Q. Luo, Y. Cai, W.C. Lin, H. Chen, Y. Yang, S. Hu, J. Yu, The
silkworm (Bombyx mori) microRNAs and their expressions in multiple
developmental stages, PLoS ONE 3 (2008) e2997.
 B.M. Wheeler, A.M. Heimberg, V.N. Moy, E.A. Sperling, T.W. Holstein, S. Heber, K.J.
Peterson, The deep evolution of metazoan microRNAs, Evol Dev 11 (2009) 50–68.
 S.E. Prochnik, D.S. Rokhsar, A.A. Aboobaker, Evidence for a microRNA expansion in
the bilaterian ancestor, Dev Genes Evol 217 (2007) 73–77.
 R. Niwa, F.J. Slack, The evolution of animal microRNA function, Curr Opin Genet
Dev 17 (2007) 145–150.
 D. Gerlach, E.V. Kriventseva, N. Rahman, C.E. Vejnar, E.M. Zdobnov, miROrtho:
computational survey of microRNA genes, Nucleic Acids Res 37 (2009) D111–D117.
 A. Grimson, M. Srivastava, B. Fahey, B.J. Woodcroft, H.R. Chiang, N. King, B.M.
Degnan, D.S. Rokhsar, D.P. Bartel, Early origins and evolution of microRNAs and
Piwi-interacting RNAs in animals, Nature 455 (2008) 1193–1197.
 X.J. Wang, J.L. Reyes, N.H. Chua, T. Gaasterland, Prediction and identification of
Arabidopsis thaliana microRNAs and their mRNA targets, Genome Biol 5 (2004)
 T. Dezulian, M. Remmert, J.F. Palatnik, D. Weigel, D.H. Huson, Identification of
plant microRNA homologs, Bioinformatics 22 (2006) 359–360.
 S. Artzi, A. Kiezun, N. Shomron, miRNAminer: a tool for homologous microRNA
gene search, BMC Bioinformatics 9 (2008) 39.
 S.C. Li, C.K. Shiau, W.C. Lin, Vir-Mir db: prediction of viral microRNA candidate
hairpins, Nucleic Acids Res (2007).
  miRBase, http://microrna.sanger.ac.uk/
  UCSC-Bioinformatics, http://www.genome.ucsc.edu/
 R.M. Kuhn, D. Karolchik, A.S. Zweig, H. Trumbower, D.J. Thomas, A. Thakkapallayil,
C.W. Sugnet, M. Stanke, K.E. Smith, A. Siepel, K.R. Rosenbloom, B. Rhead, B.J. Raney,
A. Pohl, J.S. Pedersen, F. Hsu, A.S. Hinrichs, R.A. Harte, M. Diekhans, H. Clawson, G.
Bejerano, G.P. Barber, R. Baertsch, D. Haussler, W.J. Kent, The UCSC genome
browser database: update 2007, Nucleic Acids Res 35 (2007) D668–D673.
 V.N. Vapnik, Statistical Learning Theory, Wiley, New York, USA, 1998.
 C. Chen, D.A. Ridzon, A.J. Broomer, Z. Zhou, D.H. Lee, J.T. Nguyen, M. Barbisin, N.L.
Xu, V.R. Mahuvakar, M.R. Andersen, K.Q. Lao, K.J. Livak, K.J. Guegler, Real-time
quantification of microRNAs by stem-loop RT-PCR, Nucleic Acids Res 33 (2005)
 R.D. Morin, M.D. O'Connor, M. Griffith, F. Kuchenbauer, A. Delaney, A.L. Prabhu, Y.
Zhao, H. McDonald, T. Zeng, M. Hirst, C.J. Eaves, M.A. Marra, Application of
massively parallel sequencing to microRNA profiling and discovery in human
embryonic stem cells, Genome Res 18 (2008) 610–621.
 E.A. Glazov, P.A. Cottee, W.C. Barris, R.J. Moore, B.P. Dalrymple, M.L. Tizard, A
microRNA catalog of the developing chicken embryo identified by a deep
sequencing approach, Genome Res 18 (2008) 957–964.
 M. Lagos-Quintana, R. Rauhut, W. Lendeckel, T. Tuschl, Identification of novel
genes coding for small expressed RNAs, Science 294 (2001) 853–858.
 J. Hertel, M. Lindemeyer, K. Missal, C. Fried, A. Tanzer, C. Flamm, I.L. Hofacker, P.F.
Stadler, The expansion of the metazoan microRNA repertoire, BMC Genomics 7
 M. Legendre, A. Lambert, D. Gautheret, Profile-based detection of microRNA
precursors in animal genomes, Bioinformatics 21 (2005) 841–845.
 J.W. Nam, K.R. Shin, J. Han, Y. Lee, V.N. Kim, B.T. Zhang, Human microRNA
prediction through a probabilistic co-learning model of sequence and structure,
Nucleic Acids Res 33 (2005) 3570–3581.
S.-C. Li et al. / Genomics 96 (2010) 1–9
 G.M. Borchert, W. Lanier, B.L. Davidson, RNA polymerase III transcribes human
microRNAs, Nat Struct Mol Biol (2006).
  RepeatMasker, http://www.repeatmasker.org/.
 D. Zhi, B.J. Raphael, A.L. Price, H. Tang, P.A. Pevzner, Identifying repeat domains in
large genomes, Genome Biol 7 (2006) R7.
 T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I.F. Kim, A.
Soboleva, M. Tomashevsky, R. Edgar, NCBI GEO: mining tens of millions of
expression profiles–database and tools update, Nucleic Acids Res 35 (2007)
 A.A. Aravin, M. Lagos-Quintana, A. Yalcin, M. Zavolan, D. Marks, B. Snyder, T.
Gaasterland, J. Meyer, T. Tuschl, The small RNA profile during Drosophila
melanogaster development, Dev Cell 5 (2003) 337–350.
 E.C. Lai, P. Tomancak, R.W. Williams, G.M. Rubin, Computational identification of
Drosophila microRNA genes, Genome Biol 4 (2003) R42.
 Chih-Chung Chang, C.-J. Lin, LIBSVM: a library for support vector machines, , 2001.
  LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
 T.H. Huang, B. Fan, M.F. Rothschild, Z.L. Hu, K. Li, S.H. Zhao, MiRFinder: an
improved approach and software implementation for genome-wide fast micro-
RNA precursor scans, BMC Bioinformatics 8 (2007) 341.
 M.Y. Khan Barozai, M. Irfan, R. Yousaf, I. Ali, U. Qaisar, A. Maqbool, M. Zahoor, B.
Rashid, T. Hussnain, S. Riazuddin, Identification of micro-RNAs in cotton, Plant
Physiol Biochem 46 (2008) 739–751.
 X. Wang, J. Zhang, F. Li, J. Gu, T. He, X. Zhang, Y. Li, MicroRNA identification based
on sequence and structure alignment, Bioinformatics 21 (2005) 3610–3614.
 C. Xue, F. Li, T. He, G.P. Liu, Y. Li, X. Zhang, Classification of real and pseudo
microRNA precursors using local structure-sequence features and support vector
machine, BMC Bioinformatics 6 (2005) 310.
 A. Sewer, N. Paul, P. Landgraf, A. Aravin, S. Pfeffer, M.J. Brownstein, T. Tuschl, E. van
Nimwegen, M. Zavolan, Identification of clustered microRNAs using an ab initio
prediction method, BMC Bioinformatics 6 (2005) 267.
S.-C. Li et al. / Genomics 96 (2010) 1–9