Defining diversity, specialization, and gene specificity in transcriptomes through information theory

Article (PDF Available)inProceedings of the National Academy of Sciences 105(28):9709-14 · August 2008with43 Reads
DOI: 10.1073/pnas.0803479105 · Source: PubMed
Abstract
The transcriptome is a set of genes transcribed in a given tissue under specific conditions and can be characterized by a list of genes with their corresponding frequencies of transcription. Transcriptome changes can be measured by counting gene tags from mRNA libraries or by measuring light signals in DNA microarrays. In any case, it is difficult to completely comprehend the global changes that occur in the transcriptome, given that thousands of gene expression measurements are involved. We propose an approach to define and estimate the diversity and specialization of transcriptomes and gene specificity. We define transcriptome diversity as the Shannon entropy of its frequency distribution. Gene specificity is defined as the mutual information between the tissues and the corresponding transcript, allowing detection of either housekeeping or highly specific genes and clarifying the meaning of these concepts in the literature. Tissue specialization is measured by average gene specificity. We introduce the formulae using a simple example and show their application in two datasets of gene expression in human tissues. Visualization of the positions of transcriptomes in a system of diversity and specialization coordinates makes it possible to understand at a glance their interrelations, summarizing in a powerful way which transcriptomes are richer in diversity of expressed genes, or which are relatively more specialized. The framework presented enlightens the relation among transcriptomes, allowing a better understanding of their changes through the development of the organism or in response to environmental stimuli.
Defining diversity, specialization, and gene specificity
in transcriptomes through information theory
Octavio Martı´nez*
and M. Humberto Reyes-Valde´s
*Laboratorio Nacional de Geno´ mica para la Biodiversidad (Langebio), Cinvestav, Campus Guanajuato, Apartado Postal 629, C.P. 36500 Irapuato,
Guanajuato, Mexico; and
Department of Plant Breeding, Universidad Auto´ noma Agraria Antonio Narro, Buenavista, C.P. 25315 Saltillo, Coahuila, Mexico
Communicated by Luis Herrera-Estrella, Center for Research and Advanced Studies, Guanajuato, Mexico, April 10, 2008 (received for review March 10, 2008)
The transcriptome is a set of genes transcribed in a given tissue
under specific conditions and can be characterized by a list of genes
with their corresponding frequencies of transcription. Transcrip-
tome changes can be measured by counting gene tags from mRNA
libraries or by measuring light signals in DNA microarrays. In any
case, it is difficult to completely comprehend the global changes
that occur in the transcriptome, given that thousands of gene
expression measurements are involved. We propose an approach
to define and estimate the diversity and specialization of tran-
scriptomes and gene specificity. We define transcriptome diversity
as the Shannon entropy of its frequency distribution. Gene spec-
ificity is defined as the mutual information between the tissues and
the corresponding transcript, allowing detection of either house-
keeping or highly specific genes and clarifying the meaning of
these concepts in the literature. Tissue specialization is measured
by average gene specificity. We introduce the formulae using a
simple example and show their application in two datasets of gene
expression in human tissues. Visualization of the positions of
transcriptomes in a system of diversity and specialization coordi-
nates makes it possible to understand at a glance their interrela-
tions, summarizing in a powerful way which transcriptomes are
richer in diversity of expressed genes, or which are relatively more
specialized. The framework presented enlightens the relation
among transcriptomes, allowing a better understanding of their
changes through the development of the organism or in response
to environmental stimuli.
biological complexity gene expression microarrays serial analysis of
gene expression (SAGE) Shannon entropy
T
he transcriptome is highly dynamic; the relative transcription
f requencies of the genes change in response to environmen-
t al and internal stimuli redirecting the functional and structural
landscape of living organ isms. Currently, we can measure tran-
scriptome changes by counting gene tags with technologies as
serial analysis of gene expression (SAGE) (1), massively parallel
signature sequencing (MPSS) (2), pyrosequencing of cDNA
libraries obtained from mRNA (3), or alternatively by measuring
light signals in DNA microarrays (4). In any case, it is difficult
to completely understand the global changes that occur in the
transcriptome, given that thousands of gene frequency measure-
ments are involved. Here, we present a set of indexes that allow
the calculation of transcriptome diversit y and context special-
ization and the degree of gene specificity. These indexes are
based on the adaptation of Shannon’s information theory (5) to
the transcriptome framework. Our approach is exemplified by
the analysis of a dataset of the transcriptome of 32 human tissues,
f rom which 32 million gene tags were obtained (6) and a
c omparable dataset for the ex pression of human genes in 36
human tissues using the Affymetrix GeneChip for the human
genome (7). The main conclusion of our study is that this
c onceptualization allows elucidation of aspects of the transcrip-
tome previously uncharacterized due to the quantity and com-
plexity of the data.
Infor mation theory was pioneered by Claude E. Shannon in a
seminal paper in 1948 (5), and it has been generalized and
applied to many scientific fields (8). In particular, it has been
repeatedly applied to genetics in distinct contexts (9–12). Our
approach consists of considering as symbols, in the sense of
infor mation theory, the distinct transcripts found in a tissue and
c ounting their abundance to calculate infor mation parameters.
Results and Discussion
Theoretical Framework. Consider the division of an organism in
tissues; the transcriptomes of each tissue can then be simply
described as the set of relative frequencies, p
ij
, for the ith gene
(i 1, 2, ...,g)inthejth tissue (j 1, 2, ... ,t). Then the
diversit y of the transcriptome of each tissue can be quantified by
an adapt ation of Shannon’s entropy formula,
H
j
i1
g
p
ij
log
2
p
ij
. [1]
H
j
will vary from zero when only one gene is transcribed up to
log
2
(g), where all g genes are transcribed at the same frequency:
1/g. If we c onsider the average frequency of the ith gene among
tissues, say,
p
i
1
t
j1
t
p
ij
, [2]
and define gene specificity as the information that its ex pression
provides about the identity of the source tissue as
S
i
1
t
j1
t
p
ij
p
i
log
2
p
ij
p
i
. [3]
S
i
will give a value of zero if the gene is transcribed at the same
f requency in all tissues and a max imum value of log
2
(t)ifthe
gene is exclusively expressed in a single tissue. To quantify
the tissue specialization we can obt ain for each jth tissue, the
average of the gene specificities, say,
j
i1
g
p
ij
S
i
. [4]
j
varies from zero if all genes ex pressed in the tissue are
c ompletely unspecific (S
i
0 for all i) up to a maximum of
log
2
(t), when all genes expressed in the tissue are not expressed
Author contributions: O.M. and M.H.R.-V. designed research, performed research, contrib-
uted new reagents/analytic tools, analyzed data, and wrote the paper.
The authors declare no conflict of interest.
Freely available online through the PNAS open access option.
To whom correspondence should be addressed. E-mail: omartine@ira.cinvestav.mx.
This article contains supporting information online at www.pnas.org/cgi/content/full/
0803479105/DCSupplemental.
© 2008 by The National Academy of Sciences of the USA
www.pnas.orgcgidoi10.1073pnas.0803479105 PNAS
July 15, 2008
vol. 105
no. 28
9709–9714
GENETICS
anywhere else. If we substitute the values of p
ij
by p
i
in Eq. 1 and
ignore the subindex j, we obtain a measure, say H, of the diversity
of the whole system.
To define a measure of divergence with respect to the whole
average transcriptome, let us define the average log
2
of the global
transcript f requencies in a given tissue, say
H
Rj
i1
g
p
ij
log
2
p
i
. [5]
H
Rj
will be equal to or larger than the corresponding H
j
, reaching
equalit y if and only if p
i
p
ij
for all values of i. Now we can define
the Kullback–Leibler divergence of the tissue j as
D
j
H
Rj
H
j
. [6]
D
j
measures how much a given tissue j departs from the
c orresponding transcriptome distribution of the whole system.
Notice that H
j
, the measure of diversity, depends only on the
relative transcription frequencies of the tissue j; thus, it is
independent of the c ontext. However, the measures of tissue
specialization and divergence,
j
and D
j
, respectively, depend not
only on these f requencies but also on those of the remaining
tissues; thus, these parameters are sensitive to the context where
they are measured [see supporting information (SI) Text].
So far, we have been assuming the subdiv ision of an organism
in a set of tissues, but the transcriptome can also be analyzed at
the individual cell level or at higher hierarchic levels as sets of
tissues (organs) or collections of organs (systems), etc. Tran-
scriptome analysis can also be approached by analyzing the same
organ or tissue under distinct developmental or environment al
c onditions. For example, we can monitor the changes in tran-
scriptome from a normal to a malignant tumor or the ef fect of
environmental stresses in plant transcriptomes. The f ramework
presented here is completely general and can be used to study
transcriptome changes in complex experiments.
Simple Example. Fig. 1 presents a simple and unrealistic example
to illustrate the numerical results of the indexes presented here.
From Fig. 1, we can see that tissue a, which transcribes only the
most specific gene (S
1
2), is the least diverse and most
specialized of tissues, whereas d, which transcribes three genes at
the same relative frequency, is the most diverse and the second
least specialized. Tissue c, transcribing two genes with low
specificities at distinct frequencies, is the least specialized with
an approximate intermediate diversity, whereas tissue b, tran-
scribing three genes, one w ith relatively high specificity (S
4
1.05), is the sec ond most diverse and specialized. The diversit y
of the whole system, with H 1.9965, almost reaches the
maximum diversity for a system with four genes, log
2
(4) 2, and
the mean average diversit y of the tissues, mean (
j
) 1.0424, is
almost in the center of the range of possible diversities, 0 to log
2
(4) 2 (diamond in Fig. 1B). In this example, the properties of
the transcriptome can be easily understood by inspection of the
transcription frequencies of the four genes, but in any real case,
thousands of genes are involved, and the appreciation of the
transcriptome properties becomes impossible without the tools
described here.
Analysis of Human Data: The Tissue Perspective. To exemplify our
approach with real cases, we analyzed two comparable datasets.
The first consists of 31 millions MPSS tags for 22,935 genes
measured in 32 human tissues (6), and the second is a microarray
ex pression profiling of 36 human tissues (7). These two dat asets
share 28 human tissues and thus present the possibility of
c omparing the results of our approach with two highly dissimilar
methodologies.
Fig. 2 shows a scatter plot of the values of diversit y, H
j
vs. the
values of specialization given by the average gene specificity of
the tissues,
j
.
From the results of the MPSS dataset (Fig. 2 A and C), note
that the less diverse and more specialized organ is the pancreas,
followed by the salivary gland and stomach. As noted in ref. 6,
much of the transcriptional output in the pancreas is directed
toward the manufacture of a limited repertoire of secreted
enz ymes and, to some extent, the same can be said about the
salivary gland and the stomach. We can also note how the organs
of the digestive system cover almost the entire specialization
spectr um, from the highest specialization of the pancreas to the
relative low specialization of the small intestine. When compar-
ing the values of H
j
for the constitutive organs of the digestive
system in the MPSS dataset, we see a high degree of variation
f rom 5.2 for the pancreas up to more than double that quantity,
10.7, for the small intestine. The organs of the CNS are scattered
in a region of high diversity but relatively low specialization (Fig.
2 B and D). The testis is the organ with the most diverse
transcriptome; this is reasonable, because, as noted in ref. 6, in
the testis, no abundant tissue-specific transcripts dominate the
tot al population, which is derived f rom a large number of cell
t ypes of both germ-line and somatic origin. Among the organs
sampled from the reproductive system, the placenta is more
abcd
Gene 1; S = 2.0000
Gene 2; S = 0.7341
Gene 3; S = 0.4151
Gene 4; S = 1.0501
Tissue
Relative frequency of
transcription (pij)
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.0
1.0
1.5
2.0
Hj (Diversity)
j (Specialization)
δ
a
b
c
d
H / mean
δ
Fig. 1. Example of values of information parameters as functions of the relative frequencies of transcription of genes in a system with four tissues and four genes.
(A) Bar plot of the relative frequencies of transcription of each gene in the four tissues. Values of gene specificities, S
i
, are presented. (B) Scatter plot of H
j
(diversity)
vs.
j
(specialization, given by the average of the gene specificities) for each tissue. The value of H in the whole system and the mean of
j
is plotted as a diamond.
9710
www.pnas.orgcgidoi10.1073pnas.0803479105 Martı´nez and Reyes-Valde´s
specialized than the testis but less diverse. Within the lymphatic
system, the bone marrow is the most specialized and the least
diverse. Within the endocrine system, which includes the hypo-
thalamus given its mainly hormonal role (MPSS dataset only,
Fig. 2 A and B), the pituitary gland is the most specialized and
diverse. For the two organs representing the respiratory system,
lung, and trachea, the lung presents a more diverse and less
specialized transcriptome than the trachea, whereas for the two
organs representing the urinary system, the kidney has a more
specialized transcriptome than the bladder. All these observa-
tions are c onsistent in both datasets.
When comparing the analyses resulting from the two distinct
dat asets (Fig. 2 C and D), we notice a difference in the ranks
c overed by H
j
and
j
in the two images. The ranks of H
j
and
j
are narrower when estimated from the microarrays compared
with the estimation from the MPSS data, because fewer genes
with less average variation are represented in the microarray
c ompared with the MPSS dataset. Despite these scale dif fer-
ences, scatter plots C and D (Fig. 2 C and D) are remarkably
similar, taking into account that they arose from two completely
distinct methods and used different biological samples that
surely present individual noise in the estimation of the gene
f requencies. In both graphs, the most specialized tissue is the
pancreas, followed by the salivary gland, and the most diverse is
the testis. The Pearson’s correlation between the paired esti-
mates of H
j
f rom both datasets was r 0.68, whereas the
c orresponding coefficient for
j
was r 0.90. Figs. S1, S2, and
S3 show scatter plots for the values of H
j
,
j
, and D
j
, respectively,
estimated from each dataset. Fig. S4 shows the scatter plot of H
j
vs.
j
in the microarray dataset, including all 36 human tissues.
56789101112
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Hj (Diversity)
j(
Specialization)
δ
H / mean δ
Bone Marrow
Heart
Pancreas
Pituitary Gland
Placenta
Salivary Gland
Stomach
Testis
Lung
Trachea
Kidney
Colon
Lymphocytes
Central Nervous System
Circulatory
Digestive
Endocrine
Lymphatic
Reproductive
Respiratory
Urinary
10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6
0.7
0.8
0.9
1.0
Hj (Diversity)
j(Specialization)
δ
Adrenal Gland
B. Amygdala
B. CaudateNucleus
B. Cerebellum
B. CorpusCallosum
B. Thalamus
Lung
Mammary Gland
Prostate
Retina
Small Intestine
Spinal Cord
Thyroid
Uterus
Monocytes
Bladder
Brain Fetal
Spleen
Thymus
Brain Hypothalamus
Central Nervous System
Circulatory
Digestive
Endocrine
Lymphatic
Reproductive
Respiratory
Urinary
56789101112
0.5
1.
0
1.
5
2.
0
2.5
3.0
3.
5
Hj (Diversity)
j(
S
pecialization)
δ
B. Fetal
Heart
Kidney
Lung
Pancreas
Pituitary Gland
Placenta
Salivary Gland
Stomach
Tes t i s
Thymus
Bone Marrow
Mammary Gland
SpinalCord
Trachea
Uterus
Colon
Central Nervous System
Circulatory
Digestive
Endocrine
Lymphatic
Reproductive
Respiratory
Urinary
12.0 12.2 12.4 12.6
0.25
0.30
0.35 0.40
0.45
0.500.55
Hj (Diversity)
j(Specialization)
δ
Bladder
Bone Marrow
Cerebellum
Corpus Callosum
Thalamus
Heart
Kidney
Lung
Mammary Gland
Pancreas
Pituitary Gland
Placenta
Prostate
Salivary Gland
Small Intestine
Thymus
Thyroid
Trachea
Uterus
Colon
Caudate Nucleus & Amygdala
Brain Fetal
Spleen
Stomach
Tes t i s
Adrenal Gland
Spinal Cord
Central Nervous System
Circulatory
Digestive
Endocrine
Lymphatic
Reproductive
Respiratory
Urinary
AB
C
D
Fig. 2. Scatter plot of estimated values of H
j
(diversity) vs.
j
(specialization, given by the average gene specificity) for tissues of the human systems. Tissues are
colored by system of origin. (A and B) Results from the MPSS dataset (32 tissues); B is an amplification of the box in A. The point of H for the whole system and
the mean of
j
is shown by a diamond in A.(C) Results from the MPSS dataset in 28 tissues shared with the microarray dataset. (D) Results from the microarray
dataset in the shared 28 human tissues.
Martı´nez and Reyes-Valde´s PNAS
July 15, 2008
vol. 105
no. 28
9711
GENETICS
The visualization of the positions of transcriptomes in a system
of diversity and specialization coordinates, as the one presented
in Fig. 2 and Fig. S4, permits a full and comprehensible
appreciation of these transcriptome properties that is unfeasible
by other means. In ref. 6, the authors show the distribution of
transcript abundance classes in various tissues, plotting the
proportion of the transcriptome contributed by the n-most
abundant transcripts. Using this approach, they conclude that
the pancreas, salivary gland, and stomach are examples of highly
specialized tissues, whereas the fetal brain and testis are pre-
sented as examples of tissues with complex and diversified
transcriptomes. More subtle and det ailed conclusions are
reached by observing Fig. 2, which presents a complete and
easy-to-interpret panorama of the diversity and specialization in
all sampled tissues.
Fig. S5 presents a scatter plot of estimated values of D
j
(divergence) vs.
j
(specialization) for tissues of the human
systems resulting from the MPSS dataset. In Fig. S5,wecan
appreciate distinct strategies of specialization of the tissue’s
transcriptomes. Tissues with
j
D
j
are above the red line that
marks D
j
j
, whereas tissues with
j
D
j
are below that line.
Tissues with
j
D
j
have a specialization strategy that c onsists
mainly of expressing highly specialized genes, whereas tissues
with
j
D
j
achieve their specialization by ex pressing at higher
or lower rates genes that are, on average, expressed in the whole
system. The distance of each point (tissue) to the line D
j
j
denotes how extreme is the specialization strategy. From Fig. S5,
we notice, for example, that the tissues of the reproductive
system are in general very close to the line D
j
j
, indicating an
almost neutral specialization strategy. In c ontrast, all of the
tissues of the digestive system have large deviations from the
neutral specialization strategy and all, except the colon, have
values of
j
D
j
, denoting a specialization strategy consisting of
the ex pression of mainly specialized genes. The plotting of D
j
(divergence) vs.
j
(specialization), as the one shown in Fig. S5,
of fers immediate and easy-to-interpret insights into the special-
ization strategies of the transcriptomes, which will be very
dif ficult, if not impossible, to attain without the information
tools presented. Fig. S6 presents the scatter plot of H
j
vs.
j
in the
microarray dataset, including the 28 tissues shared with the
MPSS dat aset.
In the case of the human dataset, the information analysis can
also be performed at system level by grouping the sets of tissues
into their corresponding systems. Fig. S7 presents the graphical
result of that analysis performed over the MPSS dataset, showing
a scatter plot of estimated values of H
j
(diversit y) vs.
j
(spe-
cialization) that also includes the estimated values of D
j
(diver-
gence) for each system. These results are c onsistent with the
analysis perfor med at the finer level of tissues presented earlier.
Analysis of Human Data: The Gene Perspective. The values of S
i
calculated for each of the 22,935 genes studied in the MPSS
dat aset allow the quantitative classification of gene specificity, a
c oncept regularly used in the literature but seldom quantified
(13). In this case, the maximum value of S
i
for the human genes,
S
i
log
2
(32) 5, reached by 2,555 genes (11.14%), indicates
that the gene is exclusively transcribed in only one of the 32
tissues studied, whereas the min imum possible value of S
i
, zero,
unatt ained in the MPSS dataset, would indicate a gene with
exactly the same frequency in all tissues. Housekeeping genes
have small values of S
i
, indicating an even distribution across
tissues. An index for gene specificity that depends basically on its
maximum expression was applied to the human dataset (6). That
index does not have definite maximum or minimum bounds and
misses many exclusive genes, giving values that, in contrast with
our index S
i
, depend not only on the specificity of the gene but
also on its frequency of expression. Although the index S
i
easily
selects all 2,555 specific genes, the index proposed in ref. 6 can
induce a misleading inference about specificity, because many
specific genes with a value of S
i
5 give very low values of the
index proposed by ref. 6 and thus will not be classified as specific
with that index. Fig. S8 presents a scatter plot for the values of
both coefficients. Despite these differences, all genes presented
in t able 3 of ref. 6 with a value of their coefficient 9 also have
a high value of S
i
that ranks f rom 4.99 to 5. Table 1 presents
examples of genes with extreme values of S
i
and the S
i
values
att ained in this dataset by some genes classified as housekeeping
in the literature.
Table 1 presents five examples of the 2,555 completely specific
genes (S
i
5) selected to be shown from the MPSS dataset,
because they are the ones with the five highest average expres-
sion levels (highest p
i
) and are also presented as examples in ref.
6. As mentioned above, no gene attached the minimum possible
value of S
i
0 that will indicate exactly the same expression level
of transcription in all 32 tissues; however, Table 1 presents the
five genes with the lowest values of S
i
that rank from 0.09 to 0.10
and can be classified as housekeeping, because they present the
most even distribution of transcription expression among the
human tissues sampled. Two genes with the lowest S
i
in Table 1,
PSMB6 and PSMC5, belong to the proteasome that is responsible
for the degradation of abnormal intracellular proteins (14). The
gene CHMP4A, which has the second smallest value of S
i
, 0.09
(Table 1), is a member of the family of small coiled-coil proteins
named CHMP implicated in playing roles in multivesicular body
sorting (15), whereas COMMD3, with a value of S
i
0.09, is a
member of a gene family defined by the presence of a conserved
and unique motif termed the COMM (c opper metabolism gene
MURR1) domain, which functions as an interface for protein–
protein interactions. In particular, COMMD3 has been indepen-
dently shown to be ex pressed at relatively even levels in 13
human tissues (16). The CTNND1 gene with a value of S
i
0.10
c orresponds to catenin, a protein linked to the cytoplasmic
domain of transmembrane cadherins (17). These examples show
that measuring the specificit y of genes by S
i
can lead to the
detection of new housekeeping genes.
Table 1 also presents the values of S
i
for genes repeatedly
reported in the literature as housekeeping for human studies
(18). Four of these genes (PPIA, ACG1, PGK1, and TAF11) have
values of S
i
bet ween 0.22 and 0.28 that are more than double the
Table 1. Examples of genes with distinct value of specificity (S
i
)
Completely specific (S
i
5), highly expressed genes (tissue)
S
i
HUGO Description (tissue where expressed)
5.00 LIPF Lipase (stomach)
5.00 ELA3B Enastase 3B (pancreas)
5.00 RHO Rhodopsin; opsin 2, rod pigment (retina)
5.00 AZU1 Azurocidin 1 (bone marrow)
5.00 MYL2 Myosin, light polypeptide 2 (heart)
Genes with the lowest values of S
i
(expressed in all sampled tissues)
S
i
HUGO Description
0.09 PSMB6 Prosome, macropain; subunit, beta type, 6
0.09 CHMP4A Chromatin modifying protein 4A
0.09 COMMD3 B lymphoma Mo-MLV insertion region
0.10 CTNND1 Catenin (cadherin-associated protein), delta 1
0.10 PSMC5 Prosome, macropain; 26S subunit, ATPase, 5
Genes reported as housekeeping in the literature
S
i
HUGO Description
0.22 PPIA Cyclophilin A
0.25 ACG1 Actin, gamma 1
0.28 PGK1 Phosphoglycerate kinase 1
0.28 TAF11 TAF11; TATA box-binding protein
2.30 GAPDH Glyceraldehyde-3-phosphate dehydrogenase
HUGO, Human Genome Organization.
9712
www.pnas.orgcgidoi10.1073pnas.0803479105 Martı´nez and Reyes-Valde´s
values for genes with the smallest values of S
i
but can still be
c onsidered to have an even distribution and thus housekeeping
genes. In contrast, the gene for GAPDH, the most popular
housekeeping gene (18), presents a value of S
i
2.3 that is
almost in the center of the possible rank of S
i
(0–5) and cannot
be considered as housekeeping, at least for the tissues studied.
Fig. S9 presents bar plots for the distribution of 10 specific
genes (S
i
5; Fig. S9A) and the 10 genes with the lowest values
of S
i
(Fig. S9B) f rom the MPSS dataset, where one can appre-
ciate how specific genes are expressed in only one organ, whereas
nonspecific or housekeeping genes have an approximately even
distribution among tissues.
Comparing the distributions of S
i
in the systems and tissues (Fig.
S10), one can appreciate in both cases an approximate U-shaped
distribution, with a larger number of genes having values closer to
the limits of the S
i
rank. The largest difference between the
distributions of S
i
is observed in the first class, which in both cases
groups the first fifth of the S
i
rank, and that for the system
distribution repre sents 36% of the genes, whereas for tissue s, it
groups 25% of the genes. This shows that, when grouping the data
by systems, more gene s can be classified as housekeeping or
ubiquitously distributed among the systems than when grouping by
tissues. The difference in the last class of the distributions, grouping
one-fifth of the most extreme or specialized genes, is only of 3%,
showing that this class of genes is less affected in the relative S
i
value
by the grouping than genes with a low value of S
i
.
General Considerations. The transcriptome is vastly dynamic;
f requencies of gene expression in tissues change during the
development of the organism and at the same developmental
st age in response to internal or external stimuli, modifying the
landscape of the proteome and the functional and structural
roles of the cells. In many instances in the recent literature on
transcriptomes (19–23), the concept of complexity is mentioned
in relation to the number of genes expressed and the changes of
ex pression patterns in distinct situations; however, problems of
quantit ative evaluation of the transcriptome diversity or special-
ization and gene specificity are not addressed. The analytical
tools herein presented (H
j
for measuring diversity,
j
for assess-
ing context specialization, D
j
for transcriptome divergence, and
S
i
for estimating gene specificity) allow the understanding of
these global changes, giving insights about the complex changes
oc curring during these phenomena. A decrease in H
j
will indicate
that fewer genes are being transcribed, or that the transcription
f requencies are less uniform, whereas an increase of
j
will signal
that, on average, more specific genes are transcribed, and an
increase of D
j
indicates departures from the average transcrip-
tome. With the help of the S
i
index, it is possible to detect genes
specific to a giving condition or that, on the contrary, are
maint ained approximately at the same rate of transcription
under dif ferent situations.
In the examples presented here, we consider tissue s of a given
organism; however, the information framework presented is gen-
eral, and we can speak about ‘‘subsystems’’ of a given organism,
where each subsystem can represent an order of morphological
classification, i.e., individual cells, tissues, organs, systems, etc.; a
state of development, for example plantlet, flowering plant, senes-
cent plant; normal or malignant tissue, etc.; a particular experi-
mental treatment, such as ‘‘optimal condition’’ vs. ‘‘stress condi-
tion’’ in a model organism; and so forth.
There are st atistical issues about the estimation of the
infor mation properties not det ailed here. A goodness-of-fit
st atistic can be readily obt ained by transfor ming the Kullback–
Leibler dist ance D
j
to test the null hypothesis that the tran-
scriptome of a given tissue is statistically equal to a given
distribution (24). Another issue is the estimation of confidence
intervals for H
j
, D
j
,
j
, and S
i
that can be obt ained by the
bootstrap method and w ill be presented elsewhere. Another
import ant statistical issue related to the information param-
eters estimation is the sample size or deepness of sampling of
the transcriptome. Because many genes are transcribed at very
low f requencies, small sample sizes, usually used in EST
studies, are likely to miss many low-ex pressed genes, probably
underestimating the value of H
j
and distorting the tr ue values
of D
j
and
j
.
When we have a snapshot of the relative frequencies of the
transcribed genes, as in the case of SAGE, MPSS, or microarray
ex periments, the estimation of H
j
makes it possible to objectively
quantif y the diversity of a transcriptome, capturing this aspect of
its complexity. Because H
j
depends only on the relative frequen-
cies of the expressed genes, it can be used to compare transcrip-
tion diversity not only between subsystems of the same organ ism
but also between transcriptomes of various distinct organisms,
allowing comparison among taxa.
The index S
i
, defined as the specificity of a gene, permits the
quantification of the relative spreadness of the genes across
subsystems, giving a quantitative base to define concepts such as
housekeeping or specialized genes recurrently used in the liter-
ature, in many cases without a quantitative assessment of their
degree of variability (13).
We have shown the method in the framework of protein-
c odifying genes; however, it is applicable to any kind of transcript
t ag available, including the precursors or mature forms of
nonc oding RNA as iRNA, sRNA, and so forth, and to collections
of anonymous tags from tissues in an organism for which
nongenome sequences are available.
Materials and Methods
The MPSS dataset, consisting of 31 millions tags for 22,935 genes measured
in 32 human tissues analyzed and published (6), was kindly made available to
us by Jongeneel et al. The data consist of the number of tags obtained for each
gene in each tissue. To obtain the relative frequencies of expression of each
gene at each tissue, p
ij
, the original number of tags per gene was divided by
the corresponding total number of tags in the tissue.
For the analysis of the MPSS dataset at the system level, the data from the
32 tissues were grouped into eight systems in accordance with the main
functional classification of the tissues (Table S1). To obtain the relative fre-
quencies of expression of a given gene in a specific system, we took the
average of the relative frequencies of expression of that gene in the organs
considered to form part of the corresponding system. From this matrix of
relative frequencies, {p
ij
}, all information parameters were calculated. The
number of gene tags and the information parameters calculated from the
data are presented in Table S1.
The dataset of microarray experiments used the Affymetrix GeneChip
(Gene Expression Omnibus accession no. GDS1096), downloaded from the
National Center for Biotechnology Information site (www.ncbi.nlm.nih-
.gov). The file containing the normalized measurements for all identifiers
in all tissues was processed to obtain the average expression per gene in
each tissue. To obtain the relative frequencies of expression of each gene
at each tissue, p
ij
, the estimated average expression of each gene in a given
tissue was divided by the sum of the average expression of all genes in that
tissue. Two analyses were performed, one including only the 28 tissues
shared with the MPSS dataset (Fig. 2D) and the other including all 36 tissues
(Fig. S4).
The analyses were performed within the R statistical language (25). The
program designed for the analyses and the full table of results for all analyses
are available on request. The program will also be deposited to form part of
the R Bioconductor software.
Note Added in Proof. The value of gene specificity (Eq. 3) is a linear function of
the entropy of a gene’s expression distribution applied in (ref. 26) to evaluate
tissue specificity.
ACKNOWLEDGMENTS. We are grateful to Luis Herrera-Estrella and two anon-
ymous referees for suggestions and critical review of the manuscript and to
Lourdes Martı´nez de la Vega for help with the classification of human organs
into systems. We are also grateful to Victor Jongeneel et al. (6), who kindly
sent their original data files to us. We acknowledge financial support from
Conacyt, Concyteg, Cinvestav, and Universidad Auto´ noma Agraria Antonio
Narro.
Martı´nez and Reyes-Valde´s PNAS
July 15, 2008
vol. 105
no. 28
9713
GENETICS
1. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene
expression. Science 270:368 –369.
2. Brenner S, et al. (2000) Gene expression analysis by massively parallel signature
sequencing (MPSS) on microbead arrays. Nat Biotechnol 18:630 634.
3. Agaton C, et al. (2002) Gene expression analysis by signature pyrosequencing. Gene
289:31–39.
4. Meyers BC, Galbraith DW, Nelson T, Agrawal V (2004) Methods for transcriptional
profiling in plants. Be fruitful and replicate. Plant Physiol 135:637– 652.
5. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J
27:379 423.
6. Jongeneel CV, et al. (2005) An atlas of human gene expression from massively parallel
signature sequencing (MPSS). Genome Res 15:1007–1014.
7. Ge X, et al. (2005) Interpreting expression profiles of cancers by genome-wide survey
of breadth of expression in normal tissues. Genomics 86:127–141.
8. Taneja IJ (2001) Generalized Information Measures and Their Applications (Departa-
mento de Matema´ tica, Universidade Federal de Santa Caterina, Floriano´ polis, Brazil).
9. Roma´ n-Rolda´ n R, Bernaola-Galva´ n P, Oliver JL (1996) Application of information
theory to DNA sequence analysis: A review. Pattern Recog 29:1187–1194.
10. Schneider TD (1997) Information content of individual genetic sequences. J Theor Biol
189:427–441.
11. Reyes-Valde´ s MH, Williams CG (2005) An entropy-based measure of founder informa-
tiveness. Genet Res 85:81– 88.
12. Chao A, Shen T-J (2003) Nonparametric estimation of Shannon’s index of diversity
when there are unseen species in sample Environ Ecol Stat 10:429-443.
13. Thellin O, et al. (1999) Housekeeping genes as internal standards: Use and limits.
J Biotechnol 75:291–295.
14. Kwak M-K, Kensler TW (2006) Induction of 26S proteasome subunit PSMB5 by the
bifunctional inducer 3-methylcholanthrene through the Nrf2-ARE, but not the AhR/
Arnt-XRE, pathway. Biochem Biophys Res Commun 345:1350 –1357.
15. Katoh K, Shibata H, Hatta K, Maki M (2004) CHMP4b is a major binding partner of the
ALG-2-interacting protein Alix among the three CHMP4 isoforms. Arch Biochem
Biophys 421:159 –165.
16. Burstein E, et al. (2005) COMMD proteins, a novel family of structural and functional
homologs of MURR1. J Biol Chem 280:22222–22232.
17. Keirsebilck A, et al. (1998) Molecular cloning of the human p120ctnCatenin gene
(CTNND1): Expression of multiple alternatively spliced isoforms. Genomics 50:129
146.
18. Tricarico C, et al. (2002) Quantitative real-time reverse transcription polymerase chain
reaction: Normalization to rRNA or single housekeeping genes is inappropriate for
human tissue biopsies. Anal Biochem 309:293–300.
19. Frith MC, Pheasant M, Mattick JS (2005) Genomics: The amazing complexity of the
human transcriptome. Eur J Hum Genet 13:894 897.
20. Kapranov P, et al. (2005) Examples of the complex architecture of the human
transcriptome revealed by RACE and high-density tiling arrays. Genome Res 15:987–
997.
21. Gustincich S, et al. (2006) The complexity of the mammalian transcriptome. J Physiol
575:321–332.
22. Dix TI (2007) Comparative analysis of long DNA sequences by per element information
content using different contexts. BMC Bioinformatics 8:S10.
23. Sayyed-Ahmad A (2007) Transcriptional regulatory network refinement and quanti-
fication through kinetic modeling, gene expression microarray data and information
theory. BMC Bioinformatics 8:20.
24. Senoglu B, Surucu B (2004) Goodness-of-fit tests based on Kullback-Leibler informa-
tion. IEEE Tran Rel 53:357–361.
25. R-Development Core Team (2005) R: A Language and Environment for Statistical
Computing (R Foundation for Statistical Computing, Vienna).
26. Schug J, et al. (2005) Promoter features related to tissue specificity as measured by
Shannon entropy. Genome Biol 6:R33.
9714
www.pnas.orgcgidoi10.1073pnas.0803479105 Martı´nez and Reyes-Valde´s
    • "Transcriptome analysis of mammalian testes showed that the gene expression levels in this organ are higher relative to other organs (e.g., brain, heart, liver, kidney), and this difference is more pronounced for predicted lncRNAs, which have higher expression in testis than in other organs [43] . Concordantly, we previously demonstrated that in humans the testis has the highest transcriptome diversity [44]. Moreover, the repertoire and expression pattern of lncRNAs in tetrapods showed that lncRNAs are preferentially expressed in the testes, and this expression is actively regulated, which suggests that this expression is not due only to non-specific transcription in open chromatin regions [45]. "
    [Show abstract] [Hide abstract] ABSTRACT: Background Meiosis is a form of specialized cell division that marks the transition from diploid meiocyte to haploid gamete, and provides an opportunity for genetic reassortment through recombination. Experimental data indicates that, relative to their wild ancestors, cultivated sunflower varieties show a higher recombination rate during meiosis. To better understand the molecular basis for this difference, we compared gene expression in male sunflower meiocytes in prophase I isolated from a domesticated line, a wild relative, and a F1 hybrid of the two. Results Of the genes that showed differential expression between the wild and domesticated genotypes, 63.62 % could not be identified as protein-coding genes, and of these genes, 70.98 % passed stringent filters to be classified as long non-coding RNAs (lncRNAs). Compared to the sunflower somatic transcriptome, meiocytes express a higher proportion of lncRNAs, and the majority of genes with exclusive expression in meiocytes were lncRNAs. Around 40 % of the lncRNAs showed sequence similarity with small RNAs (sRNA), while 1.53 % were predicted to be sunflower natural antisense transcripts (NATs), and 9.18 % contained transposable elements (TE). We identified 6895 lncRNAs that are exclusively expressed in meiocytes, these lncRNAs appear to have higher conservation, a greater degree of differential expression, a higher proportion of sRNA similarity, and higher TE content relative to lncRNAs that are also expressed in the somatic transcriptome. Conclusions lncRNAs play important roles in plant meiosis and may participate in chromatin modification processes, although other regulatory functions cannot be excluded. lncRNAs could also be related to the different recombination rates seen for domesticated and wild sunflowers. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2776-1) contains supplementary material, which is available to authorized users.
    Full-text · Article · Dec 2016
    • "This resulted in the selection of 30 accessions from 14 different organisms with a total of 272 gene count vectors. Additionally we included two more RNA-seq experiments, the previously reported set of human MPSS data [60, 61] comprising gene counts for 32 human tissues , and a study of the sunflower transcriptome, which comprised 7 libraries [64]. The full dataset therefore included 32 RNA-seq experiments with a total of 311 individual libraries (seeTables 3 and 4 and S1Table). "
    [Show abstract] [Hide abstract] ABSTRACT: RNA-seq experiments estimate the number of genes expressed in a transcriptome as well as their relative frequencies. However, an undetermined number of genes can remain unde-tected due to their low expression relative to the sample size (sequence depth). Estimation of the true number of genes expressed in a transcriptome is essential in order to determine which genes are exclusively expressed in specific tissues or under particular conditions. A reliable estimate of the true number of expressed genes is also required to accurately measure transcriptome changes and to predict the sequencing depth needed to increase the proportion of detected genes. This problem is analogous to ecological sampling problems such as estimating the number of species at a given site. Here we present a non-parametric estimator for the number of undetected genes as well as for the extra sample size needed to detect a given proportion of the undetected genes. Our estimators are superior to ones already published by having smaller standard errors and biases. We applied our method to a set of 32 publicly available RNA-seq experiments, including the evaluation of 311 individually sequenced libraries. We found that in the majority of the cases more than one thousand genes are undetected, and that on average approximately 6% of the expressed genes per accession remain undetected. This figure increases to approximately 10% if individual sequencing libraries are analyzed. Our method is also applicable to metagenomic experiments. Using our method, the number of undetected genes as well as the sample size needed to detect them can be calculated, leading to more accurate and complete gene expression studies.
    Full-text · Article · Jun 2015
    • "RE insertion close to a lincRNA promoter or RE hypermethylation might interrupt the transcription factors or other regulatory elements binding to lincRNA promoters, which could contribute to lincRNAs tissue-specific expression. We quantified the tissue specificity of lincRNA expression in 16 normal tissues (SRA, E-MTAB-513) and six cell lines (GEO, GSE23316) using an information theory method (Supplementary file) (44). CM lincRNAs had significantly higher tissue-specific expression than RM lincRNAs, which was consistent with our hypothesis (Figure 3B). "
    [Show abstract] [Hide abstract] ABSTRACT: Despite growing consensus that long intergenic non-coding ribonucleic acids (lincRNAs) are modulators of cancer, the knowledge about the deoxyribonucleic acid (DNA) methylation patterns of lincRNAs in cancers remains limited. In this study, we constructed DNA methylation profiles for 4629 tumors and 705 normal tissue samples from 20 different types of human cancer by reannotating data of DNA methylation arrays. We found that lincRNAs had different promoter methylation patterns in cancers. We classified 2461 lincRNAs into two categories and three subcategories, according to their promoter methylation patterns in tumors. LincRNAs with resistant methylation patterns in tumors had conserved transcriptional regulation regions and were ubiquitously expressed across normal tissues. By integrating cancer subtype data and patient clinical information, we identified lincRNAs with promoter methylation patterns that were associated with cancer status, subtype or prognosis for several cancers. Network analysis of aberrantly methylated lincRNAs in cancers showed that lincRNAs with aberrant methylation patterns might be involved in cancer development and progression. The methylated and demethylated lincRNAs identified in this study provide novel insights for developing cancer biomarkers and potential therapeutic targets.
    Full-text · Article · Jul 2014
Show more