ArticlePDF Available

Comparison of threshold selection methods for microarray gene co-expression matrices

Authors:

Abstract and Figures

Network and clustering analyses of microarray co-expression correlation data often require application of a threshold to discard small correlations, thus reducing computational demands and decreasing the number of uninformative correlations. This study investigated threshold selection in the context of combinatorial network analysis of transcriptome data. Six conceptually diverse methods - based on number of maximal cliques, correlation of control spots with expressed genes, top 1% of correlations, spectral graph clustering, Bonferroni correction of p-values, and statistical power - were used to estimate a correlation threshold for three time-series microarray datasets. The validity of thresholds was tested by comparison to thresholds derived from Gene Ontology information. Stability and reliability of the best methods were evaluated with block bootstrapping.Two threshold methods, number of maximal cliques and spectral graph, used information in the correlation matrix structure and performed well in terms of stability. Comparison to Gene Ontology found thresholds from number of maximal cliques extracted from a co-expression matrix were the most biologically valid. Approaches to improve both methods were suggested. Threshold selection approaches based on network structure of gene relationships gave thresholds with greater relevance to curated biological relationships than approaches based on statistical pair-wise relationships.
Content may be subject to copyright.
BioMed Central
Page 1 of 6
(page number not for citation purposes)
BMC Research Notes
Open Access
Short Report
Comparison of threshold selection methods for microarray gene
co-expression matrices
Bhavesh R Borate1, Elissa J Chesler3, Michael A Langston2,
Arnold M Saxton*4 and Brynn H Voy3
Address: 1Genome Science and Technology Program, University of Tennessee, Knoxville, Tennessee, USA, 2Department of Electrical Engineering
and Computer Science, University of Tennessee, Knoxville, Tennessee, USA, 3Oak Ridge National Laboratory, Systems Genetics Group, Biosciences
Division, Oak Ridge, Tennessee, USA and 4Department of Animal Science, University of Tennessee, Knoxville, Tennessee, USA
Email: Bhavesh R Borate - boratebr@mail.nih.gov; Elissa J Chesler - elissa.chesler@jax.org; Michael A Langston - langston@eecs.utk.edu;
Arnold M Saxton* - asaxton@utk.edu; Brynn H Voy - bhvoy@utk.edu
* Corresponding author
Abstract
Background: Network and clustering analyses of microarray co-expression correlation data often
require application of a threshold to discard small correlations, thus reducing computational
demands and decreasing the number of uninformative correlations. This study investigated
threshold selection in the context of combinatorial network analysis of transcriptome data.
Findings: Six conceptually diverse methods - based on number of maximal cliques, correlation of
control spots with expressed genes, top 1% of correlations, spectral graph clustering, Bonferroni
correction of p-values, and statistical power - were used to estimate a correlation threshold for
three time-series microarray datasets. The validity of thresholds was tested by comparison to
thresholds derived from Gene Ontology information. Stability and reliability of the best methods
were evaluated with block bootstrapping.
Two threshold methods, number of maximal cliques and spectral graph, used information in the
correlation matrix structure and performed well in terms of stability. Comparison to Gene
Ontology found thresholds from number of maximal cliques extracted from a co-expression matrix
were the most biologically valid. Approaches to improve both methods were suggested.
Conclusion: Threshold selection approaches based on network structure of gene relationships
gave thresholds with greater relevance to curated biological relationships than approaches based
on statistical pair-wise relationships.
Introduction
To extract gene networks from microarray data, correla-
tions are often used as a measure of gene co-expression. A
typical microarray with 20,000 gene probes will produce
200 million correlations. Correlations below a threshold
value, closer to zero, will be less meaningful. Hard and
soft threshold approaches have been applied to biological
data. Hard thresholds discard gene pairs with correlation
below the threshold, while soft thresholds use the correla-
tion value to weight gene network relationships. Zhang
and Horvath [1] concluded that soft thresholds based on
aggregate, modular relationships between genes gave
Published: 2 December 2009
BMC Research Notes 2009, 2:240 doi:10.1186/1756-0500-2-240
Received: 27 August 2009
Accepted: 2 December 2009
This article is available from: http://www.biomedcentral.com/1756-0500/2/240
© 2009 Saxton et al; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240
Page 2 of 6
(page number not for citation purposes)
more robust results, but data reduction by a hard thresh-
old is often essential for computational tractability of
graph algorithms.
We focus on relevance networks, created by applying a
hard threshold to the gene expression correlation matrix
[2], then extracting gene networks. The resulting networks
have been well documented in recent literature to yield
sets of co-expressed genes [3-5]. Relevance networks are
easily converted to graphs, with genes as vertices, only
connected by an edge if their correlation is above the
threshold. A clique is a sub-graph in which all nodes are
connected to each other [6]. A disadvantage of using
cliques is the computational requirements, which grow
exponentially with number of genes. Thus hard threshold
selection is required when performing clique extraction
on microarray data.
Current approaches to threshold selection are typically
statistically based, and do not fully reflect the connectivity
of the data [7]. Methods based on statistical arguments
may not necessarily yield biologically significant relation-
ships [3,8].
Some studies used an arbitrary threshold correlation such
as 0.80 [9]. Moriyama et al. [10] obtained random corre-
lation distributions for gene pairs by permuting their
expression values and defended their choice of threshold
based on statistical significance. Lee et al. [11] used the
top 1% of correlations (absolute value) to build a co-
expression network. Voy et al. [3] used distribution of cor-
relations of genes with buffer spots on the arrays to select
a threshold correlation value of 0.875.
However, using connectivity of the data to derive thresh-
olds has been suggested. Langston et al. [12] recom-
mended use of ontological distance, statistical
significance and various graph structural attributes to
arrive at a correlation threshold. Palla et al. [13] found
that a threshold based on clique size was effective at sepa-
rating networks.
Here two threshold selection methods based on correla-
tion graph structure are compared with common statisti-
cally based methods. The graph based methods used
spectral properties [14] or number of cliques to select a
threshold. Objectives were to compare the various hard
threshold methods for validity (retention of biological
information), stability, and reliability.
Methods
Datasets
Three yeast S. cerevisiae time-series datasets were chosen
for this study: 31 arrays for Anoxia state [15], 21 arrays for
Reoxygenation state [15] and 18 arrays from yeast cultures
synchronized using Alpha-factor arrest [16]. Data are
available on Gene Expression Omnibus under GSE2246,
GSE2267 and GSE22. Extensive GO annotation for S. cer-
evisiae genes influenced the selection. Exploratory data
analyses within each dataset using PCA, box plots and
pair-wise correlations between arrays found no outlier
arrays. Quantile plots showed data were normally distrib-
uted, and distribution of correlations among gene expres-
sion profiles had the expected bell-shaped curve, so all
data were used.
Software
Software written by Langston and colleagues (University
of Tennessee) was used, including Datagen version 1.4a
for computing correlations, maximal clique enumeration
code version 2.0.1 [17], spectral analysis code [14], and
GO Pairwise Similarity analysis code version 1.0. Matrix
calculations for spectral graph analysis were carried out in
MATLAB 7.0. P-values were calculated in SAS version 9.1
(Cary; NC). Statistical power was calculated using PASS
statistical software http://www.ncss.com/pass.html.
Threshold Estimation
Six conceptually different approaches were evaluated:
1) Numbers of maximal cliques were calculated at each
potential correlation threshold, starting at r = 0.99. The
threshold was lowered, in steps of 0.01, and number of
maximal cliques increased due to greater connections
among genes. When clique number increased two times
(Maximal Clique-2) or three times (Maximal Clique-3)
the previous value, that correlation was chosen as the
threshold.
2) For each potential threshold correlation value, spectral
graph theory [18] was used to decompose the resulting
graph into eigenvalues and eigenvectors, which were used
to enumerate spectral clusters [19]. As the potential
threshold was incrementally lowered in steps of 0.01, a
peak in the number of clusters occurs, and the threshold
is chosen to maximize cluster number. Details are in [14].
3) Correlations of control spots with all other genes on
the array were calculated, creating a null distribution. The
99th percentile correlation value (absolute value) of this
distribution gave the threshold.
4) The top 1% of all correlations (absolute value) among
genes was used to estimate a threshold [11]. Correlations
were ranked, and the correlation at the 99th percentile
was the threshold estimate. Note that the control spot
method uses a different subset of correlations (only with
control spots), whereas this method uses all correlations
among genes.
BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240
Page 3 of 6
(page number not for citation purposes)
5) A p-value for every correlation was computed, testing if
the correlation was zero (Fisher's z-transformation).
Threshold estimate was the correlation value correspond-
ing to the critical Bonferroni p-value, 0.05/number of cor-
relations. This threshold will remove any correlations that
are statistically equal to zero.
6) Statistical power calculations were used to find the cor-
relation value that gave an 80% chance of rejecting the
null hypothesis, Ho: correlation = 0. Type I error rate in
these calculations was Bonferroni-adjusted to correct for
multiple testing.
Further details on computing these threshold estimation
methods are in the Additional file 1.
Performance Evaluation
Performance of the threshold estimation methods was
evaluated by comparison to a biologically based Gene
Ontology threshold. GO data used was
gene_ontology_edit.obo.2008-05-01.gz. The biological
meaning for each correlation bin (in 0.01 increments)
was the average of functional similarity scores for all gene
pairs within that correlation bin. Functional similarity for
a pair of genes was defined as log(n/N)/log(2/N), where n
is the number of genes in the lowest GO category that con-
tained both genes, and N is the total number of genes
annotated for the organism. The formula normalizes
Functional similarity to a 0 to 1 range, and a value of 1
means the GO category contained only the two genes
being considered (perfect similarity). GO threshold esti-
mate was defined as the correlation at which change in
average functional similarity exceeded median change
plus half its standard deviation, thus identifying where
biological information begins to accumulate.
To study stability of the methods, 10,000 block bootstrap
samples were created by sampling arrays with replace-
ment from each block. Blocks were defined to be 2 or 3
adjacent time periods, such that each block contained 3 or
4 arrays. Block bootstrapping was necessary to preserve as
much as possible the time-course dependency structure of
the experiments [20]. For each of the 10,000 samples, a
threshold estimate was calculated by each method, and
the distribution of these thresholds was used to compare
threshold methods for stability.
Results
Functional similarity scores for the three datasets are dis-
played in Figure 1. Changes in scores across correlation
values were similar for all datasets, and the lack of GO
term relationship for negative correlations is striking.
Because of this, the GO threshold was defined by the
curve for positive correlations. Biological relationship
begins to increase sharply above a correlation value of
0.80, and this produced the GO thresholds in Table 1.
Estimated thresholds obtained by each method are listed
in Table 1 for the three datasets. If estimated threshold is
higher than the biological threshold, false negatives will
occur, because data reduction by the higher threshold will
remove real relationships. Conversely, using a threshold
below the biological threshold will create false positives,
and relationships that are not real would be included in
the network. In discovery-based settings, false positives
are more acceptable, as they can be removed with further
validation. Thus methods that estimate a lower threshold
are preferred. Maximal Clique-2 and Spectral Clustering
performed better than the other methods, based on
summed absolute deviations from GO threshold (Table
1). Maximal Clique-2 was further from the GO threshold,
but might be preferred since it never exceeded that thresh-
old.
The estimated threshold derived for selected methods for
each dataset is compared to bootstrap distributions in
Table 2. The best methods from above, Maximal Clique-2
and Spectral Clustering, and two other methods for com-
parative purposes were chosen for this analysis. The boot-
strap mean was never less than the estimated threshold,
and occasionally was two standard deviations above. This
upward bias in correlation is expected, as each time period
had a limited number of arrays, making it likely that the
identical array would be resampled. However, Maximal
Clique and Spectral Clustering methods showed more
resistance to this bias. The bootstrap standard deviation
measures ability of the methods to produce similar
Change in GO functional similarity score across correlation valuesFigure 1
Change in GO functional similarity score across cor-
relation values. Lines represent Anoxia dataset (solid line),
Reoxygenation dataset (dashed line) and Alpha dataset (dot-
ted line).
BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240
Page 4 of 6
(page number not for citation purposes)
threshold estimates from randomized arrays. Again the
network-based methods showed the lowest standard devi-
ations, and highest stability. All methods showed poorest
performance with the Alpha dataset, possibly due to its
unreplicated design. This makes it less likely that all time
levels would be represented in the bootstrap samples,
whereas the other datasets had glucose and galactose bio-
logical replicates.
Discussion
The two network-based methods, Maximal Clique-2 and
Spectral Clustering, performed very well in terms of boot-
strap stability and biological validity. Though Maximal
Clique-2 method gave thresholds close to the biological
threshold, and always below, the method had slightly
higher bootstrap standard deviations. The robustness of
the Maximal Clique-2 algorithm could be enhanced by
exclusion of smaller cliques in the graph, for example
cliques of size 3. Spectral Clustering thresholds were on
average closer to biological thresholds, but too often
exceeded it. However, if all thresholds for Spectral Cluster-
ing were lowered by 0.05, it would have been clearly the
best method. Further fine-tuning of the parameters in the
algorithm (size of sliding window, different tolerance lev-
Table 1: Estimated threshold for each method by dataset, with methods sorted by the sum of absolute deviations from the GO
functional similarity threshold.
Method Anoxia Reoxygenation Alpha Absolute deviations from GO threshold
GO Functional Similarity 0.97 0.92 0.85
Spectral Clustering 0.93 0.97a0.89 0.04+0.05+0.04 = 0.13
Maximal Clique-2 0.90 0.91 0.74 0.07+0.01+0.11 = 0.19
Power 0.88 0.94 0.96 0.09+0.02+0.11 = 0.22
Bonferroni adjustment 0.85 0.93 0.95 0.12+0.01+0.10 = 0.23
Control-Spot 0.93 0.83 0.70 0.04+0.09+0.15 = 0.28
Maximal Clique-3 0.87 0.89 0.60 0.10+0.03+0.25 = 0.38
Top 1 Percent 0.81 0.81 0.72 0.16+0.11+0.13 = 0.40
aThresholds above the GO functional similarity threshold are in bold.
Table 2: Summary of bootstrap results compares the estimated threshold with the bootstrap distribution for the four selected
methods.
Method Dataset Estimated Threshold Bootstrap Mean DifferenceaBootstrap Standard Deviation
Maximal Clique-2 Anoxia 0.90 0.91 -0.01 0.015
Reoxy 0.91 0.93 -0.02 0.009
Alpha 0.74 0.78 -0.04 0.057
Spectral Clustering Anoxia 0.93 0.95 -0.02 0.012
Reoxy 0.97 0.97 0.00 0.011
Alpha 0.89 **0.95 -0.06 0.017
Top 1 Percent Anoxia 0.81 0.83 -0.02 0.011
Reoxy 0.81 0.84 -0.03 0.016
Alpha 0.72 **0.79 -0.07 0.027
Control Spot Anoxia 0.93 0.95 -0.02 0.015
Reoxy 0.83 **0.90 -0.07 0.034
Alpha 0.70 **0.82 -0.12 0.043
a Estimated threshold minus bootstrap mean.
** Estimated threshold is more than 2 std. deviations from bootstrap mean.
BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240
Page 5 of 6
(page number not for citation purposes)
els for cluster formation) may improve the method's
validity. In a recent paper, Almendral and Díaz-Guilera
[21] documented the sensitivity of the non-zero eigen-
value to network changes. All methods had subjective set-
tings, and further work on many more species and
experiments would be needed to establish best choices.
The results from this study complement the work of
Zhang and Horvath [1] which concluded that thresholds
based on the scale-free topology - the formation of hubs
and densely-connected sub-graphs - produced more
robust results. The statistically-based methods studied
here are directly dependent on the correlation distribution
and thus were unable to capture biological relationships.
Although the Control-Spot method is based on logical
reasoning, the high correlation of control spots with other
genes on the arrays weakened the method's validity. The
Top 1% Correlations method is arbitrary, and failed to
capture biological relationships. Statistical considerations
used for the Power and Bonferroni methods were also not
able to identify biological relationships, reflecting the
well-known discrepancy between biological and statistical
significance. Experiments that are small will produce
thresholds that are too high, while large experiments will
give excessively low thresholds, even though the biologi-
cal relationships are the same.
The GO similarity measure of biological validity we have
used, however, is by no means perfect and is just one way
of quantifying biological information. Khatri and Dragh-
ici [22] have listed limitations of GO in detail. We also
found low GO scores at high negative correlations as com-
pared to the high GO score associated with high positive
correlations for all three datasets. The drop in GO score at
high negative correlations could be due to several reasons,
for example experimental and analytical limitations to
detect biologically negative correlations among genes,
and limited gene annotations [11]. As the quantification
of biological information in data gets more precise, the
selection of thresholds should become easier. In fact, note
that a method like the GO threshold used here would be
a logical choice if GO information were complete and
accurate.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BRB wrote code for the analyses, summarized results, and
drafted the paper. All authors were involved in study
design, and read and approved the final manuscript.
Additional material
Acknowledgements
This research has been supported in part by the National Institutes of
Health under grants P01DA015027-01, R01HD052472-02, R01MH074460-
01, U01AA013512 and U01AA013641-04 and by the UT-ORNL Science
Alliance. Dr. E.J. Chesler was supported by NIAAA Integrative Neuro-
science Initiative on Alcoholism under grants U01AA13499 and
U24AA13513. This research used resources of the National Energy
Research Scientific Computing Center, which is supported by the Office of
Science of the U.S. Department of Energy under Contract No. DE-AC02-
05CH11231. Additional support was provided by the University of Tennes-
see Genome Science and Technology program. John Eblen, Andy Perkins,
Gary Rogers and Yun Zhang helped with basic issues of algorithm synthesis.
Drs. Bing Zhang and Roumyana Yordanova provided valuable comments on
certain aspects of this study.
References
1. Zhang B, Horvath S: A general framework for weighted gene
co-expression network analysis. Stat Appl Genet Mol Biol 2005, 4:.
2. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering
functional relationships between RNA expression and chem-
otherapeutic susceptibility using relevance networks. Proc
Natl Acad Sci USA 2000, 97(22):12182-12186.
3. Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ,
Branstetter LK, Langston MA: Extracting gene networks for low-
dose radiation using graph theoretical algorithms. PLoS Com-
put Biol 2006, 2(7):e89.
4. Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ: A
graph-based approach to systematically reconstruct human
transcriptional regulatory modules. Bioinformatics 2007,
23(13):i577-586.
5. Freeman TC, Goldovsky L, Brosch M, van Dongen S, Maziere P, Gro-
cock RJ, Freilich S, Thornton J, Enright AJ: Construction, visualisa-
tion, and clustering of transcription networks from
microarray expression data. PLoS Comput Biol 2007,
3(10):2032-2042.
6. Baldwin NE, Chesler EJ, Kirov S, Langston MA, Snoddy JR, Williams
RW, Zhang B: Computational, integrative, and comparative
methods for the elucidation of genetic coexpression net-
works. J Biomed Biotechnol 2005, 2005(2):172-180.
7. Butte AJ, Kohane IS: Mutual information relevance networks:
functional genomic clustering using pairwise entropy meas-
urements. Pac Symp Biocomput 2000:418-429.
8. Quackenbush J: Genomics. Microarrays--guilt by association.
Science 2003, 302(5643):240-241.
9. Sanoudou D, Haslett JN, Kho AT, Guo S, Gazda HT, Greenberg SA,
Lidov HG, Kohane IS, Kunkel LM, Beggs AH: Expression profiling
reveals altered satellite cell numbers and glycolytic enzyme
transcription in nemaline myopathy muscle. Proc Natl Acad Sci
USA 2003, 100(8):4666-4671.
10. Moriyama M, Hoshida Y, Otsuka M, Nishimura S, Kato N, Goto T,
Taniguchi H, Shiratori Y, Seki N, Omata M: Relevance network
between chemosensitivity and transcriptome in human
hepatoma cells. Mol Cancer Ther 2003, 2(2):199-205.
11. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis
of human genes across many microarray data sets. Genome
Res 2004, 14(6):1085-1094.
Additional file 1
Methodology for Threshold Estimation. Details on the six threshold esti-
mation methods are presented in a computationally oriented manner.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1756-
0500-2-240-S1.PDF]
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240
Page 6 of 6
(page number not for citation purposes)
12. Langston MA, Perkins AD, Saxton AM, Scharff JA, Voy BH: Innova-
tive Computational Methods For Transcriptomic Data Anal-
ysis: A Case Study in the Use Of FPT For Practical
Algorithm Design and Implementation. ACM symposium on
Applied Computing: 2006; Dijon, France 2006.
13. Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping
community structure of complex networks in nature and
society. Nature 2005, 435(7043):814-818.
14. Perkins AD, Langston MA: Threshold selection in gene co-
expression networks using spectral graph theory techniques.
BMC Bioinformatics 2009, 10(Suppl 11):S4.
15. Lai LC, Kosorukoff AL, Burke PV, Kwast KE: Metabolic-state-
dependent remodeling of the transcriptome in response to
anoxia and subsequent reoxygenation in Saccharomyces
cerevisiae. Eukaryot Cell 2006, 5(9):1468-1489.
16. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB,
Brown PO, Botstein D, Futcher B: Comprehensive identification
of cell cycle-regulated genes of the yeast Saccharomyces cer-
evisiae by microarray hybridization. Mol Biol Cell 1998,
9(12):3273-3297.
17. Zhang Y, Abu-Khzam FN, Baldwin NE, Chesler EJ, Langston MA, Sam-
atova NF: Genome-scale computational approaches to mem-
ory-intensive applications in systems biology. Supercomputing
2005 Proceedings of the ACM/IEEE SC Conference: 2005 2005:12.
18. Chung FRK: Spectral Graph Theory. American Mathematical
Society; 1994.
19. Ding CHQ, He X, Zha H: A spectral method to separate discon-
nected and nearly disconnected Web graph components.
Proceedings of the Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. San Francisco, California
2001:275-280 [http://ranger.uta.edu/~chqding/papers/kdd3a.ps].
20. Politis DN: The impact of bootstrap methods on time series
analysis. Statistical Science 2003, 18(2):219-230.
21. Almendral JA, Diaz-Guilera A: Dynamical and spectral proper-
ties of complex networks. New Journal of Physics 2007, 9:187.
22. Khatri P, Draghici S: Ontological analysis of gene expression
data: current tools, limitations, and open problems. Bioinfor-
matics 2005, 21(18):3587-3595.

Supplementary resource (1)

... According to Osterman and Overbeek [48], the gene context technique ranks candidate genes through multiple assignments. Thus, candidate genes with highly similar contexts are measured as high-confidence genes with potential functional association with known genes [212][213][214][215]. ...
... There is no rule of thumb applied for setting the threshold values. Although a soft threshold value (nearing zero) is considered less significant, it compensates for the robustness of a weighted GCN [215]. On the flip side, important genes might be missed out from the network with a highly stringent threshold selection [216]. ...
... A hard threshold (r = 0.8 to 1.0) has been shown to be more relevant in studies inferring biological relationships. The validity of the biological information computed based on the GO functional similarity measure increases at r > 0.8 [215]. GCN has been widely applied to Arabidopsis for the identification of genes corresponding to cell wall biosynthetic [90], fatty acid chain [217], photorespiration [218], immune response [219] and other metabolic pathways. ...
Article
Full-text available
In higher plants, the complexity of a system and the components within and among species are rapidly dissected by omics technologies. Multi-omics datasets are integrated to infer and enable a comprehensive understanding of the life processes of organisms of interest. Further, growing open-source datasets coupled with the emergence of high-performance computing and development of computational tools for biological sciences have assisted in silico functional prediction of unknown genes, proteins and metabolites, otherwise known as uncharacterized. The systems biology approach includes data collection and filtration, system modelling, experimentation and the establishment of new hypotheses for experimental validation. Informatics technologies add meaningful sense to the output generated by complex bioinformatics algorithms, which are now freely available in a user-friendly graphical user interface. These resources accentuate gene function prediction at a relatively minimal cost and effort. Herein, we present a comprehensive view of relevant approaches available for system-level gene function prediction in the plant kingdom. Together, the most recent applications and sought-after principles for gene mining are discussed to benefit the plant research community. A realistic tabulation of plant genomic resources is included for a less laborious and accurate candidate gene discovery in basic plant research and improvement strategies.
Chapter
The thresholding problem is considered in the context of high-throughput biological data. Several approaches are reviewed, implemented, and tested over an assortment of transcriptomic data.
Article
COVID-19 exerts systemic effects that can compromise various organs and systems. Although retrospective and in-silico studies and prospective preliminary analysis have assessed the possibility of direct infection of the endometrium, there is a lack of in-depth and prospective studies on the impact of systemic disease on key endometrial genes and functions across the menstrual cycle and window of implantation. Gene expression data has been obtained from (i) healthy secretory endometrium collected from 42 women without endometrial pathologies and (ii) nasopharyngeal swabs from 231 women with COVID-19 and 30 negative controls. To predict how COVID-19-related gene expression changes impact key endometrial genes and functions, an in-silico model was developed by integrating the endometrial and COVID-19 datasets in an affected mid-secretory endometrium gene co-expression network. An endometrial validation set comprising 16 women (8 confirmed to have COVID-19 and 8 negative test controls) was prospectively collected to validate the expression of key genes. We predicted that five genes important for embryo implantation were affected by COVID-19 (downregulation of COBL, GPX3 and SOCS3, and upregulation of DOCK2 and SLC2A3). We experimentally validated these genes in COVID19 patients using endometrial biopsies during the secretory phase of the menstrual cycle. The results generally support the in-silico model predictions, suggesting that the transcriptomic landscape changes mediated by COVID-19 affect endometrial receptivity genes and key processes necessary for fertility, such as immune system function, protection against oxidative damage and development vital for embryo implantation and early development.
Article
Environmental factors, including different stresses, can have an impact on the expression of genes and subsequently the phenotype and development of plants. Since a large number of genes are involved in response to the perturbation of the environment, identifying groups of co-expressed genes is meaningful. The gene co-expression network models can be used for the exploration, interpretation, and identification of genes responding to environmental changes. Once a gene co-expression network is constructed, one can determine gene modules and the association of gene modules to the phenotypic response. To link modules to phenotype, one approach is to find the correlated eigengenes of given modules or to integrate all eigengenes in regularized linear model. This manuscript describes the method from construction of co-expression network, module discovery, association between modules and phenotypic data, and finally to annotation/visualization.
Chapter
Full-text available
High-throughput phenotyping platforms for growth chamber and greenhouse-grown plants enable nondestructive, automated measurements of plant traits including shape, aboveground architecture, length, and biomass over time. However, to establish these platforms, many of these methods require expensive equipment or phenotyping expertise. Here we present a relatively inexpensive and simple phenotyping method for imaging hundreds of small- to medium-sized growth chamber or greenhouse-grown plants with a digital camera. Using this method, we image hundreds of tomato plants in 1 day.
Article
High-throughput phenotyping enables the temporal detection of subtle changes in plant plasticity and adaptation to different conditions, such as nitrogen deficiency, in an accurate, nondestructive, and unbiased way. Here, we describe a protocol to assess the contribution of nitrogen addition or deprival using an image-based system to analyze plant phenotype. Thousands of images can be captured throughout the life cycle of Arabidopsis, and those images can be used to quantify parameters such as plant growth (area, caliper length, diameter, etc.), in planta chlorophyll fluorescence, and in planta relative water content.
Article
High-throughput phenotyping (HTP) allows automation of fast and precise acquisition and analysis of digital images for the detection of key traits in real time. HTP improves characterization of the growth and development of plants in controlled environments in a nondestructive fashion. Marchantia polymorpha has emerged as a very attractive model for studying the evolution of the physiological, cellular, molecular, and developmental adaptations that enabled plants to conquer their terrestrial environments. The availability of the M. polymorpha genome in combination with a full set of functional genomic tools including genetic transformation, homologous recombination, and genome editing has allowed the inspection of its genome through forward and reverse genetics approaches. The increasing number of mutants has made it possible to perform informative genome-wide analyses to study the phenotypic consequences of gene inactivation. Here we present an HTP protocol for M. polymorpha that will aid current efforts to quantify numerous morphological parameters that can potentially reveal genotype-to-phenotype relationships and relevant connections between individual traits.
Article
The development of RGB (red, green, blue) sensors has opened the way for plant phenotyping. This is relevant because plant phenotyping allows us to visualize the product of the interaction between the plant ontogeny, anatomy, physiology, and biochemistry. Better yet, this can be achieved at any stage of plant development, i.e., from seedling to maturity. Here, we describe the use of phenotyping, based on the stay-green trait, of common bean (Phaseolus vulgaris L.) plant, as a model, stressed by water deficit, to elucidate the result of that interaction. Description is based on interpretation of RGB digital images acquired using a phenomic platform and a specific software. These images allow us to obtain a data group related to the color parameters that quantify the changes and alterations in each plant growth and development.
Article
Full-text available
Tools of molecular biology and the evolving tools of genomics can now be exploited to study the genetic regulatory mechanisms that control cellular responses to a wide variety of stimuli. These responses are highly complex, and involve many genes and gene products. The main objectives of this paper are to describe a novel research program centered on understanding these responses bydeveloping powerful graph algorithms that exploit the innovative principles of fixed parameter tractability in order to generate distilled gene sets;producing scalable, high performance parallel and distributed implementations of these algorithms utilizing cutting-edge computing platforms and auxiliary resources;employing these implementations to identify gene sets suggestive of co-regulation; andperforming sequence analysis and genomic data mining to examine, winnow and highlight the most promising gene sets for more detailed investigation.As a case study, we describe our work aimed at elucidating genetic regulatory mechanisms that control cellular responses to low-dose ionizing radiation (IR). A low-dose exposure, as defined here, is an exposure of at most 10 cGy (rads). While the consequences of high doses of radiation are well known, the net outcome of low-dose exposures continues to be debated, with support in the literature for both detrimental and beneficial effects. We use genome-scale gene expression data collected in response to low-dose IR exposure in vivo to identify the pathways that are activated or repressed as a tissue responds to the radiation insult. The driving motivation is that knowledge of these pathways will help clarify and interpret physiological responses to IR, which will advance our understanding of the health consequences of low-dose radiation exposures.
Conference Paper
Full-text available
Graph-theoretical approaches to biological network analysis have proven to be effective for small networks but are computationally infeasible for comprehensive genome-scale systems-level elucidation of these networks. The difficulty lies in the NP-hard nature of many global systems biology problems that, in practice, translates to exponential (or worse) run times for finding exact optimal solutions. Moreover, these problems, especially those of an enumerative flavor, are often memory-intensive and must share very large sets of data effectively across many processors. For example, the enumeration of maximal cliques - a core component in gene expression networks analysis, cis regulatory motif finding, and the study of quantitative trait loci for high-throughput molecular phenotypes can result in as many as 3^n/3 maximal cliques for a graph with n vertices. Memory requirements to store those cliques reach terabyte scales even on modest-sized genomes. Emerging hardware architectures with ultra-large globally addressable memory such as the SGI Altix and Cray X1 seem to be well suited for addressing these types of data-intensive problems in systems biology. This paper presents a novel framework that provides exact, parallel and scalable solutions to various graph-theoretical approaches to genome-scale elucidation of biological networks. This framework takes advantage of these large-memory architectures by creating globally addressable bitmap memory indices with potentially high compression rates, fast bitwise-logical operations, and reduced search space. Augmented with recent theoretical advancements based on fixed-parameter tractability, this framework produces computationally feasible performance for genome-scale combinatorial problems of systems biology.
Article
Full-text available
Gene co-expression networks are often constructed by computing some measure of similarity between expression levels of gene transcripts and subsequently applying a high-pass filter to remove all but the most likely biologically-significant relationships. The selection of this expression threshold necessarily has a significant effect on any conclusions derived from the resulting network. Many approaches have been taken to choose an appropriate threshold, among them computing levels of statistical significance, accepting only the top one percent of relationships, and selecting an arbitrary expression cutoff. We apply spectral graph theory methods to develop a systematic method for threshold selection. Eigenvalues and eigenvectors are computed for a transformation of the adjacency matrix of the network constructed at various threshold values. From these, we use a basic spectral clustering method to examine the set of gene-gene relationships and select a threshold dependent upon the community structure of the data. This approach is applied to two well-studied microarray data sets from Homo sapiens and Saccharomyces cerevisiae. This method presents a systematic, data-based alternative to using more artificial cutoff values and results in a more conservative approach to threshold selection than some other popular techniques such as retaining only statistically-significant relationships or setting a cutoff to include a percentage of the highest correlations.
Article
Full-text available
We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at http://cellcycle-www.stanford.edu
Article
Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of1, 2, 3, 4. A key question is how to interpret the global organization of such networks as the coexistence of their structural subunits (communities) associated with more highly interconnected parts. Identifying these a priori unknown building blocks (such as functionally related proteins5, 6, industrial sectors7 and groups of people8, 9) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large networks find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping communities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties.
Conference Paper
Separation of connected components from a graph with disconnected graph components mostly use breadth-first search (BFS) or depth-first search (DFS) graph algorithms. Here we propose a new algebraic method to separate disconnected and nearly-disconnected components. This method is based on spectral graph partitioning, following a key observation that disconnected components will show up, after properly sorted, as step-function like curve in the lowest eigenvectors of the Laplacian matrix of the graph. Following an perturbative analysis framework, we systematically analyzed the graph structures, first on the disconnected subgraph case, and second on the effects of adding edges sparsely connecting different subgraphs as a perturbation. Several new results are derived, providing insights to spectral methods and related clustering objective function. Examples are given illustrating the concepts and results our methods. Comparing to the standard graph algorithms, this method has the same O(‖E ‖ + ‖V‖log(‖V‖)) complexity, but is easier to implement (using readily available eigensolvers). Further more the method can easily identify articulation points and bridges on nearly-disconnected graphs. Segmentation of a real example of Web graph for query amazon is given. We found that each disconnected or nearly-disconnected components forms a cluster on a clear topic.
Article
Sparked by Efron's seminal paper, the decade of the 1980s was a period of active research on bootstrap methods for independent data--mainly i.i.d. or regression set-ups. By contrast, in the 1990s much research was directed towards resampling dependent data, for example, time series and random fields. Consequently, the availability of valid nonparametric inference procedures based on resampling and/or subsampling has freed practitioners from the necessity of resorting to simplifying assumptions such as normality or linearity that may be misleading.