ArticlePDF Available

Comparison of threshold selection methods for microarray gene co-expression matrices

December 2009
BMC Research Notes 2(1):240

December 2009
2(1):240

DOI:10.1186/1756-0500-2-240

Source
PubMed

License
CC BY 2.0

Authors:

Elissa J Chesler

The Jackson Laboratory

Michael A. Langston

University of Tennessee

Arnold Saxton

University of Tennessee

Show all 5 authorsHide

Network and clustering analyses of microarray co-expression correlation data often require application of a threshold to discard small correlations, thus reducing computational demands and decreasing the number of uninformative correlations. This study investigated threshold selection in the context of combinatorial network analysis of transcriptome data. Six conceptually diverse methods - based on number of maximal cliques, correlation of control spots with expressed genes, top 1% of correlations, spectral graph clustering, Bonferroni correction of p-values, and statistical power - were used to estimate a correlation threshold for three time-series microarray datasets. The validity of thresholds was tested by comparison to thresholds derived from Gene Ontology information. Stability and reliability of the best methods were evaluated with block bootstrapping.Two threshold methods, number of maximal cliques and spectral graph, used information in the correlation matrix structure and performed well in terms of stability. Comparison to Gene Ontology found thresholds from number of maximal cliques extracted from a co-expression matrix were the most biologically valid. Approaches to improve both methods were suggested. Threshold selection approaches based on network structure of gene relationships gave thresholds with greater relevance to curated biological relationships than approaches based on statistical pair-wise relationships.

Change in GO functional similarity score across correlation values. Lines represent Anoxia dataset (solid line), Reoxygenation dataset (dashed line) and Alpha dataset (dotted line).

…

Figures - available via license: Creative Commons Attribution 2.0 Generic

Content may be subject to copyright.

Content uploaded by Michael A. Langston

Content may be subject to copyright.

Available via license: CC BY 2.0

Content may be subject to copyright.

BioMed Central

Page 1 of 6

(page number not for citation purposes)

BMC Research Notes

Open Access

Short Report

Comparison of threshold selection methods for microarray gene

co-expression matrices

Bhavesh R Borate1, Elissa J Chesler3, Michael A Langston2,

Arnold M Saxton*4 and Brynn H Voy3

Address: 1Genome Science and Technology Program, University of Tennessee, Knoxville, Tennessee, USA, 2Department of Electrical Engineering

and Computer Science, University of Tennessee, Knoxville, Tennessee, USA, 3Oak Ridge National Laboratory, Systems Genetics Group, Biosciences

Division, Oak Ridge, Tennessee, USA and 4Department of Animal Science, University of Tennessee, Knoxville, Tennessee, USA

Email: Bhavesh R Borate - boratebr@mail.nih.gov; Elissa J Chesler - elissa.chesler@jax.org; Michael A Langston - langston@eecs.utk.edu;

Arnold M Saxton* - asaxton@utk.edu; Brynn H Voy - bhvoy@utk.edu

* Corresponding author

Abstract

Background: Network and clustering analyses of microarray co-expression correlation data often

require application of a threshold to discard small correlations, thus reducing computational

demands and decreasing the number of uninformative correlations. This study investigated

threshold selection in the context of combinatorial network analysis of transcriptome data.

Findings: Six conceptually diverse methods - based on number of maximal cliques, correlation of

control spots with expressed genes, top 1% of correlations, spectral graph clustering, Bonferroni

correction of p-values, and statistical power - were used to estimate a correlation threshold for

three time-series microarray datasets. The validity of thresholds was tested by comparison to

thresholds derived from Gene Ontology information. Stability and reliability of the best methods

were evaluated with block bootstrapping.

Two threshold methods, number of maximal cliques and spectral graph, used information in the

correlation matrix structure and performed well in terms of stability. Comparison to Gene

Ontology found thresholds from number of maximal cliques extracted from a co-expression matrix

were the most biologically valid. Approaches to improve both methods were suggested.

Conclusion: Threshold selection approaches based on network structure of gene relationships

gave thresholds with greater relevance to curated biological relationships than approaches based

on statistical pair-wise relationships.

Introduction

To extract gene networks from microarray data, correla-

tions are often used as a measure of gene co-expression. A

typical microarray with 20,000 gene probes will produce

200 million correlations. Correlations below a threshold

value, closer to zero, will be less meaningful. Hard and

soft threshold approaches have been applied to biological

data. Hard thresholds discard gene pairs with correlation

below the threshold, while soft thresholds use the correla-

tion value to weight gene network relationships. Zhang

and Horvath [1] concluded that soft thresholds based on

aggregate, modular relationships between genes gave

Published: 2 December 2009

BMC Research Notes 2009, 2:240 doi:10.1186/1756-0500-2-240

Received: 27 August 2009

Accepted: 2 December 2009

This article is available from: http://www.biomedcentral.com/1756-0500/2/240

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240

Page 2 of 6

(page number not for citation purposes)

more robust results, but data reduction by a hard thresh-

old is often essential for computational tractability of

graph algorithms.

We focus on relevance networks, created by applying a

hard threshold to the gene expression correlation matrix

[2], then extracting gene networks. The resulting networks

have been well documented in recent literature to yield

sets of co-expressed genes [3-5]. Relevance networks are

easily converted to graphs, with genes as vertices, only

connected by an edge if their correlation is above the

threshold. A clique is a sub-graph in which all nodes are

connected to each other [6]. A disadvantage of using

cliques is the computational requirements, which grow

exponentially with number of genes. Thus hard threshold

selection is required when performing clique extraction

on microarray data.

Current approaches to threshold selection are typically

statistically based, and do not fully reflect the connectivity

of the data [7]. Methods based on statistical arguments

may not necessarily yield biologically significant relation-

ships [3,8].

Some studies used an arbitrary threshold correlation such

as 0.80 [9]. Moriyama et al. [10] obtained random corre-

lation distributions for gene pairs by permuting their

expression values and defended their choice of threshold

based on statistical significance. Lee et al. [11] used the

top 1% of correlations (absolute value) to build a co-

expression network. Voy et al. [3] used distribution of cor-

relations of genes with buffer spots on the arrays to select

a threshold correlation value of 0.875.

However, using connectivity of the data to derive thresh-

olds has been suggested. Langston et al. [12] recom-

mended use of ontological distance, statistical

significance and various graph structural attributes to

arrive at a correlation threshold. Palla et al. [13] found

that a threshold based on clique size was effective at sepa-

rating networks.

Here two threshold selection methods based on correla-

tion graph structure are compared with common statisti-

cally based methods. The graph based methods used

spectral properties [14] or number of cliques to select a

threshold. Objectives were to compare the various hard

threshold methods for validity (retention of biological

information), stability, and reliability.

Methods

Datasets

Three yeast S. cerevisiae time-series datasets were chosen

for this study: 31 arrays for Anoxia state [15], 21 arrays for

Reoxygenation state [15] and 18 arrays from yeast cultures

synchronized using Alpha-factor arrest [16]. Data are

available on Gene Expression Omnibus under GSE2246,

GSE2267 and GSE22. Extensive GO annotation for S. cer-

evisiae genes influenced the selection. Exploratory data

analyses within each dataset using PCA, box plots and

pair-wise correlations between arrays found no outlier

arrays. Quantile plots showed data were normally distrib-

uted, and distribution of correlations among gene expres-

sion profiles had the expected bell-shaped curve, so all

data were used.

Software

Software written by Langston and colleagues (University

of Tennessee) was used, including Datagen version 1.4a

for computing correlations, maximal clique enumeration

code version 2.0.1 [17], spectral analysis code [14], and

GO Pairwise Similarity analysis code version 1.0. Matrix

calculations for spectral graph analysis were carried out in

MATLAB 7.0. P-values were calculated in SAS version 9.1

(Cary; NC). Statistical power was calculated using PASS

statistical software http://www.ncss.com/pass.html.

Threshold Estimation

Six conceptually different approaches were evaluated:

1) Numbers of maximal cliques were calculated at each

potential correlation threshold, starting at r = 0.99. The

threshold was lowered, in steps of 0.01, and number of

maximal cliques increased due to greater connections

among genes. When clique number increased two times

(Maximal Clique-2) or three times (Maximal Clique-3)

the previous value, that correlation was chosen as the

threshold.

2) For each potential threshold correlation value, spectral

graph theory [18] was used to decompose the resulting

graph into eigenvalues and eigenvectors, which were used

to enumerate spectral clusters [19]. As the potential

threshold was incrementally lowered in steps of 0.01, a

peak in the number of clusters occurs, and the threshold

is chosen to maximize cluster number. Details are in [14].

3) Correlations of control spots with all other genes on

the array were calculated, creating a null distribution. The

99th percentile correlation value (absolute value) of this

distribution gave the threshold.

4) The top 1% of all correlations (absolute value) among

genes was used to estimate a threshold [11]. Correlations

were ranked, and the correlation at the 99th percentile

was the threshold estimate. Note that the control spot

method uses a different subset of correlations (only with

control spots), whereas this method uses all correlations

among genes.

BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240

Page 3 of 6

(page number not for citation purposes)

5) A p-value for every correlation was computed, testing if

the correlation was zero (Fisher's z-transformation).

Threshold estimate was the correlation value correspond-

ing to the critical Bonferroni p-value, 0.05/number of cor-

relations. This threshold will remove any correlations that

are statistically equal to zero.

6) Statistical power calculations were used to find the cor-

relation value that gave an 80% chance of rejecting the

null hypothesis, Ho: correlation = 0. Type I error rate in

these calculations was Bonferroni-adjusted to correct for

multiple testing.

Further details on computing these threshold estimation

methods are in the Additional file 1.

Performance Evaluation

Performance of the threshold estimation methods was

evaluated by comparison to a biologically based Gene

Ontology threshold. GO data used was

gene_ontology_edit.obo.2008-05-01.gz. The biological

meaning for each correlation bin (in 0.01 increments)

was the average of functional similarity scores for all gene

pairs within that correlation bin. Functional similarity for

a pair of genes was defined as log(n/N)/log(2/N), where n

is the number of genes in the lowest GO category that con-

tained both genes, and N is the total number of genes

annotated for the organism. The formula normalizes

Functional similarity to a 0 to 1 range, and a value of 1

means the GO category contained only the two genes

being considered (perfect similarity). GO threshold esti-

mate was defined as the correlation at which change in

average functional similarity exceeded median change

plus half its standard deviation, thus identifying where

biological information begins to accumulate.

To study stability of the methods, 10,000 block bootstrap

samples were created by sampling arrays with replace-

ment from each block. Blocks were defined to be 2 or 3

adjacent time periods, such that each block contained 3 or

4 arrays. Block bootstrapping was necessary to preserve as

much as possible the time-course dependency structure of

the experiments [20]. For each of the 10,000 samples, a

threshold estimate was calculated by each method, and

the distribution of these thresholds was used to compare

threshold methods for stability.

Results

Functional similarity scores for the three datasets are dis-

played in Figure 1. Changes in scores across correlation

values were similar for all datasets, and the lack of GO

term relationship for negative correlations is striking.

Because of this, the GO threshold was defined by the

curve for positive correlations. Biological relationship

begins to increase sharply above a correlation value of

0.80, and this produced the GO thresholds in Table 1.

Estimated thresholds obtained by each method are listed

in Table 1 for the three datasets. If estimated threshold is

higher than the biological threshold, false negatives will

occur, because data reduction by the higher threshold will

remove real relationships. Conversely, using a threshold

below the biological threshold will create false positives,

and relationships that are not real would be included in

the network. In discovery-based settings, false positives

are more acceptable, as they can be removed with further

validation. Thus methods that estimate a lower threshold

are preferred. Maximal Clique-2 and Spectral Clustering

performed better than the other methods, based on

summed absolute deviations from GO threshold (Table

1). Maximal Clique-2 was further from the GO threshold,

but might be preferred since it never exceeded that thresh-

old.

The estimated threshold derived for selected methods for

each dataset is compared to bootstrap distributions in

Table 2. The best methods from above, Maximal Clique-2

and Spectral Clustering, and two other methods for com-

parative purposes were chosen for this analysis. The boot-

strap mean was never less than the estimated threshold,

and occasionally was two standard deviations above. This

upward bias in correlation is expected, as each time period

had a limited number of arrays, making it likely that the

identical array would be resampled. However, Maximal

Clique and Spectral Clustering methods showed more

resistance to this bias. The bootstrap standard deviation

measures ability of the methods to produce similar

Change in GO functional similarity score across correlation valuesFigure 1

Change in GO functional similarity score across cor-

relation values. Lines represent Anoxia dataset (solid line),

Reoxygenation dataset (dashed line) and Alpha dataset (dot-

ted line).

BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240

Page 4 of 6

(page number not for citation purposes)

threshold estimates from randomized arrays. Again the

network-based methods showed the lowest standard devi-

ations, and highest stability. All methods showed poorest

performance with the Alpha dataset, possibly due to its

unreplicated design. This makes it less likely that all time

levels would be represented in the bootstrap samples,

whereas the other datasets had glucose and galactose bio-

logical replicates.

Discussion

The two network-based methods, Maximal Clique-2 and

Spectral Clustering, performed very well in terms of boot-

strap stability and biological validity. Though Maximal

Clique-2 method gave thresholds close to the biological

threshold, and always below, the method had slightly

higher bootstrap standard deviations. The robustness of

the Maximal Clique-2 algorithm could be enhanced by

exclusion of smaller cliques in the graph, for example

cliques of size 3. Spectral Clustering thresholds were on

average closer to biological thresholds, but too often

exceeded it. However, if all thresholds for Spectral Cluster-

ing were lowered by 0.05, it would have been clearly the

best method. Further fine-tuning of the parameters in the

algorithm (size of sliding window, different tolerance lev-

Table 1: Estimated threshold for each method by dataset, with methods sorted by the sum of absolute deviations from the GO

functional similarity threshold.

Method Anoxia Reoxygenation Alpha Absolute deviations from GO threshold

GO Functional Similarity 0.97 0.92 0.85

Spectral Clustering 0.93 0.97a0.89 0.04+0.05+0.04 = 0.13

Maximal Clique-2 0.90 0.91 0.74 0.07+0.01+0.11 = 0.19

Power 0.88 0.94 0.96 0.09+0.02+0.11 = 0.22

Bonferroni adjustment 0.85 0.93 0.95 0.12+0.01+0.10 = 0.23

Control-Spot 0.93 0.83 0.70 0.04+0.09+0.15 = 0.28

Maximal Clique-3 0.87 0.89 0.60 0.10+0.03+0.25 = 0.38

Top 1 Percent 0.81 0.81 0.72 0.16+0.11+0.13 = 0.40

aThresholds above the GO functional similarity threshold are in bold.

Table 2: Summary of bootstrap results compares the estimated threshold with the bootstrap distribution for the four selected

methods.

Method Dataset Estimated Threshold Bootstrap Mean DifferenceaBootstrap Standard Deviation

Maximal Clique-2 Anoxia 0.90 0.91 -0.01 0.015

Reoxy 0.91 0.93 -0.02 0.009

Alpha 0.74 0.78 -0.04 0.057

Spectral Clustering Anoxia 0.93 0.95 -0.02 0.012

Reoxy 0.97 0.97 0.00 0.011

Alpha 0.89 **0.95 -0.06 0.017

Top 1 Percent Anoxia 0.81 0.83 -0.02 0.011

Reoxy 0.81 0.84 -0.03 0.016

Alpha 0.72 **0.79 -0.07 0.027

Control Spot Anoxia 0.93 0.95 -0.02 0.015

Reoxy 0.83 **0.90 -0.07 0.034

Alpha 0.70 **0.82 -0.12 0.043

a Estimated threshold minus bootstrap mean.

** Estimated threshold is more than 2 std. deviations from bootstrap mean.

BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240

Page 5 of 6

(page number not for citation purposes)

els for cluster formation) may improve the method's

validity. In a recent paper, Almendral and Díaz-Guilera

[21] documented the sensitivity of the non-zero eigen-

value to network changes. All methods had subjective set-

tings, and further work on many more species and

experiments would be needed to establish best choices.

The results from this study complement the work of

Zhang and Horvath [1] which concluded that thresholds

based on the scale-free topology - the formation of hubs

and densely-connected sub-graphs - produced more

robust results. The statistically-based methods studied

here are directly dependent on the correlation distribution

and thus were unable to capture biological relationships.

Although the Control-Spot method is based on logical

reasoning, the high correlation of control spots with other

genes on the arrays weakened the method's validity. The

Top 1% Correlations method is arbitrary, and failed to

capture biological relationships. Statistical considerations

used for the Power and Bonferroni methods were also not

able to identify biological relationships, reflecting the

well-known discrepancy between biological and statistical

significance. Experiments that are small will produce

thresholds that are too high, while large experiments will

give excessively low thresholds, even though the biologi-

cal relationships are the same.

The GO similarity measure of biological validity we have

used, however, is by no means perfect and is just one way

of quantifying biological information. Khatri and Dragh-

ici [22] have listed limitations of GO in detail. We also

found low GO scores at high negative correlations as com-

pared to the high GO score associated with high positive

correlations for all three datasets. The drop in GO score at

high negative correlations could be due to several reasons,

for example experimental and analytical limitations to

detect biologically negative correlations among genes,

and limited gene annotations [11]. As the quantification

of biological information in data gets more precise, the

selection of thresholds should become easier. In fact, note

that a method like the GO threshold used here would be

a logical choice if GO information were complete and

accurate.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BRB wrote code for the analyses, summarized results, and

drafted the paper. All authors were involved in study

design, and read and approved the final manuscript.

Additional material

Acknowledgements

This research has been supported in part by the National Institutes of

Health under grants P01DA015027-01, R01HD052472-02, R01MH074460-

01, U01AA013512 and U01AA013641-04 and by the UT-ORNL Science

Alliance. Dr. E.J. Chesler was supported by NIAAA Integrative Neuro-

science Initiative on Alcoholism under grants U01AA13499 and

U24AA13513. This research used resources of the National Energy

Research Scientific Computing Center, which is supported by the Office of

Science of the U.S. Department of Energy under Contract No. DE-AC02-

05CH11231. Additional support was provided by the University of Tennes-

see Genome Science and Technology program. John Eblen, Andy Perkins,

Gary Rogers and Yun Zhang helped with basic issues of algorithm synthesis.

Drs. Bing Zhang and Roumyana Yordanova provided valuable comments on

certain aspects of this study.

References

1. Zhang B, Horvath S: A general framework for weighted gene

co-expression network analysis. Stat Appl Genet Mol Biol 2005, 4:.

2. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering

functional relationships between RNA expression and chem-

otherapeutic susceptibility using relevance networks. Proc

Natl Acad Sci USA 2000, 97(22):12182-12186.

3. Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ,

Branstetter LK, Langston MA: Extracting gene networks for low-

dose radiation using graph theoretical algorithms. PLoS Com-

put Biol 2006, 2(7):e89.

4. Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ: A

graph-based approach to systematically reconstruct human

transcriptional regulatory modules. Bioinformatics 2007,

23(13):i577-586.

5. Freeman TC, Goldovsky L, Brosch M, van Dongen S, Maziere P, Gro-

cock RJ, Freilich S, Thornton J, Enright AJ: Construction, visualisa-

tion, and clustering of transcription networks from

microarray expression data. PLoS Comput Biol 2007,

3(10):2032-2042.

6. Baldwin NE, Chesler EJ, Kirov S, Langston MA, Snoddy JR, Williams

RW, Zhang B: Computational, integrative, and comparative

methods for the elucidation of genetic coexpression net-

works. J Biomed Biotechnol 2005, 2005(2):172-180.

7. Butte AJ, Kohane IS: Mutual information relevance networks:

functional genomic clustering using pairwise entropy meas-

urements. Pac Symp Biocomput 2000:418-429.

8. Quackenbush J: Genomics. Microarrays--guilt by association.

Science 2003, 302(5643):240-241.

9. Sanoudou D, Haslett JN, Kho AT, Guo S, Gazda HT, Greenberg SA,

Lidov HG, Kohane IS, Kunkel LM, Beggs AH: Expression profiling

reveals altered satellite cell numbers and glycolytic enzyme

transcription in nemaline myopathy muscle. Proc Natl Acad Sci

USA 2003, 100(8):4666-4671.

10. Moriyama M, Hoshida Y, Otsuka M, Nishimura S, Kato N, Goto T,

Taniguchi H, Shiratori Y, Seki N, Omata M: Relevance network

between chemosensitivity and transcriptome in human

hepatoma cells. Mol Cancer Ther 2003, 2(2):199-205.

11. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis

of human genes across many microarray data sets. Genome

Res 2004, 14(6):1085-1094.

Additional file 1

Methodology for Threshold Estimation. Details on the six threshold esti-

mation methods are presented in a computationally oriented manner.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1756-

0500-2-240-S1.PDF]

Publish with BioMed Central and every

scientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

BMC Research Notes 2009, 2:240 http://www.biomedcentral.com/1756-0500/2/240

Page 6 of 6

(page number not for citation purposes)

12. Langston MA, Perkins AD, Saxton AM, Scharff JA, Voy BH: Innova-

tive Computational Methods For Transcriptomic Data Anal-

ysis: A Case Study in the Use Of FPT For Practical

Algorithm Design and Implementation. ACM symposium on

Applied Computing: 2006; Dijon, France 2006.

13. Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping

community structure of complex networks in nature and

society. Nature 2005, 435(7043):814-818.

14. Perkins AD, Langston MA: Threshold selection in gene co-

expression networks using spectral graph theory techniques.

BMC Bioinformatics 2009, 10(Suppl 11):S4.

15. Lai LC, Kosorukoff AL, Burke PV, Kwast KE: Metabolic-state-

dependent remodeling of the transcriptome in response to

anoxia and subsequent reoxygenation in Saccharomyces

cerevisiae. Eukaryot Cell 2006, 5(9):1468-1489.

16. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB,

Brown PO, Botstein D, Futcher B: Comprehensive identification

of cell cycle-regulated genes of the yeast Saccharomyces cer-

evisiae by microarray hybridization. Mol Biol Cell 1998,

9(12):3273-3297.

17. Zhang Y, Abu-Khzam FN, Baldwin NE, Chesler EJ, Langston MA, Sam-

atova NF: Genome-scale computational approaches to mem-

ory-intensive applications in systems biology. Supercomputing

2005 Proceedings of the ACM/IEEE SC Conference: 2005 2005:12.

18. Chung FRK: Spectral Graph Theory. American Mathematical

Society; 1994.

19. Ding CHQ, He X, Zha H: A spectral method to separate discon-

nected and nearly disconnected Web graph components.

Proceedings of the Seventh ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. San Francisco, California

2001:275-280 [http://ranger.uta.edu/~chqding/papers/kdd3a.ps].

20. Politis DN: The impact of bootstrap methods on time series

analysis. Statistical Science 2003, 18(2):219-230.

21. Almendral JA, Diaz-Guilera A: Dynamical and spectral proper-

ties of complex networks. New Journal of Physics 2007, 9:187.

22. Khatri P, Draghici S: Ontological analysis of gene expression

data: current tools, limitations, and open problems. Bioinfor-

matics 2005, 21(18):3587-3595.

Additional file 1

Data

December 2009

Bhavesh R Borate · Elissa J Chesler · Michael A. Langston · Arnold Saxton · Brynn Voy

Multi-Omics Approaches and Resources for Systems-Level Gene Function Prediction in the Plant Kingdom

Article

Full-text available

Oct 2022

In higher plants, the complexity of a system and the components within and among species are rapidly dissected by omics technologies. Multi-omics datasets are integrated to infer and enable a comprehensive understanding of the life processes of organisms of interest. Further, growing open-source datasets coupled with the emergence of high-performance computing and development of computational tools for biological sciences have assisted in silico functional prediction of unknown genes, proteins and metabolites, otherwise known as uncharacterized. The systems biology approach includes data collection and filtration, system modelling, experimentation and the establishment of new hypotheses for experimental validation. Informatics technologies add meaningful sense to the output generated by complex bioinformatics algorithms, which are now freely available in a user-friendly graphical user interface. These resources accentuate gene function prediction at a relatively minimal cost and effort. Herein, we present a comprehensive view of relevant approaches available for system-level gene function prediction in the plant kingdom. Together, the most recent applications and sought-after principles for gene mining are discussed to benefit the plant research community. A realistic tabulation of plant genomic resources is included for a less laborious and accurate candidate gene discovery in basic plant research and improvement strategies.

A Comparative Study of Gene Co-Expression Thresholding Algorithms

Article

May 2024
J COMPUT BIOL

Inferring Independent Sets of Gaussian Variables after Thresholding Correlations

Article

Apr 2024

A Brief Study of Gene Co-expression Thresholding Algorithms

Chapter

Oct 2023

The thresholding problem is considered in the context of high-throughput biological data. Several approaches are reviewed, implemented, and tested over an assortment of transcriptomic data.

Predicted COVID-19 molecular effects on endometrium reveal key dysregulated genes and functions

Article

Oct 2022

COVID-19 exerts systemic effects that can compromise various organs and systems. Although retrospective and in-silico studies and prospective preliminary analysis have assessed the possibility of direct infection of the endometrium, there is a lack of in-depth and prospective studies on the impact of systemic disease on key endometrial genes and functions across the menstrual cycle and window of implantation. Gene expression data has been obtained from (i) healthy secretory endometrium collected from 42 women without endometrial pathologies and (ii) nasopharyngeal swabs from 231 women with COVID-19 and 30 negative controls. To predict how COVID-19-related gene expression changes impact key endometrial genes and functions, an in-silico model was developed by integrating the endometrial and COVID-19 datasets in an affected mid-secretory endometrium gene co-expression network. An endometrial validation set comprising 16 women (8 confirmed to have COVID-19 and 8 negative test controls) was prospectively collected to validate the expression of key genes. We predicted that five genes important for embryo implantation were affected by COVID-19 (downregulation of COBL, GPX3 and SOCS3, and upregulation of DOCK2 and SLC2A3). We experimentally validated these genes in COVID19 patients using endometrial biopsies during the secretory phase of the menstrual cycle. The results generally support the in-silico model predictions, suggesting that the transcriptomic landscape changes mediated by COVID-19 affect endometrial receptivity genes and key processes necessary for fertility, such as immune system function, protection against oxidative damage and development vital for embryo implantation and early development.

Gene Co-expression Network Analysis and Linking Modules to Phenotyping Response in Plants

Article

Jul 2022

Environmental factors, including different stresses, can have an impact on the expression of genes and subsequently the phenotype and development of plants. Since a large number of genes are involved in response to the perturbation of the environment, identifying groups of co-expressed genes is meaningful. The gene co-expression network models can be used for the exploration, interpretation, and identification of genes responding to environmental changes. Once a gene co-expression network is constructed, one can determine gene modules and the association of gene modules to the phenotypic response. To link modules to phenotype, one approach is to find the correlated eigengenes of given modules or to integrate all eigengenes in regularized linear model. This manuscript describes the method from construction of co-expression network, module discovery, association between modules and phenotypic data, and finally to annotation/visualization.

A Straightforward High-Throughput Aboveground Phenotyping Platform for Small- to Medium-Sized Plants

Chapter

Full-text available

Jul 2022

High-throughput phenotyping platforms for growth chamber and greenhouse-grown plants enable nondestructive, automated measurements of plant traits including shape, aboveground architecture, length, and biomass over time. However, to establish these platforms, many of these methods require expensive equipment or phenotyping expertise. Here we present a relatively inexpensive and simple phenotyping method for imaging hundreds of small- to medium-sized growth chamber or greenhouse-grown plants with a digital camera. Using this method, we image hundreds of tomato plants in 1 day.

A Novel High-Throughput Phenotyping Hydroponic System for Nitrogen Deficiency Studies in Arabidopsis thaliana

Article

Jul 2022

High-throughput phenotyping enables the temporal detection of subtle changes in plant plasticity and adaptation to different conditions, such as nitrogen deficiency, in an accurate, nondestructive, and unbiased way. Here, we describe a protocol to assess the contribution of nitrogen addition or deprival using an image-based system to analyze plant phenotype. Thousands of images can be captured throughout the life cycle of Arabidopsis, and those images can be used to quantify parameters such as plant growth (area, caliper length, diameter, etc.), in planta chlorophyll fluorescence, and in planta relative water content.

An Automated High-Throughput Phenotyping System for Marchantia polymorpha

Article

Jul 2022

High-throughput phenotyping (HTP) allows automation of fast and precise acquisition and analysis of digital images for the detection of key traits in real time. HTP improves characterization of the growth and development of plants in controlled environments in a nondestructive fashion. Marchantia polymorpha has emerged as a very attractive model for studying the evolution of the physiological, cellular, molecular, and developmental adaptations that enabled plants to conquer their terrestrial environments. The availability of the M. polymorpha genome in combination with a full set of functional genomic tools including genetic transformation, homologous recombination, and genome editing has allowed the inspection of its genome through forward and reverse genetics approaches. The increasing number of mutants has made it possible to perform informative genome-wide analyses to study the phenotypic consequences of gene inactivation. Here we present an HTP protocol for M. polymorpha that will aid current efforts to quantify numerous morphological parameters that can potentially reveal genotype-to-phenotype relationships and relevant connections between individual traits.

High-Throughput Screening to Examine the Dynamic of Stay-Green by an Imaging System

Article

Jul 2022

The development of RGB (red, green, blue) sensors has opened the way for plant phenotyping. This is relevant because plant phenotyping allows us to visualize the product of the interaction between the plant ontogeny, anatomy, physiology, and biochemistry. Better yet, this can be achieved at any stage of plant development, i.e., from seedling to maturity. Here, we describe the use of phenotyping, based on the stay-green trait, of common bean (Phaseolus vulgaris L.) plant, as a model, stressed by water deficit, to elucidate the result of that interaction. Description is based on interpretation of RGB digital images acquired using a phenomic platform and a specific software. These images allow us to obtain a data group related to the color parameters that quantify the changes and alterations in each plant growth and development.

Innovative computational methods for transcriptomic data analysis: A case study in the use of FPT for practical algorithm design and implementation

Article

Full-text available

Jan 2001

Tools of molecular biology and the evolving tools of genomics can now be exploited to study the genetic regulatory mechanisms that control cellular responses to a wide variety of stimuli. These responses are highly complex, and involve many genes and gene products. The main objectives of this paper are to describe a novel research program centered on understanding these responses bydeveloping powerful graph algorithms that exploit the innovative principles of fixed parameter tractability in order to generate distilled gene sets;producing scalable, high performance parallel and distributed implementations of these algorithms utilizing cutting-edge computing platforms and auxiliary resources;employing these implementations to identify gene sets suggestive of co-regulation; andperforming sequence analysis and genomic data mining to examine, winnow and highlight the most promising gene sets for more detailed investigation.As a case study, we describe our work aimed at elucidating genetic regulatory mechanisms that control cellular responses to low-dose ionizing radiation (IR). A low-dose exposure, as defined here, is an exposure of at most 10 cGy (rads). While the consequences of high doses of radiation are well known, the net outcome of low-dose exposures continues to be debated, with support in the literature for both detrimental and beneficial effects. We use genome-scale gene expression data collected in response to low-dose IR exposure in vivo to identify the pathways that are activated or repressed as a tissue responds to the radiation insult. The driving motivation is that knowledge of these pathways will help clarify and interpret physiological responses to IR, which will advance our understanding of the health consequences of low-dose radiation exposures.

Genome-Scale Computational Approaches to Memory-Intensive Applications in Systems Biology.

Conference Paper

Full-text available

Jan 2005

Graph-theoretical approaches to biological network analysis have proven to be effective for small networks but are computationally infeasible for comprehensive genome-scale systems-level elucidation of these networks. The difficulty lies in the NP-hard nature of many global systems biology problems that, in practice, translates to exponential (or worse) run times for finding exact optimal solutions. Moreover, these problems, especially those of an enumerative flavor, are often memory-intensive and must share very large sets of data effectively across many processors. For example, the enumeration of maximal cliques - a core component in gene expression networks analysis, cis regulatory motif finding, and the study of quantitative trait loci for high-throughput molecular phenotypes can result in as many as 3^n/3 maximal cliques for a graph with n vertices. Memory requirements to store those cliques reach terabyte scales even on modest-sized genomes. Emerging hardware architectures with ultra-large globally addressable memory such as the SGI Altix and Cray X1 seem to be well suited for addressing these types of data-intensive problems in systems biology. This paper presents a novel framework that provides exact, parallel and scalable solutions to various graph-theoretical approaches to genome-scale elucidation of biological networks. This framework takes advantage of these large-memory architectures by creating globally addressable bitmap memory indices with potentially high compression rates, fast bitwise-logical operations, and reduced search space. Augmented with recent theoretical advancements based on fixed-parameter tractability, this framework produces computationally feasible performance for genome-scale combinatorial problems of systems biology.

Threshold Selection in Gene Co-Expression Networks Using Spectral Graph Theory Techniques

Article

Full-text available

Oct 2009
BMC BIOINFORMATICS

Gene co-expression networks are often constructed by computing some measure of similarity between expression levels of gene transcripts and subsequently applying a high-pass filter to remove all but the most likely biologically-significant relationships. The selection of this expression threshold necessarily has a significant effect on any conclusions derived from the resulting network. Many approaches have been taken to choose an appropriate threshold, among them computing levels of statistical significance, accepting only the top one percent of relationships, and selecting an arbitrary expression cutoff. We apply spectral graph theory methods to develop a systematic method for threshold selection. Eigenvalues and eigenvectors are computed for a transformation of the adjacency matrix of the network constructed at various threshold values. From these, we use a basic spectral clustering method to examine the set of gene-gene relationships and select a threshold dependent upon the community structure of the data. This approach is applied to two well-studied microarray data sets from Homo sapiens and Saccharomyces cerevisiae. This method presents a systematic, data-based alternative to using more artificial cutoff values and results in a more conservative approach to threshold selection than some other popular techniques such as retaining only statistically-significant relationships or setting a cutoff to include a percentage of the highest correlations.

Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization

Article

Full-text available

Jan 1999

We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at http://cellcycle-www.stanford.edu

Spectral graph theory

Article

Jan 2004

Spectral Graph Theory CBMS Series

Article

F. R. K. Chung

Spectral Graph Theory

Article

Jan 1997

Fan R. K. Chung

Uncovering the overlapping community structure of complex networks in nature and society

Article

Jun 2005

Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of1, 2, 3, 4. A key question is how to interpret the global organization of such networks as the coexistence of their structural subunits (communities) associated with more highly interconnected parts. Identifying these a priori unknown building blocks (such as functionally related proteins5, 6, industrial sectors7 and groups of people8, 9) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large networks find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping communities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties.

A spectral method to separate disconnected and nearly-disconnected Web graph component

Conference Paper

Aug 2001

Separation of connected components from a graph with disconnected graph components mostly use breadth-first search (BFS) or depth-first search (DFS) graph algorithms. Here we propose a new algebraic method to separate disconnected and nearly-disconnected components. This method is based on spectral graph partitioning, following a key observation that disconnected components will show up, after properly sorted, as step-function like curve in the lowest eigenvectors of the Laplacian matrix of the graph. Following an perturbative analysis framework, we systematically analyzed the graph structures, first on the disconnected subgraph case, and second on the effects of adding edges sparsely connecting different subgraphs as a perturbation. Several new results are derived, providing insights to spectral methods and related clustering objective function. Examples are given illustrating the concepts and results our methods. Comparing to the standard graph algorithms, this method has the same O(&Verbar;E &Verbar; + &Verbar;V&Verbar;log(&Verbar;V&Verbar;)) complexity, but is easier to implement (using readily available eigensolvers). Further more the method can easily identify articulation points and bridges on nearly-disconnected graphs. Segmentation of a real example of Web graph for query amazon is given. We found that each disconnected or nearly-disconnected components forms a cluster on a clear topic.

The Impact of Bootstrap Methods on Time Series Analysis

Article

May 2003
STAT SCI

Dimitris N. Politis

Sparked by Efron's seminal paper, the decade of the 1980s was a period of active research on bootstrap methods for independent data--mainly i.i.d. or regression set-ups. By contrast, in the 1990s much research was directed towards resampling dependent data, for example, time series and random fields. Consequently, the availability of valid nonparametric inference procedures based on resampling and/or subsampling has freed practitioners from the necessity of resorting to simplifying assumptions such as normality or linearity that may be misleading.

Comparison of threshold selection methods for microarray gene co-expression matrices

Abstract and Figures

Supplementary resource (1)

Recommended publications

Free Online MiniCourses from The Jackson Laboratory

Chaotic Aspects of a GRM1 Innovation Diffusion Model

Unsupervised Discovery of Abnormal Activity Occurrences in Multi-dimensional Time Series, with Appli...

Are Oil, Gold and the Euro Inter-Related? Time Series and Neural Network Analysis

Complex network analysis of pulse wave

Inference of Gene Regulatory Networks Using Time Sliding Comparison and Transcriptional Lagging Time...