Page 1
Filtering Microarray Correlations by Statistical
Literature Analysis Yields Potential Hypotheses for
Lactation Research
Maurice HT Ling1,2 (mauriceling@acm.org)
Christophe Lefevre 1,3,4 (Chris.Lefevre@med.monash.edu.au)
Kevin R Nicholas1,4 (kevin.nicholas@deakin.edu.au)
1CRC for Innovative Dairy Products, Department of Zoology, The University of
Melbourne, Australia
2School of Chemical and Life Sciences, Singapore Polytechnic, Singapore
3Victorian Bioinformatics Consortium, Monash University, Australia
4Institute of Technology Research and Innovation, Deakin University, Australia
Abstract
Background
Recent studies have demonstrated that the cyclical nature of mouse lactation1 can be
mirrored at the transcriptome2 level of the mammary glands but making sense of
microarray3 results requires analysis of large amounts of biological information which
is increasingly difficult to access as the amount of literature increases. Extraction of
protein-protein interaction from text by statistical and natural language processing has
shown to be useful in managing the literature. Correlations between gene expression
across a series of samples is a simple method to analyze microarray data as it was
found that genes that are related in functions exhibit similar expression profiles4.
Microarrays had been used to examine the transcriptome of mouse lactation and found
that the cyclic nature of the lactation cycle as observed histologically is reflected at
the transcription level. However, there has been no study to date using text mining to
sieve microarray analysis to generate new hypotheses for further research in the field
of lactational biology.
Results
Our results demonstrated that a previously reported protein name co-occurrence
method (5-mention PubGene) which was not based on a hypothesis testing
framework, is generally more stringent than the 99th percentile of Poisson distribution-
based method of calculating co-occurrence. It agrees with previous methods using
natural language processing to extract protein-protein interaction from text as more
than 96% of the interactions found by natural language processing methods to
coincide with the results from 5-mention PubGene method. However, less than 2% of
1 Lactation is the process of milk production.
2 Transcriptome is the set of genes that are active in a given cell at any one time.
3 Microarray is a multiplex technology used in molecular biology to measure the activity of a set of
genes at any one time.
4 A gene expression profile is the trend of activity for all the genes across different time points or
conditions.
- 1 -
Page 2
the gene co-expressions analyzed by microarray were found from direct co-
occurrence or interaction information extraction from the literature. At the same time,
combining microarray and literature analyses, we derive a novel set of 7 potential
functional protein-protein interactions that had not been previously described in the
literature.
Conclusions
We conclude that the 5-mention PubGene method is more stringent than the 99th
percentile of Poisson distribution method for extracting protein-protein interactions by
co-occurrence of entity names and literature analysis may be a potential filter for
microarray analysis to isolate potentially novel hypotheses for further research.
1. Background
Microarray technology is a transcriptome analysis tool which had been used in the
study of the mouse lactation cycle (Clarkson and Watson, 2003; Rudolph et al., 2007).
A number of advances in microarray analysis have been made recently. For example,
inferring the underlying genetic network from microarray results (Rawool and
Venkatesh, 2007; Maraziotis et al., 2007) by statistical correlation of gene expression
across a series of samples (Reverter et al., 2005), then deriving functional network
clusters by mapping onto Gene Ontology (Beissbarth, 2006). It has been shown that
functionally related genes demonstrate similar expression profiles (Reverter et al.,
2005). These methods have been used to study functional gene sets for basal cell
carcinoma (O'Driscoll et al., 2006). The amount of information in published form is
increasing exponentially, making it difficult for researchers to keep abreast with the
relevant literature (Hunter and Cohen, 2006). At the same time, there has been no
study to demonstrate that the current status of knowledge in protein-protein
interactions in the literature is useful to increase the understanding of microarray data.
The two major streams for biomedical protein-protein information extraction are
natural language processing (NLP) and co-occurrence statistics (Cohen and Hersh,
2005; Jensen et al., 2006). The main reason for concurrent existence of these two
methods is their complementary effect in terms of information extraction (Jensen et
al., 2006). NLP has a lower recall or sensitivity than co-occurrence but tends to be
more precise compared with co-occurrence statistical methods (Wren and Garner,
2004; Jensen et al., 2006). Mathematically, precision is the number of true positives
divided by the total number of items labeled by the system as positive (number of true
positives divided by the sum of true and false positives), whereas recall is the number
of true positives identified by the system divided the number of actual positives
(number of true positives divided by the sum of true positives and false negatives). A
number of tools have approached protein-protein interaction extraction from the NLP
perspective, these include GENIES (Friedman et al., 2001), MedScan (Novichkova et
al., 2003), PreBIND (Donaldson et al., 2003), BioRAT (David et al., 2004), GIS
(Chiang et al., 2004), CONAN (Malik et al., 2006), and Muscorian (Ling et al., 2007).
Muscorian (Ling et al., 2007) achieved at least 82% precision and 30% in recall
(sensitivity). NLP methods made use of the grammatical forms of words and structure
of a valid sentence to identify the grammatical roles of each word in a sentence, parse
the sentence into phrases and extracting information such as subject-verb-object
structures from these phrases. Co-occurrence, a statistical method, is based on the
thesis that multiple occurrences of the same pair of entities suggests that the pair of
- 2 -
Page 3
entities are related in some way and the likelihood of such relatedness increases with
higher co-occurrence. In another words, co-occurrence methods tend to view the text
as a bag of un-sequenced words. Hence, depending on the threshold allowed, which
will translate to the precision of the entire system, recall could be total, as implied in
PubGene (Jenssen et al., 2001).
PubGene (Jenssen et al., 2001) defined interactions by co-occurrence to the simplest
and widest possible form by assigning an interaction between 2 proteins if these 2
proteins appear in the same article just once in the entire library of 10 million articles
and found that this criterion has 60% precision (1-Mention PubGene method).
Although it was not stated in the article (Jenssen et al., 2001), it is obvious that such a
criterion would yield 100% recall or sensitivity, giving an F-score of 0.75. F-score is
defined as the harmonic mean of precision and recall, attributing equal weight to both
precision and recall. However, 60% precision is usually unsatisfactory for most
applications. PubGene (Jenssen et al., 2001) had also defined a “5-Mention” method
which requires 5 or more articles with 2 protein names to assign an interaction with
72% precision. It is generally accepted that precision and recall are inversely related;
hence, it can be expected that the “5-Mention” method will not be 100% sensitive.
However, PubGene was benchmarked against the Database of Interacting Proteins and
OMIM, making it more difficult to appreciate the statistical basis of “1-Mention” and
“5-Mention” methods as compared to using a hypothesis testing framework in Chen et
al. (2008). In addition, PubGene is unable to extract the nature of interactions, for
example, binding or inhibiting interactions. On the other hand, NLP is designed to
extract the nature of interactions (Malik et al., 2006; Ling et al., 2007); hence, it can
be expected that NLP results may be used to annotate co-occurrence results.
CoPub Mapper used a more sophisticated information measure which took into
account the distribution of entity names in the text database (Alako et al., 2005).
Although Alako et al (2005) demonstrated CoPub Mapper's information measure co-
relates well with microarray co-expression, the information measure was not used as a
decision criterion for deciding which pairs of co-occurrences were positive results
(personal communication, Guido Jenster, 2006). This is unlike 1-Mention PubGene
method where all co-occurrence were taken as positive result and 5-Mention PubGene
method requires at least 5 count of co-occurrence before attributing the co-occurrence
as a positive result. Chen et al. (2008) used chi-square to test co-occurrence
statistically to mine disease-drug interactions from clinical notes and published
literature. Another possible way to calculate co-occurrence is a direct use of Poisson
distribution on the assumption that co-occurrence of 2 protein names is a rare chance
with respect to the entire library. Poisson distribution is a discrete distribution similar
to Binomial distribution but is used for rare events, for example, to estimate the
probability of accidents in a given stretch of road in a day. Poisson distribution is
easier to use than Binomial distribution as it only requires the mean and does not
require a standard deviation. Based on PubGene, the statistical assumption of Poisson
distribution-based statistics requiring rare events (in this case, the co-occurrences of 2
protein names in a collection of text is statistically rare) can generally be held
(Jenssen et al., 2001).
Although a combination of either NLP or co-occurrence in microarray analysis have
been used (Li et al., 2007; Gajendran et al., 2007; Hsu et al., 2007), neither method
had been used in microarray analysis for advancing lactational biology. This study
- 3 -
Page 4
attempts to examine the relation between the PubGene and Poisson distribution
methods of calculating co-occurrence and explore the use of NLP-based protein-
protein interaction extraction results to annotate co-occurrence results. This study also
examines the use of co-occurrence analysis on 4 publically available microarray data
sets on mouse lactation cycle (Master et al., 2002; Clarkson and Watson, 2003; Stein
et al., 2004; Rudolph et al., 2007) as a novel hypothesis discovery tool. Master et al.
(2002) used 13 microarrays to discover the presence of brown adipose tissue in mouse
mammary fat pad and its role in thermoregulation, Clarkson and Watson (2003) used
24 microarrays and characterized inflammation response genes during involution,
Stein et al. (2004) used 51 microarrays and discovered a set of 145 genes that are up-
regulated in early involution where 49 encoded for immunoglobulins, and Rudolph et
al. (2007) used 29 microarrays to study lipid synthesis in the mouse mammary gland
following diets of various fat content and found that genes encoding for nutrient
transporter into the cell are up-regulated following increased food intake. More
importantly, each of the 4 studies independently demonstrated that the cyclical nature
of mammary gland development, as observed histologically and biochemically, are
reflected at the transcriptome level suggesting that microarray is a suitable tool to
study the regulation of mouse lactation. It should be noted that even-though each of
these microarray experiments were designed for different purposes, the principle that
co-expressed genes are more functionally correlated than functionally unrelated genes
remains, as demonstrated by Reverter et al. (2005).
Our results demonstrate that 5-mention PubGene method is generally statistically
more significant than 99th percentile of Poisson distribution method of calculating co-
occurrence. Our results showed that 96% of the interactions extracted by NLP
methods (Ling et al., 2007) overlapped with the results from 5-mention PubGene
method. However, less than 2% of the microarray correlations were found in the co-
occurrence graph extracted by 1-mention PubGene method. Using co-occurrence
results to filter microarray co-expression correlations, we have discovered a
potentially novel set of 7 protein-protein interactions that had not been previously
described in the literature.
2. Methods
2.1. Microarray Datasets
The 4 microarray datasets are from Master et al. (2002) using Affymetrix Mouse Chip
Mu6500 and FVB mice, Clarkson and Watson (2003) using Affymetrix U74Av2 chip
and C57/BL6 mice, Rudolph et al. (2007) using Affymetrix U74Av2 chip and FVB
mice, and Stein et al. (2004) using Affymetrix U74Av2 chip and Balb/C mice.
2.2. Co-Occurrence Calculations
Using a pre-defined list of 3653 protein names which was derived by Ling et al.
(2007) from Affymetrix Mouse Chip Mu6500 microarray probeset, PubGene
established 2 measures of binary co-occurrence (Jenssen et al., 2001): 1-mention
method and 5 mentions method. In the 1-mention method, the appearance of 2 entity
names in the same abstract will be deemed as a positive outcome whereas the 5
mentions method will require the appearance of 2 entity names in at least 5 abstracts
before considered positive.
- 4 -
Page 5
For co-occurrence modelled on Poisson distribution (Poisson co-occurrence), the
number of abstracts in which both entity names appeared in is assumed to be rare as it
only requires the appearance of 2 entity names within 5 articles in a collection of 10
million articles to give a precision of 0.72 (Jenssen et al., 2001). The relative
occurrence frequencies of each of the 2 entities were calculated separately as a
quotient of the number of abstracts in which an entity name appeared in and the total
number of abstracts in the corpus. The product of relative occurrence frequency of
each of the 2 entities can be taken as the mean expected probability of the 2 entities
appearing in the same abstract if they are not related, which when multiplied by the
total number of abstracts, can be taken as the mean number of occurrence (lambda) of
Poisson distribution. For example, if proteinA and proteinB are found in 1000
abstracts each and there are 1 million abstracts, the relative occurrence frequency will
be 0.001 each and the mean number of occurrence will be 1 (0.0012 x 1000000). This
means that we expect 1 abstract in a collection of 1 million to contain proteinA and
proteinB if they are not related (n = 1, p = 0.5).
A positive result is where the number of abstracts in which both the 2 entities in
question appeared on or above the 95th (one-tail P < 0.05) or 99th (one-tail P < 0.01)
percentile of the Poisson distribution. In both co-occurrence calculations, entity
(protein) names in text is recognized by pattern matching , as used in Ling et al.
(2007).
2.3. Comparing Co-Occurrence and Text Processing
Two sets of comparisons were performed: within the different forms of co-occurrence,
and between co-occurrence and text processing methods. The first set of comparison
aims to evaluate the differences between the 3 co-occurrence methods described
above. PubGene's 1-mention and 5-mentions methods were co-related singly and in
combination with Poisson co-occurrence methods.
Given that the nodes (N) of a co-occurrence network represents the entities and the
links or edges (E) between each node to represent a co-occurrence under the method
used, the entire co-occurrence graph (G) = {N, E}, that is, a set of nodes and a set of
edges. In addition, given that the same set of entities were used (same set of nodes),
the differences between the 2 graphs resulted from 2 co-occurrence methods can then
be simply denoted as the number of differences between the 2 sets of edges
(subtraction of one set of edges with another set of edges). In practice, a total space
model is used. A graph of total possible co-occurrence is where each node is “linked”
or co-occurred with every node, including loops (edge to itself). Thus, a graph of total
possible co-occurrence has 3653 nodes and 12694969 (35632) edges. We define a
graph, G*, as the undirected graph of total possible co-occurrence without parallel
edges including loops. G* has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2]
edges. The output graph of each co-occurrence method is reduced to the number of
edges it contains as it can be assumed that the graph from 1-mention PubGene method
represents the most liberal co-occurrence graph (GPG1), the resulting graph from any
other more sophisticated method (Gi where i denotes the co-occurrence method) will
be a proper subset of GPG1 and certainly G*.
- 5 -
End of preview.