Network Motif Families for Lung Cancer Diagnostics:
A World Community Grid Approach
Anne-Christin Hauschild, Christian A. Cumbaa, Mike Tsay, Igor Jurisica
Princess Margaret Cancer Centre, Toronto, Ontario
Despite smoking cessation, and advances in detection and treatment, lung cancer remains the
primary cause of cancer-related death[1, 2, 3], and non-small cell lung cancer (NSCLC) accounts
for about 80-85% of all cases. The overall survival rate for lung cancer has marginally improved
in the past decades, from 13% to 16%. The asymptomatic nature of this disease does not result
in diagnosis until advanced stages of the disease. Over recent decades, numerous studies have
analysed NSCLC using diverse “omic” platforms to identify a large pool of signatures for
detection and prognosis. However, translating these into clinical practice remains challenging;
signatures often do not validate in other cohorts or by diﬀerent biological assays, and there are
thousands of possible combinations to consider.
Making use of a unique computational resource, the Mapping Cancer Markers (MCM) project
aims to systematically survey the landscape of useful cancer gene signatures for multiple can-
cers (diagnosis and prognosis), and thereby establish a benchmark for cancer gene signature
identiﬁcation and validation. MCM is powered by IBM’s World Community Grid (WCG),
a massive grid of 3.3 million devices (http://www.worldcommunitygrid.org). WCG members
contribute spare compute cycles to problems in health, poverty, and ecology.
Using the WCG, MCM’s lung cancer evaluation sampled 9.8 trillion combinations of ﬁxed-
length gene expression patterns, and evaluated these signatures against a NSCLC diagnostic
gene expression dataset. Using a performance threshold based on the Matthews correlation
coeﬃcient (MCC), approximately 45 million high-performing signatures have been identiﬁed.
We characterized the distribution of the high-performing signatures in terms of the frequency
of individual genes, network patterns, and by comprehensive pathway enrichment analysis.
Our overall goal is to utilize network patterns to identify generalized motif families that
give deeper insights to the molecular background of cancers, and give rise to more reliable
signatures for cancer detection and prognosis. Using state-of-the-art unsupervised learning
technologies we ﬁrst partition the gene features into clusters of high connectivity. We then
apply established frequent-itemset mining algorithm to identify co-occuring terms among these
patterns. Those most frequent motif families have been further evaluated with frequentist and
Bayesian methods in combination with performance measures such as MCC and AUC. Given
the broad representation of the pattern space, the result of this extensive processing pipeline is
a set of highly informative gene clusters and gene motif families of high predictive power.
Finally, we demonstrate how the discovered cluster and motif families summarize genes of
similar functionality, localization as well as interaction and pathway networks.
In summary, we demonstrate a “big data” pattern discovery system that can produce more
robust and reliable clinical diagnostics. The presented compuational framework carries the
potential for applications in precision oncology.
 Jeﬀrey P Kanne. Screening for lung cancer: what have we learned? American Journal of Roentgenology,
 Rebecca Siegel, Deepa Naishadham, and Ahmedin Jemal. Cancer statistics, 2012. CA: a cancer journal for
clinicians, 62(1):10–29, 2012.
 Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal. Cancer statistics, 2015. CA: a cancer journal
for clinicians, 65(1):5–29, 2015.