ThesisPDF Available

Semalytics: a semantic analytics platform for the exploration of distributed and heterogeneous cancer data in translational research

Authors:

Abstract and Figures

The aim of translational cancer research is to transfer biomedical discoveries from the bench to the bedside. This is a challenging goal, since each cancer is a complex system with individual molecular features which determine the actual dynamics of the disease, such as its prognosis and response to therapies. Understanding those biological traits is fundamental in order to harness them in advance to improve clinical care and preci- sion medicine. To achieve this, pre-clinical research procedures have been developed, which deploy large-scale experiments involving serial propagation of patients’ samples through in vivo and in vitro cultures [1, 2]. By preserving the fundamental biolog- ical properties of the collected material (e.g., sensitivity to specific therapies), such approaches allow challenging the tumor material with different perturbations (e.g., drugs) and measuring how it responds (e.g., tumor shrinkage). These processes gen- erate massive collections of hierarchical data (i.e., experimental trees) which may be annotated with heterogeneous notes based on experimental results and observations, thus creating huge datasets that are extremely difficult to analyze both by humans and by machines. To address such issues in data analysis, we created the Semalytics data framework, the core of an analytical platform that processes experimental information through Semantic Web technologies. The platform enables users to bind experimental data to knowledge items (i.e., metadata describing biological properties) and to inves- tigate such annotations. Semalytics allows (i) the efficient exploration of experimental trees of undefined depth together with their annotations. Moreover, (ii) the platform links its data to a wider open knowledge base (i.e., Wikidata) for adding an extended knowledge layer without the need to manage and curate those data locally. Alto- gether, Semalytics provides an augmented perspective on experimental data, allowing the generation of new hypotheses, which were not anticipated by the user a priori. In this thesis, we present our research on the data framework of Semalytics, focusing on its semantic nucleus and on how it exploits semantic reasoning to tackle issues of this kind of analyses. Finally, we describe a proof-of-concept study based on the exam- ination of several dozen cases of metastatic colorectal cancer in order to illustrate how Semalytics can help researchers generate hypotheses about the role of genes alterations in causing resistance or sensitivity of cancer cells to specific drugs.
Content may be subject to copyright.
A preview of the PDF is not available
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the last years, Laboratory Information Management Systems (LIMS) have been growing from mere inventory systems into increasingly comprehensive software platforms, spanning functionalities as diverse as data search, annotation and analysis. Our institution started in 2011 a LIMS project named the Laboratory Assistant Suite with the purpose of assisting researchers throughout all of their laboratory activities, providing graphical tools to support decision-making tasks and building complex analyses on integrated data. The modular architecture of the system exploits multiple databases with different technologies. To provide an efficient and easy tool for retrieving information of interest, we developed the Multi-Dimensional Data Manager (MDDM). By means of intuitive interfaces, scientists can execute complex queries without any knowledge of query languages or database structures, and easily integrate heterogeneous data stored in multiple databases. Together with the other software modules making up the platform, the MDDM has helped improve the overall quality of the data, substantially reduced the time spent with manual data entry and retrieval and ultimately broadened the spectrum of interconnections among the data, offering novel perspectives to the biomedical analysts.
Chapter
Full-text available
Handling large knowledge bases of information from different domains such as the World Wide Web is a complex problem addressed in the Resource Description Framework (RDF) by adding semantic meaning to the data itself. The amount of linked data has brought with it a number of specialized databases that are capable of storing and processing RDF data, called RDF stores. We explore the RDF store landscape with the aim of finding an RDF store that sufficiently meets the storage needs of an enhanced living environment, more concretely the requirements of a Smart Space platform aimed at running on a cluster set up of low-power hardware that can be run locally entirely at home with the purpose of logging data for a reactive assistive system involving, e.g., activity recognition or domotics. We present a literature analysis of RDF stores and identify promising candidates for implementation of consumer Smart Spaces. Based on the insights provided with our study, we conclude by suggesting different relevant aspects of RDF storage systems that need to be considered in Ambient Assisted Living environments and a comparison of available solutions.
Article
Full-text available
Patient-derived tumor xenograft (PDX) mouse models are a versatile oncology research platform for studying tumor biology and for testing chemotherapeutic approaches tailored to genomic characteristics of individual patients’ tumors. PDX models are generated and distributed by a diverse group of academic labs, multi-institution consortia and contract research organizations. The distributed nature of PDX repositories and the use of different metadata standards for describing model characteristics presents a significant challenge to identifying PDX models relevant to specific cancer research questions. The Jackson Laboratory and EMBL-EBI are addressing these challenges by co-developing PDX Finder, a comprehensive open global catalog of PDX models and their associated datasets. Within PDX Finder, model attributes are harmonized and integrated using a previously developed community minimal information standard to support consistent searching across the originating resources. Links to repositories are provided from the PDX Finder search results to facilitate model acquisition and/or collaboration. The PDX Finder resource currently contains information for 1985 PDX models of diverse cancers including those from large resources such as the Patient-Derived Models Repository, PDXNet and EurOPDX. Individuals or organizations that generate and distribute PDXs are invited to increase the ‘findability’ of their models by participating in the PDX Finder initiative at www.pdxfinder.org.
Article
Full-text available
Introduction: Heart disease has been the leading cause of death in the United States since 1910 and cancer the second leading cause of death since 1933. However, cancer emerged recently as the leading cause of death in many US states. The objective of this study was to provide an in-depth analysis of age-standardized annual state-specific mortality rates for heart disease and cancer. Methods: We used population-based mortality data from 1999 through 2016 to compare 2 underlying cause-of-death categories: diseases of heart (International Classification of Diseases, 10th Revision [ICD-10] codes I00-I09, I11, I13, and I20-I51) and malignant neoplasms (ICD-10 codes C00-C97). We calculated age-standardized annual state-specific mortality rate ratios (MRRs) as heart disease mortality rate divided by cancer mortality rate. Results: In 1999, age-standardized heart disease mortality exceeded that for cancer in all 50 states. Median state-specific MRR in 1999 was 1.26 (interquartile range [IQR], 1.17-1.34; range, 1.03-1.56), indicating predominance of heart disease mortality nationwide. Median state-specific MRR decreased annually through 2010, reaching a low of 1.00 (IQR, 0.95-1.07; range, 0.71-1.25), indicating that predominance of heart disease mortality prevailed in approximately half of states. Median state-specific MRR increased to 1.03 (IQR, 0.97-1.12; range, 0.77-1.31) in 2016. In 2016, age-standardized cancer mortality exceeded that for heart disease in 19 states. State-level transitions were most apparent for people aged 65 to 84 and affected men, women, and all racial/ethnic groups. Conclusion: State-level data indicated heterogeneity across US states in the predominance of heart disease mortality relative to cancer mortality. Timing and magnitude of transitions toward cancer mortality predominance varied by state.
Preprint
Full-text available
Precision oncology relies on the accurate discovery and interpretation of genomic variants to enable individualized therapy selection, diagnosis, or prognosis. However, knowledgebases containing clinical interpretations of somatic cancer variants are highly disparate in interpretation content, structure, and supporting primary literature, reducing consistency and impeding consensus when evaluating variants and their relevance in a clinical setting. With the cooperation of experts of the Global Alliance for Genomics and Health (GA4GH) and of six prominent cancer variant knowledgebases, we developed a framework for aggregating and harmonizing variant interpretations to produce a meta-knowledgebase of 12,856 aggregate interpretations covering 3,437 unique variants in 415 genes, 357 diseases, and 791 drugs. We demonstrated large gains in overlapping terms between resources across variants, diseases, and drugs as a result of this harmonization. We subsequently demonstrated improved matching between patients of the GENIE cohort and harmonized interpretations of potential clinical significance, observing an increase from an average of 34% to 57% in aggregate. We developed an open and freely available web interface for exploring the harmonized interpretations from these six knowledgebases at search.cancervariants.org.
Article
Full-text available
In order to achieve more accurate disease prevention, diagnosis, and treatment, clinical and genetic data need extensive and systematically associated study. As one way to achieve precision medicine, a laboratory information management system (LIMS) can effectively associate clinical data in a macrocosmic aspect and genomic data in a microcosmic aspect. This chapter summarizes the application of the LIMS in a clinical data management and implementation mode. It also discusses the principles of a LIMS in clinical data management, as well as the opportunities and challenges in the context of medical informatics.
Article
Full-text available
While tumor genome sequencing has become widely available in clinical and research settings, the interpretation of tumor somatic variants remains an important bottleneck. Here we present the Cancer Genome Interpreter, a versatile platform that automates the interpretation of newly sequenced cancer genomes, annotating the potential of alterations detected in tumors to act as drivers and their possible effect on treatment response. The results are organized in different levels of evidence according to current knowledge, which we envision can support a broad range of oncology use cases. The resource is publicly available at http://www.cancergenomeinterpreter.org. Electronic supplementary material The online version of this article (10.1186/s13073-018-0531-8) contains supplementary material, which is available to authorized users.
Article
Purpose With prospective clinical sequencing of tumors emerging as a mainstay in cancer care, an urgent need exists for a clinical support tool that distills the clinical implications associated with specific mutation events into a standardized and easily interpretable format. To this end, we developed OncoKB, an expert-guided precision oncology knowledge base. Methods OncoKB annotates the biologic and oncogenic effects and prognostic and predictive significance of somatic molecular alterations. Potential treatment implications are stratified by the level of evidence that a specific molecular alteration is predictive of drug response on the basis of US Food and Drug Administration labeling, National Comprehensive Cancer Network guidelines, disease-focused expert group recommendations, and scientific literature. Results To date, > 3,000 unique mutations, fusions, and copy number alterations in 418 cancer-associated genes have been annotated. To test the utility of OncoKB, we annotated all genomic events in 5,983 primary tumor samples in 19 cancer types. Forty-one percent of samples harbored at least one potentially actionable alteration, of which 7.5% were predictive of clinical benefit from a standard treatment. OncoKB annotations are available through a public Web resource ( http://oncokb.org ) and are incorporated into the cBioPortal for Cancer Genomics to facilitate the interpretation of genomic alterations by physicians and researchers. Conclusion OncoKB, a comprehensive and curated precision oncology knowledge base, offers oncologists detailed, evidence-based information about individual somatic mutations and structural alterations present in patient tumors with the goal of supporting optimal treatment decisions.
Article
This article provides a status report on the global burden of cancer worldwide using the GLOBOCAN 2018 estimates of cancer incidence and mortality produced by the International Agency for Research on Cancer, with a focus on geographic variability across 20 world regions. There will be an estimated 18.1 million new cancer cases (17.0 million excluding nonmelanoma skin cancer) and 9.6 million cancer deaths (9.5 million excluding nonmelanoma skin cancer) in 2018. In both sexes combined, lung cancer is the most commonly diagnosed cancer (11.6% of the total cases) and the leading cause of cancer death (18.4% of the total cancer deaths), closely followed by female breast cancer (11.6%), prostate cancer (7.1%), and colorectal cancer (6.1%) for incidence and colorectal cancer (9.2%), stomach cancer (8.2%), and liver cancer (8.2%) for mortality. Lung cancer is the most frequent cancer and the leading cause of cancer death among males, followed by prostate and colorectal cancer (for incidence) and liver and stomach cancer (for mortality). Among females, breast cancer is the most commonly diagnosed cancer and the leading cause of cancer death, followed by colorectal and lung cancer (for incidence), and vice versa (for mortality); cervical cancer ranks fourth for both incidence and mortality. The most frequently diagnosed cancer and the leading cause of cancer death, however, substantially vary across countries and within each country depending on the degree of economic development and associated social and life style factors. It is noteworthy that high‐quality cancer registry data, the basis for planning and implementing evidence‐based cancer control programs, are not available in most low‐ and middle‐income countries. The Global Initiative for Cancer Registry Development is an international partnership that supports better estimation, as well as the collection and use of local data, to prioritize and evaluate national cancer control efforts. CA: A Cancer Journal for Clinicians 2018;0:1‐31. © 2018 American Cancer Society