Conference PaperPDF Available

Text mining workflow for extraction of paragraphs from full articles describing drug-gene interactions to support Onco KEM software platform for personalized treatments

Authors:

Abstract

Introduction Synthetically presenting key information about drug-gene interactions to medical oncologists to support personalized treatment decisions is a challenge and requires databases created by curation of full scientific articles. Databases, such as Comparative Toxicogenomics Database (CTD) or Drug-Gene Interaction (DGI), provide a brief sentence for this purpose, supported by the article’s abstract and the associated PubMed® ID, which in numerous cases do not allow quick evaluation of the relevant content. We describe here a text-mining workflow (Onco KEM® Builder) for automatic screening of full text articles aimed at extracting complete paragraphs of scientific evidence on drug-gene interactions and creating a database to be integrated in a software platform for personalized treatments in oncology. Materials and methods The corpus consisted of 56 PubMed articles downloaded with PaperToolBox1 to extract paragraphs related to 115 sentences from CTD describing drug-genes interactions for 34 cancer related drugs. The customized PDFMiner python library2 extracted the text from the PDF files of the full articles. A KNIME workflow based on dictionary tagger and bag-of-words model was used to score each paragraph according to the number of occurrences of the keywords. The keywords were enriched with synonyms for both drugs and genes. Human curator evaluated the appropriateness of the paragraph for each interaction. Results For the 115 sentences, in 93 cases, the most appropriate paragraph was ranked number one. For 53 sentences a single paragraph was ranked as highest; and in the remaining 40 cases up to three paragraphs were equal in ranking. For the 22 other sentences, the most appropriate paragraphs were ranked no lower than 4th. For 59 sentences, the extracted paragraphs were from the abstracts of the articles, showing that databases relying only on article abstracts may not provide sufficient information. Using the R script to assemble paragraphs allowed retrieving 19 complete paragraphs ranked as first in the database. The presented text mining workflow allowed reducing the time for selecting content for the database by presenting the most appropriate paragraph with an accuracy of 80.9%. Conclusion Our text mining workflow can generate database of paragraphs describing drug-gene interactions to help clinicians in the decision process of treatment allocation to patients. 1. http://www.papertoolbox.com/ 2. https://pypi.python.org/pypi/pdfminer/20140328
Text mining workflow for extraction of paragraphs from full
articles describing drug-gene interactions to support Onco
KEM software platform for personalized treatments
Fanny Perraudeau, David Morley, Mohammad Afshar, Mariana Guergova-Kuras
Ariana Pharmaceuticals, 28 Rue du Docteur Finlay, 75015 Paris France, +33(0) 1 44 37 17 00, d.morley@arianapharma.com
Introduction
Results
References
Conclusion
www.arianapharma.com
Preliminary work
The corpus consisted of 56 PubMed articles downloaded
with PaperToolBox3 to extract paragraphs related to 115
sentences from CTD describing drug-genes interactions
for 34 cancer related drugs. The customized PDFMiner
python library4 extracted the text from the PDF files of the
full articles. The figures, tables and references of the
articles were not extracted avoiding false negative
paragraphs extraction. The keywords constituting the
sentence were enriched with synonyms for both drugs
and genes. The synonyms of the drugs were extracted
from the CTD database. For the synonyms of the genes,
the gene_info file downloaded from the FTP of the NCBI
was used.
Text Mining Tool
A KNIME workflow based on dictionary tagger and bag-of-
words model was used to score each paragraph according
to the number of occurrences of the keywords. Custom R
script reassembled the paragraphs, if the paragraphs with
the highest score were incomplete after converting PDF to
text. The assembly of the paragraphs is based on a
sentence boundary detection algorithm. If the end (resp.
start) of a paragraph was not considered as the end (resp.
start) of a sentence, the algorithm looked for the nearest
paragraph not starting (resp. ending) with a complex
regular expression representing the start (resp. end) of a
sentence.
Database building
Database was completed with the paragraphs selected as
the most representative by the curator. If no paragraph
was considered as relevant among the ranked ones, the
abstract of the article is chosen.
In 93 cases out of 115 sentences, the most appropriate
paragraph in the article to describe the drug-gene
interaction was ranked number one by the workflow. For
53 sentences a single paragraph was ranked as highest;
and in the remaining 40 cases up to three paragraphs
were equal in ranking. For the 22 other sentences, the
most appropriate paragraphs were ranked no lower than
fourth.
In summary, our text mining workflow reduces
knowledge base curation time by selecting the most
appropriate paragraph with an accuracy of 80.9%. On
average, articles can be curated five times faster using
the workflow compared to manual curation of full
articles.
The need to use full article
For 56 sentences, the extracted paragraphs were not
from the abstracts of the articles but from the core text,
showing that databases relying only on article abstracts
may not provide sufficient information for clinical
decision making.
Our Onco KEM® Builder text-mining workflow can
generate a knowledge base of complete paragraphs
describing drug-gene interactions to support clinicians in
personalized treatment decisions for patients.
1. Davis AP, Murphy CG, Johnson R, Lay JM, Lennon-Hopkins K, Saraceni-
Richards C, Sciaky D, King BL, Rosenstein MC, Wiegers TC, Mattingly CJ.
The Comparative Toxicogenomics Database: update 2013. Nucleic Acids
Res. 2013 Jan 1;41(D1):D1104-14.
2. Malachi Griffith, Obi L Griffith, Adam C Coffman, James V Weible, Josh
F McMichael, Nicholas C Spies, James Koval, Indraniel Das, Matthew B
Callaway, James M Eldred, Christopher A Miller, Janakiraman
Subramanian, Ramaswamy Govindan, Runjun D Kumar, Ron Bose, Li
Ding, Jason R Walker, David E Larson, David J Dooling, Scott M Smith,
Timothy J Ley, Elaine R Mardis, Richard K Wilson. DGIdb - mining the
druggable genome. Nature Methods (2013) doi:10.1038/nmeth.2689.
3. http://www.papertoolbox.com/
4. https://pypi.python.org/pypi/pdfminer/20140328
Steps
Most appropriate
paragraph ranked n°1
Most appropriate
paragraph ranked n°2,3,4
Before
paragraphs
assembly
74
41
After
paragraphs
assembly
93
(accuracy = 80,9%)
22
A single
paragraph
ranked
as highest
Up to three
paragraphs
ranked
highest
Most
appropriate
paragraph
ranked n°2
Most
appropriate
paragraph
ranked n°3,4
53
40
12
10
Result Table Introduction
Synthetically presenting key information about drug-gene
interactions to medical oncologists to support
personalized treatment decisions is a challenge and
requires databases created by curation of full scientific
articles. Databases, such as Comparative Toxicogenomics
Database1 (CTD) or Drug-Gene Interaction2 (DGI), provide
a brief sentence for this purpose, supported by the
associated PubMed® ID, which in numerous cases do not
allow quick evaluation of the relevant content. We
describe here a text-mining workflow (Onco KEM®
Builder) for automatic screening of full text articles aimed
at extracting complete paragraphs of scientific evidence
on drug-gene interactions and creating a database to be
integrated in a software platform for personalized
treatments in oncology.
Query Example
Pipeline for data processing
1. Corpus of scientific literature
(currently papers from PubMed®)
2. Onco KEM® Builder extracts the
relevant knowledge : the most
appropriate paragraphs explaining the
drug-gene interactions
3. The medical oncologists use the
Onco KEM® database to support
personalized treatment decisions
Onco KEM®
Knowledge Base
Information
Clinical Decision
Support
Query : « Ofatumumab binds to the CD20 antigen where upon it induces cell lysis » + PubMedID : 22150234
Extracted paragraph : « Ofatumumab is a human IgG antibody that binds to a unique, more membrane proximal epitope of the CD20
antigen. Pre-clinical studies have shown ofatumumab to have similar antibody-dependent cellular cytotoxicity (ADCC) and improved CMC
when compared to rituximab. Ofatumumab induces prolonged B-cell depletion when compared to rituximab and has been shown to
slow lymphoma tumour cell growth in xenograft models. Ofatumumab was recently approved by the Food and Drug Administration
(FDA) for the treatment of fludarabine and alemtuzumab refractory chronic lymphocytic leukaemia. »
Methods - Workflow steps
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between environmental chemicals and gene products and their relationships to diseases. Chemical–gene, chemical–disease and gene–disease interactions manually curated from the literature are integrated to generate expanded networks and predict many novel associations between different data types. CTD now contains over 15 million toxicogenomic relationships. To navigate this sea of data, we added several new features, including DiseaseComps (which finds comparable diseases that share toxicogenomic profiles), statistical scoring for inferred gene–disease and pathway–chemical relationships, filtering options for several tools to refine user analysis and our new Gene Set Enricher (which provides biological annotations that are enriched for gene sets). To improve data visualization, we added a Cytoscape Web view to our ChemComps feature, included color-coded interactions and created a ‘slim list’ for our MEDIC disease vocabulary (allowing diseases to be grouped for meta-analysis, visualization and better data management). CTD continues to promote interoperability with external databases by providing content and cross-links to their sites. Together, this wealth of expanded chemical–gene–disease data, combined with novel ways to analyze and view content, continues to help users generate testable hypotheses about the molecular mechanisms of environmental diseases.
DGIdb-mining the druggable genome
  • Malachi Griffith
  • L Obi
  • Adam C Griffith
  • James V Coffman
  • Josh F Weible
  • Mcmichael
  • C Nicholas
  • James Spies
  • Indraniel Koval
  • Das
  • B Matthew
  • James M Callaway
  • Christopher A Eldred
  • Janakiraman Miller
  • Ramaswamy Subramanian
  • Govindan
  • D Runjun
  • Ron Kumar
  • Li Bose
  • Jason R Ding
  • David E Walker
  • David J Larson
  • Dooling
  • M Scott
  • Timothy J Smith
  • Elaine R Ley
  • Richard K Mardis
  • Wilson
Malachi Griffith, Obi L Griffith, Adam C Coffman, James V Weible, Josh F McMichael, Nicholas C Spies, James Koval, Indraniel Das, Matthew B Callaway, James M Eldred, Christopher A Miller, Janakiraman Subramanian, Ramaswamy Govindan, Runjun D Kumar, Ron Bose, Li Ding, Jason R Walker, David E Larson, David J Dooling, Scott M Smith, Timothy J Ley, Elaine R Mardis, Richard K Wilson. DGIdb-mining the druggable genome. Nature Methods (2013) doi:10.1038/nmeth.2689.