caTIES: A Grid Based System for Coding and Retrieval of Surgical Pathology Reports and Tissue Specimens in Support of Translational Research

Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania 15232, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 05/2010; 17(3):253-64. DOI: 10.1136/jamia.2009.002295
Source: PubMed


The authors report on the development of the Cancer Tissue Information Extraction System (caTIES)--an application that supports collaborative tissue banking and text mining by leveraging existing natural language processing methods and algorithms, grid communication and security frameworks, and query visualization methods. The system fills an important need for text-derived clinical data in translational research such as tissue-banking and clinical trials. The design of caTIES addresses three critical issues for informatics support of translational research: (1) federation of research data sources derived from clinical systems; (2) expressive graphical interfaces for concept-based text mining; and (3) regulatory and security model for supporting multi-center collaborative research. Implementation of the system at several Cancer Centers across the country is creating a potential network of caTIES repositories that could provide millions of de-identified clinical reports to users. The system provides an end-to-end application of medical natural language processing to support multi-institutional translational research programs.

Download full-text


Available from: Girish Chavan,
  • Source
    • "There is an extensive amount of existing work in creating clinical natural language processing (NLP) systems to extract information from free text in specific disease domains. Two salient examples are the Cancer Text Information System (caTIES) [1] and SymText [2]. caTIES has been developed at the University of Pittsburgh to extract coded information from surgical pathology reports using terms from the National Cancer Institute (NCI) Thesaurus . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Epilepsy is a common serious neurological disorder with a complex set of possible phenotypes ranging from pathologic abnormalities to variations in electroencephalogram. This paper presents a system called Phenotype Exaction in Epilepsy (PEEP) for extracting complex epilepsy phenotypes and their correlated anatomical locations from clinical discharge summaries, a primary data source for this purpose. PEEP generates candidate phenotype and anatomical location pairs by embedding a named entity recognition method, based on the Epilepsy and Seizure Ontology, into the National Library of Medicine's MetaMap program. Such candidate pairs are further processed using a correlation algorithm. The derived phenotypes and correlated locations have been used for cohort identification with an integrated ontology-driven visual query interface. To evaluate the performance of PEEP, 400 de-identified discharge summaries were used for development and an additional 262 were used as test data. PEEP achieved a micro-averaged precision of 0.924, recall of 0.931, and F1-measure of 0.927 for extracting epilepsy phenotypes. The performance on the extraction of correlated phenotypes and anatomical locations shows a micro-averaged F1-measure of 0.856 (Precision: 0.852, Recall: 0.859). The evaluation demonstrates that PEEP is an effective approach to extracting complex epilepsy phenotypes for cohort identification.
    Journal of Biomedical Informatics 06/2014; 51. DOI:10.1016/j.jbi.2014.06.006 · 2.19 Impact Factor
  • Source
    • "Here we note that we are interested in the primary tumor generic site (that is, the top level two digit ICD-O-3 main site code) and not all mentions of anatomical sites in pathology reports. Identifying all tumor site mentions can be addressed by state-of-the-art cancer information extraction systems such as caTIES [ 7 ], which in addition can perform complex queries to retrieve specific pathology reports. Although many mentions of anatomical sites occur in pathology reports, in the majority of cases there is only one primary tumor site, and other non-primary sites referred to within in the report provide the context of patient’s history or other metastatic progressions of the primary tumor. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Although registry specific requirements exist, cancer registries primarily identify reportable cases using a combination of particular ICD-O-3 topography and morphology codes assigned to cancer case abstracts of which free text pathology reports form a main component. The codes are generally extracted from pathology reports by trained human coders, sometimes with the help of software programs. Here we present results that improve on the state-of-the-art in automatic extraction of 57 generic sites from pathology reports using three representative machine learning algorithms in text classification. We use a dataset of 56,426 reports arising from 35 labs that report to the Kentucky Cancer Registry. Employing unigrams, bigrams, and named entities as features, our methods achieve a class-based micro F-score of 0.9 and macro F-score of 0.72. To our knowledge, this is the best result on extracting ICD-O-3 codes from pathology reports using a large number of possible codes. Given the large dataset we use (compared to other similar efforts) with reports from 35 different labs, we also expect our final models to generalize better when extracting primary sites from previously unseen reports.
    03/2013; 2013:112-116.
  • Source
    • "Paralleling the growth in CRI prominence, JAMIA has received an increasing number of CRI submissions. In 2010, five published articles were completely focused on CRI,10–14 while in 2011 this number rose to 23,15–37 accounting for 11.5% of all JAMIA articles for that year. There was a special section focused on CRI papers in the December 2011 supplement issue. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Clinical research informatics is the rapidly evolving sub-discipline within biomedical informatics that focuses on developing new informatics theories, tools, and solutions to accelerate the full translational continuum: basic research to clinical trials (T1), clinical trials to academic health center practice (T2), diffusion and implementation to community practice (T3), and 'real world' outcomes (T4). We present a conceptual model based on an informatics-enabled clinical research workflow, integration across heterogeneous data sources, and core informatics tools and platforms. We use this conceptual model to highlight 18 new articles in the JAMIA special issue on clinical research informatics.
    Journal of the American Medical Informatics Association 04/2012; 19(e1):e36-e42. DOI:10.1136/amiajnl-2012-000968 · 3.50 Impact Factor
Show more