Conference PaperPDF Available

Recruit - An Ontology Based Information Retrieval System for Clinical Trials Recruitment

Conference Paper

Recruit - An Ontology Based Information Retrieval System for Clinical Trials Recruitment

Abstract and Figures

Clinical trials are studies designed to assess whether a new intervention is better than the current alternatives. However, most of them fail to recruit participants on schedule. It is hard to use Electronic Health Record (EHR) data to find eligible patients, therefore studies rely on manual assessment, which is time consuming, inefficient and requires specialized training. In this work we describe the design and development of an information retrieval system with the objective of finding eligible patients for cancer trials. The Recruit system has been in use at A. C. Camargo Cancer Center since August/2014 and contains data from more than 500,000 patients and 9 databases. It uses ontologies to integrate data from several sources and represent medical knowledge, which helps enhance results. One can search both in structured data and inside free text reports. The preliminary quality assessments shows excellent recall rates. Recruit proved to be an useful tool for researchers and its modular design could be applied to other clinical conditions and hospitals.
Content may be subject to copyright.
Recruit - An Ontology Based Information Retrieval System for
Clinical Trials Recruitment
Diogo F. C. Patrãoa, Michel Oleynika, Felipe Massicanoa, Ariane Morassi Sassoa
a International Center for Research (CIPE) - A. C. Camargo Cancer Center
Abstract
Clinical trials are studies designed to assess whether a new
intervention is better than the current alternatives. However,
most of them fail to recruit participants on schedule. It is hard
to use Electronic Health Record (EHR) data to find eligible
patients, therefore studies rely on manual assessment, which
is time consuming, inefficient and requires specialized
training. In this work we describe the design and development
of an information retrieval system with the objective of finding
eligible patients for cancer trials. The Recruit system has been
in use at A. C. Camargo Cancer Center since August/2014
and contains data from more than 500,000 patients and 9
databases. It uses ontologies to integrate data from several
sources and represent medical knowledge, which helps
enhance results. One can search both in structured data and
inside free text reports. The preliminary quality assessments
shows excellent recall rates. Recruit proved to be an useful
tool for researchers and its modular design could be applied
to other clinical conditions and hospitals.
Keywords:
Information Retrieval, Ontology, Clinical Trial, Patient
Selection
Introduction
A clinical trial is designed every time a new intervention
needs to be tested and compared with its current alternatives
in an effective way. This type of study usually specifies two
groups of research subjects: control (patients receiving regular
treatment or placebo) and treatment (patients receiving new
treatment). To statistically back up the conclusions, there
should be enough participants (the recruiting target) and all of
them should have similar clinical characteristics (in order to
meet the inclusion and exclusion criteria). Currently, only half
of all clinical trials reach their recruiting target, and half of
those reach the target in a timely manner [1]. The usage of
existing search tools (such as data warehouses) over
Electronic Health Record (EHR) data can lead to significant
increases in recruitment rates [2–5], however, there are still
important challenges to be faced.
Firstly, the clinical data is notoriously difficult to represent
formally [6–8]. The complex nature of medical practice, its
specialties and sub-specialties led the industry to develop
specialized software. Thus, it is common that the hospital
context involves usage of several systems for different aspects
of clinical practice, and the patient information is scattered
through different databases, usually using different data
models. Both structured forms and reports containing free text
may coexist in one database. Therefore, in order to use EHR
data to find patients based on clinical features, it is necessary
to apply data integration techniques [9].
In Computer Science, ontologies represent knowledge
formally, and have been used to derive new knowledge from
facts by a process called logical inference [10]. Ontology
based data integration employs the descriptive power of
ontologies to harmonize semantic mismatches between source
databases, particularly on the biomedical domain [11].
Additionally, the field of Information Retrieval has been
developed at a fast pace [12], with continuous improvements
to the algebraic models (such as latent semantic indexing) as
well as to the probabilistic models, many of them empowering
natural language processing. However, the availability of
language models trained on the medical subdomain is still
sparse mainly due to ethical concerns when publishing
corpora, a fact that hinders the development of the area.
Several systems have tried to overcome these difficulties with
customized search engines. Most of them [13] map free-text
data onto concepts taken from terminologies such as
SNOMED CT, upon which a structured search is then
performed. The translation of eligibility criteria into
computable definitions is usually done by the searcher, but
can be aided through machine language [14,15], especially
useful on cross-language environments.
In this paper, we present Recruit, an information retrieval
system with the target of finding patients based on clinical
criteria using EHR data. It uses ontologies to reconcile
heterogeneous databases, as well as merge data from
structured forms and free text for pathology and image
reports. We will describe its requirements, design, deployment
and preliminary quality assessments.
Materials and Methods
Requirements
We classified requirements for Recruit in two types: data and
functional. Data requirements are clinical information that the
system must be able to find, and functional are tasks that the
user must be able to perform when using the system. The first
source for data requirements was the Medical Informatics
Laboratory ticket system. We analyzed the patient reports
requested during the past 12 months and identified the most
common clinical criteria. We interviewed the clinical trial
support team, who described the most important concepts.
In particular, the system should implement the ethics
regulations regarding patient data access. System usage
should be monitored by analytics tools, to assess usage
patterns.
Design
Based on the verified requirements, we elaborated an UML
use case diagram to describe the interactions between the user
MEDINFO 2015: eHealth-enabled Health
I.N. Sarkar et al. (Eds.)
© 2015 IMIA and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License.
doi:10.3233/978-1-61499-564-7-534
534
and the system, and a mock-up of the user interface. File
formats and software used should be open source.
We chose to use ontologies to represent both patient
information and knowledge about the domain. The data model
should be as simple as needed for creating an index of
concepts for patients. Existing ontologies can be reused, as
long as they are provided with a free license and are within
the desired expressive power. Classes and properties will be
annotated in Portuguese, to allow easy understanding to both
the technical team and Recruit users.
Quality Assessment
In order to evaluate the quality of Recruit we created a
methodology to build gold standards that could be replicated
and that would help us measure the impact of interventions in
the system. The standards are constructed based on the
classification of the results from predetermined queries as
relevant or non-relevant in a chosen scenario. Therefore, we
can obtain the precision (defined as the ratio of retrieved
patients that are relevant) and recall (defined as the ratio of the
relevant patients that were retrieved) measures for each one,
which combined generates the weighted harmonic means F1
and F2. These scores allow us to assess the quality of the
result sets produced by Recruit.
First, we realized that it is important to choose queries in
which the professionals from the institution have scientific
interest for and also that brings a reasonable amount of cases.
In this way, it is possible to collaborate with researchers in the
construction of the gold standards and the results can be
assessed in a short period of time.
Second, each result from the chosen queries had to be
classified as relevant or non-relevant, by using a 3 point scale
in which (2) means relevant, (1) unsure and (0) non-relevant.
Therefore, if there is certainty about the presence of the
clinical condition, we would classify the subject as relevant
(2), usually by confirmation from a pathology report. Besides,
if there is doubt, either because there is lack of electronic
clinical documents to support the presence of the condition or,
by the own doubt of the health professionals, we would
classify the case as unsure (1). Finally, if the hypothesis of the
patient meeting the clinical criteria is rejected by the
physicians or, if it is impossible to find the information even
in non-electronic format (e.g. the case is too old), we would
classify the case as (0) non- relevant.
The next step is to convert the 3 point scale into a binary one,
thus we prefer to consider only relevant cases as (1) and the
all others as non relevant (0). Finally, it is recommended to
look into documents not searched by the system to recover
and analyze other cases related to the queries. That will allow
the generation of the described set of measures for each
standard, namely precision (P), recall (R), F1 and F2, which
will further help the system’s quality evaluation.
Results
Requirements
After the interview with the clinical trials support team and
review of requests on the ticket system, we compiled the data
requirements (Table 1). After that, we identified in which
database each information could be found and Table 2
compiles the characteristics of each one.
Table 1 – Data Requirements
% Tickets Description
33,07% Diagnosis (ICD-10) , date and age of diagnostic
33,07% Tumor Topography (ICD-O)
33,07% Tumor Morphology (ICD-O)
13,39% Treatment type (Surgery, Chemotherapy, Radio-
therapy)
13,39% Sample collected in biobank (by type: frozen
tissue, blood, RNA, DNA)
11,02% Chemotherapic drugs used
9,45% Demographics (Date of birth, current age, gen-
der)
8,66% Department
6,30% Cancer status (under diagnostic, not cancer,
cancer, metastasis, remission)
6,30% Alive/deceased, date of last information, death
by cancer or other reasons
4,72% TNM Clinical Stage (e. g. CS III)
3,94% TNM Variables (e. g. T0, N2, M1)
Table 2 – Source Databases
Database Technology Information Time span Structured / free text
EHR Oracle 10i TNM, ICD-10, Chemotherapy, Surgery, Radiotherapy 2007-2014 Structured
AP Reports Interbase Pathology Reports 2006-2014 Free text
AP Reports-legacy Oracle 8i Pathology Reports 2000-2006 Free text
AP Reports-legacy-2 DBase III Pathology Reports 1992-2000 Free text
Image Reports Oracle 10i Image Reports 2006-2014 Free text
Image Reports-legacy Oracle 8i Image Reports 2002-2006 Free text
Registry Excel ICD-10, ICD-O, TNM, follow up, treatment 2000-2012 Structured
Biobank MySQL 5.1 Frozen tissue, blood, DNA/RNA sample 1999-2014 Structured
Index SPSS ICD-10, ICD-O 1953-2000 Structured
Design
The Recruit software is divided in two parts: the backend,
responsible for extracting, transforming and indexing data;
and the frontend, the web interface with which the end user
interacts. We designed a single use case (depicted on Figure
1) to represent the user authentication, search and retrieval of
results on screen or on a CSV file.
The backend is a workflow that produces an index combining
unstructured and structured data contained in the original
databases. This index will be processed by the search engine
Apache Solr which will be queried by the frontend. We
adapted the structure described in [16]: instead of view
integration, we created a data warehouse based on ontology,
and then loaded this data as structured metadata in an
indexing server, along with unstructured report texts.
D.F.C. Patrão et al. / Recruit – An Ontology Based Information Retrieval System for Clinical Trials Recruitment 535
Figure 1 - UML use case diagram
As the first step, we integrated data from several structured
databases into only one triple store endpoint, namely Openlink
Virtuoso (Figure 2). The mapping from relational data to RDF
graphs was done with the help of Ontop, thus generating
several RDF files which were loaded in memory using ARQ
(configured with inference profile OWL Micro Reasoner).
Then, all inferenced triples were materialized into a single
RDF file, later loaded into the triplestore endpoint. Finally, a
set of handcrafted SPARQL queries was executed at the
endpoint in order to enrich it with information not feasible by
the inference process due to absence of closed world
reasoning. Therefore, the resulting dataset contained
consolidated information, such as the ICD-10, ICD-O and the
date of the diagnosis and also the age of the patient at the
diagnosis.
Figure 2 - Integration process data flow
In a second step, we published the inference-rich facts
together with unstructured data into a search engine (Figure
3). We extracted the textual content of pathology and image
reports and then associated it with the triple store endpoint
data using the patient identifier. The results of this process
were published in the search engine using its API.
Figure 3 - Search engine publishing data flow
Knowledge representation and structured data mapping
We decided to represent structured data as an ontology, as it
has been used for database integration and semantic
harmonization, and also for its inference capabilities. This
consolidated data will be used to feed the search engine and to
display result details in the user interface. Due to the large
data volume, we decided to use a very simple model, that
would allow for a reasonably fast (for weekly updates)
inference, while powerful enough to make required reasoning
tasks.
In our ontology, patients are represented as instances of
NamedIndividuals (as defined in OWL2 [17]) and structured
search criteria as classes directly attributed to instances. For
instance, a patient with breast cancer would belong to class
C50 (the ICD-10 [18] code for this disease). The patient URI
is based on the patient identifier, used in all databases. The
ICD-10 ontology was reused from NCBO [19], the ICD-O
was semi-automatically obtained from an existing relational
database. Other ontologies (treatments, chemotherapy, cancer
staging [20]) were manually created. Data properties were
created for identification data and diagnostic dates. Properties
with unique values for a patient were marked as
owl:functional to allow sorting on them. All properties and
classes were created with URIs in Portuguese and annotated
with the skos:prefLabel property. Besides the class hierarchy,
inference axioms were created to enhance results, specially on
clinical staging data. See excerpts of the ontology axioms on
Figures 4 and 5.
Figure 4 - Class hierarchy for chemotherapic drugs
SQL Mapping
In order to translate data from the relational databases into an
ontology, we created one SQL query that maps a production
database to instances of the corresponding class. We used
Ontop software to test and extract data into N3 format. Similar
information represented in different ways in the same database
was normalized in this step using regular expressions (such as
'^IV|IV$|^4$' for clinical stage IV). Moreover, we associated a
patient ICD-10 code with his/her diagnostic status, a
structured field with three possible values: patient in
diagnostic, patient without cancer, or patient with a confirmed
cancer diagnostic.
Figure 5 -Rules for inference of TNM clinical stage for
specific topographies
D.F.C. Patrão et al. / Recruit – An Ontology Based Information Retrieval System for Clinical Trials Recruitment536
Quality Assessment
The first query we chose to apply our methodology on was
related to the phyllodes tumor, a rare breast neoplasm. With
the help of a research nurse we built a list of synonyms in
Portuguese for that clinical term and they were expressed in
the following query: “filode OR filoides OR phyllodes OR
phylloides OR phylodes OR phyloides”. Then, we presented it
to Recruit, which retrieved a set with 266 patients and, to it
we added more 91 that were found by looking into documents
not searched by Recruit but that belong to the patients’
electronic health record. That was done in order to reproduce
a more real search universe and finally, we divided the cases
between two authors of this paper and one research nurse,
who classified them as relevant (2), non-relevant (0) or unsure
(1). There was a final meeting to guarantee that everyone had
the same understanding about each classification option and
so, the measures highlighted in table 3 were obtained. The
evaluated recall is high (97.16%), which fulfill the health
professionals expectations. However, the obtained precision
was 51.50%, probably due to the search engine not being
specific for the healthcare scenario. F2 (82.53%) is higher than
F1 (67.32%), since the former gives more weight to recall
while the latter considers both measures to have the same
weight.
Table 3 – Quality Evaluation
Variable (%)
Precision 51.50
Recall 97.16
F1 67.32
F2 82.53
We noted some signs that may indicate whether a case is
relevant or not, for example, if the patient has an unknown
name or if the case is too old, it usually is less relevant due to
lack of electronic records and the possibility of the patient
being dead or have lost follow-up. It is also a sign of non-
relevance the presence of terms that negates the searched
condition such as 'absent’ or the existence of relevant
documents only in paper or microfilm. If the patient possesses
a considerable amount of electronic records, or if the clinical
criteria appears in an anatomic pathology report and the latter
is recent, the chance of the case being relevant is higher.
Discussion
We designed Recruit to be easily maintainable, modularizing
the data extraction in a large number of queries, each mapping
to specific classes, or group of classes within the same
category or property, on a particular database. Adding another
database or removing a problematic query simply implies
adding or removing a mapping file. We verified this in
practice, when developing the mapping files incrementally.
Our database is large (more than 500,000 patients), and our
mappings generated a reasonable amount of RDF triples
which took a significant time (>6 hours) to be extracted. Most
of this time was spent on a couple of queries that used regular
expressions to detect patients with specific molecular test
results. The inference materialization step took more than 2
hours and required almost 20GB of RAM in a single process.
This could be optimized by using a customized program to
extract specific axioms, such as class instances, instead of
using a general SPARQL query to extract all possible
inferences. Parallel inference is a current topic of study, and
could greatly enhance the inference performance. Overall, the
preparation step took more than 8 hours, hindering the
application to be updated on a daily basis. The addition of
concepts, axioms or mappings should be carefully planned, as
running time and memory required can grow exponentially.
Our modelling also directly connects the diagnostic to the
patient. Although the inference takes less time this way, it
would be impossible to correctly assign the date of a
diagnostic when a patient has multiple tumours. The creation
of a new entity for the diagnostic would allow this, but the
inference resources would be higher, and it would require a
different usage of the indexing engine as well.
Since its release on August/2014 until December/2014, the
system has been continuously accessed by at least 15 different
users per week. Therefore, it is becoming a established tool
for researchers in A. C. Camargo Cancer Center.
Recruit achieved a high recall rate, considered good by the
research nurses, since they usually look into all the results,
thus not giving much weight to precision. Although we
understand the importance of the former measure, the latter
has a significant role, since it can diminish the time a
researcher needs to look for relevant patients. Also, the search
engine being used by Recruit is capable of dealing with cases
of inflection in Portuguese but not specificities from the
healthcare field. We believe that Recruit would have its
performance enhanced if it was imbued with a clinical health
terminology in Portuguese, which would allow users to
discover the synonyms for terms such as “phyllodes”.
Despite having just one gold standard, we believe that it gave
us some insight related to the quality of the search and to the
signs needed to enhance it. Therefore, if we build more
standards to evaluate the searches made by Recruit we can
improve our list of signs and use it to make interventions in
the system, which will have its results measured and
compared to the standards. Also, each intervention has a
different difficulty level: in order to increase the weight given
to terms found in anatomic pathology reports, for example,
one would need to detect if the searched condition is followed
by a word or phrase that negates it. All that must be taken into
consideration in order to plan the next maintenance projects
and consequently improve Recruit’s quality.
We could have used extraction and classification techniques
to improve free-text data with structured concepts and thus
improve recall rates [21]. However, due to unavailability of
public language models and annotated corpora in the medical
domain in Portuguese, we chose not to apply experimental
results at this moment while further research is ongoing.
Other works [3–5] rely on search and integration tools already
implemented for performing the search; in particular, these
tools were designed with other objectives than searching for
patients for clinical trials, therefore they may not include all
needed search criteria, or may not be customized to do so and
keep their main objective. Recruit, being designed exclusively
for patient selection, can be customized as needed for the sole
purpose of finding potentially recruitable patients. Also, as it
relies on open source software and file formats, it does not
impose artificial limits or require expensive licenses, which is
particularly relevant for research projects. Recruit integrates
data from structured and unstructured sources, which, at least
to our knowledge, has not been done in this field.
Recruit is now a production level tool being used at A. C.
Camargo Cancer Center, but further improvements are
D.F.C. Patrão et al. / Recruit – An Ontology Based Information Retrieval System for Clinical Trials Recruitment 537
needed. A comprehensive set of gold standards should be
created in order to guide further enhancements on mappings,
and evaluate the general quality of results. Data extraction and
inference should be optimized as to be more frequently
updated. The ontology model should be improved by creating
an instance for each diagnosis and relating the diagnostic
classes and date to it. This would allow specifying date of
diagnostic for patients with multiple cancers. Also, other types
of documents should have free text indexed, such as
outpatient clinical notes. A preliminary evaluation showed
that doctors employ a large number of acronyms and do not
use proper phrases, instead using a list of diagnostics,
therapies and other applicable concepts, and that should pose
a major challenge.
Conclusion
We have successfully implemented an ontology based
information retrieval system for clinical criteria based patient
selection. It uses EHR data, represents medical knowledge as
ontologies, integrates several databases and allows search for
structured data and free text. The preliminary quality
assessments show excellent recall rates. It is not only an
important asset for A. C. Camargo Cancer Center researchers,
but the principles here presented can be used on a larger range
of information retrieval problems.
Acknowledgments
Marcelo Sagayama contributed on the prototype construction,
data mapping and other technical tasks. Fábio Rampazzo
Mathias provided technical assistance. Helano Carioca Freitas
and Tatiana Iafuso provided system requirements. Silvana
Soares dos Santos assisted on the gold standard construction.
References
[1] Fletcher B, Gheorghe A, Moore D, Wilson S, Damery S.
Improving the recruitment activity of clinicians in
randomised controlled trials: a systematic review. BMJ
Open. 2012; 2(1):e000496.
[2] Fink E, Kokku PK, Nikiforou S, Hall LO, Goldgof DB,
Krischer JP. Selection of patients for clinical trials: an
interactive web-based system. Artif Intell Med. 2004;
31(3):241–54.
[3] Dugas M, Lange M, Müller-Tidow C, Kirchhof P,
Prokosch H-U. Routine data from hospital information
systems can support patient recruitment for clinical
studies. Clin Trials. 2010; 7(2):183–9.
[4] Kamal J, Pasuparthi K, Rogers P, Buskirk J, Mekhjian H.
Using an information warehouse to screen patients for
clinical trials: a prototype. AMIA Annu Symp Proc.
2005; 5(6):1004.
[5] Köpcke F, Kraus S, Scholler A, Nau C, Schüttler J,
Prokosch H-U, et al. Secondary use of routinely collected
patient data in a clinical trial: an evaluation of the effects
on patient recruitment and data acquisition. Int J Med
Inform. 2013; 82(3):185–92.
[6] Beale T. The Health Record - why is it so hard? In: Haux
R, Kulikowski C, editors. IMIA Yearbook of Medical
Informatics 2005. Stuttgart: Schattauer; 2005. p. 301–4.
[7] Jaspers MWM, Knaup P, Schmidt D. The Computerized
Patient Record: Where Do We Stand? 2006; 29–39.
[8] Weng C, Tu SW, Sim I, Richesson R. Formal
representation of eligibility criteria: A literature review.
Journal of Biomedical Informatics. Elsevier; 2010. p.
451–67.
[9] Anjum A, Bloodsworth P, Branson A, Hauer T,
McClatchey R, Munir K, et al. The Requirements for
Ontologies in Medical Data Integration: A Case Study.
11th International Database Engineering and Applications
Symposium (IDEAS 2007). IEEE; 2007. p. 308–14.
[10] Brachman RJ, Levesque HJ. Knowledge representation
and reasoning. New York. Morgan Kaufmann; 2004.
[11] Sujansky W. Heterogeneous database integration in
biomedicine. J Biomed Inform. 2001; 34(4):285–98.
[12] Manning CD, Raghavan P, Schütze H. An Introduction to
Information Retrieval. Press, Cambridge U.; 2008.
[13] SEMCARE [Internet]. 2015 [accessed 2015 Apr 10].
Available from: http://semcare.eu/
[14] Markó K, Schulz S, Hahn U. MorphoSaurus - design and
evaluation of an interlingua-based, cross-language
document retrieval engine for the medical domain.
Methods Inf Med. 2005; 44(4):537–45.
[15] Luz FF. Querying ontologies using controlled natural
language. Universidade de São Paulo; 2013.
[16] Patrão DFDC. Desenvolvimento e Avaliação de
Ferramentas Computacionais para Triagem Automática
de Sujeitos de Pesquisa. A. C. Camargo Cancer Center;
2014.
[17] W3C. OWL 2 Web Ontology Language Primer
[Internet]. 2009 [accessed 2014 Aug 21]. Available from:
http://www.w3.org/TR/owl2-primer/
[18] World Health Organization. International Statistical
Classification of Diseases and Related Health Problems
(ICD-10). World Health Organization; 2004.
[19] Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG,
Story M-A, et al. The National Center for Biomedical
Ontology. J Am Med Inform Assoc. 2012; 19(2):190–5.
[20] Sobin L, Gospodarowicz M, Wittekind C, editors. TNM
Classification of Malignant Tumours. John Wiley &
Sons; 2011.
[21] Oleynik M. Clinical reports information retrieval.
Universidade de São Paulo; 2014.
Address for correspondence
Diogo Ferreira da Costa Patrao
International Center for Research (CIPE)
A. C. Camargo Cancer Center
djogo@cipe.accamargo.
D.F.C. Patrão et al. / Recruit – An Ontology Based Information Retrieval System for Clinical Trials Recruitment538
Article
Objectives: The primary goal of this review is to summarize significant developments in the field of Clinical Research Informatics (CRI) over the years 2015-2016. The secondary goal is to contribute to a deeper understanding of CRI as a field, through the development of a strategy for searching and classifying CRI publications. Methods: A search strategy was developed to query the PubMed database, using medical subject headings to both select and exclude articles, and filtering publications by date and other characteristics. A manual review classified publications using stages in the “research study lifecycle”, with key stages that include study definition, participant enrollment, data management, data analysis, and results dissemination. Results: The search strategy generated 510 publications. The manual classification identified 125 publications as relevant to CRI, which were classified into seven different stages of the research lifecycle, and one additional class that pertained to multiple stages, referring to general infrastructure or standards. Important cross-cutting themes included new applications of electronic media (Internet, social media, mobile devices), standardization of data and procedures, and increased automation through the use of data mining and big data methods. Conclusions: The review revealed increased interest and support for CRI in large-scale projects across institutions, regionally, nationally, and internationally. A search strategy based on medical subject headings can find many relevant papers, but a large number of non-relevant papers need to be detected using text words which pertain to closely related fields such as computational statistics and clinical informatics. The research lifecycle was useful as a classification scheme by highlighting the relevance to the users of clinical research informatics solutions.
Chapter
The development of information technology has resulted in its penetration into every area of clinical research. Various clinical systems have been developed, which produce increasing volumes of clinical data. However, saving, exchanging, querying, and exploiting these data are challenging issues. The development of Extensible Markup Language (XML) has allowed the generation of flexible information formats to facilitate the electronic sharing of structured data via networks, and it has been used widely for clinical data processing. In particular, XML is very useful in the fields of data standardization, data exchange, and data integration. Moreover, ontologies have been attracting increased attention in various clinical fields in recent years. An ontology is the basic level of a knowledge representation scheme, and various ontology repositories have been developed, such as Gene Ontology and BioPortal. The creation of these standardized repositories greatly facilitates clinical research in related fields. In this chapter, we discuss the basic concepts of XML and ontologies, as well as their clinical applications.
Presentation
Full-text available
This work aims at developing an automated classifier of pathology reports, which should be able to infer the localization (topography) and the histological type (morphology) of a tumor in the International Classification of Diseases for Oncology (ICD-O). We used data provided by the A.C. Camargo Cancer Center located in São Paulo for training and validation and assessed the information retrieval quality using a Naive Bayes classifier evaluated with F2-score. We report measures of over 73% in the topographic group and 60% in the morphologic group, which agree with similar studies.
Article
Full-text available
Background Poor recruitment to randomised controlled trials (RCTs) is a widespread problem. Provision of interventions aimed at supporting or incentivising clinicians may improve recruitment to RCTs. Objectives To quantify the effects of strategies aimed at improving the recruitment activity of clinicians in RCTs, complemented with a synthesis of qualitative evidence related to clinicians' attitudes towards recruiting to RCTs. Data sources A systematic review of English and non-English articles identified from: The Cochrane Library, Ovid MEDLINE, Ovid EMBASE, Ovid PsycINFO, Ebsco CINAHL, Index to Theses and Open SIGLE from 2001 to March 2011. Additional reports were identified through citation searches of included articles. Study eligibility criteria Quantitative studies were included if they evaluated interventions aimed at improving the recruitment activity of clinicians or compared recruitment by different groups of clinicians. Information about host trial, study design, participants, interventions, outcomes and host RCT was extracted by one researcher and checked by another. Studies that met the inclusion criteria were assessed for quality using a standardised tool, the Effective Public Health Practice Project tool. Qualitative studies were included if they investigated clinicians' attitudes to recruiting patients to RCTs. All results/findings were extracted, and content analysis was carried out. Overarching themes were abstracted, followed by a metasummary analysis. Studies that met the inclusion criteria were assessed for quality using the Critical Appraisal Skills Programme qualitative checklist. Data extraction Data extraction was carried out by one researcher using predefined data fields, including study quality indicators, and verified by another. Results Eight quantitative studies were included describing four interventions and a comparison of recruiting clinicians. One study was rated as strong, one as moderate and the remaining six as weak when assessed for quality using the Effective Public Health Practice Project tool. Effective interventions included the use of qualitative research to identify and overcome barriers to recruitment, reduction of the clinical workload associated with participation in RCTs and the provision of extra training and protected research time. Eleven qualitative studies were identified, and eight themes were abstracted from the data: understanding of research, communication, perceived patient barriers, patient–clinician relationship, effect on patients, effect on clinical practice, individual benefits for clinicians and methods associated with successful recruitment. Metasummary analysis identified the most frequently reported subthemes to be: difficulty communicating trial methods, poor understanding of research and priority given to patient well-being. Overall, the qualitative studies were found to be of good quality when assessed using the Critical Appraisal Skills Programme checklist. Conclusions There were few high-quality trials that tested interventions to improve clinicians' recruitment activity in RCTs. The most promising intervention was the use of qualitative methods to identify and overcome barriers to clinician recruitment activity. More good quality studies of interventions are needed to add to the evidence base. The metasummary of qualitative findings identified understanding and communicating RCT methods as a key target for future interventions to improve recruitment. Reinforcement of the potential benefits, both for clinicians and for their patients, could also be a successful factor in improving recruitment. A bias was found towards investigating barriers to recruitment, so future work should also encompass a focus on successfully recruiting trials.
Conference Paper
Full-text available
Evidence-based medicine is critically dependent on three sources of information: a medical knowledge base, the patient's medical record and knowledge of available resources, including where appropriate, clinical protocols. Patient data is often scattered in a variety of databases and may, in a distributed model, be held across several disparate repositories. Consequently addressing the needs of an evidence- based medicine community presents issues of biomedical data integration, clinical interpretation and knowledge management. This paper outlines how the Health-e-Child project has approached the challenge of requirements specification for (bio-) medical data integration, from the level of cellular data, through disease to that of patient and population. The approach is illuminated through the requirements elicitation and analysis of Juvenile Idiopathic Arthritis (JIA), one of three diseases being studied in the EC-funded Health- e-Child project.
Article
Full-text available
The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.
Article
Evidence-based medicine is critically dependent on three sources of information: a medical knowledge base, the patients medical record and knowledge of available resources, including where appropriate, clinical protocols. Patient data is often scattered in a variety of databases and may, in a distributed model, be held across several disparate repositories. Consequently addressing the needs of an evidence-based medicine community presents issues of biomedical data integration, clinical interpretation and knowledge management. This paper outlines how the Health-e-Child project has approached the challenge of requirements specification for (bio-) medical data integration, from the level of cellular data, through disease to that of patient and population. The approach is illuminated through the requirements elicitation and analysis of Juvenile Idiopathic Arthritis (JIA), one of three diseases being studied in the EC-funded Health-e-Child project.
Article
Purpose: Clinical trials are time-consuming and require constant focus on data quality. Finding sufficient time for a trial is a challenging task for involved physicians, especially when it is conducted in parallel to patient care. From the point of view of medical informatics, the growing amount of electronically available patient data allows to support two key activities: the recruitment of patients into the study and the documentation of trial data. Methods: The project was carried out at one site of a European multicenter study. The study protocol required eligibility assessment for 510 patients in one week and the documentation of 46-186 data elements per patient. A database query based on routine data from patient care was set up to identify eligible patients and its results were compared to those of manual recruitment. Additionally, routine data was used to pre-populate the paper-based case report forms and the time necessary to fill in the remaining data elements was compared to completely manual data collection. Results: Even though manual recruitment of 327 patients already achieved high sensitivity (88%) and specificity (87%), the subsequent electronic report helped to include 42 (14%) additional patients and identified 21 (7%) patients, who were incorrectly included. Pre-populating the case report forms decreased the time required for documentation from a median of 255 to 30s. Conclusions: Reuse of routine data can help to improve the quality of patient recruitment and may reduce the time needed for data acquisition. These benefits can exceed the efforts required for development and implementation of the corresponding electronic support systems.
Book
Knowledge representation is at the very core of a radical idea for understanding intelligence. Instead of trying to understand or build brains from the bottom up, its goal is to understand and build intelligent behavior from the top down, putting the focus on what an agent needs to know in order to behave intelligently, how this knowledge can be represented symbolically, and how automated reasoning procedures can make this knowledge available as needed. This landmark text takes the central concepts of knowledge representation developed over the last 50 years and illustrates them in a lucid and compelling way. Each of the various styles of representation is presented in a simple and intuitive form, and the basics of reasoning with that representation are explained in detail. This approach gives readers a solid foundation for understanding the more advanced work found in the research literature. The presentation is clear enough to be accessible to a broad audience, including researchers and practitioners in database management, information retrieval, and object-oriented systems as well as artificial intelligence. This book provides the foundation in knowledge representation and reasoning that every AI practitioner needs.