ArticlePDF Available

Abstract

To determine whether assisted annotation using interactive training can reduce the time required to annotate a clinical document corpus without introducing bias. A tool, RapTAT, was designed to assist annotation by iteratively pre-annotating probable phrases of interest within a document, presenting the annotations to a reviewer for correction, and then using the corrected annotations for further machine learning-based training before pre-annotating subsequent documents. Annotators reviewed 404 clinical notes either manually or using RapTAT assistance for concepts related to quality of care during heart failure treatment. Notes were divided into 20 batches of 19-21 documents for iterative annotation and training. The number of correct RapTAT pre-annotations increased significantly and annotation time per batch decreased by ∼50% over the course of annotation. Annotation rate increased from batch to batch for assisted but not manual reviewers. Pre-annotation F-measure increased from 0.5 to 0.6 to >0.80 (relative to both assisted reviewer and reference annotations) over the first three batches and more slowly thereafter. Overall inter-annotator agreement was significantly higher between RapTAT-assisted reviewers (0.89) than between manual reviewers (0.85). The tool reduced workload by decreasing the number of annotations needing to be added and helping reviewers to annotate at an increased rate. Agreement between the pre-annotations and reference standard, and agreement between the pre-annotations and assisted annotations, were similar throughout the annotation process, which suggests that pre-annotation did not introduce bias. Pre-annotations generated by a tool capable of interactive training can reduce the time required to create an annotated document corpus by up to 50%.
Assisted annotation of medical free text
using RapTAT
Glenn T Gobbel,
1,2,3
Jennifer Garvin,
4,5,6,7
Ruth Reeves,
1,2
Robert M Cronin,
2,3
Julia Heavirland,
4
Jenifer Williams,
4
Allison Weaver,
4
Shrimalini Jayaramaraja,
3
Dario Giuse,
2
Theodore Speroff,
1,3,8
Steven H Brown,
1,2
Hua Xu,
9
Michael E Matheny
1,2,3,8
Additional material is
published online only. To view
please visit the journal online
(http://dx.doi.org/10.1136/
amiajnl-2013-002255).
For numbered afliations see
end of article.
Correspondence to
Dr Glenn T Gobbel,
Department of Veterans
Affairs, Tennessee Valley
Healthcare, 1310 24th Ave
South, 4th Floor GRECC,
Nashville, TN 37212, USA;
glenn.t.gobbel@vanderbilt.edu
Received 6 August 2013
Revised 13 December 2013
Accepted 17 December 2013
Published Online First
14 January 2014
To cite: Gobbel GT,
Garvin J, Reeves R, et al.
J Am Med Inform Assoc
2014;21:833841.
ABSTRACT
Objective To determine whether assisted annotation
using interactive training can reduce the time required to
annotate a clinical document corpus without introducing
bias.
Materials and methods A tool, RapTAT, was
designed to assist annotation by iteratively
pre-annotating probable phrases of interest within a
document, presenting the annotations to a reviewer for
correction, and then using the corrected annotations for
further machine learning-based training before pre-
annotating subsequent documents. Annotators reviewed
404 clinical notes either manually or using RapTAT
assistance for concepts related to quality of care during
heart failure treatment. Notes were divided into 20
batches of 1921 documents for iterative annotation
and training.
Results The number of correct RapTAT pre-annotations
increased signicantly and annotation time per batch
decreased by 50% over the course of annotation.
Annotation rate increased from batch to batch for
assisted but not manual reviewers. Pre-annotation
F-measure increased from 0.5 to 0.6 to >0.80 (relative
to both assisted reviewer and reference annotations) over
the rst three batches and more slowly thereafter.
Overall inter-annotator agreement was signicantly
higher between RapTAT-assisted reviewers (0.89) than
between manual reviewers (0.85).
Discussion The tool reduced workload by decreasing
the number of annotations needing to be added and
helping reviewers to annotate at an increased rate.
Agreement between the pre-annotations and reference
standard, and agreement between the pre-annotations
and assisted annotations, were similar throughout the
annotation process, which suggests that pre-annotation
did not introduce bias.
Conclusions Pre-annotations generated by a tool
capable of interactive training can reduce the time
required to create an annotated document corpus by up
to 50%.
INTRODUCTION
Natural language processing (NLP) systems can
help to monitor patient care by automated process-
ing of medical records and extraction of
quality-of-care indicators.
15
Such systems are often
designed to replace manual review but may still
require a manually annotated corpus for initial
training or formal evaluation because the structure
of documents, syntax, and terminology used for
expression can vary among domains and personnel;
an existing system optimized for one medical spe-
cialty or organization may not work well for
another. As a result, NLP systems that rely on
machine learning may have to be trained using
annotated documents from the intended medical
domain, and both rule-based and machine learning-
based NLP tools require testing and validation
before deployment within a new medical eld.
Because NLP can reduce the cost and increase
the efciency of data extraction relative to manual
annotation, a recent Veterans Affairs
(VA)-supported project has focused on developing
an NLP system to support automated monitoring
of care for patients with congestive heart failure
(CHF). That system aims to detect clinical signs
and treatments that can show the consistency with
which providers adhere to American Heart
Association (AHA) guidelines for CHF care.
Following AHA guidelines has been shown to
reduce hospital admissions, improve quality of life,
and decrease mortality of patients with CHF,
67
so
rapidly identifying discrepancies between guidelines
and care can help to mitigate decreases in care
quality. To achieve this aim, the NLP system will
need to identify seven concepts within clinical
notes: (1) mentions of ACE inhibitor administra-
tion; (2) mentions of angiotensin II receptor
blocker (ARB) administration; (3) mentions of ejec-
tion fraction; (4) quantitative measures of ejection
fraction; (5) mentions of left ventricular systolic
function; (6) qualitative measures of left ventricular
systolic function; and (7) documented reasons for
not administering ACE inhibitors or ARBs when
otherwise indicated. This study describes the design
and evaluation of an assisted annotation tool
designed to support the development of an anno-
tated reference corpus, which will be used to train
and test the machine learning-based NLP system.
BACKGROUND
The lack of annotated datasets can substantially
hinder the development and use of NLP on clinical
text.
8
Annotating clinical documents to create an
annotated corpus is laborious and expensive. High
cost and labor requirements are incurred owing to
the requirement for reviewers with sufcient
domain expertise to identify relevant text.
9
Annotation commonly employs two independent
annotators for review and a third person to adjudi-
cate disagreements.
10 11
Furthermore, the docu-
ment corpus must be large enough to allow for
accurate training and testing.
Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255 833
Research and applications
Given the importance of annotation to NLP system develop-
ment, studies have focused largely on two primary methods of
reducing the burden and cost of generating annotated corpora:
active learning and pre-annotation. Active learning can decrease
the cost of annotating text by actively involving the learning
algorithm in the document selection process,
12
and its goal is to
train the system while requiring as few samples as possible. It
has been applied in a wide variety of language processing
tasks
13
for example, part-of-speech tagging,
14 15
text categor-
ization,
16 17
named entity recognition,
18 19
and classication of
assertions.
20
Active learning has been reported to reduce the
number of training samples required by 3863%.
18 20
Because the focus is on reducing training sample size, active
learning does not reduce the burden when annotating a set
number of documents. In contrast, the goal of pre-annotation is
to reduce the time and/or effort required to annotate a docu-
ment by reducing the number of annotations a reviewer must
add. Pre-annotation is generally carried out using a dictionary
generated for the annotation task or an existing NLP system.
Recently, Lingren et al
21
created a dictionary to generate pre-
annotations in clinical trial announcements, focusing on the
impact of pre-annotation on the ability of reviewers to label
disease and symptom-related concepts. Pre-annotation reduced
the time needed for review by 1421% compared with fully
manual annotation. Investigations using existing NLP systems
for pre-annotation of non-medical documents reported reduc-
tions in annotation time of 5058% for named entity recogni-
tion, part-of-speech tagging, and parsing within non-medical
documents.
2224
Despite the reported benets of pre-annotation, there are
some potentially important considerations for its use.
Inaccurate pre-annotations may require deletion or correction,
and evidence indicates that time-savings correlate with pre-
annotation accuracy.
25 26
For some tasks, pre-annotation may
not alter annotation time,
27
and the presence of multiple,
inaccurate pre-annotations may instead increase annotation
time.
25 28
Also, pretrained systems capable of pre-annotating
for a specic task or medical realm either may not exist or may
not be sufciently accurate when used within a new domain.
Although it is possible to create task-specic pre-annotation
systems,
21
doing so may require substantial effort and offset
the time-savings afforded by pre-annotation. Furthermore,
although some studies have found no evidence to suggest that
pre-annotation induces bias or reduces quality of annotating
text for biomedical concepts or part-of-speech,
21 29 30
Fort
and Sagot suggest that pre-annotations can induce bias, leading
to decreases in random errors but increases in systematic errors
by reviewers.
25
This study describes the design and evaluation of an assisted
annotation tool that may serve as an alternative approach to pre-
viously described methods of pre-annotation. In it, we assess the
impact of generating pre-annotations interactively using iterative
machine learning as implemented in the Rapid Text Annotation
Tool (RapTAT) on annotation burden. Specically, the study
evaluates whether RapTAT can support interactive, assisted
annotation and reduce the time required for annotation without
negatively affecting inter-annotator consistency or inducing
annotation bias relative to manual review.
METHODS
Sampling and population
The study corpus consisted of notes on patients with CHF,
including discharge summaries, emergency department triage
and nursing notes, internal medicine attending notes, neurology
resident notes, physician discharge notes, physician history and
physical notes, and primary care outpatient notes. Documents
were selected from a larger corpus consisting of a random
sample of documents generated between September 2007 and
September of 2008 by six independent VA medical centers from
the western USA. Patients were excluded if they (a) had partici-
pated in trials related to ACE inhibitors or ARBs; (b) had
comfort measure advanced directives; (c) were tted with heart
assist devices (except pacemakers or debrillators); or (d) had
had a heart or heart/lung transplant. The nal study corpus con-
tained 404 documents from 171 patients. The Tennessee Valley
and Salt Lake City Health System VA and University of Utah
institutional review boards and research and development com-
mittees approved the study and granted a waiver regarding the
need to obtain informed consent and Health Insurance
Portability and Accountability Act authorization.
Schema development
A cardiology expert and three experienced annotators designed
the annotation schema using an iterative process involving
schema generation, annotation of a document sample, review of
the annotations, and schema revision. The schema development
process dened the key concepts that occur within the medical
record and that relate to clinical care guidelines for patients
with CHF. According to the guidelines, patients in systolic heart
failure with an ejection fraction of 40% should be treated with
ACE inhibitors or, alternatively, ARBs.
31
The schema was
designed to provide annotations so that the NLP tool could
identify (1) evidence of heart failure, (2) whether the patient
was receiving ACE inhibitors or ARBs, and (3) if a reason was
provided for not prescribing ACE inhibitors or ARBs to patients
with heart failure. The nal schema contained seven concepts
(table 1), and the task of annotators was to identify phrases in
the text that express those concepts.
Annotator training
Four reviewers, all experienced in clinical note annotation, were
responsible for annotation. All annotators were provided with
annotation guidelines specic to the schema. Two were respon-
sible for manual annotations only, and the other two carried out
only RapTAT-assisted annotation. To train all reviewers with
respect to the annotation schema, the creators of the schema
used consensus annotation to generate a training set of 30 docu-
ments distinct from the study corpus. Reviewers annotated the
training set in batches of 10 using the Knowtator annotation
tool (gure 1).
32
They were required to achieve an agreement
score exceeding 80% between their annotations and the adjudi-
cated training set before proceeding with review of documents
in the study corpus, where
Agreement ¼Matches
Matches þNon-Matches ð1Þ
Annotation of the study corpus
Each document in the corpus was randomly assigned to one of
20 batches, and each batch contained 1921 documents (gure 2).
The batches were used as units of analysis for statistical purposes
and to identify document sets for training RapTAT during
assisted annotation. Assisted reviewers annotated the rst docu-
ment batch without any pre-annotation to provide the initial
training of the machine learning algorithms within RapTAT. The
next batch was pre-annotated by RapTAT based on this training,
834 Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255
Research and applications
displayed within Knowtator for review and correction by the
assisted annotators, and the corrected annotations were entered
into RapTAT to update its training before pre-annotating the
subsequent batch. This iterative process of pre-annotation, cor-
rection, and updating of RapTAT training was carried out by
separate instances of RapTAT for each of the two assisted
reviewers, and it continued until the nal batch had been cor-
rected following pre-annotation. Manual annotators also used
Knowtator for annotating each batch, but the documents were
not pre-annotated. An adjudicator who was neither a manual
nor assisted annotator reviewed the manual annotations to
produce the reference standard. Inter-annotator agreement
(IAA) was calculated using equation 1.
Text processing
RapTAT learns to pre-annotate documents with the likely anno-
tations of a reviewer based on iterative feedback from that same
reviewer. The tool used two different probabilistic models to
estimate the likelihood of a reviewer (1) annotating a particular
phrase and (2) mapping that phrase to a particular schema
concept (table 1). For both models, we dened a token as a con-
tiguous group of characters that corresponded to a word, value,
or unit of measure, and a phrase as a contiguous sequence of
one or more tokens that is representative of one of the schema
concepts. Considering only token sequences (S) in a phrase
without regard to context, the probability of annotation (A) of a
given sequence is
Table 1 Schema for the seven concepts annotated within the corpus and text samples demonstrating phrases that should be annotated
Concept
Number of documents
containing concept
Number of patients
with concept Sample text*
ACEI 272 132 ACEI,”“ACE inhibitor,Altace,Vaseretic,”“Captopril,”“Lisinopril
Angiotensin II receptor blocker 107 53 ARB,”“Angiotensin receptor blocker,”“Sartans,”“Losartan
EF 201 118 Estimated ejection fraction,”“EF,LVEF,Ejection fraction
EF quantitation 197 116 EF=6070%,”“EF is about 30%,”“Ejection fraction in the
range of 40 to 50%
LV systolic function/dysfunction 79 51 LV systolic function,”“Systolic dysfunction,”“LV function,”“Normal LV
size and function,
LV systolic function value 76 48 Mild systolic dysfunction,”“Systolic function is borderline normal
Reason not on ACEI/ARB 40 26 Elevated creatinine levels,”“Developing sepsis,
Patient refuses to take ACEI,Renal disease
*Annotated phrases in bold. Examples corresponding to each concept were provided for reviewers as part of the annotation guidelines, but they were not meant to comprehensively
represent all phrases that might refer to a given concept. For the concept Reason not on ACE inhibitor/ARB,reviewers were instructed to annotate a phrase only when it was provided
as an explicit reason for not prescribing one of the drugs.
ACE, angiotensin converting enzyme; ACEI, ACE inhibitor; ARB, angiotensin II receptor blocker; EF, ejection fraction; LV, left ventricular.
Figure 1 Screen capture of the Knowtator annotation plug-in within the Protégé application. The displayed document is synthetic but contains text
representative of that found within the study corpus. Schema concepts are listed on the left. For each corpus document, reviewers use the input
device of the computer to highlight all phrases mapping to one of the schema concepts and to select the concept associated with each highlighted
phrase.
Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255 835
Research and applications
P(AjS) ¼Number of Annotations of S
Number of Occurrences of S ð2Þ
We modied this equation for use in RapTAT because, if this
simple phrase identication model is used, subsequences shorter
than the complete annotated phrase do not enter into probabil-
ity calculations. For example, if high fever of unknown origin
was annotated, the probability of annotating the subsequence
high feverwould not increase. Such a model could reduce
recall by underestimating the probability of annotating token
sequences that occur infrequently as complete annotated phrases
even though they might occur frequently as subsequences. We
therefore adjusted RapTAT to give partial credit to subsequences
(table 2). Each subsequence within an annotated phrase of
length i in a sequence of length j was credited with an annota-
tion count of i/j (numerator, equation 2). Thus, the credited
count was lower for sub-sequences that were particularly short
relative to the length of the complete annotated phrase. All
token sequences whose rst token was not the rst token in an
annotated phrase were considered unlabeled and contributed
equally to the number of sequence occurrences (denominator,
equation 2).
Estimating the likelihood of mapping a phrase to a concept
was accomplished using a multinomial naïve Bayes classier.
The classier calculated the most probable concept for a given
phrase, using the equation
PðCijT1;...;TkÞ¼P(Ci)P(T1jCi)PðTkjCiÞ
PðT1;...;TkÞð3Þ
where P(C
i
) refers to the probability of occurrence of the ith
concept, k is the number of tokens in the phrase and T
k
refers
to the token at the kth position in the sequence. The value of P
(T
k
|C
i
) is provided by the equation
P(TkjCi)¼
Occurrences of Token T at position k when a
Phrase Maps to Concept Ci
Occurrences of Concept Ci
ð4Þ
Because the denominator in equation 3, P(T1;...;Tk), is con-
stant when mapping a given phrase, nding the most probable
concept for mapping is reduced to identifying the one that max-
imizes the numerator. Laplace smoothing adjusted for the occur-
rence of tokens missing from the training data.
33
Multiple
studies have used this multinomial naïve Bayes models for text
classication,
34
although, to the best of our knowledge, the use
of token position as a feature for medical concept mapping is
unique to RapTAT.
35
RapTAT system design
The RapTAT system was programed in Java, and consisted of
one module that determined the likelihood of phrase annotation
and a second that determined the likelihood of a given phrase
mapping to a particular concept (gure 3). Phrases analyzed by
the system were limited to contiguous sequences of 7 tokens.
Before analysis by the two RapTAT modules, the text was pre-
processed, which consisted of detecting sentence boundaries,
dividing each sentence into tokens, removing stop wordtokens
(and,”“by,”“for,”“in,”“nos,”“of,”“on,”“the,”“to,and
with), and identifying and adding the appropriate part of
speech to the token as a sufx. The preprocessing steps were
carried out using the OpenNLP libraries (Apache Software
Foundation). All versions of RapTATare available at http://code.
google.com/p/raptat/, and V.0.6a was used for this study.
Evaluation measures
RapTAT was evaluated based on the number of true positives
(TPs), false negatives (FNs), and false positives (FPs) within the
pre-annotations. Precision, recall, and F-measure provided mea-
sures of performance of the RapTAT tool and were calculated
for both the corrected annotations from the RapTAT-assisted
annotators and the reference standard described above. A TP
was dened as an overlap of one or more tokens between the
RapTAT-generated and reference standard that mapped to the
same concept. RapTAT automatically scored TPs, FPs, and FNs
Figure 2 Document ow for generation of the annotated study corpus using manual review and adjudication or RapTAT-assisted review.
Table 2 Examples demonstrating how annotated phrases and
their subsequences are counted during training, where n represents
the number of tokens in the phrase
Sequence length Phrase Tokens
Number of annotations
credited to sequence
Full, annotated phrase LV systolic
function
3 1.0
n1 subsequence LV systolic2 0.67
n2 subsequence LV1 0.33
Full, annotated phrase Renal disease2 1.0
n1 subsequence Renal1 0.5
LV, Left venticular.
836 Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255
Research and applications
and calculated precision, recall, and F-measure according to the
equations
Precision ¼TP=(TP þFP) ð5Þ
Recall ¼TP=ðTP þFNÞð6Þ
FMeasure ¼2PrecisionRecall=(Precision þRecall) ð7Þ
We used leave-one-out cross-validation to estimate the perform-
ance of RapTAT with respect to each of the schema concepts.
Cross-validation consisted of training RapTAT using all but one
of the annotated documents from a given reviewer; RapTAT
then generated annotations for the left-outdocument, which
were compared with those of the reference standard. This
process was repeated for each document and reviewer.
Precision, recall, and F-measure for a given concept were calcu-
lated by combining the TPs, FPs, and FNs for that concept.
Reviewer annotation time and rate
To assess batch-to-batch changes in annotation time, each
RapTAT-assisted reviewer recorded the time required to review
each document. Time for each batch was normalized to batch
size in kilobytes. Because correct pre-annotations might
decrease, and incorrect annotations might increase, annotation
time, we also calculated annotation rate of both manual and
assisted reviewers with respect to only those annotations that
were added or corrected. Correction was dened as either modi-
fying the beginning or end offsets of the annotation or changing
the concept to which the phrase mapped. We dened the anno-
tation rate as the number of annotations added or corrected per
minute based on timestamps generated by Knowtator for each
annotation. Knowtator did not create timestamps for FP pre-
annotations that were removed during assisted review, so the
occurrence of, and the time taken for, such corrections were not
explicitly included in the calculations of annotation rate.
Because annotation rates were not normally distributed, we
determined the median rates for each reviewer and batch, and
those data were used for statistical evaluations of the change in
annotation rate as a function of batch number.
RapTAT system training and annotation rates
To evaluate the training rate of the RapTAT system, we mea-
sured the time required to process the rst 10 document
batches. To evaluate the annotation rate, the corpus was divided
into two independent training and test groups with 10 batches
of documents in each. After processing the training documents,
annotation rate of RapTAT was calculated based on the time
spent pre-annotating the test documents. Times were normal-
ized to document corpus size in kilobytes. Time required to
read the corpus from disk into computer memory and read and
write data structures before and after training was excluded
from all rate calculations. Heap size of the Java virtual machine
was 1 GB. Training and testing were carried out on the VA
informatics and computing infrastructure (VINCI) server, which
ran on an Intel Xeon quad-core processor running at 2.27 GHz
and supplied with 128 GB of RAM. The operating system was
Windows Server, 2008 R2 Enterprise.
Statistical analysis
The study used simple linear regression to evaluate the statistical
signicance of changes in F-measure, annotation rate, and frac-
tion of annotations correctly provided by RapTAT as a function
of document batch. A correctRapTAT annotation was dened
as a pre-annotation that was neither added nor corrected by the
reviewer. To compare the similarity of RapTAT-generated pre-
annotations with the assisted and reference standard annota-
tions, we ran paired t tests on estimates of precision, recall, and
F-measure across all batches. A Students t test was used to
compare the number of annotations added or corrected by
assisted versus manual reviewers. A two-sample proportion test
was employed to identify statistical differences for single mea-
sures of IAA. All statistical analyses were carried out using Stata/
IC V.11.2 for Mac (Stata Corp, College Station, Texas, USA),
and p values <0.05 were considered signicant.
RESULTS
There was a notable decrease in annotation time from batch to
batch for the RapTAT-assisted reviewers, especially over the rst
six to seven batches, followed by a slower apparent decrease
over batches 1420 (gure 4; top). Annotation time decreased
by about 50% from the rst to the last batch. Part of this
decrease may be accounted for by the gradual decrease in the
number of annotations that had to be added or corrected by the
annotators over the course of annotation (gure 4; bottom).
Averaged over the entire corpus, the two manual annotators
added 100±18 (mean±SD) annotations per batch. The assisted
annotators added or corrected signicantly fewer; 78±12 anno-
tations per batch, and 21±9 annotations per batch were
Figure 3 Data ow during training
and pre-annotation by the RapTAT
interactive machine learning system.
Dotted lines and arrows represent
optional parts of the system that are
available but were not used in this
study, such as Lexical Variant
Generation (LVG) lemmatization.
Stippled patterns represent
RapTAT-specic modules (light
stippling) and les (dense stippling).
Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255 837
Research and applications
generated as pre-annotations by RapTAT during assisted annota-
tion and did not require correction. To determine if the decrease
in annotation number alone accounted for the marked decrease
in annotation time (gure 4; top), the rate of adding or correct-
ing only annotations while excluding correct annotations gener-
ated by RapTAT was evaluated. The annotation rate of the
assisted reviewers signicantly increased over the course of
annotation (+0.145 added or corrected annotations per minute
per batch; 95% CI 0.07 to 0.22) and approximately doubled
from the rst to the last batch (gure 5). In contrast, the
batch-to-batch change in annotation rate for the manual
reviewers was signicantly lower than that of the assisted anno-
tators and did not change signicantly over the course of anno-
tation (+0.022 annotations per minute per batch; 95% CI
0.004 to 0.048).
The F-measure of the RapTAT pre-annotations relative to the
assisted reviewer annotations increased steeply over the initial
ve to six batches. After a single batch of training, the
F-measure was 0.50.6 and increased to >0.80 after three
batches. Precision and recall increased similarly. Linear regres-
sion analysis of the performance scores after the initial ve
batches showed a non-signicant trend towards a continuing
increase in F-measure (p=0.0623 for slope > 0 by linear regres-
sion analysis). There was no evidence that pre-annotation
introduced bias. Precision, recall, and F-measure (data not
shown) increased in a similar fashion through the course of
annotation regardless of whether the pre-annotation perform-
ance measures were calculated relative to the reviewer annota-
tions (gure 6, left) or the reference standard (gure 6, right).
Furthermore, although the RapTAT pre-annotations were more
similar to the annotations of the assisted reviewers than the ref-
erence standard based on signicantly increased precision,
recall, and F-measure across all batches (paired t test; p<0.05),
the average increases were generally slight (0.046) and consist-
ent from batch to batch. This nding was expected because the
tool was specically trained using the annotations of each
assisted reviewer. There was no evidence that pre-annotation
adversely affected IAA, which was signicantly greater for
assisted than manual annotation for certain concepts as well as
overall (table 3).
The performance of RapTAT with respect to its ability to
annotate phrases accurately was concept dependent (table 4).
The four highest F-measures ranged from 0.80 to 0.97 and cor-
responded to the most highly prevalent concepts in the corpus,
and the lowest F-measure was for the least prevalent concept,
Reason not on ACE inhibitor/ARBconcept (table 1).
The processing speed of the RapTAT tool during annotation
was 132.0 ms/kb of text. Preprocessing,which we dene as
sentence boundary detection, tokenization, part-of-speech detec-
tion, and stop word removal, took most of the time (123 ms);
only 9 ms were required for training once the text was read into
computer memory and preprocessed. Annotation rate by the
tool was 116.6 ms, which consisted of 116 ms for preprocessing
and 0.55 ms for phrase identication and concept mapping.
DISCUSSION
This study demonstrates that pre-annotation based on inter-
active, iterative machine learning can reduce the burden asso-
ciated with creating an annotated corpus. Considering the
annotation time and rate of the two assisted reviewers compared
with the manual reviewers, we estimate that using assisted rather
than manual annotation would have saved each reviewer
roughly 16 h for annotation of the entire 404-document corpus.
Also, our study found no evidence to suggest that pre-
annotation introduces bias. Before the study, we were concerned
that the closed feedback loop between RapTAT and each
Figure 4 Time required to annotate one kilobyte of text as a function
of the number of document batches reviewed (top), and the fraction of
all annotations that were uncorrected by the reviewers and added only
by RapTAT. For the annotation time plot (top), each symbol represents
the time taken by a single RapTAT-assisted annotator for a particular
batch of documents from the study corpus, and the dashed line
represents the apparent, batch-to-batch trend in annotation time. For
the plot of the fraction of annotations generated by RapTAT alone with
no correction by the annotators (bottom), each symbol represents the
total number of uncorrected annotations generated by RapTAT for each
batch divided by the total number of annotated phrases in the batch;
the least squares line of regression is also included, and the slope is
signicantly different from zero (p<0.01).
Figure 5 Annotation rate as a function of the number of document
batches reviewed. Each symbol represents the rate for a single reviewer
for a particular batch of documents from the study corpus. The rate
represents the inverse of the time between adding or correcting
annotations for both manual (open symbols) and RapTAT-assisted
(closed symbols) reviewers.
838 Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255
Research and applications
reviewer might induce a drift in the annotations, so that pre-
annotations might closely match annotations of each reviewer
but increasingly deviate from those of the reference or other
annotator over the course of annotation. However, the IAA for
the assisted annotators was equal to or higher than that of the
manual annotators. Also, the precision, recall, and F-measure
for the pre-annotations relative to the assisted annotations and
for the pre-annotations relative to the reference standard
remained similar throughout the course of annotation.
The F-measure of the pre-annotations relative to the assisted
reviewer annotations was <0.8 for the rst few annotation
batches, so the tool may provide only slight assistance in the
early stages. This is a limitation of the iterative training needed
for RapTAT compared with prior approaches that initially pre-
annotate all documents using existing tools or ones created for
the task. As RapTAT learns and improves, the number of anno-
tations that must be added or corrected decreases and the anno-
tation time of the reviewers correspondingly decreases. Fort and
Sagot examined the impact of pre-annotation accuracy on anno-
tation time and found that increasing accuracy from 66.5% to
81.6% was associated with an 50% decrease in annotation
time.
25
In our study, the F-measure reached 81% after three
document batches, which suggests that about 60 documents may
be required for training RapTAT to a level of accuracy such that
its pre-annotations substantially reduce annotation time. The
impact of training on F-measure was concept dependent, which
may be partially related to concept prevalence, so the rate of
increase in annotation speed as a function of the number of
documents annotated may be slower for infrequent concepts.
In this study, while RapTAT was used to generate the pre-
annotations, annotators added and corrected pre-annotations
using the separate Knowtator tool. Our goal is to eventually
embed RapTATwithin an annotation tool. This will allow anno-
tators to update the machine learning algorithms after each
document and obviate the import and export of data that was
required in this study. When designing RapTAT, we were con-
cerned that existing language models, such as maximum entropy
Markov and conditional random elds, might not be sufciently
rapid to support iterative training and pre-annotation in a way
that would avoid delays during annotation. We therefore used
language models and worked to implement algorithms that
would be sufciently fast to support the interactive annotation
process described in this study. Based on the annotation and
system training rate determined in this study, RapTAT should be
readily capable of supporting real-time, interactive annotation.
The rate-limiting factor is disk access. Since 1 kb of text equals
about half a page, the current RapTAT system should take about
1 s to train on four pages or annotate eight pages once the
documents are read from disk and stored in computer memory.
The impact of the interactive approach to pre-annotation
described here on annotation time appears to be within the
range reported in other similar studies, which decreased annota-
tion time by 1458%.
2124 30
Interactive, assisted pre-
annotation in our study approximately doubled the annotation
rate relative to that of manual reviewers. Studies examining
changes in IAA due to pre-annotation have been less consistent,
with some studies reporting no change and another reporting
an increase of 11%.
21 28 30
Interactive assisted annotation in
this study improved IAA by 27%. Although some of the
decrease in annotation time in our study was expected and
probably due to the increased fraction of annotations correctly
labeled by RapTAT, there was an unexpected increase in annota-
tion rate unrelated to annotation number. Our calculation of
annotation rate did not explicitly include the time required for
Figure 6 Performance of the RapTAT
tool as measured by precision and
recall as a function of the number of
document batches used for training.
Pre-annotations provided by the
RapTAT tool were scored for
performance versus either the assisted
reviewer annotations (left) or the
reference standard annotations (right).
Table 3 Inter-annotator agreement (IAA) between the two manual and between the two RapTAT-assisted reviewers
Average IAA (95% CI)
Concept Manual Assisted
Angiotensin converting enzyme inhibitor 0.89 (0.86 to 0.93) 0.93* (0.91 to 0.96)
Angiotensin II receptor blocker 0.81 (0.72 to 0.89) 0.97* (0.95 to 1.00)
Ejection fraction 0.86 (0.80 to 0.93) 0.97* (0.95 to 1.00)
Ejection fraction quantitation 0.90 (0.85 to 0.94) 0.88 (0.83 to 0.92)
Left ventricular systolic function/dysfunction 0.82 (0.73 to 0.91) 0.76 (0.62 to 0.89)
Left ventricular systolic function value 0.85 (0.78 to 0.93) 0.77 (0.64 to 0.90)
Reason not on ACE inhibitor/ARB 0.58 (0.46 to 0.70) 0.54 (0.45 to 0.64)
Total (combined over all concepts) 0.85 (0.81 to 0.88) 0.89* (0.87 to 0.91)
*Indicates significant difference when comparing the IAA of the manual reviewers with that of the RapTAT-assisted reviewers.
ACE, angiotensin converting enzyme; ARB, angiotensin II receptor blocker.
Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255 839
Research and applications
removal of FP pre-annotations. When annotators review a single
document, our observation has been that they continuously
alternate between removing FPs, adding FNs, and correction of
inaccurate text spans or concept mapping. Based on this obser-
vation, the time for correcting FPs would have been added to
the time taken between adding or correcting annotations, and
thus reduced the annotation rate of the assisted annotators.
Therefore, exclusion of the time taken for correcting FPs does
not account for the increased annotation rate of the assisted
reviewers. One possible explanation is that correcting annota-
tions may take less time than adding missing annotations. The
existence of pre-annotations may also reduce the cognitive
burden by decreasing the number of annotations that have to be
identied in each document or helping to delineate document
sections. With respect to the increase in IAA for assisted annota-
tors, we theorize that pre-annotation by RapTAT may help
reviewers to identify and annotate phrases that they might other-
wise overlook, thus reducing inter-annotator discrepancies. A
potential benet of increased IAA is a decrease in the adjudica-
tion workload.
Although previous studies have suggested that pre-annotation
can reduce annotation burden, the iterative, machine learning-
based approach to pre-annotations described here has some
important advantages. First, there is no need to identify or
create a pre-annotation system because such a system is gener-
ated during the annotation process. RapTAT can be used
without the linguistic and computational experience that might
otherwise be required to implement a pre-annotation system.
Second, the system carrying out pre-annotation is automatically
optimized for the schema and intended domain via machine
learning during annotation. Considering that low pre-annotation
accuracy can slow the annotation process,
25 28
correctly tailor-
ing the pre-annotation to the domain is important, and non-
optimized pre-annotation tools, such as pre-existing systems or
dictionaries developed for a task, may not be sufcient.
There have been previous reports on the use of machine
learning-based pre-annotations for assisted annotation. Kors
et al
36
used the assisted annotation function of the BRAT tool
to generate multilingual corpora of documents annotated for
multiple biomedical semantic types.
37
BRAT is a web browser-
based tool that can use external web services for text processing
and generation of pre-annotations.
37
Culotta et al
9
described an
iterative approach similar to the one described for RapTAT for
training a named entity recognition system. Using simulations,
they reported that their approach reduced the number of
actionsrequired by an annotator by 42%. The MIST tool has
been used to annotate protected health information within
medical documents, and it can be trained to identify other con-
cepts.
38
Another annotation tool, BOEMIE, is reported to have
the ability to use a similar interactive approach to assist with
text annotation.
39
To the best of our knowledge, the impact of
using MIST or BOEMIE on annotation time and IAA and their
ability to support real-time interactive annotation have not been
reported.
CONCLUSION
This study demonstrates that interactive, iterative machine learn-
ing as provided by RapTAT can assist with the annotation of
text by gradually learning to produce accurate pre-annotations.
Doing so substantially reduces the annotation time by decreasing
the number of annotations that must be added by reviewers and
helping to accelerate the rate at which reviewers are able to add
missing annotations and correct inaccurate ones. RapTAT also
improves IAA, which should accelerate adjudication when using
multiple reviewers for annotation. Integration of RapTAT or a
similar system with an annotation tool could help to mitigate an
important barrier to implementing NLP systems in the medical
eld.
Author afliations
1
Department of Veterans Affairs Medical Center, Geriatric Research, Education and
Clinical Center (GRECC), Nashville, Tennessee, USA
2
Department of Biomedical Informatics, Vanderbilt University School of Medicine,
Nashville, Tennessee, USA
3
Division of General Internal Medicine & Public Health, Department of Medicine,
Vanderbilt University School of Medicine, Nashville, Tennessee, USA
4
IDEAS Center SLC VA Healthcare System, Salt Lake City, Utah, USA
5
Division of Epidemiology, University of Utah School of Medicine, Salt Lake City,
Utah, USA
6
Department of Biomedical Informatics, University of Utah School of Medicine, Salt
Lake City, Utah, USA
7
Department of Veterans Affairs Medical Center, Geriatric Research, Education and
Clinical Center (GRECC), Salt Lake City, Utah, USA
8
Department of Biostatistics, Vanderbilt University School of Medicine, Nashville,
Tennessee, USA
9
School of Biomedical Informatics, University of Texas Health Science Center,
Houston, Texas, USA
Acknowledgements We thank Vincent Messina for technological assistance and
Stephane Meystre for a careful review of the manuscript.
Contributors GTG developed the algorithms, conceived and designed the study,
acquired, analyzed, and interpreted the data, and wrote and revised the manuscript.
RR conceived and designed the study and also reviewed and revised the manuscript.
RMC, JH, JW, and AW designed the study, acquired the data, and reviewed the
manuscript. SJ developed the algorithms and acquired and analyzed the data. JG,
TS, and DG designed the study and reviewed and revised the manuscript. SHB
conceived and designed the study and reviewed and revised the manuscript. HX
analyzed and interpreted the data and reviewed and revised the manuscript. MEM
developed the algorithms, conceived and designed the study, analyzed and
interpreted the data, and reviewed and revised the manuscript.
Funding This material is based upon work funded by the Department of Veterans
Affairs (VA), Veterans Health Administration, Ofce of Research and Development,
Health Services Research and Development (HSR&D) program. The work was
supported with resources and the use of facilities at the VA Tennessee Valley
Healthcare System (TVHS). Funding for this study was provided through VA grant
SAF-03-223 and HSR&D IBE 09-069.
Competing interests The VA Consortium for Health Informatics Research (CHIR)
HIR 09-001 and HIR 09-003 also provided support to GTG, TS, and MEM. The
Department of Veterans Affairs Health Administration HSR&D Career Development
Award CDA-08020 provided additional support to MEM; GTG and RR were
supported by the Department of Veterans Affairs Medical Informatics Fellowship
Program (sponsored by Ofce of Academic Afliations, Ofce of Health Information,
and HSR&D). GTG and RR performed the work in this study while serving as
medical informatics fellows within the Department of Veterans Affairs Medical
Center, Nashville, Tennessee. MEM is a physician researcher at the Geriatrics
Research Education and Clinical Center (GRECC) at the Department of Veterans
Affairs Medical Center, Nashville, Tennessee. SHB is a staff physician at the
Department of Veterans Affairs Medical Center, Nashville, Tennessee and Director of
Table 4 Performance of the RapTAT tool for the various schema
concepts as measured by precision, recall, and F-measure
Performance Measure
Concept Precision Recall F
Angiotensin converting enzyme inhibitor 0.97 0.94 0.95
Angiotensin II receptor blocker 0.99 0.96 0.97
Ejection fraction 0.96 0.95 0.96
Ejection fraction quantitation 0.77 0.82 0.80
Left ventricular systolic function/dysfunction 0.61 0.82 0.70
Left ventricular systolic function value 0.83 0.37 0.51
Reason not on ACE inhibitor/ARB 0.36 0.12 0.18
ACE, angiotensin converting enzyme; ARB, angiotensin II receptor blocker.
840 Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255
Research and applications
Knowledge-Based System, Health Informatics, Ofce of Informatics and Analytics,
Department of Veterans Affairs. TS is chief of TVHS Center for Health Services
Research, GRECC, Department of Veterans Affairs Medical Center, Nashville,
Tennessee.
Ethics approval Institutional review boards of the Tennessee Valley VA, Salt Lake
City VA, and University of Utah.
Provenance and peer review Not commissioned; externally peer reviewed.
REFERENCES
1 Matheny ME, Fitzhenry F, Speroff T, et al. Detection of infectious symptoms from VA
emergency department and primary care clinical documentation. Int J Med Inform
2012;81:14356.
2 Murff HJ, FitzHenry F, Matheny ME, et al. Automated identication of postoperative
complications within an electronic medical record using natural language
processing. JAMA 2011;306:84855.
3 Chiang JH, Lin JW, Yang CW. Automated evaluation of electronic discharge notes
to assess quality of care for cardiovascular diseases using Medical Language
Extraction and Encoding System (MedLEE). J Am Med Inform Assoc
2010;17:24552.
4 Harkema H, Chapman WW, Saul M, et al. Developing a natural language
processing application for measuring the quality of colonoscopy procedures. JAm
Med Inform Assoc 2011;18(Suppl 1):i1506.
5 Greenberg JO, Vakharia N, Szent-Gyorgyi LE, et al. Meaningful measurement:
developing a measurement system to improve blood pressure control in patients
with chronic kidney disease. J Am Med Inform Assoc 2013;20:e97101.
6 Bonow RO, Bennett S, Casey DE Jr, et al. ACC/AHA clinical performance measures
for adults with chronic heart failure: a report of the American College of
Cardiology/American Heart Association Task Force on Performance Measures
(Writing Committee to Develop Heart Failure Clinical Performance Measures):
endorsed by the Heart Failure Society of America. Circulation 2005;112:185387.
7 Juckett D. A method for determining the number of documents needed for a gold
standard corpus. J Biomed Inform 2012;45:46070.
8 Chapman WW, Nadkarni PM, Hirschman L, et al. Overcoming barriers to NLP for
clinical text: the role of shared tasks and the need for additional creative solutions.
J Am Med Inform Assoc 2011;18:5403.
9 Culotta A, Kristjansson T, McCallum A, et al. Corrective feedback and persistent
learning for information extraction. Artif Intell 2006;170:110122.
10 Roberts A, Gaizauskas R, Hepple M, et al. Building a semantically annotated corpus
of clinical texts. J Biomed Inform 2009;42:95066.
11 South BR, Shen S, Leng J, et al. A prototype tool set to support machine-assisted
annotation. 2012 Workshop on Biomedical Natural Language Processing; Montreal,
Canada: Association for Computational Linguistics, 2012.
12 Thompson CA, Califf ME, Mooney RJ. Active learning for natural language parsing
and information extraction. Sixteenth International Conference on Machine Learning;
Morgan Kaufmann Publishers Inc, 1999.
13 Olsson F. A literature survey of active machine learning in the context of natural
language processing: Swedish Institute of Computer Science (SICS) Technical Report;
2009. Report No: T2009:06.
14 Dagan I, Engleson SP. Committee-based sampling for training probabilistic
classiers. Twelfth International Conference on Machine Learning; Tahoe City,
California: Morgan Kaufmann, 1995.
15 Ringger E, McClanahan P, Haertel R, et al. Active learning for part-of-speech
tagging: accelerating corpus annotation. Linguistic Annotation Workshop; Prague,
Czech Republic: Association for Computational Linguistics, 2007.
16 McCallum A, Nigam K. Employing EM and pool-based active learning for text
classication. Fifteenth International Conference on Machine Learning; Morgan
Kaufmann Publishers Inc, 1998.
17 Lewis DD, Gale WA. A sequential algorithm for training text classiers. 17th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval; Dublin, Ireland: Springer-Verlag New York, Inc, 1994.
18 Hachey B, Beatrice A, Becker M. Investigating the effects of selective sampling on
the annotation task. Ninth Conference on Computational Natural Language
Learning; Ann Arbor, Michigan: Association for Computational Linguistics, 2005.
19 Vlachos A. Active annotation. Workshop on Adaptive Text Extraction and Mining
(ATEM 2006); Trento, Italy; 2006.
20 Chen Y, Mani S, Xu H. Applying active learning to assertion classication of
concepts in clinical text. J Biomed Inform 2012;45:26572.
21 Lingren T, Deleger L, Molnar K, et al. Evaluating the impact of pre-annotation on
annotation speed and potential bias: natural language processing gold standard
development for clinical named entity recognition in clinical trial announcements.
J Am Med Inform Assoc 2014;21:40613.
22 Chiou F-D, Chiang D, Palmer M. Facilitating treebank annotation using a statistical
parser. First International Conference on Human Language Technology Research;
San Diego: Association for Computational Linguistics, 2001.
23 Ganchev K, Pereira F, Mandel M, et al. Semi-automated named entity annotation.
Linguistic Annotation Workshop; Prague, Czech Republic: Association for
Computational Linguistics, 2007.
24 Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of
English: the Penn treebank. Comput Linguist 1993;19:31330.
25 Fort K, Sagot B. Inuence of pre-annotation on POS-tagged corpus development.
Fourth Linguistic Annotation Workshop; Uppsala, Sweden: Association for
Computational Linguistics, 2010.
26 Ringger E, Carmen M, Haertel R, et al. Assessing the costs of machine-assisted
corpus annotation through a user study. Sixth International Conference on
Language Resources and Evaluation (LREC08); Marrakech, Morocco: European
Language Resources Association (ELRA), 2008.
27 Rehbein I, Ruppenhofer J, Sporleder C. Assessing the benets of partial automatic
pre-labeling for frame-semantic annotation. Third Linguistic Annotation Workshop;
Suntec, Singapore: Association for Computational Linguistics, 2009.
28 Ogren PV, Savova G, Chute C. Constructing evaluation corpora for automated
clinical named entity recognition. Language Resources and Evaluation Conference
(LREC); 2008.
29 Dandapat S, Biswas P, Choudhury M, et al. Complex linguistic annotationno easy
way out!: a case from Bangla and Hindi POS labeling tasks. Third Linguistic
Annotation Workshop; Suntec, Singapore: Association for Computational Linguistics,
2009.
30 Névéol A, Islamaj Doğan R, Lu Z. Semi-automatic semantic annotation of PubMed
queries: a study on quality, efciency, satisfaction. J Biomed Inform
2011;44:31018.
31 Yancy CW, Jessup M, Bozkurt B, et al. 2013 ACCF/AHA guideline for the
management of heart failure: a report of the American College of Cardiology
Foundation/American Heart Association Task Force on Practice Guidelines.
Circulation 2013;128:e240327.
32 Ogren PV. Knowtator: a Protégé plug-in for annotated corpus construction. North
American Chapter of the Association for Computational Linguistics on Human
Language Technology; New York, New York: Association for Computation
Linguistics, 2006.
33 Manning CD, Raghavan P, Schütze H. Introduction to information retrieval.
New York: Cambridge University Press, 2008.
34 Schneider K-M. Techniques for improving the performance of naive Bayes for text
classication. 6th International Conference on Computational Linguistics and
Intelligent Text Processing; Mexico City, Mexico: Springer-Verlag, 2005.
35 Gobbel GT, Reeves R, Jayaramaraja S, et al. Development and evaluation of
RapTAT: a machine learning system for concept mapping of phrases from medical
narratives. J Biomed Inform http://dx.doi.org/10.1016/j.jbi.2013.11.008.
36 Kors JA, Clematide S, Akhondi SA, et al. Creating multilingual gold standard
corpora for biomedical concept recognition. CLEF 2013 Conference; Valencia, Span:
2013.
37 Stenetorp P, Pyysalo S, Topic G, et al. BRAT: a web-based tool for NLP-assisted text
annotation. Proceedings of the Demonstrations at the 13th Conference of the
European Chapter of the Association for Computational Linguistics; Association for
Computational Linguistics, 2012.
38 Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE identication scrubber toolkit:
design, training, and assessment. Int J Med Inform 2010;79:84959.
39 Fragkou P, Petasis G, Theodorakos A, et al. Boemie ontology-based text annotation
tool. 6th International Conference on Language Resources and Evaluation (LREC);
Marrakech, Morocco, 2008.
Gobbel GT, et al.J Am Med Inform Assoc 2014;21:833841. doi:10.1136/amiajnl-2013-002255 841
Research and applications
... Another trend aspect of annotation research is a spectrum between controlled environments like pervasive computing and internet-of-things [20] to more controlled environments like medical image annotation [21]. In less controlled environments, annotators may not be trained, whereas in more controlled environments, like medical domains, the annotators are usually highly trained. ...
Preprint
Full-text available
Annotated data have traditionally been used to provide the input for training a supervised machine learning (ML) model. However, current pre-trained ML models for natural language processing (NLP) contain embedded linguistic information that can be used to inform the annotation process. We use the BERT neural language model to feed information back into an annotation task that involves semantic labelling of dialog behavior in a question-asking game called Emotion Twenty Questions (EMO20Q). First we describe the background of BERT, the EMO20Q data, and assisted annotation tasks. Then we describe the methods for fine-tuning BERT for the purpose of checking the annotated labels. To do this, we use the paraphrase task as a way to check that all utterances with the same annotation label are classified as paraphrases of each other. We show this method to be an effective way to assess and revise annotations of textual user data with complex, utterance-level semantic labels.
... Our results suggest that although time can be decreased with pre-annotations using automated rule-based deidentification systems, the quality of the corpus could decline when compared to serial annotations. Our results are congruent with previous findings, that automatically pre-annotating corpus can significantly save time while there is no significant difference of annotation quality between parallel and pre-annotations 30,33 . Comparison between Setting 1 and Setting 2 suggests that the former has better quality, contrary to what is observed in a previous study 28 . ...
Article
Full-text available
For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
... Recently, deep neural network (DNN), especially Recurrent Neural Network (RNN), achieves remarkable performance in Clinical Named Entity Recognition (CNER) tasks. Mostafiz and Ashraf [17] compared the RNN-based NER method with other information extraction tools, e.g., RapTAT [18], MTI [19], in extracting pathological terms from chest X-Ray radiology reports and demonstrated that deep neural network outperformed generic tools by a large margin. Gridach [20] added a CRF layer after the RNN layer to process the CNER task and obtained remarkable results on both JNLPBA and BioCreAtIvE II GM data sets. ...
Article
Full-text available
Background Computed tomography (CT) reports record a large volume of valuable information about patients’ conditions and the interpretations of radiology images from radiologists, which can be used for clinical decision-making and further academic study. However, the free-text nature of clinical reports is a critical barrier to use this data more effectively. In this study, we investigate a novel deep learning method to extract entities from Chinese CT reports for lung cancer screening and TNM staging. Methods The proposed approach presents a new named entity recognition algorithm, namely the BERT-based-BiLSTM-Transformer network (BERT-BTN) with pre-training, to extract clinical entities for lung cancer screening and staging. Specifically, instead of traditional word embedding methods, BERT is applied to learn the deep semantic representations of characters. Following the long short-term memory layer, a Transformer layer is added to capture the global dependencies between characters. Besides, pre-training technique is employed to alleviate the problem of insufficient labeled data. Results We verify the effectiveness of the proposed approach on a clinical dataset containing 359 CT reports collected from the Department of Thoracic Surgery II of Peking University Cancer Hospital. The experimental results show that the proposed approach achieves an 85.96% macro-F1 score under exact match scheme, which improves the performance by 1.38%, 1.84%, 3.81%,4.29%,5.12%,5.29% and 8.84% compared to BERT-BTN, BERT-LSTM, BERT-fine-tune, BERT-Transformer, FastText-BTN, FastText-BiLSTM and FastText-Transformer, respectively. Conclusions In this study, we developed a novel deep learning method, i.e., BERT-BTN with pre-training, to extract the clinical entities from Chinese CT reports. The experimental results indicate that the proposed approach can efficiently recognize various clinical entities about lung cancer screening and staging, which shows the potential for further clinical decision-making and academic research.
Article
Objectives Active learning (AL) has rarely integrated diversity-based and uncertainty-based strategies into a dynamic sampling framework for clinical named entity recognition (NER). Machine-assisted annotation is becoming popular for creating gold-standard labels. This study investigated the effectiveness of dynamic AL strategies under simulated machine-assisted annotation scenarios for clinical NER. Materials and Methods We proposed 3 new AL strategies: a diversity-based strategy (CLUSTER) based on Sentence-BERT and 2 dynamic strategies (CLC and CNBSE) capable of switching from diversity-based to uncertainty-based strategies. Using BioClinicalBERT as the foundational NER model, we conducted simulation experiments on 3 medication-related clinical NER datasets independently: i2b2 2009, n2c2 2018 (Track 2), and MADE 1.0. We compared the proposed strategies with uncertainty-based (LC and NBSE) and passive-learning (RANDOM) strategies. Performance was primarily measured by the number of edits made by the annotators to achieve a desired target effectiveness evaluated on independent test sets. Results When aiming for 98% overall target effectiveness, on average, CLUSTER required the fewest edits. When aiming for 99% overall target effectiveness, CNBSE required 20.4% fewer edits than NBSE did. CLUSTER and RANDOM could not achieve such a high target under the pool-based simulation experiment. For high-difficulty entities, CNBSE required 22.5% fewer edits than NBSE to achieve 99% target effectiveness, whereas neither CLUSTER nor RANDOM achieved 93% target effectiveness. Discussion and Conclusion When the target effectiveness was set high, the proposed dynamic strategy CNBSE exhibited both strong learning capabilities and low annotation costs in machine-assisted annotation. CLUSTER required the fewest edits when the target effectiveness was set low.
Conference Paper
Full-text available
Natural language processing (NLP) research combines the study of universal principles, through basic science, with applied science targeting specific use cases and settings. However, the process of exchange between basic NLP and applications is often assumed to emerge naturally, resulting in many innovations going unapplied and many important questions left unstudied. We describe a new paradigm of Translational NLP, which aims to structure and facilitate the processes by which basic and applied NLP research inform one another. Translational NLP thus presents a third research paradigm, focused on understanding the challenges posed by application needs and how these challenges can drive innovation in basic science and technology design. We show that many significant advances in NLP research have emerged from the intersection of basic principles with application needs, and present a conceptual framework outlining the stakeholders and key questions in translational research. Our framework provides a roadmap for developing Translational NLP as a dedicated research area, and identifies general translational principles to facilitate exchange between basic and applied research.
Article
Full-text available
Natural language processing for medical applications (medical NLP) requires high-quality annotated corpora. In this study, we designed a versatile annotation scheme for clinical-medical text and a set of associated guidelines, which address two common subtasks used in medical NLP: named entity recognition (NER) and relation extraction (RE). The annotation scheme integrates similar existing schemes and defines clinical-medical entities and relations to encode useful information for many medical NLP applications. The guidelines aim to increase the annotation feasibility by reducing the necessity of judgement based on medical knowledge so as to enable non-medical professionals to annotate the text. We adopted a recursive discussion procedure involving NLP researchers, medical professionals, and annotators to develop the scheme and guidelines based on real annotation examples while increasing the corpus size. Further, we obtained annotated corpora comprising 3,769 medical records and radiology reports of patients with serious lung diseases. For improved efficiency, preliminary NER and RE models were created after the first half was annotated; they were subsequently applied to the second half, which was then corrected manually. This two-step annotation also increased the inter-coder agreement. Finally, a joint NER + RE model trained on our corpora showed sufficiently promising performance to suggest its practical implementation.
Article
The omnipresence and deep impact of artificial intelligence (AI) in today's society are undeniable. While the technology has already established itself as a powerful tool in several industries, more recently it has also started to change the practice of medicine. The aim of this review is to provide healthcare providers working in the field of cardiovascular medicine with an overview of AI and machine learning (ML) algorithms that have passed the initial tests and made it into contemporary clinical practice. The following domains where AI/ML could revolutionize cardiology are covered: (i) signal processing, (ii) image processing, (iii) clinical risk stratification, (iv) natural language processing, and (v) fundamental clinical discoveries.
Article
Purpose of review: Artificial intelligence (AI) has changed virtually every aspect of modern life, and medicine is no exception. Pediatric cardiology is both a perceptual and a cognitive subspecialty that involves complex decision-making, so AI is a particularly attractive tool for this medical discipline. This review summarizes the foundational work and incremental progress made as AI applications have emerged in pediatric cardiology since 2020. Recent findings: AI-based algorithms can be useful for pediatric cardiology in many areas, including: (1) clinical examination and diagnosis, (2) image processing, (3) planning and management of cardiac interventions, (4) prognosis and risk stratification, (5) omics and precision medicine, and (6) fetal cardiology. Most AI initiatives showcased in medical journals seem to work well in silico, but progress toward implementation in actual clinical practice has been more limited. Several barriers to implementation are identified, some encountered throughout medicine generally, and others specific to pediatric cardiology. Summary: Despite barriers to acceptance in clinical practice, AI is already establishing a durable role in pediatric cardiology. Its potential remains great, but to fully realize its benefits, substantial investment to develop and refine AI for pediatric cardiology applications will be necessary to overcome the challenges of implementation.
Article
The artificial intelligence (AI) revolution is well underway, including in the medical field, and has dramatically transformed our lives. An understanding of the basics of AI applications, their development, and challenges to their clinical implementation is important for clinicians to fully appreciate the possibilities of AI. Such a foundation would ensure that clinicians have a good grasp and realistic expectations for AI in medicine and prevent discrepancies between the promised and real-world impact. When quantifying the track record for AI applications in cardiology, we found that a substantial number of AI-systems are never deployed in clinical practice, although there certainly are many success stories. Successful implementations shared the following: they came from clinical areas where large amount of training data was available; were deployable into a single diagnostic modality; prediction models generally had high performance on external validation; and most were developed as part of collaborations with medical device manufacturers who had substantial experience with implementation of new technology. When looking into the current processes used for developing AI-based systems, we suggest that expanding the analytic framework to address potential deployment and implementation issues at project outset will improve the rate of successful implementation, and will be a necessary next step for AI to achieve its full potential in cardiovascular medicine.
Article
Full-text available
We describe our approach to create gold standard corpora for biomedical concept recognition in multiple languages, including English, French, German, Spanish, and Dutch. The annotations are based on a subset of the Unified Medical Language System and cover a wide variety of semantic groups.
Conference Paper
Full-text available
We report on an active learning experiment for named entity recognition in the astronomy domain. Active learning has been shown to reduce the amount of labelled data required to train a supervised learner by selectively sampling more informative data points for human annotation. We inspect double annotation data from the same domain and quantify potential problems concerning annotators’ performance. For data selectively sampled according to different selection metrics, we find lower inter-annotator agreement and higher per token annotation times. However, overall results confirm the utility of active learning.
Conference Paper
Naive Bayes is often used in text classification applications and experiments because of its simplicity and effectiveness. However, its performance is often degraded because it does not model text well, and by inappropriate feature selection and the lack of reliable confidence scores. We address these problems and show that they can be solved by some simple corrections. We demonstrate that our simple modifications are able to improve the performance of Naive B ayes for text classification significantly.
Conference Paper
Manually annotating clinical document corpora to generate reference standards for Natural Language Processing (NLP) systems or Machine Learning (ML) is a time-consuming and labor-intensive endeavor. Although a variety of open source annotation tools currently exist, there is a clear opportunity to develop new tools and assess functionalities that introduce efficiencies into the process of generating reference standards. These features include: management of document corpora and batch assignment, integration of machine-assisted verification functions, semi-automated curation of annotated information, and support of machine-assisted pre-annotation. The goals of reducing annotator workload and improving the quality of reference standards are important considerations for development of new tools. An infrastructure is also needed that will support large-scale but secure annotation of sensitive clinical data as well as crowdsourcing which has proven successful for a variety of annotation tasks. We introduce the E xtensible H uman O racle S uite of T ools (eHOST) http://code.google.com/p/ehost that provides such functionalities that when coupled with server integration offer an end-to-end solution to carry out small or large scale as well as crowd sourced annotation projects.
Article
Rapid, automated determination of the mapping of free text phrases to pre-defined concepts could assist in the annotation of clinical notes and increase the speed of natural language processing systems. The aim of this study was to design and evaluate a token-order-specific naïve Bayes-based machine learning system (RapTAT) to predict associations between phrases and concepts. Performance was assessed using a reference standard generated from 2,860 VA discharge summaries containing 567,520 phrases that had been mapped to 12,056 distinct Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) concepts by the MCVS natural language processing system. It was also assessed on the manually annotated, 2010 i2b2 challenge data. Performance was established with regard to precision, recall, and F-measure for each of the concepts within the VA documents using bootstrapping. Within that corpus, concepts identified by MCVS were broadly distributed throughout SNOMED CT, and the token-order-specific language model achieved better performance based on precision, recall, and F-measure (0.95±0.15, 0.96±0.16, and 0.95±0.16, respectively; mean ± SD) than the bag-of-words based, naïve Bayes model (0.64±0.45, 0.61±0.46, and 0.60±0.45, respectively) that has previously been used for concept mapping. Precision, recall, and F-measure on the i2b2 test set were 92.9%, 85.9%, and 89.2% respectively, using the token-order-specific model. RapTAT required just 7.2 milliseconds to map all phrases within a single discharge summary, and mapping rate did not decrease as the number of processed documents increased. The high performance attained by the tool in terms of both accuracy and speed was encouraging, and the mapping rate should be sufficient to support near-real-time, interactive annotation of medical narratives. These results demonstrate the feasibility of rapidly and accurately mapping phrases to a wide range of medical concepts based on a token-order-specific naïve Bayes model and machine learning.
Conference Paper
We introduce the brat rapid annotation tool (BRAT), an intuitive web-based tool for text annotation supported by Natural Language Processing (NLP) technology. BRAT has been developed for rich structured annotation for a variety of NLP tasks and aims to support manual curation efforts and increase annotator productivity using NLP techniques. We discuss several case studies of real-world annotation projects using pre-release versions of BRAT and present an evaluation of annotation assisted by semantic class disambiguation on a multicategory entity mention annotation task, showing a 15% decrease in total annotation time. BRAT is available under an open-source license from: http://brat.nlplab.org