Conference PaperPDF Available

TAKELAB: Medical Information Extraction and Linking with MINERAL

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 389–393,
Denver, Colorado, June 4-5, 2015. c
2015 Association for Computational Linguistics
TAKELAB: Medical Information Extraction and Linking with MINERAL
Goran Glavaˇ
University of Zagreb
Faculty of Electrical Engineering and Computing
Text Analysis and Knowledge Engineering Lab
Unska 3, 10000 Zagreb, Croatia
Medical texts are filled with mentions of dis-
eases, disorders, and other clinical conditions,
with many different surface forms relating to
the same condition. We describe MINERAL, a
system for extraction and normalization of dis-
ease mentions in clinical text, with which we
participated in the Task 14 of SemEval 2015
evaluation campaign. MINERAL relies on a
conditional random fields-based model with a
rich set of features for mention detection, and
a semantic textual similarity measure for entity
linking. MINERAL reaches joint extraction
and linking performance of
score (strict score of
) and ranks fourth
among 16 participating teams.
1 Introduction
Clinical narratives contain numerous mentions of
diseases and disorders. Recognizing these mentions
in text and normalizing the different superficial forms
of a disorder to the same canonical form could enable
new types of analyses that would be beneficial for
both medical professionals and patients.
Detection and normalization of various concepts
such as named entities (McCallum and Li, 2003; Kr-
ishnan and Manning, 2006) or events (Bethard, 2013;
s and
Snajder, 2014) has long been in the focus
of the NLP community. Disorder mentions in clini-
cal text, however, have some peculiarities not typical
for traditional information extraction tasks such as
discontinuity or distributivity of a single token to
multiple disorder mentions. For example, the snippet
turned in
clinched together as a consequence of. . . ”
contains two mentions of medical conditions, “ex-
tremities turned in” and “extremities clinched to-
gether”, which share the token “extremities”, with
the latter mention being discontinuous.
In this paper we present the MINERAL (Medi-
cal INformation ExtRAction and Linking) system
for recognizing and normalizing mentions of clinical
conditions, with which we participated in Task 14
of SemEval 2015 evaluation campaign. The system
recognizes disorder mentions via the supervised con-
ditional random fields (CRF) model with a rich set of
lexical, gazetteer-based, and informativeness-based
features. We apply a set of post-processing rules to
construct disorder mentions from token-level anno-
tations which follow the BE GIN-INSIDE-O UT SIDE
scheme. We utilize a measure of semantic textual
similarity to link recognized disorder mentions to
entries in the SNOMED-CT medical database. Our
approach is resource light in the sense that, except for
SNOMED-CT which is necessary for normalization,
it does not rely on medical NLP resources.
We ranked fourth (relaxed evaluation setting)
among 16 teams in the official evaluation, with 3%
lower performance than the best-performing system.
Such a result suggests that coupling sequence la-
belling for mention recognition with an STS measure
for concept normalization poses a viable solution for
entity recognition in the clinical domain. We make
the MINERAL system freely available.1
2 Clinical Information Extraction
Clinical concept extraction is an essential task in
medical natural language processing. While early
approaches heavily relied on domain-specific vocab-
ularies (Friedman et al., 1994; Aronson, 2001; Zeng
et al., 2006), more recent efforts leverage the human-
annotated corpora to develop machine learning mod-
els for the extraction of medical concepts (Tang et al.,
2013; Uzuner et al., 2010). The rise in the number
of data-driven efforts in the medical domain was par-
ticularly motivated by the shared tasks such as i2b2
challenges (Uzuner et al., 2010) and ShARe/CLEF
eHealth Evaluation Lab (Suominen et al., 2013).
The first subtask of the SemEval Task 14, in which
we participated, was essentially the same as the first
task in the ShARe/CLEF eHealth campaign. We did
not participate in the second subtask on extracting
arguments of disorder mentions. The best performing
system of the ShARe/CLEF eHealth task on disor-
der extraction and normalization (Tang et al., 2013)
employed CRF and structured SVM models for men-
tion extraction and the traditional vector-space model
from information retrieval (Salton et al., 1975) for
disorder normalization.
Similar to (Tang et al., 2013), we employ the CRF
model for extraction of disorder mentions, but we
leverage recent findings in word vector representa-
tions (Mikolov et al., 2013) for feature computation.
We make use of the state-of-the-art measure of se-
mantic similarity of short texts (
c et al., 2012) for
concept normalization.
MINERAL consists of two subsystems: one for ex-
tracting disorder mentions and the other for normal-
izing extracted mentions by assigning them a Con-
cept Unique Identifier (CUI) from the SNOMED-CT
database (Stearns et al., 2001).
3.1 Disorder Mention Extraction
At the core of the extraction subsystem is the
CRF model with lexical, gazetteer-based, and
informativeness-based features. We decided to use
the BE GIN-I NSIDE-OUTSID E annotation scheme for
the CRF model, although this scheme does not ac-
count for token-sharing disorder mentions. Thus,
we apply a set of postprocessing rules to derive dis-
order mentions from token-level outputs produced
by the CRF model and to handle most frequent
cases of token-sharing mentions (e.g., “abdomen non-
disturbed and non-distended”).
3.1.1 Features
We feed the CRF model with a rich set of features
that can be divided into (1) token-based features, (2)
gazetteer-based features, and (3) information content-
based features. All of the features are templated on
the symmetric window of size two, i.e., computed
for two preceding tokens, current token, and two
subsequent tokens.
Token-based features (TK).
Token-based fea-
tures group all features which can be computed just
from the token at hand. These include the surface
form, lemma, stem, POS-tag, and shape (encoding of
the capitalization of the word, e.g., “UL” for “Atrial”)
of the word. We also encode the first and the last char-
acter bigram and trigram of the word as features.
Gazetter-based features (GZ).
Features in this
group rely on comparison of tokens in text with en-
tries in the SNOMED-CT database and with disease
annoations on the training set. For each token we
compute: the maximum similarity with any of the
words (1) starting a SNOMED-CT entry, (2) inside
a SNOMED-CT entry, and (3) ending a SNOMED-
CT entry. We compute the same three features only
considering gold annotations in the training set as
gazetteer entries. We compute the semantic similarity
between two words as the cosine between their cor-
responding word embedding vectors. We trained the
embedding vectors with the word2vec tool (Mikolov
et al., 2013) on the large unlabeled corpus of clini-
cal texts (with over 400K documents) provided by
the task organizers. We also counted the number of
gazetteer entries that start with, contain, and end with
the token at hand.
Information content-based features (IC).
features compute the informativeness of ngrams
within the clinical domain and compare it their gen-
eral informativeness. We use information content
as a measure of the informativeness of the word
within a corpus C:
ic(w) = log freq(w)+1
is the frequency of the word
. We compute three different information
content-based features. First, we compute the infor-
mation content of the word within a large corpus of
clinical narratives. Secondly, we compute the ratio
of the information content of the word computed on
the clinical corpus and the information content of the
same word computed on a large general corpus. We
used Google Books ngrams (Michel et al., 2011) as
the general corpus. The rationale here is that the clin-
ical concepts such as diseases and disorders will have
a higher relative frequency and, consequently, lower
information content in the clinical corpus than in the
general corpus. Finally, the third feature we com-
pute is the mutual information of the bigrams in the
clinical corpus, which we define via the information
mi(w1, w2) = ic(w1w2)
is the information content of the bi-
. Mutual information score indicates pairs
of words that often appear together (e.g., “atrial di-
latation”). For each word
we compute the mutual
information of the bigrams it constitues with the pre-
vious word (i.e.,
) and the subsequent word
(i.e., wiwi+1).
3.1.2 Postprocessing
The only reasonable postprocessing strategy with
the B-I-O scheme is to join each INSIDE token with
the closest preceding BEGI N token. However, this
strategy requires rule-based fixes for common situa-
tions in which two disorder mentions share a token.
We designed postprocessing rules by observing the
most frequent mistakes our CRF model made on the
development set provided by the organizers. This led
to three particular fixes: (1) mentions of abdomen
condition typically correspond to two disorder men-
tions sharing the token “abdomen” (e.g., processing
“abdomen non-tender and non-distended” results with
two disorder mentions – “abdomen non-tender” and
“abdomen non-distended”); (2) mentions of allergies
typically share the token “allergies” (e.g., process-
ing “Allergies: Roxicet / Penicillins / Aspirin” pro-
duces three mentions – “Allergies Roxicet”,“Aller-
gies Penicillins”, and “Allergies Aspirin”); and (3)
the CRF model rather frequently fails to recognize
the type of the hepatitis. We associate the type of the
hepatitis (e.g., “B”) found in the proximity of the
token “hepatitis” when CRF fails to do so.
3.2 Mention Normalization
The normalization subsystem assigns a CUI to each
extracted disorder mention by comparing the seman-
tic similarity of the mention with the SNOMED-CT
entries. Given that SNOMED-CT has over 650K
entries, it is infeasible to compute the similarity
of the disorder mentions with all database entries.
Therefore, we first filtered out only the entries which
contain at least one lemma from the extracted men-
tion. E.g., for the mention “melena due to gastroin-
testinal haemorrhage” we would consider only the
SNOMED-CT entries containing either “melena”,
“gastrointestinal”, or “haemorrhage”.
We compute the similarity as the modified variant
of the greedy weighted alignment overlap (GWAO)
measure from (
c et al., 2012). To compute this
score, we iteratively pair the words – one from ex-
tracted mention and the other from the database entry
– according to their semantic similarity. In each iter-
ation we greedily select the pair of words with the
largest semantic similarity, and remove these words
from their corresponding text snippets. The similarity
between words is computed as the cosine between
their embedding vectors obtained with
(Mikolov et al., 2013) on the large unlabeled corpus
of clinical narratives. Let
P(m, s)
be the set of word
pairs obtained through the alignment between the
extracted mention
and the SNOMED-CT entry
and let vec(w)be the embedding vector of the word
w. The GWAO score is then computed as follows:
gwao(m, s) =X
α·cos (vec(wm),vec(ws))
is the larger of the information contents
of the two words,
α= max (ic(wm),ic(ws))
. The
gwao(m, s)
score is normalized with the sum of in-
formation contents of words from
, respec-
tively, and the harmonic mean of the two normalized
scores is the final similarity score. We assign to
the extracted mention the CUI of the most similar
SNOMED-CT entry, assuming the similarity is above
some treshold
(otherwise, the label “CUI-less” is
assigned to the mention). The optimal value of
Strict Relaxed
Model P R F1P R F1
TK 75.6 65.6 70.2 90.0 80.4 84.9
TK + GZ 75.1 66.1 70.3 89.6 80.9 85.0
TK + IC 76.4 66.3 71.0 90.2 80.4 85.1
All feat. 76.3 66.9 71.3 90.1 81.1 85.4
All + PPR 77.4 69.1 73.0 90.1 82.2 86.0
Table 1: Model selection results.
determined by maximizing the CUI prediction accu-
racy on the training and development set. A useful
add-on to the normalization step is the memorization
of CUIs for all disorder mentions observed in the
training set. In other words, a memorized mention
observed in the test set will be assigned the CUI it
had in the training set.
4 Evaluation
Participants were provided with a training set consist-
ing of 298 clinical documents and a development set
with 133 documents. We used the training and devel-
opment set to optimize the model (features, postpro-
cessing rules, and the similarity treshold
). A test
set of 100 clinical documents was used for official
4.1 Model Optimization
We trained the CRF model with different combina-
tions of feature groups (TK, GZ, and IC) and eval-
uated the performance of these models on the de-
velopment set. We also evaluated the contribution
of the postprocessing rules (PPR) on the develop-
ment set. The extraction performance of the differ-
ent models is shown in Table 4.1. The model using
only token-based features alone (model TK) achieves
solid performance. Information content-based fea-
tures (model TK + IC) seem to have a more positive
impact on the performance than the gazetteer-based
features (model TK + GZ). Still, the model with all
features displays the best performance. Applying
postprocessing rules further boosts the performance
on the development set, which is expected, because
the rules were designed precisely to fix the most fre-
quent errors on that dataset. We submitted the model
All + PPR for official evaluation. We also optimized
Strict Relaxed
Team P R F1P R F1
ezDI 78.3 73.2 75.7 81.5 76.1 78.7
ULisboa 77.9 70.5 74.0 80.6 72.9 76.5
UTH-CCB 77.8 69.6 73.5 79.7 71.4 75.3
UWM 77.3 69.9 73.4 80.9 73.1 76.8
TakeLab 76.1 69.6 72.7 79.4 72.7 75.9
Bioinf.-UA 69.0 73.6 71.2 71.9 76.6 74.2
Table 2: Official SemEval Task 14 (subtask 1) evaluation.
the similarity treshold
to maximize the normaliza-
tion accuracy on the development set, selecting the
optimal value of λ= 0.83.
4.2 Official Results
A subset of the official ranking on the test set is
shown in Table 4.2. MINERAL ranks fourth among
16 teams in relaxed evaluation and fifth in strict eval-
uation, with only 3% lower
performance than the
best performing system.
Like most other systems, MINERAL displays
higher precision than recall. This would suggest a
non-negligible amount of obdurate disorder mentions
which appear rarely in clinical documents and which
are not semantically similar with more frequent dis-
5 Conclusion
We described MINERAL, a system for extraction
and normalization of disorder mentions in clinical
text, with which we participated in Task 14 of Se-
mEval 2015. At the core of the mention extraction
approach is the CRF model built on B-I-O annota-
tion scheme and a rich set of lexical, gazetteer-based,
and informativeness-based features. We link the dis-
ease mentions to the SNOMED-CT entries using a
measure of semantic textual similarity of short texts.
MINERAL achieved performance of almost 76%
(relaxed evaluation setting), ranking us fourth
out of 16 teams participating in the task, with 3%
lower performance than the best-performing team.
Such a result suggests that a resource light approach
with sequence labeling (with semantic features) for
mention extraction and STS measures for concept
normalization offers competitive performance in the
clinical domain.
Alan R. Aronson. 2001. Effective mapping of biomed-
ical text to the UMLS metathesaurus: the MetaMap
program. In Proceedings of the AMIA Symposium,
page 17.
Steven Bethard. 2013. ClearTK-TimeML: A minimal-
ist approach to TempEval 2013. In Second Joint
Conference on Lexical and Computational Semantics
(StarSEM), volume 2, pages 10–14.
Carol Friedman, Philip O. Alderson, John H.M. Austin,
James J. Cimino, and Stephen B. Johnson. 1994. A
general natural-language text processor for clinical ra-
diology. Journal of the American Medical Informatics
Association, 1(2):161–174.
Goran Glava
s and Jan
Snajder. 2014. Construction and
evaluation of event graphs. Natural Language Engi-
neering, pages 1–46.
Vijay Krishnan and Christopher D. Manning. 2006. An
effective two-stage model for exploiting non-local de-
pendencies in named entity recognition. In Proceedings
of the 21st International Conference on Computational
Linguistics and the 44th Annual Meeting of the Associ-
ation for Computational Linguistics, pages 1121–1128.
Andrew McCallum and Wei Li. 2003. Early results
for named entity recognition with conditional random
fields, feature induction and web-enhanced lexicons. In
Proceedings of the Seventh Conference on Natural Lan-
guage Learning at HLT-NAACL 2003, pages 188–191.
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser
Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pick-
ett, Dale Hoiberg, Dan Clancy, Peter Norvig, and Jon
Orwant. 2011. Quantitative analysis of culture using
millions of digitized books. Science, 331(6014):176–
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013. Distributed representations
of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems,
pages 3111–3119.
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975.
A vector space model for automatic indexing. Commu-
nications of the ACM, 18(11):613–620.
c, Goran Glava
s, Mladen Karan, Jan
and Bojana Dalbelo Ba
c. 2012. TakeLab: Systems
for measuring semantic text similarity. In Proceed-
ings of the Sixth International Workshop on Semantic
Evaluation (SemEval), pages 441–448.
Michael Q. Stearns, Colin Price, Kent A. Spackman, and
Amy Y. Wang. 2001. SNOMED Clinical Terms:
Overview of the development process and project status.
In Proceedings of the AMIA Symposium, page 662.
Hanna Suominen, Sanna Salanter
a, Sumithra Velupillai,
Wendy W. Chapman, Guergana Savova, Noemie El-
hadad, Sameer Pradhan, Brett R. South, Danielle L.
Mowery, and Gareth J.F. Jones. 2013. Overview of
the ShARe/CLEF eHealth Evaluation Lab 2013. In
Information Access Evaluation: Multilinguality, Multi-
modality, and Visualization, pages 212–231.
Buzhou Tang, Yonghui Wu, Min Jiang, Joshua C. Denny,
and Hua Xu. 2013. Recognizing and encoding dis-
corder concepts in clinical text using machine learning
and vector space model. In Workshop of ShARe/CLEF
eHealth Evaluation Lab 2013.
Ozlem Uzuner, Imre Solti, and Eithon Cadag. 2010.
Extracting medication information from clinical text.
Journal of the American Medical Informatics Associa-
tion, 17(5):514–518.
Qing T. Zeng, Sergey Goryachev, Scott Weiss, Margarita
Sordo, Shawn N. Murphy, and Ross Lazarus. 2006. Ex-
tracting principal diagnosis, co-morbidity and smoking
status for asthma research: Evaluation of a natural lan-
guage processing system. BMC Medical Informatics
and Decision Making, 6(1):30.
... Authors of MedInX system [9] proclaim about 95% for precision and recall in extraction of medical terms from Portuguese text. The same concept is used in TAKELAB system presented at SemEval 2015 devoted to medical texts processing [10]. The authors of [11] use the information extraction approach for a pictorial visualization of an electronic medical record. ...
Full-text available
Medical records contain a textual description of such important information as patients’ complaints, diseases progress and therapy. An extraction of this information could allow starting with processing information stored in medical databases. In this article we introduce a short description of a medical ontology storing information on patients’ complaints. We also describe an algorithm that uses this ontology for extraction of claims from texts of medical records. The algorithm combines both syntactic properties, and peculiarities, of a text and connections between diseases’ properties and their values. The algorithm corrects syntactical mistakes according to the hierarchical information from the ontology. The resulting algorithm was proved on 3000 clinical records of Department of Neurosurgery of FEFU.
... Concerns related to doctor-patient confidentiality breaches were addressed to some extent by the systems designed by Aberdeen et al. [7] and Yang and Garibaldi [8]. Information extraction using various machine learning models has also been explored by Sondhi et al. [9], Uzuner et al. [10], Bae and Kim [11], Park et al. [12] and Glavas [13], and further analysed by Kraus et al. [14]. Chang et al. [15] proposed an approach for auto-assessment of health quality using classification and identification techniques. ...
Full-text available
Objectives One of the most important functions for a medical practitioner while treating a patient is to study the patient's complete medical history by going through all records, from test results to doctor's notes. With the increasing use of technology in medicine, these records are mostly digital, alleviating the problem of looking through a stack of papers, which are easily misplaced, but some of these are in an unstructured form. Large parts of clinical reports are in written text form and are tedious to use directly without appropriate pre-processing. In medical research, such health records may be a good, convenient source of medical data; however, lack of structure means that the data is unfit for statistical evaluation. In this paper, we introduce a system to extract, store, retrieve, and analyse information from health records, with a focus on the Indian healthcare scene. Methods A Python-based tool, Healthcare Data Extraction and Analysis (HEDEA), has been designed to extract structured information from various medical records using a regular expression-based approach. Results The HEDEA system is working, covering a large set of formats, to extract and analyse health information. Conclusions This tool can be used to generate analysis report and charts using the central database. This information is only provided after prior approval has been received from the patient for medical research purposes.
... Some effective approaches to English corpora have been proposed. Glavas exploited semantic textual similarity for linking entity mentions in clinical text [21]. Zheng et al. proposed a collective inference approach which leverages semantic information and structures in ontology to solve the entity linking problem for biomedical literature [22]. ...
Full-text available
Online medical text is full of references to medical entities (MEs), which are valuable in many applications, including medical knowledge-based (KB) construction, decision support systems, and the treatment of diseases. However, the diverse and ambiguous nature of the surface forms gives rise to a great difficulty for ME identification. Many existing solutions have focused on supervised approaches, which are often task-dependent. In other words, applying them to different kinds of corpora or identifying new entity categories requires major effort in data annotation and feature definition. In this paper, we propose unMERL, an unsupervised framework for recognizing and linking medical entities mentioned in Chinese online medical text. For ME recognition, unMERL first exploits a knowledge-driven approach to extract candidate entities from free text. Then, the categories of the candidate entities are determined using a distributed semantic-based approach. For ME linking, we propose a collaborative inference approach which takes full advantage of heterogenous entity knowledge and unstructured information in KB. Experimental results on real corpora demonstrate significant benefits compared to recent approaches with respect to both ME recognition and linking.
Full-text available
Extracting disease or health related information is of good concentration in healthcare sector now-a-days. The health sector data has three type of information such as digit value, medical definition, and definite value. Many extraction methods are term-based but the issues faced are polysemy and synonymy. It’s also results in many matches for a single term. So difficult to extract the information is efficiency. Objective: Information extraction from huge health care dataset using Chief, Tweak, Abridge Matrix and Midst Measure. The Tweak matrix and Abridge Matrix will find the impact between the extracted diseases, midst measure is used for making decision about the disease information to be extracted. Results: The DIECTA method holds 0.9532 % higher average of precision with more relevant instances from the records. The average recall value remains 0.554 % and the average of F-measure shows 0.576 % which status that high accurate instances were extracted. Conclusion: method Disease Closeness, Disease Matrix, Midst Measure based classifier is proposed to extract accurate and the relevant disease. The results are evaluated using RC and Coverage measure of machine learning.
Full-text available
Events play an important role in natural language processing and information retrieval due to numerous event-oriented texts and information needs. Many natural language processing and information retrieval applications could benefit from a structured event-oriented document representation. In this paper, we propose event graphs as a novel way of structuring event-based information from text. Nodes in event graphs represent the individual mentions of events, whereas edges represent the temporal and coreference relations between mentions. Contrary to previous natural language processing research, which has mainly focused on individual event extraction tasks, we describe a complete end-to-end system for event graph extraction from text. Our system is a three-stage pipeline that performs anchor extraction, argument extraction, and relation extraction (temporal relation extraction and event coreference resolution), each at a performance level comparable with the state of the art. We present EvExtra , a large newspaper corpus annotated with event mentions and event graphs, on which we train and evaluate our models. To measure the overall quality of the constructed event graphs, we propose two metrics based on the tensor product between automatically and manually constructed graphs. Finally, we evaluate the overall quality of event graphs with the proposed evaluation metrics and perform a headroom analysis of the system.
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Full-text available
The Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records focused on the identification of medications, their dosages, modes (routes) of administration, frequencies, durations, and reasons for administration in discharge summaries. This challenge is referred to as the medication challenge. For the medication challenge, i2b2 released detailed annotation guidelines along with a set of annotated discharge summaries. Twenty teams representing 23 organizations and nine countries participated in the medication challenge. The teams produced rule-based, machine learning, and hybrid systems targeted to the task. Although rule-based systems dominated the top 10, the best performing system was a hybrid. Of all medication-related fields, durations and reasons were the most difficult for all systems to detect. While medications themselves were identified with better than 0.75 F-measure by all of the top 10 systems, the best F-measure for durations and reasons were 0.525 and 0.459, respectively. State-of-the-art natural language processing systems go a long way toward extracting medication names, dosages, modes, and frequencies. However, they are limited in recognizing duration and reason fields and would benefit from future research.
Conference Paper
Full-text available
This paper shows that a simple two-stage approach to handle non-local dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle non-local dependencies, while being much more computationally efficient. NER systems typically use sequence models for tractable inference, but this makes them unable to capture the long distance structure present in text. We use a Conditional Random Field (CRF) based NER system using local features to make predictions and then train another CRF which uses both local information and features extracted from the output of the first CRF. Using features capturing non-local dependencies from the same document, our approach yields a 12.6% relative error reduction on the F1 score, over state-of-the-art NER systems using local-information alone, when compared to the 9.3% relative error reduction offered by the best systems that exploit non-local information. Our approach also makes it easy to incorporate non-local information from other documents in the test corpus, and this gives us a 13.3% error reduction over NER systems using local-information alone. Additionally, our running time for inference is just the inference time of two sequential CRFs, which is much less than that directly model the dependencies and do approximate inference.
Full-text available
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
Full-text available
Development of a general natural-language processor that identifies clinical information in narrative reports and maps that information into a structured representation containing clinical terms. The natural-language processor provides three phases of processing, all of which are driven by different knowledge sources. The first phase performs the parsing. It identifies the structure of the text through use of a grammar that defines semantic patterns and a target form. The second phase, regularization, standardizes the terms in the initial target structure via a compositional mapping of multi-word phrases. The third phase, encoding, maps the terms to a controlled vocabulary. Radiology is the test domain for the processor and the target structure is a formal model for representing clinical information in that domain. The impression sections of 230 radiology reports were encoded by the processor. Results of an automated query of the resultant database for the occurrences of four diseases were compared with the analysis of a panel of three physicians to determine recall and precision. Without training specific to the four diseases, recall and precision of the system (combined effect of the processor and query generator) were 70% and 87%. Training of the query component increased recall to 85% without changing precision.
The ShARe/CLEF eHealth Evaluation Lab (SHEL) organized a challenge on natural language processing (NLP) and information retrieval (IR) in the medical domain in 2013. The first task of the 2013 ShARe/CLEF challenge was to extract disorder mention spans and their associated UMLS (Unified Medical Language System) concept unique identifiers (CUIs). We participated in Task 1 and developed a clinical disorder recognition and encoding system. The proposed system consists of two components: a machine learning-based approach to recognize disorder entities and a vector space model-based method to encode disorders to UMLS CUIs. The challenge organizers manually annotated disorder entities and corresponding UMLS CUIs in 298 clinical notes, of which 199 notes were used for training and 99 were for testing. Evaluation on the test data set showed that our system achieved the best F-measure of 0.750 for entity recognition (ranked first) and the highest F-measure of 0.514 for UMLS CUI encoding (ranked third), indicating the promise of the proposed approaches.
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.
Conference Paper
This paper presents a feature induction method for CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase loglikelihood, the approach builds on that of Della Pietra et al (1997), but is altered to work with conditional rather than joint probabilities, and with a mean-field approximation and other additional modifications that improve efficiency specifically for a sequence model. In comparison with traditional approaches, automated feature...