A Survey on Deep Learning for Named Entity Recognition

Preprint (PDF Available) · December 2018with 863 Reads
Cite this publication
Abstract
Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
arXiv:1812.09449v1 [cs.CL] 22 Dec 2018
DRAFT IN PROGRESS, VOL. XX, NO. XX, 2018 1
A Survey on Deep Learning for
Named Entity Recognition
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li
Abstract—Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into
predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications
such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing
decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep
learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has
been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing
deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then,
we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context
encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new
NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future
directions in this area.
Index Terms—Natural language processing, named entity recognition, deep learning, survey
1 INTROD UCT ION
NAMED Entity Recognition (NER) aims to recognize
mentions of rigid designators from text belonging to
predefined semantic types such as person, location, orga-
nization etc [1]. NER not only acts as a standalone tool for
information extraction (IE), but also plays an essential role in
a variety of natural language processing (NLP) applications
such as information retrieval [2], [3], automatic text summa-
rization [4], question answering [5], machine translation [6],
and knowledge base construction [7] etc.
Evolution of NER. The term “Named Entity” (NE) was
first used at the sixth Message Understanding Conference
(MUC-6) [8], as the task of identifying names of organi-
zations, people and geographic locations in text, as well
as currency, time and percentage expressions. Since MUC-
6 there has been increasing interest in NER, and various
scientific events (e.g., CoNLL03 [9], ACE [10], IREX [11], and
TREC Entity Track [12]) devote much effort to this topic.
Regarding the problem definition, Petasis et al. [13]
restricted the definition of named entities: “A NE is a
proper noun, serving as a name for something or someone”.
This restriction is justified by the significant percentage of
proper nouns present in a corpus. Nadeau and Sekine [1]
claimed that the word “Named” restricted the task to only
those entities for which one or many rigid designators stands
for the referent. Rigid designator, defined in [14], include
J. Li and A. Sun are with School of Computer Science and
Engineering, Nanyang Technological University, Singapore. E-mail:
jli030@e.ntu.edu.sg; axsun@ntu.edu.sg.
J. Han is with SAP innovation Center, Singapore. E-mail:
ray.han@sap.com.
C. Li is with School of Cyber Science and Engineering, Wuhan University,
China. E-mail:cllee@whu.edu.cn.
Manuscript received xx, 2018; revised xx , 2018.
proper names and natural kind terms like biological species
and substances. Despite the various definitions of NEs,
researchers have reached common consensus on the types
of NEs to recognize. We generally divide NEs into two
categories: generic NEs (e.g., person and location) and
domain-specific NEs (e.g., proteins, enzymes, and genes).
In this paper, we mainly focus on generic NEs in English
language. We do not claim this article to be exhaustive or
representative of all NER works on all languages.
As to the techniques applied in NER, there are four
main streams: 1) Rule-based approaches, which do not
need annotated data as they rely on hand-crafted rules;
2) Unsupervised learning approaches, which rely on un-
supervised algorithms without hand-labeled training ex-
amples; 3) Feature-based supervised learning approaches,
which rely on supervised learning algorithms with careful
feature engineering; 4) Deep-learning based approaches,
which automatically discover representations needed for the
classification and/or detection from raw input in an end-to-
end manner. We brief 1), 2) and 3), and review 4) in detail.
Motivations for conducting this survey. In recent years,
deep learning (DL, also named deep neural network) has
attracted significant attention due to their success in vari-
ous domains. Starting with Collobert et al. [15], DL-based
NER systems with minimal feature engineering have been
flourishing. Over the past few years, a considerable number
of studies have applied deep learning to NER and succes-
sively advanced the state-of-the-art performance [15]–[19].
This trend motivates us to conduct a survey to report the
current status of deep learning techniques in NER research.
By comparing the choices of DL architectures, we aim to
identify factors affecting NER performance as well as issues
and challenges.
On the other hand, although NER studies has been
DRAFT IN PROGRESS, VOL. XX, NO. XX, 2018 2
thriving for a few decades, to the best of our knowledge,
there are few reviews in this field so far. Arguably the most
established one was published by Nadeau and Sekine [1]
in 2007. This survey presents an overview of the technique
trend from hand-crafted rules towards machine learning.
Marrero et al. [20] summarized NER works from the per-
spectives of fallacies, challenges and opportunities in 2013.
Then Patawar and Potey [21] provided a short review in
2015. The two recent short surveys are on new domains [22]
and complex entity mentions [23], respectively. In summary,
existing surveys mainly cover feature-based machine learn-
ing models, but not the modern DL-based NER systems.
More germane to this work are the two recent surveys [24],
[25] in 2018. Goyal et al. [25] surveyed developments and
progresses made in NER. However, they did not include
recent advances of deep learning techniques. Yadav and
Bethard [24] presented a short survey of recent advances
in NER based on representations of words in sentence. This
survey focuses more on the distributed representations for
input (e.g., char- and word-level embeddings) and do not
review the context encoder and tag decoders. The recent
trend of applied deep learning on NER tasks (e.g., multi-
task learning, transfer learning, reinforcement leanring and
adversarial learning) are not in their servery as well.
Contributions of this survey. We intensely review applica-
tions of deep learning techniques in NER, to enlighten and
guide researchers and practitioners in this area. Specifically,
we consolidate NER corpora, off-the-shelf NER systems
(from both academia and industry) in a tabular form, to
provide useful resources for NER research community. We
then present a comprehensive survey on deep learning tech-
niques for NER. To this end, we propose a new taxonomy,
which systematically organizes DL-based NER approaches
along three axes: distributed representations for input, con-
text encoder (for capturing contextual dependencies for tag
decoder), and tag decoder (for predicting labels of words in
the given sequence). In addition, we also survey the most
representative methods for recent applied deep learning
techniques in new NER problem settings and applications.
Finally, we present readers with the challenges faced by
NER systems and outline future directions in this area.
The remaining of this paper is organized as follows:
Section 2introduces background of NER, consisting of
definition, resources, evaluation metrics, and traditional ap-
proaches. Section 3presents deep learning techniques for
NER based on our taxonomy. Section 4summarizes recent
applied deep learning techniques that are being explored
for NER. Section 5lists the challenges and misconceptions,
as well as future directions. We conclude this survey in
Section 6.
2 BACKGRO UND
Before examining how deep learning is applied in NER field,
we first give a formal formulation of the NER problem. We
then introduce the widely used NER datasets and tools.
Next, we detail the evaluation metrics and summarize the
traditional approaches to NER.
!
"
#
$
%&
%%
'
(
)
!"#$%&'()(*%+#,-.')()-'
!
%
"
(
" !"#$% #
!
$
" $
!
" %
#
!
"
"
"
"&$'()*$%
#
!
$
"
%&
"&$'()*$% #
&+!
%
"
)
" '''"
"
#
Fig. 1. An illustration of the named entity recognition task. Given a sen-
tence, the NER model recognizes one Person entity and two Location
entities.
2.1 What is NER?
A named entity is a word or a phrase that clearly identi-
fies one item from a set of other items that have similar
attributes [26]. Examples of named entities are organiza-
tion, person, and location names in general domain; gene,
protein, drug and disease names in biomedical domain.
Named entity recognition (NER) is the process of locating
and classifying named entities in text into predefined entity
categories.
Formally, given a sequence of tokens s=
hw1, w2, ..., wNi, NER is to output a list of tuples hIs, Ie, ti,
each of which is a named entity mentioned in s. Here,
Is[1, N ]and Ie[1, N]are the start and the end
indexes of a named entity mention; tis the entity type
from a predefined category set. Figure 1shows an example
where NER recognizes three named entities from the given
sentence. When NER was first defined in MUC-6 [8],
the task is to recognize names of people, organizations,
locations, and time, currency, percentage expressions in
text. Note that the task focuses on a small set of coarse
entity types and one type per named entity. We call this
kind of NER tasks coarse-grained NER [8], [9]. Recently
fine-grained NER tasks [27]–[31] focus on a much larger set
of entity types where a mention may be assigned multiple
types.
NER acts as an important pre-processing step for a
variety of downstream applications such as information re-
trieval, question answering, machine translation, etc. Here,
we use semantic search as an example to illustrate the
importance of NER in supporting various applications. Se-
mantic search refers to a collection of techniques, which
enable search engines to understand the concepts, meaning,
and intent behind the queries from users [32]. According
to [2], about 71% of search queries contain at least one
named entity. Recognizing named entities in search queries
would help us to better understand user intents, hence to
provide better search results. To incorporate named enti-
ties in search, entity-based language models [32], which
consider individual terms as well as term sequences that
have been annotated as entities (both in documents and in
queries), have been proposed by Raviv et al. [33]. There are
also studies utilizing named entities for an enhanced user
experience, such as query recommendation [34], query auto-
completion [35], [36] and entity cards [37], [38].
DRAFT IN PROGRESS, VOL. XX, NO. XX, 2018 3
TABLE 1
List of annotated datasets for English NER. Number of tags refer to the number of entity types.
Corpus Year Text Source #Tags URL
MUC-6 1995 Wall Street Journal texts 7 https://catalog.ldc.upenn.edu/LDC2003T13
MUC-6 Plus 1995 Additional news to MUC-6 7 https://catalog.ldc.upenn.edu/LDC96T10
MUC-7 1997 New York Times news 7 https://catalog.ldc.upenn.edu/LDC2001T02
CoNLL03 2003 Reuters news 4 https://www.clips.uantwerpen.be/conll2003/ner/
ACE 2000 - 2008 Transcripts, news 7 https://www.ldc.upenn.edu/collaborations/past-projects/ace
OntoNotes 2007 - 2012 Magazine, news, conversation,
web
89 https://catalog.ldc.upenn.edu/LDC2013T19
W-NUT 2015 - 2018 User-generated text 18 http://noisy-text.github.io
BBN 2005 Wall Street Journal texts 64 https://catalog.ldc.upenn.edu/ldc2005t33
NYT 2008 New York Times texts 5 https://catalog.ldc.upenn.edu/LDC2008T19
WikiGold 2009 Wikipedia 4 https://figshare.com/articles/Learning_multilingual_named
_entity_recognition_from_Wikipedia/5462500
WiNER 2012 Wikipedia 4 http://rali.iro.umontreal.ca/rali/en/winer-wikipedia-for-ner
WikiFiger 2012 Wikipedia 113 https://github.com/xiaoling/figer
N32014 News 3 http://aksw.org/Projects/N3NERNEDNIF.html
GENIA 2004 Biology and clinical texts 36 http://www.geniaproject.org/home
GENETAG 2005 MEDLINE 2 https://sourceforge.net/projects/bioc/files/
FSU-PRGE 2010 PubMed and MEDLINE 5 https://julielab.de/Resources/FSU_PRGE.html
NCBI-Disease 2014 PubMed 790 https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/
BC5CDR 2015 PubMed 3 http://bioc.sourceforge.net/
DFKI 2018 Business news and social media 7 https://dfki-lt-re-group.bitbucket.io/product-corpus/
2.2 NER Resources: Datasets and Tools
High quality annotations are critical for both model learning
and evaluation. In the following, we summarize widely
used datasets and off-the-shelf tools for English NER.
A tagged corpus is a collection of documents that contain
annotations of one or more entity types. Table 1lists some
widely used datasets with their data sources and number
of entity types (also known as tag types). Summarized
in Table 1, before 2005, datasets were mainly developed
by annotating news articles with a small number of en-
tity types, suitable for coarse-grained NER tasks. After
that, more datasets were developed on various kinds of
text sources including Wikipedia articles, conversation, and
user-generated text (e.g., tweets and YouTube comments
and StackExchange posts in W-NUT). The number of tag
types becomes significantly larger, e.g., 89 in OntoNotes. We
also list a number of domain specific datasets, particularly
developed on PubMed and MEDLINE texts. The number of
entity types ranges from 2 in GENETAG to 790 in NCBI-
Disease.
We note that many recent NER works report their perfor-
mance on CoNLL03 and OntoNotes datasets (see Table 3).
CoNLL03 contains annotations for Reuters news in two lan-
guages: English and German. The English dataset has a large
portion of sports news with annotations in four entity types
(Person, Location, Organization, and Miscellaneous) [9]. The
goal of the OntoNotes project was to annotate a large
corpus, comprising of various genres (weblogs, news, talk
shows, broadcast, usenet newsgroups, and conversational
telephone speech) with structural information (syntax and
predicate argument structure) and shallow semantics (word
sense linked to an ontology and coreference).1There are
5 versions, from Release 1.0 to Release 5.0. The texts are
annotated with 18 coarse entity types, consisting of 89
subtypes.
1. https://catalog.ldc.upenn.edu/LDC2013T19
TABLE 2
Off-the-shelf NER tools offered by academia and industry/opensource
projects.
NER System URL
StanfordCoreNLP https://stanfordnlp.github.io/CoreNLP/
OSU Twitter NLP https://github.com/aritter/twitter_nlp
Illinois NLP http://cogcomp.org/page/software/
NeuroNER http://neuroner.com/
NERsuite http://nersuite.nlplab.org/
Polyglot https://polyglot.readthedocs.io
Gimli http://bioinformatics.ua.pt/gimli
spaCy https://spacy.io/
NLTK https://www.nltk.org
OpenNLP https://opennlp.apache.org/
LingPipe http://alias-i.com/lingpipe-3.9.3/
AllenNLP https://allennlp.org/models
IBM Watson https://www.ibm.com/watson/
There are many NER tools available online with pre-
trained models. Table 2summarizes popular ones for En-
glish NER. StanfordCoreNLP, OSU Twitter NLP, Illinois
NLP, NeuroNER, NERsuite, Polyglot, and Gimli are offered
by academia. spaCy, NLTK, OpenNLP, LingPipe, AllenNLP,
and IBM Watson are from industry or open source projects.
2.3 NER Evaluation Metrics
NER systems are usually evaluated by comparing their
outputs against human annotations. The comparison can be
quantified by either exact-match or relaxed match.
2.3.1 Exact-match Evaluation
NER involves identifying both entity boundaries and entity
types. With “exact-match evaluation”, a named entity is
considered correctly recognized only if its both boundaries
and type match ground truth [9], [39]. Precision, Recall, and
F-score are computed on the number of true positives (TP),
false positives (FP), and false negatives (FN).
DRAFT IN PROGRESS, VOL. XX, NO. XX, 2018 4
True Positive (TP): entities that are recognized by
NER and match ground truth.
False Positive (FP): entities that are recognized by
NER but do not match ground truth.
False Negative (FN): entities annotated in the ground
truth that are not recognized by NER.
Precision measures the ability of a NER system to present
only correct entities, and Recall measures the ability of a
NER system to recognize all entities in a corpus.
Precision =T P
T P +F P Recall =T P
T P +F N
F-score is the harmonic mean of precision and recall, and
the balanced F-score is most commonly used:
F-score = 2 ×Precision ×Recall
Precision +Recall
As most of NER systems involve multiple entity types,
it is often required to assess the performance across all
entity classes. Two measures are commonly used for this
purpose: macro-averaged F-score and micro-averaged F-
score. Macro-averaged F-score computes the F-score inde-
pendently for each entity type, then takes the average (hence
treating all entity types equally). Micro-averaged F-score
aggregates the contributions of entities from all classes to
compute the average (treating all entities equally). The latter
can be heavily affected by the quality of recognizing entities
in large classes in the corpus.
2.3.2 Relaxed-match Evaluation
MUC-6 [8] defines a relaxed-match evaluation: a correct type
is credited if an entity is assigned its correct type regardless
its boundaries as long as there is an overlap with ground
truth boundaries; a correct boundary is credited regardless
an entity’s type assignment. Then ACE [10] proposes a more
complex evaluation procedure. It resolves a few issues like
partial match and wrong type, and considers subtypes of
named entities. However, it is problematic because the final
scores are comparable only when parameters are fixed [1],
[20], [21]. Complex evaluation methods are not intuitive
and make error analysis difficult. Thus, complex evaluation
methods are not widely used in recent NER studies.
2.4 Traditional Approaches to NER
Traditional approaches to NER are broadly classified into
three main streams: rule-based, unsupervised learning, and
feature-based supervised learning approaches [1], [24].
2.4.1 Rule-based Approaches
Rule-based NER systems rely on hand-crafted rules. Rules
can be designed based on domain-specific gazetteers [7],
[40] and syntactic-lexical patterns [41]. Kim [42] proposed
to use Brill rule inference approach for speech input. This
system generates rules automatically based on Brill’s part-
of-speech tagger. In biomedical domain, Hanisch et al. [43]
proposed ProMiner, which leverages a pre-processed syn-
onym dictionary to identify protein mentions and potential
gene in biomedical text. Quimbaya et al. [44] proposed
a dictionary-based approach for NER in electronic health
records. Experimental results show the approach improves
recall while having limited impact on precision.
Some other well-known rule-based NER systems in-
clude LaSIE-II [45], NetOwl [46], Facile [47], SAR [48],
FASTUS [49], and LTG [50] systems. These systems are
mainly based on hand-crafted semantic and syntactic rules
to recognize entities. Rule-based systems work very well
when lexicon is exhaustive. Due to domain-specific rules
and incomplete dictionaries, high precision and low recall
are often observed from such systems, and the systems
cannot be transferred to other domains.
2.4.2 Unsupervised Learning Approaches
A typical approach of unsupervised learning is clustering
[1]. Clustering-based NER systems extract named entities
from the clustered groups based on context similarity. The
key idea is that lexical resources, lexical patterns, and statis-
tics computed on a large corpus can be used to infer men-
tions of named entities. Collins et al. [51] observed that use
of unlabeled data reduces the requirements for supervision
to just 7 simple “seed” rules. The authors then presented
two unsupervised algorithms for named entity classifica-
tion. Similarly, the KNOWITALL [7] system leverage a set
of predicate names as input and bootstraps its recognition
process from a small set of generic extraction patterns.
Nadeau et al. [52] proposed an unsupervised system for
gazetteer building and named entity ambiguity resolution.
This system combines entity extraction and disambiguation
based on simple yet highly effective heuristics. In addi-
tion, Zhang and Elhadad [41] proposed an unsupervised
approach to extracting named entities from biomedical text.
Instead of supervision, their model resorts to terminolo-
gies, corpus statistics (e.g., inverse document frequency
and context vectors) and shallow syntactic knowledge (e.g.,
noun phrase chunking). Experiments on two mainstream
biomedical datasets demonstrate the effectiveness and gen-
eralizability of their unsupervised approach.
2.4.3 Feature-based Supervised Learning Approaches
Applying supervised learning, NER is cast to a multi-class
classification or sequence labeling task. Given annotated
data samples, features are carefully designed to represent
each training example. Machine learning algorithms are
then utilized to learn a model to recognize similar patterns
from unseen data.
Feature engineering is critical in supervised NER sys-
tems. Feature vector representation is an abstraction over
text where a word is represented by one or many Boolean,
numeric, or nominal values [1], [53]. Word-level features
(e.g., case, morphology, and part-of-speech tag) [54]–[56],
list lookup features (e.g., Wikipedia gazetteer and DBpedia
gazetteer) [57]–[60], and document and corpus features (e.g.,
local syntax and multiple occurrences) [61]–[64] have been
widely used in various supervised NER systems. More
feature designs are discussed in [1], [26], [65]
Based on these features, many machine learning algo-
rithms have been applied in supervised NER, including
Hidden Markov Models (HMM) [66], Decision Trees [67],
Maximum Entropy Models [68], Support Vector Machines
(SVM) [69], and Conditional Random Fields (CRF) [70].
DRAFT IN PROGRESS, VOL. XX, NO. XX, 2018 5
Bikel et al. [71], [72] proposed the first HMM-based NER
system, named IdentiFinder, to identify and classify names,
dates, time expressions, and numerical quantities. Zhou and
Su [54] extended IdentiFinder by using mutual informa-
tion. Specifically, the main difference is that Zhou’s model
assumes mutual information independence while HMM
assumes conditional probability independence. In addition,
Szarvas et al. [73] developed a multilingual NER system
by using C4.5 decision tree and AdaBoostM1 learning algo-
rithm. A major merit is that it provides an opportunity to
train several independent decision tree classifiers through
different subsets of features then combine their decisions
through a majority voting scheme.
Given labeled samples, the principle of maximum en-
tropy can be applied to estimate a probability distribution
function that assigns an entity type to any word in a
given sentence in terms of its context. Borthwick et al. [74]
proposed “maximum entropy named entity” (MENE) by
applying the maximum entropy theory. MENE is able to
make use of an extraordinarily diverse range of knowledge
sources in making its tagging decisions. Other systems using
maximum entropy can be found in [75]–[77]
McNamee and Mayfield [78] used 1000 language-related
features and 258 orthography and punctuation features to
train SVM classifiers. Each classifier makes binary decision
whether the current token belongs to one of the eight classes,
i.e., B- (Beginning), I- (Inside) for PERSON,ORGANIZA-
TION,LOCATION, and MIS tags. Isozaki and Kazawa [79]
developed a method to make SVM classifier substantially
faster on NER task. Li et al. [80] proposed a SVM-based
system, which uses an uneven margins parameter leading
to achieve better performance than original SVM on a few
datasets.
SVM does not consider “neighboring” words when pre-
dicting an entity label. CRFs takes context into account. Mc-
Callum and Li [81] proposed a feature induction method for
CRFs in NER. Experiments were performed on CoNLL03,
and achieved F-score of 84.04% for English. Krishnan and
Manning [64] proposed a two-stage approach based on two
coupled CRF classifiers. The second CRF makes use of the
latent representations derived from the output of the first
CRF. We note that CRF-based NER has been widely applied
to texts in various domains, including biomedical text [55],
tweets [82], [83] and chemical text [84].
3 DEEP LEARNING TECH NIQUES F OR NER
In recent years, DL-based NER models become dominant
and achieve state-of-the-art results. Compared to feature-
based approaches, deep learning is beneficial in discovering
hidden features automatically. Next, we first briefly intro-
duce what deep learning is, and why deep learning for NER.
We then survey DL-based NER approaches.
3.1 Why Deep Learning for NER?
Deep learning is a field of machine learning that is com-
posed of multiple processing layers to learn representations
of data with multiple levels of abstraction [85]. The typical
layers are artificial neural networks. Figure 2illustrates a
multilayer neural network and backpropagation. The for-
ward pass computes a weighted sum of their inputs from
!
!
"
!"
"
!
"
"
#!!
""
!
#
"#
"
"
"
#
#!!
#
"
!
"
#
!
!
!
$
"
!
#!!
!
"
!"
"#
#$%%&' ('$)*+ #,
-
().() ('$)
*
/'.() ('$)*
#$%%&' ('$)*+ #0
(a) Forward pass.
!
#!
$%
$&
"
"
%
"
!
"
#
!
!
!
"
$%
$'
$%
$&
$&
$'
$%
$&
!
" !"
!
" $%
$'
$%
$'
!
$%
$&
!
$&!
$'
!
$%
$&
"
#$
!
! $%
$'
!
$%
$'" $%
$&"
$&"
$'"
/'.() ('$)*
(b) Backward pass.
Fig. 2. An illustration for multilayer neural networks and backpropaga-
tion. In the forward pass computation, a non-line function f(.)is applied
to zto get the output of the units. The backward pass uses an example
cost function 0.5(yltl)2, where tlis the target value.
!"#$%&'()(*%+#,-.')()-'
!"#$!%&'"()*+!()",-"((&'./)01%!%2$"!#3"4"3)
",-"((&'./) 56)$%./)7%8"$$""!/9
:&21%"3));"<<!"=));+!(%'))*%>))-+!'))&'))?!++@3='/)A"*)B+!@C
/##0%1#!2')'.%3!4#$% &+
D/)4(2)56(#$%2#02#4#'(!()-'4%7-2%)'06(
0AA/)EAA/)F%'.G%."),+("3/)H!%'><+!,"!/9
8-'(#9(%#',-$#2
6+<$,%I/)0EJ/)EAA/) +&'$)'"$*+!@/9
:!.%$#,-$#2
K
L
;),<!#=%%>#772#*%%>-2$!'))*%>))-+!'))&'))32--?=*'%%%%/))) #@%%%%A-2?%%%%%C
?# ME N# ME M# ME 5 5 5 6#F50 ?#F50 M#F505 5
%(
M,-"((&'.)3%="!
%(
6"OG"'2"
F++@GP
M,-"((&'.
0+'4+3G$&+'
01%!%2$"! 3"4"3)
!"P!">"'$%$&+'
++3&'.
C!4<)'.(-'
Fig. 3. The taxonomy of DL-based NER. From input sequence to pre-
dicted tags, a DL-based NER model consists of distributed representa-
tions for input, context encoder, and tag decoder.
the previous layer and pass the result through a non-linear
function. The backward pass is to compute the gradient
of an objective function with respect to the weights of a
multilayer stack of modules via the chain rule of deriva-
tives. The key advantage of deep learning is the capability
of representation learning and the semantic composition
empowered by both the vector representation and neural
processing. This allows a machine to be fed with raw data
and to automatically discover latent representations and
processing needed for classification or detection [85].
There are three core strengths of applying deep learning
techniques to NER. First, NER benefits from the non-linear
transformation, which generates non-linear mappings from
input to output. Compared with linear models (e.g., log-
linear HMM and linear chain CRF), deep-learning models
are able to learn complex and intricate features from data
via non-linear activation functions. Second, deep learning
saves significant effort on designing NER features. The
traditional feature-based approaches require considerable
amount of engineering skill and domain expertise. Deep
learning models, on the other hand, are effective in au-
tomatically learning useful representations and underlying
factors from raw data. Third, deep neural NER models can
be trained in an end-to-end paradigm, by gradient descent.
This property enables us to design possibly complex NER
systems.
This research hasn't been cited in any other publications.
  • Conference Paper
    Full-text available
    Standard named entity recognizers can effectively recognize entity mentions that consist of contiguous tokens and do not overlap with each other. However, in practice, there are many domains, such as the biomedical domain, in which there are nested, overlapping, and discontinuous entity mentions. These complex mentions cannot be directly recognized by conventional sequence tagging models because they may break the assumptions based on which sequence tagging techniques are built. We review the existing methods which are revised to tackle complex entity mentions and categorize them as tokenlevel and sentence-level approaches. We then identify the research gap, and discuss some directions that we are exploring.
  • Article
    Full-text available
    Clinical named entity recognition aims to identify and classify clinical terms such as diseases, symptoms, treatments, exams, and body parts in electronic health records, which is a fundamental and crucial task for clinical and translational research. In recent years, deep neural networks have achieved significant success in named entity recognition and many other natural language processing tasks. Most of these algorithms are trained end to end, and can automatically learn features from large scale labeled datasets. However, these data-driven methods typically lack the capability of processing rare or unseen entities. Previous statistical methods and feature engineering practice have demonstrated that human knowledge can provide valuable information for handling rare and unseen cases. In this paper, we propose a new model which combines data-driven deep learning approaches and knowledge-driven dictionary approaches. Specifically, we incorporate dictionaries into deep neural networks. In addition, two different architectures that extend the bi-directional long short-term memory neural network and five different feature representation schemes are also proposed to handle the task. Computational results on the CCKS-2017 Task 2 benchmark dataset show that the proposed method achieves the highly competitive performance compared with the state-of-the-art deep learning methods.
  • Conference Paper
    Full-text available
    Recent advances in language modeling using recurrent neural networks have made it viable to model language as distributions over characters. By learning to predict the next character on the basis of previous characters, such models have been shown to automatically internalize linguistic concepts such as words, sentences, subclauses and even sentiment. In this paper, we propose to leverage the internal states of a trained character language model to produce a novel type of word embedding which we refer to as contextual string embeddings. Our proposed embeddings have the distinct properties that they (a) are trained without any explicit notion of words and thus fundamentally model words as sequences of characters, and (b) are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use. We conduct a comparative evaluation against previous embeddings and find that our embeddings are highly useful for downstream tasks: across four classic sequence labeling tasks we consistently outperform the previous state-of-the-art. In particular, we significantly outperform previous work on English and German named entity recognition (NER), allowing us to report new state-of-the-art F1-scores on the C O NLL03 shared task. We release all code and pre-trained language models in a simple-to-use framework to the research community, to enable reproduction of these experiments and application of our proposed embeddings to other tasks: https://github.com/zalandoresearch/flair
  • Book
    This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book.
  • Article
    Recommender systems are effective tools of information filtering that are prevalent due to increasing access to the Internet, personalization trends, and changing habits of computer users. Although existing recommender systems are successful in producing decent recommendations, they still suffer from challenges such as accuracy, scalability, and cold-start. In the last few years, deep learning, the state-of-the-art machine learning technique utilized in many complex tasks, has been employed in recommender systems to improve the quality of recommendations. In this study, we provide a comprehensive review of deep learning-based recommendation approaches to enlighten and guide newbie researchers interested in the subject. We analyze compiled studies within four dimensions which are deep learning models utilized in recommender systems, remedies for the challenges of recommender systems, awareness and prevalence over recommendation domains, and the purposive properties. We also provide a comprehensive quantitative assessment of publications in the field and conclude by discussing gained insights and possible future work on the subject.
  • Article
    Textual information is becoming available in abundance on the web, arising the requirement of techniques and tools to extract the meaningful information. One of such an important information extraction task is Named Entity Recognition and Classification. It is the problem of finding the members of various predetermined classes, such as person, organization, location, date/time, quantities, numbers etc. The concept of named entity extraction was first proposed in Sixth Message Understanding Conference in 1996. Since then, a number of techniques have been developed by many researchers for extracting diversity of entities from different languages and genres of text. Still, there is a growing interest among research community to develop more new approaches to extract diverse named entities which are helpful in various natural language applications. Here we present a survey of developments and progresses made in Named Entity Recognition and Classification research.
  • Chapter
    Full-text available
    Clinical named entity recognition (NER) is a foundational technology to acquire the knowledge within the electronic medical records. Conventional clinical NER methods suffer from heavily feature engineering. Besides, these methods treat NER as a sentence-level task and ignore the long-range contextual dependencies. In this paper, we propose an attention-based neural network architecture to leverage document-level global information to alleviate the problem. The global information is obtained from document represented by pre-trained bidirectional language model (Bi-LM) with neural attention. The parameters of pre-trained Bi-LM which makes use of unlabeled data can be transferred to NER model to further improve the performance. We evaluate our model on 2010 i2b2/VA datasets to verify the effectiveness of leveraging global information and transfer strategy. Our model outperforms previous state-of-the-art method with less labeled data and no feature engineering.
  • Article
    Full-text available
    Motivation: The explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of biomedical named entities in text (BNER) such as genes/proteins, diseases, and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER. Results: We demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target data sets with a small number of labels (approximately 6000 or less). Availability and implementation: Source code for the LSTM-CRF is available athttps://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available athttps://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/. Contact: john.giorgi@utoronto.ca. Supplementary information: Supplementary data are available at Bioinformatics online.