DRAFT IN PROGRESS, VOL. XX, NO. XX, 2018 4
•True Positive (TP): entities that are recognized by
NER and match ground truth.
•False Positive (FP): entities that are recognized by
NER but do not match ground truth.
•False Negative (FN): entities annotated in the ground
truth that are not recognized by NER.
Precision measures the ability of a NER system to present
only correct entities, and Recall measures the ability of a
NER system to recognize all entities in a corpus.
Precision =T P
T P +F P Recall =T P
T P +F N
F-score is the harmonic mean of precision and recall, and
the balanced F-score is most commonly used:
F-score = 2 ×Precision ×Recall
As most of NER systems involve multiple entity types,
it is often required to assess the performance across all
entity classes. Two measures are commonly used for this
purpose: macro-averaged F-score and micro-averaged F-
score. Macro-averaged F-score computes the F-score inde-
pendently for each entity type, then takes the average (hence
treating all entity types equally). Micro-averaged F-score
aggregates the contributions of entities from all classes to
compute the average (treating all entities equally). The latter
can be heavily affected by the quality of recognizing entities
in large classes in the corpus.
2.3.2 Relaxed-match Evaluation
MUC-6  deﬁnes a relaxed-match evaluation: a correct type
is credited if an entity is assigned its correct type regardless
its boundaries as long as there is an overlap with ground
truth boundaries; a correct boundary is credited regardless
an entity’s type assignment. Then ACE  proposes a more
complex evaluation procedure. It resolves a few issues like
partial match and wrong type, and considers subtypes of
named entities. However, it is problematic because the ﬁnal
scores are comparable only when parameters are ﬁxed ,
, . Complex evaluation methods are not intuitive
and make error analysis difﬁcult. Thus, complex evaluation
methods are not widely used in recent NER studies.
2.4 Traditional Approaches to NER
Traditional approaches to NER are broadly classiﬁed into
three main streams: rule-based, unsupervised learning, and
feature-based supervised learning approaches , .
2.4.1 Rule-based Approaches
Rule-based NER systems rely on hand-crafted rules. Rules
can be designed based on domain-speciﬁc gazetteers ,
 and syntactic-lexical patterns . Kim  proposed
to use Brill rule inference approach for speech input. This
system generates rules automatically based on Brill’s part-
of-speech tagger. In biomedical domain, Hanisch et al. 
proposed ProMiner, which leverages a pre-processed syn-
onym dictionary to identify protein mentions and potential
gene in biomedical text. Quimbaya et al.  proposed
a dictionary-based approach for NER in electronic health
records. Experimental results show the approach improves
recall while having limited impact on precision.
Some other well-known rule-based NER systems in-
clude LaSIE-II , NetOwl , Facile , SAR ,
FASTUS , and LTG  systems. These systems are
mainly based on hand-crafted semantic and syntactic rules
to recognize entities. Rule-based systems work very well
when lexicon is exhaustive. Due to domain-speciﬁc rules
and incomplete dictionaries, high precision and low recall
are often observed from such systems, and the systems
cannot be transferred to other domains.
2.4.2 Unsupervised Learning Approaches
A typical approach of unsupervised learning is clustering
. Clustering-based NER systems extract named entities
from the clustered groups based on context similarity. The
key idea is that lexical resources, lexical patterns, and statis-
tics computed on a large corpus can be used to infer men-
tions of named entities. Collins et al.  observed that use
of unlabeled data reduces the requirements for supervision
to just 7 simple “seed” rules. The authors then presented
two unsupervised algorithms for named entity classiﬁca-
tion. Similarly, the KNOWITALL  system leverage a set
of predicate names as input and bootstraps its recognition
process from a small set of generic extraction patterns.
Nadeau et al.  proposed an unsupervised system for
gazetteer building and named entity ambiguity resolution.
This system combines entity extraction and disambiguation
based on simple yet highly effective heuristics. In addi-
tion, Zhang and Elhadad  proposed an unsupervised
approach to extracting named entities from biomedical text.
Instead of supervision, their model resorts to terminolo-
gies, corpus statistics (e.g., inverse document frequency
and context vectors) and shallow syntactic knowledge (e.g.,
noun phrase chunking). Experiments on two mainstream
biomedical datasets demonstrate the effectiveness and gen-
eralizability of their unsupervised approach.
2.4.3 Feature-based Supervised Learning Approaches
Applying supervised learning, NER is cast to a multi-class
classiﬁcation or sequence labeling task. Given annotated
data samples, features are carefully designed to represent
each training example. Machine learning algorithms are
then utilized to learn a model to recognize similar patterns
from unseen data.
Feature engineering is critical in supervised NER sys-
tems. Feature vector representation is an abstraction over
text where a word is represented by one or many Boolean,
numeric, or nominal values , . Word-level features
(e.g., case, morphology, and part-of-speech tag) –,
list lookup features (e.g., Wikipedia gazetteer and DBpedia
gazetteer) –, and document and corpus features (e.g.,
local syntax and multiple occurrences) – have been
widely used in various supervised NER systems. More
feature designs are discussed in , , 
Based on these features, many machine learning algo-
rithms have been applied in supervised NER, including
Hidden Markov Models (HMM) , Decision Trees ,
Maximum Entropy Models , Support Vector Machines
(SVM) , and Conditional Random Fields (CRF) .