ArticlePDF Available

Meta-learning orthographic and contextual models for language independent named entity recognition

Authors:

Abstract

This paper presents a named entity classifica-tion system that utilises both orthographic and contextual information. The random subspace method was employed to generate and refine at-tribute models. Supervised and unsupervised learning techniques used in the recombination of models to produce the final results.
Meta-Learning Orthographic and Contextual Models for Language
Independent Named Entity Recognition
Robert Munro and Daren Ler and Jon Patrick
Language Technology Research Group
Capital Markets Co-operative Research Centre
University of Sydney
{rmunro,ler,jonpat}@it.usyd.edu.au
Abstract
This paper presents a named entity classifica-
tion system that utilises both orthographic and
contextual information. The random subspace
method was employed to generate and refine at-
tribute models. Supervised and unsupervised
learning techniques used in the recombination
of models to produce the final results.
1 Introduction
There are commonly considered to be two main tasks in
named entity recognition, recognition (NER) and classi-
fication (NEC). As the features that best classify words
according to the two tasks are somewhat disparate, the
two are often separated. Attribute sets may be further
divided into subsets through sub-grouping of attributes,
sub-grouping of instances and/or the use of multiple clas-
sifying processes. While the use of multiple subsets can
increase overall accuracy, the recombination of models
has been shown to propagate errors (Carreras et al., 2002;
Patrick et al., 2002). More importantly, the decision re-
garding the separation of attributes into various subsets is
often a manual task. As it is reasonable to assume that the
same attributes will have different relative levels of sig-
nificance in different languages, using the same division
of attributes across languages will be less than optimal,
while a manual redistribution across different languages
is limited by the users knowledge of those languages. In
this paper, the division and subsequent recombination of
subgroups is treated as a meta-learning task.
2 Feature Representation
It has been our intention to create linguistically driven
model of named entity composition, and to search for
the attribute representations of these linguistic phenom-
ena that best suit inference by a machine learning algo-
rithm.
It is important to note that a named entity is a label
that has been consciously granted by some person or per-
sons, and as these names are chosen rather than assigned
randomly or evolved gradually, there are generalisations
that may be inferred about the words that may be used for
naming certain entity types (Allan, 2001; Kripke, 1972).
While generalisations relating to abstract connotations
of a word may be difficult to infer, generalisations about
the structure of the words are more emergent. As the
use of a name stretches back in time, it stretches back
to a different set of etymological constraints. It may
also stem from another language, with a different ortho-
graphic structure, possibly representing a different under-
lying phonology. Foreign words are frequently named
entities, especially in the domain of a newswire such as
Reuters. In language in general it is also reasonable to as-
sume that a foreign word is more likely to be an entity, as
people are more likely to migrate between countries than
prepositions. It is these generalisations of the structure of
words that we have attempted to represent in the n-gram
features.
Another emergent structural generalisation is that of
capitalisation, as named entities are commonly expressed
in title-case in European Languages. In this work it has
been investigated as a preprocessing step.
The other features used were contextual features, such
as observed trigger words, and the given part-of-speech
and chunking tags.
In total, we selected 402 attributes from which the
models were built.
2.1 Character N-Gram Modelling
The fundamental attribute of character n-gram modelling
is the observed probability of a collocation of characters
occurring as each of the category types. Individual n-
grams, or aggregations of them, may be used as attributes
in part of a larger data set for machine learning.
Modeling at the orthographic level has been shown
to be a successful method of named entity recogni-
tion. Orthographic Tries (Cucerzan and Yarowsky, 1999;
Whitelaw and Patrick, 2003; Whitelaw and Patrick, 2002)
and character n-gram modelling (Patrick et al., 2002) are
two methods for capturing orthographic features. While
Tries give a rich representation of a word, they are fixed to
one boundary of a word and cannot extend beyond unseen
character sequences. As they are also a classifying tool
in themselves, their integration with a machine learning
algorithm is problematic, as evidenced by reduction of
overall accuracy when processing a Trie output through a
machine learner in Patrick et al. (2002). As such, Tries
have not been used here. Although n-gram modelling has
not always been successful as a lone method of classifica-
tion (Burger et al., 2002), for the reasons outlined above
it is a more flexible modelling technique than Tries.
To capture affixal information, we used N-Grams mod-
elling to extract features for the suffixes and prefixes of all
words for all categories.
For general orthographic information we used the av-
erage probability of all bi-grams occurring in a word
for each category, and the value of the maximum and
minimum probability of all bi-grams in a word for each
category. To capture contextual information, these bi-
gram attributes was also extracted across word bound-
aries, both pre/post and exclusive/inclusive of the current
word, for different context windows.
All n-grams were extracted for the four entity types, lo-
cation, person, organisation and miscellaneous, with the
word level n-grams also extracted for NE recognition at-
tributes using a IOE2 model.
The aggregate n-gram attributes (for example, the av-
erage probability of all the n-grams in a word belonging
to a category), act as a memory based attribute, clustering
forms with less then random variance. These most bene-
fit agglutinative structures, such as the compound words
common to German, as well as morphologically disparate
forms, for example, Australia’ and Australian’. Here, of
all the n-grams, only the final one differs. While a stem-
ming algorithm would also match the two words, stem-
ming algorithms are usually based on language specific
affixal rules and are therefore inappropriate for a lan-
guage independent task. Furthermore, the difference be-
tween the words may be significant. The second of the
two words, used adjectively, would most likely belong
to the miscellaneous category, while the former is most
likely to be a location.
2.2 Contextual Features
Other than the contextual n-gram attributes, contextual
features used were: a bag of words, both pre and post
an entity, the relative sentence position of the word, com-
monly observed forms, and observed collocational trigger
words for each category.
2.3 Other Features
The part-of-speech and chunking tags were used with a
context window of three, both before and after each word.
For the German data, an attribute indicating whether
the word matched its lemma form or was unknown was
also included.
An attribute indicating both the individual and sequen-
tial existence of a words in the gazetteer was included for
both sets.
No external sources were used.
3 Normalising Case Information
As well as indicating a named entity, capitalisation may
indicate phenomenon such as the start of a sentence, a ti-
tle phrase or the start of reported speech. As orthographic
measures such as n-grams are case sensitive, both in the
building of the model and in classification, a preprocess-
ing step to correctly reassign the case information was
used to correct alternations caused by these phenomenon.
To the best knowledge of the authors, the only other at-
tempt to use computational inference methods for this
task is Whitelaw and Patrick (2003). Here we assumed
all words in the training and raw data sets that were not
sentence initial, did not occur in a title sentence, and did
not immediately follow punctuation were in the correct
case. This amounted to approximately 10,000,000 words.
From these, we extracted the observed probability of a
word occurring as lowercase, all capitals, initial capital,
or internal capital; the bi-gram distribution across these
four categories; and the part-of-speech and chunking tags
of the word. Using a decision graph (Patrick and Goyal,
2001), all words from the test and training sets were then
either recapitalised or decapitalised according to the out-
put. The results were 97.8% accurate, as indicated by the
number of elements in the training set that were correctly
re-assigned their original case.
The benefit of case-restoration for the English develop-
ment set was Fβ=1 1.56. Case-restoration was not under-
taken on the English test set or German sets. For consis-
tency, the English development results reported in table
1 are for processing without case restoration. We leave a
more thorough investigation of case restoration as future
work.
4 Processing
In order to make classifications, we employ a meta-
learning strategy that is a variant of stacking (Wolpert,
1992) and cascading (Gama and Brazdil, 2000) over an
ensemble of classifiers. This classifier is described in two
phases.
In the first phase, an ensemble of classifiers is produced
by combining both the random subspace method (Ho,
1998) and bootstrap aggregation or bagging (Breiman,
1996).
In the random subspace method, subspaces of the fea-
ture space are formed, with each subspace trained to pro-
duce a classifier. Given that with nfeatures, 2ndiffer-
ent subsets of features can be generated, not all possible
subsets are created. Ho (1998) suggests that the random
subspace method is best suited for problems with high
dimensionality. Furthermore, he finds that the method
works well where there exists a high degree of redun-
dancy across attributes, and where the prior knowledge
about the significance of various attributes is unknown.
It is also a useful method for limiting the impact of at-
tributes that may cause the learner to overfit the data. This
is especially important in the domain of newswires where
the division between training and test sets is temporal, as
topic shift is likely to occur.
From a different prespective, bagging produces differ-
ent subsets or bootstrap replicates by randomly drawing
with replacement, minstances from the original training
set. Once again, each bag is used to produce a different
classifier.
Both techniques share the same fundamental idea of
forming multiple training sets from a single original train-
ing set. An unweighted or weighted voting scheme is
then typically adopted to make the ultimate classifica-
tion. However, in this paper, as the second phase of our
classifier, an additional level of learning is performed.
For each training instance, the class or category proba-
bility distributions produced by the underlying ensemble
is used in conjunction with the correct classification to
train a new final classifier. The category probability dis-
tributions may be seen as meta-data that is used to train a
meta-learner.
Specifically, given nfeatures A1, A2, ..., Anand
mtraining instances I1, I2, ..., Im,we may then ran-
domly form ldifferent training subsets S1, S2, ..., Sl,
with each Sicontaining a random subset of both at-
tributes and training instances. Given a learning al-
gorithm L, each Siis used to train Lto produce
ldifferent classifiers C1, C2, ..., Cl. When tested,
each Ciwill produce a category probability distribu-
tion Ci(D1), Ci(D2), ..., Ci(Dg)where gis the to-
tal number of categories. Then for each training
instance h, the unified category probability distribu-
tion Pl
r=1 Cr(D1),Pl
r=1 Cr(D2), ..., Pl
r=1 Cr(Dg)in
conjunction with the correct category for that instance
CLhis used to train Lto produce the final classifier C0.
In our experimentation, we divided each data set into
subsets containing approximately 65% of the original
training set (with replication) and with 50 of the total 402
attributes. In total, the meta-learner utilised data gen-
erated from the combined output of 150 sub-classifiers.
The choices regarding the number of subsets and their re-
spective attribute and instance populations were made in
consideration of both processing constraints and the min-
imum requirements in terms of the required original data.
While increasing the number of subsets will generally in-
crease the overall accuracy, obtaining an optimal subset
size through automated experimentation would have been
a preferable method, especially as the optimal size may
differ between languages.
To eliminate subsets that were unlikely to produce ac-
curate results across any language, we identified eight
subtypes of attributes, and considered only those sets with
at least one attribute from each. These were:
1. prefixal n-grams
2. suffixal n-grams
3. n-grams specifically modelling IOE2 categories
4. trigger forms occurring before a word
5. trigger forms occurring after a word
6. sentence and clausal positions
7. collocating common forms
8. the observed probability of the word and surround-
ing words belonging to each category type
To classify the various subsets generated as well as
to train the final meta-learner, a boosted decision graph
(Patrick and Goyal, 2001) is used.
5 Results
N-Grams, in various instances, were able to capture in-
formation about various structural phenomenon. For ex-
ample, the bi-gram ‘ae’ occurred in an entity in approx-
imately 96% of instances in the English training set and
91% in the German set, showing that the compulsion to
not assimilate old forms of names ‘Israel’ and ‘Michael’
to something like ‘Israil’ and ‘Michal’ is more emergent
than the constraint to maintain form. An example bi-gram
indicating a word from a foreign language with a dif-
ferent phonology is ‘cz’, representing the voiced palatal
fricative, which is not commonly used in English or Ger-
man. The fact two characters were needed to represent
one underlying phoneme in itself suggests this. Within
English, the suffix ‘gg’ always indicates a named entity,
with the exception of the word ‘egg’, which has retained
both g’s in accordance with the English constraint of con-
tent words being three or more letters long. All other
word’s with an etymological history of a ‘gg’ suffix such
as ‘beg’ have assimilated to the shorter form.
The meta-learning strategy improved the German test
set results by Fβ=1 9.06 over a vote across the classifiers.
For English test set, this improvement was Fβ=1 0.40.
6 Discussion
The methodology employed was significantly more suc-
cessful at identifying location and person entities (see ta-
ble 1). The recalls for these values for English are es-
pecially high considering that precision is typically the
higher value in named entity recognition. Although the
lower value for miscellaneous entities was expected, due
to the relatively smaller number of items and idiosyn-
crasies of the category membership, the significantly low
values for organisations was surprising. There are three
possible reasons for this: organisations are more likely
than people or places to take their names from the con-
temporary lexicon, and are therefore less likely to contain
orthographic structures able to be exploited by n-gram
modelling; in the training set, organisations were rela-
tively over represented in the errors made in the normal-
ising of case information, most likely due to the previous
reason; and organisations may be represented metonymi-
cally, creating ambiguity about the entity class.
As the difference that meta-learning made to German
was very large, but to English very small (see Results), it
is reasonable to assume that the individual English classi-
fiers were much more homogeneous, indicating both that
the attribute space for the individual classifiers for En-
glish were very successful, but only certain classifiers or
combinations of them were beneficial for German. The
flexibility of the strategy as a whole was successful when
generalising across languages
References
K. Allan. 2001. Natural Language Semantics. Black-
well Publishers, Oxford, UK.
L. Breiman. 1996. Bagging predictors. In Machine
Learning, 24(2), pages 123–140.
J. D. Burger, J. C. Henderson, and W. T. Morgan. 2002.
Statistical Named Entity Recognizer Adaptation. In
Proceedings of CoNLL-2002. Taipei, Taiwan.
X. Carreras, L. Marques, and L. Padro. 2002. Named
Entity Extraction using AdaBoost. In Proceedings of
CoNLL-2002. Taipei, Taiwan.
S. Cucerzan and D. Yarowsky. 1999. Language indepen-
dent named entity recognition combining morphologi-
cal and contextual evidence.
J. Gama and P. Brazdil. 2000. Cascade generalization.
In Machine Learning, 41(3), pages 315–343.
T. K. Ho. 1998. The Random Subspace Method for Con-
structing Decision Forests. In IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(8).
S. Kripke. 1972. Naming and necessity. In Semantics of
Natural Language, pages 253–355.
English devel. precision recall Fβ=1
LOC 90.02% 92.76% 91.37
MISC 89.05% 78.52% 83.46
ORG 77.61% 82.48% 79.97
PER 88.48% 93.38% 90.86
Overall 86.49% 88.42% 87.44
English test precision recall Fβ=1
LOC 84.74% 88.25% 86.46
MISC 80.13% 70.66% 75.09
ORG 75.20% 79.77% 77.42
PER 82.98% 90.48% 86.57
Overall 80.87% 84.21% 82.50
German devel. precision recall Fβ=1
LOC 74.70% 68.76% 71.60
MISC 76.07% 59.80% 66.96
ORG 71.14% 63.17% 66.92
PER 76.87% 76.87% 76.87
Overall 74.75% 67.80% 71.11
German test precision recall Fβ=1
LOC 67.73% 69.76% 68.73
MISC 64.38% 57.46% 60.73
ORG 60.54% 54.98% 57.63
PER 78.95% 75.31% 77.09
Overall 69.37% 66.21% 67.75
Table 1: Results for English and German sets.
J. Patrick and I. Goyal. 2001. Boosted Decision Graphs
for NLP Learning Tasks. In Walter Daelemans and
R´
emi Zajac, editors, Proceedings of CoNLL-2001,
pages 58–60. Toulouse, France.
J. Patrick, C. Whitelaw, and R. Munro. 2002. SLIN-
ERC: The Sydney Language-Independent Named En-
tity Recogniser and Classifier. In Proceedings of
CoNLL-2002, pages 199–202. Taipei, Taiwan.
C. Whitelaw and J. Patrick. 2002. Orthographic tries
in language independent named entity recognition. In
Proceedings of ANLP02, pages 1–8. Centre for Lan-
guage Technology, Macquarie University.
C. Whitelaw and J. Patrick. 2003. Named Entity
Recognition Using a Character-based Probabilistic Ap-
proach. In Proceedings of CoNLL-2003. Edmonton,
Canada.
D. Wolpert. 1992. Stacked generalization. In Neural
Networks, 5(2), pages 241–260.
... For named entity recognition, often the local context or specific trigger words are used. Part-of-speech (POS) tags, capitalization and punctuation are also common features as shown in [20] and [21]. The detection performance is usually evaluated in terms of f-score with equal weight for precision and recall (f0-score) [22]. ...
Conference Paper
Full-text available
With the globalization more and more words from other lan-guages come into a language without assimilation to the phonetic system of the new language. To economically build up lexical re-sources with automatic or semi-automatic methods, it is important to detect and treat them separately. Due to the strong increase of Anglicisms, especially from the IT domain, we developed features for their automatic detection and collected and annotated a German IT corpus to evaluate them. Furthermore we applied our methods to Afrikaans words from the NCHLT corpus and German words from the news domain. Combining features based on grapheme perplex-ity, grapheme-to-phoneme confidence, Google hits count as well as spell-checker dictionary and Wiktionary lookup reaches 75.44% f-score. Producing pronunciations for the words in our German IT corpus based on our methods resulted in 1.6% phoneme error rate to reference pronunciations, while applying exclusively German grapheme-to-phoneme rules for all words achieved 5.0%.
... Note that, with abuse of notation, with h i (x) we ambiguously denote the extension of h i to the entire R d space. This approach has been successfully applied to different real problems [13, 14], but it seems well-suited for the diagnosis of polygenic and tumoral diseases using very high dimensional gene expression data. ...
Article
Full-text available
The bio-molecular diagnosis of malignancies, based on DNA microarray biotechnologies, is a difficult learning task, because of the high dimensionality and low cardinality of the data. Many supervised learning techniques, among them support vector machines (SVMs), have been experimented, using also feature selection methods to reduce the dimensionality of the data. In this paper we investigate an alternative approach based on random subspace ensemble methods. The high di-mensionality of the data is reduced by randomly sampling subsets of features (gene expression levels), and accuracy is improved by aggregat-ing the resulting base classifiers. Our experiments, in the area of the diagnosis of malignancies at bio-molecular level, show the effectiveness of the proposed approach.
... It is rooted on the theory of stochastic discrimination (Kleinberg 2000). The random subspace method has been successfully applied to different problems (Munro et al. 2003;Hall et al. 2003). ...
Article
Full-text available
This paper presents the application of Artificial Immune Systems to the design of classifier ensembles. Ensembles of classifiers are a very interesting alternative to single classifiers when facing difficult problems. In general, ensembles are able to achieve better performance in terms of learning and generalisation errors. Several papers have shown that the processes of classifier design and combination must be related in order to obtain better ensembles. Artificial Immune Systems are a recent paradigm based on the immune systems of animals. The features of this new paradigm make it very appropriate for the design of systems where many components must cooperate to solve a given task. The design of classifier ensembles can be considered within such a group of systems, as the cooperation of the individual classifiers is able to improve the performance of the overall system. This paper studies the viability of Artificial Immune Systems when dealing with ensemble design. We construct a population of classifiers that is evolved using an Artificial Immune algorithm. From this population of classifiers several different ensembles can be extracted. These ensembles are favourably compared with ensembles obtained using standard methods in 35 real-world classification problems from the UCI Machine Learning Repository.
... It is rooted in the theory of stochastic discrimina- tion [28], and it has common points with bagging, but instead of sampling instances, it samples subspaces [29]. RSM has been successfully applied to different problems [30], [31] . For instance , in selection based on genetic algorithms, we can evolve the input subspace instead of using RSM, which is a better solution for the stability of -NN rule with respect to sampling (see Section III-B). ...
Article
In this paper, we approach the problem of constructing ensembles of classifiers from the point of view of instance selection. Instance selection is aimed at obtaining a subset of the instances available for training capable of achieving, at least, the same performance as the whole training set. In this way, instance selection algorithms try to keep the performance of the classifiers while reducing the number of instances in the training set. Meanwhile, boosting methods construct an ensemble of classifiers iteratively focusing each new member on the most difficult instances by means of a biased distribution of the training instances. In this work, we show how these two methodologies can be combined advantageously. We can use instance selection algorithms for boosting using as objective to optimize the training error weighted by the biased distribution of the instances given by the boosting method. Our method can be considered as boosting by instance selection. Instance selection has mostly been developed and used for k -nearest neighbor ( k -NN) classifiers. So, as a first step, our methodology is suited to construct ensembles of k -NN classifiers. Constructing ensembles of classifiers by means of instance selection has the important feature of reducing the space complexity of the final ensemble as only a subset of the instances is selected for each classifier. However, the methodology is not restricted to k -NN classifier. Other classifiers, such as decision trees and support vector machines (SVMs), may also benefit from a smaller training set, as they produce simpler classifiers if an instance selection algorithm is performed before training. In the experimental section, we show that the proposed approach is able to produce better and simpler ensembles than random subspace method (RSM) method for k -NN and standard ensemble methods for C4.5 and SVMs.
... This method was proposed for constructing a decision forest by randomly selecting subspaces from the original dataset, and very good results were reported. Random subspace method has been successfully applied to different problems [7]. ...
Conference Paper
This paper presents a new method for constructing ensembles of classifiers based on immune network theory, one of the most interesting paradigms within the field of artificial immune systems. Ensembles of classifiers are a very interesting alternative to single classifiers when facing difficult problems. In general, ensembles are able to achieve better performance in terms of learning and generalisation error. Artificial immune system is a new paradigm within the field of bioinspired algorithms that mimics the behaviour of the natural immune system of animals to develop solutions for a given problem. Within artificial immune systems, one of the most innovative and appealing fields is immune network theory. We construct an immune network that constitutes an ensemble of classifiers. Using a neural network as base classifier we have compared the performance of this ensemble with five standard methods of ensemble construction. This comparison is made using 35 real-world classification problems from the UCI Machine Learning Repository. The results show that the proposed model exhibits a general advantage over the standard methods. r 2007 Elsevier B.V. All rights reserved.
Article
Full-text available
This paper reports about a multi-engine approach for the development of a Named Entity Recognition (NER) system in Bengali by combining the classifiers such as Max-imum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) with the help of weighted voting techniques. The training set consists of ap-proximately 272K wordforms, out of which 150K wordforms have been manually an-notated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. An appropriate tag conversion routine has been defined in order to convert the 122K wordforms of the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL) 1 data into the desired forms. The individual classifiers make use of the different contextual in-formation of the words along with the variety of features that are helpful to predict the various NE classes. Lexical context patterns, generated from an unlabeled corpus of 3 million wordforms in a semi-automatic way, have been used as the features of the classifiers in order to improve their performance. In addition, we propose a number of techniques to post-process the output of each classifier in order to reduce the errors and to improve the performance further. Finally, we use three weighted voting techniques to combine the individual models. Experimental results show the effectiveness of the proposed multi-engine approach with the overall Recall, Precision and F-Score values of 93.98%, 90.63% and 92.28%, respectively, which shows an improvement of 14.92% in F-Score over the best performing baseline SVM based system and an improvement of 18.36% in F-Score over the least performing baseline ME based system. Compara-tive evaluation results also show that the proposed system outperforms the three other existing Bengali NER systems.
Conference Paper
This paper reports about the development of a NER system in Bengali by combining outputs of the classifiers such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). The training set consists of approximately 250K wordforms and has been manually annotated with the four major named entity (NE) tags such as Person, Location, Organization and Miscellaneous tags. The classifiers make use of the different contextual information of the words along with the variety of features that are helpful in predicting the various NE classes. Lexical context patterns, which are generated from an unlabeled corpus of 1 million wordforms in a semi-automatic way, have been used as the features of the classifiers in order to improve their performance. In addition, we have used the second best tags of the classifiers and applied several heuristics to improve the performance. Finally, the classifiers are combined using a majority voting approach. Experimental results show the effectiveness of the proposed approach with the overall average recall, precision, and f-score values of 90.78%, 87.35%, and 89.03%, respectively, which shows an improvement of 11.8% in f-score over the best performing SVM based baseline system and an improvement of 15.11% in f-score over the least performing ME based baseline system. The proposed system also outperforms the other existing Bengali NER system.
Conference Paper
This paper reports about the development of a Named Entity Recognition (NER) system in Bengali by combining the outputs of the two classifiers, namely Conditional Random Field (CRF) and Support Vector Machine (SVM). Lexical context patterns, which are generated from an unlabeled corpus of 10 million wordforms in an unsupervised way, have been used as the features of the classifiers in order to improve their performance. We have post-processed the models by considering the second best tag of CRF and class splitting technique of SVM in order to improve the performance. Finally, the classifiers are combined together into a final system using three weighted voting techniques. Experimental results show the effectiveness of the proposed approach with the overall average recall, precision, and f-score values of 91.33%, 88.19%, and 89.73%, respectively.
Conference Paper
Full-text available
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.
Conference Paper
Full-text available
This paper presents a Named Entity Extraction (NEE) system for the CoNLL 2002 competition. The two main sub-tasks of the problem, recognition (NER) and classification (NEC), are performed sequentially and independently with separate modules. Both modules are machine learning based systems, which make use of binary AdaBoost classifiers.
Article
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Article
Full-text available
Using multiple classifiers for increasing learning accuracy is an active research area. In this paper we present two related methods for merging classifiers. The first method, Cascade Generalization, couples classifiers loosely. It belongs to the family of stacking algorithms. The basic idea of Cascade Generalization is to use sequentially the set of classifiers, at each step performing an extension of the original data by the insertion of new attributes. The new attributes are derived from the probability class distribution given by a base classifier. This constructive step extends the representational language for the high level classifiers, relaxing their bias. The second method exploits tight coupling of classifiers, by applying Cascade Generalization locally. At each iteration of a divide and conquer algorithm, a reconstruction of the instance space occurs by the addition of new attributes. Each new attribute represents the probability that an example belongs to a class given by a base classifier. We have implemented three Local Generalization Algorithms. The first merges a linear discriminant with a decision tree, the second merges a naive Bayes with a decision tree, and the third merges a linear discriminant and a naive Bayes with a decision tree. All the algorithms show an increase of performance, when compared with the corresponding single models. Cascade also outperforms other methods for combining classifiers, like Stacked Generalization, and competes well against Boosting at statistically significant confidence levels.
Article
Full-text available
We present a named entity recognition and classification system that uses only probabilistic character-level features. Classifications by multiple orthographic tries are combined in a hidden Markov model framework to incorporate both internal and contextual evidence. As part of the system, we perform a preprocessing stage in which capitalisation is restored to sentence-initial and all-caps words with high accuracy. We report f-values of 86.65 and 79.78 for English, and 50.62 and 54.43 for the German datasets.
Article
Full-text available
This paper reports the implementation of the AdaBoost algorithm on decision graphs, optimized using the Minimum Message Length Principle. The AdaBoost algorithm, which we call 1-Stage Boosting, is shown to improve the accuracy of decision graphs, along with we another technique which we combine with AdaBoost and call 2-Stage Boosting. which shows the greater improvement. Empirical tests demonstrate that both 1-Stage and 2-Stage Boosting techniques perform better than the boosted C4.5 algorithm. However the boosting has not shown a significant improvement for NLP tasks with a high disjunction of attribute space.
Article
Full-text available
The Sydney Language Independent Named Entity Recogniser and Classi er (SLINERC) is a multi-stage system for the recognition and classi cation of named entities. Each stage uses a decision graph learner to combine statistical features with results from prior stages. Earlier stages are focused upon entity recognition, the division of non-entity terms from entities. Later stages concentrate on the classi cation of these entities into the desired classes. The best overall f-values are 73.92 and 71.36 for the Spanish and Dutch datasets, respectively.
Chapter
I hope that some people see some connection between the two topics in the title. If not, anyway, such connections will be developed in the course of these talks. Furthermore, because of the use of tools involving reference and necessity in analytic philosophy today, our views on these topics really have wide-ranging implications for other problems in philosophy that traditionally might be thought far-removed, like arguments over the mind-body problem or the so-called ‘identity thesis’. Materialism, in this form, often now gets involved in very intricate ways in questions about what is necessary or contingent in identity of properties — questions like that. So, it is really very important to philosophers who may want to work in many domains to get clear about these concepts. Maybe I will say something about the mind-body problem in the course of these talks. I want to talk also at some point (I don’t know if I can get it in) about substances and natural kinds.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Article
Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy