Conference PaperPDF Available

Definition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers.

Authors:

Abstract and Figures

The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.
Content may be subject to copyright.
Definition extraction using a sequential combination of baseline grammars
and machine learning classifiers
Łukasz Degórski1, Michał Marci´
nczuk3, Adam Przepiórkowski1,2
1Institute of Computer Science, Polish Academy of Sciences, Warsaw
2Institute of Informatics, University of Warsaw
3Institute of Applied Informatics, Wrocław University of Technology
ldegorski@bach.ipipan.waw.pl, michal.marcinczuk@pwr.wroc.pl, adamp@ipipan.waw.pl
Abstract
The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented:
Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial
grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.
1. Introduction
The aim of this paper is to contrast two approaches
to the task of extracting definitions from relatively un-
structured instructive texts (textbooks, learning materials
in eLearning, etc.) in a morphologically rich, relatively free
word order, determinerless language (Polish). The task is
a part of a larger EU project, Language Technology for
eLearning (LT4eL; http://www.lt4el.eu/), focus-
ing on facilitating the construction and retrieval of learning
objects (LOs) in eLearning with the help of language tech-
nology; the results of definition extraction are presented
to the author or the maintainer of a LO as candidates for
the glossary of this LO. Since it is easier to reject wrong
definition candidates than to go back to the text and search
for missed definitions manually, recall is more important
than precision in the evaluation of the results.
Previous work (Przepiórkowski et al., 2007a;
Przepiórkowski et al., 2007b) approached this task
via the manual construction of partial (or shallow) gram-
mars finding fragments of definition sentences. In this
paper we attempt to quantify the extent to which the same
task may be accomplished with the automatically trained
machine learning classifiers, without the need to construct
sophisticated manual grammars.
For the experiments described below, a corpus of instruc-
tive texts of over 300K tokens (with over 550 definitions)
was automatically annotated morphosyntactically and then
manually annotated for definitions. The corpus was split
into two main parts: a training corpus (combined of what
Przepiórkowski et al. (2007b) call a training corpus and
a held-out corpus) and a testing corpus. The quantitative
characteristics of these corpora is given in Table 1.
training testing TOTAL
tokens 223 327 77 309 300 636
sentences 7 481 3 349 10 830
definitions 386 172 558
sentences with def. 364 182 546
Table 1: Corpora used in the experiments
2. Partial Parsing Experiments
Partial parsing experiments are most fully described in
(Przepiórkowski et al., 2007b). Since the input texts were
XML-encoded1, the XML-aware efficient lxtransduce
tool (Tobin, 2005) was used for the implementation
of the grammar. The grammar, essentially a cascade of
regular grammars, was developed within about 10 work-
ing days in over 100 iterations, where in each iteration
the grammar was improved and the results were evalu-
ated (on portions of the training corpus only) both quan-
titatively (automatically) and qualitatively (manually). The
final grammar, called PG2(partial grammar), contains 13
top level rules (with 48 rules in total, in a 16K file, 12.5
lines for a rule, on the average).
For the evaluation, three baseline grammars were con-
structed: from the trivial B1 grammar, which marks all sen-
tences as definition sentences, through B2, which marks as
definitory all sentences containing a possible Polish cop-
ula (jest,s ˛a,to), the abbreviation tj. ‘i.e.’, or the word czyli
‘that is’, ‘namely’, to B3, a very permissive grammar mark-
ing as definitions all sentences containing any of the 27
very simple patterns (in most cases, single-token keywords)
manually identified on the basis of manually annotated def-
initions (these patterns include all patterns in B2, as well
as various definitor verbs, apparent acronym specifications,
the equal sign ‘=’, etc.).
For all grammars, a sentence was classified as a definition
sentence if the grammar found a match in this sentence (not
necessarily spanning the whole sentence).
All grammars were applied to the testing corpus, unseen
during the development of the grammars; the results are
given in Table 2. Apart from precision (P) and recall (R),
also the usual F-measure is given (F1), as well as F2used
by Przepiórkowski et al. (2007a) and F5(apparently) used
by Saggion (2004).3Note that for the task at hand, where
1More precisely, the input adheres to the XML Corpus Encod-
ing Standard (Ide et al., 2000).
2This is the GR’ grammar of Przepiórkowski et al. (2007b)
3It should, however, be noted that Saggion (2004) uses F5to
evaluate definition answers to particular questions.
recall is more important than precision, the latter two mea-
sures seem appropriate, although whether recall is twice as
important as precision (F2) or five times as important (F5) is
ultimately an empirical issue that should be settled by user
case evaluation experiments.
P R F1F2F5
B1 5.43 100.00 10.31 14.71 25.64
B2 9.69 61.54 16.74 22.11 32.53
B3 10.54 88.46 18.84 25.54 39.64
PG 18.69 59.34 28.42 34.39 43.55
Table 2: Partial grammar evaluation on the testing corpus
3. Machine Learning Experiments
Given the relatively small amount of data at our disposal
(cf. Table 1 above) and the inherent difficulty of the task,
we were skeptical about the applicability of machine learn-
ing approaches in this case, but, nevertheless, decided to
perform some experiments, starting with some traditional
well-known classifiers as implemented in the Weka toolset
(Witten and Frank, 2005): Naïve Bayes, decision trees (ID3
and C4.5), the lazy classifier IB1 and AdaBoostM1 with
Decision Stump, as well as the nu-SVC (EL-Manzalawy
and Honavar, 2005) implementation of Support Vector Ma-
chines.
In case of the AdaBoost classifier, the number of iterations
in the reported experiments was set to 1000. Other values
(10, 100 and 10000) were also tested. Increasing the num-
ber of iterations led to the increase of the results, but also
to the significant increase of the time of operation of the
classifier. 10000 iterations took unacceptably long and the
results were not much better than for 1000 iterations.
The nu-SVC classifier was used with radial basis kernel.
Other kernels were also tested and proved worse. The nu
parameter was set to 0.5 for 1:1 subsampling, 0.4 for 1:3,
0.2 for 1:5, 0.1 for 1:10 and 0.05 for no subsampling.
In general, the higher the nu value, the better the results,
but a value too high for a given subsampling ratio causes an
error.
The attribute set was constructed as follows: for selected
n-gram types (see Table 34), we took the most frequent n-
grams of every type from all the sentences in the corpus.
The maximum number was arbitrarily chosen for each n-
gram type; the numbers really used are smaller than this
value when there are not enough possibilities (e.g., cases)
in the corpus, and a little higher when there are a few n-
grams with the same frequency exactly on the threshold.
For those experiments we used the whole corpus (training
and testing, cf. Table 1 and Przepiórkowski et al. (2007b)),
and applied the usual 10-fold cross-validation for the pre-
4The number of ctags reported in the table is higher than
the number of parts of speech in the IPI PAN Tagset used here,
due to an error in corpus annotation (in two files, instead of part
of speech only, full morphosyntactic information has been as-
signed as ctag). This additionally increased the noise in the data,
so the results reported in this paper should be treated as a lower
boundary on the actually attainable results.
n-gram type max allowed really used
base 100 100
base-base 100 100
base-base-base 100 115
ctag 100 100
ctag-ctag 100 100
ctag-ctag-ctag 100 100
case 100 8
case-case 100 59
case-case-case 100 100
Table 3: Feature set used in initial 10-fold cross-validation
experiments
liminary evaluation. For the sake of reproducibility, the cor-
pus was split into folds once and this division was used
in the following cross-validation experiments. The posi-
tive and negative examples were randomly assigned to one
of the 10 subsets; the ratio of positive to negative examples
in every subset was balanced.
The corpus has (and every corpus of this type will in-
herently have) a prevalent number of negative instances.
The ratio of non-definitions to definitions in our training
texts is about 19. Thus, we decided to subsample the nega-
tive instances. For example, for the 1:5 subsampling ratio,
all positive instances were used, and 5 times more nega-
tive instances were chosen randomly from the whole set
of negative instances. A side effect of this approach is
that the results of experiments with subsampling are still
not 100% reproducible, owing to the randomisation factor.
Some classifiers are more influenced by this factor, some
are less, but the absolute differences of precision and recall
between the results of two independent tests we have con-
ducted very rarely exceed 0.5%, and tend to be balanced
with regard to F-measures.
The experiments have shown that reducing the preva-
lence of negative examples noticeably increases recall —
of course not without a loss of precision, but the change
in terms of F2is always positive or negligible. For this rea-
son, in further research we focused on configurations with
the high subsampling ratio of negative instances. One pos-
itive side effect of subsampling was a substantial (up to 13
times in case of AdaBoost) decrease in execution time, as
fewer examples have to be analysed.
The results of ten-fold cross-validation on the whole cor-
pus (using the balanced random split), for different subsam-
pling ratios, as well as results achieved on the same corpus
by the grammars, are presented in Table 4. Even the best
classifiers are significantly worse than the partial grammar
PG. Note that some ML configurations achieve PG’s pre-
cision, while other configurations — PG’s recall, but never
both at the same time.
4. Sequential combination of grammars
with classifiers
In order to improve the precision, we applied the B3
grammar sequentially before the classifiers came into play.
In this approach the classifiers filter the results of the gram-
mar: all sentences rejected by B3 are unconditionally
Classifier Ratio P R F1F2F5Comments
NB 1:1 8.77 57.69 15.23 20.18 29.90
1:5 9.94 51.65 16.67 21.53 30.39
1:10 10.21 49.08 16.90 21.62 30.02
1:all 10.16 46.70 16.69 21.24 29.20
C4.5 1:1 8.20 62.45 14.50 19.48 29.70
1:5 13.90 28.21 18.62 21.00 24.08
1:10 18.47 15.93 17.11 16.70 16.31
1:all 35.94 8.42 13.65 11.31 9.66
ID3 1:1 8.26 64.29 14.65 19.72 30.18
1:5 13.11 36.63 19.31 22.93 28.20
1:10 15.52 26.56 19.59 21.47 23.74
1:all 17.57 19.05 18.28 18.53 18.78
IB1 1:1 9.72 50.00 16.28 21.00 29.58
1:5 15.86 25.09 19.43 21.01 22.87
1:10 19.88 17.95 18.86 18.55 18.24
1:all 22.19 14.47 17.52 16.37 15.36
nu-SVC 1:1 9.88 65.93 17.19 22.81 33.89 nu=0.5
1:5 20.39 38.46 26.65 29.69 33.51 nu=0.2
1:10 26.88 28.21 27.52 27.75 27.97 nu=0.1
1:all 31.51 16.85 21.96 19.94 18.27 nu=0.05
AB+DS 1:1 10.59 54.95 17.75 22.92 32.35 10 iterations
1:5 27.95 16.48 20.74 19.09 17.69 10 iterations
1:1 11.89 66.48 20.17 26.27 37.66 100 iterations
1:5 28.07 18.86 22.56 21.18 19.95 100 iterations
1:1 11.67 68.32 19.94 26.10 37.77 1000 iterations
1:5 27.49 20.70 23.62 22.55 21.59 1000 iterations
B3 9.12 89.56 16.55 22.73 36.26
PG 18.08 67.77 28.54 35.37 46.48
Table 4: Performance of the classifiers for different ratio of positive to negative examples evaluated using 10-fold cross-
validation on the whole corpus with balanced random split, and evaluation of the grammars on the same (whole) corpus for
comparison
marked as non-definitions, and only sentences accepted by
B3 are passed to the ML stage, where their status is deter-
mined.
In these experiments the classifiers were trained
on the training corpus and evaluated on the testing
corpus. It effectively takes almost 10 times less time than
cross-validation on the whole corpus, so an augmented set
of features could be used. Apart from 1, 2 and 3-grams
of single features, mixed combinations of attributes were
added — see Table 5 for details.
4.1. Single classifiers
As the grammar in the preliminary stage takes some
care of precision, classifier configurations with high re-
call turned out to be optimal, as they are complementary
to the grammar. Thus, for all types of classifiers, the 1:1
subsampling ratio ensured the best results. The evaluation
on the testing corpus for single classifiers with subsampling
1:1 is presented in Table 6.
Note that the best of these results, especially, B3 com-
bined with the AdaBoost classifier, approach the results
of the grammar PG, but still do not exceed them in terms
of F2.
n-gram type max allowed really taken
base 400 404
base-base 100 100
base-base-base 100 101
ctag 100 106
ctag-ctag 100 100
ctag-ctag-ctag 100 100
case 100 8
case-case 100 59
case-case-case 100 101
base-case 100 100
case-base 100 100
base-ctag 100 100
ctag-base-ctag 50 50
ctag-base-base-ctag 50 50
case-base-case 50 50
Table 5: Feature set used in filtering experiments
4.2. Ensembles of classifiers
In the next step we created homogeneous ensembles
of classifiers. Every classifier in the ensemble was trained
on all positive examples and a different subset of the nega-
Classifier P R F1F2F5
ID3 15.54 58.24 24.54 30.40 39.95
IB1 16.17 47.80 24.17 28.94 36.05
C4.5 15.97 56.59 24.91 30.62 39.74
NB 16.20 53.58 24.90 30.34 38.81
nu-SVC 17.44 62.09 27.23 33.50 43.52
AB+DS 18.27 60.44 28.06 34.16 43.65
Table 6: Filtering approach: results of single classifiers
with subsampling 1:1
# Classifier P R F1F2F5
7ID3 19.94 69.23 30.96 37.95 49.03
3IB1 16.98 45.05 24.66 29.04 35.32
7C4.5 19.67 59.34 29.55 35.49 44.41
1NB 16.20 53.58 24.90 30.34 38.81
3nu-SVC 19.06 64.29 29.40 35.89 46.06
7AB+DS 19.59 63.19 29.91 36.28 46.09
B3 10.54 88.46 18.84 25.54 39.64
PG 18.69 59.34 28.42 34.39 43.55
Table 7: Filtering approach: Best results of ensembles
of classifiers with subsampling 1:1, and evaluation of the
grammars on the same (testing) corpus for comparison
tive ones (size of the subset was determined by subsampling
ratio). Then, majority voting was used to determine the de-
cision of the whole ensemble for each sentence. In this way
errors in classification made by one of the classifiers — es-
pecially those caused by an “unlucky” choice of negative
examples — may be corrected by other classifiers in the en-
semble.
The six classifiers were tested in ensembles of 3, 5, 7,
9 and 13, with variable subsampling ratios. The evalua-
tion results for combinations of B3 with the best ensem-
bles of classifiers with subsampling ratio 1:1 (as for single
classifiers, this ratio always rendered the best results) are
presented in Table 7, together with the analogous results
for the pure grammars B3 and PG, repeated from Table 2.
Note that four of these ensembles of classifiers, when com-
bined with the baseline grammar B3, exceeded the results
of the relatively sophisticated PG; only the lazy learner IB1
and the Naïve Bayes classifier resulted in F2significantly
worse than that of PG.
The dependence of the results on the number of classifiers
in each ensemble was not as straightforward as the mono-
tonic dependence on subsampling ratio: some larger en-
sembles achieved worse results than smaller ensembles
of the same type of classifiers.
It is however worth noting that, in case of some of the clas-
sifiers, the differences between results of bigger and smaller
ensembles are negligible; taking into account the randomi-
sation factor mentioned before, they could be treated as of
equal quality. For instance, while AB+DS achieves the best
results in an ensemble of 7, all other ensembles of this clas-
sifier are nearly as good. It is in a way similar to nu-SVC
that scores best in an ensemble of 3 and almost as good
in any larger configuration (but its results, when it is used
on its own, are significantly worse). C4.5 achieves com-
parably good results in any ensemble of 7 or more. Other
classifiers behave differently: ID3 is quite clearly the best
in an ensemble of 7, IB1 performs equally well in an ensem-
ble of any size (including 1), while combining a number of
NB classifiers into an ensemble actually gives worse results
than for a single NB classifier.
For practical applications the execution time may also
be important. The approximate measurements show that,
among the ensembles in Table 7, 7xID3, 7xC4.5, 1xNB
and 3xnu-SVC are equally fast (1-2 minutes in our environ-
ment), 3xIB1 is slower (6 minutes5), and 7xAB+DS with
1000 iterations is the slowest (25 minutes6).
4.3. Role of B3 grammar
We conducted an additional experiment to verify the influ-
ence of the B3 grammar on the combination of classifiers.
This was done by repeating some tests — single classi-
fiers with the 1:1 subsampling ratio — without the initial
filtering by the grammar. For details, see Table 8. This
was in a way similar to the pure ML experiments described
in the previous section — but this time the evaluation has
been performed on the testing corpus and with the aug-
mented set of features, so the results may be compared di-
rectly to those in Table 6.
Classifier P R F1F2F5
ID3 9.91 60.99 17.05 22.44 32.81
IB1 9.41 51.65 15.92 20.69 29.54
C4.5 10.90 60.44 18.47 24.03 34.39
NB 11.68 56.04 19.34 24.74 34.32
nu-SVC 10.76 64.29 18.44 24.19 35.15
AB+DS 13.47 63.19 22.20 28.33 39.12
Table 8: Pure ML approach: results of single classifiers
with subsampling 1:1
The results — in terms of F2— are much better when B3
is applied before the classifiers. Table 9 visualises the rel-
ative differences between the results with and without B3,
counted using the following formula:
value in Table 9 =value in Table 6
value in Table 8
1
Note that a significant increase of precision is accompanied
by only a small decrease of recall. Even though the B3
grammar rejects less than 12% of the sentences, it greatly
improves the final result by apparently rejecting significant
part of potential false positives.
5. Conclusion
The main result of this paper is that, for the task of defi-
nition extraction, a sequential combination of a very sim-
ple baseline partial grammar with machine learning algo-
rithms gives results which are as good as — and sometimes
5As mentioned before, 1xIB1 is almost equally good, and two
times faster.
61xAB+DS with 1000 iterations, and even 1xAB+DS with 100
iterations, achieve F2exceeding 34%, the latter in 1.5 minutes.
Classifier P R F1F2F5
ID3 56,8% -4,5% 43,9% 35,5% 21,8%
IB1 71,8% -7,5% 51,8% 39,9% 22,0%
C4.5 46,5% -6,4% 34,9% 27,4% 15,6%
NB 38,7% -4,4% 28,7% 22,6% 13,1%
nu-SVC 62,1% -3,4% 47,7% 38,5% 23,8%
AB+DS 35,6% -4,4% 26,4% 20,6% 11,6%
Table 9: Relative gain of applying B3 before the classifiers
(for single classifiers, 1:1 subsampling ratio)
significantly better than — the results of the application
of manually constructed partial grammars, and much higher
than the results of ML classifiers alone. Two corollaries
of this result are: 1) even if only a small amount of noisy
training data is available, the application of automatic ma-
chine learning methods may exceed pure grammar-based
approaches, 2) but the clear improvement is observed only
when such ML algorithms are supported by some — rela-
tively trivial — a priori linguistic knowledge.
6. References
Yasser EL-Manzalawy and Vasant Honavar, 2005.
WLSVM: Integrating LibSVM into Weka Environment.
http://www.cs.iastate.edu/~yasser/
wlsvm.
Nancy Ide, Patrice Bonhomme, and Laurent Romary. 2000.
XCES: An XML-based standard for linguistic corpora.
In Proceedings of the Third International Conference on
Language Resources and Evaluation, LREC2000, pages
825–830, Athens. ELRA.
Adam Przepiórkowski, Łukasz Degórski, Miroslav
Spousta, Kiril Simov, Petya Osenova, Lothar Lemnitzer,
Vladislav Kuboˇ
n, and Beata Wójtowicz. 2007a. To-
wards the automatic extraction of definitions in Slavic.
In Jakub Piskorski, Bruno Pouliquen, Ralf Steinberger,
and Hristo Tanev, editors, Proceedings of the Workshop
on Balto-Slavonic Natural Language Processing at ACL
2007, pages 43–50, Prague.
Adam Przepiórkowski, Łukasz Degórski, and Beata Wój-
towicz. 2007b. On the evaluation of Polish definition
extraction grammars. In Zygmunt Vetulani, editor, Pro-
ceedings of the 3rd Language & Technology Conference,
pages 473–477, Pozna´
n, Poland.
Horacio Saggion. 2004. Identifying definitions in text col-
lections for question answering. In Proceedings of the
Fourth International Conference on Language Resources
and Evaluation, LREC 2004, Lisbon. ELRA.
Richard Tobin, 2005. Lxtransduce, a replace-
ment for fsgmatch. University of Edinburgh.
http://www.cogsci.ed.ac.uk/~richard/
ltxml2/lxtransduce-manual.html.
Ian H. Witten and Eibe Frank. 2005. Data Mining: Prac-
tical machine learning tools and techniques. Morgan
Kaufmann, San Francisco, 2nd edition. http://www.
cs.waikato.ac.nz/ml/weka/.
... When using domain specific text ( [10], [11], [12], [6]) there are certain linguistic patterns that facilitate the extraction of definitions. Many authors that use pattern matching use regular expressions due to their simplicity and efficiency ( [5], [11], [13]). In our work, we used recursive regular expressions, which are more expressive than classical regular expressions. ...
... Many authors agree that working with a language different than English is more challenging because of the lack of available tools and resources for those languages, which entails an additional difficulty ( [5], [4], [13]). ...
... The work of [13] proposes, however, to manually develop a partial grammar that could be applied together with a set of classifiers that will vote which classification is the correct one (whether a given sentence is a definition or not). This is an interesting approach as it tries to leverage both manual and statistical techniques, but their tests were performed over a relatively small corpus, where the machine learning approach cannot work at its best, obtaining 19.94% precision and 69.23% recall. ...
Article
Full-text available
This paper describes the design and implementation of a system that takes Spanish texts and generates crosswords (board and definitions) in a fully automatic way using definitions extracted from those texts. Our solution divides the problem in two parts: a definition extraction module that applies pattern matching implemented in Python, and a crossword generation module that uses a greedy strategy implemented in Prolog. The system achieves 73% precision and builds crosswords similar to those built by humans.
... The task of extracting definitional contexts is not limited to glossaries and encyclopaediae, but extended to other fields such as ontology learning (Gangemi et al., 2003), question answering (Saggion, 2004;Cui et al., 2007) and eLearning (Westerhout and Monachesi, 2007). Most approaches rely on lexico-syntactic patterns (Saggion, 2004;Cui et al., 2007;Fahmi and Bouma, 2006;Degórski et al., 2008) that require manual annotation and/or manually written rules. A different approach has been taken with the use of Word Lattices, directed acyclic graphs that represent a segment. ...
... Many early approaches to definition extraction relied on morphosyntactic patterns presupposing the analytical definition type (Klavans & Muresan, 2001), later extended with more sophisticated grammars or lattices (Navigli & Velardi, 2010). Several approaches use machine learning techniques to distinguish between definitions and nondefinitions (Fišer et al., 2010), and the combination of a base grammar and a classifier proved most successful than either of these techniques used alone (Degórski et al., 2008; Westerhout 2010). A common problem to all these attempts is low recall and/or low accuracy when extracting definitions from highly unstructured noisy corpora. ...
Conference Paper
Full-text available
We explore definitions in the domain of karstology from a cross-language perspective with the aim of comparing the cognitive frames underlying defining strategies in Croatian and English. The experiment involved the semi-automatic extraction of definition candidates from our corpora, manual selection of valid examples, identification of functional units and semantic annotation with conceptual categories and relations. Our results comply with related frame-based approaches in that they clearly demonstrate the multidimensionality of concepts and the key factors affecting the choice of defining strategy, e.g. concept category, its place in the conceptual system of the domain and the communicative setting. Our approach extends related work by applying the frame-based view on a new language pair and a new domain, and by performing a more detailed semantic analysis. The most interesting finding, however, regards the cross-language comparison; it seems that definition frames are language-and/or culture-specific in that certain conceptual structures may exist in one language but not the other. These results imply that a cross-linguistic analysis of conceptual structures is an essential step in the construction of knowledge bases, ontologies and other domain representations.
... Benefiting from syntactic information extracted from dependency structures has proven efficient in other fields, like Information Extraction [32] [30] [1] or paraphrase identification [33]. With regard to DE, we are unaware of any supervised system which relies only on features derived from dependency relations, although these have been previously used, either for the extraction of definition candidates [9] [36] or as part of more comprehensive feature sets [6]. ...
Conference Paper
Definition Extraction (DE) is the task to automatically identify definitional knowledge in naturally-occurring text. This task has applications in ontology generation, glossary creation or question answering. Although the traditional approach to DE has been based on hand-crafted pattern-matching rules, recent methods incorporate learning algorithms in order to classify sentences as definitional or non-definitional. This paper presents a supervised approach to Definition Extraction in which only syntactic features derived from dependency relations are used. We model the problem as a classification task where each sentence has to be classified as being or not definitional. We compare our results with two well-known approaches: First, a supervised method based on Word-Class Lattices and second, an unsupervised approach based on mining recurrent patterns. Our competitive results suggest that syntactic information alone can contribute substantially to the development and improvement of DE systems.
... Furthermore, more complex features were also used for the purposes of machine learning. For instance, the baseline grammar, word-class lattice, and compressionbased minimum descriptive length of sentence definition that is represented in syntax tree [19][20][21]. ...
Article
Full-text available
Aim of the research is to conduct automatic Indonesian synonym sets gloss extraction using the supervised learning approach. The main sources used are collections of web documents containing the gloss of the synonym sets. Three main phases of the proposed method are: preprocessing phase, features extraction phase, and classification phase. Preprocessing phase includes large scale fetch of web documents collection, extraction of raw text, text clean-up, and extraction of sentence from possible gloss candidates. Furthermore, in the features extraction phase, seven features are extracted from each of the gloss candidates: the position of a sentence in a paragraph, the frequency of a sentence in the document collection, the number of words in a sentence, the number of important words in a sentence, the number of characters in a sentence, the number of gloss sentences from the same word, and the number of nouns in the sentence. Lastly, in the classification phase, the supervised learning method will then accept or reject the candidate as a true gloss based on those seven features. It is shown in this paper that the proposed system was successful in acquiring 6,520 Indonesian synset glosses, with an average accuracy of 74.06% and 75.40% using the decision tree and backpropagation feedforward neural networks respectively. Thus, with the vast amount of successfully acquired glosses which is quite significant for Indonesian words, it is believed that the supervised learning approach used in this research will be useful to accelerate the process of lexical database formation such as WordNet for other languages.
... The corpora that have been used for evaluating DE systems are varied, although in general efforts have been greatly focused on academic and encyclopedic genres. Some prominent examples include German technical texts (Storrer and Wellinghoff, 2006), the IULA Technical Corpus (in Spanish) (Alarcón et al., 2009), the ACL Anthology (Jin et al., 2013;Reiplinger et al., 2012), the BNC corpus (Rodríguez, 2004), Wikipedia , ensembles of domain glossaries and Web documents (Velardi et al., 2008), or technical texts in various languages (Westerhout and Monachesi, 2007;Przepiórkowski et al., 2007;Borg et al., 2009;Degórski et al., 2008;Del Gaudio et al., 2013). ...
Conference Paper
Full-text available
Definition Extraction (DE) is the task to extract textual definitions from naturally occurring text. It is gaining popularity as a prior step for constructing taxonomies, ontologies, automatic glossaries or dictionary entries. These fields of application motivate greater interest in well-formed encyclopedic text from which to extract definitions, and therefore DE for academic or lay discourse has received less attention. In this paper we propose weakly supervised bootstrapping approach for identifying textual definitions with higher linguistic variability than the classic encyclopedic genus-et-differentia definition, and take the domain of Natural Language Processing as a use case. We also introduce a novel set of features for DE and explore their relevance. Evaluation is carried out on two datasets that reflect opposed ways of expressing definitional knowledge.
... To the best of our knowledge, only three works have abandoned this widespread approach of starting with a first pass through a pattern matching module, and sought to explicitly address the imbalanced dataset issue through some kind of sampling. Degórski et al. (2008b) used a corpus made of tutorials on information technology in Polish to develop a definition extraction system to support the construction of glossaries. The corpus was composed of 10,830 sentences, 546 of which were definitions, with the original ratio of 1:19. ...
Article
Full-text available
This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.
Article
Full-text available
This is an overview paper, presenting linguistic resources and tools developed within projects carried out by the Linguistic Engineering Group (LEG) at the In-stitute of Computer Science of the Polish Academy of Sciences. It briefly describes corpora (esp., the IPI PAN Corpus of Polish) and corpus tools (Poliqarp), gram-mars and parsers (incl. Spejd and Świgra), and syntactic and semantic lexica, and it discusses the extent to which they may be claimed to satisfy the requirement of interoperability.
Article
Full-text available
This paper presents the results of the preliminary experiments in the automatic extraction of definitions (for semi-automatic glossary construction) from usually unstructured or only weakly structured e-learning texts in Bulgarian, Czech and Polish. The extraction is performed by regular grammars over XML-encoded morphosyntactically-annotated documents. The results are less than satisfying and we claim that the reason for that is the intrinsic difficulty of the task, as measured by the low interannotator agreement, which calls for more sophisticated deeper linguistic processing, as well as for the use of machine learning classification techniques.
Article
Full-text available
One particular type of question which was made the focus of its own subtask within the TREC2003 QA track was the definition question ("What is X?" or "Who is X?"). One of the main problems with this type of question is how to discriminate in vast text collections between definitional and non-definitional text passages about a particular definiendum (i.e., the term to be defined). A method will be presented that uses definition patterns and terms that co-occurr with the definiendum in on-line sources for both passage selection and definition extraction.
Article
Full-text available
This paper presents the results of experiments in the automatic extraction of definitions (for semi-automatic glossary construction) from usually unstructured or only weakly structured e-learning texts in Polish. The extraction is performed by regular grammars over XML-encoded morphosyntactically-annotated documents. The results, although perhaps still not fully satisfactory, are carefully evaluated and compared to the inter-annotator agreement; they clearly improve on previous definition extraction attempts for Polish.
Article
Full-text available
The Corpus Encoding Standard (CES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES) that provides a set of encoding standards for corpus-based work in natural language processing applications. We have instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and "standoff" annotation in separate documents. Conversion to XML enables use of some of the more powerful mechanisms provided in the XML framework, including the XSLT Transformation Language, XML Schemas, and support for inter-rescue reference together with an extensive path syntax for pointers. In this paper, we describe the differences between the CES and XCES DTDs and demonstrate how XML mechanisms can be used to select from and manipulate annotated corpora encoded according to XCES specifications. We also provide a general overview of XML and the XML mechanisms that are most relevant to language engineering research and applications.
WLSVM: Integrating LibSVM into Weka Environment
  • El-Manzalawy Yasser
  • Vasant Honavar
Yasser EL-Manzalawy and Vasant Honavar, 2005. WLSVM: Integrating LibSVM into Weka Environment. http://www.cs.iastate.edu/~yasser/ wlsvm.
Lxtransduce, a replacement for fsgmatch
  • Richard Tobin
Richard Tobin, 2005. Lxtransduce, a replacement for fsgmatch. University of Edinburgh.
XCES: An XML-based standard for linguistic corpora
  • Nancy Ide
  • Patrice Bonhomme
  • Laurent Romary
Nancy Ide, Patrice Bonhomme, and Laurent Romary. 2000. XCES: An XML-based standard for linguistic corpora. In Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2000, pages 825-830, Athens. ELRA.