Content uploaded by Gerardo Sierra
Author content
All content in this area was uploaded by Gerardo Sierra on Apr 05, 2016
Content may be subject to copyright.
Z. Vetulani and H. Uszkoreit (Eds.): LTC 2007, LNAI 5603, pp. 382–391, 2009.
© Springer-Verlag Berlin Heidelberg 2009
ECODE: A Definition Extraction System
Rodrigo Alarcón1, Gerardo Sierra1, and Carme Bach2
1 Grupo de Ingeniería Lingüística, Universidad Nacional Autónoma de Mexico,
Ciudad Universitaria, Torre de Ingeniería, Basamento 3, 04510, Mexico City, Mexico
{ralarconm,gsierram}@iingen.unam.mx
2 Instituto Universitario de Lingüística Aplicada, Universidad Pompeu Fabra,
Pl. de la Mercè 10-12, 08002, Barcelona, Spain
carme.bach@upf.edu
Abstract. Terminological work aims to identify knowledge about terms in spe-
cialised texts in order to compile dictionaries, glossaries or ontologies. Search-
ing for definitions about the terms that terminographers intend to define is
therefore an essential task. This search can be done in specialised corpus, where
they usually appear in definitional contexts, i.e. text fragments where an author
explicitly defines a term. We present a research focused on the automatic ex-
traction of those definitional contexts. The methodology includes three different
processes: the extraction of definitional patterns, the automatic filtering of non-
relevant contexts, and the automatic identification of constitutive elements, i.e.,
terms and definitions.
Keywords: Definition extraction, definitional knowledge, definitional contexts,
information extraction, computational terminography.
1 Introduction
A common need in terminological work is the extraction of knowledge about terms in
specialised texts. Some efforts in the field of NLP have been done in order to develop
tools that help in this need, such as corpora, where a large quantity of technical docu-
ments are digitally stored, as well as term extraction systems, which automatically
identify relevant terms in corpora.
Nowadays there is a growing interest on developing systems for the automatic ex-
traction of useful information to describe the meaning of terms. This information
commonly appears in structures called definitional contexts (DCs), which are struc-
tured by a series of lexical and metalinguistic patterns that can be automatically rec-
ognised [1], [2]. Following this idea, our work is focused on developing a system for
the automatic extraction of definitional contexts on Spanish language specialised
texts. Such system includes the extraction of definitional pattern’s occurrences, the
filtering of non-relevant contexts, and the identification of DCs constitutive elements,
i.e., terms and definitions.
This system has been developing for Spanish language and it will be helpful in the
elaboration of ontologies, databases of lexical knowledge, glossaries or specialised
dictionaries.
ECODE: A Definition Extraction System 383
In this paper we will describe the structure of DCs; we will make a short review of
related works; we will present the methodology we have followed for the automatic
extraction of DCs, in addition with a methodology’s evaluation; and finally we will
describe the future work.
2 Definitional Contexts
A definitional context is a textual fragment from a specialised text where a definition
of a term is given. It is basically structured by a term (T) and its definition (D), being
both elements connected by typographic or syntactic patterns. Mainly, typographic
patterns are punctuation marks (comas, parenthesis), while syntactic patterns include
definitional verbs –such as definir (to define) or significar (to signify)– as well as
discursive markers –such as es decir (that is, lit. (it) is to say), or o sea (that is, lit. or
be-subjunctive)–. Besides, DCs can include pragmatic patterns (PP), which provide
conditions for the use of the term or clarify its meaning, like en términos generales
(in general terms) or en este sentido (in this sense).
The next is an example of a definitional context:
“Desde un punto de vista práctico, los opioides se definen como compuestos de
acción directa, cuyos efectos se ven antagonizados estereoespecíficamente por la
naloxona.”
In this case, the term opioides is connected to its definition (compuestos de acción
directa […]) by the verbal pattern se definen como (are defined as), while the general
sense of the context is modified by the pragmatic pattern desde un punto de vista
práctico (from a practical point of view).
2.1 Related Work
The study of automatic extraction of definitional knowledge has been approached
from both theoretical-descriptive and applied perspectives.
One of the first theoretical-descriptive works is Pearson’s [1], in which the behav-
iour of the contexts where terms appear is described. Pearson mentions that, when
authors define a term, they usually employ typographic patterns to visually bring out
the presence of terms and/or definitions, as well as lexical and metalinguistic patterns
to connect DCs elements by means of syntactic structures.
Meyer [2] reinforced this idea and also states that definitional patterns can provide
keys that allow the identification of the definition type occurring in DCs, which is a
helpful task in the elaboration of ontologies. Other theoretical-descriptive works can
be found in [3] and [4].
Applied investigations, on the other hand, leave from theoretical-descriptive stud-
ies with the objective of elaborate methodologies for the automatic extractions of
DCs, more specifically for the extraction of definitions in medical texts [5], for the
extraction of definitions for question answering systems [6], for the automatic elabo-
ration of ontologies [7], for the extraction of semantic relations from specialised texts
[8], as well as for the extraction of relevant information for eLearning purposes
[9], [10].
384 R. Alarcón, G. Sierra, and C. Bach
In general words, those studies employ definitional patterns as a common start
point for the extraction of knowledge about terms. In order to developing our meth-
odology we start from the analysis and integration of theoretical-descriptive and ap-
plied studies.
3 Definitional Contexts Extraction
As we have mentioned before, the main purpose of a definitional context extractor
would be to simplify the search of relevant information about terms, by means of
searching occurrences of definitional patterns.
An extractor that only retrieves those occurrences of definitional patterns would be
a useful system for terminographical work. Nevertheless, the manual analysis of the
occurrences would still suppose an effort that could be simplified by an extractor,
which also includes an automatic processing of the information obtained.
Therefore, we propose a methodology that includes not only the extraction of oc-
currences of definitional patterns, but also a filtering of non-relevant contexts (i.e. non
definitional contexts) and the automatic identification of the possible constitutive
elements of a DC: terms, definitions and pragmatic patterns. In the next sections we
explain each step of our methodology.
3.1 Corpus
We took as reference the IULA´s Technical Corpus and its search engine bwanaNet1,
developed on the Instituto Universitario de Lingüstica Aplicada (IULA, UPF). The
corpus is conformed by specialised documents in Law, Genome, Economy, Environ-
ment, Medicine, Informatics and General Language. It counts with a total of 1,378
documents in Spanish (December 2008). For the experiments we use all the areas
except General Language, and the number of treated documents was 959 with a total
number of 11,569,729 words.
3.2 Extracting Definitional Patterns
For the experiments we searched for definitional verbal patterns (DVPs). We worked
with 15 patterns that include simple definitional verbal patterns (SDVP) and compound
definitional verbal patterns (CDVP). As we can see in table 1, patterns of the simple
forms include only the definitional verb, while patterns of the compound forms include
the definitional verb plus a grammatical particle such as a preposition or an adverb.
Each pattern was searched in the Technical IULA’s corpus through the complex
search option, which allows users to obtain the occurrences with POS tags. We also
delimitate the search to no more of 300 occurrences for each verbal pattern, using the
random (and representative) recovery option.
The verbal patterns were searched taking into account the next restrictions:
Verbal forms: infinitive, participle and conjugate forms.
Verbal tenses: present and past for the simple forms, any verbal time for the com-
pounds forms.
1 http://bwananet.iula.upf.edu/indexes.htm
ECODE: A Definition Extraction System 385
Table 1. Simple & compound Definitional Verbal Patterns
Type Verbs
Simple concebir (to conceive), definir (to define), entender (to understand),
identificar (to identify), significar (to signify)
Compound consistir de (to consist of), consistir en (to consist in), constar de (to
comprise), denominar también (also denominated), llamar también
(also called), servir para (to serve for), usar como (to use as), usar
para (to use for), utilizar como (to utilise as), utilizar para (to utilise
for)
Person: 3rd singular and plural for the simple forms, any for the compound forms.
The obtained occurrences were automatically annotated with contextual tags. The
function of these simple tags is to work as borders in the next automatic process. For
each occurrence, the definitional verbal pattern were annotated with “<dvp></dvp>”;
everything after the pattern with “<left></left>”; everything before the pattern with
“<right></right>”; and finally, in those cases where the verbal pattern includes a
nexus, like the adverb como (as), everything between the verbal pattern and the nexus
were annotated with <nexus></nexus>.
Here is an example of a DC with contextual tags:
<left>El metabolismo</left> <dvp>puede definir se </dvp> <nexus>en términos
generales como</nexus> <right>la suma de todos los procesos químicos (y físicos)
implicados.</right>
It is important to mention that from this contextual annotation process, all the
automatic process was done with scripts in Perl. We choose this programming lan-
guage mainly by its inherent effectiveness to process regular expressions.
3.3 Filtering Non-relevant Contexts
Once we have extracted and annotated the occurrences with DVPs, the next process
was the filtering of non-relevant contexts. We apply this step based on the fact that
definitional patterns are not used only in definitional sentences. In the case of DVPs
some verbs trend to have a high metalinguistic meaning rather than others. That is the
case of definir (to define) or denominar (to denominate), vs. concebir (to conceive) or
identificar (to identify), where the last two ones could be used in a wide variety of
different sentences. Moreover, the verbs with a high metalinguistic meaning are not
used only for defining terms.
In a previous work an analysis was done in order to determine which kind of
grammatical particles or syntactic sequences could appear in those cases when a DVP
is not used to define a term.
Those particles and sequences were found in some specific positions, for example:
some negation particles like no (not) or tampoco (either) were found in the first posi-
tion before or after the DVP; adverbs like tan (so), poco (few) as well as sequences
386 R. Alarcón, G. Sierra, and C. Bach
like poco más (not more than) were found between the definitional verb and the nexus
como; also, syntactic sequences like adjective + verb were found in the first position
after the definitional verb.
Thus, considering this and other frequently combinations and helped by contextual
tags previously annotated, we developed a script in order to filtering non-relevant
contexts. The script could recognise contexts like the following examples:
Rule: NO <left>
<left>En segundo lugar, tras el tratamiento eficaz de los cambios patológicos en un
órgano pueden surgir problemas inesperados en tejidos que previamente no </left>
<dvp>se identificaron</dvp> <nexus> como </nexus> <right> implicados clínica-
mente, ya que los pacientes no sobreviven lo suficiente.</right>
Rule: <nexus> CONJUGATED VERB
<left>Ciertamente esta observación tiene una mayor fuerza cuando el número de
categorías </left> <dvp> definidas</dvp> <nexus> es pequeño como</nexus>
<der>en nuestro análisis.</der>
3.4 Identifying DCs Elements
Once the non-relevant contexts were filtered, the next process in the methodology is
the identification of main terms, definitions and pragmatic patterns. In Spanish’s
DCs, and depending on each DVP, the terms and definitions can appear in some spe-
cific positions. For example, in DCs with the verb definir (to define), the term could
appear in left, nexus or right position (T se define como D; se define T como D;
se define como T D), while in DCs with the verb significar (to signify), terms can
appear only in left position (T significa D). Therefore, in this phase the automatic
process is highly related to deciding in which positions could appear the constitutive
elements.
We decided to use a decision tree [11] to solve this problem, i.e., to detect by
means of logic inferences the probable positions of terms, definitions and pragmatic
patterns. We established some simple regular expressions to represent each constitu-
tive element2:
T = BRD (Det) + N + Adj. {0,2} .* BRD
PP = BRD (sign) (Prep | Adv) .* (sign) BRD
As well as in the filtering process, the contextual tags have functioned as borders to
demarcate decision tree’s instructions. In addition, each regular expression could
function as a border. In a first level, the branches of the tree are the different positions
in which constitutive elements can appear (left, nexus or right). In a second level, the
branches are the regular expressions of each DC element. The nodes (branches con-
junctions) corresponds to decisions taken from the attributes of each branch and are
also horizontally related by If or If Not inferences, and vertically through Then infer-
ences. Finally, the leaves are the assigned position for a constitutive element.
Hence, in figure 1 we present an example of the decision tree inferences to identify
left constitutive elements3:
2 Where: Det= determiner, N= name, Adj= adjective, Prep= preposition, Adv= adverb, BRD=
border and “.*”= any word or group of words.
3 TRE = term regular expression | PPRE = pragmatic pattern regular expression | DRE = defini-
tion regular expression.
ECODE: A Definition Extraction System 387
Fig. 1. Example of the identification of DCs elements
This tree should be interpreted in the next way:
Given a series of DVPs occurrences:
D = BRD Det. + N .* BRD
If verbal pattern = compound definitional verbal pattern, then:
1. If left position corresponds only to a term regular expression, then:
<left> = term | <right> = definition.
If Not:
2. If left position corresponds to a term regular expression and a pragmatic pattern
regular expression, then:
<left> = term & pragmatic pattern | <right> = definition.
If Not:
3. If left position only corresponds to a pragmatic pattern regular expression, then4:
<left> = pragmatic pattern | If nexus corresponds only to a term regular expression,
then <nexus> = term & <right> = definition; If Not <right> = term & definition.
4. If left position corresponds only to a definition regular expression, then:
<left> = definition | <right> = term.
To exemplify we can observe the next context:
4 In some cases the tree must resort to other position inferences to find terms and definitions.
388 R. Alarcón, G. Sierra, and C. Bach
“<left>En sus comienzos</left> <dvp>se definió</dvp> <nexus>la psicología co-
mo </nexus><right>"la descripción y la explicación de los estados de conciencia"
(Ladd, 1887).</right>”
Once the DVP was identified as a CDVP – definir como (to define as) – the tree in-
fers that left position:
1. Does not correspond only to a TRE.
2. Does not correspond to a TRE and a PPRE.
3. It does correspond only to a PPRE.
Then: left position is a pragmatic pattern (En sus comienzos). To identify the term
and definition the tree goes to nexus’s inferences and finds that:
1. It does correspond only to a TRE.
Then: nexus’s position corresponds to the term (la psicología) and right’s position
corresponds to the definition (“la descripción y la explicación de los estados de con-
ciencia […]”).
As result, the processed context was reorganised into terminological entries as in
the next example:
Table 2. Example of the results
Term psicología
Definition “la descripción y la explicación de los estados de la conciencia” (Ladd, 1887).
Verbal
Pattern
se define
Pragmatic
Pattern
En sus comienzos
To conclude this part we have to mention that the algorithms implement non-
complex regular expressions as well as simple logic inferences to find, analyse and
organise definitional knowledge. Furthermore, the design of the algorithms allows the
implementation in other languages by replacing the correspondent regular expressions
as well as the logical inferences.
4 Evaluation
The evaluation of the methodology consists in two parts:
1. We evaluate the extraction of DVPs and the filtering of no relevant contexts
using Precision & Recall. In general words, Precision measures how many in-
formation extracted is relevant, while Recall measures how many relevant in-
formation was extracted from the input.
2. For the identification of constitutive elements, we manually assigned values
that helped us to statistically evaluate the exactitude of the decisions tree.
4.1 Evaluation of DVP’s Extraction and Non-relevant Contexts Filtering
We determine Precision & Recall by means of the following formulas:
ECODE: A Definition Extraction System 389
P = the number of filtered DCs automatically extracted, over the number of con-
texts automatically extracted.
R = the number of filtered DCs automatically extracted, over the number of non-
filtered DCs automatically extracted.
The results for each verbal pattern can be seen in table 3. In the case of Precision,
there is a divergence on verbs that usually appear in metalinguistic sentences. The
best results were obtained with verbs like denominar (to denominate) or definir (to
define), while verbs like entender (to understand) or significar (to signify) recover
low Precision values. Those verbs with lower results can be used in a wide assortment
of sentences, (i.e., not necessarily definitional contexts), and they trend to recover a
big quantity of noise. In the case of Recall, low results indicate that valid DCs were
filtered as non-relevant contexts. The wrong classification is related to the non-
filtering rules, but also in some cases a wrong classification was due to a POS tagging
errors in the input corpus.
Table 3. Precision & Recall results
Verbal Patten Precision Recall
Concebir (como) To conceive (as) 0.67 0.98
Definir (como) To define (as) 0.84 0.99
Entender (como) To understand (as) 0.34 0.94
Identificar (como) To identify (as) 0.31 0.90
Consistir de To consist of 0.62 1
Consistir en To consist in 0.60 1
Constar de To comprise 0.94 0.99
Denominar también Also denominated 1 0.87
LLamar también Also called 0.90 1
Servir para To serve for 0.55 1
Significar To signify 0.29 0.98
Usar como To use as 0.41 0.95
Usar para To use for 0.67 1
Utilzar como To utilise as 0.45 0.92
Utilizar para To utilise for 0.53 1
The challenge we faced in this stage is directly related to the elimination of noise.
We have noticed that the more precise the verbal pattern is, the better results (in terms
of less noise) can be obtained. Nevertheless, a specification of verbal patterns means a
probable lost of recall. Although, a revision of filtering rules must be done in order to
improve the non-relevant contexts identification and avoid the cases when some DC
where incorrect filtered.
4.2 Evaluation of DVP’s Extraction and Non-relevant Contexts Filtering
To evaluate the DCs elements identification, we manually assign the next values to
each DC processed by the decisions tree:
3 for those contexts where the constitutive elements were correct classified;
2 for those contexts where the constitutive elements were correct classified, but
390 R. Alarcón, G. Sierra, and C. Bach
some extra information were also classified (for example extra words or punctuation
marks in term position);
1 for those contexts where the constitutive elements were not correct classified,
(for example when terms were classified as definitions or vice versa).
Ø for those contexts the system could not classify.
In table 4 we present the results of the evaluation of DCs elements identification.
The values are expressed as percentages, and the amount of all of them represent the
total number of DCs founded with each verbal pattern. From DCs evaluation we high-
light the following facts:
The average percentage of the correct classified elements (group “3”) is over the 50
percent of the global classification. In these cases, the classified elements correspond
exactly with a term or a definition.
In a low percentage (group “2”), the classified elements include extra information
or noise. Nevertheless, in these cases the elements where also good classified as in
group “3”.
The incorrect classification of terms and definitions (group “1”), as well as the unclas-
sified elements (group “Ø”) correspond to a low percentage of the global classification.
Table 4. Evaluation of DCs elements identification
Verbal Patten 3 2 1 Ø
Concebir (como) To conceive (as) 68.57 15.71 11.42 04.28
Definir (como) To define (as) 65.10 18.22 10.41 06.25
Entender (como) To understand (as) 54.16 20.83 8.33 16.66
Identificar (como) To identify (as) 51.72 5.17 34.48 08.62
Consistir de To consist of 60 0 20 20
Consistir en To consist in 60.81 8.10 15.54 15.54
Constar de To comprise 58.29 22.97 2.97 15.74
Denominar también Also denominated 21.42 28.57 7.14 42.85
LLamar también Also called 30 40 0 30
Servir para To serve for 53.78 27.27 0.007 18.18
Significar To signify 41.26 44.44 3.17 11.11
Usar como To use as 63.41 14.63 17.07 4.87
Usar para To use for 36.26 32.96 4.39 26.37
Utilzar como To utilise as 55.10 28.57 10.20 6.12
Utilizar para To utilise for 51.51 19.69 10.60 18.18
Since the purpose of this process was the identification of DCs elements, we can
argue that results are generally satisfactory. However, there is a lot of work to do in
order to improve the performance of decision’s tree inferences. This work is related to
the way the tree analyses the different DCs elements of each verbal pattern.
5 Conclusions and Future Work
We have presented the process of developing a definitional knowledge extraction
system. The aim of this system is the simplification of the terminological practice
related to the search of term’s definitions in specialised texts.
ECODE: A Definition Extraction System 391
The methodology we have presented includes the search of definitional patterns,
the filtering of non-relevant contexts and the identification of DCs constitutive ele-
ments: terms, definitions, and pragmatic patterns.
At this moment we have worked with definitional verbs and we know that there is
a lot of work to do, which basically consists of the following points:
a) To explore other kind of definitional patterns (mainly typographical patterns and
reformulation markers) that are capable to recover definitional contexts.
b) To include those definitional patterns mentioned above in each step of the meth-
odology.
c) To improve the rules for the non-relevant contexts filtering process, as well as
the algorithm for the automatic identification of constitutive elements process.
Acknowledgments. This research has been developed by the sponsorship of the
Mexican National Council of Science and Technology (CONACYT), the DGAPA-
UNAM, as well as the Macro Project Tecnologías para la Universidad de la Informa-
ción y la Computación, UNAM. We also acknowledge the help of Bertha Lecumberri
in the translation of this paper.
References
1. Pearson, J.: Terms in Context. John Benjamin’s, Amsterdam (1998)
2. Meyer, I.: Extracting Knowledge-rich Contexts for Terminography. In: Bourigault, D.,
Jacquemin, C., L’Homme, M.C. (eds.), pp. 278–302. John Benjamin’s, Amsterdam (2001)
3. Péry-Woodley, M.-P., Rebeyrolle, J.: Domain and Genre in Sublanguage Text: Defini-
tional Microtexts in Three Corpora. In: First International Conference on Language Re-
sources and Evaluation, Grenade, pp. 987–992 (1998)
4. Bach, C.: Los marcadores de reformulación como localizadores de zonas discursivas rele-
vantes en el discurso especializado. Debate Terminológico, Electronic Journal 1 (2005)
5. Klavans, J., Muresan, S.: Evaluation of the DEFINDER System for Fully Automatic Glos-
sary Construction. In: Proceedings of the American Medical Informatics Association Sym-
posium, pp. 252–262. ACM Press, New York (2001)
6. Saggion, H.: Identifying Definitions in Text Collections for Question Answering. In: Pro-
ceedings of the 4th International Conference on Language Resources and Evaluation, Lis-
bon, pp. 1927–1930 (2004)
7. Malaisé, V.: Méthodologie linguistique et terminologique pour la structuration
d’ontologies différentielles á partir de corpus textuels. PhD Thesis. UFR de Linguistique,
Université Paris 7 – Denis Diderot, Paris (2005)
8. Sierra, G., Alarcón, R., Aguilar, C., Bach, C.: Definitional Verbal Patterns for Semantic
Relation Extraction. Terminology 14(1), 74–98 (2008)
9. Del-Gaudio, R., Branco, A.: Automatic Extraction of Definitions in Portuguese: A Rule-
Based Approach. In: Proceedings of the 2nd Workshop on Text Mining and Applications,
Guimarães (2007)
10. Degórski, L., Marcinczuk, M., Przepiórkowski, A.: Definition Extraction Using a Sequen-
tial Combination of Baseline Grammars and Machine Learning Classifiers. In: Proceedings
of the 6th International Conference on Language Resources and Evaluation, Forth-
Coming, Marrakech (2008)
11. Alarcón, R., Bach, C., Sierra, G.: Extracción de contextos definitorios en corpus especiali-
zados. Hacia la elaboración de una herramienta de ayuda terminográfica. In: Revista de la
Sociedad Española de Lingüística 37, pp. 247–278. Madrid (2007)