Conference PaperPDF Available

ECODE: A definition extraction system

Authors:

Abstract and Figures

Terminological work aims to identify knowledge about terms in specialised texts in order to compile dictionaries, glossaries or ontologies. Searching for definitions about the terms that terminographers intend to define is therefore an essential task. This search can be done in specialised corpus, where they usually appear in definitional contexts, i.e. text fragments where an author explicitly defines a term. We present a research focused on the automatic extraction of those definitional contexts. The methodology includes three different processes: the extraction of definitional patterns, the automatic filtering of non-relevant contexts, and the automatic identification of constitutive elements, i.e., terms and definitions. http://hdl.handle.net/10230/24631
Content may be subject to copyright.
Z. Vetulani and H. Uszkoreit (Eds.): LTC 2007, LNAI 5603, pp. 382–391, 2009.
© Springer-Verlag Berlin Heidelberg 2009
ECODE: A Definition Extraction System
Rodrigo Alarcón1, Gerardo Sierra1, and Carme Bach2
1 Grupo de Ingeniería Lingüística, Universidad Nacional Autónoma de Mexico,
Ciudad Universitaria, Torre de Ingeniería, Basamento 3, 04510, Mexico City, Mexico
{ralarconm,gsierram}@iingen.unam.mx
2 Instituto Universitario de Lingüística Aplicada, Universidad Pompeu Fabra,
Pl. de la Mercè 10-12, 08002, Barcelona, Spain
carme.bach@upf.edu
Abstract. Terminological work aims to identify knowledge about terms in spe-
cialised texts in order to compile dictionaries, glossaries or ontologies. Search-
ing for definitions about the terms that terminographers intend to define is
therefore an essential task. This search can be done in specialised corpus, where
they usually appear in definitional contexts, i.e. text fragments where an author
explicitly defines a term. We present a research focused on the automatic ex-
traction of those definitional contexts. The methodology includes three different
processes: the extraction of definitional patterns, the automatic filtering of non-
relevant contexts, and the automatic identification of constitutive elements, i.e.,
terms and definitions.
Keywords: Definition extraction, definitional knowledge, definitional contexts,
information extraction, computational terminography.
1 Introduction
A common need in terminological work is the extraction of knowledge about terms in
specialised texts. Some efforts in the field of NLP have been done in order to develop
tools that help in this need, such as corpora, where a large quantity of technical docu-
ments are digitally stored, as well as term extraction systems, which automatically
identify relevant terms in corpora.
Nowadays there is a growing interest on developing systems for the automatic ex-
traction of useful information to describe the meaning of terms. This information
commonly appears in structures called definitional contexts (DCs), which are struc-
tured by a series of lexical and metalinguistic patterns that can be automatically rec-
ognised [1], [2]. Following this idea, our work is focused on developing a system for
the automatic extraction of definitional contexts on Spanish language specialised
texts. Such system includes the extraction of definitional pattern’s occurrences, the
filtering of non-relevant contexts, and the identification of DCs constitutive elements,
i.e., terms and definitions.
This system has been developing for Spanish language and it will be helpful in the
elaboration of ontologies, databases of lexical knowledge, glossaries or specialised
dictionaries.
ECODE: A Definition Extraction System 383
In this paper we will describe the structure of DCs; we will make a short review of
related works; we will present the methodology we have followed for the automatic
extraction of DCs, in addition with a methodology’s evaluation; and finally we will
describe the future work.
2 Definitional Contexts
A definitional context is a textual fragment from a specialised text where a definition
of a term is given. It is basically structured by a term (T) and its definition (D), being
both elements connected by typographic or syntactic patterns. Mainly, typographic
patterns are punctuation marks (comas, parenthesis), while syntactic patterns include
definitional verbs –such as definir (to define) or significar (to signify)– as well as
discursive markers –such as es decir (that is, lit. (it) is to say), or o sea (that is, lit. or
be-subjunctive)–. Besides, DCs can include pragmatic patterns (PP), which provide
conditions for the use of the term or clarify its meaning, like en términos generales
(in general terms) or en este sentido (in this sense).
The next is an example of a definitional context:
“Desde un punto de vista práctico, los opioides se definen como compuestos de
acción directa, cuyos efectos se ven antagonizados estereoespecíficamente por la
naloxona.”
In this case, the term opioides is connected to its definition (compuestos de acción
directa […]) by the verbal pattern se definen como (are defined as), while the general
sense of the context is modified by the pragmatic pattern desde un punto de vista
práctico (from a practical point of view).
2.1 Related Work
The study of automatic extraction of definitional knowledge has been approached
from both theoretical-descriptive and applied perspectives.
One of the first theoretical-descriptive works is Pearson’s [1], in which the behav-
iour of the contexts where terms appear is described. Pearson mentions that, when
authors define a term, they usually employ typographic patterns to visually bring out
the presence of terms and/or definitions, as well as lexical and metalinguistic patterns
to connect DCs elements by means of syntactic structures.
Meyer [2] reinforced this idea and also states that definitional patterns can provide
keys that allow the identification of the definition type occurring in DCs, which is a
helpful task in the elaboration of ontologies. Other theoretical-descriptive works can
be found in [3] and [4].
Applied investigations, on the other hand, leave from theoretical-descriptive stud-
ies with the objective of elaborate methodologies for the automatic extractions of
DCs, more specifically for the extraction of definitions in medical texts [5], for the
extraction of definitions for question answering systems [6], for the automatic elabo-
ration of ontologies [7], for the extraction of semantic relations from specialised texts
[8], as well as for the extraction of relevant information for eLearning purposes
[9], [10].
384 R. Alarcón, G. Sierra, and C. Bach
In general words, those studies employ definitional patterns as a common start
point for the extraction of knowledge about terms. In order to developing our meth-
odology we start from the analysis and integration of theoretical-descriptive and ap-
plied studies.
3 Definitional Contexts Extraction
As we have mentioned before, the main purpose of a definitional context extractor
would be to simplify the search of relevant information about terms, by means of
searching occurrences of definitional patterns.
An extractor that only retrieves those occurrences of definitional patterns would be
a useful system for terminographical work. Nevertheless, the manual analysis of the
occurrences would still suppose an effort that could be simplified by an extractor,
which also includes an automatic processing of the information obtained.
Therefore, we propose a methodology that includes not only the extraction of oc-
currences of definitional patterns, but also a filtering of non-relevant contexts (i.e. non
definitional contexts) and the automatic identification of the possible constitutive
elements of a DC: terms, definitions and pragmatic patterns. In the next sections we
explain each step of our methodology.
3.1 Corpus
We took as reference the IULA´s Technical Corpus and its search engine bwanaNet1,
developed on the Instituto Universitario de Lingüstica Aplicada (IULA, UPF). The
corpus is conformed by specialised documents in Law, Genome, Economy, Environ-
ment, Medicine, Informatics and General Language. It counts with a total of 1,378
documents in Spanish (December 2008). For the experiments we use all the areas
except General Language, and the number of treated documents was 959 with a total
number of 11,569,729 words.
3.2 Extracting Definitional Patterns
For the experiments we searched for definitional verbal patterns (DVPs). We worked
with 15 patterns that include simple definitional verbal patterns (SDVP) and compound
definitional verbal patterns (CDVP). As we can see in table 1, patterns of the simple
forms include only the definitional verb, while patterns of the compound forms include
the definitional verb plus a grammatical particle such as a preposition or an adverb.
Each pattern was searched in the Technical IULA’s corpus through the complex
search option, which allows users to obtain the occurrences with POS tags. We also
delimitate the search to no more of 300 occurrences for each verbal pattern, using the
random (and representative) recovery option.
The verbal patterns were searched taking into account the next restrictions:
Verbal forms: infinitive, participle and conjugate forms.
Verbal tenses: present and past for the simple forms, any verbal time for the com-
pounds forms.
1 http://bwananet.iula.upf.edu/indexes.htm
ECODE: A Definition Extraction System 385
Table 1. Simple & compound Definitional Verbal Patterns
Type Verbs
Simple concebir (to conceive), definir (to define), entender (to understand),
identificar (to identify), significar (to signify)
Compound consistir de (to consist of), consistir en (to consist in), constar de (to
comprise), denominar también (also denominated), llamar también
(also called), servir para (to serve for), usar como (to use as), usar
para (to use for), utilizar como (to utilise as), utilizar para (to utilise
for)
Person: 3rd singular and plural for the simple forms, any for the compound forms.
The obtained occurrences were automatically annotated with contextual tags. The
function of these simple tags is to work as borders in the next automatic process. For
each occurrence, the definitional verbal pattern were annotated with “<dvp></dvp>”;
everything after the pattern with “<left></left>”; everything before the pattern with
“<right></right>”; and finally, in those cases where the verbal pattern includes a
nexus, like the adverb como (as), everything between the verbal pattern and the nexus
were annotated with <nexus></nexus>.
Here is an example of a DC with contextual tags:
<left>El metabolismo</left> <dvp>puede definir se </dvp> <nexus>en términos
generales como</nexus> <right>la suma de todos los procesos químicos (y físicos)
implicados.</right>
It is important to mention that from this contextual annotation process, all the
automatic process was done with scripts in Perl. We choose this programming lan-
guage mainly by its inherent effectiveness to process regular expressions.
3.3 Filtering Non-relevant Contexts
Once we have extracted and annotated the occurrences with DVPs, the next process
was the filtering of non-relevant contexts. We apply this step based on the fact that
definitional patterns are not used only in definitional sentences. In the case of DVPs
some verbs trend to have a high metalinguistic meaning rather than others. That is the
case of definir (to define) or denominar (to denominate), vs. concebir (to conceive) or
identificar (to identify), where the last two ones could be used in a wide variety of
different sentences. Moreover, the verbs with a high metalinguistic meaning are not
used only for defining terms.
In a previous work an analysis was done in order to determine which kind of
grammatical particles or syntactic sequences could appear in those cases when a DVP
is not used to define a term.
Those particles and sequences were found in some specific positions, for example:
some negation particles like no (not) or tampoco (either) were found in the first posi-
tion before or after the DVP; adverbs like tan (so), poco (few) as well as sequences
386 R. Alarcón, G. Sierra, and C. Bach
like poco más (not more than) were found between the definitional verb and the nexus
como; also, syntactic sequences like adjective + verb were found in the first position
after the definitional verb.
Thus, considering this and other frequently combinations and helped by contextual
tags previously annotated, we developed a script in order to filtering non-relevant
contexts. The script could recognise contexts like the following examples:
Rule: NO <left>
<left>En segundo lugar, tras el tratamiento eficaz de los cambios patológicos en un
órgano pueden surgir problemas inesperados en tejidos que previamente no </left>
<dvp>se identificaron</dvp> <nexus> como </nexus> <right> implicados clínica-
mente, ya que los pacientes no sobreviven lo suficiente.</right>
Rule: <nexus> CONJUGATED VERB
<left>Ciertamente esta observación tiene una mayor fuerza cuando el número de
categorías </left> <dvp> definidas</dvp> <nexus> es pequeño como</nexus>
<der>en nuestro análisis.</der>
3.4 Identifying DCs Elements
Once the non-relevant contexts were filtered, the next process in the methodology is
the identification of main terms, definitions and pragmatic patterns. In Spanish’s
DCs, and depending on each DVP, the terms and definitions can appear in some spe-
cific positions. For example, in DCs with the verb definir (to define), the term could
appear in left, nexus or right position (T se define como D; se define T como D;
se define como T D), while in DCs with the verb significar (to signify), terms can
appear only in left position (T significa D). Therefore, in this phase the automatic
process is highly related to deciding in which positions could appear the constitutive
elements.
We decided to use a decision tree [11] to solve this problem, i.e., to detect by
means of logic inferences the probable positions of terms, definitions and pragmatic
patterns. We established some simple regular expressions to represent each constitu-
tive element2:
T = BRD (Det) + N + Adj. {0,2} .* BRD
PP = BRD (sign) (Prep | Adv) .* (sign) BRD
As well as in the filtering process, the contextual tags have functioned as borders to
demarcate decision tree’s instructions. In addition, each regular expression could
function as a border. In a first level, the branches of the tree are the different positions
in which constitutive elements can appear (left, nexus or right). In a second level, the
branches are the regular expressions of each DC element. The nodes (branches con-
junctions) corresponds to decisions taken from the attributes of each branch and are
also horizontally related by If or If Not inferences, and vertically through Then infer-
ences. Finally, the leaves are the assigned position for a constitutive element.
Hence, in figure 1 we present an example of the decision tree inferences to identify
left constitutive elements3:
2 Where: Det= determiner, N= name, Adj= adjective, Prep= preposition, Adv= adverb, BRD=
border and “.*”= any word or group of words.
3 TRE = term regular expression | PPRE = pragmatic pattern regular expression | DRE = defini-
tion regular expression.
ECODE: A Definition Extraction System 387
Fig. 1. Example of the identification of DCs elements
This tree should be interpreted in the next way:
Given a series of DVPs occurrences:
D = BRD Det. + N .* BRD
If verbal pattern = compound definitional verbal pattern, then:
1. If left position corresponds only to a term regular expression, then:
<left> = term | <right> = definition.
If Not:
2. If left position corresponds to a term regular expression and a pragmatic pattern
regular expression, then:
<left> = term & pragmatic pattern | <right> = definition.
If Not:
3. If left position only corresponds to a pragmatic pattern regular expression, then4:
<left> = pragmatic pattern | If nexus corresponds only to a term regular expression,
then <nexus> = term & <right> = definition; If Not <right> = term & definition.
4. If left position corresponds only to a definition regular expression, then:
<left> = definition | <right> = term.
To exemplify we can observe the next context:
4 In some cases the tree must resort to other position inferences to find terms and definitions.
388 R. Alarcón, G. Sierra, and C. Bach
“<left>En sus comienzos</left> <dvp>se definió</dvp> <nexus>la psicología co-
mo </nexus><right>"la descripción y la explicación de los estados de conciencia"
(Ladd, 1887).</right>”
Once the DVP was identified as a CDVP – definir como (to define as) – the tree in-
fers that left position:
1. Does not correspond only to a TRE.
2. Does not correspond to a TRE and a PPRE.
3. It does correspond only to a PPRE.
Then: left position is a pragmatic pattern (En sus comienzos). To identify the term
and definition the tree goes to nexus’s inferences and finds that:
1. It does correspond only to a TRE.
Then: nexus’s position corresponds to the term (la psicología) and right’s position
corresponds to the definition (“la descripción y la explicación de los estados de con-
ciencia […]”).
As result, the processed context was reorganised into terminological entries as in
the next example:
Table 2. Example of the results
Term psicología
Definition “la descripción y la explicación de los estados de la conciencia” (Ladd, 1887).
Verbal
Pattern
se define
Pragmatic
Pattern
En sus comienzos
To conclude this part we have to mention that the algorithms implement non-
complex regular expressions as well as simple logic inferences to find, analyse and
organise definitional knowledge. Furthermore, the design of the algorithms allows the
implementation in other languages by replacing the correspondent regular expressions
as well as the logical inferences.
4 Evaluation
The evaluation of the methodology consists in two parts:
1. We evaluate the extraction of DVPs and the filtering of no relevant contexts
using Precision & Recall. In general words, Precision measures how many in-
formation extracted is relevant, while Recall measures how many relevant in-
formation was extracted from the input.
2. For the identification of constitutive elements, we manually assigned values
that helped us to statistically evaluate the exactitude of the decisions tree.
4.1 Evaluation of DVP’s Extraction and Non-relevant Contexts Filtering
We determine Precision & Recall by means of the following formulas:
ECODE: A Definition Extraction System 389
P = the number of filtered DCs automatically extracted, over the number of con-
texts automatically extracted.
R = the number of filtered DCs automatically extracted, over the number of non-
filtered DCs automatically extracted.
The results for each verbal pattern can be seen in table 3. In the case of Precision,
there is a divergence on verbs that usually appear in metalinguistic sentences. The
best results were obtained with verbs like denominar (to denominate) or definir (to
define), while verbs like entender (to understand) or significar (to signify) recover
low Precision values. Those verbs with lower results can be used in a wide assortment
of sentences, (i.e., not necessarily definitional contexts), and they trend to recover a
big quantity of noise. In the case of Recall, low results indicate that valid DCs were
filtered as non-relevant contexts. The wrong classification is related to the non-
filtering rules, but also in some cases a wrong classification was due to a POS tagging
errors in the input corpus.
Table 3. Precision & Recall results
Verbal Patten Precision Recall
Concebir (como) To conceive (as) 0.67 0.98
Definir (como) To define (as) 0.84 0.99
Entender (como) To understand (as) 0.34 0.94
Identificar (como) To identify (as) 0.31 0.90
Consistir de To consist of 0.62 1
Consistir en To consist in 0.60 1
Constar de To comprise 0.94 0.99
Denominar también Also denominated 1 0.87
LLamar también Also called 0.90 1
Servir para To serve for 0.55 1
Significar To signify 0.29 0.98
Usar como To use as 0.41 0.95
Usar para To use for 0.67 1
Utilzar como To utilise as 0.45 0.92
Utilizar para To utilise for 0.53 1
The challenge we faced in this stage is directly related to the elimination of noise.
We have noticed that the more precise the verbal pattern is, the better results (in terms
of less noise) can be obtained. Nevertheless, a specification of verbal patterns means a
probable lost of recall. Although, a revision of filtering rules must be done in order to
improve the non-relevant contexts identification and avoid the cases when some DC
where incorrect filtered.
4.2 Evaluation of DVP’s Extraction and Non-relevant Contexts Filtering
To evaluate the DCs elements identification, we manually assign the next values to
each DC processed by the decisions tree:
3 for those contexts where the constitutive elements were correct classified;
2 for those contexts where the constitutive elements were correct classified, but
390 R. Alarcón, G. Sierra, and C. Bach
some extra information were also classified (for example extra words or punctuation
marks in term position);
1 for those contexts where the constitutive elements were not correct classified,
(for example when terms were classified as definitions or vice versa).
Ø for those contexts the system could not classify.
In table 4 we present the results of the evaluation of DCs elements identification.
The values are expressed as percentages, and the amount of all of them represent the
total number of DCs founded with each verbal pattern. From DCs evaluation we high-
light the following facts:
The average percentage of the correct classified elements (group “3”) is over the 50
percent of the global classification. In these cases, the classified elements correspond
exactly with a term or a definition.
In a low percentage (group “2”), the classified elements include extra information
or noise. Nevertheless, in these cases the elements where also good classified as in
group “3”.
The incorrect classification of terms and definitions (group “1”), as well as the unclas-
sified elements (group “Ø”) correspond to a low percentage of the global classification.
Table 4. Evaluation of DCs elements identification
Verbal Patten 3 2 1 Ø
Concebir (como) To conceive (as) 68.57 15.71 11.42 04.28
Definir (como) To define (as) 65.10 18.22 10.41 06.25
Entender (como) To understand (as) 54.16 20.83 8.33 16.66
Identificar (como) To identify (as) 51.72 5.17 34.48 08.62
Consistir de To consist of 60 0 20 20
Consistir en To consist in 60.81 8.10 15.54 15.54
Constar de To comprise 58.29 22.97 2.97 15.74
Denominar también Also denominated 21.42 28.57 7.14 42.85
LLamar también Also called 30 40 0 30
Servir para To serve for 53.78 27.27 0.007 18.18
Significar To signify 41.26 44.44 3.17 11.11
Usar como To use as 63.41 14.63 17.07 4.87
Usar para To use for 36.26 32.96 4.39 26.37
Utilzar como To utilise as 55.10 28.57 10.20 6.12
Utilizar para To utilise for 51.51 19.69 10.60 18.18
Since the purpose of this process was the identification of DCs elements, we can
argue that results are generally satisfactory. However, there is a lot of work to do in
order to improve the performance of decision’s tree inferences. This work is related to
the way the tree analyses the different DCs elements of each verbal pattern.
5 Conclusions and Future Work
We have presented the process of developing a definitional knowledge extraction
system. The aim of this system is the simplification of the terminological practice
related to the search of term’s definitions in specialised texts.
ECODE: A Definition Extraction System 391
The methodology we have presented includes the search of definitional patterns,
the filtering of non-relevant contexts and the identification of DCs constitutive ele-
ments: terms, definitions, and pragmatic patterns.
At this moment we have worked with definitional verbs and we know that there is
a lot of work to do, which basically consists of the following points:
a) To explore other kind of definitional patterns (mainly typographical patterns and
reformulation markers) that are capable to recover definitional contexts.
b) To include those definitional patterns mentioned above in each step of the meth-
odology.
c) To improve the rules for the non-relevant contexts filtering process, as well as
the algorithm for the automatic identification of constitutive elements process.
Acknowledgments. This research has been developed by the sponsorship of the
Mexican National Council of Science and Technology (CONACYT), the DGAPA-
UNAM, as well as the Macro Project Tecnologías para la Universidad de la Informa-
ción y la Computación, UNAM. We also acknowledge the help of Bertha Lecumberri
in the translation of this paper.
References
1. Pearson, J.: Terms in Context. John Benjamin’s, Amsterdam (1998)
2. Meyer, I.: Extracting Knowledge-rich Contexts for Terminography. In: Bourigault, D.,
Jacquemin, C., L’Homme, M.C. (eds.), pp. 278–302. John Benjamin’s, Amsterdam (2001)
3. Péry-Woodley, M.-P., Rebeyrolle, J.: Domain and Genre in Sublanguage Text: Defini-
tional Microtexts in Three Corpora. In: First International Conference on Language Re-
sources and Evaluation, Grenade, pp. 987–992 (1998)
4. Bach, C.: Los marcadores de reformulación como localizadores de zonas discursivas rele-
vantes en el discurso especializado. Debate Terminológico, Electronic Journal 1 (2005)
5. Klavans, J., Muresan, S.: Evaluation of the DEFINDER System for Fully Automatic Glos-
sary Construction. In: Proceedings of the American Medical Informatics Association Sym-
posium, pp. 252–262. ACM Press, New York (2001)
6. Saggion, H.: Identifying Definitions in Text Collections for Question Answering. In: Pro-
ceedings of the 4th International Conference on Language Resources and Evaluation, Lis-
bon, pp. 1927–1930 (2004)
7. Malaisé, V.: Méthodologie linguistique et terminologique pour la structuration
d’ontologies différentielles á partir de corpus textuels. PhD Thesis. UFR de Linguistique,
Université Paris 7 – Denis Diderot, Paris (2005)
8. Sierra, G., Alarcón, R., Aguilar, C., Bach, C.: Definitional Verbal Patterns for Semantic
Relation Extraction. Terminology 14(1), 74–98 (2008)
9. Del-Gaudio, R., Branco, A.: Automatic Extraction of Definitions in Portuguese: A Rule-
Based Approach. In: Proceedings of the 2nd Workshop on Text Mining and Applications,
Guimarães (2007)
10. Degórski, L., Marcinczuk, M., Przepiórkowski, A.: Definition Extraction Using a Sequen-
tial Combination of Baseline Grammars and Machine Learning Classifiers. In: Proceedings
of the 6th International Conference on Language Resources and Evaluation, Forth-
Coming, Marrakech (2008)
11. Alarcón, R., Bach, C., Sierra, G.: Extracción de contextos definitorios en corpus especiali-
zados. Hacia la elaboración de una herramienta de ayuda terminográfica. In: Revista de la
Sociedad Española de Lingüística 37, pp. 247–278. Madrid (2007)
... otherwise (1) where c = i + log 2 |a|. As real terms with length = 1 used to appear too far in the bottom of the output list, after a lot of bad longer candidates, then it was defined i = 1, which produces better rankings. ...
Article
The paper presents LEXIK, an intelligent terminological architecture that is able to efficiently obtain specialized lexical resources for elaborating dictionaries and providing lexical support for different expert tasks. LEXIK is designed as a powerful tool to create a rich knowledge base for lexicography. It will process big amounts of data in a modular system, that combines several applications and techniques for terminology extraction, definition generation, example extraction and term banks, that have been partially developed so far. Such integration is a challenge for the area, which lacks an integrated system for extracting and defining terms from a non-preprocessed corpus.
Conference Paper
In order to avoid ambiguity and to ensure, as far as possible, a strict interpretation of law, legal texts usually define the specific lexical terms used within their discourse by means of normative rules. With an often large amount of rules in effect in a given domain, extracting these definitions manually would be a costly undertaking. This paper presents an approach to cope with this problem based in a variation of an automated technique of natural language processing of Brazilian Portuguese texts. For the sake of generality, the proposed solution was developed to address the more general problem of building a glossary from domain specific texts that contain definitions amongst their content. This solution was applied to a corpus of texts on the telecommunications regulations domain and the results are reported. The usual pipeline of natural language processing has been followed: preprocessing, segmentation, and part-of-speech tagging. A set of feature extraction functions is specified and used along with reference glossary information on whether or not a text fragment is a definition, to train a SVM classifier. At last, the definitions are extracted from the texts and evaluated upon a testing corpus, which also contains the reference glossary annotations on definitions. The results are then discussed in light of other definition extraction techniques.
Article
Full-text available
This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.
Article
Full-text available
Uno de los objetivos principales del trabajo terminográfico es la identificación de conocimiento sobre los términos que aparecen en textos especializados. Para confeccionar diccionarios, glosarios u ontologías, los terminógrafos suelen buscar definiciones sobre los términos que pretenden definir. La búsqueda de definiciones se puede hacer a partir de corpus especializados, donde normalmente aparecen en contextos definitorios, es decir, en fragmentos de texto donde un autor explícitamente define el término en cuestión. Hoy en día hay un interés creciente por automatizar este proceso, basado en la búsqueda de patrones definitorios sobre corpus especializados anotados morfosintácticamente. En este artículo presentamos una investigación centrada en la extracción automática de contextos definitorios. Presentamos una metodología que incluye tres procesos automáticos diferentes: la extracción de ocurrencias de patrones definitorios, el filtrado de contextos no relevantes, y la identificación de elementos constitutivos, es decir, términos, definiciones y patrones pragmáticos. http://repositori.upf.edu/handle/10230/16965
Article
Full-text available
One particular type of question which was made the focus of its own subtask within the TREC2003 QA track was the definition question ("What is X?" or "Who is X?"). One of the main problems with this type of question is how to discriminate in vast text collections between definitional and non-definitional text passages about a particular definiendum (i.e., the term to be defined). A method will be presented that uses definition patterns and terms that co-occurr with the definiendum in on-line sources for both passage selection and definition extraction.
Conference Paper
Full-text available
In this paper we present a rule-based system for automatic extraction of definitions from Portuguese texts. As input, this system takes text that is previously annotated with morpho-syntactic information, namely on POS and inflection features. It handles three types of definitions, whose connector between definiendum and definiens is the so-called copula verb “to be”, a verb other that one, or punctuation marks. The primary goal of this system is to act as a tool for supporting glossary construction in e-learning management systems. It was tested using a collection of texts that can be taken as learning objects, in three different domains: information society, computer science for non experts, and e-learning. For each one of these domains and for each type of definition typology, evaluation results are presented. On average, we obtain 14% for precision, 86% for recall and 0.33 for F 2 score.
Conference Paper
Full-text available
The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.
Article
Full-text available
In this paper we present a description of the role of definitional verbal patterns for the extraction of semantic relations. Several studies show that semantic relations can be extracted from analytic definitions contained in machine-readable dictionaries (MRDs). In addition, definitions found in specialised texts are a good starting point to search for different types of definitions where other semantic relations occur. The extraction of definitional knowledge from specialised corpora represents another interesting approach for the extraction of semantic relations. Here, we present a descriptive analysis of definitional verbal patterns in Spanish and the first steps towards the development of a system for the automatic extraction of definitional knowledge. http://repositori.upf.edu/handle/10230/6177
Article
Full-text available
Basándonos en la teoría de la relevancia, en este artículo analizamos el papel de los marcadores de reformulación (MR) en la localización de zonas discursivas relevantes en el discurso especializado. Con el estudio presentado, demostramos que el uso de los MR puede serde utilidad para la construcción de herramientas que pretendan el reconocimiento automático de la terminología con el mínimo esfuerzo de procesamiento y apuntamos la necesaria continuación de dicho trabajo con el estudio de la densidad reformulativa de los textos especializados y de la posible interrelación entre la densidad terminológica y la densidad de tratamiento de la terminología. http://repositori.upf.edu/handle/10230/6164
Chapter
This first collection of selected articles from researchers in automatic analysis, storage, and use of terminology, and specialists in applied linguistics, computational linguistics, information retrieval, and artificial intelligence offers new insights on computational terminology. The recent needs for intelligent information access, automatic query translation, cross-lingual information retrieval, knowledge management, and document handling have led practitioners and engineers to focus on automated term handling. This book offers new perspectives on their expectations. It will be of interest to terminologists, translators, language or knowledge engineers, librarians and all others dependent on the automation of terminology processing in professional practices. The articles cover themes such as automatic thesaurus construction, automatic term acquisition, automatic term translation, automatic indexing and abstracting, and computer-aided knowledge acquisition. The high academic standing of the contributors together with their experience in terminology management results in a set of contributions that tackle original and unique scientific issues in correlation with genuine applications of terminology processing.
Article
In this paper we outline a shallow grammar of definitions as they occur in technical and scientific texts. We show how visual clues, typography and layout, interact with lexical and syntactic clues to signal a definitional text. Whilst starting from the hypothesis that domain and genre have an impact on the grammar of definitions, we also expect to find stable features, which would make the retrieval of definitions possible in new domains. We illustrate this approach with a comparative study on three texts which differ in terms of domain and genre and we show how it is possible to identify some constraints on variations in the formulation of definitional texts.
Article
Resources like terminologies or ontologies are used in a number of applications, including documentary description and information retrieval. Different methodologies have been proposed to build such resources, on the basis of experts' interviews or of textual corpora. This thesis focuses on the use of existing Natural Language Processing methodologies, meant to help the building of ontologies from textual corpora, to build a particular type of resource : differential ontologies. These ontologies are structured according to a system of semantic identities and differences between their constituents: terms of the domain and categorisation items called “top level categories”. We present different experiments that we have done to elicit, structure, define and “interdefine” the terminological items relevant for a given task. Our first use case was the opales pro ject, in which we had to provide a group of anthropologists with the conceptual vocabulary that they needed to annotate audiovisual documents about childhood. We have used the textual corpus that we have built in this pro ject to test linguistic tools and methodologies for building ontologies from textual data, and we have defined our own programs. The suite of resulting programs is called SODA, and they focus on the extraction and use of defining contexts in corpora to spot terminological items, to structure them and to provide semantic similarity information that enables to compare them.
Article
In this paper we present a quantitative and qualitative evaluation of DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define. The quantitative evaluation shows that in terms of precision and recall as measured against human performance, DEFINDER obtained 87% and 75% respectively, thereby revealing the incompleteness of existing resources and the ability of DEFINDER to address these gaps. Our basis for comparison is definitions from on-line dictionaries, including the UMLS Metathesaurus. Qualitative evaluation shows that the definitions extracted by our system are ranked higher in terms of user-centered criteria of usability and readability than are definitions from on-line specialized dictionaries. The output of DEFINDER can be used to enhance these dictionaries. DEFINDER output is being incorporated in a system to clarify technical terms for non-specialist users in understandable non-technical language.