Conference PaperPDF Available

An Improved Automatic Term Recognition Method for Spanish

Conference Paper

An Improved Automatic Term Recognition Method for Spanish

Abstract and Figures

The C-value/NC-valueC\mbox{-}value/NC\mbox{-}value algorithm, a hybrid approach to automatic term recognition, has been originally developed to extract multiword term candidates from specialised documents written in English. Here, we present three main modifications to this algorithm that affect how the obtained output is refined. The first modification aims to maximise the number of real terms in the list of candidates with a new approach for the stop-list application process. The second modification adapts the C-valueC\mbox{-}value calculation formula in order to consider single word terms. The third modification changes how the term candidates are grouped, exploiting a lemmatised version of the input corpus. Additionally, size of candidate’s context window is variable. We also show the necessary linguistic modifications to apply this algorithm to the recognition of term candidates in Spanish.
Content may be subject to copyright.
An Improved Automatic Term Recognition
Method for Spanish
Alberto Barr´on-Cede˜no1,2, Gerardo Sierra1,
Patrick Drouin3, and Sophia Ananiadou4
1Engineering Institute,
Universidad Nacional Aut´onoma de M´exico, Mexico
2Department of Information Systems and Computation,
Universidad Polit´ecnica de Valencia, Spain
3Observatoire de Linguistique Sense-Texte,
Universit´edeMontr´eal, Canada
4University of Manchester and National Centre for Text Mining, UK
alberto@pumas.ii.unam.mx, gsierram@ii.unam.mx,
patrick.drouin@umontreal.ca, sophia.ananiadou@manchester.ac.uk
Abstract. The C-value/N C-value algorithm, a hybrid approach to au-
tomatic term recognition, has been originally developed to extract mul-
tiword term candidates from specialised documents written in English.
Here, we present three main modifications to this algorithm that af-
fect how the obtained output is refined. The first modification aims
to maximise the number of real terms in the list of candidates with
a new approach for the stop-list application process. The second modifi-
cation adapts the C-value calculation formula in order to consider single
word terms. The third modification changes how the term candidates
are grouped, exploiting a lemmatised version of the input corpus. Addi-
tionally, size of candidate’s context window is variable. We also show the
necessary linguistic modifications to apply this algorithm to the recog-
nition of term candidates in Spanish.
1 Introduction
The C-value/NC-value algorithm [3] is the base of the Termine suite for auto-
matic multiword terms recognition in specialised documents in English1.
TerMine is a text mining service developed by the UK National Centre for Text
Mining for the automatic extraction of terms in a variety of domains. This al-
gorithm has been applied to Automatic Term Recognition (ATR) over different
languages such as English [3] and Japanese [10]. Additionally, it is the base for an
algorithm designed for term extraction in Chinese [4]. A first essay has started
to adapt it to handle documents in Spanish [1].
In this paper, we describe the improvements carried out over different stages
of the algorithm. Additionally, we show the necessary adaptations for exploiting
this algorithm on ATR of terms in Spanish texts. About the distribution of
1http://www.nactem.ac.uk/software/termine/
A. Gelbukh (Ed.): CICLing 2009, LNCS 5449, pp. 125–136, 2009.
c
Springer-Verlag Berlin Heidelberg 2009
126 A. Barr´on-Cede˜no et al.
the paper, Section 2 gives a brief description of the original algorithm. Section
3 describes the corpora exploited during the design and test of our method.
Section 4 describes the modifications we have made to the algorithm including a
description of the resources we have exploited. Section 5 contains the evaluations.
Finally, Section 6 draws some conclusions and future work.
2TheC-value/NC-value Algorithm
The C-value/NC-value algorithm was originally developed by Frantzi et al.
[3] for multiword ATR on English texts. This hybrid (linguistic-statistical) al-
gorithm is divided into two main stages: C-value and NC-value.Wesum-
marise below the main ideas developed in that paper. (Subsections 2.1 and 2.2,
respectively).
2.1 C-value: the Hybrid Stage
The task of the C-value algorithm is to process an input corpus (composed of a
set of specialised texts) in order to generate a list of candidate terms. This list
is ranked according to the potential of each candidate of being a real term: its
termhood.
Figure 1 shows a simplified version of the C-value algorithm. The entire ver-
sion can be found in [3]. The linguistic steps that we have modified will be
described in Section 4. In our current approach, we have not defined any thresh-
old in order to previously remove strings (fourth line). The threshold exceeding
conditions that let us decide if the C-value of a candidate is high enough to
consider it as a good term candidate are not used either. The reason is that
we do not want to discriminate candidates based on their frequency or C-value
because, until now, our corpus is not big enough to reach a significant threshold.
The Linguistic Stage. The linguistic filter recognises noun phrases (combi-
nation of nouns with prepositions and adjectives), which are potential terms.
Section 5.1 contains a comparison of the results obtained by using an open ver-
sus a closed filter.
The stop-list is composed of a set of words which are not expected to occur
inside of a term on the studied domain (due to this fact the stop-list is domain-
dependent). Those candidates with words in the stop-list are removed from the
list (Section 4.2 shows the improvement made to this filtering process). The list
of candidates obtained by this linguistic process is ranked on the basis of the
next statistical process.
The Statistical Stage. The purpose of this part of the algorithm is to measure
the termhood of every candidate string. The list of candidate terms is ranked
based on this value (C-value). The C-value calculation considers four aspects:
1. Thefrequencyofthecandidateintheentirecorpus.
2. The frequency of the candidate when it appears nested in longer candidates.
3. The number of those longer candidates.
4. The length of the candidate (in words).
An Improved Automatic Term Recognition Method for Spanish 127
Algorithm 1: Given the analysis corpus:
outputList =[]
tag the corpus
extract strings using linguistic filter
remove strings below frequency threshold
filter rest of strings through stop-list
For e a ch strin g agiven that length(a)=max
C-value(a)=log2|a|∗f(a)
If C-value(a)Threshold
add ato outputList
For each substring ba
revise t(b)andc(b)
For e a ch strin g agiven that length(a)<max
If aappears for the first time
C-value(a)=log2|a|∗f(a)
Else
C-value(a)=log2|a|(f(a)1
c(a)t(a))
If C-value(a)Threshold
add ato outputList
For each substring ba
revise t(b)andc(b)
Fig. 1. Simplified C-value algorithm
The absolute frequency of a term candidate in a corpus is an initial parameter
to define if it is a real term. However, it is not enough. In fact, it is common that
long terms appear just once even in long corpora. After this appearance, refer-
ences to this kind of terms often appear only as truncated versions of themself.
For example, in a sample of 139,027 words from the Computer Science Corpus
in Spanish [7], the term computadora (computer) appeared 122 times. Mean-
while, computadora personal (personal computer) appeared only 15 times. In
this case, we could guess that, at least in some cases, computadora is a simpli-
fied version of computadora personal, so the nested appearance of the former in
the latter decreases the probability of computadora of being a real term.
Nevertheless, in the same sample corpus there are some other candidates that
are built having to the string computadora as the root, such as computadora
port´atil (laptop) and computadora de uso general (general purpose computer) ap-
pearing 15 and 2 times, respectively. These three different term candidates with
the same root, reflect the possibility of computadora of being by itself a real term.
The other three strings could be varieties with its own concept associated in the
subject area (as it is). In such a case, it is considered that all four candidates could
be real terms. For these reasons, three considerations are made:
1. The high frequency of a candidate in a corpus is beneficial for its termhood.
2. The length of a string is also beneficial (due to the fact that the probability
of a long string of appearing in a corpus decreases as it is longer).
128 A. Barr´on-Cede˜no et al.
3. The appearance of a candidate nested into another detriments its termhood.
4. If a candidate appears nested in multiple candidate strings, the detrimental
effect becomes weaker.
The C-value is calculated as in Eq. 1.
C-value =log2|a|∗f(a)ifa/nested
log2|a|∗f(a)1
P(Ta)bTf(b)otherwise ,(1)
where ais the candidate string, f(·) is the frequency of the string ·in the corpus,
Tais the set of extracted candidates containing a,andP(Ta)isthenumberof
those candidates. The nested set is composed of all those candidates appearing
inside of longer candidates.
This process generates the list T1of term candidates. T1contains the set of
candidates ranked by their termhood (C-value).
2.2 NC-value: Considering the Terms Context
It is hard to think in a word without relating it to some others that interact
with it. Sager [11] has stated that terms tend to be accompanied by a strict
set of other words (including more terms). In order to illustrate, consider the
term hard disk. This term will hardly appear with words such as cook on its
neighbourhood, but it will frequently appear with words such as GB,format,
install or capacity, which are related to it.
If a term appears with a “closed” set of neighbour words, the existence of
these words in the context of a candidate must be positive clues for its ter-
mhood. The NC-value method extends C-value by considering the candidates
context with the so called context weighting factor. Term context words are those
appearing in the neighbourhood of the candidates. However, not all the words in
the neighbourhood must be considered as context words. Only nouns, adjectives
and verbs (other words do not add significant information to a term).
A list of obtained context words is obtained and ranked according to their
“relevance” over the terms. This relevance is based on the number of terms that
appear in their contexts. The higher this number, the higher the probability that
the word is related to real terms. It is expected that these words appear with
other terms in the same corpus. The context weighting factor (that expresses
the probability of a word wof being a term context word) is calculated as in
Eq. 2.
weight(w)=t(w)
n,(2)
where wis a term context word, weight(w) is the weight assigned to the word
w(expressed as a probability), t(w)isthenumberoftermswappears with, and
nis the total number of terms considered.
The weight assigned to a context words must be calculated after getting the
list T1that, as we have said, is ordered by the C-value. In order to extract
the term context words, the top candidate terms in T1, which present a high
An Improved Automatic Term Recognition Method for Spanish 129
Precision (contains a good proportion of real terms), is used. These top terms
produce a list of term context words weighted on the basis of Eq. 2.
The rest of the context words may or may not have an associated context
weight. In the case where a context word whas been seen earlier, it retains
its associated weight. Otherwise, weight(w)=0.TheNC-value for the term
candidates is calculates as in Eq. 3, which considers the previously calculated
C-value as well as the context words weights:
NC-value =0.8C-value(a)+0.2
bCa
fa(b)weight(b),(3)
where ais the current term candidate, Cais the set of context words associated
to a,bis each one of those context words, and fa(b) is the frequency of bas a
context word of a.
The context information is exploited in order to concentrate the real terms in
the top of the list. The new list T2of term candidates is ranked on the basis of
the NC-value.
3TheCorpora
We have used two corpora during the design and evaluation of our prototype: The
Linguistic Corpus of Engineering and the Computer Science Corpus in Spanish,
described in Subsections 3.1 and 3.2, respectively.
3.1 The Linguistic Corpus of Engineering
The Linguistic Corpus of Engineering (CLI) [9] has been created in the Lan-
guage Engineering Group at the Universidad Nacional Aut´onoma de M´exico
(UNAM). It is composed of a set of specialised texts on Engineering (mechan-
ics, civil and electronics, among others). Most of the documents are written in
Mexican Spanish and includes some texts written in peninsular Spanish. This
corpus, consisting of 23 files with 274,672 words, includes postgraduate as well
as undergraduate thesis, papers and reports on this area.
Due to the fact that Engineering is a large subject area, we have opted for
focusing only on the CLI Mechanical Engineering section. This section includes
5 files for a total of 10,191 words.
The CLI corpus has been used in order to define the rules for the linguistic filter
definition corresponding to the candidate extraction subtask (Subsection 4.1).
3.2 The Computer Science Corpus in Spanish
The Computer Science Corpus in Spanish (CSCS) [7] was compiled in the Obser-
vatoire Linguistique Sense-Texte (University of Montreal). The original objective
of this corpus is the development of a Spanish version of DicoInfo [5] “the fun-
damental computer science and Internet dictionary”2.
2http://olst.ling.umontreal.ca/dicoinfo/
130 A. Barr´on-Cede˜no et al.
Table 1 . Statistics of the CLI and CSCS corpora sections used in our experiments
Feature Value
CLI
Number of files 5
Total number of tokens 10,191
Avg. number of tokens per file 2,038
CSCS
Number of files 200
Total number of tokens 150,000
Avg. number of tokens per file 750
The CSCS contains around 550 documents with more than 500,000 words. It
mainly contains texts written in peninsular Spanish. For our experiment, we have
chosen the Hardware section, with around 200 documents and almost 150,000
words.
This corpus has been used in order to define the open linguistic filter, corre-
sponding to the candidate extraction task (Subsection 4.1), as well as for evalua-
tion (Subsection 5.1). Some statistics for both corpora are included in Table 1.
4 Improvements to the Algorithm
After explaining the original C-value/NC-value algorithm, as well as describing
the used corpora, we discuss the adaptations carried out to both the linguistic
and statistical sections of the C-value method.
4.1 Creating the Linguistic Filter for Spanish
The modified prototype has been designed for ATR over documents written in
Spanish. For our experiments we have considered the application of two filters:
closed and open. The first one is strict and tries to retrieve only real terms,
reducing the number of false positives. The latter is flexible and tries to retrieve
all the terms in the corpus no matter the number of false negatives obtained.
The most frequent term patterns in Spanish are Noun -amplificador,pro-
tocolo (amplifier, protocol)-, Noun Prep Noun -estaci´on de trabajo,lenguaje
de programaci´on (work station, programming language)- and Noun Adjective
-computadora personal,red neu ronal (personal computer, neural network)- [2].
These patterns compose our closed filter (this set of rules as well as the corre-
sponding to the open filter are in NLTK format [8]):
NounAdj
NounPrepDEAdj
Noun
In the second rule we do not consider any preposition, but only de (of). That
is the meaning of the tag PrepDE.
An Improved Automatic Term Recognition Method for Spanish 131
Additionally, we have carried out a manual term extraction based on the
method described in [6]. The ob jective was to find more flexible patterns in
order to retrieve more terms (while trying to limit the generation of noise in
the output). The manual extraction carried out over a section of both, CLI and
CSCS, corpora resulted in the following set of rules composing the open filter:
(Noun|ProperNoun|ForeignWord)+
(NounAdj)(PrepDE(Noun|ProperNoun))
NounPrepDE(Noun|ProperNoun)
Noun?Acrnm
NounPrepDE((NounAdj)|(AdjNoun))
Note that some terms contained foreign words (most of them in English).
Other part-of-speech such as acronyms and proper nouns have appeared also.
The closed or open filter depends on the interest of favouring Precision or Recall
in the output (Section 5).
4.2 Modifications to the C-value Algorithm
We have detected some weaknesses to the C-value/NC-value algorithm. With
the aim of reducing them, we have carried out four main modifications.
Selective Stop-Words Deletion. As we have pointed out in Subsection 2.1,
a stop-list is applied during the C-value stage in order to reduce noise. The
original method deletes an entire candidate if it contains at least one stop-word.
A stop-list in ATR is a list of words which are not expected to occur as term
words in the treated domain. Our stop-list contains 223 words. It is composed of
nouns and adjectives that presented a high frequency in the CSCS but it is not
expected to find them inside of real terms. Some examples of these stop-words are
caracter´ıstica,compa ˜ıa and tama˜no (feature, company and size, respectively).
Our strategy propose the deletion of stop-words instead of entire candidates.
We call this strategy selective stop-words deletion. The reason for this selective
deletion is that there are a lot of candidate terms containing stop as well as other
kind of words. For instance, consider the candidate computadora grande (big
computer). If we only delete the substring grande instead of the entire candidate,
keeping computadora, the patterm of the obtained candidate is characteristic of
the terms in Spanish. And, as it is the case in Computer Science, it becomes a
potentialrealterm.
However, the stop-words could be linked to functional words. In order to
clarify this point, consider another example. The candidate desarrollo de LCD
(LCD’s development) contains the stop-word desarrollo. The POS of this can-
didate is Noun PrepDE Noun. Again, the basic option would be completely
discarding this candidate, but LCD is a real term. On the selective stop-words
deletion strategy, we only delete the stop-word (desarrollo). On this way, we
obtain de LCD with POS prepDE N oun. However, this is not a characteristic
pattern of terms in Spanish. If the stop-word is a noun, it is necessary to check
132 A. Barr´on-Cede˜no et al.
those words before and after it in order to decide if they should be deleted also.
In this case, the preposition de must be deleted. The result is LCD,whichisa
real term.
The selective stop-words deletion strategy is described in Algorithm 2. This
algorithm has been designed for Spanish terms. However, after a brief linguistic
adaptation it is possible to apply it to any other language).
Algorithm 2: Given a candidate ssplit into words si:
D={} //The set of words that will be deleted from s
If P(si)=Adjective and sistop-list
add sito D
Elif P(si)=Noun and sistop-list
add sito D
If P(si1)=P reposition
add si1to D
If P(si+1)=P rep osition or P(si+1)=Adjective
add si+1 to D
If P(si+2)=P rep osition or P(si+2)=Adjective
add si+2 to D
delete words in Dfrom s
Return s
Fig. 2. Selective deletion of stop and related words P(·) = part-of-speech of ·
In order to clarify how Algorithm 2 works, we give another example. Consider
the candidate pantalla de manera lateral (screen in lateral way). In this case,
s={pantalla<Noun> ,de
<P repDE> ,manera
<Noun>, lateral<Adj >}.s2(manera)
is a stop-word, so D={s2}.manera is a noun and for this reason it is necessary
to check s21, which is a preposition. In this step D={s1,s
2}.Thewords2+1
(an adjective) must be deleted. Now D={s1,s
2,s
3}. The resulting candidate
after deleting Dfrom sis pan tal la (screen).
Modifying the C-value Calculation Formula. The C-value/NC-value al-
gorithm was originally designed for the extraction of multiword terms. It is for
this reason that the C-value calculation formula was not designed to handle
terms composed of one word.
Fortunately, this limit is only mathematical. The C-value formula is not able
to calculate termhood for candidates with length(a) = 1 since, in order to nor-
malise the length relevance, it calculates its logarithm (note that log(1) = 0, so
C-value(a) = 0). In order to avoid this limitation, we add a constant ito the
length of abefore calculating its logarithm:
C-value =cf(a)ifa/nested
cf(a)1
P(Ta)bTf(b)otherwise ,(4)
where c=i+log2|a|.
An Improved Automatic Term Recognition Method for Spanish 133
On the initial experiments, we tried i=0.1 in order to modify as less as
possible the essence of the formula. However, real terms with length =1used
to appear too far in the bottom of the output list, after a lot of bad longer can-
didates. It is for this reason that reason we define i= 1, which (experimentally)
produces better rankings.
Searching on a Lemmatised Corpus. ThefirststepintheC-/N C -value
algorithm is POS tagging the corpus. However, the example output included in
[3], includes the strings Bcell and Bcellsin different rows of the NC-value
ranked list. This reflects that there is no lemmatisation process involved. We
consider that including such a process is important for term extraction. In the
case when the lemmatised corpus is used to build the candidate terms list, dif-
ferent variations of the same candidate term are considered as one and its total
frequency is the addition of all the variations frequencies.
In order to join the different variations of a candidate, we lemmatise the corpus
before processing it. We carry out this subtask with TreeTagger [12]. This tool
is a POS tagger as well as a lemmatiser.
4.3 Modifying the NC-value Algorithm
The NC-value stage is based on considering the candidates context. A word
appearing frequently in the neighbourhood of a term in the top of the list ranked
by C-value (with a high probability of being a real term) has a good probability
of appearing with other real terms (no matter if they are in a lower position of
the list).
The context for the candidates was originally defined as a fixed window of
length 5. However, we have opted for using flexible frontiers to define the con-
text windows. Punctuation marks (point, colon, semicolon, parenthesis) break
phrases. Due to this fact a context window is broken if it contains one of these
marks (no matter the length of the resulting window).
5 Evaluation
Our version of the C-value/NC-value algorithm for ATR has been evaluated
in terms of Precision and Recall. In 5.1 we evaluate the extractor with differ-
ent configuration parameters. Section 5.2 compares our adaptation to another
previously designed for Chinese [4].
5.1 Varying the Parameters for the Extraction
We have randomly selected a set of documents from the CSCS [7]. The test
corpus contains 15,992 words on the Hardware subject. In order to evaluate the
obtained results we have carried out a manual term extraction over the same
test corpus.
We have carried four experiments in order to compare different parameters
combinations. These combinations are the following:
134 A. Barr´on-Cede˜no et al.
AOpen linguistic filter without stop-list
BOpen linguistic filter with stop-list
CClosed linguistic filter without stop-list
DClosed linguistic filter with stop-list
The open and closed filters are described in section 4.1. An open linguistic
filter is flexible with the terms patterns. For this reason, it increases Recall
reducing Precision. A closed linguistic filter is strict with the accepted patterns.
For this reason, it increases Precision reducing Recall.
A total of 520 terms were found during the manual extraction process. The
results obtained by the different automatic extractions are shown in Table 2.
Table 2 . Result of automatic extractions with different parameters
Case Candidates Real P R
terms
A 1,867 430 0.230 0.826
B 1,554 413 0.265 0.794
C 1,000 241 0.240 0.463
D 850 262 0.308 0.503
As it is expected, considering an open filter benefits Recall but harms Precision
while the using a closed filter benefits precision but harms Recall. Looking more
closely at the results obtained by experiments Cand D,wecanseethatthe
Recall obtained by the latter is higher. This improvement is due to the fact that
after the selective deletion, carried out in experiment D, more real terms (mainly
of length 1) that originally appeared combined to stop-words are discovered. The
original approach discards those candidates (Section 4.2).
5.2 Comparing our Adaptation with a Chinese Version
The C-value/NC-value method has been implemented and modified previously.
[4] have developed a term extractor for texts on IT written in Chinese. In this
case, the NC-value stage is replaced by a semantic and syntactic analysis stage.
The objective of this stage is better ranking the obtained output.
The reported experiments on a sample corpus of 16 papers (1,500,000 Chinese
characters) obtain P recision =0.67 and Recall =0.42. Our experiment B,
obtains P recision =0.265 and Recall =0.794. Although their Precision is
better than ours, we must consider that they use a previously obtained list of
288,000 terms of length = 1. This list is a filter that separates good candidates
from bad ones.
Unlike them, we have opted for conserving the philosophy of the C-value/
NC-value method: our approach only needs a POS tagger in order to carry out
the extraction process.
We must say that we have not compared our algorithm to the originally
described in [3], because our experiment conditions are quite different mainly
from the stop-list and the corpus features points of view.
An Improved Automatic Term Recognition Method for Spanish 135
6 Conclusions
In this paper we have described a linguistic and functional adaptation of the
C-value/NC-value algorithm for automatic term recognition. The main func-
tional adaptations carried out are the following:
A new algorithm for the selective elimination of stop-words in the term
candidates has been designed.
The C-value calculation formula has been adapted in order to allow handle
candidates of one word.
The length of the candidates context windows has been is not fixed. Unlike
the default length = 5, it is dynamically re-sized when it includes punctua-
tion marks.
About the linguistic adaptations, we have analysed the patterns of the terms in
Spanish in order to build an open and a closed filter for candidates detection. The
open filter favours Recall, while the closed filter favours Precision. Additionally
a stop-list composed of around 200 nouns and adjectives has been created.
With respect to other versions of C-value/NC-value method, our obtained
Precision has decreased. The main reason for this behaviour is that we consider
candidates of one word. Moreover, we have not defined any threshold in order
to eliminate candidates with low frequency or C-value. We have opted for sup-
porting the noise for the sake of a minimum loss of information, resulting in a
good Recall.
Finally, we have designed a selective stop-words deletion method. Our method
discovers good term candidates that are ignored when considering the original
stop-word deletion method.
Acknowledgements. This paper has been partially supported by the National
Council for Science and Technology (CONACYT); the DGAPA-UNAM; the
General Direction of Postgraduate Studies (DGEP), UNAM; and the Macro-
Project Tecnol oıas para la Universidad de la Informaci´on y la Computaci´on,
UNAM.
References
1. Barr´on, A., Sierra, G., Villase˜nor, E.: C-value aplicado a la extracci´on de t´erminos
multipalabra en documentos t´ecnicos y cient´ıficos en espa˜nol. In: 7th Mexican
International Conference on Computer Science (ENC 2006). IEEE Computer Press,
Los Alamitos (2006)
2. Cardero, A.M.: Terminolog´ıa y Procesamiento. Universidad Nacional Aut´onoma
de M´exico, Mexico (2003)
3. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms:
the C-value/NC-value method. International Journal on Digital Libraries 3(2),
115–130 (2000)
136 A. Barr´on-Cede˜no et al.
4. Ji, L., Sum, M., Lu, Q., Li, W., Chen, Y.-R.: Chinese terminology extraction using
window-based contextual information. In: Gelbukh, A. (ed.) CICLing 2007. LNCS,
vol. 4394, pp. 62–74. Springer, Heidelberg (2007)
5. L’Homme, M.C.: Conception d’un dictionarie fundamental de l’informatique et de
l’Internet: selection des entr´ees. Le langage et l‘homme 40(1), 137–154 (2005)
6. L’Homme, M.C., Bae, H.S.: A Methodology for Developing Multilingual Resources
for Terminology. In: Language Resources and Evaluation Conference (LREC 2006),
pp. 22–27 (2006)
7. L’Homme, M.C., Drouin, P.: Corpus de Inform´atica en Espa˜nol. Groupe ´
Eclectik,
Universit´edeMontr´eal, http://www.olst.umontreal.ca/
8. Loper, E., Bird, S.: NLTK: The natural language toolkit. In: ACL Workshop on
Effective Tools and Methodologies for Teaching Natural Language Processing and
Computational Linguistics, pp. 62–69 (2002)
9. Medina, A., Sierra, G., Gardu˜no, G., M´endez, C., Salda˜na, R.: CLI. An Open
Linguistic Corpus for Enineering. In: IX Congreso Iberoamericano de Inteligencia
Artificial (IBERAMIA), pp. 203–208 (2004)
10. Mima, H., Ananiadou, S.: An application and evaluation of the C/NC-value ap-
proach for the automatic term recognition of multi-word units in Japanese. Inter-
national Journal on Terminology 6(2), 175–194 (2001)
11. Sager, J.C.: Commentary by Prof. Juan Carlos Sager. In: Rondeau, G. (ed.)
Actes Table Ronde sur les Probl´emes du D´ecoupage du Terms, pp. 39–74. AILA-
Comterm, Office de la Langue Fran¸caise, Montr´eal (1978)
12. Schmid, H.: Improvements in Part-of-Speech Tagging with an Application to Ger-
man. In: ACL SIGDAT-Workshop (1995)
... The knowledge resources can be linguistic (rules, patterns, structures), statistical (frequency, probability), external (dictio- naries, ontologies, folksonomies). Some examples of these systems are: NEURAL [17], Termext [5], YATE [30]. As well, in the last years, systems that include, for example, pre-validation of terms [31] or the use of Machine Learning [10] have been developed. ...
... Termex is an hybrid term recognition system that involves both a statistical and a strong linguistic process. But some linguistic and functional adaptations were carried out over different stages of the algorithm [5]. ...
Article
The paper presents LEXIK, an intelligent terminological architecture that is able to efficiently obtain specialized lexical resources for elaborating dictionaries and providing lexical support for different expert tasks. LEXIK is designed as a powerful tool to create a rich knowledge base for lexicography. It will process big amounts of data in a modular system, that combines several applications and techniques for terminology extraction, definition generation, example extraction and term banks, that have been partially developed so far. Such integration is a challenge for the area, which lacks an integrated system for extracting and defining terms from a non-preprocessed corpus.
... Description of features per group and subgroup 1. Shape features (SHAP) length number of characters & number of tokens alphanumeric whether the CT is alphabetic, numeric, alphanumeric, etc. & the number of digits and non-alphabetic characters capitalisation out of all occurrences of the CT, how often (%) is it all lowercase, all uppercase, title case, etc. NER whether the CT was tagged (completely, partially, etc.) as a Named Entity during preprocessing chunk which chunk tag(s) were assigned to the CT in preprocessing stopword whether the CT contains a stopword or is a stopword *3. Frequency features (FREQ)metrics to calculate termhood/unithood without comparing to a reference corpus: C-Value (Barrón-Cedeño et al. 2009), TF-IDF(Astrakhantsev, Fedorenko, and Turdakov 2015), Lexical Cohesion and Basic(Bordea, Buitelaar, and Polajnar 2013) metrics to calculate termhood/unithood by comparing frequencies to a reference corpus: Domain Pertinence(Meijer, Frasincar, and Hogenboom 2014), Domain Relevance(Bordea, Buitelaar, and Polajnar 2013), Weirdness ...
Article
Full-text available
Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept "term". This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult-such as the extraction of rare terms and multiword terms-this study shows how supervised machine learning is a promising methodology for ATE.
... As a result of these factors, we implemented three methods of term extraction, all of which are applied to input that the tool has already lemmatized. The first method is C-value (Frantzi et al., 2000), modified to include single-word terms (Barrón-Cedeno et al., 2009). This is likely one of the best-known term search methods in existence. ...
Conference Paper
Full-text available
Automatic term extraction (ATE) from texts is critical for effective terminology work in small speech communities. We present TermPortal, a workbench for terminology work in Iceland, featuring the first ATE system for Icelandic. The tool facilitates standardization in terminology work in Iceland, as it exports data in standard formats in order to streamline gathering and distribution of the material. In the project we focus on the domain of finance in order to do be able to fulfill the needs of an important and large field. We present a comprehensive survey amongst the most prominent organizations in that field, the results of which emphasize the need for a good, up-to-date and accessible termbank and the willingness to use terms in Icelandic. Furthermore we present the ATE tool for Icelandic, which uses a variety of methods and shows great potential with a recall rate of up to 95% and a high C-value, indicating that it competently finds term candidates that are important to the input text.
... En el trabajo de [4] presentan mejoras sobre el algoritmo C-value/NC-value propuesto por [9] que es un enfoque híbrido (lingüístico-estadístico) para el ATR. Crean dos filtros lingüísticos para el español, una eliminación selectiva de stopwords a través de una algoritmo, debido a que hay muchos términos candidatos -------------------------------------------------------------------------------------------------------------------------------Tópicos actuales en la Ingeniería del lenguaje y del conocimiento conteniendo stop-words así como otro tipo de palabras. ...
Chapter
Full-text available
En este trabajo se presenta una propuesta para la adquisición y representación de conocimiento de forma automática sobre un dominio especı́fico, utilizando inferencia para la generación de nuevos hechos. También se describen las principales técnicas para la adquisición y representación de conocimiento. Trabajos relacionados, ası́ como la metodologı́a propuesta para la realización de este trabajo.
... Hybrid techniques combine two or more techniques mentioned above. The most usual case uses a linguistic approach (dictionaries and rules of term formation) and a statistical metric, a hybrid method already developed for Spanish [34]. ...
Article
Full-text available
Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype.
... Normalmente, eligen una aproximaci?n ling??stica (o bien diccionarios o reglas de formaci?n) y una m?trica estad?stica. Existe ya un algoritmo desa- rrollado para la lengua espa?ola 28 . ...
Article
Full-text available
Automatic Term Recognition from technical texts has mainly two applications: on one side, assistance to lexicographers and documentaliste; and on the other side, identification of key concepts for machine translation and information retrieval. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish. In this article, we present an evaluation of three different strategies for selecting candidates to term. In a second step, each list is filtered by a set of medical affixes to provide a final proposal of terms. This paper also discusses the problems of recall and precision of each strategy and shows how these techniques have been used to compile a lexicon of medica terms and an ATR system for Spanish.
... En el trabajo de (Zhang et al., 2008), demostraron que C-value obtiene los mejores resultados comparado a otras medidas. Además del inglés, C-value también ha sido aplicado a otros idiomas tales como japonés, serbio, esloveno, polaco, chino (Ji et al., 2007), español (Barrón-Cedeno et al., 2009),árabe. Es por eso, en nuestro primer trabajo (Lossio-Ventura et al., 2013), la modificamos y adaptamos para el francés. ...
Article
In this paper, we propose the first method for automatic Vietnamese medical term discovery and extraction from clinical texts. The method combines linguistic filtering based on our defined open patterns with nested term extraction and statistical ranking using C -value. It does not require annotated corpora, external data resources, parameter settings, or term length restriction. Beside its specialty in handling Vietnamese medical terms, another novelty is that it uses Pointwise Mutual Information to split nested terms and the disjunctive acceptance condition to extract them. Evaluated on real Vietnamese electronic medical records, it achieves a precision of about 74% and recall of about 92% and is proved stably effective with small datasets. It outperforms the previous works in the same category of not using annotated corpora and external data resources. Our method and empirical evaluation analysis can lay a foundation for further research and development in Vietnamese medical term discovery and extraction.
Thesis
Full-text available
Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies.
Article
Full-text available
Resumen La extracción automática de terminologías en documentos de especialidad ha sido explorada ampliamente en idiomas como el inglés, el francés y el alemán, pero no lo ha sido tanto en el español. El algoritmo C -value/NC-value se ha implementado exitosamente para la extracción automática de términos biológicos sobre documentos en inglés. En el presente trabajo se describe la adaptación de la etapa C-value, un método lingüístico y estadístico para la extracción de términos multipalabra en inglés, para la extracción de términos multipalabra, en particular del área de ingeniería, en español.
Conference Paper
Full-text available
This paper presents a project that aims at building lexical resources for terminology. By lexical resources, we mean dictionaries that provide detailed lexico-semantic information on terms, i.e. lexical units the sense of which can be related to a special subject field. In terminology, there is a lack of such resources. The specific dictionaries we are currently developing describe basic French and Korean terms that belong to the fields of computer science and the Internet (e.g. computer, configure, user-friendly, Web, browse, spam). This paper presents the structure of the French and Korean articles: each component is examined and illustrated with examples. We then describe the corpus-based methodology and the different computer applications used for developing the articles. Our methodology comprises five steps: design of the corpora, selection of terms; sense distinction; definition of actantial structures and listing of semantic relations. Details on the current state of each database are also given.
Article
Full-text available
Technical terms are important for knowledge mining, especially as vast amounts of multi-lingual documents are available over the Internet. Thus, a domain and language-independent method for term recognition is necessary to automatically recognize terms from Internet documents. The C-/NC-value method is an efficient domain-independent multi-word term recognition method which combines linguistic and statistical knowledge. Although the C-value/NC-value method is originally based on the recognition of nested terms in English, our aim is to evaluate the application of the method to other languages and to show its feasibility for multi-language environments. In this article, we describe the application of the C/NC-value method to Japanese texts. Several experiments analysing the performance of the method using the NACSIS Japanese AI-domain corpus demonstrate that the method can be utilized to realize a practical domain-and language-independent term rec- ognition system.
Article
Full-text available
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in com- putational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguis- tic data structures and taking advantage of recent enhancements in the Python lan- guage. This paper reports on the simpli- fied toolkit and explains how it is used in teaching NLP.
Conference Paper
Full-text available
Technical terms (henceforth called simply terms), are important elements for digital libraries. In this paper we present a domainindependent method for the automatic extraction of multi-word terms, from machine-readable special language corpora. The method, (C-value/NC-value), combines linguistic and statistical information. The first part, C-value enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type of multi-word terms, the nested terms. The second part, NC-value, gives: 1) a method for the extraction of term context words (words that tend to appear with terms), 2) the incorporation of information from term context words to the extraction of terms.
Article
Technical terms are important for knowledge mining, especially as vast amounts of multi-lingual documents are available over the Internet. Thus, a domain and language-independent method for term recognition is necessary to automatically recognize terms from Internet documents. The C-/NC-value method is an effi cient domain-independent multi-word term recognition method which combines linguistic and statistical knowledge. Although the C-value/NC-value method is originally based on the recognition of nested terms in English, our aim is to evaluate the application of the method to other languages and to show its feasibility for multi-language environments. In this article, we describe the application of the C/NC-value method to Japanese texts. Several experiments analysing the performance of the method using the NACSIS Japanese AI-domain corpus demonstrate that the method can be utilized to realize a practical domain- and language-independent term recognition system.
Article
This article describes a method for selecting terms in a French dictionary on computing and the Internet. First, the general objectives and the foreseen content of the dictionary are presented. Then, we propose a series of lexico-semantic criteria that are designed to help terminologists validate intuitions they have on the value of lexical units from the point of view of their incorporation in a specialized dictionary. The lexico-semantic criteria are applied on a list of specific lexical units generated automatically using corpus comparison methods. We show how an automatic selection of units supported by lexico-semantic criteria helps terminographers select relevant entries and increases systematicity.
Article
This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger)which improve its accuracy when trained on small corpora. The basic tagger was originally developedfor English [Schmid, 1994]. The extensions together reduced error rates on a German test corpusby more than a third.
Conference Paper
Terminology extraction is an important work for automatic update of domain specific knowledge. Contextual information helps to decide whether the extracted new terms are terminology or not. As extraction based on fixed patterns has very limited use to handle natural language text, we need both syntactical and semantic information in the context of a term to determine its termhood. In this paper, we investigate two window-based context word extraction methods taking into account of syntactic and semantic information. Based on the performance of each method individually, a hybrid method which combines both syntactical and semantic information is proposed. Experiments show that the hybrid method can achieve significant improvement.