ArticlePDF Available

Abstract and Figures

Extracting general or intermediate level terms is a relevant problem that has not received much attention in literature. Cur-rent approaches for term extraction rely on contrastive corpora to identify domain-specific terms, which makes them better suited for specialised terms, that are rarely used outside of the domain. In this work, we propose an alternative measure of domain specificity based on term coherence with an automatically constructed domain model. Although previous systems make use of domain-independent features, their performance varies across domains, while our approach displays a more stable be-haviour, with results comparable to, or better than, state-of-the-art methods.
Content may be subject to copyright.
Domain-independent term extraction through domain modelling
Georgeta Bordea
National University
of Ireland, Galway
Paul Buitelaar
National University
of Ireland, Galway
Tamara Polajnar
Computer Laboratory
University of Cambridge
Extracting general or intermediate level
terms is a relevant problem that has not re-
ceived much attention in literature. Cur-
rent approaches for term extraction rely
on contrastive corpora to identify domain-
specific terms, which makes them better
suited for specialised terms, that are rarely
used outside of the domain. In this work,
we propose an alternative measure of do-
main specificity based on term coherence
with an automatically constructed domain
model. Although previous systems make
use of domain-independent features, their
performance varies across domains, while
our approach displays a more stable be-
haviour, with results comparable to, or bet-
ter than, state-of-the-art methods.
Term extraction plays an important role in a
wide range of applications including information
retrieval (Yang et al., 2005), keyphrase extrac-
tion (Lopez and Romary, 2010), information ex-
traction (Yangarber et al., 2000), domain ontol-
ogy construction (Kietz et al., 2000), text classi-
fication (Basili et al., 2002), and knowledge min-
ing (Mima et al., 2006). In many of these ap-
plications the specificity level of a term is a rel-
evant characteristic, but despite the large body of
work in term extraction there are few methods that
are able to identify general terms or intermediate
level terms. Take for example the following struc-
ture from the AGROVOC vocabulary1:resources
natural resources mineral resources lig-
nite, where resources is an upper level term, natu-
ral resources and mineral resources are intermedi-
ate level terms, and lignite is a leaf. Intermediate
level terms are specific to a domain but are broad
enough to be usable for summarisation and clas-
sification. Methods that make use of contrastive
corpora to select domain specific terms favour the
leaves of the hierarchy, and are less sensitive to
generic terms that can be used in other domains.
Instead, we construct a domain model by iden-
tifying upper level terms from a domain corpus.
This domain model is further used to measure the
coherence of a candidate term within a domain.
The underlying assumption is that top level terms
(e.g., resource) can be used to extract intermedi-
ate level terms, in our example natural resources
and mineral resources. Our method for construct-
ing a domain model is evaluated directly through
an expert survey as well as indirectly based on its
contribution to intermediate level term extraction.
While domain modelling is tested and exemplified
with English, the ideas presented here are not lan-
guage dependent and can be applied to other lan-
guages, but this is outside the scope of this work.
We start by giving an overview of related work
in term extraction in Section 1. Then, an approach
to construct a domain model based on domain co-
herence is proposed in Section 2, followed by a
method to apply domain models for term extrac-
tion. The experimental part of the paper starts with
a direct evaluation of a domain model through a
user survey (Section 3). A first set of experiments
is carried in a standard setting for term evaluation,
while the second set of experiments is application-
driven, using corpora annotated for keyphrase ex-
traction, information extraction, and information
retrieval. We conclude this paper in Section 4, giv-
ing a few directions for future work.
1 Related work
Methods for term extraction that use corpus statis-
tics alone are faced with the challenge of distin-
guishing general language expressions (e.g., last
week) from terminological expressions. A solu-
tion to this problem is to use contrastive corpora
(Huizhong, 1986). Several contrastive measures
are proposed including domain relevance (Park
et al., 2002), domain consensus (Velardi et al.,
2001), and word impurity (Liu et al., 2005). In
this work we propose an approach to compute do-
main specificity based on a domain model, that is
less sensitive to leaf terms and is better suited for
intermediate level terms.
The domain model proposed in this work is de-
rived from the corpus itself, without the need for
external corpora. An automatic method for iden-
tifying the upper level terms of a domain has ap-
plications beyond the task of term extraction. Al-
though not named as such, upper level terms were
previously used for text summarisation (Teufel
and Moens, 2002). The authors manually identi-
fied a set of 37 nouns including theory,method,
prototype and algorithm, without considering a
principled approach to extract them. The work
presented here is similar to (Barri`
ere, 2007), but
instead of re-ranking terms based on their similar-
ity to each other we make use of domain model
terms, reducing data sparsity issues.
In our experiments we employ two state of the
art methods for term extraction, the NC-value ap-
proach (Frantzi et al., 2000) and TermExtractor2
(Velardi et al., 2001). The former is a hybrid
method that ranks terms using only corpus statis-
tics, while the latter exploits contrastive corpora.
NC-value is based on raw frequency counts and
considers nested multi-word terms by penalising
frequency counts of shorter embedded terms. Ad-
ditionally, it incorporates context information in
a re-ranking step using top ranked terms. Con-
text words (nouns, verbs and adjectives) are identi-
fied based on their occurrence with top candidates.
Our method is an extension of this approach that
uses domain models instead of selecting context
words based on frequency alone.
TermExtractor is a popular approach that com-
bines different term extraction techniques includ-
2TermExtractor demo: http://lcl.
ing domain relevance, domain consensus and lex-
ical cohesion. Domain Relevance (DR) compares
the probability of a term tin a given domain Di
with the maximum probability of the term in other
domains used for contrast Djand is measured as:
DRDi(t) = P(t/Di)
maxj(P(t/Dj)), j 6=i(1)
Domain Consensus (DC) identifies terms that
have an even probability distribution across the
corpus that represents a domain of interest, and is
estimated through entropy as follows:
DCDi(t) = X
P(t/d)·log (P(t/d)) (2)
where dis a document in the domain Di. Fi-
nally, the degree of cohesion among the words wj
that compose the term tis computed through a
measure called Lexical Cohesion (LC). Let |t|be
the length of tin number of words, and f(t, Di)
the frequency of tin the domain Di, then Lexical
Cohesion is defined as:
LCDi(t) = |t| · f(t, Di)·log (f(t, Di))
Pwjf(wj, Di)(3)
The weight T E used for ranking terms by Ter-
mExtractor is a linear combination of the three
methods described above:
T E(t, Di) = α·DR +β·DC +γ·LC (4)
While general terms typically have a high do-
main consensus, the domain relevance measure
boosts narrow terms that have limited usage out-
side of the domain. For example the term system is
not identified as relevant for Computer Science be-
cause it is frequently used in general language and
in other specific domains as biology. In this work
we take a different approach to compute domain
specificity that can be applied for general terms by
using a domain coherence measure that does not
use external corpora. Two general purpose cor-
pora, the Open American National Corpus3and
a corpus of books from Project Gutenberg4, are
used as contrastive corpora for our implementa-
tion of TermExtractor. The books selected from
3Open American National Corpus: http://www.
4Project Gutenberg: http://www.gutenberg.
Project Gutenberg include the bible, the complete
works of William Shakespeare, James Joyce’s
Ulysses and Tolstoy’s War and Peace. We con-
sider only the default setting of TermExtractor as-
signing equal weights to each measure in Equation
2 Constructing a domain model based on
domain coherence
We begin this section by describing an approach
for domain modelling based on domain coherence
in Section 2.1. Then, we discuss a modification
of the NC-value approach which makes it better
suited for intermediate level terms (Section 2.2).
We conclude this section by describing a novel
method for term extraction using a domain model
in Section 2.3.
2.1 Domain modelling
A domain model is represented as a vector of
words which contribute to determine the domain
of the whole corpus. Let be the domain model,
and w1to wna set of generic words, specific to
the domain, then:
∆ = {w1, ..., wn}(5)
The number of words ncan be empirically set
according to a cutoff associated weight. Previous
work on using domain information for word sense
disambiguation (Magnini et al., 2002) has shown
that only about 21% of the words in a text actu-
ally carry information about the prevalent domain
of the whole text, and that nouns have the most
significant contribution (79.4%). Several assump-
tions are made to identify words that are used to
construct a domain model from a domain corpus:
1. Distribution: Generic words should appear
in at least one quarter of the documents in the
2. Length: Only single-word candidates are
considered, as longer terms are more specific;
3. Content: Only content-bearing words are of
interest (i.e., nouns, verbs, adjectives);
4. Semantic Relatedness: A term is more gen-
eral if it is semantically related to many spe-
cific terms.
The distribution assumption implies that rare
terms are more specific, similar with the
frequency-based measure previously used for
measuring tag generality (Benz et al., 2011). This
might not always be the case, for example a sim-
ple search with a search engine shows that arte-
fact or silverware are more rarely used than the
term spoon, although the first two concepts are
more generic. However, in this work we are in-
terested in extracting basic-level categories as the-
orised in psychology (Hajibayova, 2013). A basic-
level category is the preferred level of naming, that
is the taxonomical level at which categories are
most cognitively efficient. A counter example can
be found for the length assumption as well, as the
longer term inorganic matter is more general than
the single word knife, but in this case we would
simply consider as a candidate the single word
matter which is more generic than the compound
term. Both length and frequency of occurrence are
proposed as general criteria for identifying basic-
level categories (Green, 2005).
The first three assumptions are used for can-
didate selection, while the fourth assumption is
used to filter the candidates. A possible solution
for building a domain model is to use a standard
termhood measure for single-word terms. Most
approaches for extracting single-word terms make
use of contrastive corpora, ranking higher specific
words that are rarely used outside of the domain.
But our domain model is further used for term
extraction, therefore it is important that we use
generic words to insure a high recall.
We interpret coherence as semantic relatedness
to quantify the coherence of a term in a do-
main. The measure used for semantic relatedness
is Pointwise Mutual Information (PMI). First, we
extract multi-word terms using a standard term
extraction technique, then we use the top ranked
terms to filter candidate words using the following
scoring function for domain coherence:
s(θ) = X
P M I(θ, σ ) = X
log P(θ, σ)
where θis the domain model candidate, σis top
ranked multi-word term, is the set of top ranked
multi-word terms and P(θ, σ)is the probability
that the word θappears in the context of the term
σ. In our implementation, the set contains the
best terms extracted by our baseline term extrac-
tion method described in Section 2.2, but any other
term extraction method can be applied in this step.
A small sample from domain models extracted us-
Computer Science Biomed Food and Agriculture
development mechanism control
software evidence farm
framework antibody supply
information molecule food
system system forest
Table 1: Example words from domain models ex-
tracted for different domains
ing our domain coherence method for Computer
Science, Food and Agriculture, and the Biomedi-
cal Domain, is shown in Table 1.
2.2 Baseline term extraction method
Our baseline approach for intermediate level term
extraction is frequency-based, similar to the C-
value method (Ananiadou, 1994), but we mod-
ify its ranking function. The main difference
is the way we take into consideration embedded
terms. In previous work, this information is used
to decrease frequency counts, as shorter terms are
counted both when they appear by themselves and
when they are embedded in a longer term. We ar-
gue that the number of longer terms that embed a
term can be used as a termhood measure. In our
experiments, this measure only works for embed-
ded multi-word terms, as single-word terms are
too ambiguous. The baseline scoring method bis
defined as:
b(τ) = |τ|log f(τ) + αeτ(7)
where τis the candidate string, |τ|is the length
of τ,fis its frequency in the corpus, and eτis the
number of terms that embed the candidate string
τ. The parameter αis used to linearly combine
the embeddedness weight and is empirically set to
3.5 in our experiments.
2.3 Using domain coherence for term
Although we proposed a method to build a do-
main model in Section 2.1, the question of how
to use this domain model in a termhood measure
remains unanswered. Again, the solution is to
rely on the notion of domain coherence, which
is defined in this case as the semantic relatedness
between a candidate term and the domain model
described above. The assumption is that a cor-
rect term should have a high semantic relatedness
with representative words from the domain. This
method favours more generic candidates than con-
trastive corpora approaches, therefore it is better
suited for extracting intermediate level terms.
The same measure of semantic relatedness is
used as for the domain model, the PMI measure.
The domain coherence DC of a candidate string τ
is defined as follows:
DC(τ) = X
P M I(τ , θ)(8)
where θis a word from the domain model, and
is the domain model constructed using Equation
6. Using generic terms to build the domain model
is crucial for ensuring a high recall as these words
are more frequently used across the corpus. In our
implementation context is defined as a window of
5 words.
3 Experiments and Results
Evaluating term extraction results across domains
is a challenge, because finding domain experts
is difficult for more than one domain. An al-
ternative is to reuse datasets annotated for appli-
cations where term extraction plays an important
role, for example, keyphrase extraction or index
term assignment. Three technical domain cor-
pora are used in our experiments: Krapivin, a cor-
pus of scientific publications in Computer Science
(Krapivin et al., 2009); GENIA, a corpus of ab-
stracts from the biomedical domain (Ohta et al.,
2001); and FAO, a corpus of reports about Food
and Agriculture (Medelyan and Witten, 2008) col-
lected from the website of the Food and Agricul-
ture Organization of the United Nations5. The
Krapivin corpus provides author and reviewer as-
signed keyphrases for each publication. The GE-
NIA corpus is exhaustively annotated with biomed
terms, with about 35% of all noun phrases anno-
tated as biomed terms. The FAO dataset provides
index terms assigned to each document by profes-
sional indexers. It is not only the document size
that varies considerably across these three cor-
pora, but also the number of annotations assigned
to each document as can be seen in Table 2.
We evaluate our measure for building a domain
model in Computer Science, by identifying a list
of general words with the help of a domain ex-
pert in Section 3.1. We envision two sets of ex-
periments: a standard term extraction evaluation
5Food and Agriculture Organization of the United States:
Corpus Documents Tokens Avg. Annotations
Krapivin 2304 22 ·1065
GENIA 1999 0.5·10637
FAO 780 28 ·1068
Table 2: Corpora statistics
where the top ranked terms are evaluated against
the list of unique annotations provided in the eval-
uation datasets (Section 3.2.1), and a second set of
experiments where each term extraction approach
is used to assign candidates to documents in com-
bination with a document relevance measure in
Section 3.2.2.
3.1 Intrinsic evaluation of a domain model
A domain expert was asked to investigate nouns
used in the ACM Computing Classification Sys-
tem6. The expert was provided with the list of
nouns and their frequency in the taxonomy and
was required to identify nouns that refer to generic
concepts. A set of 80 nouns were selected in this
manner including system,information, and soft-
ware. Only one annotator was involved because of
the complexity of the task, that implies the analy-
sis and filtering of several hundred words. We esti-
mate the inter-annotator agreement by analysing a
subset of the selected words through a survey with
27 participants. A quarter of the selected words
are combined with the same number of randomly
selected rejected words and the resulting list is
sorted alphabetically. The Fleiss kappa statistic
for interrater agreement is 0.34, lying in the fair
agreement range. 80% of the words from our gold
standard domain model were selected by at least
half of the participants.
We compare our method (DC) with two other
benchmarks, the contrastive termhood measure
used in TermExtractor, and the frequency-based
method used by NC-value to select context words
(NCV weight). Again, context is defined as a
window of 5 words. A domain model has many
similarities with probabilistic topic modelling, al-
though it provides less structure. We compare our
approach with a popular approach to topic mod-
elling, Latent Dirichlet Allocation (LDA) (Blei et
al., 2003). We experimented with different num-
bers of topics but we report only the best results
6ACM Computing Classification System: http://
Portion of ranked list (in number of candidates)
100 200 300 400
40 DC
T ermExtractor
NC V weight
Figure 1: Methods for extracting a domain model
achieved for 75 topics (LDA75).
The results of this experiment are shown in Fig-
ure 1, in terms of F-score. Several conclusions
can be drawn from this experiment. First, the
methods that analyse the context of top ranked
terms (i.e., our domain coherence measure, DC,
and the weight used for context words in the NC-
value, wNC V ) perform better than the contrastive
measure used in TermExtractor, with statistically
significant gains. Also, our domain coherence
method outperforms the more simple frequency-
based weight used in NC-value, although this re-
sult is not statistically significant. As expected,
the words ranked high by TermExtractor are too
specific for a generic domain model. The topic
modelling approach identifies several words from
the gold standard but much less than our approach
and these are evenly distributed across latent top-
ics. These conclusions will be further investigated
across two other domains, using gold standard
terms annotated for three different applications in
Section 3.
3.2 Term extraction evaluation results
We implement and compare the baseline method
presented in Section 2.2 and the method based
on domain coherence described in Section 2.3,
against the NC-value and TermExtractor methods,
which are used as benchmarks. The same candi-
date selection method is used for all the evaluated
approaches. Candidate terms are selected through
syntactic analysis by defining a syntactic pattern
for noun phrases. To assure the results are compa-
rable, the same number of context words is used
Portion of ranked list (in %)
P recision
20 40 60 80 100
Baseline +DC
NC value
T ermExtractor
Figure 2: Precision for top 10k
terms from the Krapivin corpus
Portion of ranked list (in %)
P recision
20 40 60 80 100
Baseline +DC
NC V alue
T ermExtractor
Figure 3: Precision for top 10k
terms from the FAO corpus
Portion of ranked list (in %)
P recision
20 40 60 80 100
Baseline +DC
NC V alue
T ermExtractor
Figure 4: Precision for top 10k
terms from the GENIA corpus
in our implementation of the NC-value approach
as the size of the domain model. Two general pur-
pose corpora, the Open American National Cor-
pus7and a corpus of books from Project Guten-
berg8, are used as contrastive corpora for our im-
plementation of TermExtractor. We considered
only the default setting for TermExtractor, assign-
ing equal weights to each measure.
3.2.1 Standard term extraction evaluation
While keyphrases and index terms suit well our
purposes, as they are terms of an intermediate
level of specificity, meant to summarise or clas-
sify documents, many of the terms annotated in
GENIA are too specific. We discard the annotated
terms that are mentioned in less than 1% of the
documents from corpus, based on our distribution
assumption. For each of the three datasets, the top
ten thousand ranked terms were evaluated. We in-
crementally analysed portions of the ranked lists
computed using the baseline approach (Baseline),
the baseline approach linearly combined with the
domain coherence measure (Baseline+DC), and
the two benchmarks, NC-value and TermExtrac-
tor. The precision value for a portion of the list
is scaled against the overall number of candidates
considered. First, we observe that all methods
perform better on the GENIA (Figure 4) and the
Krapivin corpus (Figure 2), with the best methods
achieving a maximum precision close to 60% at
the top of the ranked list.
The Food and Agriculture use case is more chal-
lenging, as the best method achieves a precision
of less than 20%, as can be seen in Figure 3.
7Open American National Corpus: http://www.
8Project Gutenberg: http://www.gutenberg.
Also, the contrastive corpora measure employed in
TermExtractor yields considerably worse results
on all three domains, because the extracted terms
are too specific. The baseline method, that re-
wards embedded terms, outperforms the NC-value
method on the Computer Science domain, and in
the biomedical domain, but it performs slightly
worse on the Agriculture domain. The combina-
tion of our baseline method with the domain co-
herence measure (referred to as Baseline + DC in
the legend) yields the most stable behaviour, out-
performing all other measures across the three do-
mains, considerably so in the biomedical domain
(Figure 4) and at the top of the ranked list in Com-
puter Science (Figure 2). In Computer Science,
domain coherence significantly outperforms the
best performing state-of-the-art method, NC-value
(Figure 2). In Biomedicine, the improvement is
statistically significant, with a gain of 106% at top
20% of the list (Figure 4).
3.2.2 Application-based evaluation
An important reason for developing termhood
measures is that they are needed in specific ap-
plications, for example keyphrase extraction and
index term extraction. Typically, a termhood mea-
sure is combined with different measures of docu-
ment relevance in such applications, as the candi-
dates are assigned at the document level. We make
use of the standard information retrieval measure
TF-IDF in combination with the considered term
extraction scoring functions to assign terms to
documents. The best results are obtained by us-
ing domain coherence as a post-processing step.
In this experiment, the PostRankDC approach was
computed by re-ranking the top 30 candidates se-
lected using our baseline approach described in
Top F@5 F@10 F@15 F@20
Baseline 12.24 12.81 12.14 11.32
PostRankDC 13.42 14.55 13.72 12.51
NC-value 6.77 7.32 7.18 6.75
TermExtractor 1.41 1.77 1.95 1.97
Table 3: Keyphrase extraction evaluation on the
Krapivin corpus
Equation 7, based on their domain coherence.
The application-based evaluation proposed in
this work allows us to evaluate both precision and
recall, and consequently F-score can be used as
an evaluation metric. The results for keyphrase
extraction in Computer Science are presented in
Table 3, while the results for index term extrac-
tion in the Agriculture domain are shown in Ta-
ble 4. The results for document level term extrac-
tion from the Biomed corpus appear in Table 5.
All three methods yield a higher performance on
the GENIA corpus. The results on the Agricul-
ture corpus are again the lowest, because a larger
number of candidates has to be analysed.
Our Baseline method outperforms the NC-
value approach on the Krapivin corpus and on the
GENIA corpus, but not on the FAO corpus. We
can observe that the domain coherence approach
(P ostRankDC ) improves over our baseline ap-
proach (Baseline) on all three domains. The im-
provement is statistically significant compared to
the best state-of-the-art method in Computer Sci-
ence, NC-value. NC-value outperforms TermEx-
tractor in Computer Science and Agriculture, but
TermExtractor performs better in Biomedicine.
Although both NC-value and TermExtractor make
use of domain-independent features for ranking,
their performance varies across domains and ap-
plications. At the same time, combining our
domain coherence approach (PostRankDC) with
our baseline method in a post-ranking step dis-
plays a more stable behaviour, achieving the best
performance on the Computer Science domain
(Krapivin) and similar results with the results of
the best method in Biomedicine (GENIA) and
Agriculture (FAO).
4 Conclusions
In this study, we proposed an approach to iden-
tify intermediate level terms through domain mod-
elling and a novel domain coherence measure, ar-
Top F@5 F@10 F@15 F@20
Baseline 3.17 3.76 4.03 4.20
PostRankDC 55.8 5.62 5.29
NC-value 4.65 5.88 6.09 5.94
TermExtractor 0.2 0.31 0.34 0.35
Table 4: Index term evaluation on the FAO corpus
Top F@5 F@10 F@15 F@20
Baseline 9.67 15.71 20.17 23.19
PostRankDC 11.36 17.63 21.52 23.55
NC-value 7.79 11.97 14.01 14.6
TermExtractor 10.77 17.75 22.14 24.63
Table 5: Term extraction at the document level on the
GENIA corpus
guing that approaches that make use of contrastive
corpora are only suitable for updating existing ter-
minology resources with more specific terms and
not for summarisation or classification tasks. The
contributions described in this work are three-fold:
i) A method for extracting top level terms from a
domain corpus
ii) A novel domain coherence metric based on se-
mantic relatedness with a domain model
iii) A novel application-based evaluation for term ex-
traction systems
Experiments discussed in this paper show that
term extraction performance depends on the do-
main, although systems make use of domain-
independent features. Our domain coherence ap-
proach based on a domain model performs well
across domains, while the performance of the
NC-value and TermExtractor benchmarks is more
domain-dependent. The results lead to the conclu-
sion that using a domain model is more appropri-
ate than using statistical approaches based on con-
trastive corpora, for extracting intermediate level
terms. Future work will include an unsupervised
learning-to-rank approach for term extraction, that
will allow a more principled integration of domain
coherence measures with standard term extraction
features. The method proposed here can be used
as a specificity measure, and we currently investi-
gate this in the context of constructing generalisa-
tion hierarchies of concepts.
This work has been funded in part by the European
Union under Grant No. 258191 for the PROMISE
project, as well as by a research grant from Sci-
ence Foundation Ireland (SFI) under Grant Num-
ber SFI/12/RC/2289.
Sophia Ananiadou. 1994. A methodology for au-
tomatic term recognition. In Proceedings of the
15th International Conference on Computational
Linguistics (COLING ’94), page 10341038, Kyoto,
Caroline Barri`
ere. 2007. Une perspective interactive `
lextraction de termes. In 7`
eme Conf`
erence” Termi-
nologie et intelligence artificielle”, pages 95–104.
Roberto Basili, Alessandro Moschitti, and
Maria Teresa Pazienza. 2002. Empirical in-
vestigation of fast text classification over linguistic
features. In Frank van Harmelen, editor, ECAI,
pages 485–489. IOS Press.
Dominik Benz, Christian Krner, Andreas Hotho, Gerd
Stumme, and Markus Strohmaier. 2011. One tag to
bind them all: Measuring term abstractness in social
metadata. In Grigoris Antoniou, Marko Grobelnik,
Elena Simperl, Bijan Parsia, Dimitris Plexousakis,
Pieter Leenheer, and Jeff Pan, editors, The Semanic
Web: Research and Applications, volume 6644 of
Lecture Notes in Computer Science, pages 360–374.
Springer Berlin Heidelberg.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003. Latent dirichlet allocation. J. Mach. Learn.
Res., 3:993–1022, March.
Katerina Frantzi, Sophia Ananiadou, and Hideki
Mima. 2000. Automatic recognition of multi-word
terms : the C-value / NC-value method. Journal on
Digital Libraries, Natural language processing for
digital libraries, 3 (2):115–130.
Rebecca Green. 2005. Vocabulary alignment via basic
level concepts. In Final Report, 2003 OCLC/ALISE
Library and Information Science Research Grant
Project, Dublin, OH: OCLC.
Lala Hajibayova. 2013. Basic-level categories: A re-
view. Journal of Information Science.
Y Huizhong. 1986. A new technique for identify-
ing scientific/technical terms and describing science
texts. Lit. Linguist. Comput., 1:93–103, April.
org-Uwe Kietz, Raphael Volz, and Alexander Maed-
che. 2000. Extracting a domain-specific ontology
from a corporate intranet. In Proceedings of the
2nd workshop on Learning language in logic and
the 4th conference on Computational natural lan-
guage learning - Volume 7, ConLL ’00, pages 167–
175, Stroudsburg, PA, USA. Association for Com-
putational Linguistics.
Mikalai Krapivin, Aliaksandr Autayeu, and Maurizio
Marchese. 2009. Large dataset for keyphrases ex-
traction. In Technical Report DISI-09-055, DISI.
University of Trento, Italy.
Tao Liu, X Wang, Guan Yi, Zhi-Ming Xu, and Qiang
Wang, 2005. Domain-Specific Term Extraction and
Its Application in Text Classification, volume 1481,
pages 1481–1484.
Patrice Lopez and Laurent Romary. 2010. HUMB
: Automatic Key Term Extraction from Scientific
Articles in GROBID. In Proceedings of the ACL
2010 Workshop on Evaluation Exercises on Seman-
tic Evaluation (SemEval 2010), number July, pages
Bernardo Magnini, Giovanni Pezzulo, and Alfio
Gliozzo. 2002. The role of domain information in
word sense disambiguation. Natural Language En-
gineering, 8:359–373.
Olena Medelyan and Ian H. Witten. 2008. Domain in-
dependent automatic keyphrase indexing with small
training sets. J. Am. Soc. Information Science and
Hideki Mima, Sophia Ananiadou, and Katsumori Mat-
sushima. 2006. Terminology-based knowledge
mining for new knowledge discovery. ACM Trans.
Asian Lang. Inf. Process., 5(1):74–88.
Tomoko Ohta, Yuka Tateisi, Jin-Dong Kim, Sang-Zoo
Lee, and Jun’ichi Tsujii. 2001. Genia corpus: A
semantically annotated corpus in molecular biology
domain. In Proceedings of the ninth International
Conference on Intelligent Systems for Molecular Bi-
ology (ISMB 2001) poster session, page 68, July.
Youngja Park, Roy J. Byrd, and Branimir Boguraev.
2002. Automatic glossary extraction: Beyond ter-
minology identification. In 19th International Con-
ference on Computational Linguistics - COLING 02,
Taipei, Taiwan, August-September. Howard Interna-
tional House and Academia Sinica.
Simone Teufel and Marc Moens. 2002. Summariz-
ing scientific articles - experiments with relevance
and rhetorical status. Computational Linguistics,
Paola Velardi, Michele Missikoff, and Roberto Basili.
2001. Identification of relevant terms to support the
construction of domain ontologies. In Proceedings
of the ACL 2001 Workshop on Human Language
Technology and Knowledge Management, Toulouse,
Lingpeng Yang, Dong-Hong Ji, Guodong Zhou, and
Nie Yu. 2005. Improving retrieval effectiveness
by using key terms in top retrieved documents. In
David E. Losada and Juan M. Fernndez-Luna, edi-
tors, ECIR, volume 3408 of Lecture Notes in Com-
puter Science, pages 169–184. Springer.
Roman Yangarber, Ralph Grishman, Pasi Tapanainen,
and Silja Huttunen. 2000. Automatic acquisition
of domain knowledge for information extraction. In
Proceedings of the 18th conference on Computa-
tional linguistics - Volume 2, COLING ’00, pages
940–946, Stroudsburg, PA, USA. Association for
Computational Linguistics.
... Description of features per group and subgroup 1. Shape features (SHAP) length number of characters & number of tokens alphanumeric whether the CT is alphabetic, numeric, alphanumeric, etc. & the number of digits and non-alphabetic characters capitalisation out of all occurrences of the CT, how often (%) is it all lowercase, all uppercase, title case, etc. NER whether the CT was tagged (completely, partially, etc.) as a Named Entity during preprocessing chunk which chunk tag(s) were assigned to the CT in preprocessing stopword whether the CT contains a stopword or is a stopword *3. Frequency features (FREQ)metrics to calculate termhood/unithood without comparing to a reference corpus: C-Value (Barrón-Cedeño et al. 2009), TF-IDF(Astrakhantsev, Fedorenko, and Turdakov 2015), Lexical Cohesion and Basic(Bordea, Buitelaar, and Polajnar 2013) metrics to calculate termhood/unithood by comparing frequencies to a reference corpus: Domain Pertinence(Meijer, Frasincar, and Hogenboom 2014), Domain Relevance(Bordea, Buitelaar, and Polajnar 2013), Weirdness ...
... Description of features per group and subgroup 1. Shape features (SHAP) length number of characters & number of tokens alphanumeric whether the CT is alphabetic, numeric, alphanumeric, etc. & the number of digits and non-alphabetic characters capitalisation out of all occurrences of the CT, how often (%) is it all lowercase, all uppercase, title case, etc. NER whether the CT was tagged (completely, partially, etc.) as a Named Entity during preprocessing chunk which chunk tag(s) were assigned to the CT in preprocessing stopword whether the CT contains a stopword or is a stopword *3. Frequency features (FREQ)metrics to calculate termhood/unithood without comparing to a reference corpus: C-Value (Barrón-Cedeño et al. 2009), TF-IDF(Astrakhantsev, Fedorenko, and Turdakov 2015), Lexical Cohesion and Basic(Bordea, Buitelaar, and Polajnar 2013) metrics to calculate termhood/unithood by comparing frequencies to a reference corpus: Domain Pertinence(Meijer, Frasincar, and Hogenboom 2014), Domain Relevance(Bordea, Buitelaar, and Polajnar 2013), Weirdness ...
Full-text available
Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept "term". This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult-such as the extraction of rare terms and multiword terms-this study shows how supervised machine learning is a promising methodology for ATE.
... First we extract the terms that are most relevant to the domain, a task referred to as automatic term recognition (ATR). Current approaches to this task have employed a varied suite of methods for extracting terms from text based on parts of speech and metrics for assessing 'termhood' [15,29], domain modelling [11], and the composition of multiple metrics in an unsupervised manner [5]. More recently, these methods have been combined into off-the-shelf tools such as ATR4S [7] and JATE [31], and our system is a similar implementation to ATR4S. ...
... -Frequency of occurrences: scoring functions that consider only frequencies of candidate terms within the corpus and/or frequency of words occurring within candidate terms (TF-IDF, Residual IDF [13], C Value [4], ComboBasic [6]). -Context of occurrences: scoring functions that follow the distributional hypothesis [18] to distinguish terms from non-terms by considering the distribution of words in their contexts (PostRankDC [11]). -Reference corpora: scoring functions based on the assumption that terms can be distinguished from other words and collocations by comparing occurrence statistics in the dataset against statistics from a reference corpus -usually of general language/non specific domain (Weirdness [2], Relevance [24]). ...
Full-text available
Customer service agents play an important role in bridging the gap between customers’ vocabulary and business terms. In a scenario where organisations are moving into semi-automatic customer service, semantic technologies with capacity to bridge this gap become a necessity. In this paper we explore the use of automatic taxonomy extraction from text as a means to reconstruct a customer-agent taxonomic vocabulary. We evaluate our proposed solution in an industry use case scenario in the financial domain and show that our approaches for automated term extraction and using in-domain training for taxonomy construction can improve the quality of automatically constructed taxonomic knowledge bases.
... As future work we propose to extend the ontology structure by allowing the same word to appear for multiple concepts by conditioning its presence on a certain meaning. In addition, we plan to use domain modelling [5] that complements well our contrastive corpus-based solution by providing domain-specific terms that are more generic than the current ones. Furthermore, the clustering of terms can be improved. ...
Full-text available
For aspect-based sentiment analysis (ABSA), hybrid models combining ontology reasoning and machine learning approaches have achieved state-of-the-art results. In this paper, we introduce WEB-SOBA: a methodology to build a domain sentiment ontology in a semi-automatic manner from a domain-specific corpus using word embeddings. We evaluate the performance of a resulting ontology with a state-of-the-art hybrid ABSA framework, HAABSA, on the SemEval-2016 restaurant dataset. The performance is compared to a manually constructed ontology, and two other recent semi-automatically built ontologies. We show that WEB-SOBA is able to produce an ontology that achieves higher accuracy whilst requiring less than half of user time, compared to the previous approaches.
... Term Extraction. In the topic extraction phase, intermediate level terms of the domain are sought (as defined in [4]). It involves two approaches: one looking for domain model words in the context of the candidate terms (within a defined span size), and the second using the domain model as a base to measure the lexical coherence of terms by PMI calculation. ...
Full-text available
A key challenge in the legal domain is the adaptation and representation of the legal knowledge expressed through texts, in order for legal practitioners and researchers to access this information more easily and faster to help with compliance related issues. One way to approach this goal is in the form of a taxonomy of legal concepts. While this task usually requires a manual construction of terms and their relations by domain experts, this paper describes a methodology to automatically generate a taxonomy of legal noun concepts. We apply and compare two approaches on a corpus consisting of statutory instruments for UK, Wales, Scotland and Northern Ireland laws.
... Some of the important domains are as follows: document classification & clustering [58], digital libraries management [5], query expansion [49], text summarization [24], Machine translation and question answering [42], automated sentiment analysis [31], web page and document retrieval [25], patent analysis [18] and image tag refinement [52,53]. Providing keywords for these large-sized text collections thus allows users to grab the essence of the lengthy contents quickly and helps to locate information with high efficiency. ...
Full-text available
The internet changed the way that people communicate, and this has led to a vast amount of Text that is available in electronic format. It includes things like e-mail, technical and scientific reports, tweets, physician notes and military field reports. Providing key-phrases for these extensive text collections thus allows users to grab the essence of the lengthy contents quickly and helps to locate information with high efficiency. While designing a Keyword Extraction and Indexing system, it is essential to pick unique properties, called features. In this article, we proposed different unsupervised keyword extraction approaches, which is independent of the structure, size and domain of the documents. The proposed method relies on the novel and cognitive inspired set of standard, phrase, word embedding and external knowledge source features. The individual and selected feature results are reported through experimentation on four different datasets viz. SemEval, KDD, Inspec, and DUC. The selected (feature selection) and word embedding based features are the best features set to be used for keywords extraction and indexing among all mentioned datasets. That is the proposed distributed word vector with additional knowledge improves the results significantly over the use of individual features, combined features after feature selection and state-of-the-art. After successfully achieving the objective of developing various keyphrase extraction methods we also experimented it for document classification task.
... [30]) measuring occurrences frequencies (including word association), assessing occurrences contexts, using reference corpora, e.g. Wikipedia (PU-ATR measure) [36], domain and topic modelling [37,38]. ...
Full-text available
Assessing the completeness of a document collection, regarding terminological coverage of a domain of interest, is a complicated task that requires substantial computational resource and human effort. Automated term extraction (ATE) is an important step within this task in our OntoElect approach. It outputs the bags of terms extracted from incrementally enlarged partial document collections for measuring terminological saturation. Saturation is measured iteratively, using our \( thd \) measure of terminological distance between the two bags of terms. The bags of retained significant terms \( T_{i} \) and \( T_{i + 1} \) extracted at i-th and i + 1-st iterations are compared \( (thd(T_{i} ,T_{i + 1} )) \) until it is detected that \( thd \) went below the individual term significance threshold. The flaw of our conventional approach is that the sequence of input datasets is built by adding an increment of several documents to the previous dataset. Hence, the major part of the documents undergoes term extraction repeatedly, which is counter-productive. In this paper, we propose and prove the validity of the optimized pipeline based on the modified C-value method. It processes the disjoint partitions of a collection but not the incrementally enlarged datasets. It computes partial C-values and then merges these in the resulting bags of terms. We prove that the results of extraction are statistically the same for the conventional and optimized pipelines. We support this formal result by evaluation experiments to prove document collection and domain independence. By comparing the run times, we prove the efficiency of the optimized pipeline. We also prove experimentally that the optimized pipeline effectively scales up to process document collections of industrial size.
... [12]) measuring occurrences frequencies (including word association), assessing occurrences contexts, using reference corpora, e.g. Wikipedia [18], topic modelling [19,20]. ...
Conference Paper
Full-text available
Assessing the completeness of a document collection, within a domain of in-terest, is a complicated task that requires substantial effort. Even if an auto-mated technique is used, for example, terminology saturation measurement based on automated term extraction, run times grow quite quickly with the size of the input text. In this paper, we address this issue and propose an optimized approach based on partitioning the collection of documents in disjoint constit-uents and computing the required term candidate ranks (using the c-value method) independently with subsequent merge of the partial bags of extracted terms. It is proven in the paper that such an approach is formally correct – the total c-values can be represented as the sums of the partial c-values. The ap-proach is also validated experimentally and yields encouraging results in terms of the decrease of the necessary run time and straightforward parallelization without any loss in quality.
... There are a few annotated corpora available. One of the most popular resources is the GENIA corpus (Kim et al. 2003), which has been used in multiple ATE evaluations (Zhang et al. 2018;Zhang et al. 2008;Nenadić and Ananiadou 2006;Nenadic et al. 2004;Bordea et al. 2013;Fedorenko et al. 2013). GENIA is a collection of 2000 abstracts from the MEDLINE database in the domain of biomedicine, specifically about ''transcription factors in human blood cells'' (Kim et al. 2003, 180). ...
Full-text available
Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation.
Full-text available
Domain-specific terminologies play a central role in many language technology solutions. Substantial manual effort is still involved in the creation of such resources, and many of them are published in proprietary formats that cannot be easily reused in other applications. Automatic term extraction tools help alleviate this cumbersome task. However, their results are usually in the form of plain lists of terms or as unstructured data with limited linguistic information. Initiatives such as the Linguistic Linked Open Data cloud (LLOD) foster the publication of language resources in open structured formats, specifically RDF, and their linking to other resources on the Web of Data. In order to leverage the wealth of linguistic data in the LLOD and speed up the creation of linked terminological resources, we propose TermitUp, a service that generates enriched domain specific terminologies directly from corpora, and publishes them in open and structured formats. TermitUp is composed of five modules performing terminology extraction, terminology post-processing, terminology enrichment, term relation validation and RDF publication. As part of the pipeline implemented by this service, existing resources in the LLOD are linked with the resulting terminologies, contributing in this way to the population of the LLOD cloud. TermitUp has been used in the framework of European projects tackling different fields, such as the legal domain, with promising results. Different alternatives on how to model enriched terminologies are considered and good practices illustrated with examples are proposed.
The problem of extracting saturated term-sets for learning domain ontologies from professional texts, describing a subject domain of interest, appeared to be under-researched. Therefore, a broader scale systematic review of the related work has been undertaken for collecting different bits of relevant knowledge about the State-of-the-Art in various fields. These ranged from Ontology Engineering, through Ontology Learning from Texts and Information Science, to Qualitative Research in Social and Medical Sciences. The analysis of these bits of knowledge helped us better understand the research gaps in our field of study.With an aim to narrow these research gaps, a vision of the approach for terminology saturation measurement and detection was proposed. This vision allowed us to formulate the research questions that needed to be answered in order to transform this visionary approach into the method, further implement it in a software pipeline, systematically evaluate the method, and make it ready for technology transfer.
Conference Paper
Full-text available
Dans le contexte d'un système d'extraction automatique de termes, notre étude porte sur l'impact d'utiliser les choix de bons termes tes qu'indiqués par un usager pour réordonnancer unel iste de candidats termes. Afin d'établir une relation inter-termes entre les termes choisis et les candidats proposés, nous explorons la similarité distributionnelle qui permet d'exprimer la tendance d'unités lexicales à apparaître ensemble dans un corpus. La similarité distributionnelle peut servir en premier lieu à diriger l'attention du terminologue vers des sous-thématiques en regroupant des candidats termes fortement interreliés. Ais au-delà de ceci, la similarité distributionnelle a un impat plus global en augmentant la précision des candidats en tête de la liste de candidats termes. Nous démontrons ceci par une évaluation dans le domaine de la traduction automatique utilisant un étalon-or tel qu'établi par dix experts du domaine. Dans cette expérimentation, la précision de sous-ensembles de candiats en tête de liste augmente de 3 à 11% selon la taille de ces sous-ensembles.
Full-text available
This paper analyses selected literature on basic-level categories, explores related theories and discusses theoretical explanations of the phenomenon of basic-level categories. A substantial body of research has proposed that basic-level categories are the first categories formed during perception of the environment, the first learned by children and those most used in language. Experimental studies suggest that high-level (or superordinate) categories lack informativeness because they are represented by only a few attributes and low-level (or subordinate) categories lack cognitive economy because they are represented by too many attributes. Studies in library and information science have demonstrated the prevalence of basic-level categories in knowledge organization and representation systems such as thesauri and in image indexing and retrieval; and it has been suggested that the universality of basic-level categories could be used for building crosswalks between classificatory systems and user-centred indexing. However, while there is evidence of the pervasiveness of basic-level categories, they may actually be unstable across individuals, domains or cultures and thus unable to support broad generalizations. This paper discusses application of Heidegger’s notion of handiness as a framework for understanding the relational nature of basic-level categories.
Full-text available
A statistical method is proposed for domain-specific term extraction from domain comparative corpora. It takes distribution of a candidate word among domains and within a domain into account. Entropy impurity is used to measure distribution of a word among domains and within a domain. Normalization step is added into the extraction process to cope with unbalanced corpora. So it characterizes attributes of domain-specific term more precisely and more effectively than previous term extraction approaches. Domain-specific terms are applied in text classification as the feature space. Experiments show that it achieves better performance than traditional methods for feature selection.
Full-text available
Though the utility of domain Ontologies is now widely acknowledged in the IT (Information Technology) community, several barriers must be overcome before Ontologies become practical and useful tools. One important achievement would be to reduce the cost of identifying and manually entering several thousand-concept descriptions. This paper describes a text mining technique to aid an Ontology Engineer to identify the important concepts in a Domain Ontology.
Full-text available
This paper explores the role of domain information in word sense disambiguation. The un-derlying hypothesis is that domain labels, such as Medicine, Architecture and Sport, provide a useful way to establish semantic relations among word senses, which can be profitably used during the disambiguation process. Results obtained at the Senseval-2 initiative confirm that for a significant subset of words domain information can be used to disambiguate with a very high level of precision.
Conference Paper
Full-text available
In this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we firstly find out terms in query and their importance scales by making use of the information derived from top N(N<=30) retrieved documents in the initial retrieval; secondly, we re-order retrieved K(N<<K) documents by what kinds of terms of query they contain. That is, we first automatically extract key terms from top N retrieved documents, then we collect key terms that occur in query and their document frequencies in the N retrieved documents, finally we use these collected terms to re-order the initially retrieved documents. Each collected term is assigned a weight by its length and its document frequency in top N retrieved documents. Each document is re-ranked by the sum of weights of collected terms it contains. In our experiments on 42 query topics in NTCIR3 Cross Lingual Information Retrieval (CLIR) dataset, an average 17.8%-27.5% improvement can be made for top 10 documents and an average 6.6%-26.9% improvement can be made for top 100 documents at relax/rigid relevance judgment and different parameter setting.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
We discuss a method for using automated corpus analysis to acquire word sense information for multilingual text interpretation. Our system, SHOGUN, extracts data from news stories with broad coverage in Japanese and English. Our approach focuses on tying ...
Conference Paper
Recent research has demonstrated how the widespread adoption of collaborative tagging systems yields emergent semantics. In recent years, much has been learned about how to harvest the data produced by taggers for engineering light-weight ontologies. For example, existing measures of tag similarity and tag relatedness have proven crucial step stones for making latent semantic relations in tagging systems explicit. However, little progress has been made on other issues, such as understanding the different levels of tag generality (or tag abstractness), which is essential for, among others, identifying hierarchical relationships between concepts. In this paper we aim to address this gap. Starting from a review of linguistic definitions of word abstractness, we first use several large-scale ontologies and taxonomies as grounded measures of word generality, including Yago, Wordnet, DMOZ and WikiTaxonomy. Then, we introduce and apply several folksonomy-based methods to measure the level of generality of given tags. We evaluate these methods by comparing them with the grounded measures. Our results suggest that the generality of tags in social tagging systems can be approximated with simple measures. Our work has implications for a number of problems related to social tagging systems, including search, tag recommendation, and the acquisition of light-weight ontologies from tagging data.