ArticlePDF Available

Abstract and Figures

For any information to be organized, taxonomy is essential. Taxonomy plays a very important role for information and content management. Also it helps in searching of content. The most common method for constructing taxonomy was the manual construction. As the information available today is huge, constructing taxonomy for such information manually was time consuming and maintenance was difficult. This paper presents an overview of various taxonomy construction techniques available for easier construction of taxonomy or generating taxonomy automatically. Also this paper describes the advantages and disadvantages of each technique used.
Content may be subject to copyright.
TAXONOMY CONSTRUCTION
TECHNIQUES – ISSUES AND
CHALLENGES
Sujatha R
School of Information Technology and Engineering
VIT University, Vellore-632014
r.sujatha@vit.ac.in
Bandaru Rama krishna Rao
School of Information Technology and Engineering
VIT University, Vellore-632014
bandaru@vit.ac.in
Abstract
For any information to be organized, taxonomy is essential. Taxonomy plays a very important role for
information and content management. Also it helps in searching of content. The most common method for
constructing taxonomy was the manual construction. As the information available today is huge, constructing
taxonomy for such information manually was time consuming and maintenance was difficult. This paper
presents an overview of various taxonomy construction techniques available for easier construction of taxonomy
or generating taxonomy automatically. Also this paper describes the advantages and disadvantages of each
technique used.
Keywords: Taxonomy, Clustering, Tags, WordNet, Similarity Measure, Semantic Analysis
I. INTRODUCTION
Taxonomy is a process of classifying content and organizing. It is an organized set of words used for
organizing information and intended for browsing. For faster information retrieval and a better classification of
knowledge, taxonomy is very much essential. The term “Taxonomy” comes from terms “Taxos”, ordering and
“nomos”, rule. [1] Taxonomy was first used as a field in biology where it was necessary for classification of
biological specimens. Example of taxonomy includes Bloom’s taxonomy, Plant taxonomy, and Animal
taxonomy etc. which have been used today for easier classification of biological specimens. Nowadays the
concept of taxonomy is being used in other areas such as “Psychology” and “Information Technology”.
Particularly in Information Technology, it is very much useful for content management and information
architecture. This has been widely used in websites for categorization of web pages or resources (audio, video,
content etc). Taxonomy is always rigid and conservative. Taxonomies also provide “serendipitous guidance”
[33] since it helps to get additional information from viewing where a topic resides in the taxonomy’s context.
Many advantages are there in using taxonomy. Some of them include easy navigation and searching. However
updating or maintaining taxonomy is very much difficult since incorporation of new resources or categories
involves more time. There are three ways of constructing taxonomy: a manual approach, a semi-automated [15]
approach, an automated approach. From an organization perspective, taxonomy construction can be classified
into three types namely buying pre-built taxonomy, building a taxonomy using several techniques and automatic
approach.[18] According to survey made by Gartner that taxonomy construction is vital and 70% of
organizations who invested do not achieve their return on investment because of lack of proper taxonomy
construction.
The following activities are Supported by Taxonomy:[21]
Searching
Re-purposing the content
Unifying language across enterprise
Future-proofing knowledge
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
661
This paper presents an overview of various approaches for construction of taxonomy. The widely used
approach is the automatic construction of Taxonomy by incorporating user generated metadata called Tags
which can be used along with various other techniques.
II. OUTLINE
In this paper, we present an overview of the techniques available for construction of Taxonomy. In section III,
a general description about constructing Taxonomy, it’s approaches and its importance. It tells us about the
manual construction of taxonomy with its pros and cons. In section IV – VIII, a list of available taxonomy
construction techniques has been discussed with its issues. We present an overview of each technique and list
both advantages and disadvantages of each technique. Nowadays the construction of taxonomy is enhanced due
to the use of tags. Since tags are used by people for sharing content, constructing taxonomy for classifying
content based on tags is presented in section VIII of the paper. Also it discusses the problems related to tagging
of content/ resources. This paper also discusses other approaches that can be used to construct taxonomy in the
final section.
III. CONSTRUCTING TAXONOMY
Reference [1] shows that construction of taxonomy is limited to a particular domain. For example, taxonomy
for a domain “Sports” can be constructed by specifying the categories “Football”, “Cricket”, “Hockey” etc
under “Sports”. For extraction of categories and terms that can be used for each category, careful detailed
analysis and study should be performed and this is defined by the domain experts. [31] After a thorough
analysis, the categories and content in each category are represented in an organizational structure. [2] As
mentioned above taxonomies built using existing taxonomy templates (pre-built taxonomy) from vendors can
speed up the construction of taxonomy and help an enterprise deliver quick results. Existing taxonomies can be
optimized for the organization’s specific requirements. However pre-built taxonomies have some disadvantages
since it has less applicability and also time spent on user training.
An in-house constructed taxonomy is more particular to an organization and its intention. The selection of
terminology in taxonomy is fully controlled by the developer. Sometimes it is only possible to construct an in-
house taxonomy since existing taxonomies may not exist for a particular domain. The only disadvantage for
constructing taxonomy is time consumption and also expensive.
Irrespective of whatever approach used to construct taxonomy, there are four phases in general for taxonomy
construction:
Planning and Analysis: Detailed study needs to be done by the domain experts to identify the
categories, resources to be allocated, cost involved in the construction.
Design, Development and Testing: Detailed design of hierarchical structure is done by the software
development team.
Implementation: In this paper, various approaches of implementing taxonomy are discussed.
Maintenance: Maintenance of taxonomy is a taxing job and time consuming for manual construction as
mentioned above. However maintenance can be simpler if automatic construction approaches is used.
For constructing taxonomy, two techniques are widely used: Top-Down approach and Bottom-Up approach
[17]
The top-down approach involves selection of few numbers of higher categories reaching more specific
levels of lower subcategories based on the context. Usually taxonomy is developed manually and it
provides control over the concepts present in higher taxonomy levels.
The bottom-up approach involves selection of specific levels of categories and reaching the higher
categories. To extract concepts from content and to make generalizations the automatic techniques are
used in this approach.
The above two approaches have both advantages and disadvantages still vital for taxonomy construction.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
662
A. Manual Approach
Usually the most common method of constructing taxonomy is the manual method. This method has been by
the domain experts who are experienced in a particular domain can construct taxonomy. It provides major
control over the synonyms and order of concepts. The choice of terminology is left to domain experts for using
in taxonomy. Because of human judgment, manual classification of documents to the concepts in taxonomy is
less accurate. Due to this misunderstanding of the terminology is possible for an end user who wants to view a
particular resource of a domain. Also maintenance of taxonomy using such approach is a time consuming task.
Nowadays it is very rare to construct taxonomy using manual approach. [1]
Advantages: Human decision, High precision, Disambiguation.
Disadvantages: Labor exhaustive, Unable to scale, Costly resources
IV. AUTOMATIC APPROACH
In recent years research is being out for generating taxonomy using various techniques. Some of the
approaches used for automatic Taxonomy generation include:
Using WordNet (Lexical Database Dictionary) and NLP (Natural Language Processing) techniques.
Using large text corpus.
Clustering algorithms.
Using the combination of tags (Annotations/Keywords) and Wikipedia to generate taxonomy.
The above approaches can be used in any combination for enhancing the construction of taxonomy. Also the
above approaches are used at lexical and semantic level where the concepts of taxonomy are extracted and
semantic relationships are used to construct the taxonomy.
Several automatic classification tools are available for classifying the content for a prevailing taxonomy or to
generate taxonomy structure. Various algorithms (Statistical Analysis, Bayesian Probability, Clustering) [32] are
applied to tools that create taxonomy structure to a set of documents using bottom-up strategy since this strategy
involves incorporating automatic techniques. However automatic construction provides least control over the
synonyms and order of concepts. Also refinement of the concepts is required for the user to understand. It can
save time however human judgment will be there to check if the concept should be there in taxonomy or not.
Advantages: Handles large volumes, Measures easily, Cheap resources
Disadvantages: Rule/ algorithm weakness, Inaccuracies, Not easy to train
A. Natural Language Processing: Definition and Areas
A Natural Language refers to language spoken by people and the applications that deal with the natural
language are called Natural Language Processing*. NLP is an area of research carried out by software giants like
Microsoft, Google, and Yahoo etc. From Fig.1 the NLP comes under Artificial Intelligence which is again a
category of areas under “Computers”. NLP is carried out at various levels namely linguistic level, syntactic
level, semantic level, information retrieval and extraction and machine translation. However there are both
advantages and issues at all levels. NLP is being used for various applications such as classifying text into
categories, index and search large texts, automatic translation, speech understanding, information extraction,
knowledge acquisition and text generations.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
663
Fig.1. Computer Science Taxonomy
B. WordNet-Lexical Database for English
For NLP (Natural Language Processing) based applications, a lexical database dictionary WordNet is
primarily used. The most commonly used WordNet is the English WordNet.It groups English words into set of
synonyms called synsets, provide short, general definitions and records various semantic relations between these
synonym sets. Using WordNet3 (Princeton) it is possible to generate a hierarchical structure by defining IS-A
relationship between nouns and verbs.[34] However WordNet is also used in other languages and can also be
constructed for other languages using the English WordNet which is used as a skeleton structure [3]. Reference
[3] shows the construction of WordNets for Spanish and Catalan languages. Using WordNet one can generate
the parts of speech form of a particular word and also finds the similarity between two given words in a
dictionary. The WordNet also uses morphology functions to generate the root form of a word. A lot of
advantages and limitations for using WordNet are described in 2.
V. CONSTRUCTING TAXONOMY USING WordNet
Several approaches where WordNet plays an important role have been used. This section describes some of
the approaches that use WordNet for taxonomy construction. A semi supervised approach is used to construct
taxonomy from scratch using the web hyponym-hypernym pairs [6]. It automatically learns from hyponym-
hypernym using the root concept, a basic level concept and recursive surface patterns. This approach is very
much useful for reconstructing WordNet taxonomy. Another approach is to build noun hierarchy of WordNet
automatically from a text corpus [7]. Calculates the cosine of the angle between two vectors of the constructed
nouns set, as ||||/,),cos( wvwvwv
Also, similarity between two nouns can be calculated as
wv BsizeAsizewvBASim ,)()(/),cos(),(
where v ranges over all vectors for nouns in group A, w ranges over the vectors in group B, and size(x) denotes
the number of nouns that are descendants of node x.
The most common approach of deducing taxonomic relations is following a bottom-up strategy [3]. The
following steps are performed when using a bottom-up strategy
Parsing each definition for obtaining the genus.
Performing a genus disambiguation procedure.
Building a natural classification of concepts as concept taxonomy.
There other two approaches that uses WordNet for Taxonomy construction: “MERGE” approach and
“EXPAND” approach.
In “MERGE” approach, there are two steps:
Selection of main top beginners for a semantic primitive.
Computers
Databases AI Algorithms Networking
Robotics NLP Search
Information
Retrieval Machine
Translation Language
Analysis
Semantics Parsing
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
664
Two processes are carried out which include: Attaching Diccionario General Ilustrado de la Lengua
Espanola (DGILE) senses to semantic primitives and filtering process. In the first process two steps are
carried to calculate conceptual distance and salient words are extracted in the following ways.[35]
1. First labeling: Conceptual Distance is calculated between two words with the help of WordNet
))(/1min)2,1( kcdepthwwDist
where, ck ε path(c1i,c2i) where c1i ε w1 and c2i ε w2
2. Salient words are extracted using the formula
)Pr( )|Pr(log)|Pr(
),( 2
wSCwSCw
SCwAR where, w denotes word and SC means semantic
class
In the filtering process, genus terms are removed.
Exploiting genus, construction of taxonomies for each semantic primitive.
Genus Sense Identification and Genus Sense Disambiguation are performed and a taxonomy structure
is generated.
A. Word Sense Disambiguation
The process of finding relevant documents and ignoring the irrelevant ones are carried out by an
information retrieval system. The search results are optimized by disambiguating terms and omitting the
documents that holds the terms used in incorrect sense. There are various ways word sense disambiguation can
be performed.
The problems with word sense disambiguation are Homonymy and Synonymy. Homonymy is one word
which can be used in two or more different senses. This will results in irrelevant documents based on query
being retrieved. By adding words to the query that help to get the user intended documents. One attempt at
solving this adds additional words to the query that can help disambiguate the terms used in it to the concept that
the user intended. Consider the example of the word “bat”. We cannot differentiate “cricket bats” or “flying
animals” without providing additional information. By using additional words like “cricket bats” or “flying
animals” the search request will be less ambiguous and relevant documents will be obtained. But this method
affects the precisions of an information system.
Synonymy means that more than one word refers to the same concept or sense. There could be a problem
where documents which are relevant but could not be retrieved since the query doesn’t contain specific words.
Latent Semantic Indexing can be used to address this problem.
Various approaches and methods for disambiguation can be broadly classified into supervised disambiguation
methods, Unsupervised disambiguation methods, Semi-supervised methods, Dictionary and knowledge based
method. Based on the previous knowledge about the sense of particular instance of a word the corresponding
method is used.
VI. CLASSIFICATION FROM TEXT
Generating Taxonomy from text involves extracting concept maps from texts. Concept map is a graph that
contains nodes as concepts and arc as relations. [4] A concept map of a particular domain can be constructed by
giving a text file called TextStorm. It is used to extract binary predicates from a given file. TextStorm uses
WordNet (lexical database dictionary, Princeton) to extract concepts from a particular sentence using tagging.
For example, in the following sentence “John drinks Milk”, the predicate ‘drink’ where ‘john’ and ‘Milk’ are
concepts. The concepts are extracted based on IS-A relationship that can be found out with the help of WordNet.
The relationships are extracted which result in a concept map. The predicates extracted uses “Clouds” which is a
machine based learning tool which constructs a concept hierarchy by inferring knowledge. The architecture of
the methodology provides a clear insight about generating a concept hierarchy or taxonomy. According to the
architecture, a text file (TextStorm) is parsed and each sentence is tagged with the help of WordNet. Parsing is
done with the help of augmented grammar. Without using the training data, another approach was proposed by
[2] for automatically deriving hierarchical organization of concepts from a set of documents. The approach was
based on the following principles:
Terms for the hierarchy are to be extracted from documents.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
665
The organization of the terms is such that a parent term refers to a more general concept than a child
term.
The child term covers a related a related sub topic of the parent.
Extracting concepts from simple sentences is quite simpler. However in a complex sentence there is always
an ambiguity which can be solved by Anaphora Resolution. For example, consider the following sentence,
“Lions eat both gazelles and zebras. These are the preys”. In the following sentence the keyword “These” refers
in the previous context which can be “gazelles, zebras, lions”. Such ambiguities can be resolved by pronominal
Anaphora Resolution. However this approach has many limitations which have been discussed in [4].
Another approach proposed by [19] where semantic relationships are extracted from textual documents.
Based on co-occurrence of terms in the text relationships are discovered. In this approach, terms are selected
from related documents to represent categories and to select best subset of features; χ2 measure is taken to select
an appropriate number of features for text classification
)(*)(*)(*)( )(*
),(
2srsqqprp qrpsn
tc
where, p denotes frequency of documents in which t and c co-occur, q and r the frequency when either t or c
occurs, frequency when neither c nor t occurs is denoted by s and the total number of documents is n. Based on
the c and t independence the χ2 value will be zero or positive.
The Taxonomy can be constructed with the help of fuzzy relations and the relation between the terms can be
determined by Document Frequency (DF). For any two terms, the relation between them is called term
subsumption relation which is characterized by the following measure:
)(),( titjji
TSR subsetDDPtt
Where Dt denotes the set of documents the term t occurs and P represents the probability that Dtj is contained in
Dti and finally a fuzzy relation between two categories is determined known as Category Subsumption Relation
(CSR)
Another approach proposed by [20] where terms are extracted from set of documents after preprocessing and
terms is used to construct a conceptual hierarchy. The approach is divided into three modules:
Term Extraction Module: It is responsible for labeling every document with a set of terms derived from
document.
Term Generation Module: On the basis of relevant morpho-syntactic patterns potential candidates are
selected from the sequences of tagged lemma
Term Filtering Module: By applying statistical scoring scheme the number of candidate terms produced
from previous module is reduced and that is the goal.
After performing these modules, Taxonomy is constructed. Taxonomy constructed in this approach is semi-
automatic.
A. Anaphora Resolution: Types and Issues
Anaphora Resolution is known as pronoun resolution is the problem of resolving references to earlier or later
items in the context. It can be either in noun phrases or in verb phrases representing concepts. Noun phrases are
called referents. It is considered as a serious problem in NLP. Three types of anaphora are:[5]
Pronominal: In this general type where a referent is referred by a pronoun. For example, consider the
following sentence “Suresh is a Doctor. His friend is also a Doctor”. In the following sentence, the word
“His” refers to “Suresh”. This can be solved by an approach called “CBR (Case Based Reasoning)”.
Case Based Reasoning [10] basically extracts the syntactic and part-of-speech classification for main
elements in the two sentences of a new case. Then, it searches for a similar case that was resolved in the
past. The solution of this similar case is adapted to this new situation finding the word that has the same
syntactic function.
Definite noun phrase: The antecedent is referred by the phrase of the form “<the><noun phrase>”. For
example, consider the sentence “The relationship did not last long”. In the following sentence, the word
“relationship” refers to “the love”.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
666
Quantifier/Ordinal: The anaphor is a quantifier such as “one” or an ordinal such as “first”. For
example, in the sentence “He started a new one” where “one” refers to “the relationship”.
There are some traditional techniques for resolution which includes:
Eliminative Constraints: An anaphor and a referent must agree in certain attributes to generate a match.
These include gender (male/female) and number (singular/plural).
Weighting Preferences: these factors are used to assign likelihood of match to the competing referents.
They include proximity, centering and syntactic/semantic parallelism.
Reference [5] describes the techniques that can be used for resolution which include:
Multi-Sentential Resolution
Attributes as Semantic clues
Misclassifying “it” as Pleonastic
Verb Phases as Referents
Generating taxonomy from text involves some serious limitations such as depending on WordNet totally and
extensive study of verb types. Also finding relationships or conceptual maps from text require so many special
cases to be resolved. However techniques which use clustering algorithms have produced some good results [8].
VII. CLUSTERING APPROACHES
An approach has been proposed by [9] which present a conceptual clustering method based on FCA (Formal
Concept Analysis) and compare with other clustering techniques. The clustering techniques are compared based
on effectiveness, efficiency and traceability of taxonomy construction.
According to [9], taxonomy generation via clustering can be categorized into two classes namely the similarity
based methods and set-theoretical approaches on the other. These two methods follow a vector-space model and
represent a word or term as a vector. The similarity based clustering algorithms are further classified into
agglomerative and divisive. So an FCA based theoretical clustering approach is compared with these similarity
based clustering algorithms and results are compared.
A. Cluster Analysis
It is also known as data segmentation is the grouping or segmenting a group of objects into clusters, so that
those within the same cluster are closely related to each other. The main aim is to find the similarity between
objects being clustered. Hierarchical clustering and partitioning clustering are the two methods for clustering:
B. Formal Concept Analysis
Formal Concept Analysis is an ethical way of automatically deriving ontology from a group of objects and
their properties. 3Rudolf Wille first introduced this analysis and later developed by Birkhoff. FCA is an
unsupervised learning method which is used to analyze relations between objects, G and their features, M. FCA
identifies from data description called formal context K, its set of features B subset M being correlated with its
set of objects A subset G.
C. Hierarchical Agglomerative Clustering
Hierarchical Agglomerative Clustering is a similarity based bottom-up clustering technique in which at the
beginning every term forms a cluster of its own. Three different strategies are used to estimate the similarity
between clusters: single-, complete- and average- linkage. Single linkage is defined as the similarity between
two clusters P and Q to equal Max pεP, qεQ Sim (p, q), considering the chosen pair between two clusters. The
other name for this is nearest neighbor technique. It’s defining feature is that the distance between groups is
defined as the distance between the closest pair of objects. D(r, s) is calculated as Min (d(i, j)) where object “i”
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
667
is in cluster “r” and object “j” in cluster “s”. Complete linkage considers the two most dissimilar terms Min pεP,
qεQ Sim (p, q). The other name for this is farthest neighbor technique. The distance between objects is calculated
as the distance between the most distant pair of objects. D(r, s) is calculated as Max (d(i, j)) where object “i” is
in cluster “r” and object “j” in cluster “s”. Finally average-linkage computes the average similarity of the terms
of two clusters. In this method the distance between clusters is the average of distances between all pairs of
objects where object is taken from each group.
The distance D(r, s) is computed as )*(
),( sr
rs
NN
T
srD
where, “Trs” is the sum of all pair wise distances between cluster “r” and cluster “s”. Nr and Ns are the sizes of
clusters “r” and “s” respectively.
To create a hierarchy of clusters grouping similar items i.e. documents the algorithm can be applied.
Clustering begins with a set of singleton clusters, each containing a single document Di, i=1, 2…N where D is
the entire set of documents and N is number of documents. The two most similar clusters over the entire set D
are merged to create a new cluster that covers both. This procedure is iterated for each of the remaining N-1
documents.
Merging of document clusters is completed until a single, all-inclusive cluster remains. At the end, a uniform,
binary hierarchy of document clusters is generated.
Fig.2 depicts the hierarchical clustering of 8 documents.
Fig.2. Document Classification using Hierarchical Clustering
The time complexity of naïve implementations of hierarchical agglomerative clustering algorithms is O (n3)
where n is the number of terms and when using single linkage its time complexity is O (n2).
D. Bi-Section K-MEANS Clustering
According to Bi-Section-K-Means – a variant of K-Means – is a good and fast divisive clustering
algorithm.[36] It frequently outperforms standard K-Means as well as agglomerative clustering techniques. The
time complexity of Bi-Section K-Means algorithm is O (nk) where n is the number of terms and k is the number
of clusters.
On comparison, the FCA based approach produces slightly better results than the other clustering
approaches. It is not only producing cluster – but also provides an intentional description for the clusters which
contribute to better understanding. On contrasting with similarity based methods, it provides higher level of
traceability. A drawback of using the FCA is that the size of the lattice becomes exponential with the size of the
context resulting in exponential time complexity compared to O (n2 log n) and O (n2) for agglomerative and Bi-
Section K-Means clustering.
Reference [9] provides a clear comparison of various clustering approaches used to generate Taxonomy based
on effectiveness, efficiency and traceability in the following table.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
668
TABLE 1
TABLE SHOWING COMPARISON OF VARIOUS CLUSTERING APPROACHES
Effectiveness Efficiency Traceability
FCA
Good
O (2n)
Near Linear
Good
Agglomerative
clustering
Good
O (n2 log n)
(complete/Avg)
O (n2) (single)
Fair
Bi-Section K-
Means
Good
O (n2)
Weak-Fair
Another approach was proposed by [8] where a conceptual taxonomy is constructed which is a hierarchical
construction of keywords also known as Keyword Hierarchy. The hierarchical construction is performed using
Ward Hierarchical clustering algorithm guided by keyword proximity measure. This is carried out in similar
way in which PageRank determines the authority of web pages. For cluster evaluation measure Goodman-
Kruskal is used. One of the greedy, agglomerative clustering methods that record a fusion of clusters into large
clusters is Ward’s hierarchical method. PageRank is used to rank the keywords used in a cluster. This algorithm
is also used in Google Search.
Another approach proposed by [23] which uses a clustering framework called DIVA to generate Taxonomy
automatically. DIVA is a multi phase clustering algorithm which can be separated into two steps: Divisive and
Agglomerative. In first step a divisive approach is performed for a given dataset D and a cluster set Ck is
generated. In the second step, an agglomerative approach is performed for the cluster set Ck to generate a
dendrogram T
Another approach proposed by [22] where a query Taxonomy is generated by means clustering. It uses a new
clustering approach called HAC+P and it is an extension of Hierarchical Agglomerative Clustering algorithm
(HAC). To generate a cluster hierarchy it is combined with hierarchical cluster partitioning technique.
VIII. CONSTRUCTING TAXONOMY USING TAGS
Social Tagging is a present trend now. People tag a resource which can be used for better sharing and
searching. [11] Tagging helps in discovering items which are not found and helps in improving search. Several
websites are available where people tag content or resources for effective communication. Some of them include
§Flickr, **Delicious, ††Bibsonomy, ‡‡Technorati which are widely used portals for tagging. Basically tagging can
be represented as Documents, Users and Tags. [14] This is sometimes known Collaborative Tagging. Since tags
are used to describe the resources, to categorize the resources into a structured hierarchy tags play an important
role for generating Taxonomy. This section describes the approaches used to create taxonomy from user
generated tags. Also it gives an overview about the problems that can occur from user generated tags.
A. Tagging Approaches
Some of the approaches include [12, 13, 14, 15, 16, 22] are used to construct Taxonomy. Reference [12]
provides a framework to classify web pages based on social annotation. In this approach both web page and
category are described based on tags and assign the resource to the category based on cosine similarity.
Reference [13] describes a hierarchical classifier that can be used to classify documents into categories based on
the tags that are used to describe the documents. This approach requires the document to be preprocessed before
applying the document in the hierarchical classifier. Reference [14] describes about the document classification
categorized using Open Directory9. Reference [16] provides a novel approach for generating Taxonomy using
§ Flickr is online picture sharing service from Yahoo!. People tag photos to share the content. (www.flickr.com)
** Delicious is a very popular social bookmarking service from Yahoo!. People can tag bookmarks to share content.(www.delicious.com)
†† Bibsonomy is an online publication management service.(www.bibsonomy.org)
‡‡ Technorati is an online news aggregator for various domains (www.technorati.com)
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
669
tags. In this approach tags are collected from Delicious database and heuristic rule analysis is performed. Valid
documents are extracted using tags with the help of Wikipedia. Each document is parsed and concept-
relationship acquisition and inference approach is performed for generating Taxonomy.
Another approach proposed by [22] where tags are extracted from repositories and clustering techniques are
performed. Similarity between tags can be calculated by using the distance metric which depends on the factors
namely: Co-occurrence for tags and Semantic similarity for tags.
Presently research is being carried out for enhancing taxonomy construction with the help of tags and also
improving the navigation of Taxonomy.
Tagging provides an easier approach for classification of content and constructing Taxonomy. However tags
can be misused since it is user generated data. The vocabulary of tag terms may not be accurate. Also spams are
generated using tags which are being addressed as a serious issue [11].
IX. OTHER APPROACHES
Some of the approaches [24, 25, 26, 27, 28, 29, 30] are used to automatically generate Taxonomy. These
approaches are incorporated with clustering techniques. In [24] existing distance measure which is used to
calculate the similarity is modified to enhance Taxonomy construction. In [25], taxonomy can be constructed
based on frequency in which the terms occur which can be useful for generating a natural hierarchy. In [26], a
compound similarity measure between two terms is used based on neural network model. In [29], Taxonomy is
generated automatically using the Heymann algorithm. It determines the generality of terms and inserts the
terms into Taxonomy.
The combination of manual and automatic lead to technique called hybrid where lots of debate is going
on. It is having advantages like large volume + precision, human-guided rule sets and incremental learning.
Disadvantages are management challenge, extraordinary shills needed and maintenance endeavor required.
X. CONCLUSION
This paper presents an overview of different approaches used to generate Taxonomy. The approaches
discussed above can be used in any combination to construct Taxonomy. The approach which uses tags for
content classification and for construction of Taxonomy is a relatively new area where a lot of enhanced
techniques for easier construction is being researched. Also constructing Taxonomy for a domain presents a
challenging task since the Web is a heterogeneous repository of information.
REFERENCES
[1] Miquel Centelles. Taxonomies for categorization and organization in Web Sites, num. 3, 2005
[2] Mark Sanderson, Bruce Croft. Deriving concept hierarchies from text, 1998
[3] Xavier Farreres, German Rigau, Horacio Rodriguez. Using WordNet for building WordNets, 1997
[4] Ana Oliveira, Francisco Camara Pereira, Amilcar Cardoso. Automatic Reading and Learning from text, 2000
[5] Imran Q. Sayed. Issues in Anaphora Resolution, 2001
[6] Zornista Kozareva, Eduard Hovy. A Semi-Supervised Method to Learn and Construct Taxonomies using the Web, 2010. EMNLP’10
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.
[7] Sharon A. Caraballo. Automatic construction of hypernym-labeled noun hierarchy from text, 1999. Proceedings of the 37th Annual
Meeting of the Association for Computational Linguistics on Computational Linguistics.
[8] Dino Lenco, Rosa Meo. Towards an Automatic Construction of Conceptual Taxonomies, 2008. DaWaK’08 Proceedings of the 10th
International Conference on Data Warehousing and Knowledge Discovery
[9] Philipp Cimiano, Andreas Hotho, Steffen Staab. Comparing Conceptual, Divisive and Agglomerative Clustering for Learning
Taxonomies from Text, 2004. Proceedings of the European Conference on Artificial Intelligence.
[10] Agnar Aamodt, Enric Plaza. Case-Based Reasoning: Foundational Issues, Methodological Variations and System Approaches
[11] Manish Gupta, Rui Li, Zhijun Yin, Jiawei Han. Survey on Social Tagging Techniques, 2010
[12] Sadegh Aliakbary, Hassan Abolhassani, Hossein Rahmani, Behrooz Nobakht. Web Page Classification Using Social tags, 2009.
CSE’09 Proceedings of the 2009 International Conference on Computational Science and Engineering
[13] Robert Wetzker, Tansu Alpcan, Christian Bauckhage. An Unsupervised Hierarchical Approach to Document Categorization, 2007.
WI’07 Proceedings of the ACM International Conference on Web Intelligence.
[14] Michael G. Noll, Christoph Meinel. Exploring Social Annotations for Web Document Classification, 2008. SAC’08 Proceedings of the
2008 ACM Symposium on Applied Computing.
[15] Davide Picca, Adrain Popescu. Using Wikipedia and Supersense Tagging for Semi-Automatic Complex Taxonomy Construction,
2007
[16] Eric Tsui. A Concept-Relationship Acquisition and Inference Approach for Hierarchical Taxonomy Construction, 2010
[17] Laura Ramos, Daniel Rasmus. Best Practices in Taxonomy Development and Management.
[18] Cisco, Susan L, Jackson, Wanda K. Creating order out of chaos with Taxonomies.
[19] Han-joon-kim, Sang-goo Lee. Discovering Taxonomic Relationships from Textual Documents.
[20] Ronen Feldman, Moshe Fresko, Kinar. Text Mining at the Term Level, 1998. Proceddings of 2nd European Symposium on Principles
of DM and KD
[21] SchemaLogic Whitepaper. The Business Benefits of Taxonomy, 2005.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
670
[22] Shui-Lung Chuang, Lee-Feng Chion. Automatic Query Taxonomy Generation for Information Retrieval Applications.
[23] Tao Li, Sarabjot S. Anand. Automated Taxonomy Generation for Summarizing Multi-type Relational Datasets.
[24] Wei Lee Woon, Stuart E. Madnick. Asymmetric Information Distances for Automated Taxonomy Construction, 2007.Knowledge and
Information Systems, Volume 21, Issue1
[25] Karin Murthy, Tanveer A Faruquie, L Venkata Subramaniam. Automatically Generating Term-frequency-induced
Taxonomies.Proceedings of the ACL 2010 Conference Short Papers, 2010.
[26] Mahmood Neshati, Leila Sharif Hassanabadi. Taxonomy Construction Using Compound Similarity Measure. Lecture Notes in
Computer Science,2007, Volume 4803/2007, P 915-932.
[27] Nicola Guarino, Christopher Welty. Ontological Analysis of Taxonomic Relationships. Proceedings of the 19th International
Conference on Conceptual Computing, 2000.
[28] Rob Shearer, Ian Horrocks. Exploiting Partial Information in Taxonomy Construction. ISWC ’09. Proceedings of the 8th International
Semantic Web Conference, 2009.
[29] Andreas Henschel, Wei Lee Woon, Thomas Wachter ,Stuart Madnick. Comparison of Generality Based Algorithm Variants for
Automatic Taxonomy Generation. IIT’09. Proceedings of the 6th International Conference on Innovations in Information Technology,
2009.
[30] Kunal Punera, Suju Rajan, Joydeep Ghosh. Automatic Construction of N-ary Tree Based Taxonomies. ICDMW ’06. Proceedings of
the 6th IEEE International Conference on Data Mining Workshop.
[31] H. C. J Godfray, B. R Clark, I, J Kitching, S. J Mayo, M. J Scoble. The Web and Structure of Taxonomy, 2007
[32] Delphi Research Report. Content Classification and the Enterprise Taxonomy practice, 2004
[33] Bruno,Denise; MLS; & Richamond, Heather; CRM. The Truth about Taxonomies 2003, Information Management Journal , 37,2;
ABI/INFORM Global pp. 44.
[34] George A.Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller , Introduction to WordNet : An On-Line
Lexical Database
[35] German Rigau and Horacio Rodriguez, Eneko Agirre. Building Accurate Semantic Taxonomies from Monolingual MRDs.
[36] Michael Steinbach, George Karypis, Vipin Kumar.Technical Report 2000 A comparison of Document Clustering Techniques.
Sujatha R et al./ Indian Journal of Computer Science and Engineering (IJCSE)
ISSN : 0976-5166
Vol. 2 No. 5 Oct-Nov 2011
671
... A plethora of operations in the publishing domain rely on the utilization of "semantic" taxonomies, which are hierarchical and directed structures of descriptive concepts. Those concepts are assigned to content items for facilitating their organization and discoverability [1], as well as for further improving recommendation tasks [2]. These semantic taxonomies are utilized by hosting platform providers of scientific content for satisfying the academic publishers' needs in terms of semantic classification of content items (e.g., publications, videos, and digital objects). ...
... accessed on 6 January 2022) controlled vocabulary, which may not always sufficiently cover all content aspects, or proprietary ones tailored to the publishers' needs and content, as happens in the majority of the cases (more details are provided in Section 5). These proprietary taxonomies are created by third-party vendors with the aid of taxonomists, subject matter, and domain experts [2], which is an intensive, costly, and time-consuming process [1,2]. Eventually, once the taxonomy is available then the automatic or manual classification takes place and concepts are assigned to the content. ...
... accessed on 6 January 2022) controlled vocabulary, which may not always sufficiently cover all content aspects, or proprietary ones tailored to the publishers' needs and content, as happens in the majority of the cases (more details are provided in Section 5). These proprietary taxonomies are created by third-party vendors with the aid of taxonomists, subject matter, and domain experts [2], which is an intensive, costly, and time-consuming process [1,2]. Eventually, once the taxonomy is available then the automatic or manual classification takes place and concepts are assigned to the content. ...
Article
Full-text available
The descriptive concepts of “semantic” taxonomies are assigned to content items of the publishing domain for supporting a plethora of operations, mostly regarding the organization and discoverability of the content, as well as for recommendation tasks. However, either not all publishers rely on such structures, or in many cases employ their own proprietary taxonomies, thus the content is either difficult to be retrieved by the end users or stored in publisher-specific fragmented “data-silos”, respectively. To address these issues, the modular and scalable “Dominance Metric” methodology is proposed for rating the dominance and importance of concepts in semantic taxonomies. Our proposed metric is applied both on the vast multidisciplinary Microsoft Academic Graph Fields of Study taxonomy and the MeSH controlled vocabulary in order for their enhanced and refined versions to be produced. Moreover, we describe the cleansing process of the resulting taxonomy from Microsoft’s structure by deduplicating concepts and refining the hierarchical relations towards the increase of its representation quality. Our evaluation procedure provided valuable insights by showcasing that high volume, namely the number of publications a concept is assigned to, does not necessarily imply high influence, but the latter is also affected by the structural and topological properties of the individual entities
... A taxonomy is an essential tool for fast information retrieval and classification of knowledge [1]. A taxonomy provides an efficient navigating and browsing mechanism by organizing huge volumes of data into a relatively small number of hierarchical clusters [2]. ...
... A bottom-up approach is generally applied for automatic taxonomy construction. This involves the selection of specific levels of categories to further move up to higher categories [1]. Manual taxonomy construction is a tedious process, and the resulting taxonomy is generally highly subjective [13]. ...
... Furthermore, automatic approaches have the potential to enable humans or machines to easily understand a highly focused and potentially fast changing domain [13]. Details related to the advantages and limitations of these two approaches are presented in [1]. Semi-automated approaches generally share the advantages and drawbacks of manual and automatic approaches. ...
Article
Full-text available
Taxonomies are essential tools for fast information retrieval and classification of knowledge. Many existing techniques for automatic taxonomy generation strongly depend on the specific properties of a particular domain and are consequently hard to apply to other domains. Some attempts have been made to design taxonomies for multiple domains. Unfortunately, they induce high hierarchical classification error rates for some datasets. The automatic design of a taxonomy requires the capability of measuring the similarity between classes. More precisely, the fact that two classes are near intuitively implies that some elements of one class are scattered in the neighborhood of some elements of the other class. This observation is used in this paper to propose a new generic technique for automatic taxonomy generation. A topological analysis of the neighborhood of each instance is first performed. The results of this analysis are used to initialize and train a hidden Markov model for each class. The model of a given class c captures the frequencies of the classes found in the neighborhood of the instances of c, from the most dominant class to the least dominant. The similarities between these models are finally used to derive a taxonomy. Hierarchical classification experiments realized on 20 datasets from various domains showed an average accuracy of \(97.22\%\) and a standard deviation of \(4.11\%\). Comparison results revealed that the proposed approach outperforms existing work with accuracy gains reaching \(38.62\%\) for one dataset.
... A taxonomy is defined as a structure that organizes knowledge according to the hierarchy of concepts that underlie it (Paukkeri, García-Plaza, Fresno, Unanue, & Honkela, 2012). Among the applications that have been recognized for taxonomies, we highlight their use in information management, in the organization and categorization of data, and in the search for content (Sujatha, Bandaru, & Rao, 2011). A taxonomy allows the organization of content based on the standardization of its descriptors, provided that it has a defined content and its related metadata (Engel, Pryde, & Sappington, 2010). ...
... The production of the new taxonomy is aligned with reflections on the usefulness of this type of controlled vocabulary, because of its effectiveness in organizing, managing and searching for information from tags that characterize knowledge in a field (Sujatha et al., 2011). Our proposal starts from conceptual references typical of Mathematics Education to establish the categories that organize the key terms. ...
Article
Full-text available
We present the process of developing a taxonomy of key terms for Mathematics Education. We build on the existing taxonomy of key terms that has been used in an open access document repository. Additionally, we took into account terms that have been established in encyclopedias of the discipline and the frequency of use of keywords in specialized journals that were indexed in Scopus and Web of Science. We made a review of synonymy between these terms and the terms of the existing taxonomy. We included in our proposal the terms that are relevant given their frequency of use in the journals. We removed from the existing taxonomy the terms that are little used in practice. The new taxonomy is organized in six main categories: approach, educational level, foundations of Mathematics Education, research in Mathematics Education, pedagogical notions and mathematical content. This proposal was validated in three phases by researchers, innovators in Mathematics Education, and editors of specialized journals and experts who lead associations and events in the discipline.
... Taxonomy construction approach discussion. There are mainly two kinds of taxonomy construction approaches (i.e., automatic construction and manual construction) [170,238]. Automatic construction approaches, including topic modeling, are generally applicable to datasets with large sample sizes [170,225]. In our study, the number of selected research papers is only 164, which is not sufficient to produce reliable clustering and modeling results. ...
Article
Serverless computing is an emerging cloud computing paradigm, being adopted to develop a wide range of software applications. It allows developers to focus on the application logic in the granularity of function, thereby freeing developers from tedious and error-prone infrastructure management. Meanwhile, its unique characteristic poses new challenges to the development and deployment of serverless-based applications. To tackle these challenges, enormous research efforts have been devoted. This paper provides a comprehensive literature review to characterize the current research state of serverless computing. Specifically, this paper covers 164 papers on 17 research directions of serverless computing, including performance optimization, programming framework, application migration, multi-cloud development, testing and debugging, etc. It also derives research trends, focus, and commonly-used platforms for serverless computing, as well as promising research opportunities.
... As every single concept represents a more or less generalized semantic excerpt of the entire domain, modifying the taxonomy happens rarely [7]. Removing, adding, but also modifying (label or level) a single concept would effect that the information gain changes and the entire semantic would be misleading [54]. This intentional formality of taxonomies results in inflexibility. ...
Article
Full-text available
Taxonomies are used as product categories to facilitate users navigating through an e-commerce portal with the help of hierarchically structured concepts. However, the identical taxonomy is shown to each customer regardless of the channel used. This is challenging for the customers in terms of user experience, as the screen size is rigid, and has not a flexible format like a printed catalog. Simply reducing the taxonomy as suggested in existing works is not sufficient, as it leads to semantic misrepresentation of the product domain. To overcome the inflexibility of product taxonomies, the rule-based expert system TaxoMulti is presented in this paper. The main objectives of our descriptive research are the formulation of the taxonomy over- and undersize problem in multi-channel context, before different types of flexible mediator concepts are discussed that allow overcoming these challenges. Using our novel method, marketing experts can now provide different taxonomies including the same semantics to be shown on different channels. The method is implemented using logic programming, allowing the integration of an inference engine utilizing background knowledge without changing the underlying logic of the used information (management) system. The comprehensive experiments on three public and private databases highlight the improvement when adding different types of mediator concepts for the adaption process. Compared to existing best performing works in related fields, TaxoMulti has achieved an improvement of + 26.31 % for the reduction of the taxonomy, + 60 % for the enlargement of the taxonomy, and + 21.21 % in terms of flexibility.
... Without CS, more time would be devoted to understanding how other researchers define and measure variables, and then to making these definitions appropriate to the research at hand (Bagstad, 2018;Delphi Group, 2002;Hlava, 2018;IDC, 2003;Sujatha et al., 2011;Vernau, 2005). It has been estimated that "knowledge workers," who manage large amounts of data, spend 20-35 percent of their time searching for information, with a 50 percent success rate (Bagstad, 2018). ...
Article
Ecosystem services (ES) practitioners (e.g., researchers, policy makers) have been working to better define, measure, and value the ways that nature contributes to society. Because measurement techniques follow the labeling or identification of ES, precise identification is critical. This article reviews literature and consults experts in classification science and ES to determine the expected benefits of using ES classification knowledge (classification knowledge); ecosystem services classification systems (ES-CS) and their principles. An informal analysis of the costs of transitioning from the current ad-hoc approach—based on various ES lists—to using classification knowledge was conducted. 18 benefits of using classification knowledge were found, including allowing ES to be defined more easily and precisely, easing the transfer of knowledge among studies, and avoiding the need to recreate ES identification systems. Collectively, these 18 benefits should allow for more accurate and consistent definition of ES, thereby serving to improve communication and measurement of ES. Moreover, the expected benefits of using ES-CS outweigh expected costs of the transition. Practitioners can use ES-CS in whole, or in parts, as their research or their institutions warrant. Finally, a case study was conducted that shows how ES measures can be organized using ES-CS, delivering benefits to practitioners.
... Figure 2 is a picture of the IT entrepreneurship taxonomy. This taxonomy was built based on a combination of computer science taxonomy [18][19][20][21] and the concept of https://doi.org/10.31436/iiumej.v20i2.1096 entrepreneurship [22,23]. ...
Article
Full-text available
Higher education has great potential in producing new startups in the IT (Information Technology) field. Many choices influence students to become IT- entrepreneurs. Association Rule can be used to obtain a model by analysing data so that it can be used to make a rule to the IT entrepreneurship-student model, but the association algorithm has disadvantages in handling large datasets. We propose reducing candidate itemsets using degrees of fuzzy similarity. The membership function in fuzzy sets can be used to measure the quality of rules obtained. The purpose of this study is to improve the algorithm by evaluating the similarity of candidate itemsets to get a good quality rule. This research method has 2 phases, namely (1) calculating the membership function with similarity itemset and (2) applying fuzzy mining association rule. Phase 1 has several steps, including: preparation of a transaction database, the taxonomy process, and identification of similar itemset. Phase 2 has several steps as well. The first is defining membership functions, and the last is a fuzzy mining fuzzy association rule. In this study, a questionnaire was distributed to 1225 students who were members of the IT entrepreneurship program. The results of this study were reduced into 823 itemsets and produced an IT entrepreneurship rule model. ABSTRAK: Pendidikan tinggi mempunyai potensi besar dalam menghasilkan permulaan baru dalam bidang IT. Banyak pilihan mempengaruhi pelajar bagi menjadi usahawan-IT. Kaedah Bersekutu boleh digunakan bagi mendapatkan model dengan menganalisa data supaya ianya dapat digunakan menjadi model kepada pelajar keusahawanan-IT, namun algoritma bersekutu mempunyai kelemahan dalam mengendalikan dataset yang besar. Kami mencadangkan pengurangan bilangan set item menggunakan tahapan persamaan kabur. Fungsi ahli dalam set kabur dapat digunakan bagi mengukur kualiti aturan yang diperoleh. Tujuan kajian ini adalah bagi meningkatkan algoritma dengan menilai persamaan set item calon bagi mendapatkan aturan kualiti yang baik. Kaedah penyelidikan ini mempunyai 2 peringkat, iaitu (1) mengira fungsi ahli dengan set item persamaan dan (2) menerapkan aturan perlombongan bersekutu kabur. Peringkat 1 mempunyai beberapa langkah, iaitu: urus niaga pangkalan data, proses taksonomi, identifikasi set item yang sama. Tahap 2 mempunyai beberapa langkah, iaitu: menentukan fungsi keahlian, dan akhirnya, aturan perlombongan bersekutu. Dalam kajian ini, soal selidik telah diedarkan kepada 1225 pelajar yang menjadi ahli program keusahawanan IT. Dapatan kajian menunjukkan pengurangan nombor dataset kepada 823 set item dan menghasilkan model aturan teknologi keusahawanan IT.
Chapter
Full-text available
A key challenge in the legal domain is the adaptation and representation of the legal knowledge expressed through texts, in order for legal practitioners and researchers to access this information more easily and faster to help with compliance related issues. One way to approach this goal is in the form of a taxonomy of legal concepts. While this task usually requires a manual construction of terms and their relations by domain experts, this paper describes a methodology to automatically generate a taxonomy of legal noun concepts. We apply and compare two approaches on a corpus consisting of statutory instruments for UK, Wales, Scotland and Northern Ireland laws.
Article
Engineering management has been historically focused on integrating scientific, engineering, and management know-how to contribute effectively to the functioning of organizations and industries. Today, an array of disruptive socio-technological transformations is bringing new challenges for education and research institutions engaged to advance the knowledge needed to design more performing socio-technical systems. Framed within the larger industrial engineering domain, management engineering has emerged as a new perspective to integrate technological and managerial knowledge, although a shared understanding of the field is yet to be introduced. In particular, the meaning and building blocks of management engineering can be discussed in the light of the evolving nature of engineering management. This article presents an extensive work of search and analysis of academic and practitioner evidence about management engineering, which allowed to derive a taxonomy of 467 concepts and 32 aggregating topics. The proposed framework can support academic discussion on the relevance of integrating management and engineering knowledge in the current socio-technical scenario. For practitioners, the identification of management engineering topics can be used to design global education initiatives as well as competence development and professional certification processes.
Article
Full-text available
This paper presents a method that combines a set of unsupervised algorithms in order to accurately build large taxonomies from any machine-readable dictionary (MRD). Our aim is to profit from conventional MRDs, with no explicit semantic coding. We propose a system that 1) performs fully automatic exraction of taxonomic links from MRD entries and 2) ranks the extracted relations in a way that selective manual refinement is allowed. Tested accuracy can reach around 100% depending on the degree of coverage selected, showing that taxonomy building is not limited to structured dictionaries such as LDOCE.
Article
Full-text available
Article
Full-text available
This paper proposes a novel approach to automatically dis-covering the hierarchical topic structure of a large doc-ument set without any linguistic analysis. The method is based on term subsumption relations using term co-occurrence, and attempts to discover hierarchical topic structures that are expressed by subsumption relations. De-spite its simplicity, results of experiments on well-known document collections such as the Reuters-21578 collection and Yahoo! directory data demonstrate the high quality of the resulting hierarchies.
Article
Full-text available
In this paper we propose an unsupervised ap-proach for acquiring domain related conceptual hierarchies from open-domain text. Super Sense Tagging (SST) is used to extract up-level terms and Wikipedia categories and WordNet are em-ployed to construct the rest of taxonomic hierar-chy. The result is a complete top-bottom taxon-omy for every formal context. We describe both the method we implemented and some encoruag-ing initial experimental results.
Article
Full-text available
Case-based reasoning is a recent approach to problem solving and learning that has got a lot of attention over the last few years. Originating in the US, the basic idea and underlying theories have spread to other continents, and we are now within a period of highly active research in case-based reasoning in Europe as well. This paper gives an overview of the foundational issues related to case-based reasoning, describes some of the leading methodological approaches within the field, and exemplifies the current state through pointers to some systems. Initially, a general framework is defined, to which the subsequent descriptions and discussions will refer. The framework is influenced by recent methodologies for knowledge level descriptions of intelligent systems. The methods for case retrieval, reuse, solution testing, and learning are summarized, and their actual realization is discussed in the light of a few example systems that represent different CBR approaches. We also discuss the role of case-based methods as one type of reasoning and learning method within an integrated system architecture.
Conference Paper
Full-text available
We compare a family of algorithms for the automatic generation of taxonomies by adapting the Heymann-algorithm in various ways. The core algorithm determines the generality of terms and iteratively inserts them in a growing taxonomy. Variants of the algorithm are created by altering the way and the frequency, generality of terms is calculated. We analyse the performance and the complexity of the variants combined with a systematic threshold evaluation on a set of seven manually created benchmark sets. As a result, betweenness centrality calculated on unweighted similarity graphs often performs best but requires threshold fine-tuning and is computationally more expensive than closeness centrality. Finally, we show how an entropy-based filter can lead to more precise taxonomies.
Conference Paper
Full-text available
Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. We propose (1) a semi-supervised algorithm that uses a root concept, a basic level concept, and recursive surface patterns to learn automatically from the Web hyponym-hypernym pairs subordinated to the root; (2) a Web based concept positioning procedure to validate the learned pairs' is-a relations; and (3) a graph algorithm that derives from scratch the integrated taxonomy structure of all the terms. Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested.
Article
Taxonomy construction is a resource-demanding, top–down, and time consuming effort. It does not always cater for the prevailing context of the captured information. This paper proposes a novel approach to automatically convert tags into a hierarchical taxonomy. Folksonomy describes the process by which many users add metadata in the form of keywords or tags to shared content. Using folksonomy as a knowledge source for nominating tags, the proposed method first converts the tags into a hierarchy. This serves to harness a core set of taxonomy terms; the generated hierarchical structure facilitates users’ information navigation behavior and permits personalizations. Newly acquired tags are then progressively integrated into a taxonomy in a largely automated way to complete the taxonomy creation process. Common taxonomy construction techniques are based on 3 main approaches: clustering, lexico-syntactic pattern matching, and automatic acquisition from machine-readable dictionaries. In contrast to these prevailing approaches, this paper proposes a taxonomy construction analysis based on heuristic rules and deep syntactic analysis. The proposed method requires only a relatively small corpus to create a preliminary taxonomy. The approach has been evaluated using an expert-defined taxonomy in the environmental protection domain and encouraging results were yielded.
Conference Paper
Social annotation via so-called collaborative tagging describes the process by which many users add metadata in the form of unstructured keywords to shared content. In this paper, we explore and study social annotations and tagging with regard to their usefulness for web document classication by an analysis of large sets of real-world data. We are interested in nding out which kinds of documents are annotated more by end users than others, how users tend to annotate these documents, and in particular how this user-generated folk- sonomy compares with a top-down taxonomy maintained by classication experts for the same set of documents. We describe what can be deduced from the results for further research and development in the areas of document classi- cation and information retrieval.