Content uploaded by Nada Ghneim
Author content
All content in this area was uploaded by Nada Ghneim on Nov 22, 2015
Content may be subject to copyright.
Building a Framework for Arabic Ontology Learning
Nada Ghneim, Informatics Department, HIAST, Damascus, Syria, nada.ghneim@hiast.edu.sy
Waseem Safi, Informatics Department, HIAST, Damascus, Syria,
waseemm2005@hotmail.com
Moayad Al Said Ali, AI Department, IT Faculty, Damascus University, Damascus, Syria,
moayad@live.com
Abstract
This paper presents the ArOntoLearn a Framework for Arabic Ontology learning from textual
resources. Supporting Arabic language and using domain knowledge or previous knowledge
in the learning process are the main features of our framework, besides it represents the
learned ontology in Probabilistic Ontology Model (POM), which can be translated into any
knowledge representation formalism, and implements data-driven change discovery, therefore
it updates the POM according to the corpus changes only, and allows user to trace the
evolution of the ontology with respect to the changes in the underlying corpus. Our framework
analyses Arabic textual resources, and matches them to Arabic Lexico-syntactic patterns in
order to learn new Concepts and Relations.
Keywords: Ontologies, Ontology Learning, Knowledge Acquisition, Arabic Natural
Language Processing.
Introduction
Since Ontologies provide a shared understanding of a domain of interest, they became the key
technology of modern knowledge based systems: natural language processing, information
retrieval and the Semantic Web. Since building ontology for a huge amount of data is a
difficult and time consuming task, several ontology learning frameworks have been designed
and implemented in the last decade.
The Mo’k workbench [1], for instance, basically relies on unsupervised machine learning
methods to induce concept hierarchies from text collections.
OntoLT [2] is ontology learning plug-in for the Protege ontology editor. It is targeted more at
end users and heavily relies on linguistic analysis. It basically makes use of the internal
structure of noun phrases to drive ontology knowledge from texts.
The OntoLearn [3] framework by Velardi et al. focuses on the word sense disambiguation
problem, presents a novel algorithm called SSI relying on the structure of the general ontology
for this purpose.
TextToOnto [4] framework is implementing a variety of algorithms for diverse ontology
learning subtasks. In particular, it implements diverse relevance measures for term extraction,
different algorithms for taxonomy constructions as well as techniques for learning relations
between concepts.
Text2onto [5] is a new version from textToOnto. It presented Preliminary Ontology Model
provided with probability, and it relay on data-driven change discovery and NLP techniques.
All mentioned frameworks do not support Arabic language, and they do not use domain
knowledge or previous knowledge such as assistant ontology. In this paper, we will present
our framework which overcomes these deficient points by using Arabic Lexico-syntactic
patterns and semantic annotator, which is based on a novel idea of using domain or previous
knowledge in the traceable incremental ontology learning process. Supporting Arabic
language is not that easy task because current linguistic analysis tools are not efficient enough
to process unvocalized Arabic corpuses that rarely contain appropriate punctuation, so we
tried to build a flexible and freely configured framework whereas any linguistic analysis tool
can be replaced by more sophisticated one whenever it is available.
This paper is structured as follows: in the first section we will describe the System
Architecture which explains usage scenario and the system unit’s interaction. These units will
be detailed in the next three sections: The Probabilistic Ontology Model, Arabic Natural
Language Processing, and Algorithms Controller. Some experimental results are presented in
the Results and Discussion section. Conclusions and future work appear in final section.
System Architecture
The architecture of ArOntoLearn is divided into three units: Probabilistic Ontology Model
POM, Algorithms Controller, and Arabic NLP tools (see figure 1).
POM stores the results of the different ontology learning algorithms, which are initialized by
the Algorithms Controller, the purpose of which is to trigger the linguistic preprocessing of the
data, execute the ontology learning algorithms in the appropriate order and apply the
algorithms change requests to the POM. The fact that none of the algorithms has the
permission of directly manipulating the POM guarantees maximum transparency and allows
for the flexible composition of arbitrarily complex algorithm.
The execution of each algorithm consists of three phases: (i) in the notification phase, the
algorithm learns about recent changes to the corpus, (ii) in the computation phase, these
changes are mapped to changes with respect to the reference repository, which contains the
pointers to all occurrences of the concept, (iii) in the result generation phase, request for POM
changes are generated from the updated content of the reference repository.
Arabic Natural Language Processing NLP Unit is based on the GATE framework [6]. Our
choice of GATE is based on two facts: its flexibility with respect to the set of linguistic
algorithms used, and presenting Java Annotation Patterns Engine (JAPE) which provides
finite state transduction over annotation based on regular expressions that gives flexible
modeling and developing of Lexico-syntactic Patterns.
The Probabilistic Ontology Model
Probabilistic Ontology Model (POM) is presented by text2onto [5] framework as collection of
instantiated modeling primitives which are independent of a concrete ontology representation
language. The benefits of POM are twofold. On the one hand, adding new primitives does not
Algorithm Controller
Corpus
NLP
Evidence
store
Reference
store
POM
POM Visualization
OWL
RDF
Ontology
Fig 1. Architecture of ArOntoLearn
imply changing the underlying framework thus making it flexible and extensible. On the other
hand, it can be translated into various ontology representation languages such as OWL and
RDF. The modeling primitives we use in ArOntoLearn are:
i.Concepts.
ii.Instances.
iii.Concept inheritance (subclass-of).
iv.Concept instantiation (instance-of).
POM is not probabilistic in a mathematical sense, but because every instantiated modeling
primitive gets assigned a value indicating how certain algorithm in question is about the
existence of the corresponding instance. The purpose of these ‘probabilities’ is to facilitate the
user interaction by allowing him to filter the POM and thereby select only a number of
relevant instances of modeling primitives. Each POM change has references to the underlying
corpus changes, that is, why the user can trace the evolution of the ontology with respect to the
changes in the underlying corpus.
Arabic Natural Language Processing
Our Natural Language Processing unit is based on GATE framework which is very flexible
and can be freely configured by replacing existing algorithms or adding new ones such as a
deep parser if required, another advantage of using GATE is the ability of using existing
Arabic linguistics analyzers.
The Gate pipeline Application we used starts by Arabic tokenization and sentence splitting.
The resulting annotation set serves as input for an Arabic Morphological Analyzer and POS-
Tagger [7] which assigns appropriate morphological and POS tags to all tokens, such as
(token=‘, POS-tag=‘common_noun’, root=‘’, pattern=‘’, stem=‘’, stem
pattern=‘’). The resulting annotation set will serve as input for an Arabic Syntactical
Analyzer [8] which will assign appropriate syntactic categories to all tokens and construct the
syntactical trees of each sentence. For example, the sentence ‘ ’ will
produce the annotations in the figure 3, where ‘’ and ‘’ are Proper noun phrase (NPP),
and they construct a Noun Phrase (NP).
Fig 3. Syntactical Analyzer Annotations
Then, by using an assistant ontology such as Arabic WordNet (AWN) [11], which represents
previous or domain knowledge in order to keep the result in the same domain and to get more
accuracy, a Semantic Annotator will find and annotate all known concepts in the texts to get
benefits from the knowledge we already had. For example (‘ ’ will be
annotated by the concept ‘Syria’ or super-concept ‘country’).
NP NP
DTJJ DTNN NNP NNP
Tokenizer
Morph POS-tagger
Syntactical Analyzer
Semantic Annotator
Lexico-syntactic
Patterns Matcher
Sentence Splitter
Fig 2. Gate Pipeline Application
After the basic linguistic preprocessing is done, a JAPE transducer is run over the annotated
corpus in order to match a set of particular Arabic Lexico-syntactic patterns required by the
ontology learning algorithms. Whereas the left hand side of each JAPE pattern defines a
regular expression over existing annotations, the right hand side describes the new annotations
to be created. We developed JAPE patterns to identify modeling primitives, i.e. concepts,
instances and different types of relations such as ‘has-a’ and ‘is-a’. In the following we
explain some of our Lexico-syntactic patterns written in JAPE:
Pattern 1: ‘is-a’ relation
(syntaxnode.type =NP): instance
({Token.category=is-a})//is-a={
}
(syntaxnode.type = NP): concept
This pattern aims to match sentences, for example (‘ ’, ‘ ’, ‘
’, ‘ ’, ‘ ’), which all express is-a relation between the
instance and the concept.
Pattern 2: ‘has-a’ relation
({Token.root==’’}| {Token.root=’’})
(syntaxnode.type = NP):sub_concept
({Token.string=’’})
({Token.string=’’})
(syntaxnode.type = NP):super_concept
This pattern aims to match sentences, like (‘’, ‘
’), which all express has-a relation between sub_concept and super_concept.
We have developed a core of 15 Lexico-syntactical pattern to capture ‘is-a’ and ‘has-a’
relations in different Arabic sentence forms. Once a pattern captures a relation, it adds
annotation tags that refer to the relation elements such as instance, concept and sub_concept.
Algorithms Controller
Algorithms Controller drives the whole learning process. First, it runs the NLP unit and gets
produced annotations, then fires the learning algorithms corresponding to these annotations
types, in appropriate order. In fact, algorithms can be classified according to the modeling
primitives they produce. In order to obtain a more reliable probability for each instantiated
primitive, the algorithms controller allows performing several algorithms of different types,
and then combines their results. Because algorithms can’t directly change the POM, they send
POM change requests to the algorithms controller which applies them to the POM.
Hereafter, we briefly describe the algorithms we used in the learning process. In particular we
describe the approach to calculate the probability of an instantiated modeling primitive:
i. Concept and instances algorithms:
We implemented different algorithms calculating the following measures: Relative Term
Frequency (RTF), Term Frequency Inverted Document Frequency (TFIDF) [9], and
Entropy [10]. For each term, the values of these measures are normalized into the interval
[0..1], and used as corresponding probability in the POM.
ii. ‘Is-a’ and ‘has-a’ Relation Algorithms:
In order to assign the confidence of extracted relations we implemented different
algorithms. We used Google search engine to calculate the confidence according to the
count of results returned when searching for the relation, the search query is formed
according to relation type and elements, and we used AWN [11] lexical ontology to search
for the relation, if it’s found then the confidence will be 1 else it’ll be 0.
Results and discussion
To evaluate the quality of the ontology learning results we tested our system on texts selected
from Arabic Wikipedia. We have selected about 125 documents in the domain of countries
and cities. We used Arabic WordNet as assistant ontology. We have established 15 Lexico-
syntactic patterns (5 for instance-of relations, 10 for subclass-of relations). Our test focused on
the instance-of relation. Since applying Stanford Syntactic Parser on Arabic sentences requires
that these sentences are well formed and this are not the case Arabic Wikipedia documents, we
have performed two tests: (i) by using Stanford Syntactic Parser on a limited number of
sentences (34 sentences), (ii) by using only Arabic Morphological Analyzer, without Stanford
Syntactic Parser on the whole corpus (12779 sentences) (see Table 1). We divided the results
according to a relation confidence threshold that we have assigned the value 0.5. The results
were as following:
i. Using Morphological Analyzer:
The number of relations with confidence more or equals 0.5 is 68. After the manual
verification of these relations we found that, there are 26 true instance-of relation and 42
false instance-of relation.
The number of relations with confidence less than 0.5 is 40. After the manual verification
of these relations we found that, there are 28 true instance-of relation and 12 false instance-
of relation.
The precision of the results of this test is 0.5.
ii. Using Stanford Syntactic Parser:
The number of relations with confidence more or equals 0.5 is 10. After the manual
verification of these relations we found that, there are 8 true instance-of relation and 2 false
instance-of relation.
The number of relations with confidence less than 0.5 is 2. After the manual verification of
these relations we found that, there are 2 true instance-of relation and none false instance-
of relation.
The precision of the results of this test is 0.83.
Table 1: Results
We remarked that using Stanford Syntactic Parser has increased the precision of the results,
because we could use the syntactic annotations of the sentences in the Lexico-syntactic
patterns. Unfortunately we could not apply Stanford Syntactic Parser on the whole corpus,
because it requires that the sentences are well formed and with appropriate punctuations.
Conclusion and Future Work
We have developed a framework for incremental ontology learning, using Arabic natural
language processing, machine learning and text mining techniques, in order to extract
ontology from Arabic textual resources. The novel aspects about our framework are: (i) the
flexibility with respect to use other Arabic linguistic analyzers, and add new Lexico-syntactic
patterns to reach more accuracy, (ii) the independence of a concrete ontology representation
Sentences
>=0.5
True
False
<0.5
True
False
Morphological
12779
68
26
42
40
28
12
Syntactical
34
10
8
2
2
2
0
language, (iii) benefits from the previous knowledge by using an assistant ontology, (iv) using
the probability for capturing uncertainty and enhancing user interaction, (v) the integration of
data-driven change discovery strategies increasing the efficiency of the system, as well as the
traceability of the learned ontology with respect to changes in the corpus, making the whole
process more transparent.
In the future, we intend to develop more Arabic Lexico-syntactic patterns to involve more
relations and sentence forms. We can extend the POM with more taxonomic and non
taxonomic relations. In view of increasing the precision of the system, we can replace existing
Arabic morphological and syntactical analyzer with more accurate ones, and add an Arabic
lemmetizer. We can use machine learning and text mining techniques in our algorithms used
for extracting modeling primitives.
References
[1] G. Bisson, c. Nedellec, and L.canamero. Designing clustering methods for ontology
building – The Mo’K workbench. In proceedings of the ECAI ontology Learning Workshop,
pages 13-19, 2000.
[2] P. Buitelaar, D. Olejnik, and M. Sintek. OntoLT: A protégé plug-in for ontology extraction
from text, 2003, In Proceeding of the international Semantic Web Conference, (ISWC).
[3] P. Velardi, R. Navigli, A. Cuchiarelli, and F. Neri. Evaluation of ontolearn, a methodology
for automatic population of domain Ontologies, 2005, In P. Buitelaar.
[4] A. Maedche and S. Staab. Ontology learing, 2004, In s. Staab and R. Studer, editors,
handbook on Ontologies, pages 173-189. Springer.
[5] P. Cimiano and J. Volker. Text2onto – a framework for ontology learning and data-driven
change discovery, Proc. NLDB 2005, Lecture Notes in Computer Science, vol. 3513,
Springer, Alicante, 2005, pp. 227-238.
[6] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and
graphical development environment for robust NLP tools and applications, 2002, In
Proceedings of the 40th Annual Metting of the ACL.
[7] R. Sonbol, N. Ghneim, M. Desouki, Arabic Morphological Analysis: a new approach,
ICTTA 2008, Damascus, Syria.
[8] Arabic Stanford parser, Retrieved October 2,2008, from :
http//www.nlp.stanford.edu/software/parser-faq.shtml.
[9] G. Salton. Developments in automatic text retrieval, 1991, Springer 253:974-979.
[10] R.M. Gray. Entropy and information Theory, 1990, Springer.
[11] Navigli, R., Paola Velardi, Alessandro Cucchiarelli and Francesca Neri. Extending and
Enriching WordNet with OntoLearn, 2004, Proc. of The Second Global Wordnet Conference
2004 (GWC 2004), Brno, Czech Republic.