Conference Paper

Arabic Entity Graph Extraction Using Morphology, Finite State Machines, and Graph Transformations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Research on automatic recognition of named entities from Arabic text uses techniques that work well for the Latin based languages such as local grammars, statistical learning models, pattern matching, and rule-based techniques. These techniques boost their results by using application specific corpora, parallel language corpora, and morphological stemming analysis. We propose a method for extracting entities, events, and relations amongst them from Arabic text using a hierarchy of finite state machines driven by morphological features such as part of speech and gloss tags, and graph transformation algorithms.We evaluated our method on two natural language processing applications. We automated the extraction of narrators and narrator relations from several corpora of Islamic narration books (hadith). We automated the extraction of genealogical family trees from Biblical texts. In all applications, our method reports high precision and recall and learns lemmas about phrases that improve results.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Knowledge-based techniques such as [17,18] propose local grammars with morphological stemming to perform NER. [19] presents a method for extracting entities, events, and relations amongst them from Arabic text using a hierarchy of manually built finite state machines driven by morphological features and graph transformation algorithms. Such techniques require advanced linguistic and programming expertise. ...
... In this section we compare MERF implementations of the narrator chain, temporal entity, and genealogy entity extration tasks to the task specific techniques proposed to solve them in ANGE [40], ATEEMA [41], and GENTREE [19], respectively. We also compare a MERF number normalization task to a task specific implementation. ...
... meaning "and Haran became the father of Lot". GENTREE [19] automatically extracts the genealogical family trees using morphology, finite state machines, and graph transformations. Table 5 shows that MERF detected MBF matches with 99% recall, and 75% precision, and extracted user-defined relations with 81% recall and 96% precision. ...
Article
Full-text available
Rule-based techniques and tools to extract entities and relational entities from documents allow users to specify desired entities using natural language questions, finite state automata, regular expressions, structured query language statements, or proprietary scripts. These techniques and tools require expertise in linguistics and programming and lack support of Arabic morphological analysis which is key to process Arabic text. In this work, we present MERF; a morphology-based entity and relational entity extraction framework for Arabic text. MERF provides a user-friendly interface where the user, with basic knowledge of linguistic features and regular expressions, defines tag types and interactively associates them with regular expressions defined over Boolean formulae. Boolean formulae range over matches of Arabic morphological features, and synonymity features. Users define user defined relations with tuples of subexpression matches and can associate code actions with subexpressions. MERF computes feature matches, regular expression matches, and constructs entities and relational entities from user defined relations. We evaluated our work with several case studies and compared with existing application-specific techniques. The results show that MERF requires shorter development time and effort compared to existing techniques and produces reasonably accurate results within a reasonable overhead in run time.
... However, even if some works (e.g. Ben Guirat et al., 2016;Darwish & Oard, 2002) compared different tools and indexing units, lemma-based IR is not yet well investigated except in Abdelali, Darwish, Durrani and Mubarak, (2014), Makhlouta, Zaraket and Harkous (2012) and Soudani, Bounhas, and Retrieval (2016). That's why, other studies proposed to get a morphology-based IR in order to reach better results. ...
... However, even if some works (e.g. Ben Guirat et al., 2016;Darwish & Oard, 2002) compared different tools and indexing units, lemma-based IR is not yet well investigated except in Abdelali, Darwish, Durrani and Mubarak, (2014), Makhlouta, Zaraket and Harkous (2012) and Soudani, Bounhas, and Retrieval (2016). That's why, other studies proposed to get a morphology-based IR in order to reach better results. ...
Article
In this paper, we propose to build a morpho-semantic knowledge graph from Arabic vocalized corpora. Our work focuses on classical Arabic as it has not been deeply investigated in related works. We use a tool suite which allows analyzing and disambiguating Arabic texts, taking into account short diacritics to reduce ambiguities. At the morphological level, we combine Ghwanmeh stemmer and MADAMIRA which are adapted to extract a multi-level lexicon from Arabic vocalized corpora. At the semantic level, we infer semantic dependencies between tokens by exploiting contextual knowledge extracted by a concordancer. Both morphological and semantic links are represented through compressed graphs, which are accessed through lazy methods. These graphs are mined using a measure inspired from BM25 to compute one-to-many similarity. Indeed, we propose to evaluate the morpho-semantic Knowledge Graph in the context of Arabic Information Retrieval (IR). Several scenarios of document indexing and query expansion are assessed. That is, we vary indexing units for Arabic IR based on different levels of morphological knowledge, a challenging issue which is not yet resolved in previous research. We also experiment several combinations of morpho-semantic query expansion. This permits to validate our resource and to study its impact on IR based on state-of-the art evaluation metrics.
... Association Rules [97], [98], [99], [100] 4(4%) Named Entity Recognition [101], [102], [103], [104] 4(4%) ...
Article
Full-text available
Text Mining is a set of techniques that analyzes large masses of data, extract relations that are unknown beforehand, and provide solutions to help decision-making. Text mining had been used extensively to analyze English text. However, text mining has only been used recently in analyzing Arabic text. As a result the objective of this paper is to present the current state of Arabic text mining. A systematic review has been performed to collect the papers published on the analysis of Arabic text mining. More than one hundred papers were used in our review from different reliable sources, and then they were classified according to their specific domain, and classified again according to the specific techniques used. This paper also provides quantitative analysis of publications according to publication type, year, category, and contributors.
... Association Rules [97], [98], [99], [100] 4(4%) Named Entity Recognition [101], [102], [103], [104] 4(4%) ...
Chapter
Full-text available
Educational Data Mining (EDM) is a multidisciplinary field that covers the area of analyzing educational data using data mining techniques. Since 2008 the first annual educational data mining conference has been established. Many articles have been published in the field of EDM due to the eager interest in improving teaching practices for both the learning process and the learners. This paper presents a systematic review of the published EDM literature during 2006-2013 based on the highly cited paper in this domain. More than three hundred papers were collected through Google scholar index, then they were classified according to the application domains, while also providing quantitative analysis of publications according to publication type, year, venue, category and tasks and contributors.
Article
Full-text available
The Holy Quran and Hadith are the two main sources of legislation and guidelines for Muslims to shape their lives. The daily activities, sayings, and deeds of the Holy Prophet Muhammad (PBUH) are called Hadiths. Hadiths are the optimal practical descriptions of the Holy Quran. Technological advancements of information and communication technologies (ICT) have revolutionized every field of daily life, including digitizing the Holy Quran and Hadith. Available online contents of Hadith are obtained from different sources. Thus, alterations and fabrications of fake Hadiths are feasible. Authentication of these online available Hadith contents is a complex and challenging task and a crucial area of study in Islam. Few Hadith authentication techniques and systems are proposed in the literature. In this study, we have surveyed all techniques and systems, which are proposed for Hadith authentication. Furthermore, classification, open challenges, and future research directions related to Hadith authentication are identified. Hadiths are the optimal practical descriptions of the Holy Quran. Technological advancements have made alterations and fabrications of fake Hadiths feasible. In this article, we have surveyed and classified all techniques and systems, which are proposed for Hadith authentication and identified future research challenges and directions.
Article
This article presents a literature review of computer-science-related research applied on hadith, a kind of Arabic narration which appeared in the 7th century. We study and compare existent works in several fields of Natural Language Processing (NLP), Information Retrieval (IR), and Knowledge Extraction (KE). Thus, we illicit their main drawbacks and identify some perspectives, which may be considered by the research community. We also study the characteristics of these types of documents, by enumerating the advantages/limits of using hadith as a language resource. Moreover, our study shows that previous studies used different collections of hadiths, thus making it hard to compare their results objectively. Besides, many preprocessing steps are recurrent through these applications, thus wasting a lot of time. Consequently, the key issues for building generic language resources from hadiths are discussed, taking into account the relevance of related literature and the wide community of researchers that are interested in these narrations. The ultimate goal is to structure hadith books for multiple usages, thus building common collections which may be exploited in future applications.
Conference Paper
Computational linguistic and natural language processing automation tasks require text annotated with tags that represent the desired output of the task. The annotation tags serve for training, validation, and evaluation. Arabic morphological analysis, and tags associated with it such as part of speech and gloss tags, is key to Arabic computational linguistics and natural language processing. Several manual and automated tagging tools exist for text. Very few exist that are based on Arabic morphological analysis. In this paper, we present an open source tagging tool with visual interface that enables the construction of annotated Arabic text corpora with automatic morphology-based tags. The tool allows the specification of tags with Boolean formulae where the atomic predicates are match and contain relations between the morphological solution of part of the text and the value of a morphological feature. The tool allows the user to directly enter manual tags, to edit existing tags through a tag sensitive coloring interface, to compare tag sets, and compute accuracy results.
Conference Paper
Full-text available
We tackle the problem of automatic, or at least assisted, vocalization, a problem that arises from the almost universal absence of vowels in Arabic texts. We show that the problem of vocalization resides in the fact that the majority of Arabic words accept several potential vocalizations and are therefore ambiguous. In essence, the problem reduces to choosing, in context, the correct vocalization from among several. We focus here on the results obtained by starting with morphological analysis and proceeding to a grammatical (part-of-speech) tagging. In the proposed system, the vocalic ambiguity is detected by means of a double dictionary of voweled and non-voweled forms. The process of resolution is set in motion starting with morphological analysis and continuing through subsequent steps. The experiments described here concern the treatment as far as grammatical (part-of-speech) tagging.
Article
Full-text available
Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products revolving around natural language processing tasks. Many researchers have attacked the name identification problem in a variety of languages, but only a few limited research efforts have focused on named entity recognition for Arabic script. This is due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this article, we present the results of our attempt at the recognition and extraction of the 10 most important categories of named entities in Arabic script: the person name, location, company, date, time, price, measurement, phone number, ISBN, and file name. We developed the system Named Entity Recognition for Arabic (NERA) using a rule-based approach. The resources created are: a Whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. A filtration mechanism is used that serves two different purposes: (a) revision of the results from a named entity extractor by using metadata, in terms of a Blacklist or rejecter, about ill-formed named entities and (b) disambiguation of identical or overlapping textual matches returned by different name entity extractors to get the correct choice. In NERA, we addressed major challenges posed by NER in the Arabic language arising due to the complexity of the language, peculiarities in the Arabic orthographic system, nonstandardization of the written text, ambiguity, and lack of resources. NERA has been effectively evaluated using our own tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.
Article
Full-text available
For Muslims, Hadiths are the second source of Islamic jurisprudence after the Holy Qur’an. Hadiths are narrations originating from the words and deeds of Prophet Muhammad. There are two main components in each Hadith, the narration chain and the narrative text. A hadith scholar judges a Hadith based on the narration chain and the individuals involved in the chain. In this paper, we report on e-Narrator, an application that parses a plain Hadith text and automatically generates the full narration tree. Our proposed solution involves parsing and annotating the Hadith text and recognizing the narrators’ names. We use shallow parsing along with a domain specific grammar to parse the Hadith content. Moreover, we use a transformation mechanism based on semantic web ontology to represent the narration chain in a standard format and then graphically render its complete tree. Experiments on sample Hadiths show our approach to have a very good success rate.
Conference Paper
Full-text available
The task of Named Entity Recognition (NER) allows to identify proper names as well as temporal and numeric expressions, in an open-domain text. NER systems proved to be very important for many tasks in Natural Language Processing (NLP) such as Information Retrieval and Question Answering tasks. Unfortunately, the main efforts to build reliable NER systems for the Arabic language have been made in a commercial frame and the approach used as well as the accuracy of the performance are not known. In this paper, we present ANERsys: a NER system built exclusively for Arabic texts based-on n-grams and maximum entropy. Furthermore, we present both the specific Arabic language dependent heuristic and the gazetteers we used to boost our system. We developed our own training and test corpora (ANERcorp) and gazetteers (ANERgazet) to train, evaluate and boost the implemented technique. A major effort was conducted to make sure all the experiments are carried out in the same framework of the CONLL 2002 conference. We carried out several experiments and the preliminary results showed that this approach allows to tackle successfully the problem of NER for the Arabic language.
Conference Paper
Full-text available
The two fundamental sources of Islamic legislation are Qur'an and the Hadith. The Hadiths, or Prophetic Traditions, are narrations originating from the sayings and conducts of Prophet Muhammad. Each Hadith starts with a list of narrators involved in transmitting it followed by the transmitted text. The Hadith corpus is extremely huge and runs into hundreds of volumes. Due to its legislative importance, Hadiths have been carefully scrutinized by hadith scholars. One way a scholar may grade a Hadith is by its narration chain and the individual narrators in the chain. In this paper we report on a system that automatically generates the transmission chains of a Hadith and graphically display it. Computationally, this is a challenging problem. The text of Hadith is in Arabic, a morphologically rich language; and each Hadith has its own peculiar way of listing narrators. Our solution involves parsing and annotating the Hadith text and identifying the narrators' names. We use shallow parsing along with a domain specific grammar to parse the Hadith content. Experiments on sample Hadiths show our approach to have a very good success rate.
Conference Paper
Full-text available
The local grammar approach was first used to discuss recursive phrases that are commonly found in specialist literature like biochemistry and then extended to extract time, date and address expressions from letters. It has recently been applied to extract person names from English, Chinese, French, Korean, Portuguese, and Turkish news texts. This paper shows that this approach can also be used to extract person names from Arabic counterparts.
Conference Paper
Full-text available
Tagging and extracting proper names is an important key for improving the effectiveness of question- answering systems. The valuable information in the text usually is located around proper names, to collect this information it should be found first. By extracting proper names from the text we provide question- answering systems with both the proper name found in the text, some information about it and where it was found. The proper names in Arabic do not start with capital letter as in many other languages so special treatment is needed to find them in a text. Little research has been conducted in this area; most efforts have been based on a number of heuristic rules used to find names in the text. In this paper we present a new technique to extract names from text by building a database and graphs to represent the words that might form a name and the relationships between them. First we mark the phrases that might include names, second we build graphs to represent the words in these phrases and the relationships between them, third we apply rules to find the names.
Conference Paper
Full-text available
The Named Entity Recognition (NER) task has been garnering significant attention in NLP as it helps improve the performance of many natural language processing applica- tions. In this paper, we investigate the im- pact of using different sets of features in two discriminative machine learning frameworks, namely, Support Vector Machines and Condi- tional Random Fields using Arabic data. We explore lexical, contextual and morphological features on eight standardized data-sets of dif- ferent genres. We measure the impact of the different features in isolation, rank them ac- cording to their impact for each named entity class and incrementally combine them in or- der to infer the optimal machine learning ap- proach and feature set. Our system yields a performance of F =1-measure=83.5 on ACE 2003 Broadcast News data.
Conference Paper
Full-text available
Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).
Conference Paper
Full-text available
We present a working Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabic, but- instead- a highly multilingual, almost language-independent NER system was adapted to also cover Arabic. The Semitic language Arabic substantially differs from the Indo-European and Finno-Ugric languages currently covered. This paper thus describes what Arabic language-specific resources had to be developed and what changes needed to be made to the otherwise language-independent rule set in order to be applicable to the Arabic language. The achieved evaluation results are generally satisfactory, but could be improved for certain entity types. 1.
Conference Paper
Article
Arabic is the most widely spoken language in the Arab World. Most people of the Islamic World understand the Classic Arabic language because it is the language of the Qur’an. Despite the fact that in the last decade the number of Arabic Internet users (Middle East and North and East of Africa) has increased considerably, systems to analyze Arabic digital resources automatically are not as easily available as they are for English. Therefore, in this work, an attempt is made to build a real time Named Entity Recognition system that can be used in web applications to detect the appearance of specific named entities and events in news written in Arabic. Arabic is a highly inflectional language, thus we will try to minimize the impact of Arabic affixes on the quality of the pattern recognition model applied to identify named entities. These patterns are built up by processing and integrating different gazetteers, from DBPedia (http://dbpedia.org/About, 2009) to GATE (A general architecture for text engineering, 2009) and ANERGazet (http://users.dsic.upv.es/grupos/nle/?file=kop4.php). KeywordsArabic language–Text mining–Named Entity Recognition–Event detection–Morphological analysis–Root extraction
Article
We describe a fast, high-performance name recognizer for Arabic texts. It combines a patternmatching engine and supporting data with a morphological analysis component. The role of the morphological analysis in accurate name recognition is discussed. We also provide evaluations of both morphological analysis and name recognition. 1 Introduction 1.1 Roadmap Arabic named entity recognition in texts in Arabic script is, to our knowledge, a little researched topic. 1 In this paper we describe a system, TAGARAB, that uses a generic pattern-matching engine, SRA's NetOwl TurboTag TM , combined with an integrated morphological analysis process, which recognizes names at a high level of accuracy. 2 We first discuss the factors involved in recognizing names in Arabic. We then present a system description, focussing on the morphological analysis and the name recognition components. We also report the results of our evaluations of each component's performance. Finally, we discuss the cont...
Arabic named entity extraction: A local grammar-based approach Arabic text mining framework
  • B Technologies
  • H Traboulsi
Technologies, B.: BBN IdentiFinder Text Suite, http://www.bbn.com/technology/speech/identifinder 21. Traboulsi, H.: Arabic named entity extraction: A local grammar-based approach. In: Interna-tional Multi Conference on Computer Science and Information Technology (2009) 22. Arabic text mining framework (2009), http://code.google.com/p/atmine/ 23. Sakhr inc. (September 2009), http://www.sakhr.com/products/Mining
Entity extraction enables “discovery
  • S Cohen
ANEE: Arabic named entity extraction
  • Coltec
Mapping God’s bloodline
  • R Rouse
Platform for automated authentication of Islamic traditions and hadiths
  • M Zeineddine
BBN IdentiFinder Text Suite
  • B Technologies
  • J Belote
Belote, J.: Bible Genealogies with Notes on Bible Kinship and Family Systems (2008), http://www.d.umn.edu/ ˜ jbelote/biblegenealogy.html
Bin Badia, N.: iTree-automating the construction of the narration tree of hadiths
  • A Azmi
iTree-automating the construction of the narration tree of hadiths
  • A Azmi
  • N Bin Badia