Chapter

Keyphrase Extraction from Modern Standard Arabic Texts Based on Association Rules

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Keywords or Keyphrases constitute a very important kind of concepts which can be extracted from texts. They reflect the semantic contained in these texts and are useful in many tasks of Information Retrieval, Text mining and Natural Language Processing. Their extraction is a challenging problem to which researchers have an active interest. In this paper, an approach based on the Association Rules model is described for extracting keyphrases from modern standard Arabic texts. The experiments done and the results obtained are promising: the performance values of the proposed system (in terms of precision, recall and f-score) are higher than 60% and can exceed 70%.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Below we introduce some of these studies. Loukam, Hammouche, Mezzoudj, and Belkredim (2019) propose an association rule mining based approach for keyword extraction. The authors represents a pre-processed text as matrix where rows indicate sentences, columns indicate the words of the text, and each cell indicates weather a word appears in a sentence or not. ...
Chapter
Keyword indexing is the problem of assigning keywords to text documents. It is an important task as keywords play crucial roles in several information retrieval tasks. The problem is also challenging as the number of text documents is increasing, and such documents come in different forms (i.e., scientific papers, online news articles, and microblog posts). This chapter provides an overview of keyword indexing and elaborates on keyword extraction techniques. The authors provide the general motivations behind the supervised and the unsupervised keyword extraction and enumerate several pioneering and state-of-the-art techniques. Feature engineering, evaluation metrics, and benchmark datasets used to evaluate the performance of keyword extraction systems are also discussed.
... In the literature, several key-phrase extraction techniques were presented, which can be summarized into five categories [13]: Rule Based Linguistic approaches [9], Statistical approaches [11], Machine Learning approaches [7,8,10], Graph-based approaches [6], and Hybrid approaches [22]. We employed rule-based approaches, where a key phrase is extracted based on word frequency and syntax analysis. ...
Conference Paper
Open innovation is a new paradigm embraced by companies to introduce transformations. It assumes that firms can and should use external and internal ideas to innovate. Recently, commercial and research projects have undergone an exponential growth, leading the open challenge of identifying possible insights on interesting aspects to work on. The existing literature has focused on the identification of goals, topics, and keywords in a single piece of text. However, insights do not have a clear structure and cannot be validated by comparing them with a straightforward ground truth, thus making their identification particularly challenging. Besides the extraction of insights from previously existing initiatives, the issue of how to present them to a company in a ranking also emerges. To overcome these two issues, we present an approach that extracts insights from a large number of projects belonging to distinct domains, by analyzing their abstract. Then, our method is able to rank these results, to support project preparation, by presenting first the most relevant and timely/recent insights. Our evaluation on real data coming from all the Horizon 2020 European projects, shows the effectiveness of our approach in a concrete case study.
... In a study specifically related to a Schulz MSA learning program, Jaeni (2015) finds that his instructional book contains adequate ideologies and cultures of Arabic since it provides an appropriate portion of Islamic religious teachings by citing Quran and Hadist verses in each page. In terms of language skill contents, the study by Taufikurrahman (2015) finds that the Schulz package has contained the four language skills in a brief time and with recent topics and materials. ...
Article
Full-text available
An E-textbook of Modern Standard Arabic (MSA) is an important device that supports the on-line practical aspects of the teaching Arabic as a foreign language (TAFL). The present study was aimed at describing the characteristics of the design and discourse of the Schulz’s E-textbooks of MSA. The study focused on Schulz MSA E-Textbook as an instructional material package of standard Arabic of the intermediate levels (B1 and B2 levels). The study used the qualitative research approach with the discourse-analysis method of data analyses.The results showed the following findings. The design of the E-package is web-based with multimedia as resources. The content prioritizes on grammar material in each initial lesson unit of the textbook. Although standard modern Arabic is used as the language variant for the language skills, colloquial language variants are given sufficient attention that can be accessed only in the E-edition version of the package. In terms of the contents of the discourse, a wide selection of texts are used with varied themes in the field of education, social affairs, economy, culture, politics, religion, environment, and technology. Citations embedded in header texts do not always match the contents of the discourse. The religious discourse pieces in the materials seem to be positioned more as socio-cultural facts rather than as theological facts. Keywords: E-textbook, Modern Standard Arabic, discourse, learning material DESAIN DAN WACANA BUKU PELAJARAN BAHASA ARAB BAKU Abstrak Buku ajar elektronik bahasa Arab standar merupakan perangkat penting yang mendukung aspek-aspek praktis online pada pembelajaran bahasa Arab sebagai bahasa asing (PBABA). Penelitian ini bertujuan untuk menggambarkan karakteristik desain dan muatan wacana buku ajar elektronik bahasa Arab Standar. Kajian difokuskan pada e-text karya Eckehard Schulz yang banyak digunakan di Indonesia dan dibatasi pada tingkat menengah (tingkat B1-B2). Penelitian menggunakan pendekatan penelitian deskriptif kualitatif dengan analisis wacana. Hasil penelitian menunjukkan temuan berikut. Desain e-textbook MSA Schulz berkarakter web based dengan multimedia sebagai resources. Kontennya memprioritaskan materi tata bahasa di setiap awal pelajaran. Meskipun keterampilan berbahasa Arab ditekankan pada penggunaan variasi standar atau modern language, variasi colloquial language mendapat perhatian cukup memadai yang hanya dapat diakses dalam versi e-edition. Dari segi muatan wacana, ia menghadirkan banyak muatan seperti pendidikan, sosial, ekonomi, budaya, politik, agama, lingkungan, dan teknologi. Kutipan yang disematkan di setiap header halaman tidak selalu cocok dengan isi wacana. Yang menarik, wacana agama di dalamnya terlihat lebih diposisikan sebagai fakta sosial-budaya daripada fakta teologis. Kata kunci: buku elektronik, bahasa Arab standar, wacana, kegiatan pembelajaran
Article
Full-text available
Arabic keyphrase extraction is a crucial task due to the significant and growing amount of Arabic text on the web generated by a huge population. It is becoming a challenge for the community of Arabic natural language processing because of the severe shortage of resources and published processing systems. In this paper we propose a deep learning based approach for Arabic keyphrase extraction that achieves better performance compared to the related competitive approaches. We also introduce the community with an annotated large-scale dataset of about 6000 scientific abstracts which can be used for training, validating and evaluating deep learning approaches for Arabic keyphrase extraction.
Chapter
Full-text available
Nowadays, research in text mining has become one of the widespread fields in analyzing natural language documents. The present study demonstrates a comprehensive overview about text mining and its current research status. As indicated in the literature, there is a limitation in addressing Information Extraction from research articles using Data Mining techniques. The synergy between them helps to discover different interesting text patterns in the retrieved articles. In our study, we collected, and textually analyzed through various text mining techniques, three hundred refereed journal articles in the field of mobile learning from six scientific databases, namely: Springer, Wiley, Science Direct, SAGE, IEEE, and Cambridge. The selection of the collected articles was based on the criteria that all these articles should incorporate mobile learning as the main component in the higher educational context. Experimental results indicated that Springer database represents the main source for research articles in the field of mobile education for the medical domain. Moreover, results where the similarity among topics could not be detected were due to either their interrelations or ambiguity in their meaning. Furthermore, findings showed that there was a booming increase in the number of published articles during the years 2015 through 2016. In addition, other implications and future perspectives are presented in the study.
Chapter
Full-text available
Keyphrase extraction is a critical step in many natural language processing and Information retrieval applications. In this paper, we introduce AKEA, a keyphrase extraction algorithm for single Arabic documents. AKEA is an unsupervised algorithm as it does not need any type of training in order to achieve its task. We rely on heuristics that collaborate linguistic patterns based on Part-Of-Speech (POS) tags, statistical knowledge, and the internal structural pattern of terms (i.e. word-occurrence). We employ the usage of Arabic Wikipedia to improve the ranking (or significance) of candidate keyphrases by adding a confidence score if the candidate exist as an indexed Wikipedia concept. Experimental results show that on average AKEA has the highest precision value, the highest F-measure value which indicates it presents more accurate results compared to its other algorithms
Conference Paper
Full-text available
Keywords are considered an abridged version of ‎the text which indicate the important information implied ‎within the document. The availability of huge amount of ‎information on the WWW makes the process of analyzing ‎document information and finding the proper keywords ‎manually very difficult. Therefore, automatic keyword ‎extraction techniques (AKE) are needed. In this paper, we will ‎tackle the problem of automatic keyword extraction from ‎Arabic documents base on unsupervised learning method. The ‎main objective of this research is to propose an automatic ‎Arabic keyword extraction (AAKE) technique from single ‎document using full-text based indexing. The proper feature-‎set that improves AAKE performance is specified. Self-‎organizing map (SOM) neural network is used as an ‎unsupervised learning method. The performance of the ‎proposed technique is evaluated using recall, precision, and F-‎measure. Encouraging results are obtained compared with ‎Sakhr keyword extractor.‎
Conference Paper
Full-text available
Keyphrases are important in capturing the content of a document and thus useful for many natural language processing tasks such as Information Retrieval , Document Classification, and Text Summariza-tion. Keyphrase extraction aims to identify multi-word sequences from a collection of documents that more or less correspond to keyphrases. In this paper, we propose a new method for keyphrase extraction based on association rule mining. Redundant multi-word sequences or synonymous phrases inevitably make up a big part of the keyphrases extracted. With association rules, we can also reduce the redundancy by grouping the related keyphrases that have strong co-occurrence frequencies. We further apply our keyphrase extraction and grouping solution to Information Retrieval. By both distinguishing and grouping keyphrases, we are able to achieve improved performance for Information Retrieval.
Article
Full-text available
The keyphrase is a sentence or a part of a sentence that contains a sequence of words that expresses the meaning and the purpose of any given paragraph. Keyphrase extraction is the task of identifying the possible keyphrases from a given document. Many applications including text summarization, indexing, and characterization use keyphrase extraction. Also, it is an essential task to improve the performance of any information retrieval system. The internet contains a massive amount of documents that may have been manually assigned keyphrases or not. The Arabic language is an important language in the world. Nowadays the number of online Arabic documents is growing rapidly; and most of them have no manually assigned keyphrases, so the user will scan the whole retrieved web documents. To avoid scanning the entire retrieved document, we need keyphrases assigned to each web document manually or automatically. This paper addresses the problem of identifying keyphrases in Arabic documents automatically. In this work, we provide a novel algorithm that identified keyphrases from Arabic text. The new algorithm, Automatic Keyphrases Extraction from Arabic (AKEA), extracts keyphrases from Arabic documents automatically. In order to test the algorithm, we collected a dataset containing 100 documents from Arabic wiki; also, we downloaded another 56 agricultural documents from Food and Agricultural Organization of the United Nations (F.A.O.). The evaluation results show that the system achieves 83% precision value in identifying 2-word and 3-word keyphrases from agricultural domains.
Article
Full-text available
The paper surveys methods and approaches for the task of keyword extraction. The systematic review of methods was gathered which resulted in a comprehensive review of existing approaches. Work related to keyword extraction is elaborated for supervised and unsupervised methods, with a special emphasis on graph-based methods. Various graph-based methods are analyzed and compared. The paper provides guidelines for future research plans and encourages the development of new graph-based approaches for keyword extraction.
Article
Full-text available
The rapid growth of the Internet and other computing facilities in recent years has resulted in the creation of a large amount of text in electronic form, which has increased the interest in and importance of different automatic text processing applications, including keyword extraction and term indexing. Although keywords are very useful for many applications, most documents available online are not provided with keywords. We describe a method for extracting keywords from Arabic documents. This method identifies the keywords by combining linguistics and statistical analysis of the text without using prior knowledge from its domain or information from any related corpus. The text is preprocessed to extract the main linguistic information, such as the roots and morphological patterns of derivative words. A cleaning phase is then applied to eliminate the meaningless words from the text. The most frequent terms are clustered into equivalence classes in which the derivative words generated from the same root and the non-derivative words generated from the same stem are placed together, and their count is accumulated. A vector space model is then used to capture the most frequent N-gram in the text. Experiments carried out using a real-world dataset show that the proposed method achieves good results with an average precision of 31% and average recall of 53% when tested against manually assigned keywords.
Article
Full-text available
In this paper we present a survey of various techniques available in text mining for keyword and keyphrase extraction. Keywords and keyphrases are very useful in analyzing large amount of textual material quickly and efficiently search over the internet besides being useful for many other purposes. Keywords and keyphrases are set of representative words of a document that give high-level specification of the content for interested readers. They are used highly in the field of Computer Science especially in Information Retrieval and Natural Language Processing and can be used for index generation, query refinement, text summarization, author assistance, etc. We have also discussed some important feature selection metrics generally employed by researchers to rank candidate keywords and keyphrases according to their importance.
Conference Paper
Full-text available
Many academic journals and conferences require that each article include a list of keyphrases. These keyphrases should provide general information about the contents and the topics of the article. Keyphrases may save precious time for tasks such as filtering, summarization, and categorization. In this paper, we investigate automatic extraction and learning of keyphrases from scientific articles written in English. Firstly, we introduce various baseline extraction methods. Some of them, formalized by us, are very successful for academic papers. Then, we integrate these methods using different machine learning methods. The best results have been achieved by J48, an improved variant of C4.5. These results are significantly better than those achieved by previous extraction systems, regarded as the state of the art.
Conference Paper
Full-text available
Keywords characterize the topics discussed in a document. Extracting a small set of keywords from a single document is an important problem in text mining. We propose a hybrid structural and statistical approach to extract keywords. We represent the given document as an undirected graph, whose vertices are words in the document and the edges are labeled with a dissimilarity measure between two words, derived from the frequency of their co-occurrence in the document. We propose that central vertices in this graph are candidates as keywords. We model importance of a word in terms of its centrality in this graph. Using graph-theoretical notions of vertex centrality, we suggest several algorithms to extract keywords from the given document. We demonstrate the effectiveness of the proposed algorithms on real-life documents.
Article
Full-text available
Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by GenEx suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications.
Conference Paper
Abstract: This paper proposes a new keyword extraction method that uses bag-of-concept to extract keywords from Arabic text. The proposed algorithm utilizes semantic vector space model instead of traditional vector space model to group words into classes. The new method built word-context matrix where the synonym words will be grouped into the same class. The evaluation of new approach was conducted using dataset which consists of three documents and compared with Keyword Extraction from Arabic Documents using Term Equivalence Classes method; experimental results showed that the proposed method provides significant results.
Conference Paper
Keyphrases are defined as phrases that capture the main topic(s) discussed in a document, they offer a brief and short summary of document's content. In fact, they were integrated in many Text Mining Applications such as Document Summarization, Document Indexing, Feature Extraction for Document Clustering and Classification as new Text Representation instead of the Full-Text Representation. This paper presents new improvement of our previous Arabic keyphrases extraction system based on Suffix Tree Data Structure and named the KpST system, by adding linguistic patterns specialized in extracting Arabic noun phrases, and using adapted C-value multi-terms extraction method to calculate the keyphrase relevance and solve the sub-keyphrases problem.
Article
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.
Article
A keyphrase is a sequence of words that play an important role in the identification of the topics that are embedded in a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applications such as document indexing, document retrieval, search engines, and document summarization. This paper presents a framework for extracting keyphrases from Arabic news documents which is based on the KEA system. It relies on supervised learning, Naive Bayes in particular, to extract keyphrases. Two probabilities are computed: the probability of being a keyphrase and the probability of not being a keyphrase. The final set of keyphrases is chosen from the set of phrases that have high probabilities of being keyphrases. The novel contributions of the current work are that it provides insights on keyphrase extraction for news documents written in Arabic. It also presents an annotated dataset that was used in the experimentation. Finally, it uses Naive Bayes as a medium for extracting keyphrases.
Book
Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives. The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning, and natural language processing can collectively capture, classify, and interpret words and their contexts. As suggested in the preface, text mining is needed when "words are not enough." This book: Provides state-of-the-art algorithms and techniques for critical tasks in text mining applications, such as clustering, classification, anomaly and trend detection, and stream analysis. •Presents a survey of text visualization techniques and looks at the multilingual text classification problem. •Discusses the issue of cybercrime associated with chatrooms. •Features advances in visual analytics and machine learning along with illustrative examples. Is accompanied by a supporting website featuring datasets. CApplied mathematicians, statisticians, practitioners and students in computer science, bioinformatics and engineering will find this book extremely useful.
Article
Purpose – The purpose of this paper is to apply local grammar (LG) to develop an indexing system which automatically extracts keywords from titles of Lebanese official journals. Design/methodology/approach – To build LG for our system, the first word that plays the determinant role in understanding the meaning of a title is analyzed and grouped as the initial state. These steps are repeated recursively for the whole words. As a new title is introduced, the first word determines which LG should be applied to suggest or generate further potential keywords based on a set of features calculated for each node of a title. Findings – The overall performance of our system is 67 per cent, which means that 67 per cent of the keywords extracted manually have been extracted by our system. This empirical result shows the validity of this study’s approach after taking into consideration the below-mentioned limitations. Research limitations/implications – The system has two limitations. First, it is applied to a sample of 5,747 titles and it can be developed to generate all finite state automata for all titles. The other limitation is that named entities are not processed due to their varieties that require specific ontology. Originality/value – Almost all keyword extraction systems apply statistical, linguistic or hybrid approaches to extract keywords from texts. This paper contributes to the development of an automatic indexing system to replace the expensive human indexing by taking advantages of LG, which is mainly applied to extract time, date and proper names from texts.
Article
This paper addresses the problem of keyword extraction from conversations, with the goal of using these keywords to retrieve, for each short conversation fragment, a small number of potentially relevant documents, which can be recommended to participants. However, even a short fragment contains a variety of words, which are potentially related to several topics; moreover, using an automatic speech recognition (ASR) system introduces errors among them. Therefore, it is difficult to infer precisely the information needs of the conversation participants. We first propose an algorithm to extract keywords from the output of an ASR system (or a manual transcript for testing), which makes use of topic modeling techniques and of a submodular reward function which favors diversity in the keyword set, to match the potential diversity of topics and reduce ASR noise. Then, we propose a method to derive multiple topically separated queries from this keyword set, in order to maximize the chances of making at least one relevant recommendation when using these queries to search over the English Wikipedia. The proposed methods are evaluated in terms of relevance with respect to conversation fragments from the Fisher, AMI, and ELEA conversational corpora, rated by several human judges. The scores show that our proposal improves over previous methods that consider only word frequency or topic similarity, and represents a promising solution for a document recommender system to be used in conversations.
Conference Paper
While automatic keyphrase extraction has been examined extensively, state-of-theart performance on this task is still much lower than that on many core natural language processing tasks. We present a survey of the state of the art in automatic keyphrase extraction, examining the major sources of errors made by existing systems and discussing the challenges ahead.
Article
We consider the problem of discovering association rules between items in a large database of sales transactions. We presenttwo new algorithms for solving this problem that are fundamentally different from the known algorithms. Experiments with synthetic as well as real-life data show that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also showhow the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database. 1 Introduction Database mining is motivated by the decision support problem faced by most large retail organizations [S + 93]. Progress in bar-code technology has made it possible for retail ...
Keyword extraction: a review of methods and approaches
  • S Beliga
Automatic keyphrase extractor from arabic documents
  • H M Najadat
  • I I Hmeidi
  • M N Al-Kabi
  • M M B Issa
  • HM Najadat