Tunisian Arabic Corpus: A written corpus of an "unwritten" language
Abstract
Because of the many varieties of Arabic, there can never be “one” authoritative corpus of the language. To achieve the best results for language-learning resources and natural language processing, corpora for both the standard language and the spoken varieties need to be available. To this end, Tunisiya.org is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word corpus of Tunisian Spoken Arabic.
... By analyzing the various published works, we checked out that the research efforts dealing with TA treat specially the basic language analyses. In this paper, we itemized 16 resources (2 lexicons [2,4], 1 ontology [5], 2 wordnets [7,8], 12 corpora [6,7,10,12,13,14,15,16,17,18,20,22] and 1 treebank [19]) with 13 available resources. Among them, we find 4 resources [7,8,10,13] morphologically annotated and POS tagged. ...
... Mekki et al.[19] have followed these steps to syntactically annotate 1.072 sentences of STAC corpus.Multi dialect type. McNeil et al.[22] have created a corpus for TA collected from different sources (e.g. Web, TV drama, internet forums, literature, conversation, etc.). ...
This paper presents a critical description of natural language processing for Tunisian Arabic. Indeed, several linguistic resources were proposed for the three types of Tunisian Arabic (intellectualized dialect, spontaneous dialect and electronic dialect). We present different linguistic resources (corpora, lexicons and linguistic analysis tools). This study can be used as a quick reference for the scientific community working on natural language processing in general and more precisely those studying Tunisian Arabic.
... In [14], the author proposes a different approach to building a Wordnet for the TAD. The development of this Wordnet is based on two resources: the ANG/ TD "Peace Corps Lexicon" dictionary and the [15] corpus. ...
Social networks are the most used means to express oneself freely and give one's opinion about a subject, an event, or an object. These networks present rich content that could be subject today to sentiment analysis interest in many fields such as politics, social sciences, marketing, and economics. However, social networkusers express themselves using their dialect. Thus, to help decision-makers in the analysis of users' opinions, it is necessary to proceed to the sentimental analysis of this dialect. The paper subject deals with a hybrid model combining a lexicon-based approach with a modified and adapted version of a sentiment rule-based engine named VADER. The hybrid model is tested and evaluated using the Tunisian Arabic Dialect, it showed good performance reaching 85% classification.
... Various Arabic corpora are being used for Arabic SA, including the Quranic Arabic Corpus [43], arabiCorpus [44], Tunisian Arabic Corpus [45], International Corpus of Arabic [46], King Abdulaziz City for Science and Technology (KACST) Arabic Corpus, KALIMAT, and Arabic Corpus. These corpora have different sizes and dialects. ...
Text classification is a prominent research area, gaining more interest in academia, industry and social media. Arabic is one of the world’s most famous languages and it had a significant role in science, mathematics and philosophy in Europe in the middle ages. During the Arab Spring, social media, that is, Facebook, Twitter and Instagram, played an essential role in establishing, running, and spreading these movements. Arabic Sentiment Analysis (ASA) and Arabic Text Classification (ATC) for these social media tools are hot topics, aiming to obtain valuable Arabic text insights. Although some surveys are available on this topic, the studies and research on Arabic Tweets need to be classified on the basis of machine learning algorithms. Machine learning algorithms and lexicon-based classifications are considered essential tools for text processing. In this paper, a comparison of previous surveys is presented, elaborating the need for a comprehensive study on Arabic Tweets. Research studies are classified according to machine learning algorithms, supervised learning, unsupervised learning, hybrid, and lexicon-based classifications, and their advantages/disadvantages are discussed comprehensively. We pose different challenges and future research directions.
... Some of them cover Arabic dialects in general, including some Maghrebi dialects (Callan et al. 2009;Almeman and Lee 2013;Suwaileh et al. 2016). Others relate exclusively to the MADs, covering only one particular dialect such as the Tunisian dialect (McNeil and Faiza 2011;McNeil 2015;Younes and Souissi 2014;Younes et al. 2015;Bouchlaghem et al. 2014;Masmoudi et al. 2017;Torjmen and Haddar 2018a, b), the Algerian dialect (Abidi and Menacer 2017;Smaili 2017, Guellil et al. 2018a, b, c;Soumeur et al. 2018) and the Libyan dialect (Alhammi and Alfards 2018), or covering a subset of MADs (Adouane et al. 2016a). In (Guellil et al. 2018a, b, c) two raw corpora were built from Algerian Facebook pages. ...
Diglossia is one of the main characteristics of Arabic language. In Arab countries, there are three forms of Arabic that co-exist: Classical Arabic (CA) which is mainly used in the Quran and in several classical literary texts, Modern Standard Arabic (MSA) that descends from CA and used as official language, and various regional colloquial varieties of Arabic that are usually referred to as Arabic dialects (AD). Deemed to be amongst low-resource languages, these dialects have aroused increased interest among the NLP community in recent years. Indeed, the various Arabic dialects are increasingly used on the social web and may be transcribed in both the Arabic and the Latin script. The latter is known as Arabizi and seems to be more frequently used for some of them. The AD NLP raises many challenges and requires the availability of large and appropriate language resources. In this study, we focus, in particular, on the Maghrebi Arabic dialects (MADs). We propose a thorough review of the language resources (LRs) that have been generated by the various work carried out on the MAD language processing. A survey of the currently online available MAD NLP dedicated-LRs is also compiled and discussed. LRs investigated in this work are essentially data-resources such as primary and annotated corpora, lexica, dictionaries, ontologies, etc.
... We can cite in this context, McNeil and Faiza (2011) who built a TD corpus as part of a project to create a TD-English dictionary. The corpus was then organised in a web application allowing basic linguistic processing (McNeil, 2015). Younes and Souissi (2014) and Younes et al. (2015) also used the social web to build various resources for TD. ...
... We can cite in this context, McNeil and Faiza (2011) who built a TD corpus as part of a project to create a TD-English dictionary. The corpus was then organised in a web application allowing basic linguistic processing (McNeil, 2015). Younes and Souissi (2014) and Younes et al. (2015) also used the social web to build various resources for TD. ...
... The social web has been used by several researchers to build linguistic resources for TD, given its increasing use by Tunisians and its richness in varied dialectal language productions. We cite among these corpora those exclusively constructed from the web [14], [26] and those representing a mixture of content from the web and from other sources [27,28,29,30]. ...
The language study and automatic processing require the availability of large raw and annotated corpora. Collecting data and constructing such language resources are non-trivial tasks in the NLP field, especially when it comes to deal with low-resource languages. In this paper, we are concerned with the Tunisian dialect (TD) and propose to survey the availability of corpora for its automatic processing. From the study of the main works that have been carried out in TD language processing, we were able to identify and categorize the different types of corpora that were constructed as part of these works. We present, in this paper, a summary of the identified TD corpora characteristics as well as an inventory of those which are accessible online.
The authors examine the application of electronically searchable corpora, from their own experience, in addressing questions pertinent to linguistics as a whole and to matters internal to Arabic, the while lamenting that the field of Arabic linguistics, in its theoretical and applied orientations alike, has not made use of the rich data source that searchable electronic corpora represent. They show how corpora can be used easily to falsify common assumptions and assertions about the human language capacity in general just as they can be used efficiently to query assumptions and assertions about Arabic itself. So, too, do they hold implications for applied uses such as teaching Arabic as a foreign language and translation between Arabic and other languages. In any of these applications, the use of corpora in the analysis of all varieties of Arabic remains underdeveloped compared to their use in the analysis of other languages, especially English.
The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.
In this work I will be discussing the preposition fī and its use as an aspectual marker in Tunisian Arabic. Fī, as a preposition, describes a containment relationship and is roughly equivalent to the English prepositions ‘in’ and ‘into’. In addition to its use as a marker of spatial relationship, fī has developed an aspectual use in Tunisian Arabic as a marker of the progressive aspect, e.g. nušrub fī al-tāy ‘I’m drinking tea’. This feature has been sparsely attested in other varieties of Arabic (see Woidich 2006), but only in Tunisian has it developed into an integral, obligatory part of the aspectual system.
ResearchGate has not been able to resolve any references for this publication.