Kareem Darwish

Kareem Darwish
Qatar Foundation · Qatar Computing Research Institute (QCRI)

About

145
Publications
42,609
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,873
Citations
Introduction

Publications

Publications (145)
Patent
Full-text available
The presently disclosed method and system automatically diacritize written Arabic text for use with applications that require verbalizing Arabic text. A method may comprise converting a written sentence into a word sequence and identifying a target source word. The method then may comprise repeatedly overlaying and translating a context window at a...
Preprint
Full-text available
NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the Wave...
Preprint
Full-text available
Rampant use of offensive language on social media led to recent efforts on automatic identification of such language. Though offensive language has general characteristics, attacks on specific entities may exhibit distinct phenomena such as malicious alterations in the spelling of names. In this paper, we present a method for identifying entity spe...
Preprint
Full-text available
This paper examines news consumption in response to a major polarizing event, and we use the May 2021 Israeli-Palestinian conflict as an example. We conduct a detailed analysis of the news consumption of more than eight thousand Twitter users who are either pro-Palestinian or pro-Israeli and authored more than 29 million tweets between January 1 an...
Preprint
Full-text available
Emotion detection is of great importance for understanding humans. Constructing annotated datasets to train automated models can be expensive. We explore the efficacy of cross-lingual approaches that would use data from a source language to build models for emotion detection in a target language. We compare three approaches, namely: i) using inhere...
Article
With the outbreak of the COVID-19 pandemic, people turned to social media to read and to share timely information including statistics, warnings, advice, and inspirational stories. Unfortunately, alongside all this useful information, there was also a new blending of medical and political misinformation and disinformation, which gave rise to the fi...
Article
On June 24, 2018, Turkey conducted a highly consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One aspect of polarization among the electorate was suppor...
Conference Paper
Full-text available
This system demonstration paper describes ASAD: Arabic Social media Analysis and unDerstanding, a suite of seven individual modules that allows users to determine dialects, sentiment, news category, offensiveness, hate speech, adult content, and spam in Arabic tweets. The suite is made available through a web API and a web interface where users can...
Conference Paper
Full-text available
Stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can yield very high accuracy (+98%). However, such methods perform poorly or fail...
Article
Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and...
Article
The term natural language refers to any system of symbolic communication (spoken, signed, or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (N...
Preprint
Full-text available
Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highl...
Conference Paper
Full-text available
During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a machine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can produce deepfake texts. It can generate blocks of text based on brief writing pr...
Preprint
Full-text available
During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a ma-chine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate blocks of text based on brief writing...
Conference Paper
Full-text available
Automatic categorization of short texts, such as news headlines and social media posts, has many applications ranging from content analysis to recommendation systems. In this paper, we use such text categorization i.e., labeling the social media posts to categories like 'sports', 'politics', 'human-rights' among others, to showcase the efficacy of...
Preprint
Full-text available
The term natural language refers to any system of symbolic communication (spoken, signed or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NL...
Chapter
Through the use of Twitter, framing has become a prominent presidential campaign tool for politically active users. Framing is used to influence thoughts by evoking a particular perspective on an event. In this paper, we show that the COVID19 pandemic rather than being viewed as a public health issue, political rhetoric surrounding it is mostly sha...
Chapter
Twitter has become a popular social media platform in the Arab region. Some users exploit this popularity by posting unwanted advertisements for their own interest. In this paper, we present a large manually annotated dataset of advertisement (Spam) tweets in Arabic. We analyze the characteristics of these tweets that distinguish them from other tw...
Preprint
Through the use of Twitter, framing has become a prominent presidential campaign tool for politically active users. Framing is used to influence thoughts by evoking a particular perspective on an event. In this paper, we show that the COVID19 pandemic rather than being viewed as a public health issue, political rhetoric surrounding it is mostly sha...
Preprint
Full-text available
With the outbreak of the COVID-19 pandemic, people turned to social media to read and to share timely information including statistics, warnings, advice, and inspirational stories. Unfortunately, alongside all this useful information, there was also a new blending of medical and political misinformation and disinformation, which gave rise to the fi...
Article
We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our f...
Preprint
Full-text available
We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects -covering 18 different countries in the MENA (Middle East and North Africa) region. Our method for building this dataset relies on applying multiple filters to identify users who belong to different countries based on their accou...
Preprint
Full-text available
Disinformation, i.e., information that is both false and means harm, thrives in social media. Most often, it is used for political purposes, e.g., to influence elections or simply to cause distrust in society. It can also target medical issues, most notably the use of vaccines. With the emergence of the COVID-19 pandemic, the political and the medi...
Preprint
Full-text available
Disinformation, i.e., information that is both false and means harm, thrives in social media. Most often, it is used for political purposes , e.g., to influence elections or simply to cause distrust in society. It can also target medical issues, most notably the use of vaccines. With the emergence of the COVID-19 pandemic, the political and the med...
Data
Disinformation, i.e., information that is both false and means harm, thrives in social media. Most often, it is used for political purposes , e.g., to influence elections or simply to cause distrust in society. It can also target medical issues, most notably the use of vaccines. With the emergence of the COVID-19 pandemic, the political and the med...
Preprint
Stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can yield very high accuracy (+98%). However, such methods perform poorly or fail...
Preprint
Full-text available
Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building effective Arabic offensive tweet detection. We introduce a method for building an offensive dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dat...
Preprint
Full-text available
Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem...
Preprint
This paper addresses polarization quantification, particularly as it pertains to the nomination of Brett Kavanaugh to the US Supreme Court and his subsequent confirmation with the narrowest margin since 1881. Republican (GOP) and Democratic (DNC) senators voted overwhelmingly along party lines. In this paper, we examine political polarization conce...
Chapter
This paper addresses polarization quantification, particularly as it pertains to the nomination of Brett Kavanaugh to the US Supreme Court and his subsequent confirmation with the narrowest margin since 1881. Republican (GOP) and Democratic (DNC) senators voted overwhelmingly along party lines. In this paper, we examine political polarization conce...
Chapter
Full-text available
Social media users often employ offensive language in their communication. Detecting offensive language on Twitter has many applications ranging from detecting/predicting conflict to measuring polarization. In this paper, we focus on building effective offensive tweet detection. We show that we can rapidly build a training set using a seed list of...
Preprint
Full-text available
We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what's behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame...
Article
Full-text available
This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based represen...
Preprint
Full-text available
On June 24, 2018, Turkey conducted a highly-consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One access of polarization among the electorate was suppor...
Conference Paper
Short vowels, aka diacritics, are more often omitted when writing different varieties of Arabic including Modern Standard Arabic (MSA), Classical Arabic (CA), and Dialectal Arabic (DA). However, diacritics are required to properly pronounce words, which makes diacritic restoration (a.k.a. diacritization) essential for language learning and text-to-...
Article
Full-text available
When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard...
Article
Full-text available
This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Sub-task 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural net...
Preprint
Full-text available
Controversial social and political issues of the day spur people to express their opinion on social networks, often sharing links to online media articles and reposting statements from prominent members of the platforms. Discovering the stances of people and entire media on current, debatable topics is important for social statisticians and policy...
Conference Paper
Full-text available
Arabic text is typically written without short vowels (or diacritics). However, their pres- ence is required for properly verbalizing Ara- bic and is hence essential for applications such as text to speech. There are two types of dia- critics, namely core-word diacritics and case- endings. Most previous works on automatic Arabic diacritic recovery...
Preprint
Full-text available
We present a highly effective unsupervised method for detecting the stance of Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our method has th...
Preprint
Full-text available
Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the line share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we...
Conference Paper
Full-text available
This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional...
Conference Paper
Full-text available
Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on diale...
Conference Paper
Full-text available
We introduce the notion of “seminar users”, who are social media users engaged in propaganda in support of a political entity. We develop a framework that can identify such users with 84.4% precision and 76.1% recall. While our dataset is from the Arab region, omitting language-specific features has only a minor impact on classification performance...
Conference Paper
In this paper, we present quantitative and qualitative analysis of the top retweeted tweets (viral tweets) pertaining to the US presidential elections from September 1, 2016 to Election Day on November 8, 2016. For everyday, we tagged the top 50 most retweeted tweets as supporting or attacking either candidate or as neutral/irrelevant. Then we anal...
Article
Full-text available
Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for eac...
Article
Full-text available
Arabic dialects do not just share a common Koiné, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accura...
Article
Full-text available
In this paper we focus on the problem of question ranking in community question answering (cQA) forums in Arabic. We address the task with machine learning algorithms using advanced Arabic text representations. The latter are obtained by applying tree kernels to constituency parse trees combined with textual similarities, including word embeddings....
Conference Paper
Full-text available
Predicting the stance of social media users on a topic can be challenging, particularly for users who never express explicit stances. Earlier work has shown that using users' historical or non-relevant tweets can be used to predict stance. We build on prior work by making use of users' interaction elements, such as retweeted accounts and mentioned...
Article
Full-text available
In this paper, we present quantitative and qualitative analysis of the top retweeted tweets (viral tweets) pertaining to the US presidential elections from September 1, 2016 to Election Day on November 8, 2016. For everyday, we tagged the top 50 most retweeted tweets as supporting or attacking either candidate or as neutral/irrelevant. Then we anal...
Conference Paper
Full-text available
The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neura...
Article
Full-text available
In this paper, we provide a quantitative and qualitative analyses of the viral tweets related to the US presidential election. In our study, we focus on analyzing the most retweeted 50 tweets for everyday during September 2016. The resulting set is composed 1,500 viral tweets, and they were retweeted over 6.7 million times. We manually annotated th...
Conference Paper
Full-text available
In this paper, we present Farasa, a fast and accurate Arabic segmenter. Our approach is based on SVM-rank using linear kernels. We measure the performance of the seg-menter in terms of accuracy and efficiency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or is at par with the state-of-the-art...
Conference Paper
This paper examines the effect of online social network interactions on future attitudes. Specifically, we focus on how a person's online content and network dynamics can be used to predict future attitudes and stances in the aftermath of a major event. In this study, we focus on the attitudes of US Twitter users towards Islam and Muslims subsequen...
Article
Full-text available
The Paris terrorist attacks occurred on November 13, 2015 prompted a massive response on social media including Twitter, with millions of posted tweets in the first few hours after the attacks. Most of the tweets were condemning the attacks and showing support to Parisians. One of the trending debates related to the attacks concerned possible assoc...
Article
Full-text available
The Paris attacks prompted a massive response on social media including Twitter. This paper explores the immediate response of English speakers on Twitter towards Middle Eastern refugees in Europe. We show that antagonism towards refugees is mostly coming from the United States and is mostly partisan.
Conference Paper
Full-text available
AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two sub-tasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been submitted and tested on the standardized corpora developed for the track. This overview paper describes these e...
Conference Paper
The recent rise of the "Islamic State of Iraq and Syria" (ISIS) has sparked significant interest in the group. We explored the tweets of a large number of Twitter users who frequently comment on this subject by either showing support or opposition. ISIS supporters dedicate on average 20% of their tweets to ISIS related content, compared to 4.5% for...
Article
Full-text available
Within a fairly short amount of time, the Islamic State of Iraq and Syria (ISIS) has managed to put large swaths of land in Syria and Iraq under their control. To many observers, the sheer speed at which this "state" was established was dumbfounding. To better understand the roots of this organization and its supporters we present a study using dat...
Article
Full-text available
This paper describes QCRI’s participation in SemEval-2015 Task 3 Answer Selection in Community Question Answering”, which targeted real-life Web forums, and was offered in both Arabic and English. We apply a supervised machine learning approach considering a manifold of features including among others word n-grams, text similarity, sentiment analys...
Conference Paper
Full-text available
Different names may be popular in different countries. Hence, person names may give a clue to a person’s country of origin. Along with other features, mapping names to countries can be helpful in a variety of applications such as country tagging twitter users. This paper describes the collection of Arabic Twitter user names that are either written...
Conference Paper
Full-text available
This paper describes the error correction model that we used for the QALB2015 Automatic Correction of Arabic Text shared task. We employed a case-specific correction approach that handles specific error types such as dialectal word substitution and word splits and merges with the aid of a language model. We also applied corrections that are specifi...
Article
In this paper, we introduce a new approach for joint segmentation, POS tagging and dependency parsing. While joint modeling of these tasks addresses the issue of error propagation inherent in traditional pipeline archi-tectures, it also complicates the inference task. Past research has addressed this challenge by placing constraints on the scoring...
Article
Full-text available
There is little doubt about whether social networks play a role in modern protests. This agreement has triggered an entire research avenue, in which social structure and content analysis have been central - but are typically exploited separately. Here, we combine these two approaches to shed light on the opinion evolution dynamics in Egypt during t...
Article
In the past several years, Arabic Information Retrieval (IR) has garnered significant attention. The main research interests have focused on retrieval of formal language, mostly in the news domain, with ad hoc retrieval, OCR document retrieval, and cross-language retrieval. The literature on other aspects of retrieval continues to be sparse or non-...
Chapter
We introduce a brief introduction to statistical machine translation for semitic languages along with an overview of machine translation approaches. We discuss the special consideration that should be taken into account when developing SMT systems for Semitic languages. We discuss segmentation techniques for Semitic SMT; and finally we introduce a...
Article
Retrieval in many languages would benefit from languagespecific processing, such as stemming or morphological analysis. However, many languages lack such processing tools, or they may be inadequate for retrieval due to language evolution. In this paper, we explore the use of Wikipedia redirects to automatically learn morphological equivalence patte...
Conference Paper
Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorpor...
Article
Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arab...
Article
A reader of a news article would often be interested in the comments of other readers on anarticle, because comments give insight into popular opinions or feelings toward a given piece of news. In recent years, social media platforms, such as Twitter, have become a social hub for users to communicate and express their thoughts. This includes sharin...
Conference Paper
The use of social media has profoundly affected social and political dynamics in the Arab world. In this paper, we explore the Arabic microblogs retrieval. We illustrate some of the challenges associated with Arabic microblog retrieval, which mainly stem from the use of different Arabic dialects that vary in lexical selection, morphology, and phone...
Conference Paper
Searching social content in general and microblogs (aka tweets) in particular has been basic and limited, especially for time-sensitive topics. The currently implemented microblog search on sites such as Twitter is based on simple word matching and retrieves the most recent microblogs that match a given query. Furthermore, a user may obtain hundred...
Conference Paper
Full-text available
Social media streams such as Twitter are regarded as faster first-hand sources of information generated by massive users. The content diffused through this channel, although noisy, provides important complement and sometimes even a substitute to the traditional news media reporting. In this paper, we propose a novel unsupervised approach based on t...
Conference Paper
Full-text available
Due to Arabic's morphological complexity, Arabic retrieval benefits greatly from morphological analysis -- particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wik...
Conference Paper
Full-text available
Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique wit...
Conference Paper
Full-text available
Users in many regions of the world are multilingual and they issue similar queries in different languages. Given a source language query, we propose query picking which involves finding equivalent target language queries in a large query log. Query picking treats translation as a search problem, and can serve as a translation method in the context...
Conference Paper
The wide use of abbreviations in modern texts poses interesting challenges and opportunities in the field of NLP. In addition to their dynamic nature, abbreviations are highly polysemous with respect to regular words. Technologies that exhibit some level of language understanding may be adversely impacted by the presence of abbreviations. This pape...

Network

Cited By