Kareem Darwish
Qatar Foundation · Qatar Computing Research Institute (QCRI)

About

145

Publications

42,609

Reads

3,873

Citations

Skills and Expertise

Natural Language Processing

Information Retrieval

Publications

Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology

Conference Paper

Jan 2023

METHOD AND SYSTEM FOR DIACRITIZING ARABIC TEXT (US20220188515A1 Patent)

Patent

Full-text available

Jun 2022

The presently disclosed method and system automatically diacritize written Arabic text for use with applications that require verbalizing Arabic text. A method may comprise converting a written sentence into a word sequence and identifying a target source word. The method then may comprise repeatedly overlaying and translating a context window at a...

Fig. 3. Distribution of segments lengths for each speaker

MOS evaluation results for the three systems.

NatiQ: An End-to-end Text-to-Speech System for Arabic

Preprint

Full-text available

Jun 2022

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the Wave...

Gulf Arabic Diacritization: Guidelines, Initial Dataset, and Results

Conference Paper

Jan 2022

NatiQ: An End-to-end Text-to-Speech System for Arabic

Conference Paper

Jan 2022

reports on the baseline results. The results show that though SVM...

25 accounts with the most number of offensive replies with the number...

Automatic Expansion and Retargeting of Arabic Offensive Language Training

Preprint

Full-text available

Nov 2021

Rampant use of offensive language on social media led to recent efforts on automatic identification of such language. Though offensive language has general characteristics, attacks on specific entities may exhibit distinct phenomena such as malicious alterations in the spelling of names. In this paper, we present a method for identifying entity spe...

Figure 3: Number of citations per source for pro-Palestinian group for...

News Consumption in Time of Conflict: 2021 Palestinian-Israel War as an Example

Preprint

Full-text available

Sep 2021

Kareem Darwish

This paper examines news consumption in response to a major polarizing event, and we use the May 2021 Israeli-Palestinian conflict as an example. We conduct a detailed analysis of the news consumption of more than eight thousand Twitter users who are either pro-Palestinian or pro-Israeli and authored more than 29 million tweets between January 1 an...

Best models for English emotion analysis

Cross-lingual Emotion Detection

Preprint

Full-text available

Jun 2021

Emotion detection is of great importance for understanding humans. Constructing annotated datasets to train automated models can be expensive. We explore the efficacy of cross-lingual approaches that would use data from a source language to build models for emotion detection in a target language. We compare three approaches, namely: i) using inhere...

Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms

Article

May 2021

With the outbreak of the COVID-19 pandemic, people turned to social media to read and to share timely information including statistics, warnings, advice, and inspirational stories. Unfortunately, alongside all this useful information, there was also a new blending of medical and political misinformation and disinformation, which gave rise to the fi...

Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey

Article

May 2021

On June 24, 2018, Turkey conducted a highly consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One aspect of polarization among the electorate was suppor...

ASAD: Arabic Social media Analytics and unDerstanding

Conference Paper

Full-text available

Apr 2021

This system demonstration paper describes ASAD: Arabic Social media Analysis and unDerstanding, a suite of seven individual modules that allows users to determine dialects, sentiment, news category, offensiveness, hate speech, adult content, and spam in Arabic tweets. The suite is made available through a web API and a web interface where users can...

A Few Topical Tweets are Enough for Effective User-Level Stance Detection

Conference Paper

Full-text available

Apr 2021

Stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can yield very high accuracy (+98%). However, such methods perform poorly or fail...

Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

Article

Apr 2021

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and...

A Panoramic Survey of Natural Language Processing in the Arab World

Article

Mar 2021

The term natural language refers to any system of symbolic communication (spoken, signed, or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (N...

Pre-Training BERT on Arabic Tweets: Practical Considerations

Preprint

Full-text available

Feb 2021

Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highl...

BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Conference Paper

Full-text available

Jan 2021

During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a machine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can produce deepfake texts. It can generate blocks of text based on brief writing pr...

An example of GPT2-Small-Arabic generated text

BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Preprint

Full-text available

Jan 2021

During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a ma-chine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate blocks of text based on brief writing...

A Few Topical Tweets are Enough for Effective User Stance Detection

Conference Paper

Full-text available

Jan 2021

Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society

Conference Paper

Jan 2021

Improving Arabic Text Categorization Using Transformer Training Diversification

Conference Paper

Full-text available

Dec 2020

Automatic categorization of short texts, such as news headlines and social media posts, has many applications ranging from content analysis to recommendation systems. In this paper, we use such text categorization i.e., labeling the social media posts to categories like 'sports', 'politics', 'human-rights' among others, to showcase the efficacy of...

A Panoramic Survey of Natural Language Processing in the Arab World

Preprint

Full-text available

Nov 2020

The term natural language refers to any system of symbolic communication (spoken, signed or written) that has evolved naturally in humans without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NL...

Political Framing: US COVID19 Blame Game

Chapter

Oct 2020

Through the use of Twitter, framing has become a prominent presidential campaign tool for politically active users. Framing is used to influence thoughts by evoking a particular perspective on an event. In this paper, we show that the COVID19 pandemic rather than being viewed as a public health issue, political rhetoric surrounding it is mostly sha...

Spam Detection on Arabic Twitter

Chapter

Oct 2020

Twitter has become a popular social media platform in the Arab region. Some users exploit this popularity by posting unwanted advertisements for their own interest. In this paper, we present a large manually annotated dataset of advertisement (Spam) tweets in Arabic. We analyze the characteristics of these tweets that distinguish them from other tw...

Political Framing: US COVID19 Blame Game

Preprint

Jul 2020

Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms

Preprint

Full-text available

Jul 2020

Unsupervised User Stance Detection on Twitter

Article

May 2020

We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our f...

Figure 1: Geographical distribution of Arabic dialects. (Source:...

Figure 3: Most frequent words for each country

Figure 4: Arabic Dialects t-SNE projection using top 10k frequent words.

Figure 5: Clustering of Arabic Dialects using valence scores on top 10k...

provides per country breakdown of the dataset.

Arabic Dialect Identification in the Wild

Preprint

Full-text available

May 2020

We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects -covering 18 different countries in the MENA (Middle East and North Africa) region. Our method for building this dataset relies on applying multiple filters to identify users who belong to different countries based on their accou...

Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society

Preprint

Full-text available

Apr 2020

Disinformation, i.e., information that is both false and means harm, thrives in social media. Most often, it is used for political purposes, e.g., to influence elections or simply to cause distrust in society. It can also target medical issues, most notably the use of vaccines. With the emergence of the COVID-19 pandemic, the political and the medi...

2005.00033

Preprint

Full-text available

Apr 2020

Disinformation, i.e., information that is both false and means harm, thrives in social media. Most often, it is used for political purposes , e.g., to influence elections or simply to cause distrust in society. It can also target medical issues, most notably the use of vaccines. With the emergence of the COVID-19 pandemic, the political and the med...

COVID-19 Infodemic Twitter dataset https://github.com/firojalam/COVID-19-tweets-for-check-worthiness

Data

Apr 2020

Disinformation, i.e., information that is both false and means harm, thrives in social media. Most often, it is used for political purposes , e.g., to influence elections or simply to cause distrust in society. It can also target medical issues, most notably the use of vaccines. With the emergence of the COVID-19 pandemic, the political and the med...

A Few Topical Tweets are Enough for Effective User-Level Stance Detection

Preprint

Apr 2020

Arabic Offensive Language on Twitter: Analysis and Experiments

Preprint

Full-text available

Apr 2020

Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building effective Arabic offensive tweet detection. We introduce a method for building an offensive dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dat...

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model

Preprint

Full-text available

Feb 2020

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem...

Quantifying Polarization on Twitter: the Kavanaugh Nomination

Preprint

Jan 2020

Kareem Darwish

This paper addresses polarization quantification, particularly as it pertains to the nomination of Brett Kavanaugh to the US Supreme Court and his subsequent confirmation with the narrowest margin since 1881. Republican (GOP) and Democratic (DNC) senators voted overwhelmingly along party lines. In this paper, we examine political polarization conce...

Predicting the Topical Stance and Political Leaning of Media using Tweets

Conference Paper

Full-text available

Jan 2020

Arabic Curriculum Analysis

Conference Paper

Jan 2020

Quantifying Polarization on Twitter: The Kavanaugh Nomination

Chapter

Nov 2019

Kareem Darwish

Arabic Offensive Language Classification on Twitter

Chapter

Full-text available

Nov 2019

Social media users often employ offensive language in their communication. Detecting offensive language on Twitter has many applications ranging from detecting/predicting conflict to measuring polarization. In this paper, we focus on building effective offensive tweet detection. We show that we can rapidly build a training set using a seed list of...

Tanbih: Get To Know What You Are Reading

Preprint

Full-text available

Oct 2019

We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what's behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame...

Effective Multi Dialectal Arabic POS Tagging

Article

Full-text available

Oct 2019

This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based represen...

Embedding-based Qualitative Analysis of Polarization in Turkey

Preprint

Full-text available

Sep 2019

On June 24, 2018, Turkey conducted a highly-consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One access of polarization among the electorate was suppor...

A System for Diacritizing Four Varieties of Arabic

Conference Paper

Aug 2019

Short vowels, aka diacritics, are more often omitted when writing different varieties of Arabic including Modern Standard Arabic (MSA), Classical Arabic (CA), and Dialectal Arabic (DA). However, diacritics are required to properly pronounce words, which makes diacritic restoration (a.k.a. diacritization) essential for language learning and text-to-...

POS Tagging for Improving Code-Switching Identification in Arabic

Article

Full-text available

Aug 2019

When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard...

QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification

Article

Full-text available

Aug 2019

This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Sub-task 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural net...

Predicting the Topical Stance of Media and Popular Twitter Users

Preprint

Full-text available

Jul 2019

Controversial social and political issues of the day spur people to express their opinion on social networks, often sharing links to online media articles and reposting statements from prominent members of the platforms. Discovering the stances of people and entire media on current, debatable topics is important for social statisticians and policy...

Highly Effective Arabic Diacritization using Sequence to Sequence Modeling

Conference Paper

Full-text available

Jun 2019

Arabic text is typically written without short vowels (or diacritics). However, their pres- ence is required for properly verbalizing Ara- bic and is hence essential for applications such as text to speech. There are two types of dia- critics, namely core-word diacritics and case- endings. Most previous works on automatic Arabic diacritic recovery...

Unsupervised User Stance Detection on Twitter

Preprint

Full-text available

Apr 2019

We present a highly effective unsupervised method for detecting the stance of Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our method has th...

QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification

Conference Paper

Full-text available

Jan 2019

POS Tagging for Improving Code-Switching Identification in Arabic

Conference Paper

Jan 2019

Tanbih: Get To Know What You Are Reading

Conference Paper

Full-text available

Jan 2019

Diacritization of Maghrebi Arabic Sub-Dialects

Preprint

Full-text available

Oct 2018

Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the line share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we...

Multi-Dialect Arabic POS Tagging: A CRF Approach

Conference Paper

Full-text available

May 2018

This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional...

Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach

Conference Paper

Full-text available

May 2018

Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on diale...

Seminar Users in the Arabic Twitter Sphere

Conference Paper

Full-text available

Sep 2017

We introduce the notion of “seminar users”, who are social media users engaged in propaganda in support of a political entity. We develop a framework that can identify such users with 84.4% precision and 76.1% recall. While our dataset is from the Arab region, omitting language-specific features has only a minor impact on classification performance...

Trump vs. Hillary: What Went Viral During the 2016 US Presidential Election

Conference Paper

Sep 2017

In this paper, we present quantitative and qualitative analysis of the top retweeted tweets (viral tweets) pertaining to the US presidential elections from September 1, 2016 to Election Day on November 8, 2016. For everyday, we tagged the top 50 most retweeted tweets as supporting or attacking either candidate or as neutral/irrelevant. Then we anal...

Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

Article

Full-text available

Aug 2017

Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for eac...

seg-guidelines

Data

Aug 2017

Learning from Relatives: Unified Dialectal Arabic Segmentation

Article

Full-text available

Aug 2017

Arabic dialects do not just share a common Koiné, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accura...

Language processing and learning models for community question answering in Arabic

Article

Full-text available

Aug 2017

In this paper we focus on the problem of question ranking in community question answering (cQA) forums in Arabic. We address the task with machine learning algorithms using advanced Arabic text representations. The latter are obtained by applying tree kernels to constituency parse trees combined with textual similarities, including word embeddings....

Improved Stance Prediction in a User Similarity Feature Space

Conference Paper

Full-text available

Jul 2017

Predicting the stance of social media users on a topic can be challenging, particularly for users who never express explicit stances. Earlier work has shown that using users' historical or non-relevant tweets can be used to predict stance. We build on prior work by making use of users' interaction elements, such as retweeted accounts and mentioned...

Trump vs. Hillary: What went Viral during the 2016 US Presidential Election

Article

Full-text available

Jul 2017

A Neural Architecture for Dialectal Arabic Segmentation

Conference Paper

Full-text available

Apr 2017

The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neura...

Arabic Diacritization: Stats, Rules, and Hacks

Conference Paper

Jan 2017

Abusive Language Detection on Arabic Social Media

Conference Paper

Full-text available

Jan 2017

Learning from Relatives: Unified Dialectal Arabic Segmentation

Conference Paper

Jan 2017

Arabic POS Tagging: Don't Abandon Feature Engineering Just Yet

Conference Paper

Full-text available

Jan 2017

Trump vs. Hillary Analyzing Viral Tweets during US Presidential Elections 2016

Article

Full-text available

Oct 2016

In this paper, we provide a quantitative and qualitative analyses of the viral tweets related to the US presidential election. In our study, we focus on analyzing the most retweeted 50 tweets for everyday during September 2016. The resulting set is composed 1,500 viral tweets, and they were retweeted over 6.7 million times. We manually annotated th...

Farasa: A Fast and Furious Segmenter for Arabic

Conference Paper

Full-text available

Jun 2016

In this paper, we present Farasa, a fast and accurate Arabic segmenter. Our approach is based on SVM-rank using linear kernels. We measure the performance of the seg-menter in terms of accuracy and efficiency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or is at par with the state-of-the-art...

#ISISisNotIslam or #DeportAllMuslims?: predicting unspoken views

Conference Paper

May 2016

This paper examines the effect of online social network interactions on future attitudes. Specifically, we focus on how a person's online content and network dynamics can be used to predict future attitudes and stances in the aftermath of a major event. In this study, we focus on the attitudes of US Twitter users towards Islam and Muslims subsequen...

Quantifying Public Response towards Islam on Twitter after Paris Attacks

Article

Full-text available

Dec 2015

The Paris terrorist attacks occurred on November 13, 2015 prompted a massive response on social media including Twitter, with millions of posted tweets in the first few hours after the attacks. Most of the tweets were condemning the attacks and showing support to Parisians. One of the trending debates related to the attacks concerned possible assoc...

Attitudes towards Refugees in Light of the Paris Attacks

Article

Full-text available

Dec 2015

The Paris attacks prompted a massive response on social media including Twitter. This paper explores the immediate response of English speakers on Twitter towards Middle Eastern refugees in Europe. We show that antagonism towards refugees is mostly coming from the United States and is mostly partisan.

Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection

Conference Paper

Full-text available

Dec 2015

AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two sub-tasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been submitted and tested on the standardized corpora developed for the track. This overview paper describes these e...

"I like ISIS, but I want to watch Chris Nolan's new movie"

Conference Paper

Aug 2015

The recent rise of the "Islamic State of Iraq and Syria" (ISIS) has sparked significant interest in the group. We explored the tweets of a large number of Twitter users who frequently comment on this subject by either showing support or opposition. ISIS supporters dedicate on average 20% of their tweets to ISIS related content, compared to 4.5% for...

#FailedRevolutions: Using Twitter to Study the Antecedents of ISIS Support

Article

Full-text available

Mar 2015

Within a fairly short amount of time, the Islamic State of Iraq and Syria (ISIS) has managed to put large swaths of land in Syria and Iraq under their control. To many observers, the sheer speed at which this "state" was established was dumbfounding. To better understand the roots of this organization and its supporters we present a study using dat...

QCRI: Answer Selection for Community Question Answering – Experiments for Arabic and English

Article

Full-text available

Jan 2015

This paper describes QCRI’s participation in SemEval-2015 Task 3 Answer Selection in Community Question Answering”, which targeted real-life Web forums, and was offered in both Arabic and English. We apply a supervised machine learning approach considering a manifold of features including among others word n-grams, text similarity, sentiment analys...

Randomized Greedy Inference for Joint Segmentation, POS Tagging and Dependency Parsing

Conference Paper

Jan 2015

Classifying Arab Names Geographically

Conference Paper

Full-text available

Jan 2015

Different names may be popular in different countries. Hence, person names may give a clue to a person’s country of origin. Along with other features, mapping names to countries can be helpful in a variety of applications such as country tagging twitter users. This paper describes the collection of Arabic Twitter user names that are either written...

QCRI$@$QALB-2015 Shared Task: Correction of Arabic Text for Native and Non-Native Speakers’ Errors

Conference Paper

Full-text available

Jan 2015

This paper describes the error correction model that we used for the QALB2015 Automatic Correction of Arabic Text shared task. We employed a case-specific correction approach that handles specific error types such as dialectal word substitution and word splits and merges with the aid of a language model. We also applied corrections that are specifi...

Randomized greedy inference for joint segmentation, POS tagging and dependency parsing

Article

Jan 2015

In this paper, we introduce a new approach for joint segmentation, POS tagging and dependency parsing. While joint modeling of these tasks addresses the issue of error propagation inherent in traditional pipeline archi-tectures, it also complicates the inference task. Past research has addressed this challenge by placing constraints on the scoring...

Content and Network Dynamics Behind Egyptian Political Polarization on Twitter

Article

Full-text available

Oct 2014

There is little doubt about whether social networks play a role in modern protests. This agreement has triggered an entire research avenue, in which social structure and content analysis have been central - but are typically exploited separately. Here, we combine these two approaches to shed light on the opinion evolution dynamics in Egypt during t...

Simple Effective Named Entity Reecognition for Microblogs

Conference Paper

Full-text available

May 2014

Arabic Information Retrieval

Article

Mar 2014

Kareem Darwish

In the past several years, Arabic Information Retrieval (IR) has garnered significant attention. The main research interests have focused on retrieval of formal language, mostly in the news domain, with ad hoc retrieval, OCR document retrieval, and cross-language retrieval. The literature on other aspects of retrieval continues to be sparse or non-...

Statistical Machine Translation

Chapter

Mar 2014

We introduce a brief introduction to statistical machine translation for semitic languages along with an overview of machine translation approaches. We discuss the special consideration that should be taken into account when developing SMT systems for Semitic languages. We discuss segmentation techniques for Semitic SMT; and finally we introduce a...

Query term expansion by automatic learning of morphological equivalence patterns from Wikipedia

Article

Jan 2014

Retrieval in many languages would benefit from languagespecific processing, such as stemming or morphological analysis. However, many languages lack such processing tools, or they may be inadequate for retrieval due to language evolution. In this paper, we explore the use of Wikipedia redirects to automatically learn morphological equivalence patte...

Automatic Correction of Arabic Text: a Cascaded Approach

Conference Paper

Full-text available

Jan 2014

Using Twitter to Collect a Multi-Dialectal Corpus of Arabic

Conference Paper

Full-text available

Jan 2014

Verifiably Effective Arabic Dialect Identification

Conference Paper

Full-text available

Jan 2014

Structural And Semantic Evolution Of Egyptian Political Polarization On Twitter

Conference Paper

Full-text available

Jan 2014

Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

Conference Paper

Aug 2013

Kareem Darwish

Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorpor...

Translating Dialectal Arabic to English

Conference Paper

Aug 2013

Arabizi Detection and Conversion to Arabic

Article

Jun 2013

Kareem Darwish

Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arab...

Detecting Comments on News Articles in Microblogs

Article

Jan 2013

A reader of a news article would often be interested in the comments of other readers on anarticle, because comments give insight into popular opinions or feelings toward a given piece of news. In recent years, social media platforms, such as Twitter, have become a social hub for users to communicate and express their thoughts. This includes sharin...

Language processing for arabic microblog retrieval

Conference Paper

Oct 2012

The use of social media has profoundly affected social and political dynamics in the Arab world. In this paper, we explore the Arabic microblogs retrieval. We illustrate some of the challenges associated with Arabic microblog retrieval, which mainly stem from the use of different Arabic dialects that vary in lexical selection, morphology, and phone...

A summarization tool for time-sensitive social media

Conference Paper

Oct 2012

Searching social content in general and microblogs (aka tweets) in particular has been basic and limited, especially for time-sensitive topics. The currently implemented microblog search on sites such as Twitter is based on simple word matching and retrieves the most recent microblogs that match a given query. Furthermore, a user may obtain hundred...

Joint topic modeling for event summarization across news and social media streams

Conference Paper

Full-text available

Oct 2012

Social media streams such as Twitter are regarded as faster first-hand sources of information generated by massive users. The content diffused through this channel, although noisy, provides important complement and sometimes even a substitute to the traditional news media reporting. In this paper, we propose a novel unsupervised approach based on t...

Stemming techniques of Arabic Language Comparative Study from the Information Retrieval Perspective 2009

Data

Full-text available

Sep 2012

Arabic retrieval revisited: Morphological hole filling

Conference Paper

Full-text available

Jul 2012

Due to Arabic's morphological complexity, Arabic retrieval benefits greatly from morphological analysis -- particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wik...

Transliteration Mining Using Large Training and Test Sets

Conference Paper

Full-text available

Jan 2012

Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique wit...

Is a Query Worth Translating: Ask the Users!

Conference Paper

Full-text available

Apr 2011

Users in many regions of the world are multilingual and they issue similar queries in different languages. Given a source language query, we propose query picking which involves finding equivalent target language queries in a large query log. Query picking treats translation as a search problem, and can serve as a translation method in the context...

ICE-TEA: In-Context Expansion and Translation of English Abbreviations

Conference Paper

Feb 2011

The wide use of abbreviations in modern texts poses interesting challenges and opportunities in the field of NLP. In addition to their dynamic nature, abbreviations are highly polysemous with respect to regular words. Technologies that exhibit some level of language understanding may be adversely impacted by the presence of abbreviations. This pape...

Network

Bruno Gonçalves
New York University
Svetlana Kiritchenko
National Research Council Canada
Christopher D. Manning
Stanford University
Stephan Vogel
none
Ahmed Rafea
The American University in Cairo

Ibrahim Bounhas
Carthage University
Khaled Shaalan
British University in Dubai
Giovanni Da San Martino
University of Padova
Preslav Nakov
Qatar Computing Research Institute
Houda Bouamor
Computer Sciences Laboratory for Mechanics and Engineering Sciences