Ahmed Abdelali’s research while affiliated with Qatar Computing Research Institute and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (94)


ALLaM: Large Language Models for Arabic and English
  • Preprint

July 2024

·

85 Reads

·

2 Citations

M Saiful Bari

·

Yazeed Alnumay

·

Norah A. Alzahrani

·

[...]

·

Haidar Khan

We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.


LAraBench: Benchmarking Arabic AI with Large Language Models
  • Conference Paper
  • Full-text available

July 2024

·

77 Reads

·

22 Citations

Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Ara-bic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ∼296K data points, ∼46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperformed LLMs in zero-shot learning, with a few exceptions. Notably, larger computational models with few-shot learning techniques managed to reduce these performance gaps. Our findings provide valuable insights into the applicability of LLMs for Arabic NLP and speech processing tasks.

Download

Figure 2: Quantifying Concept Alignment CALIGN (%) in German-English Concepts: Dotted lines depict base models, while solid lines represent fine-tuned models across different multilingual models.
Figure 5: Concept Alignment (%) in mBERT. Solid lines: fine-tuned German-English NER model. Dashed lines: zero-shot alignment for French and Spanish.
Figure 6: Sample Overlapping Concepts in the mT5 model.
Figure 10: Sample Concepts learned in the mT5 model
Figure 17: Pairs of Concepts in German-English mT5 model
Exploring Alignment in Shared Cross-lingual Spaces

May 2024

·

21 Reads

Despite their remarkable ability to capture linguistic nuances across diverse languages, questions persist regarding the degree of alignment between languages in multilingual embeddings. Drawing inspiration from research on high-dimensional representations in neural language models, we employ clustering to uncover latent concepts within multilingual models. Our analysis focuses on quantifying the \textit{alignment} and \textit{overlap} of these concepts across various languages within the latent space. To this end, we introduce two metrics \CA{} and \CO{} aimed at quantifying these aspects, enabling a deeper exploration of multilingual embeddings. Our study encompasses three multilingual models (\texttt{mT5}, \texttt{mBERT}, and \texttt{XLM-R}) and three downstream tasks (Machine Translation, Named Entity Recognition, and Sentiment Analysis). Key findings from our analysis include: i) deeper layers in the network demonstrate increased cross-lingual \textit{alignment} due to the presence of language-agnostic concepts, ii) fine-tuning of the models enhances \textit{alignment} within the latent space, and iii) such task-specific calibration helps in explaining the emergence of zero-shot capabilities in the models.\footnote{The code is available at \url{https://github.com/baselmousi/multilingual-latent-concepts}}


Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language Understanding

March 2024

·

127 Reads

International Journal on Cybernetics & Informatics

Recent advancements in the field of natural language processing have markedly enhanced the capability of machines to comprehend human language. However, as language models progress, they require continuous architectural enhancements and different approaches to text processing. One significant challenge stems from the rich diversity of languages, each characterized by its distinctive grammar resulting in a decreased accuracy of language models for specific languages, especially for low-resource languages. This limitation is exacerbated by the reliance of existing NLP models on rigid tokenization methods, rendering them susceptible to issues with previously unseen or infrequent words. Additionally, models based on word and subword tokenization are vulnerable to minor typographical errors, whether they occur naturally or result from adversarial misspellings. To address these challenges, this paper presents the utilization of a recently proposed free-tokenization method, such as Cannine, to enhance the comprehension of natural language. Specifically, we employ this method to develop an Arabic-free tokenization language model. In this research, we will precisely evaluate our model’s performance across a range of eight tasks using Arabic Language Understanding Evaluation (ALUE) benchmark. Furthermore, we will conduct a comparative analysis, pitting our free-tokenization model against existing Arabic language models that rely on sub-word tokenization. By making our pre-training and fine-tuning models accessible to the Arabic NLP community, we aim to facilitate the replication of our experiments and contribute to the advancement of Arabic language processing capabilities. To further support reproducibility and open-source collaboration, the complete source code and model checkpoints will be made publicly available on our Huggingface1 . In conclusion, the results of our study will demonstrate that the free-tokenization approach exhibits comparable performance to established Arabic language models that utilize sub-word tokenization techniques. Notably, in certain tasks, our model surpasses the performance of some of these existing models. This evidence underscores the efficacy of free-tokenization in processing the Arabic language, particularly in specific linguistic contexts.




Figure 1: The architecture of the LLMeBench framework. The dotted boxes represent the core implemented modules of the architecture. Customization for new tasks, datasets and models can be done on Dataset, Model, Evaluation, and Asset modules.
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

August 2023

·

97 Reads

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework. Initially developed to evaluate Arabic NLP tasks using OpenAI's GPT and BLOOM models; it can be seamlessly customized for any NLP task and model, regardless of language. The framework also features zero- and few-shot learning settings. A new custom dataset can be added in less than 10 minutes, and users can use their own model API keys to evaluate the task at hand. The developed framework has been already tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We plan to open-source the framework for the community (https://github.com/qcri/LLMeBench/). A video demonstrating the framework is available online (https://youtu.be/FkQn4UjYA0s).


Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis

June 2023

·

19 Reads

·

3 Citations

Proceedings of the International AAAI Conference on Web and Social Media

The abundance of social media data in the Arab world, specifically on Twitter, enabled companies and entities to exploit such rich and beneficial data that could be mined and used to extract important information, including sentiments and opinions of people towards a topic or a merchandise. However, with this plenitude comes the issue of producing models that are able to deliver consistent outcomes when tested within various contexts. Although model generalization has been thoroughly investigated in many fields, it has not been heavily investigated in the Arabic context. To address this gap, we investigate the generalization of models and data in Arabic with application to sentiment analysis, by performing a battery of experiments and building different models that are tested on five independent test sets to understand their performance when presented with unseen data. In doing so, we detail different techniques that improve the generalization of machine learning models in Arabic sentiment analysis, and share a large versatile dataset consisting of approximately 1.64M Arabic tweets and their corresponding sentiment to be used for future research. Our experiments concluded that the most consistent model is trained using a dataset labelled by a cascaded approach of two models, one that labels neutral tweets and another that identifies positive/negative tweets based on the Arabic emoji lexicon after class balancing. Both the BERT and the SVM models trained using the refined data achieve an average F-1 score of 0.62 and 0.60, and standard deviation of 0.06 and 0.04 respectively, when evaluated on five diverse test sets, outperforming other models by at least 17% relative gain in F-1. Based on our experiments, we share recommendations to improve model generalization for classification tasks.


Summary on test sets and their sizes used in evaluation for the speech processing tasks.
Benchmarking Arabic AI with Large Language Models

May 2023

·

897 Reads

·

2 Citations

With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.



Citations (64)


... To address this issue, several Arabic LLMs have been developed with various architectures. AraBERT [15], QARiB [16], JABER and SABER [17], CAMeLBERT [18], and AraELECTRA [19] are examples of encoder-only models, while ARABART [20], ARAGPT2 [21], Jais [22], and ALLaM [23] are categorized as decoder-only models [22]. Recently, the AceGPT model has gained prominence for its emphasis on cultural sensitivity and the incorporation of local values, setting it apart from other Arabic LLMs. ...

Reference:

InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
ALLaM: Large Language Models for Arabic and English
  • Citing Preprint
  • July 2024

... Although many different LLMs have been developed, to the best of our knowledge, most Arabic NLP studies have primarily focused on GPT-based models. In [5], the performance of four LLMs (GPT3.5-turbo, GPT4, Jais-13bchat, and BLOOMZ) was evaluated on different Arabic NLP tasks, and their performance was compared with State-Of-The-Art (SOTA) models. ...

LAraBench: Benchmarking Arabic AI with Large Language Models

... In Arabic NLP, algorithmic bias can occur when models are optimised using techniques that favour the syntactic structures of English, resulting in poor handling of the more complex morphology and syntax found in Arabic. For example, Arabic's rich inflectional system and root-based morphology can lead to incorrect tokenisation or parsing by models not explicitly designed for the language's intricacies (Abdelali et al., 2022;H. Al-Khalifa et al., 2024). ...

Post-hoc analysis of Arabic transformer models
  • Citing Conference Paper
  • January 2022

... Since millions of people throughout the world place a high value on the Quran, accurately classifying recitations is both a technological advancement and a way to preserve and value the various artistic and spiritual facets of Quranic recitation. By offering a deeper comprehension of the distinctive qualities ingrained in each reciter's version with a novel technique has the potential to completely transform the field [8 9]A revolutionary advancement in the automated analysis of Quranic recitations is made possible by the combination of audio recordings from several reciters and sophisticated classification models to a more complex and culturally sensitive voice classification system [10,11]. The associated works have limitations and challenges because of the possible limitation of a small dataset size, which can affect the model's capacity to generalize, especially when it comes to holy Quranic recitations. ...

NatiQ: An End-to-end Text-to-Speech System for Arabic
  • Citing Conference Paper
  • January 2022

... The goal of sentiment analysis [4,5] is to ascertain the opposite spectrum of feelings such as joy, sorrow, depression, anger, enmity, and love as well as opinions from the material, reviews, and posts that are posted publicly on these websites. As a result, reading through social media posts could help people comprehend what the general population perceives [6][7][8]. For instance, business managers can learn what aspects of their goods should be changed by examining the tone of feedback from customers. ...

Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis
  • Citing Article
  • June 2023

Proceedings of the International AAAI Conference on Web and Social Media

...  Leveraging existing datasets to establish benchmarks. It is noteworthy that a few datasets have already been integrated into benchmarking efforts like ORCA (Elmadany et al., 2023), ALUE (Seelawi et al., 2021) and LAraBench (Abdelali et al., 2023). However, the vast majority of the datasets have been exploited in experiments conducted solely by their authors. ...

Benchmarking Arabic AI with Large Language Models

... One approach involves modifying mono-lingual sentences into CS sentences, allowing the generator to use information from monolingual text by using generative adversarial networks (GAN) with reinforcement learning (RL) to automatically generate CS data from monolingual sentences [9]. Other techniques applied to neural machine translation have been explored to automate the generation process [10,11,12]. Recently, [13] assessed the translation ability of state-of-theart LLMs for CS translation tasks, demonstrating high performance. ...

Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition

... The notable emphasis on health topics can be attributed to many datasets being developed in the aftermath of the COVID-19 pandemic. Consequently, many of these datasets concentrate on subjects associated with COVID-19, such as vaccinations, as exemplified by Elhadad Elhadad et al. (2020), and Alam et al. Alam et al. (2021). Additionally, there were general rumors about COVID-19 included in these datasets and others. ...

Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms
  • Citing Article
  • May 2021

Proceedings of the International AAAI Conference on Web and Social Media

... Fact-checking The problem of misleading information spreading over online has been a major problem and there has been significant research effort in the past years including multilingual resource development to support fact-checker, journalist [1,25], evidence retrieval [27, 57,9], automated fact-checking pipeline [18,30]. The advancement of the fact-checking focused not only on textual modality but also visual and multimodal inputs [7], which are significantly benefiting fact-checking and content moderation tools [56,23]. ...

Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society
  • Citing Conference Paper
  • January 2021

... Research on Arabic dialects, including Moroccan Darija, has emphasized the importance of large and diverse datasets [2,[7][8][9][10] these studies stressed the need for dialect-specific adaptations to enhance ASR accuracy. The Whisper ASR models by OpenAI leverage large-scale pre-training on multilingual datasets, [2], achieving high accuracy in transcribing various languages, although specific studies on Moroccan Darija have been limited. ...

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR