Md Mahfuz Ibn Alam’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


Figure 2: Map of Italy with COMET scores for different regions.
Figure 3: Map of Switzerland with COMET scores for different regions.
Figure 4: Map of Bangladesh with COMET scores for different regions.
Figure 5: Average COMET score of all the dialects of languages with more than one variety.
Comparable COMET score of different Arabic dialects on a subset of 2000 sentences.
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation
  • Preprint
  • File available

May 2023

·

89 Reads

Md Mahfuz Ibn Alam

·

·

Antonios Anastasopoulos

Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the community to evaluate MT systems on this dimension is limited. To alleviate this issue, we compile and release \dataset, a contrastive dialectal benchmark encompassing 882 different variations from nine different languages. We also quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants. We are releasing all code and data.

Download

BIG-C: a Multimodal Multi-Purpose Dataset for Bemba

May 2023

·

43 Reads

We present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba. While Bemba is the most populous language of Zambia, it exhibits a dearth of resources which render the development of language technologies or language processing research almost impossible. The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English. There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations. We also provide baselines on speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks, and sketch out other potential future multimodal uses of our dataset. We hope that by making the dataset available to the research community, this work will foster research and encourage collaboration across the language, speech, and vision communities especially for languages outside the "traditionally" used high-resourced ones. All data and code are publicly available: https://github.com/csikasote/bigc.


Figure 2: Subset of the multilingual root model's (Franc) confusion matrix (6 languages). Using the confusion matrix, clusters of highly confused languages are identified and confusion-resolution units trained according to the tree shown on the right. The tree, for demonstration purposes, is a subset of the entire tree which has 9 confusion-resolution units
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

May 2023

·

32 Reads

Knowing the language of an input text/audio is a necessary first step for using almost every natural language processing (NLP) tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, most of the world's 7000 languages are not supported by current systems. This lack of representation affects large-scale data mining efforts and further exacerbates data shortage for low-resource languages. We take a step towards tackling the data bottleneck by compiling a corpus of over 50K parallel children's stories in 350+ languages and dialects, and the computation bottleneck by building lightweight hierarchical models for language identification. Our data can serve as benchmark data for language identification of short texts and for understudied translation directions such as those between Indian or African languages. Our proposed method, Hierarchical LIMIT, uses limited computation to expand coverage into excluded languages while maintaining prediction quality.


GMNLP at SemEval-2023 Task 12: Sentiment Analysis with Phylogeny-Based Adapters

April 2023

·

2 Reads

This report describes GMU's sentiment analysis system for the SemEval-2023 shared task AfriSenti-SemEval. We participated in all three sub-tasks: Monolingual, Multilingual, and Zero-Shot. Our approach uses models initialized with AfroXLMR-large, a pre-trained multilingual language model trained on African languages and fine-tuned correspondingly. We also introduce augmented training data along with original training data. Alongside finetuning, we perform phylogeny-based adapter tuning to create several models and ensemble the best models for the final submission. Our system achieves the best F1-score on track 5: Amharic, with 6.2 points higher F1-score than the second-best performing system on this track. Overall, our system ranks 5th among the 10 systems participating in all 15 tracks.




Citations (2)


... BIG-C (Sikasote et al., 2023). This dataset is a multi-modal one in Bemba, which can be applied to many NLP tasks. ...

Reference:

A Survey on Multi-modal Machine Translation: Tasks, Methods and Challenges
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba

... While several datasets have been introduced to evaluate the robustness of LLMs to varieties [7,14,54,33], they remain limited in size, variety coverage, and task diversity, making them inadequate for comprehensive evaluation. To rigorously assess the linguistic robustness of LLMs, it is necessary to evaluate on existing benchmark datasets across a diverse range of English varieties. ...

SD-QA: Spoken Dialectal Question Answering for the Real World
  • Citing Conference Paper
  • January 2021