Fahim FaisalGeorge Mason University | GMU · Department of Computer Science
Fahim Faisal
Bachelor of Science
About
26
Publications
1,072
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
92
Citations
Introduction
Additional affiliations
December 2012 - December 2016
Education
December 2012 - December 2016
Publications
Publications (26)
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a...
Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having diverse text representation modalities including 2 segmentation-based models (BERT, mBERT), 1 image-based mode...
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages. State-of-the-art language models came a long way, starting from the simple one-hot representation of words capable of performing tasks like natural language understanding, common-sense reasoning, or question-answering, thus capturi...
Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. P...
This report describes GMU's sentiment analysis system for the SemEval-2023 shared task AfriSenti-SemEval. We participated in all three sub-tasks: Monolingual, Multilingual, and Zero-Shot. Our approach uses models initialized with AfroXLMR-large, a pre-trained multilingual language model trained on African languages and fine-tuned correspondingly. W...
This report describes GMU's sentiment analysis system for the SemEval-2023 shared task AfriSenti-SemEval. We participated in all three sub-tasks: Monolingual, Multilingual, and Zero-Shot. Our approach uses models initialized with AfroXLMR-large, a pre-trained multilingual language model trained on African languages and fine-tuned correspondingly. W...
Pretrained language models (PLMs) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. With recent PLMs trained on enormous data sources, quantifying their potential biases is difficult, due to their black-box nature and the sheer scale of the data sources....
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2022. The campaign is part of the ninth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2022. Three separate shared tasks were included this year: Identification of L...
Large pretrained multilingual models, trained on dozens of languages, have delivered promising results due to cross-lingual learning capabilities on variety of language tasks. Further adapting these models to specific languages, especially ones unseen during pre-training, is an important goal towards expanding the coverage of language technologies....
As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP da...
Human knowledge is collectively encoded in the roughly 6500 languages spoken around the world, but it is not distributed equally across languages. Hence, for information-seeking question answering (QA) systems to adequately serve speakers of all languages, they need to operate cross-lingually. In this work we investigate the capabilities of multili...
Questions
Questions (4)
Hello, I am trying to find a dataset of canonical suffix forms for Parts of speech tags. For example, ed->VBD (past tense), ing-> VBG. Is there any available dataset like this?
Thanks!
Hello,
Hope everyone is safe.
I am working on a project related to - the ongoing status of how covid-19 and associated social distancing is affecting the lives of individuals with DOWN SYNDROME. Things I am particularly looking for:
- What types of data are currently collecting that include and analyze the associated groups of people?
- The mitigating steps were taken by the associated stakeholders.
- Effect of ongoing social distancing on this specific group of people.
If anyone can point to any related publications/documents/study/report/news/data, it is much appreciated.
Thanks
My undergraduate thesis is on online disease prediction from health forum posts. For this, I need a data set of online healthcare user posts. Is there any available data set of online health forum posts, or data set containing patient narratives and related disease. Online health forum posts data-set ?
From recent survey articles on sentiment analysis and opinion mining , I came to know about concept based sentiment analysis using commonsense knowledge base. Can anyone give any insight about research scope on this specific part of opinion mining ?