Amir Hossein Kargaran

Amir Hossein Kargaran
Verified
Amir Hossein verified their affiliation via an institutional email.
Verified
Amir Hossein verified their affiliation via an institutional email.
  • PhD Student at Ludwig-Maximilians-Universität in Munich

About

21
Publications
1,530
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
57
Citations
Introduction
Amir Hossein Kargaran is a Software Engineer and Machine Learning Practitioner with 5 years (2019-Now) of experience in industry and academia. He is currently a PhD student and Research Associate at the Center for Information and Language Processing (CIS) of the University of Munich (LMU Munich). More on: https://kargaranamir.github.io/
Current institution
Ludwig-Maximilians-Universität in Munich
Current position
  • PhD Student
Additional affiliations
September 2015 - September 2020
Isfahan University of Technology
Position
  • Bachelor Student
September 2020 - September 2022
Sharif University of Technology
Position
  • Master Student
June 2022 - August 2022
Aalto University
Position
  • Research Assistant

Publications

Publications (21)
Preprint
Full-text available
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages , (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID...
Conference Paper
Full-text available
Websites use third-party ads and tracking services to deliver targeted ads and collect information about users that visit them. These services put users’ privacy at risk, and that is why users’ demand for blocking these services is growing. Most of the blocking solutions rely on crowd-sourced filter lists manually maintained by a large community of...
Preprint
In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons detect a relation in the input...
Preprint
Full-text available
The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii)...
Preprint
Full-text available
English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for as...
Preprint
Full-text available
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is ac...
Preprint
Full-text available
We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn sc...
Conference Paper
Full-text available
We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 1...
Preprint
Full-text available
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 languages, almost all of them low-resource. An important part of this effort is to collect and clea...
Preprint
Full-text available
GitHub's issue reports provide developers with valuable information that is essential to the evolution of a software development project. Contributors can use these reports to perform software engineering tasks like submitting bugs, requesting features, and collaborating on ideas. In the initial versions of issue reports, there was no standard way...
Preprint
Full-text available
Menu system design is a challenging task involving many design options and various human factors. For example, one crucial factor that designers need to consider is the semantic and systematic relation of menu commands. However, capturing these relations can be challenging due to limited available resources. With the advancement of neural language...
Conference Paper
Full-text available
Many NLP main tasks benefit from an accurate understanding of temporal expressions, e.g., text summarization, question answering, and information retrieval. This paper introduces Hengam, an adversarially trained transformer for Persian temporal tagging outperforming state-of-the-art approaches on a diverse and manually created dataset. We create He...
Article
Full-text available
An industrial process includes many devices, variables, and sub-processes that are physically or electronically interconnected. These interconnections imply some level of correlation between different process variables. Since most of the alarms in a process plant are defined on process variables, alarms are also correlated. However, this can be a n...
Preprint
Full-text available
Websites use third-party ads and tracking services to deliver targeted ads and collect information about users that visit them. These services put users' privacy at risk, and that is why users' demand for blocking these services is growing. Most of the blocking solutions rely on crowd-sourced filter lists manually maintained by a large community of...
Preprint
Full-text available
Websites use third-party ads and tracking services to deliver targeted ads and collect information about users that visit them. These services put users privacy at risk and that's why users demand to block these services is growing. Most of the blocking solutions rely on crowd-sourced filter lists that are built and maintained manually by a large c...
Preprint
Full-text available
An industrial process includes many devices, variables, and sub-processes that are physically or electronically interconnected. These interconnections imply some level of correlation between different process variables. Since most of the alarms in a process plant are defined on process variables, alarms are also correlated. However, this can be a n...

Network

Cited By