Milind Agarwal’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


Figure 2: Example: Sentence 0's SHAP visualization for gold TAM sentence and weights when predicted class is MAL. Red indicates positive signal for MAL (unwanted) and blue indicates negative signal for MAL (wanted).
Script-Agnostic Language Identification
  • Preprint
  • File available

June 2024

·

38 Reads

Milind Agarwal

·

Joshua Otten

·

Antonios Anastasopoulos

Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.

Download


Figure 2: Subset of the multilingual root model's (Franc) confusion matrix (6 languages). Using the confusion matrix, clusters of highly confused languages are identified and confusion-resolution units trained according to the tree shown on the right. The tree, for demonstration purposes, is a subset of the entire tree which has 9 confusion-resolution units
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

May 2023

·

30 Reads

Knowing the language of an input text/audio is a necessary first step for using almost every natural language processing (NLP) tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, most of the world's 7000 languages are not supported by current systems. This lack of representation affects large-scale data mining efforts and further exacerbates data shortage for low-resource languages. We take a step towards tackling the data bottleneck by compiling a corpus of over 50K parallel children's stories in 350+ languages and dialects, and the computation bottleneck by building lightweight hierarchical models for language identification. Our data can serve as benchmark data for language identification of short texts and for understudied translation directions such as those between Indian or African languages. Our proposed method, Hierarchical LIMIT, uses limited computation to expand coverage into excluded languages while maintaining prediction quality.


Comparison of all language identification models' precision, recall, and F 1 scores across noise settings. Our hierarchical (Hier) and Root models perform as the best two models for all noise levels. fastText, Multinomial Naive Bayes (MNB) and Multilayer Perceptron (MLP) take third place for different noise levels. Precision, recall, and F 1 scores are reported for all methods to provide benchmarks. For two values that are the same up to the hundredth decimal place, boldfaced entries indicate strictly better performance.
PALI: A Language Identification Benchmark for Perso-Arabic Scripts

April 2023

·

30 Reads

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where ``unconventional'' writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.



FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN

January 2023

·

65 Reads

·

44 Citations

Milind Agarwal

·

Sweta Agrawal

·

Antonios Anastasopoulos

·

[...]

·


Citations (3)


... Following previous work (Neudecker et al., 2021;Agarwal and Anastasopoulos, 2024), we used the character error rate (CER) and word error rate (WER) evaluation metrics. Specifically, we calculated collection level CER and WER (concatenating lines, with a space to separate them for WER) with Jiwer 17 . ...

Reference:

Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway
A Concise Survey of OCR for Low-Resource Languages
  • Citing Conference Paper
  • January 2024

... Third, most S2ST systems rely heavily on the cascading of several subsystems; for example, automatic speech recognition (ASR) + T2TT + text-to-speech (TTS). Although direct systems exist 1,4,5 , they do not match the performance of their cascaded counterparts 7 . See Supplementary Information section I.2 for more details on the current technical landscape. ...

FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN