Figure 4 - uploaded by Anish Acharya
Content may be subject to copyright.
ROUGE-1 (upper triangular) and ROUGE-4 (lower triangular) scores for summaries generated by each pair of models from English conversations.

ROUGE-1 (upper triangular) and ROUGE-4 (lower triangular) scores for summaries generated by each pair of models from English conversations.

Contexts in source publication

Context 1
... this end, we measured the ROUGE-1 and ROUGE-4 scores between the summaries from each pair of models. The results are summarized in Figure 5. We observe that the ROUGE-1 scores are significantly higher than ROUGE-4 scores for both En and Hi-En conversations suggesting that the models seem to be choosing a similar set of tokens for summarization. ...
Context 2
... this end, we measured the ROUGE-1 and ROUGE-4 scores between the summaries from each pair of models. The results are summarized in Figure 5. We observe that the ROUGE-1 scores are significantly higher than ROUGE-4 scores for both En and Hi-En conversations suggesting that the models seem to be choosing a similar set of tokens for summarization. ...

Similar publications

Article
Full-text available
More than 30 inflationary models are confronted with the recently improved limit on the tensor-to-scalar ratio presented by the Planck team. I show that a few more models are falsified due to this sharper restriction. Additionally, I discuss possible consequences of CMB-S4 observations for these inflationary models. The results are summarized in a...
Article
Full-text available
Our recent works discuss the meaning of an arbitrary-order SIR model. We claim that arbitrary-order derivatives can be obtained through special power-laws in the infectivity and removal functions. This work intends to summarize previous ideas and show new results on a meaningful model constructed with Mittag-Leffler functions. We emphasize the tric...

Citations

... Humans, on the other hand, may sometimes code-switch between languages, especially when there is no appropriate translation, or when readers are more familiar with foreign entities. There have been summarization resources addressing the code-switching phenomenon (Mehnaz et al., 2021), but they focus on summarizing from already code-switched source texts. ...
... Besides sequence tagging, other works also touch on short-form generation (Mondal et al., 2022) and speech recognition tasks with audio data (Li et al., 2012Lovenia et al., 2022). Gupshup (Mehnaz et al., 2021), to the best of our knowledge, is the only collection dedicated to studying code-switching in summarization (Dogruöz et al., 2023). Different from CroCoSum, it introduces code-switched source texts by translating the SAM-Sum (Gliwa et al., 2019) dataset into Hinglish instead of studying this phenomenon in organically occurring target summaries. ...
... Because there is no existing code-switched resource for CLS, we extend our comparison to the loosely related summarization dataset Gup-Shup (Mehnaz et al., 2021) that focuses on summarizing from Hindi-English code-switched source texts. For a more comprehensive analysis, we also select datasets that contain Chinese-English codeswitched texts but across different tasks such as languages identification (LID) in tweets (Solorio et al., 2014) and speech recognition (Lyu et al., 2010;Lovenia et al., 2022). ...
Preprint
Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at https://github.com/RosenZhang/CroCoSum.
... Eventually, we build a novel multi-sentential dataset for the Hinglish language with 85k MCTs identified from 67k articles. In Table 1, we compare MUTANT with four other Hinglish datasets (Srivastava and Singh, 2020;Khanuja et al., 2020;Mehnaz et al., 2021;Srivastava and Singh, 2021b) proposed for a variety of tasks such as machine translation, natural language inference, generation, and evaluation. The MUTANT dataset has a significantly higher average number of sentences along with longer MCT (high average number of tokens). ...
Preprint
Full-text available
The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.
... • We create a new dataset applying this annotation scheme to code-switched utterances in the Spanish-English Bangor Miami Corpus (Deuchar, 2010) and a Hindi-English codeswitching dataset consisting of sentences from the GupShup dataset (Mehnaz et al., 2021) and sentences we have written. ...
... In searching for the appropriate Hindi-English code-switched dataset for this task, we considered the need for a conversational dataset in order to be consistent with the system we had created. We found that the GupShup dataset (Mehnaz et al., 2021), which contains over 6,800 Hindi-English conversations, was in the similar domain as the Bangor Miami corpus. One downside was that dataset was synthetic: the sentences had been translated from the SAMSum corpus (written in monolingual English) to a Hindi-English code-switched format (Mehnaz et al., 2021). ...
... We found that the GupShup dataset (Mehnaz et al., 2021), which contains over 6,800 Hindi-English conversations, was in the similar domain as the Bangor Miami corpus. One downside was that dataset was synthetic: the sentences had been translated from the SAMSum corpus (written in monolingual English) to a Hindi-English code-switched format (Mehnaz et al., 2021). However, we decided to use the dataset on the basis of its close replication of natural Hindi-English code-switching. ...
Preprint
Code-switching, or switching between languages, occurs for many reasons and has important linguistic, sociological, and cultural implications. Multilingual speakers code-switch for a variety of purposes, such as expressing emotions, borrowing terms, making jokes, introducing a new topic, etc. The reason for code-switching may be quite useful for analysis, but is not readily apparent. To remedy this situation, we annotate a new dataset of motivations for code-switching in Spanish-English. We build the first system (to our knowledge) to automatically identify a wide range of motivations that speakers code-switch in everyday speech, achieving an accuracy of 75% across all motivations. Additionally, we show that the system can be adapted to new language pairs, achieving 66% accuracy on a new language pair (Hindi-English), demonstrating the cross-lingual applicability of our annotation scheme
Preprint
Full-text available
We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART-50 (a multi-lingual BART) via further pre-training. The multiple objectives used in the further pre-training stage help the pre-trained model capture the structural characteristics as well as important content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.