Conference Paper

Synthetic QA Corpora Generation with Roundtrip Consistency

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This process effectively performs data synthesis to create new target language utterances, serving as data augmentation. Finally, we filter these synthesized utterances to discard low-quality ones using a filtering mechanism inspired by roundtrip consistency (Alberti et al., 2019), thereby enhancing the quality of the augmented dataset. ...
... Then, we remove low-quality data from the synthesized utterances using the filtering mechanism. By re-parsing the generated utterances, we measure round-trip consistency (Alberti et al., 2019) to determine whether it accurately maps back to the input meaning representation used during generation. This data filtration process improves the quality of the synthesized data. ...
... To filter out low-quality synthesized utterances, we propose a filtering mechanism inspired by roundtrip consistency (Alberti et al., 2019). We fine-tune the same backbone model for the utterance generator for the SP task using only labeled data from the source language. ...
... This process effectively performs data synthesis to create new target language utterances, serving as data augmentation. Finally, we filter these synthesized utterances to discard low-quality ones using a filtering mechanism inspired by roundtrip consistency (Alberti et al., 2019), thereby enhancing the quality of the augmented dataset. ...
... Then, we remove low-quality data from the synthesized utterances using the filtering mechanism. By re-parsing the generated utterances, we measure round-trip consistency (Alberti et al., 2019) to determine whether it accurately maps back to the input meaning representation used during generation. This data filtration process improves the quality of the synthesized data. ...
... To filter out low-quality synthesized utterances, we propose a filtering mechanism inspired by roundtrip consistency (Alberti et al., 2019). We fine-tune the same backbone model for the utterance generator for the SP task using only labeled data from the source language. ...
Preprint
Full-text available
Recent efforts have aimed to utilize multilingual pretrained language models (mPLMs) to extend semantic parsing (SP) across multiple languages without requiring extensive annotations. However, achieving zero-shot cross-lingual transfer for SP remains challenging, leading to a performance gap between source and target languages. In this study, we propose Cross-Lingual Back-Parsing (CBP), a novel data augmentation methodology designed to enhance cross-lingual transfer for SP. Leveraging the representation geometry of the mPLMs, CBP synthesizes target language utterances from source meaning representations. Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings, by utilizing only labeled data in the source language and monolingual corpora. Extensive experiments on two cross-language SP benchmarks (Mschema2QA and Xspider) demonstrate that CBP brings substantial gains in the target language. Further analysis of the synthesized utterances shows that our method successfully generates target language utterances with high slot value alignment rates while preserving semantic integrity. Our codes and data are publicly available at https://github.com/deokhk/CBP.
... This work is licensed under a Creative Commons Attribution 4.0 International License. vision problems [29], question answering corpora for pre-training neural models [3], query logs for evaluating query auto-completion systems [17], and dialogue datasets for improving conversational systems [16,19,20]. Synthetic data is particularly useful in the realm of user-generated content (UGC), where research and development critically depends on the availability of large-scale datasets that allow for the modeling and analysis of the dynamics of users and content. ...
... We also consider metrics evaluating within-thread user-posting behavior: (a) number of posts posted by the user, (b) mean depth at which the user posted, (c) mean number of direct replies to the user, and (d) mean number of (in)direct replies to the user. 3 ...
... We ground our evaluation at the level of a discussion path between two posts → . For any given set of threads T , we sample = 100 threads and = 5 paths from 3 As these are not key to our conclusions, the results are reported in Appendix A.3. each thread, uniformly-at-random over paths of up-to length 4. We then convert each thread into a structured discussion string, and prompt an LLM with instructions to evaluate the "coherence" of the discussion. We provide ten few-shot examples (five coherent and five incoherent discussions). ...
Preprint
Full-text available
The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads, referred to as scaffolds. Our framework is generic yet adaptable to the unique characteristics of specific social media platforms. We demonstrate its feasibility using data from two distinct online discussion platforms. To address the fundamental challenge of ensuring the representativeness and realism of synthetic data, we propose a portfolio of evaluation measures to compare various instantiations of our framework.
... Many studies have been conducted on generating corpora using natural texts for various natural language processing tasks, such as [2][3][4][5][6][7]. ...
... Such a strong dependence of the speed of work on not only the amount but also other parameters of the input data can be explained by the lack of attempts by the authors to evaluate or measure it. However, it can be argued that the methods that work with textual data (reported in [2][3][4][5][6]) are more effective in terms of speed than the one proposed by the authors, as they do not require work with images. ...
... In work [4], the authors consider the task of generating a synthetic "question-answer" corpus. To this end, the authors trained three models, each of which is responsible for a certain stage. ...
Article
Full-text available
The object of research is the process of generating text data corpora using the CorDeGen method. The problem solved in this study is the insufficient efficiency of generating corpora of text data by the CorDeGen method according to the speed criterion. Based on the analysis of the abstract CorDeGen method – the steps it consists of, the algorithm that implements it – the possibilities of its parallelization have been determined. As a result, two new modified methods of the base CorDeGen method were developed: “naive” parallel and parallel. These methods differ from each other in whether they preserve the order of terms in the generated texts compared to the texts generated by the base method (“naive” parallel does not preserve, parallel does). Using the .NET platform and the C# programming language, the software implementation of both proposed methods was performed in this work; a property-based testing methodology was used to validate both implementations. The results of efficiency testing showed that for corpora of sufficiently large sizes, the use of parallel CorDeGen methods speeds up the generation time by 2 times, compared to the base method. The acceleration effect is explained precisely by the parallelization of the process of generating the next term – its creation, calculation of the number of occurrences of texts, and recording – which takes most of the time in the base method. This means that if it is necessary to generate sufficiently large corpora in a limited time, in practice it is reasonable to use the developed parallel methods of CorDeGen instead of the base one. The choice of a particular parallel method (naive or conventional) for a practical application depends on whether or not the ability to predict the order of terms in the generated texts is important
... However, the inability of RNNs to capture semantic information in long sequences has pushed work towards the use of transformer-based architectures (Vaswani et al., 2017). Alberti et al. (2019) and Chan and Fan (2019) have proven the effectiveness of these models in generating synthetic QA data that can supplement existing data to train more robust and accurate QA models. ...
... These QG models differ only in certain factors like answer encoding (for answer-aware question generation), question word generation, and paragraph-level contexts. Recent works have solved the problem of answer encoding by either treating the answers position as an input feature (Zhao et al., 2018), by encoding the answer with a separate RNN (Duan et al., 2017;Kim et al., 2018), or a mixture of both via transformerbased architectures (Lee et al., 2020;Alberti et al., 2019;Chan and Fan, 2019). ...
... BERT models have been used effectively by Alberti et al. (2019) to generate synthetic QA pairs. The authors use three separate BERT models for the auxiliary tasks of answer extraction, question generation, and question answering. ...
Conference Paper
Full-text available
The scarcity of comprehensive, high-quality Question-Answering (QA) datasets in low-resource languages has greatly limited the progress of research on QA for these languages. This has inspired research on Question-Answer Generation (QAG) which seeks to synthetically generate QA pairs and minimize the human effort required to compile labeled datasets. In this paper, we present the first QAG pipeline for the Ben-gali language, which consists of an answer span extraction model, a question generation model, and roundtrip consistency filtering to discard inconsistent QA pairs. To train our QAG pipeline, we translate SQuAD1.1 and SQuAD2.0 using the state-of-the-art NLLB machine translation model and accurately mark the answer spans using a novel embedding-based answer alignment algorithm to construct two Bengali QA datasets that we show are superior to the only two existing machine-translated datasets in terms of quality and quantity. We use our QAG pipeline to generate more than 170,000 QA pairs to build BanglaQA, a synthetic QA dataset from 16,000 Bengali news articles spanning 5 different news categories. We demonstrate the quality of BanglaQA by human evaluation on a variety of metrics. The best-performing model among several baselines on our dataset achieves an F1 score of 86.14 falling behind human performance of 95.72 F1.
... Using such synthetic samples to improve the performance of question answering models has been explored by Puri et al. (2020), Alberti et al. (2019), and Shakeri et al. (2020), who show that reading comprehension (RC) models can be improved by generating large-scale synthetic training data. These promising results combined with the recent surge in the development of powerful generative models such as GPT-3 (Brown et al., 2020), BART (Lewis et al., 2020a), and T5 (Raffel et 2020) suggest that the need for large manually labeled datasets can be reduced. ...
... Using the F1 score of a trained RC model to perform filtering, a.k.a. roundtrip filtering, has been previously explored by Puri et al. (2020) and Alberti et al. (2019). For a generated QA sample (q, a, p), where q, a, and p indicate question, answer, and passage, the following steps are performed: 1) a trained RC model is applied to (q, p), predicting a , and 2) the F1 score of a and a is calculated, and if above a certain threshold, (q, a, p) is kept, otherwise dropped. ...
... Related WorkRecent work has explored question-answer generation(Alberti et al., 2019;Puri et al., 2020;Shakeri et al., 2020), but limited in scope to English. We leverage the modeling and filtering approaches proposed byShakeri et al. (2020) due to their simplicity and effectiveness.Kumar et al. (2019) explores cross lingual question generation. ...
... Previous methods have struggled to achieve both of these goals. Question generation has relied on templates (Pampari et al., 2018), which tend to produce formulaic questions, or has depended on extensive data annotation (Du et al., 2017;Alberti et al., 2019;Puri et al., 2020). Similarly, unanswerable questions are often generated by artificially tweaking gold answer-able questions (Gautam et al., 2023), limiting the scope of these examples. ...
... QA Corpora Generation Numerous prior studies have explored the generation of question-answer pairs (Du et al., 2017;Alberti et al., 2019;Puri et al., 2020;Lewis et al., 2021;Ushio et al., 2022Ushio et al., , 2023Yoon and Bak, 2023). These methods typically follow a two-stage supervised process: first predicting a candidate answer from an input context, and then generating questions based on these answers. ...
Preprint
Full-text available
Clinical Question Answering (QA) systems enable doctors to quickly access patient information from electronic health records (EHRs). However, training these systems requires significant annotated data, which is limited due to the expertise needed and the privacy concerns associated with clinical data. This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot setting. We find that naive prompting often results in easy questions that do not reflect the complexity of clinical scenarios. To address this, we propose two prompting strategies: 1) instructing the model to generate questions that do not overlap with the input context, and 2) summarizing the input record using a predefined schema to scaffold question generation. Experiments on two Clinical QA datasets demonstrate that our method generates more challenging questions, significantly improving fine-tuning performance over baselines. We compare synthetic and gold data and find a gap between their training efficacy resulting from the quality of synthetically generated answers.
... Generation. This approach utilizes two sequential components, known as Answer Extraction and Question Generation, which are based on the pipeline method for question-answering (QA) generation introduced by Alberti et al. [4]. Typically, the first component extracts a text span that could potentially serve as an answer, and the second component generates a question corresponding to the extracted answer. ...
... Documentdriven methods predominantly include a factuality checking step to assess the performance of the utterance generator components. A notable technique is roundtrip consistency, initially introduced for QA generation [4] and later adapted for conversation data generation [44,108]. This method uses a model to verify whether the answer span remains consistent with the original span that prompted the question. ...
Preprint
Full-text available
Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.
... Recently, impressive achievements have been made in neural retrieval models through adopting large-scale pre-trained models (Devlin et al., 2019). Dual-encoders typically serve as the backbone architecture, which enables retrieving relevant knowledge from collections with millions or billions of passages in a fraction of time (Karpukhin et al., 2020). ...
... Consistency Filtering We take the teacher as a consistency filter (Alberti et al., 2019) to remove noises contained in synthetic data. More specifically, for a given synthetic query-passage pair (q ′ , p ′+ ), if p ′+ can be retrieved by the teacher in the top-1 position, this pair is kept; otherwise, it will be discarded. ...
... Roundtrip (Alberti et al., 2019;Bartolo et al., 2021). For each question candidate and respective context pair, we check if the candidate answer is the same as the answer given by the 6-way QA ensemble model with the same question and context. ...
... Since question-answering models are typically trained on data different from the narrative domain, such as Wikipedia passages, we fine-tune a RoBERTa-large encoder (Liu et al., 2019) using QA pairs from our dataset. We discard low-quality questions through round-trip filtering (Alberti et al., 2019), i.e., we check whether the generated questions can indeed be answered using the reference description. We employ exact match and F1 (Rajpurkar et al., 2016) to evaluate all QA models. ...
... Since question-answering models are typically trained on data different from the narrative domain, such as Wikipedia passages, we fine-tune a RoBERTa-large encoder (Liu et al., 2019) using QA pairs from our dataset. We discard low-quality questions through round-trip filtering (Alberti et al., 2019), i.e., we check whether the generated questions can indeed be answered using the reference description. We employ exact match and F1 (Rajpurkar et al., 2016) to evaluate all QA models. ...
Preprint
Full-text available
Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.
... In natural language processing (NLP), conditionally perturbed and generated data are used to tackle data sparsity in areas such as grammatical error correction (Foster and Andersen 2009;Sakaguchi, Post, and Van Durme 2017), dependency parsing (Wang and Eisner 2016), and question answering (Hermann et al. 2015;Alberti et al. 2019). Synthetic data also benefit task performance and fairness, particularly in low-resource scenarios (Xia et al. 2019;Zmigrod et al. 2019;Tan et al. 2020), and have been proposed as an avenue towards privacy-preserving data sharing (Shetty, Schiele, and Fritz 2018;Mattern et al. 2022;Igamberdiev and Habernal 2023). ...
Article
Full-text available
User-generated content provides a rich resource to study social and behavioral phenomena. Although its application potential is currently limited by the paucity of expert labels and the privacy risks inherent in personal data, synthetic data can help mitigate this bottleneck. In this work, we introduce an evaluation framework to facilitate research on synthetic language data generation for user-generated text. We define a set of aspects for assessing data quality, namely, style preservation, meaning preservation, and divergence, as a proxy for privacy. We introduce metrics corresponding to each aspect. Moreover, through a set of generation strategies and representative tasks and baselines across domains, we demonstrate the relation between the quality aspects of synthetic user generated content, generation strategies, metrics, and downstream performance. To our knowledge, our work is the first unified evaluation framework for user-generated text in relation to the specified aspects, offering both intrinsic and extrinsic evaluation. We envisage it will facilitate developments towards shareable, high-quality synthetic language data.
... Recent research studies have demonstrated the versatile capabilities of LLMs in enriching QA datasets across diverse domains. Alberti et al. (2019) pioneered methods for generating synthetic QA corpora, using roundtrip consistency for validation and achieving state-of-the-art performance on tasks such as SQuAD2 and NQ. Maatouk et al. (2023) introduced a zero-shot learning approach for neural passage retrieval, using synthetic question generation and hybrid term-neural models to enhance retrieval performance without extensive domain-specific data. ...
Preprint
Full-text available
Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance.Regulatory Natural Language Processing (RegNLP) is a multidisciplinary subfield aimed at simplifying access to and interpretation of regulatory rules and obligations. We define an Automated Question-Passage Generation task for RegNLP, create the ObliQA dataset containing 27,869 questions derived from the Abu Dhabi Global Markets (ADGM) financial regulation document collection, design a baseline Regulatory Information Retrieval and Answer Generation system, and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations and avoid contradictions.
... Minervini and Riedel, 2018;Wang, Sun, and Xing, 2019;Li et al., 2019;Hosseini et al., 2021) and question answering (e.g. Kassner and Schütze, 2020;Alberti et al., 2019;Mitchell et al., 2022;Chen, Choi, and Durrett, 2021;Elazar et al., 2021;Kassner et al., 2021;Asai and Hajishirzi, 2020;Hosseini et al., 2021). For example, Kassner et al. (2021) created a dataset of sentence pairs that are subject to certain constraints (e.g. if X is a dog is true, X has a tail must also be true). ...
Article
Full-text available
The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what “understanding” means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes—inspired by Fregean senses—of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model’s multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.
... • Pseudo Labeled This approach fine-tunes the model, initially trained on the source domain, utilizing samples from the target domain that are augmented with pseudo questions generated by an off-the-shelf QG tool (Alberti et al., 2019). ...
... After generating the synthetic queries, round-trip consistency filter (Alberti et al., 2019) is applied to remove noisy queries. For example, Promptagator use a retriever trained on the synthetic data to check whether the generated query is relevant to the document from which it was generated, and find that including the round-trip filtering is important for improving the query quality. ...
... Some work uses retrieval and conditions a pretrained language model on both the query and the retrieved documents to generate the answer (Lewis et al., 2020;Guu et al., 2020;Khattab et al., 2021). Other work has investigated the use of synthetically-generated data with round trip filtering techniques and shown improved QA performance (Alberti et al., 2019;Puri et al., 2020;Kwiatkowski et al., 2019). Similarly, we used data augmentation and round trip filtering to improve generalisation; however, we do this for QG and QA for both text and KG. ...
... Consistency in NLP Consistency has been a long-standing topic in NLP research, in previous works, consistency of an NLP mode is defined the invariance of its behavior under meaningpreserving alternations (Ribeiro et al., 2020;Elazar et al., 2021;Goel et al., 2021;Wang et al., 2022b), and several works have explored the consistency in various tasks (Du et al., 2019;Ribeiro et al., 2019;Alberti et al., 2019;Camburu et al., 2020;Asai & Hajishirzi, 2020;Kassner et al., 2021;Chen et al., 2021a;Elazar et al., 2021;Mitchell et al., 2022). In the context of reward modeling, we study the consistency with respect to human preference instead. ...
Preprint
Full-text available
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understud-ied is the (in-)consistency of RMs-whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments-and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose CONTRAST INSTRUCTIONS-a benchmarking strategy for the consistency of RM. Each example in CONTRAST INSTRUCTIONS features a pair of lexically similar instructions with different ground truth responses. A consistent RM is expected to rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on CONTRAST INSTRUCTIONS compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques CONVEXDA and REWARDFU-SION, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.
... Synthetic data Other work has investigated the use of synthetically-generated data with round trip filtering techniques and shown improved textual QA performance (Alberti et al., 2019;Puri et al., 2020;Kwiatkowski et al., 2019). (Riabi et al., 2021;Agrawal et al., 2023) have also examined multilingual synthetic QA/QG data generation. ...
... Papers focusing on data augmentation either generate data for adversarial evaluation (Jia and Liang, 2017;Wang and Bansal, 2018) or for training. Most work on training data generation for QA is limited to generating answerable questions, e.g., Alberti et al. (2019) and Bartolo et al. (2020Bartolo et al. ( , 2021, but some generate both answerable and unanswerable questions (Liu et al., 2020) or, like us, just unanswerable questions (Clark and Gardner, 2018;Zhu et al., 2019). Unanswerable questions have been shown to be particularly hard for contemporary QA models when they contain false presuppositions (Kim et al., 2023), when they are fluent and related (Zhu et al., 2019), when the context contains a candidate answer of the expected type (e.g., a date for a "When" question; Weissenborn et al., 2017;Sulem et al., 2021), and in datasets beyond SQuAD (Sulem et al., 2021). ...
... We thus remove any question-answer pairs where the answer is already present in the question. We also employ a round-trip consistency check (Alberti et al., 2019) which discards questions if they yield answers different from those used to generate them. ...
... Question Generation (QG) aims to generate questions from a given answer and a grounding paragraph. As a dual task of Question Answering (QA), QG can potentially be used for the automatic construction of QA datasets, thereby improving QA with little annotation effort (Shakeri et al., 2020;Alberti et al., 2019;Cui et al., 2021). Furthermore, QG can be utilized for educational purposes (Yao et al., 2022;Qu et al., 2021), dialog systems , and conversational recommendation systems (Montazeralghaem and Allan, 2022). ...
... The generation process sometimes generates queries that are nonsensical, degenerated, ambiguous, or not grounded by the given passage. We adopt a filtering mechanism via ensuring roundtrip consistency (Alberti et al., 2019). We follow the procedure in Dai et al. (2023), where an initial retriever is trained on all synthetic query-passage pairs. ...
... Our corpus also contains a collection of QA pairs for the conversations, which could be useful for training such systems. In our work, we utilize an automated transformer-based QA generation approach (Alberti et al., 2019;Chan and Fan, 2019;Lopez et al., 2020) to generate the QA from the dialogues. ...
... Services like Google Cloud AutoML and AWS SageMaker provide user-friendly interfaces for implementing transfer learning and fine-tuning. The democratization of NLP technology, coupled with its remarkable adaptability, signifies a transformative shift in how businesses harness the potential of unstructured textual data, thereby shaping the future of text analytics and business insights generation [102]. ...
... Services like Google Cloud AutoML and AWS SageMaker provide user-friendly interfaces for implementing transfer learning and fine-tuning. The democratization of NLP technology, coupled with its remarkable adaptability, signifies a transformative shift in how businesses harness the potential of unstructured textual data, thereby shaping the future of text analytics and business insights generation [102]. ...
Article
Full-text available
In today's fast-paced business era, data reigns supreme. From emails and social media to reviews and articles, we've amassed a treasure trove of textual information that unveils customer sentiments, market trends, and brand perceptions. However, the real challenge lies in extracting valuable insights from this textual abundance. With that, we present an intensive and thorough review of the existing methods of over the past six years, from 2018 to 2023. We found two game-changers: text analytics, the detective of text patterns, and Natural Language Processing (NLP), the language expert for computers. Together, they bring order to the chaotic world of words. Our review explores the quick development of NLP and offers suggestions for problems. Businesses can make educated decisions, outperform rivals, and make data their greatest asset with the help of these cutting-edge solutions. In order to ensure they find gold in the sea of text data, our study serves as the compass that directs them on this revolutionary journey.
... We use the open source 3 implementation for this model. Additionally we also filter out QA pairs based on round-trip consistency (Alberti et al., 2019). ...
Preprint
Full-text available
Robustness in Natural Language Processing continues to be a pertinent issue, where state of the art models under-perform under naturally shifted distributions. In the context of Question Answering, work on domain adaptation methods continues to be a growing body of research. However, very little attention has been given to the notion of domain generalization under natural distribution shifts, where the target domain is unknown. With drastic improvements in the quality and access to generative models, we answer the question: How do generated datasets influence the performance of QA models under natural distribution shifts? We perform experiments on 4 different datasets under varying amounts of distribution shift, and analyze how "in-the-wild" generation can help achieve domain generalization. We take a two-step generation approach, generating both contexts and QA pairs to augment existing datasets. Through our experiments, we demonstrate how augmenting reading comprehension datasets with generated data leads to better robustness towards natural distribution shifts.
... In this part of our tutorial, we focus on the methods available for generating dialogue samples for an ODD system. The pipeline approach, initially introduced for synthetic QA pair generation [1], is one way to generate ODD samples. This method consists of four sequential stages: passage selection, answer extraction, question generation, and a subsequent filtering process to maintain quality of generated QA pairs. ...
Preprint
Full-text available
Advancements in conversational systems have revolutionized information access, surpassing the limitations of single queries. However, developing dialogue systems requires a large amount of training data, which is a challenge in low-resource domains and languages. Traditional data collection methods like crowd-sourcing are labor-intensive and time-consuming, making them ineffective in this context. Data augmentation (DA) is an affective approach to alleviate the data scarcity problem in conversational systems. This tutorial provides a comprehensive and up-to-date overview of DA approaches in the context of conversational systems. It highlights recent advances in conversation augmentation, open domain and task-oriented conversation generation, and different paradigms of evaluating these models. We also discuss current challenges and future directions in order to help researchers and practitioners to further advance the field in this area.
... The generated questions are also subjected to a filter (F3) based on the "roundtrip consistency" methodology proposed by (Alberti et al., 2019). This filter involves retaining only the synthetic examples where a QA model 5 is able to retrieve a portion of the target answer from the generated question. ...
... Further, can this QG framework be leveraged to generate effective synthetic data that can improve the closed-book QA task? Data augmentation is one of the main directions that question generation has been used for previously, with several studies finding improvements on the QA task (Lewis et al., 2019b;Alberti et al., 2019). Here, we show how to leverage the proposed QG framework to improve closed-book QA tasks on seen data (Wi-kiCQA) and unseen data (GooAQ and ELI5). ...
Article
Full-text available
Language serves as a strategic tool for managing legal and emotional pressures during trials. This study focuses on the use of the phrases "tidak tahu" and "saya rasa" in Richard Eliezer's testimony in the Brigadier J case, reflecting defensive communication strategies. The issues examined are how these phrases are used to evade legal responsibility and how emotional pressure influences speech patterns. Using a descriptive quantitative and qualitative approach, the results indicate that the phrase “tidak tahu" (used 41 times) is employed to avoid accountability, while "saya rasa" (used 4 times) reflects subjective opinions aimed at risk mitigation. These findings reveal a close relationship between the emotional pressure experienced by the witness and their language choices, as well as the importance of linguistic analysis in understanding legal communication.
Article
Question-answer generation (QAG) is a challenging task that generates both questions and answers from a given input paragraph context. The QAG task has recently achieved promising results thanks to the appearance of large pre-trained language models, yet, QAG models are mainly implemented in common languages, e.g., English. There still remains a gap in domain and language adaptation of these QAG models to low-resource languages like Vietnamese. To address the gap, this paper presents a large-scale and systematic study of QAG in Vietnamese. To do that, we first implement several QAG models by using the common fine-tuning techniques based on powerful pre-trained language models. We next introduce a set of instructions designed for the QAG task. These instructions are used to fine-tuned the pre-trained language and large language models. Extensive experimental results of both automatic and human evaluation on five benchmark machine reading comprehension datasets show two important points. First, the instruction-tuning method is potential to enhance the performance of QAG models. Second, large language models trained in English need more data for fine-tuning to work well on the downstream QAG tasks of low-resource languages. We also provide a prototype system to demonstrate how our QAG models actually work. The code for fine-tuning QAG models and instructions are also made available.
Article
Full-text available
The process of manually generating question and answer (QA) pairs for assessments is known to be a time-consuming and energy-intensive task for teachers, specifically in higher education. Several studies have proposed various methods utilising pre-trained large language models for the generation of QA pairs. However, it is worth noting that these methods have primarily been evaluated on datasets that are not specifically educational in nature. Furthermore, the evaluation metrics and strategies employed in these studies differ significantly from those typically used in educational contexts. The present discourse fails to present a compelling case regarding the efficacy and practicality of stated methods within the context of higher education. This study aimed to examine multiple QA pairs generation approaches in relation to their performance and the efficacy and constraints within the context of higher education. The various approaches encompassed in this study comprise pipeline, joint, multi-task approach. The performance of these approaches under consideration was assessed on three datasets related to distinct courses. The evaluation integrates three automated methods, teacher assessments, and real-world educational evaluations to provide a comprehensive analysis. The comparison of various approaches was conducted by directly assessing their performance using the average scores of different automatic metrics on three datasets. The results of the teachers and real educational evaluation indicate that the assessments generated were beneficial in enhancing the understanding of concepts and overall performance of students. The implications of the findings from this study hold significant importance in enhancing the efficacy of QA pair generation tools within the context of higher education.
Preprint
Full-text available
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
Article
Full-text available
The availability of large, high-quality datasets has been a major driver of recent progress in question answering (QA). Such annotated datasets, however, are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are fine-tuned, thus avoiding costly annotation. Prompt tuning the PLM with only five examples per language delivers accuracy superior to translation-based baselines; it bridges nearly 60% of the gap between an English-only baseline and a fully-supervised upper bound fine-tuned on almost 50,000 hand-labeled examples; and consistently leads to improvements compared to directly fine-tuning a QA model on labeled examples in low resource settings. Experiments on the TyDiqa-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation.1
Article
Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user’s queries in natural language. From heuristic-based retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn text representations and model the relevance matching. The recent success of pretrained language models (PLM) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the semantic representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is called dense retrieval , since it employs dense vectors to represent the texts. Considering the rapid progress on dense retrieval, this survey systematically reviews the recent progress on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related studies by four major aspects, including architecture, training, indexing and integration, and thoroughly summarize the mainstream techniques for each aspect. We extensively collect the recent advances on this topic, and include 300+ reference papers. To support our survey, we create a website for providing useful resources, and release a code repository for dense retrieval. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.
Chapter
In the real world, obtaining question-answer pairs for a target domain text is often an expensive process, an approach to tackle the problem is to use automatically generated question-answer pairs from the problem context and large amount of unstructured texts (e.g. Wikipedia). However, current approaches to generate question-answer pairs typically require fine-tuning tens of thousands of examples to achieve good results. Obtaining these instances involves high labor costs, and once the model lacks sufficient training data, the performance of the model with few-shot (< 100 examples) drops dramatically. To address this problem, we propose a method for generating question-answer pairs with few examples, generating answers and questions using the input context, and then filtering the results through a model. By tuning the fine-tuned structure of the model to improve the few-shot performance, we also input different levels of features into the model through granularity decomposition, solving an important issue when data is limited: the inability to perform answer-span detection (or answer generation). We evaluate our method in three aspects: the diversity of generated answers, the quality of generated questions, and the training of a new model entirely using the generated QA pairs. Our experimental results demonstrate that our proposed method can effectively generate question-answer pairs with low resources.
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extrac-tive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.
Chapter
Multi-hop Question Answering (QA) requires the machine to answer complex questions by finding scattering clues and reasoning from multiple documents. Graph Network (GN) and Question Decomposition (QD) are two common approaches at present. The former uses the “black-box” reasoning process to capture the potential relationship between entities and sentences, thus achieving good performance. At the same time, the latter provides a clear reasoning logical route by decomposing multi-hop questions into simple single-hop sub-questions. In this paper, we propose a novel method to complete multi-hop QA from the perspective of Question Generation (QG). Specifically, we carefully design an end-to-end QG module on the basis of a classical QA module, which could help the model understand the context by asking inherently logical sub-questions, thus inheriting interpretability from the QD-based method and showing superior performance. Experiments on the HotpotQA dataset demonstrate that the effectiveness of our proposed QG module, human evaluation further clarifies its interpretability quantitatively, and thorough analysis shows that the QG module could generate better sub-questions than QD methods in terms of fluency, consistency, and diversity.
Article
While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dual-NMT}. Experiments show that dual-NMT works very well on English\leftrightarrowFrench translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.
Conference Paper
There has been growing interest in practice in using unla- beled data together with labeled data in machine learning, and a number of different approaches have been developed. However, the assumptions these methods are based on are often quite distinct and not captured by standard theoretical models. In this paper we describe a PAC-style framework that can be used to model many of these assumptions, and analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what are the basic quantities that these numbers depend on. Our model can be viewed as an extension of the standard PAC model, where in ad- dition to a concept class C, one also proposes a type of compatibility that one believes the target concept should have with the underlying distribu- tion. In this view, unlabeled data can be helpful because it allows one to estimate compatibility over the space of hypotheses, and reduce the size of the search space to those that, according to one's assumptions, are a- priori reasonable with respect to the distribution. We discuss a number of technical issues that arise in this context, and provide sample-complexity bounds both for uniform convergence and ǫ-cover based algorithms. We also consider algorithmic issues, and give an efficient algorithm for a special case of co-training.
Conference Paper
We address the challenge of automatically generating questions from reading materials for educational practice and assessment. Our approach is to overgenerate questions, then rank them. We use manually written rules to perform a sequence of general purpose syntactic transformations (e.g., subject-auxiliary inversion) to turn declarative sentences into questions. These questions are then ranked by a logistic regression model trained on a small, tailored dataset consisting of labeled output from our system. Experimental results show that ranking nearly doubles the percentage of questions rated as acceptable by annotators, from 27% of all questions to 52% of the top ranked 20% of questions.
Ernie: Enhanced representation through knowledge integration
  • Yu Sun
  • Shuohuan Wang
  • Yukun Li
  • Shikun Feng
  • Xuyi Chen
  • Han Zhang
  • Xin Tian
  • Danxiang Zhu
  • Hua Hao Tian
  • Wu
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. CoRR, abs/1904.09223.
  • Chris Alberti
  • Kenton Lee
  • Michael Collins
Chris Alberti, Kenton Lee, and Michael Collins. 2019. A bert baseline for the natural questions. arXiv preprint arXiv:1901.08634.
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.