Conference Paper

In-context Examples Selection for Machine Translation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Unlike these tasks above, machine translation (MT) involves multiple languages and requires a more sophisticated design of in-context example selection. Recently, there have some attempts on in-context example selection specially for MT, which leverage wordlevel matching (Agrawal et al., 2023), embeddingbased scoring (Moslem et al., 2023;Ji et al., 2024;Zhu et al., 2024) or combination of superficial features (Kumar et al., 2023). ...
... There are some example selection strategies customized for MT. Agrawal et al. (2023) select examples based on n-gram overlap. Moslem et al. (2023) select examples based on sentence embedding similarity. ...
... Following Agrawal et al. (2023) and Kumar et al. (2023), all the compared methods below re-rank examples based on top-100 examples retrieved by BM25 for each test input. ...
... A prevalent approach for unsupervised selec-tion methods involves extracting the top-k examples that are most similar to the target test instance. Such methods often use similarity metrics based on embedded tokens or surface similarity (Agrawal et al., 2023;Liu et al., 2022). However, numerical data-to-text proves challenging due to the absence of token embeddings 1 . ...
... The existing studies on demonstration selection were conducted on tasks other than data-totext Agrawal et al., 2023;Chang and Jia, 2023;Nguyen and Wong, 2023;Yang et al., 2023;Peng et al., 2024). As an exception, Liu et al. (2022) explored k-nearest neighbour-based approach for a data-to-text setting, i.e., the Wikipedia table-to-text on the ToTTo dataset (Parikh et al., 2020). ...
... In contrast to this work, we focus on numerical data-to-text. Existing studies take one of three approaches: token similarity-based (Liu et al., 2022), surface similarity-based (Agrawal et al., 2023), or learningbased approachs (Chang and Jia, 2023;Nguyen and Wong, 2023;Yang et al., 2023). These studies all use texts, while our focus is on numerical time series input. ...
... As a complex task different from sentence-level translation, one major challenge of DOCMT with LLMs is that the length of demonstrations for in-context learning is limited. For sentence-level MT, Agrawal et al. (2023) and show that using 32 or more randomly sampled bilingual parallel sentence pairs as prompts can effectively enhance the translation abilities of LLMs. However, for DOCMT, the length of the text segments to be translated or to be used as demonstrations inherently increase (Zeng et al., 2024). ...
... To assess the effectiveness of our proposed method on LLMs, we employed WMT22 newstest as the test set. Following (Agrawal et al., 2023), we normalized punctuation using Moses 2 and removed sentence pairs with a source/target length ratio exceeding 1.5. To assess the accuracy of our method in ZPT, we utilized the GuoFeng (Xu et al., 2022) dataset, which covers five domains: movie subtitle, Q&A forum, government news, web fiction, and personal profile. ...
... Additionally, in the same few-shot experimental setup, the translation quality of selecting demonstrations based on sentence embedding similarity surpasses that of the random approach. This is also corroborated in (Agrawal et al., 2023). Additionally, we observe experimental results consistent with (Radford et al., 2019), indicating that the quality of generative outputs guided in a "Zero-shot" method is generally superior to those generated through "Random" selection or "Similarity-based" selection in a few-shot method. ...
... In entertainment content, where dialogues often depend on prior interactions to convey a scene's meaning and emotion effectively, context-aware translation plays a vital role (Vu, Kamigaito, and Watanabe 2024;Maruf, Saleh, and Haffari 2021;Vincent et al. 2024b;Agrawal et al. 2023). Incorporating the broader dialogue or narrative context, rather than translating sentences in isolation, is crucial to ensure Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). ...
... LLMs for Creative Translations and Style Transfer: Use of LLMs to induce creativity can be accomplished to a certain extent using prompt engineering techniques (Zhang, Haddow, and Birch 2023). In addition, advanced retrievalbased techniques (Agrawal et al. 2023;Reheman et al. 2023;Glass et al. 2022) can be used to generate context from a given text and be used to provide necessary information for the desired translations. On the other hand, recent work on style transfer introduces a Domain Adaptation Module to copy the style of the input text to be used for modifying the LLM-based translations. ...
Preprint
We address the challenging task of neural machine translation (NMT) in the entertainment domain, where the objective is to automatically translate a given dialogue from a source language content to a target language. This task has various applications, particularly in automatic dubbing, subtitling, and other content localization tasks, enabling source content to reach a wider audience. Traditional NMT systems typically translate individual sentences in isolation, without facilitating knowledge transfer of crucial elements such as the context and style from previously encountered sentences. In this work, we emphasize the significance of these fundamental aspects in producing pertinent and captivating translations. We demonstrate their significance through several examples and propose a novel framework for entertainment translation, which, to our knowledge, is the first of its kind. Furthermore, we introduce an algorithm to estimate the context and style of the current session and use these estimations to generate a prompt that guides a Large Language Model (LLM) to generate high-quality translations. Our method is both language and LLM-agnostic, making it a general-purpose tool. We demonstrate the effectiveness of our algorithm through various numerical studies and observe significant improvement in the COMET scores over various state-of-the-art LLMs. Moreover, our proposed method consistently outperforms baseline LLMs in terms of win-ratio.
... Recently, LLMs have shown the potential to use contextual information to perform many NLP tasks, including sentence and document-level translation (Karpinska and Iyyer, 2023;Wang et al., 2023). For instance, Agrawal et al. (2023);Zhang et al. (2023a);Mu et al. (2023) retrieve relevant examples during inference and supply them as context for the current source sentence. Other approaches integrate bilingual dictionaries or domain-specific terminologies (Ghazvininejad et al., 2023;Moslem et al., 2023) or use prompts to guide LLMs in resolving ambiguity either from the given context (Pilault et al., 2023) or based on pre-existing knowledge (He et al., 2024). ...
... For instance, Agrawal et al. (2023);Zhang et al. (2023a);Mu et al. (2023) retrieve relevant examples during inference and supply them as context for the current source sentence. Other approaches integrate bilingual dictionaries or domain-specific terminologies (Ghazvininejad et al., 2023;Moslem et al., 2023) or use prompts to guide LLMs in resolving ambiguity either from the given context (Pilault et al., 2023) or based on pre-existing knowledge (He et al., 2024). Additionally, Treviso et al. (2024) propose improving output quality through post-editing of initial drafts with error explanations, while Wang et al. (2023) use context-aware prompts to model document-level translations during inference. ...
Preprint
Full-text available
Effective communication is fundamental to any interaction, yet challenges arise when participants do not share a common language. Automatic translation systems offer a powerful solution to bridge language barriers in such scenarios, but they introduce errors that can lead to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings. During training, we leverage context-augmented parallel data, which allows the model to generate translations sensitive to conversational history. During inference, we perform quality-aware decoding with context-aware metrics to select the optimal translation from a pool of candidates. We validate both components of our framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, our framework consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.
... shared task. We utilize online commercial generalpurpose LLMs, DeepSeek (DeepSeek-AI et al., 2024) and GPT4o (OpenAI et al., 2024), to perform the translation with the help of techniques including Document-level Multi-Aspect Prompting and Selection (d-MAPS), LLM-generated terminology table and dynamic retrieval of in-context learning examples using Reranked BM25 (R-BM25; Agrawal et al. 2023). We also explore the potential of postcorrection of punctuation errors in LLMs' translation results. ...
... Re-ranked BM25 (R-BM25; Agrawal et al. 2023) is an in-context example retriever that can ensure both sample quality and retrieving speed. After 100 sentences are retrieved by a normal BM25 retriever, a score will be computed for each sentence using the following formula, in which S and Q denote the source and retrieved sentence's n-grams separately. ...
... Finally, considering the wide application of ICL in generation tasks (Agrawal et al., 2022;Sia and Duh, 2023;Garcia et al., 2023), we extend our analyses beyond classification tasks by conducting a thorough case study on a machine translation task. This study demonstrates that our coordinate system can also effectively capture the behavior of ICL in generation tasks. ...
... Given the recent success of ICL in generation tasks (Agrawal et al., 2022;Sia and Duh, 2023;Garcia et al., 2023), we aim for our two-dimensional coordinate system to enhance the understanding of ICL behavior not only in classification tasks but also in generation tasks. This is non-trivial, as almost no prior work has conducted an in-depth analysis of in-context generation tasks. ...
... (3) We retrieve the 20 most similar translation examples from the FLORES+ dev sets using the BM25 algorithm and employ the Claude 3.5-sonnet model 3 for 20-shot translations (Agrawal et al., 2022). ...
... Previous research suggests that providing similar parallel translation pairs as guidance can improve translation quality with large language models (Agrawal et al., 2022). To leverage this, we use the BM25 algorithm to retrieve several of the most similar translation examples from the dev test set based on the source sentences. ...
... Their work also underlined the effectiveness of cross-lingual transfer for improving few-shot learning across more than 20 languages. Agrawal et al. [27] investigated the impact of in-context example selection on machine translation quality using LLMs. They found that the selection and ordering of these examples are critical for maximizing translation accuracy. ...
... Machine Translation with LLMs Reading ListIn-context Learning Jiao et al.[15], Brown et al.[17], Lin et al.[26], Agrawal et al.[27], Vilar et al.[28], Zhang et al.[29], Reheman et al.[30], Moslem et al.[31],Garcia et al. [32], Ghazvininejad et al. [33], Sarti et al. [34], Liu et al. [35], Chen et al. [36], Reinauer et al. [37], Iyer et al. [38], Alves et al. [39], Raunak et al. [40], Tan et al. [41], Sia et al. [42], Li et al. [43], Li et al. [44], Zheng et al. [45] Chain-of-Thought Prompting He et al. [46], Peng et al. [47], Lu et al. [48], Liang et al. [49], Huang et al. [50], Zhao et al. [51], Ding et al. [52], Wei et al. [53] OpenAI's GPT-4 [7], Chowdhery et al. [18], Le Scao et al. [19], Zhang et al. [20], Touvron et al. [21][22], Costa-jussà et al. [23], Schioppa et al. [54], Wei et al. [55], Li et al. [56], Anil et al. [57], Almazrouei et al. [58] Translation Finetuning Hu et al. [59], Gao et al. [60], Mao et al. [61], Xu et al. [62], Jiao et al. [63], Li et al. [64], Yang et al. [65], Zhang et al. [66], Zeng et al. [67], Zhu et al. [68], Moslem et al. [69], Moslem et al. [70], Wu et al. [71], Xu et al. [72], Meng et al. [73], Xu et al. [74], Lyu et al. [75], Wu and Hu [76], Bao et al. [77], Li et al. [78], Chen et al. [79], Ji et al. [80] Model Decoding Zeng et al. [100], Hoang et al. [101], Xu et al. [102] Assessment Hendy et al. [12], Zhu et al. [13], Robinson et al. [14], Bang et al. [103], Wang et al. [104], Gao et al. [105], Bawden et al. [106], Karpinska et al. [107], Raunak et al. [108], Huang et al. [109], Lou et al. [110], Sanz Valdivieso et al. [111], Lyu et al. [75], Li et al. [112], Ishibashi et al. [113] ...
Article
Full-text available
This paper explores the role of Large Language Models (LLMs) in revolutionizing interactive Machine Translation (MT), providing a comprehensive analysis across nine innovative research directions. LLMs demonstrate exceptional capabilities in handling complex tasks through advanced text generation and interactive human-machine collaboration, significantly enhancing translation accuracy and efficiency, especially in low-resource language scenarios. This study also outlines potential advancements in LLM applications, emphasizing the integration of domain-specific knowledge and the exploration of model combinations to optimize performance. Future research is suggested to focus on enhancing model adaptability to diverse linguistic environments and refining human-machine interaction frameworks to better serve practical translation needs. The findings contribute to the ongoing discourse on the strategic deployment of MT with LLMs, aiming to direct future developments towards more robust and nuanced language processing solutions.
... Sample selection is guided by factors such as domain information [He et al., 2023], demonstration style [Agrawal et al., 2023], and token distance [Liu et al., 2022a]. Specifically, we systematically examine samples from both in-domain and out-of-domain collections. ...
... VLLMs learn semantic representations instead of token pattern representations for MM-ICL. As depicted by Agrawal et al. [2023], textual ICL primarily learns token patterns (e.g., similar output formats, reasoning paths) among demonstration outputs. To investigate whether VLLMs rely on repetitive token patterns, we utilize the average BLEU score across demonstration outputs as a representation of token repetition. ...
Preprint
Full-text available
Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "What factors affect the performance of MM-ICL?'' To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.
... Current text-based embedding selection methods primarily focus on capturing semantic-level similarity, demonstrating their utility in tasks such as sentiment analysis (Liu et al., 2022) and machine translation (Agrawal et al., 2023). However, these approaches encounter significant limitations in multi-step mathematical and logical reasoning tasks, such as GSM8K (Cobbe et al., 2021) and ProofWriter (Tafjord et al., 2021). ...
... Training-free approaches are generally divided into two types: (i) those that use heuristic criteria such as similarity (Liu et al., 2022;Hu et al., 2022), diversity (Cho et al., 2023;Zhang et al., 2022b;Levy et al., 2023;Hongjin et al., 2022;, complexity (Fu et al., 2022), or combinations of these (Agrawal et al., 2023;Tonglet et al., 2023;Gupta et al., 2023) to select in-context examples (ICEs); (ii) those that leverage feedback from LLMs, such as probability distributions Nguyen & Wong, 2023;Li & Qiu, 2023;, perplexity (Gonen et al., 2023), or the model's generated output to guide the selection process. While training-free approaches avoid the computational and time overhead associated with model training, their relatively simplistic architecture often results in sub-optimal performance compared to training-based methods. ...
Preprint
In-context learning (ICL) enables large language models (LLMs) to generalize to new tasks by incorporating a few in-context examples (ICEs) directly in the input, without updating parameters. However, the effectiveness of ICL heavily relies on the selection of ICEs, and conventional text-based embedding methods are often inadequate for tasks that require multi-step reasoning, such as mathematical and logical problem solving. This is due to the bias introduced by shallow semantic similarities that fail to capture the deeper reasoning structures required for these tasks. We present GraphIC, a novel approach that leverages graph-based representations of reasoning processes, coupled with Bayesian Networks (BNs) to select ICEs. Graph structures inherently filter out shallow semantics while preserving the core reasoning structure. Importantly, BNs capture the dependency of a node's attributes on its parent nodes, closely mirroring the hierarchical nature of human cognition-where each thought is shaped by preceding ones. This makes BNs particularly well-suited for multi-step reasoning tasks, aligning the process more closely with human-like reasoning. Extensive experiments across three types of reasoning tasks (mathematical reasoning, code generation, and logical reasoning) demonstrate that GraphIC outperforms both training-free and training-based models in selecting ICEs, excelling in terms of both effectiveness and efficiency. We show that GraphIC enhances ICL's performance and interoperability, significantly advancing ICE selection for multi-step reasoning tasks.
... Optimal selection of context examples is pivotal, as it can activate the intrinsic mechanisms of prompting-based LTMs to produce the anticipated outputs, as evidenced by Brown et al. (2020). Consequently, there has been considerable research focused on optimizing prompting strategies for LTMs in MT, encompassing the development and evaluation of prompt templates (Zhang et al., 2023a;Hendy et al., 2023), the curation of demonstration sets (Agrawal et al., 2022), and the in-depth exploration of the models' capacity to learn from such demonstrations (Tan et al., 2023;Peng et al., 2024). Further investigations have explored the method of using a pre-trained neural retriever to retrieve knowledge from databases and integrate external knowledge sources into LTMs to elevate translation accuracy Lu et al., 2023;He et al., 2024). ...
... Even with ICL, it might not yield satisfactory results and could even have negative effects. We use the R-BM25 (Agrawal et al., 2022) method to select translation pairs with high domain similarity to the sentence to be translated as our distractions. R-BM25, a method based on n-gram matching for linguistic similarity, enables the selection of superior data as few-shot examples. ...
... HintInstruct as a generative variant of AlignInstruct was instructions containing word alignment hints. It was inspired by Ghazvininejad et al. (2023), where dictionary hints were shown to improve few-shot in-context leaning. Instead of relying on additional dictionaries, we used the same word alignments described in Sec. ...
... for MT LLMs have shown good performance for multilingual MT through few-shot in-context learning (ICL)(Jiao et al., 2023).Agrawal et al. (2023) andZhang et al. (2023a) explored strategies to compose better examples for ICL for XGLM-7.5B (Lin et al., 2022) and GLM-130B(Zeng et al., 2023). Ghazvininejad et al.(2023), Peng et al. (2023), and Moslem et al. (2023) claimed that dictionary-based hints and domainspecific style information can improve prompting OPT (Zhang et al., 2022), ...
... Some approaches attempt to integrate additional information relevant to the translation task to enhance the performance of LLMs Lu et al., 2023;He et al., 2024;Peng et al., 2023). Studies in In-Context Learning (ICL, Brown et al., 2020) seek to provide LLMs with more relevant and highquality translation exemplars, which assists LLMs in retrieving bilingual knowledge, facilitating the generation of translations of the highest possible quality (Vilar et al., 2023;Agrawal et al., 2023). However, assessments of LLMs reveal that, in most translation directions, their performance falls short of that exhibited by robust supervised baselines . ...
... Studies in ICL (Brown et al., 2020) aim to provide LLMs with more relevant and high-quality translation exemplars. This approach assists LLMs in retrieving bilingual knowledge, facilitating the generation of translations of the highest possible quality (Vilar et al., 2023;Agrawal et al., 2023). Instruction tuning represents an efficient method to enhance the ability of LLMs to follow natural language instructions and yield outputs that align more closely with human preference in downstream zero-shot tasks (Wei et al., 2022a;Ouyang et al., 2022;Chung et al., 2024). ...
... However, ICL has been shown to have notable reliability issues, such as strong dependence on the selection of examples [3], the order sensitivity [29,62] of the demonstrations, and vulnerabilities against adversarial attacks [52,46,37].To mitigate these issues, a series of works have been proposed to automatically organize demonstrations [29,3] or design intrinsically robust ICL mechanisms [38,59,16]. While these works mainly focus on improving the robustness of ICL, how to select high-quality test suites for evaluating ICL systems remains an open research problem. ...
... However, ICL has been shown to have notable reliability issues, such as strong dependence on the selection of examples [3], the order sensitivity [29,62] of the demonstrations, and vulnerabilities against adversarial attacks [52,46,37].To mitigate these issues, a series of works have been proposed to automatically organize demonstrations [29,3] or design intrinsically robust ICL mechanisms [38,59,16]. While these works mainly focus on improving the robustness of ICL, how to select high-quality test suites for evaluating ICL systems remains an open research problem. ...
Preprint
In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at https://github.com/weizeming/MILE.
... Large language models (LLM) currently have capabilities in various types of tasks and show capabilities that approach or even exceed human intelligence. Of the various tasks that can be carried out, the translation carried out by LLM shows satisfactory results (Agrawal et al., 2023;He et al., 2024;Jiao et al., 2023). With research on artificial intelligence (AI), generative AI has been produced which can create new things after learning from trained models. ...
Article
Full-text available
The growth in the retail industry means that the retail industry must have a competitive advantage to compete. One source of competitive advantage is customer experience. One factor that has a positive influence on customer experience is the service provided by frontline employees. Nowadays, customers can easily share their experiences and information in online reviews. Therefore, a good understanding of online reviews is necessary to maintain customer satisfaction. This paper proposes a new method for obtaining information from online reviews available on online review platforms such as Google Maps. Reviews on the website will be scraped and translated into English using the large language model (LLM). The translated reviews will be translated to obtain aspects, sentiments, and opinions using an aspect-based sentiment analysis (ABSA) model that has been previously drilled using a dataset in English. The findings are visualized into Pareto diagrams and word cloud to identify aspects related to human resources that most influence the negative or positive ratings given by customers through online reviews.
... VKNN-SF queries combine both similarity measurements and structured data filters to retrieve the top elements most similar to a given query vector. These queries are particularly valuable in applications where precision and contextual relevance are critical, such as recommendation systems [16,17,19], retrieval-augmented generation (RAG) [20], and machine translation [1]. By integrating structured data with vector similarity, VKNN-SF queries enhance precision and contextual relevance, providing more accurate results in domains like product recommendations, content retrieval, and language translation [24, 36-38, 41, 44]. ...
Preprint
Full-text available
Querying both structured and unstructured data has become a new paradigm in data analytics and recommendation. With unstructured data, such as text and videos, are converted to high-dimensional vectors and queried with approximate nearest neighbor search (ANNS). State-of-the-art database systems implement vector search as a plugin in the relational query engine, which tries to utilize the ANN index to enhance performance. After investigating a broad range of hybrid queries, we find that such designs may miss potential optimization opportunities and achieve suboptimal performance for certain queries. In this paper, we propose CHASE, a query engine that is natively designed to support efficient hybrid queries on structured and unstructured data. CHASE performs specific designs and optimizations on multiple stages in query processing. First, semantic analysis is performed to categorize queries and optimize query plans dynamically. Second, new physical operators are implemented to avoid redundant computations, which is the case with existing operators. Third, compilation-based techniques are adopted for efficient machine code generation. Extensive evaluations using real-world datasets demonstrate that CHASE achieves substantial performance improvements, with speedups ranging from 13% to an extraordinary 7500 times compared to existing systems. These results highlight CHASE's potential as a robust solution for executing hybrid queries.
... Since LLMs struggle with translating large units, especially considering their prevalence among various numerical translation types, we present three commonly employed strategies aimed at enhancing numerical translation: (1) In-context Learning (ICL) [15]- [17], which prompts LLMs with the specific unit conversion principles, aiming to make LLMs translate the large units based on the principles. ...
Preprint
The inaccurate translation of numbers can lead to significant security issues, ranging from financial setbacks to medical inaccuracies. While large language models (LLMs) have made significant advancements in machine translation, their capacity for translating numbers has not been thoroughly explored. This study focuses on evaluating the reliability of LLM-based machine translation systems when handling numerical data. In order to systematically test the numerical translation capabilities of currently open source LLMs, we have constructed a numerical translation dataset between Chinese and English based on real business data, encompassing ten types of numerical translation. Experiments on the dataset indicate that errors in numerical translation are a common issue, with most open-source LLMs faltering when faced with our test scenarios. Especially when it comes to numerical types involving large units like ``million", ``billion", and "yi", even the latest llama3.1 8b model can have error rates as high as 20%. Finally, we introduce three potential strategies to mitigate the numerical mistranslations for large units.
... These methods rely on manually constructed examples, limiting their generalizability to diverse tasks. To address this, retrieval-based approaches select examples based on lexical features (Rubin et al., 2021;Agrawal et al., 2022;Luo et al., 2023), semantic similarity (Liu et al., 2021a), structural patterns (Levy et al., 2022), or other factors (Fu et al., 2022;Gonen et al., 2022;Drozdov et al., 2022). While these approaches show promising performance, they significantly slow down the inference process of LLMs due to the increased input length caused by the additional examples. ...
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks by understanding input information and predicting corresponding outputs. However, the internal mechanisms by which LLMs comprehend input and make effective predictions remain poorly understood. In this paper, we explore the working mechanism of LLMs in information processing from the perspective of Information Bottleneck Theory. We propose a non-training construction strategy to define a task space and identify the following key findings: (1) LLMs compress input information into specific task spaces (e.g., sentiment space, topic space) to facilitate task understanding; (2) they then extract and utilize relevant information from the task space at critical moments to generate accurate predictions. Based on these insights, we introduce two novel approaches: an Information Compression-based Context Learning (IC-ICL) and a Task-Space-guided Fine-Tuning (TS-FT). IC-ICL enhances reasoning performance and inference efficiency by compressing retrieved example information into the task space. TS-FT employs a space-guided loss to fine-tune LLMs, encouraging the learning of more effective compression and selection mechanisms. Experiments across multiple datasets validate the effectiveness of task space construction. Additionally, IC-ICL not only improves performance but also accelerates inference speed by over 40\%, while TS-FT achieves superior results with a minimal strategy adjustment.
... As its name implies, in-context learning allows an LLM to make use of extra context at inference time to improve its output (Brown et al., 2020). When it comes to MT, researchers investigated incorporating all sorts of context, such as translation pairs similar to the new source text (Agrawal et al., 2023;Moslem et al., 2023a;Vilar et al., 2023), terminology (Moslem et al., 2023c), and dictionary words . Usually prompting the model, i.e. writing instructions in a natural language, is enough without extra model fine-tuning. ...
Preprint
Full-text available
In this work, we compare the domain-specific translation performance of open-source autoregressive decoder-only large language models (LLMs) with task-oriented machine translation (MT) models. Our experiments focus on the medical domain and cover four language pairs with varied resource availability: English-to-French, English-to-Portuguese, English-to-Swahili, and Swahili-to-English. Despite recent advancements, LLMs exhibit a clear gap in specialized translation quality compared to multilingual encoder-decoder MT models such as NLLB-200. In three out of four language directions in our study, NLLB-200 3.3B outperforms all LLMs in the size range of 8B parameters in medical translation. While fine-tuning LLMs such as Mistral and Llama improves their performance at medical translation, these models still fall short compared to fine-tuned NLLB-200 3.3B models. Our findings highlight the ongoing need for specialized MT models to achieve higher-quality domain-specific translation, especially in medium-resource and low-resource settings. As larger LLMs outperform their 8B variants, this also encourages pre-training domain-specific medium-sized LMs to improve quality and efficiency in specialized translation tasks.
... Machine translation (MT) has made significant advances with the advent of neural machine translation (NMT); however, human post-editing continues to be necessary to correct errors and enhance translation quality (Sennrich 2015;Weng et al. 2023;Yamada 2019). Highly inflected and agglutinative languages present challenges for MT systems, as their complex morphological processes are difficult to process and generate accurately (Agrawal et al. 2023;Lee et al. 2023;Oflazer 1994). It has been found that the difficulty involved in NMT post-editing is influenced by both the complexity of the source text (ST) and the quality of the machine translation output (Castilho et al. 2017;Jia & Zheng 2022;Krings 2001;Yamada 2019). ...
Article
This study examines how morphological complexity affects cognitive effort in neural machine translation (NMT) post-editing across six languages. Analysis of the DivEMT dataset shows that morphologically richer target languages like Ukrainian and Turkish require more editing time, keystrokes, and frequent pauses, indicating higher cognitive demands. Vietnamese, despite simpler morphology, also showed high cognitive effort, suggesting other factors like syntax influence processing load. Mean Size of Paradigm (MSP) analysis confirmed Ukrainian and Turkish’s high morphological complexity compared to isolating languages like Vietnamese. Higher error rates in morphologically rich languages demonstrate increased editing needs. While user perceptions varied, the data reveals that greater linguistic distance correlates with higher cognitive effort in NMT post-editing, showing typological divergence impacts beyond morphology alone.
... We limit our study of the prompt bank to a basic set of seed prompts and GPT-written paraphrases. Notably, we do not study the impact of prompt formats (e.g., passage:{}\n answer{} vs. Passage::{} Answer::{}, Sclar et al., 2023), in-context example ordering (Lu et al., 2022) or example selection (Agrawal et al., 2023) on multi-prompt performance, although multi-prompt may extend to such methods. We leave the question of exhaustively constructing a prompt bank to future work. ...
... Prompting LLMs for translation output has been successfully employed since the early years of LLMs (Brown et al., 2020), with the few-shot enhanced context approach indicating good results (Vilar et al., 2023). Later approaches suggested that an adaptive method of few-shot prompting may be even more beneficial (Agrawal et al., 2023;Zhang et al., 2023;Soudi et al., 2024). Enis and Hopkins (2024) deal with evaluating Claude 3 Opus, as compared to other LLMs, with regard to machine translation of low resource languages. ...
... In-context Examples Retrieval in NLP and CV The field of natural language processing has identified that the choice of in-context examples significantly influences performance, as evidenced by (Agrawal et al., 2022) and (Min et al., 2022b). Furthermore, the construction of these in-context examples, often referred to as prompts, including aspects such as relevance and diversity of retrieved examples, has been reported to impact performance as well. ...
... There are two potential approaches to achieving this goal. The first is the prompt-based method, which involves developing effective prompting strategies to better stimulate LLMs' translation capabilities, such as using in-context translation examples, as outlined in works (Agrawal et al., 2023;Garcia et al., 2023;Peng et al., 2023;Feng et al., 2024). However, Zhang et al. (2023a) indicate that prompting methods overly rely on the language model, often under-translate the input and generate hallucinations. ...
... Related to these approaches, nearest-neighbor machine translation (Khandelwal et al., 2021) uses distance measures between examples to select examples closer to the sentence to translate in an additional module of a translation system. (Agrawal et al., 2023) and use similar approaches to construct prompts for LLMs. ...
... Garcia et al. (2023) examined the unreasonable effectiveness of few-shot learning in MT, emphasizing the role of prompt engineering in achieving high-quality translations even in lowresource scenarios. Moreover, Lin et al. (2022) and Agrawal et al. (2023) have focused on selecting effective in-context examples to enhance translation performance. and Lu et al. (2023) introduced dictionary-based phraselevel prompting, showcasing how bilingual dictionaries can be leveraged within prompts to guide LLMs in translating rare words and phrases more accurately. ...
Preprint
Large language models (LLMs) have demonstrated remarkable proficiency in machine translation (MT), even without specific training on the languages in question. However, translating rare words in low-resource or domain-specific contexts remains challenging for LLMs. To address this issue, we propose a multi-step prompt chain that enhances translation faithfulness by prioritizing key terms crucial for semantic accuracy. Our method first identifies these keywords and retrieves their translations from a bilingual dictionary, integrating them into the LLM's context using Retrieval-Augmented Generation (RAG). We further mitigate potential output hallucinations caused by long prompts through an iterative self-checking mechanism, where the LLM refines its translations based on lexical and semantic constraints. Experiments using Llama and Qwen as base models on the FLORES-200 and WMT datasets demonstrate significant improvements over baselines, highlighting the effectiveness of our approach in enhancing translation faithfulness and robustness, particularly in low-resource scenarios.
... In-context learning (Brown et al., 2020a) prompts LLMs with a few handcrafted demonstrations which are understandable for the LLMs. More fancy, Retrieval-Augmented Generation (RAG) (Chen et al., 2024a) complements LLMs by retrieved relevant knowledge from external databases (Li et al., 2023;Shen et al., 2023) or constructing demonstrations for in-context learning (ICL) (Poesia et al., 2022;Agrawal et al., 2023), showing promise in tasks like OpenQA (Borgeaud et al., 2022;Guu et al., 2020) and games (Zhu et al., 2023a;Hu et al., 2024). Knowledge graphs are welcome in external knowledge formats as well, especially in structured tasks like relation extraction and entity recognition (Shu et al., 2024;, improving task-specific decisions. ...
Preprint
In-context learning (ICL) and Retrieval-Augmented Generation (RAG) have gained attention for their ability to enhance LLMs' reasoning by incorporating external knowledge but suffer from limited contextual window size, leading to insufficient information injection. To this end, we propose a novel framework, RuAG, to automatically distill large volumes of offline data into interpretable first-order logic rules, which are injected into LLMs to boost their reasoning capabilities. Our method begins by formulating the search process relying on LLMs' commonsense, where LLMs automatically define head and body predicates. Then, RuAG applies Monte Carlo Tree Search (MCTS) to address the combinational searching space and efficiently discover logic rules from data. The resulting logic rules are translated into natural language, allowing targeted knowledge injection and seamless integration into LLM prompts for LLM's downstream task reasoning. We evaluate our framework on public and private industrial tasks, including natural language processing, time-series, decision-making, and industrial tasks, demonstrating its effectiveness in enhancing LLM's capability over diverse tasks.
... Aligned with contemporary research findings [1,33], which highlight the enhancement of In-Context Learning capabilities in LLMs through diversity-based methods and the performance improvement brought by more diverse datasets in IT [41], our initial phase employs a semantic diversity-oriented strategy for data Fig. 2. The overview of Bread. Stage 1 involves assembling datasets characterized by high diversity, followed by iterative dynamic sampling to retain the most representative samples while preserving diversity within the dataset in Stage 2. ...
Conference Paper
Full-text available
Recent advancements in Instruction Tuning (IT) have shown promise for aligning Large Language Models (LLMs) with users' intentions , yet its efficacy is often compromised by dependence on high-quality datasets. Previous works have concentrated on the aggregation or production of huge IT datasets through human labor or significant cost-intensive LLM APIs, which lacks adequate mechanisms to guarantee the quality of the resulting data. Moreover, training on such amount of IT data is both time-consuming and costly. To address these issues, we present Bread (Instruction Mining through Balanced REtrieval And Dynamic Data Sampling), a novel approach designed to minimize the requisite volume of IT data. Bread uses a two-stage strategy combining balanced retrieval and dynamic sampling to focus on data diversity and quality, offering a cost-saving solution without relying on any specific LLMs. Experimental results suggest that Bread outperforms baselines and shows great flexibility across various IT datasets and LLMs, thereby marking a step forward in efficient Instruction Tuning. Our code is available at https://github.com/mihara-bot/Bread.
... (Random, Retrieved) shows second best overall performance, and generally outperforms (Random, Random), suggesting retrieved examples during evaluation is advantageous even when trained with randomly paired in-context examples. Our findings align with prior work in in-context learning -that the incorporation of semantically similar examples is beneficial(Agrawal et al., 2022;Rubin et al., 2022).Does Having Semantically Relevant In-Context Example Help?For some test examples, augmented in-context examples are very relevant, and for others, much less so. In this section, we group the evaluation examples by the maximum similarity of in-context query and the test query measured by an off-the-shelf sentence embedding model (Score@Top-1). ...
Preprint
Full-text available
We investigate whether in-context examples, widely used in decoder-only language models (LLMs), can improve embedding model performance in retrieval tasks. Unlike in LLMs, naively prepending in-context examples (query-document pairs) to the target query at inference time does not work out of the box. We introduce a simple approach to enable retrievers to use in-context examples. Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This can be applied to adapt various base architectures (i.e., decoder-only language models, retriever models) and consistently achieves performance gains of up to +2.72% nDCG across various open-domain retrieval datasets (BeIR, RAR-b). In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation and lay the foundation for future work in this space.
... This setup has been shown to enhance performance in LLMs on a wide range of NLP tasks. Additionally, prior research indicates that LLMs can be sensitive to the selection of in-context exemplars (Nguyen & Wong, 2023;Zhang et al., 2022;Agrawal et al., 2023;Chen et al., 2023c;. To explore this, we employ three different strategies for exemplar selection: (1) Randomly select a specified number of exemplars. ...
Preprint
Full-text available
While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed pixel-level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi-agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high-precision perceptual information. This multi-agent approach allows VLMs to better perform fine-grained visual perception tasks by synergizing planning, reasoning, and tool use. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements over state-of-the-art baselines across all tasks. Furthermore, comprehensive ablation studies reveal the critical role of multi-agent collaboration in eliciting more detailed System-2 reasoning and highlight the importance of image input for task planning. Additionally, our error analysis identifies patterns of VLMs' inherent limitations in visual perception, providing insights into potential future improvements. VipAct offers a flexible and extensible framework, paving the way for more advanced visual perception systems across various real-world applications.
... A primary strategy involves leveraging LLMs' ability to learn from demonstrations or descriptions (Brown et al., 2020;Wei et al., 2022). Studies have explored selecting appropriate exemplars for few-shot learning and demonstrating linguistic knowledge (Agrawal et al., 2022;Vilar et al., 2023;, or augmenting LLMs with chains of multilingual dictionaries . Besides providing a demonstration or description, choosing the right temperature or prompting strategy has also been examined (Peng et al., 2023). ...
Preprint
Full-text available
Recent Large Language Models (LLMs) have demonstrated strong performance in translation without needing to be finetuned on additional parallel corpora. However, they still underperform for low-resource language pairs. Previous works have focused on mitigating this issue by leveraging relevant few-shot examples or external resources such as dictionaries or grammar books, making models heavily reliant on these nonparametric sources of information. In this paper, we propose a novel method named IntGrad MT that focuses on fully exploiting an LLM's inherent translation capability. IntGrad MT achieves this by constructing a chain of few-shot examples, each consisting of a source sentence and the model's own translation, that rise incrementally in difficulty. IntGrad MT employs two techniques: Sentence Interpolation, which generates a sequence of sentences that gradually change from an easy sentence to translate to a difficult one, and Gradual MT, which sequentially translates this chain using translations of earlier sentences as few-shot examples for the translation of subsequent ones. With this approach, we observe a substantial enhancement in the xCOMET scores of various LLMs for multiple languages, especially in low-resource languages such as Hindi(8.26), Swahili(7.10), Bengali(6.97) and Marathi(13.03). Our approach presents a practical way of enhancing LLMs' performance without extra training.
... Literature shows, that the choice of samples for few-shot learning significantly influences its outcomes (i.e. sensitivity of sample selection) Zhang et al., 2022;Köksal et al., 2023;Agrawal et al., 2023). For example, recent studies have investigated effects of such sample selection strategies for in-context learning (Zhang et al., 2022;Li and Qiu, 2023). ...
Preprint
Full-text available
The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the samples are mostly selected randomly and a comprehensive overview of the effects of other (more "informed") sample selection strategies is lacking. In this work, we compare sample selection strategies existing in few-shot learning literature and investigate their effects in LLM-based textual augmentation. We evaluate this on in-distribution and out-of-distribution clas-sifier performance. Results indicate, that while some "informed" selection strategies increase the performance of models, especially for out-of-distribution data, it happens only seldom and with marginal performance increases. Unless further advances are made, a default of random sample selection remains a good option for augmentation practitioners.
... As depicted in Table 2, the zero-shot translation results indicate the model's limitations in effectively translating this language in the absence of reference data, reflected in BLEU scores nearing zero. However, introducing 20-shot reference data prompts the model to engage in (Agrawal et al., 2022) in-context learning, resulting in a marginal improvement in BLEU scores. This highlights the potential of few-shot learning. ...
... Machine Translation with LLMs Large Language Models (LLMs), notably ChatGPT (Ouyang et al., 2022) and GPT-4 (Achiam et al., 2023), have demonstrated their substantial potential in the sphere of Neural Machine Translation (NMT). These models have delivered remarkable improvements in terms of translation accuracy and fluency compared to traditional machine translation systems, especially in the context of high-resource bilingual translation tasks (Agrawal et al., 2022;Hendy et al., 2023;. Moreover, their strong generality is believed to address traditional challenges in NMT, such as multilingual and domain-specific translation (Yang et al., 2023;Reinauer et al., 2023). ...
... In-Context Learning In-context learning (ICL) integrates a small number of training examples as prompts before the test input (Brown et al., 2020), demonstrating a remarkable ability to enhance the performance of large language models (LLMs) in a wide range of downstream tasks, such as machine translation (Agrawal et al., 2022;Sia and Duh, 2023), data generation , and others (Wang et al., 2021b;He et al., 2023;Panda et al., 2023). Furthermore, the advent of advanced strategies such as chain-of-thought prompting has significantly refined the efficacy of ICL, offering deeper insights and more nuanced understanding within this innovative paradigm (Kim et al., 2022;Chan et al., 2022;Srivastava et al., 2022;Bansal et al., 2022). ...
... Garcia et al. (2023) showed comparable performances of ICL to those large, supervised models. Vilar et al. (2023); Agrawal et al. (2023) evaluated various strategies for selecting translation examples for ICL, emphasizing the importance of example quality. ...
... Due to their strong in-context learning and instructionfollowing abilities, powerful LLMs like GPT-4 have achieved remarkable progress in machine translation, with comparable performance to the top systems on the WMT translation task (Zhu et al., 2023;He et al., 2023;Raunak et al., 2023). To fully leverage LLMs' translation ability, various methods have been proposed, including in-context translation exemplar selection (Garcia et al., 2023;Lin et al., 2022;Zhang et al., 2023a;Agrawal et al., 2022), prompt optimization and decoding strategies (Zeng et al., 2023a). ...
... We use both GPT-3.5 and GPT-4 from Microsoft Azure OpenAI Service. Without further notice, the number of few-shot samples in LLM and SCALE are set to 10 and the sample selection strategy follows Agrawal et al. (2022). The prompt we use could be found in the Appendix A. ...
... In summary, two primary objectives emerge for demonstration selection: similarity-based and diversity-based methods. The former entails choosing demonstrations akin to the test instance, facil-itating learning through analogy for LLMs (Liu et al., 2022;Lu et al., 2022;Rubin et al., 2022;Shi et al., 2022;Zhang et al., 2022b;Agrawal et al., 2022;Dalvi Mishra et al., 2022;Li and Qiu, 2023b;Luo et al., 2023;Wang et al., 2024). The latter emphasizes maximizing demonstration diversity concerning the given test instance to diminish redundancy and enrich information conveyed to LLMs (Sorensen et al., 2022;Levy et al., 2023;Ye et al., 2023;Naik et al., 2023;Ma et al., 2023;. ...
... (2) BM25 (Robertson et al., 2009) assesses relevance through keyword overlap and sentence length, used by (Agrawal et al., 2023). ...
... Recent work has highlighted a range of qualitative advantages that large language models (LLMs) hold over Neural Machine Translation (NMT) models. One significant advantage is the controllability of style and language variety which can be achieved through prompting and in-context learning (Brown et al., 2020;Garcia et al., 2023;Agrawal et al., 2023). LLMs also exhibit inherent document-level translation abilities (Wang et al., 2023;Karpinska and Iyyer, 2023). ...
... It prepends few-shot training examples before the test input as a prompt, to enable large language models to find patterns and "learn" to predict. There have been successful applications of ICL in downstream tasks, such as machine translation (Lin et al., 2021;Agrawal et al., 2022) and data generation . Despite its success in few-shot learning, a major drawback of ICL is instability. ...
... Against this backdrop, in-context learning (ICL) [15], [16] has emerged as a promising approach in natural language processing (NLP), allowing models to leverage relevant examples embedded within the input, enabling adaptation without finetuning. The success of ICL hinges on the quality of selected demonstrations [17], which poses unique challenges in speech processing due to the complexity of audio data. Existing approaches [18], [19] often rely on random selection, leading to suboptimal results. ...
Preprint
State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.
Article
Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.
Article
Full-text available
Although research suggests the use of a TM (translation memory) can lead to an increase of 10% to 70%, any actual productivity increase must depends on the TM content. If the target renditions included in the TM database exhibit more free characteristics, this may adversely affect the translator’s productivity. This paper examines how productivity is affected by different kinds of TM databases. A pilot experiment was undertaken to investigate the impact of two different versions of a TM database – free vs. literal TMs. All participants translated the same source text but used different TMs. The results show that in the higher fuzzy-match categories, translators using the less literal TM did not gain as much speed as was the case when using a more literal TM.
Article
Full-text available
The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Conference Paper
When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are “fantastic” and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.
Conference Paper
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.
Article
In the last ten years there has been a significant amount of research in Machine Translation within a "new" paradigm of empirical approaches, often labelled collectively as "Example-based" approaches. The first manifestation of this approach caused some surprise and hostility among ob- servers more used to different ways of working, but the techniques were quickly adopted and adapted by many researchers, often creating hybrid systems. This paper reviews the various research efforts within this paradigm reported to date, and attempts a categorisation of different manifestations of the general approach.
Stanford neural machine translation systems for spoken language domains
  • Minh-Thang Luong
  • Christopher Manning
Minh-Thang Luong and Christopher Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 76-79, Da Nang, Vietnam.
  • Sewon Min
  • Xinxi Lyu
  • Ari Holtzman
  • Mikel Artetxe
  • Mike Lewis
  • Hannaneh Hajishirzi
  • Luke Zettlemoyer
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
Searching for COMETINHO: The little metric that could
  • Ricardo Rei
  • Ana C Farinha
  • G C José
  • Pedro G De Souza
  • Ramos
  • F T André
  • Luisa Martins
  • Alon Coheur
  • Lavie
Ricardo Rei, Ana C Farinha, José G.C. de Souza, Pedro G. Ramos, André F.T. Martins, Luisa Coheur, and Alon Lavie. 2022. Searching for COMETINHO: The little metric that could. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 61-70, Ghent, Belgium. European Association for Machine Translation.
  • Angela Teven Le Scao
  • Christopher Fan
  • Ellie Akiki
  • Suzana Pavlick
  • Daniel Ilić
  • Roman Hesslow
  • Alexandra Sasha Castagné
  • François Luccioni
  • Yvon
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  • Oleh Shliazhko
  • Alena Fenogenova
  • Maria Tikhonova
  • Vladislav Mikhailov
Oleh Shliazhko, Alena Fenogenova, Maria Tikhonova, Vladislav Mikhailov, Anastasia Kozlova, and Tatiana Shavrina. 2022. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
  • Saleh Soltan
  • Jack Shankar Ananthakrishnan
  • Rahul Fitzgerald
  • Wael Gupta
  • Haidar Hamza
  • Charith Khan
  • Stephen Peris
  • Andy Rawls
  • Anna Rosenbaum
  • Rumshisky
Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, et al. 2022. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. arXiv preprint arXiv:2208.01448.
Sumit Sanghai, and Fei Sha. 2022. Generate-and-retrieve: Use your predictions to improve retrieval for semantic parsing
  • Yury Zemlyanskiy
  • Joshua Michiel De Jong
  • Panupong Ainslie
  • Peter Pasupat
  • Linlu Shaw
  • Qiu
Yury Zemlyanskiy, Michiel de Jong, Joshua Ainslie, Panupong Pasupat, Peter Shaw, Linlu Qiu, Sumit Sanghai, and Fei Sha. 2022. Generate-and-retrieve: Use your predictions to improve retrieval for semantic parsing. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4946-4951.
Prompting large language model for machine translation: A case study
  • Biao Zhang
  • Barry Haddow
  • Alexandra Birch
Biao Zhang, Barry Haddow, and Alexandra Birch. 2023. Prompting large language model for machine translation: A case study. arXiv preprint arXiv:2301.07069.