Conference Paper

A Survey on In-context Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... LLMs utilize deep learning techniques and become promising tools for supporting various human tasks. LLMs have also been explored in the realm of OR applications, with two primary approaches: (1) LLMs as Optimizers utilize in-context learning [11,24] to conduct optimization through conversational interfaces and leverage chain of thought (CoT) prompts [20] to harness reasoning capabilities. In this way, LLMs as optimizers do not require specialized OR knowledge. ...
... Constraint (10) ensures that if a parent node is inactive, its children must also be inactive; but if any child is active, then the parent must also be active. Constraint (11) guarantees that within each tree h, only one leaf can be active. Constraint (12) defines the binary variables. ...
Preprint
Taxi pricing and pre-allocation problems are central to urban traffic efficiency and the convenience of residents' travel. However, previous approaches face challenges in uncertain practical scenarios: (i) unpredictable ride demands due to dynamic factors such as weather and workday variations, and (ii) diverse management objectives of non-expert operators, such as minimizing dispatch costs and enhancing customer satisfaction. This paper introduces RideAgent, a solution tailored for non-expert users, which combines a Large Language Model (LLM) with a feature-driven optimization modeling approach. Experimental results show that RideAgent improves computational efficiency by 53.15% compared to traditional solvers while maintaining an optimality gap of only 2.42%. Furthermore, its variable fixing strategy outperforms five conventional cutting methods, with minimal compromise in solution quality but a significant reduction in computation time of 42.3%. RideAgent effectively caters to personalized operational needs and enables more efficient urban management.
... To address these gaps, we propose a novel framework that combines the generative capabilities of LLMs with the OpenSeesPy package [42], and we assess its performance on a curated dataset of 20 structural analysis word problems (SAWPs). Specifically, we employ multiple base models within the framework-including GPT-4 [22], GPT-4o [43], Llama 3 [44], and Gemini 1.5 [45]-and compare their baseline performance with versions enhanced by techniques such as few-shot learning [13] and in-context learning (ICL) [46]. LLMs are capable of extracting critical information from textual problem descriptions and generating Finite Element Analysis (FEA) Python scripts for 2D frame structures. ...
... Our input includes an example problem, its corresponding code, and specific constraints to help the LLM generate runnable and stable Python scripts. To structure the system instructions effectively, we introduce an ICL prompt template in Figure 3. ICL has been shown to improve LLM performance in coding and reasoning [46], and the style and format of our template are adapted from reference [48]. General Template includes preliminary settings to help the LLM understand the problem in a structural engineering context. ...
Preprint
Automated analysis for engineering structures offers considerable potential for boosting efficiency by minimizing repetitive tasks. Although AI-driven methods are increasingly common, no systematic framework yet leverages Large Language Models (LLMs) for automatic structural analysis. To address this gap, we propose a novel framework that integrates LLMs with structural analysis software. LLMs serve as the core engine: they parse structural descriptions from text and translate them into executable Python scripts. Moreover, the framework integrates the generative capabilities of LLMs with code-based finite element (FE) tools like OpenSeesPy. It employs domain-specific prompt design and in-context learning strategies to enhance the LLM's problem-solving capabilities and generative stability, enabling fully automated structural analysis from descriptive text to model outputs. In our experiments, we introduce a well-curated small-scale benchmark dataset of 20 structural analysis word problems (SAWPs) with ground-truth solutions and evaluate the performance of different LLMs within our framework in solving these SAWPs. The role of system instructions, crafted by structural engineers, is also investigated to understand their impact on LLM-driven structural analysis. Additionally, the generative stability of our framework is examined. Through multiple validation experiments on the benchmark, our results demonstrate that the proposed framework can substantially increase the level of automation in solving SAWPs compared to traditional methods. Quantitatively, the framework, built on GPT-4o, achieved 100% accuracy, surpassing GPT-4 (85%), Gemini 1.5 Pro (80%), and Llama-3.3 (30%) on the test examples. Furthermore, integrating domain-specific instructions enhanced performance by 30% on problems with asymmetrical structural configurations.
... Prompt engineering has become a critical technique for enhancing the capabilities of LLMs [27,63,69,74,88,116,118,127,129,139,144]. It facilitates the integration of LLMs into decision-making tasks by eliciting task-relevant knowledge and supporting complex behaviors, all without updating model parameters. ...
... Specifically, in-context learning [13,27,125] provides LLMs with a few question-answer examples to induce an understanding of a given decision-making task. Chain-of-Thought (CoT) prompting [52,119,134,145] instructs LLMs to "Let's think step-by-step" and then generate intermediate steps between inputs and outputs to enhance problem-solving. ...
Preprint
Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal conversational assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts contain (exactly) one type of cognitive bias and therefore fail to perform well in realistic settings where there maybe any number of biases. To fill this gap, we propose a cognitive debiasing approach, called self-debiasing, that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps -- bias determination, bias analysis, and cognitive debiasing -- to iteratively mitigate potential cognitive biases in prompts. Experimental results on finance, healthcare, and legal decision-making tasks, using both closed-source and open-source LLMs, demonstrate that the proposed self-debiasing method outperforms both advanced prompt engineering methods and existing cognitive debiasing techniques in average accuracy under no-bias, single-bias, and multi-bias settings.
... Aspect-based sentiment analysis (ABSA) subtasks are no exception [17,18,19,20,21]. Additionally, to incorporate knowledge from training data into the model, supervised fine-tuning (SFT) and in-context learning (ICL) [22] are two widely adopted strategies. SFT involves updating all model parameters through back-propagation, while ICL allows models to adapt to new data and tasks by leveraging contextual information, without modifying model parameters. ...
... In-context learning [22] allows large language models to acquire knowledge on particular tasks using a limited set of labeled examples without updating the model's weights. We provide examples from the training set to the input prompt. ...
Article
Full-text available
Implicit sentiment expressions convey emotions indirectly, through context or factual statements, rather than explicit opinion words. Recent research on implicit sentiment analysis overlooks the fact that various individuals can interpret the same implicit expressions in different manners and experience different sentiments. Additionally, most previous research mainly focuses on implicit sentiment classification, neglecting the reasons behind the results. It hinders the deep understanding of the complexities involved in human emotions and limits the application of sentiment analysis. In this work, we introduce a new task, Abisa-Ex, which aims at both sentiment classification and explanation generation.We re-labeled the previous aspect-based implicit sentiment analysis dataset, incorporating new (sentiment, explanation) pair labels provided by various annotators. Based on the new dataset, we design frameworks to allow models to learn to predict sentiments from different perspectives and provide reasonable explanations jointly. Notably, our work shows that learning explanations from various viewpoints not only allows the model to generate the logical process behind sentiment analysis, but also significantly improves the model’s sentiment classification performance.
... In such a scenario, this paper explores the direction of in-context-learning (ICL) aided with limited (and task mis-matched) fine-tuning to enable novel SLU tasks with speech-text LLMs. The approach of ICL [8,9,10] allows models to learn new tasks with the aid of only a few examples [10,11]. ...
Preprint
Full-text available
Spoken language understanding (SLU) tasks involve diverse skills that probe the information extraction, classification and/or generation capabilities of models. In this setting, task-specific training data may not always be available. While traditional task-specific SLU models are unable to cater to such requirements, the speech-text large language models (LLMs) offer a promising alternative with emergent abilities. However, out of-the-box, our evaluations indicate that the zero/few-shot performance of prominent open-source speech-text LLMs on SLU tasks are not up to the mark. In this paper, we introduce a novel approach to robust task-agnostic fine-tuning using randomized class labels. With this proposed fine-tuning, we illustrate that the performance of the speech-text LLMs on an unseen task is significantly improved over standard approaches. Critically, the proposed approach avoids the requirement of task-specific data annotations for enabling new tasks in speech-text LLMs.
... Recently, Large Language Models [29] (LLMs) have demonstrated remarkable capabilities across a vast range of domains, driven by advancements in natural language processing, reasoning, and particularly in-context learning [2]. LLMs can perform specific tasks based on instructions and knowledge provided within a prompt, often achieving reasonable performance in zero-shot [19] or fewshot [21] settings without task-specific fine-tuning. ...
Preprint
Non-intrusive Load Monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage, enabling more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of interpretability. In this paper, we introduce the first prompt-based NILM framework that leverages Large Language Models (LLMs) with in-context learning. We design and evaluate prompt strategies that integrate appliance features, timestamps and contextual information, as well as representative time-series examples, using the REDD dataset. With optimized prompts, LLMs achieve competitive state detection accuracy, reaching an average F1-score of 0.676 on unseen households, and demonstrate robust generalization without the need for fine-tuning. LLMs also enhance interpretability by providing clear, human-readable explanations for their predictions. Our results show that LLMs can reduce data requirements, improve adaptability, and provide transparent energy disaggregation in NILM applications.
... Therefore, selecting high-quality KGs for model training is complex and demanding. To validate whether the KGs identified by the proposed evaluation framework is "good" or "bad" examples, we mainly employ in-context learning [16] to validate how the selected support examples influence the quality of graph-to-text generation outcomes. In summary, the contributions of this paper are as follows: ...
... Methods such as reinforcement learning from human feedback (RLHF) 27 and direct preference optimization (DPO) 28 also allow LLMs to learn directly from their users in order to be better aligned with user preferences, and recent work has extended these approaches to the "online learning" setting, allowing for continuous updating of deployed models 29, 30 . The behavior of LLMs can also be substantially changed during deployment through interactions with users, without updating any of the model parameters: for example, in-context learning allows LLMs to learn from new training data presented in their prompts 31 , and chain-of-thought prompting enables LLMs to more effectively reason through complex problems 32 . For all these reasons, the line between model development and model deployment is becoming increasingly blurred. ...
Article
Full-text available
There is a growing recognition of the need for clinical trials to safely and effectively deploy artificial intelligence (AI) in clinical settings. We introduce dynamic deployment as a framework for AI clinical trials tailored for the dynamic nature of large language models, making possible complex medical AI systems which continuously learn and adapt in situ from new data and interactions with users while enabling continuous real-time monitoring and clinical validation.
... One of the most fascinating properties of Large Language Models (LLMs) is its In-Context Learning capability [1], [2]. It refers to the ability of a pre-trained LLM to achieve competitive results on downstream tasks given only a few prompt examples during the prediction phase, without updating the model weights through fine-tuning approaches. ...
Preprint
Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the k-nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at https://github.com/karin0018/GraphPrompter.
... For this role, the LLMs in the agent depend on the knowledge given and the included examples to perform the in-context learning and optimize the response with domain-specific knowledge. In the report by [60], the performance of LLMs based on in-context learning was comparable to fine-tuning, which usually requires a higher cost and more data to optimize the model itself. As discussed, the agent for knowledge retrieval will check if the input data include any samples for in-context learning during the knowledge-acquisition tasks. ...
Article
Full-text available
Featured Application The proposed framework can be integrated into learning platforms to enhance personalized adaptive learning. By leveraging knowledge-driven agents and RAG pipelines, this framework improves the accuracy and effectiveness of AI assistants, while expanding their capabilities through the incorporation of customized knowledge. Continuous updates to the knowledge base enable AI models to dynamically adapt to individual learners, delivering context-aware and precise responses tailored to their needs. This approach is particularly valuable for integrated interdisciplinary learning such as digital transformation, where multidisciplinary knowledge integration plays a crucial role in fostering deeper understanding and knowledge retention. Abstract As Large Language Models (LLMs) incorporate generative Artificial Intelligence (AI) and complex machine learning algorithms, they have proven to be highly effective in assisting human users with complex professional tasks through natural language interaction. However, in addition to their current capabilities, LLMs occasionally generate responses that contain factual inaccuracies, stemming from their dependence on the parametric knowledge they encapsulate. To avoid such inaccuracies, also known as hallucinations, people use domain-specific knowledge (expertise) to support LLMs in the corresponding task, but the necessary knowledge engineering process usually requires considerable manual effort from experts. In this paper, we developed an approach to leverage the collective strengths of multiple agents to automatically facilitate the knowledge engineering process and then use the learned knowledge and Retrieval Augmented Generation (RAG) pipelines to optimize the performance of LLMs in domain-specific tasks. Through this approach, we effectively build AI assistants based on particular customized knowledge to help students better carry out personalized adaptive learning in digital transformation. Our initial tests demonstrated that integrating a Knowledge Graph (KG) within a RAG framework significantly improved the quality of domain-specific outputs generated by the LLMs. The results also revealed performance fluctuations for LLMs across varying contexts, underscoring the critical need for domain-specific knowledge support to enhance AI-driven adaptive learning systems.
... Few-shot prompts were seen to be an effective way to contextualize model output's across a variety of tasks in NLP by providing examples of the task within the context of the LLM prompt. Factors like the diversity of examples and quality of examples shown for the in-context demonstrations may improve the quality of model outputs Dong et al., 2022). Within the context of multi-modal foundation models, models like Flamingo and BLIP-2 (Alayrac et al., 2022;Li et al., 2023c) have been shown to be effective at a variety of visual understanding tasks when given only given a small number of examples. ...
... ing (ICL) [6], especially with the advent of foundational models [3,26,27]. Recent work with Large Language Models (LLMs) [11,33] and Multimodal LLMs (MLLMs) [13,38,40] has demonstrated that prompting models using task-specific examples can efficiently adapt downstream tasks, achieving comparable efficacy to in-weight learning [4]. ...
Preprint
Full-text available
Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. Rather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone, Condenser ensures accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code is open-sourced at https://github.com/gimpong/CVPR25-Condenser.
... In our experimental case study, we observe that not all models consistently follow the guidelines and maintain a structured response format, which can significantly hinder the filtering process and damage the model utility (See Section 4.4.2). To address this issue, we introduce two in-context learning [10] examples to reinforce adherence to the guidelines, improving the consistency and reliability of the generated responses. The splitting process and guidelines work as the defense function Def(·), formulating a prompt P = Def(I ori , T inj ) with an example shown in Appendix D. Then the response is obtained as R = M(P ). ...
Preprint
Large language models (LLMs) have demonstrated impressive performance and have come to dominate the field of natural language processing (NLP) across various tasks. However, due to their strong instruction-following capabilities and inability to distinguish between instructions and data content, LLMs are vulnerable to prompt injection attacks. These attacks manipulate LLMs into deviating from the original input instructions and executing maliciously injected instructions within data content, such as web documents retrieved from search engines. Existing defense methods, including prompt-engineering and fine-tuning approaches, typically instruct models to follow the original input instructions while suppressing their tendencies to execute injected instructions. However, our experiments reveal that suppressing instruction-following tendencies is challenging. Through analyzing failure cases, we observe that although LLMs tend to respond to any recognized instructions, they are aware of which specific instructions they are executing and can correctly reference them within the original prompt. Motivated by these findings, we propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs. Our approach prompts LLMs to generate responses that include both answers and their corresponding instruction references. Based on these references, we filter out answers not associated with the original input instructions. Comprehensive experiments demonstrate that our method outperforms prompt-engineering baselines and achieves performance comparable to fine-tuning methods, reducing the attack success rate (ASR) to 0 percent in some scenarios. Moreover, our approach has minimal impact on overall utility.
... The goal was to assess whether LLMs could generalize the visual reasoning approach demonstrated in Task E1 and apply it to fresh data. The decision to explore one-shot prompting exclusively in Task E2, rather than in the other tasks, was based on the understanding that prior research has already established the strong performance of LLMs in zero-shot and few-shot contexts for textual data 17 . However, visual data, particularly in the context of diabetes, has been less explored in these prompting strategies. ...
Preprint
Full-text available
This study explores the potential of state-of-the-art large language models (LLM) to scaffold type 1 diabetes management by automating the analysis of multimodal diabetes device data, including blood glucose, carbohydrate, and insulin. By conducting a series of empirically grounded data analysis tasks, such as detecting glycemic episodes, clustering similar episodes into patterns, identifying counterfactual days, and performing visual data analysis, we assess whether models like ChatGPT 4o, Claude 3.5 Sonnet, and Gemini Advanced can offer meaningful insights from diabetes data. Our findings show that ChatGPT 4o demonstrates strong potential in accurately interpreting data in the context of specific glycemic episodes, identifying glycemic patterns, and analyzing patterns. However, limitations in handling edge cases and visual reasoning tasks highlight areas for future development. Using LLMs to automate data analysis tasks and generate narrative summaries could scaffold clinical decision-making in diabetes management, which could make frequent data review feasible for improved patient outcomes.
... We further augment the context with few-shot examples of natural language questions and corresponding Cypher pairs, drawing parallels to in-context learning techniques that improve semantic parsing robustness [61]. To minimize stochasticity, we set the LLM's temperature hyperparameter to zero (0) to generate the most deterministic response [62]. ...
Article
This paper introduces a novel framework enabling natural language question answering on Piping and Instrumentation Diagrams (P&IDs), addressing a critical gap between engineering design documentation and intuitive information retrieval. Our approach transforms static P&IDs into queryable knowledge bases through a three-stage pipeline. First, we recognize entities in a P&ID image and organize their relationships to form a base entity graph. Second, this entity graph is converted into a Labeled Property Graph (LPG), enriched with semantic attributes for nodes and edges. Third, a Large Language Model (LLM)-based information retrieval system translates a user query into a graph query language (Cypher) and retrieves the answer by executing it on LPG. For our experiments, we augmented a publicly available P&ID image dataset with our novel PIDQA dataset, which comprises 64,000 question–answer pairs spanning four categories: (I) simple counting, (II) spatial counting, (III) spatial connections, and (IV) value-based questions. Our experiments (using gpt-3.5-turbo) demonstrate that grounding the LLM with dynamic few-shot sampling robustly elevates accuracy by 10.6–43.5% over schema contextualization alone, even under high lexical diversity conditions (e.g., paraphrasing, ambiguity). By reducing barriers in retrieving P&ID data, this work advances human–AI collaboration for industrial workflows in design validation and safety audits.
... Large language models (LLMs) are effective tools for incontext learning [36], [37]. Zero-shot LLMs combine input data and prompt design to form context and then use this context to generate answers directly. ...
Preprint
Measuring scientific paper innovation is both important and challenging. Existing content-based methods often overlook the full-paper context, fail to capture the full scope of innovation, and lack generalization. We propose HSPIM, a hierarchical and training-free framework based on large language models (LLMs). It introduces a Paper-to-Sections-to-QAs decomposition to assess innovation. We segment the text by section titles and use zero-shot LLM prompting to implement section classification, question-answering (QA) augmentation, and weighted novelty scoring. The generated QA pair focuses on section-level innovation and serves as additional context to improve the LLM scoring. For each chunk, the LLM outputs a novelty score and a confidence score. We use confidence scores as weights to aggregate novelty scores into a paper-level innovation score. To further improve performance, we propose a two-layer question structure consisting of common and section-specific questions, and apply a genetic algorithm to optimize the question-prompt combinations. Comprehensive experiments on scientific conference paper datasets show that HSPIM outperforms baseline methods in effectiveness, generalization, and interpretability.
... Each record contains a task signature and an associated step-by-step plan, stored in an application-specific example database. When a similar task is encountered in the future, AppAgent uses In-Context Learning (ICL) [35][36][37][38] to retrieve relevant demonstrations and improve execution fidelity. This dynamic reinforcement pipeline transforms the system into a long-lived agent that improves with use, without introducing the brittleness or operational cost of fine-tuning [39]. ...
Preprint
Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
... In fact, one perspective on why ICL can derive effective prompts from shots is that the model possesses a certain degree of analogical reasoning capability 25 . The effectiveness of AL further illustrates the soundness of this perspective. ...
Preprint
Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is mainly due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite the data of each scenario has its distinct reference frame, its generation generally follows the same underlying physical rule. Based on these findings, this article proposes a brand-new universal deep learning framework named analogical learning (AL), which provides a highly efficient way to implicitly retrieve the reference frame information associated with a scenario and then to make accurate prediction by relative analogy across scenarios. Specifically, an elegant bipartite neural network architecture called Mateformer is designed, the first part of which calculates the relativity within multiple feature spaces between the input data and a small amount of embedded data from the current scenario, while the second part uses these relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments show that AL achieves state-of-the-art accuracy, stable transferability and robust adaptation to new scenarios without any tuning, and outperforming conventional methods with a precision improvement of nearly two orders of magnitude. All data and code are available at https://github.com/ziruichen-research/ALLoc.
... In-context learning (ICL) (Dong et al. 2022) refers to strategies that optimize input for LLMs (M) to generate practical outputs with a task-specific instruction (I) and a few output examples (E). We introduce distinct reasoning methods to fully assess the reasoning capabilities of LLMs Textual Chain-of-Thought (TCoT) TCoT (Wei et al. 2022) refers to a reasoning process in which LLMs incrementally derive a series of intermediate steps or sub-goals through textual prompts before generating the final answer. ...
Article
Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.
... The number of training epochs is set to 2 (5 for API generation), the learning rate is set to 2 −4, and the batch size is set 2} to identify the optimal setting for the task. In our ablation analysis, we replace fine-tuning with in-context learning [14] by preparing a prompt that includes the 5 problem-specific examples in the training set with the highest relevance scores. ...
Preprint
Full-text available
Large language models (LLMs) excel in question-answering (QA) tasks, and retrieval-augmented generation (RAG) enhances their precision by incorporating external evidence from diverse sources like web pages, databases, and knowledge graphs. However, current RAG methods rely on agent-specific strategies for individual data sources, posing challenges low-resource or black-box environments and complicates operations when evidence is fragmented across sources. To address these limitations, we propose ER-RAG, a framework that unifies evidence integration across heterogeneous data sources using the Entity-Relationship (ER) model. ER-RAG standardizes entity retrieval and relationship querying through ER-based APIs with GET and JOIN operations. It employs a two-stage generation process: first, a preference optimization module selects optimal sources; second, another module constructs API chains based on source schemas. This unified approach allows efficient fine-tuning and seamless integration across diverse data sources. ER-RAG demonstrated its effectiveness by winning all three tracks of the 2024 KDDCup CRAG Challenge, achieving performance on par with commercial RAG pipelines using an 8B LLM backbone. It outperformed hybrid competitors by 3.1% in LLM score and accelerated retrieval by 5.5X.
... Various algorithms have been developed to improve ICL performance by optimizing demonstration selection [26,150,188,221], ordering [105,109], and formatting [77,108,180]. While research has observed, context scaling in ICL, where model performance improves as the number of in-context examples increases [1,12,116,125], traditional ICL methods remain constrained by the maximum input context length, limiting them to a few-shot setting [38]. Although some works, such as SAICL [13], modify the attention structure to scale ICL to hundreds of demonstrations [55,92,93], they do not fully explore the potential benefits and challenges of utilizing a significantly larger number of examples. ...
Preprint
Full-text available
The rapid advancements in large Language models (LLMs) have significantly enhanced their reasoning capabilities, driven by various strategies such as multi-agent collaboration. However, unlike the well-established performance improvements achieved through scaling data and model size, the scaling of reasoning in LLMs is more complex and can even negatively impact reasoning performance, introducing new challenges in model alignment and robustness. In this survey, we provide a comprehensive examination of scaling in LLM reasoning, categorizing it into multiple dimensions and analyzing how and to what extent different scaling strategies contribute to improving reasoning capabilities. We begin by exploring scaling in input size, which enables LLMs to process and utilize more extensive context for improved reasoning. Next, we analyze scaling in reasoning steps that improves multi-step inference and logical consistency. We then examine scaling in reasoning rounds, where iterative interactions refine reasoning outcomes. Furthermore, we discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement. Finally, we review applications of scaling across domains and outline future directions for further advancing LLM reasoning. By synthesizing these diverse perspectives, this survey aims to provide insights into how scaling strategies fundamentally enhance the reasoning capabilities of LLMs and further guide the development of next-generation AI systems.
Conference Paper
Full-text available
Large language models (LLMs) are advanced AI systems applied across various domains, including NLP, information retrieval, and recommendation systems. Despite their adaptability and efficiency, LLMs have not been extensively explored for signal processing tasks, particularly in the domain of global navigation satellite system (GNSS) interference monitoring. GNSS interference monitoring is essential to ensure the reliability of vehicle localization on roads, a critical requirement for numerous applications. However, GNSS-based positioning is vulnerable to interference from jamming devices, which can compromise its accuracy. The primary objective is to identify, classify, and mitigate these interferences. Interpreting GNSS snapshots and the associated interferences presents significant challenges due to the inherent complexity, including multipath effects, diverse interference types, varying sensor characteristics, and satellite constellations. In this paper, we extract features from a large GNSS dataset and employ LLaVA to retrieve relevant information from an extensive knowledge base. We employ prompt engineering to interpret the interferences and environmental factors, and utilize t-SNE to analyze the feature embeddings. Our findings demonstrate that the proposed method is capable of visual and logical reasoning within the GNSS context. Furthermore, our pipeline outperforms state-of-the-art machine learning models in interference classification tasks. Github: https://gitlab.cc-asp.fraunhofer.de/darcy_gnss
Article
Full-text available
With the proliferation of the internet and social media, the spread of fake news has become a global issue, posing serious challenges to the research of Fake News Detection (FND) methods. With advancements in Artificial Intelligence (AI), large language models (LLMs) have become increasingly evident across various industries, especially in natural language processing (NLP). LLM-based FND approaches, including Chain-of-Thought (CoT), self-reflection, and in-context learning (ICL) prompting paradigms, has shown promise but still faces challenges in effectively handling complex and nuanced content. For example, CoT paradigm faces error propagation issues, self-reflection methods suffer from the Degeneration-of-Thought (DoT) problem, and ICL paradigm is highly dependent on the quality of the provided context. To address these issues, we propose a multi-role detection method based on courtroom debates. This method involves two attorneys, representing the prosecution and the defense, as well as a judge, simulating a debate process on the authenticity of the news. First, the prosecution attempts to prove that the news is fake, while the defense tries to prove that the news is genuine. The judge evaluates the evidence presented by both sides to reach a conclusion. Next, the prosecution and defense switch roles, with each attempting to argue from the opposite standpoint, and the judge evaluates the arguments again. Finally, the judge synthesizes all arguments to issue a verdict. Extensive experiments across multiple challenging scenarios (e.g., controversial news and misleading media posts) show that this debate-based framework achieves up to 9%-11% higher accuracy than advanced LLM baselines, revealing how role switching significantly enhances detection performance. Moreover, our findings indicate that incorporating diverse perspectives reduces cognitive bias, but also highlight that LLM-based judges remain susceptible to inherent biases-especially if pretrained data include skewed narratives-underscoring the need for fairness adjustments in real-world applications. Overall, the proposed courtroom debate-based FND framework not only improves accuracy and reliability in identifying fake news but also provides an interpretable decision-making process by exposing key arguments on both sides. This underscores its potential to serve as a robust, transparent, and adaptable solution in the evolving domain of misinformation detection.
Article
Full-text available
Magnetic resonance imaging (MRI) has played a crucial role in the diagnosis, monitoring and treatment optimization of multiple sclerosis (MS). It is an essential component of current diagnostic criteria for its ability to non-invasively visualize both lesional and non-lesional pathology. Nevertheless, modern day usage of MRI in the clinic is limited by lengthy protocols, error-prone procedures for identifying disease markers (e.g., lesions), and the limited predictive value of existing imaging biomarkers for key disability outcomes. Recent advances in artificial intelligence (AI) have underscored the potential for AI to not only improve, but also transform how MRI is being used in MS. In this short review, we explore the role of AI in MS applications that span the entire life-cycle of an MRI image, from data collection, to lesion segmentation, detection, and volumetry, and finally to downstream clinical and scientific tasks. We conclude with a discussion on promising future directions.
Article
Large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, in real-world robotic tasks, LLMs face grounding issues and lack precise feedback, resulting in the generated solutions deviating from the actual situation. In this paper, we propose Double-Feedback, a method that enhances LLMs reasoning by Knowledge graphs (KGs). The KGs play three key roles in Double-Feedback: prompting the LLMs to generate solutions, representing the task scenes, and verifying the solutions to provide feedback. We design structured knowledge prompts that convey the task knowledge background, example solutions, revision principles, and robotic tasks to the LLMs. We also introduce the distributed representation to quantify the task scene with interpretability. Based on the structured knowledge prompts and the distributed representation, we employ the KGs to evaluate the feasibility of each step before execution and verify the effects of the solutions after completing the tasks. The LLMs can adjust and replan the solutions based on the feedback from the KGs. Extensive experiments demonstrate that Double-Feedback outperforms prior works in the ALFRED benchmark. In addition, ablation studies show that Double-Feedback guides LLMs in generating solutions aligned with robotic tasks in the real world.
Article
Pre-trained language models (PLMs) have demonstrated significant proficiency in solving a wide range of general natural language processing (NLP) tasks. Researchers have observed a direct correlation between the performance of these models and their sizes. As a result, the sizes of these models have notably expanded in recent years, persuading researchers to adopt the term large language models (LLMs) to characterize the larger-sized PLMs. The size expansion comes with a distinct capability called in-context learning (ICL), which represents a special form of prompting and allows the models to be utilized through the presentation of demonstration examples without modifications to the model parameters. Although interesting, privacy concerns have become a major obstacle in its widespread usage. Multiple studies have examined the privacy risks linked to ICL and prompting in general, and have devised techniques to alleviate these risks. Thus, there is a necessity to organize these mitigation techniques for the benefit of the community. In this survey, we provide a systematic overview of the privacy protection methods employed during ICL and prompting in general. We review, analyze, and compare different methods under this paradigm. Furthermore, we provide a summary of the resources accessible for the development of these frameworks. Finally, we discuss the limitations of these frameworks and offer a detailed examination of the promising areas that necessitate further exploration.
Preprint
Full-text available
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. ...
Article
The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.
Conference Paper
Full-text available
Large Language Models (LLMs) have shown significant performance in numerous NLP tasks, including summarization and controlled text generation. A notable capability of LLMs is in-context learning (ICL), where the model learns new tasks using input-output pairs in the prompt without any parameter update. However, the performance of LLMs in the context of few-shot abstractive dialogue summarization remains underexplored. This study evaluates various state-of-the-art LLMs on the SAMSum dataset within a few-shot framework. We assess these models in both controlled (entity control, length control, and person-focused planning) and uncontrolled settings, establishing a comprehensive benchmark in few-shot dialogue summarization. Our findings provide insights into summary quality and model controllability, offering a crucial reference for future research in dialogue summarization.
Preprint
Full-text available
We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels-across various model families (GPT-3, InstructGPT, Codex, PaLM, and Flan-PaLM). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing language models to learn the input-label mappings shown in in-context exemplars in order to perform the task. The ability to do SUL-ICL also emerges primarily with scale, and large-enough language models can even perform linear classification in a SUL-ICL setting. Finally, we evaluate instruction-tuned models and find that instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former.
Preprint
Full-text available
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
Conference Paper
Full-text available
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Article
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E ) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment from the prompt in synthesis.
Preprint
Large language models (LLMs) have shown increasing in-context learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multi-step reasoning problems, Anil et al. 2022 showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via in-context learning, which we refer to as algorithmic prompting. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines.
Article
Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment composition-ality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.
Article
We use analogy when we say something is a Cinderella story and when we learn about resistors by thinking about water pipes. We also use analogy when we learn subjects like economics, medicine, and law. This paper presents a theory of analogy and describes an implemented system that embodies the theory. The specific competence to be understood is that of using analogies to do certain kinds of learning and reasoning. Learning takes place when analogy is used to generate a constraint description in one domain, given a constraint description in another, as when we learn Ohm's law by way of knowledge about water pipes. Reasoning takes place when analogy is used to answer questions about one situation, given another situation that is supposed to be a precedent, as when we answer questions about Hamlet by way of knowledge about Macbeth.
Language models are general-purpose interfaces
  • Yaru Hao
  • Haoyu Song
  • Li Dong
  • Shaohan Huang
  • Zewen Chi
  • Wenhui Wang
  • Shuming Ma
  • Furu Wei
Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, and Furu Wei. 2022a. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336.
Self-demos: Eliciting out-of-demonstration generalizability in large language models
  • Wei He
  • Shichun Liu
  • Jun Zhao
  • Yiwen Ding
  • Yi Lu
  • Zhiheng Xi
  • Tao Gui
  • Qi Zhang
  • Xuanjing Huang
Wei He, Shichun Liu, Jun Zhao, Yiwen Ding, Yi Lu, Zhiheng Xi, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Self-demos: Eliciting out-of-demonstration generalizability in large language models. CoRR, abs/2404.00884.
In-context learning in large language models: A comprehensive survey
  • Clyde Highmore
Clyde Highmore. 2024. In-context learning in large language models: A comprehensive survey.
PRODIGY: enabling in-context learning over graphs
  • Qian Huang
  • Hongyu Ren
  • Peng Chen
  • Gregor Krzmanc
  • Daniel Zeng
  • Percy Liang
  • Jure Leskovec
Qian Huang, Hongyu Ren, Peng Chen, Gregor Krzmanc, Daniel Zeng, Percy Liang, and Jure Leskovec. 2023a. PRODIGY: enabling in-context learning over graphs. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -16, 2023.
Language is not all you need: Aligning perception with language models
  • Shaohan Huang
  • Li Dong
  • Wenhui Wang
  • Yaru Hao
  • Saksham Singhal
  • Shuming Ma
  • Tengchao Lv
  • Lei Cui
  • Barun Owais Khan Mohammed
  • Qiang Patra
  • Kriti Liu
  • Zewen Aggarwal
  • Chi
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Nils Johan Bertil Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. 2023b. Language is not all you need: Aligning perception with language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -16, 2023.
The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention
  • Kazuki Irie
  • Róbert Csordás
  • Jürgen Schmidhuber
Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. 2022. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 9639-9659. PMLR.
Opt-iml: Scaling language model instruction meta learning through the lens of generalization
  • Srinivasan Iyer
  • Xi Victoria Lin
  • Ramakanth Pasunuru
  • Todor Mihaylov
  • Daniel Simig
  • Ping Yu
  • Kurt Shuster
  • Tianlu Wang
  • Qing Liu
  • Punit Singh Koura
  • Xian Li
  • O' Brian
  • Gabriel Horo
  • Jeff Pereyra
  • Christopher Wang
  • Asli Dewan
  • Luke Celikyilmaz
  • Ves Zettlemoyer
  • Stoyanov
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, and Ves Stoyanov. 2022. Opt-iml: Scaling language model instruction meta learning through the lens of generalization.
A latent space theory for emergent abilities in large language models
  • Hui Jiang
Hui Jiang. 2023. A latent space theory for emergent abilities in large language models. CoRR, abs/2304.09960.
Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text
  • Hanieh Khorashadizadeh
  • Nandana Mihindukulasooriya
  • Sanju Tiwari
  • Jinghua Groppe
  • Sven Groppe
Hanieh Khorashadizadeh, Nandana Mihindukulasooriya, Sanju Tiwari, Jinghua Groppe, and Sven Groppe. 2023. Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text. arXiv preprint arXiv:2305.08804.
Self-generated in-context learning: Leveraging autoregressive language models as a demonstration generator
  • Hyunsoo Hyuhng Joon Kim
  • Junyeob Cho
  • Taeuk Kim
  • Kim
  • Sang-Goo Kang Min Yoo
  • Lee
Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim, Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2022. Self-generated in-context learning: Leveraging autoregressive language models as a demonstration generator. ArXiv preprint, abs/2206.08082.
  • Jannik Kossen
  • Tom Rainforth
  • Yarin Gal
Jannik Kossen, Tom Rainforth, and Yarin Gal. 2023. In-context learning in large language models learns label relationships but is not conventional learning. CoRR, abs/2307.12375.
Otter: A multi-modal model with in-context instruction tuning
  • Bo Li
  • Yuanhan Zhang
  • Liangyu Chen
  • Jinghao Wang
  • Jingkang Yang
  • Ziwei Liu
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023a. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
Towards enhancing in-context learning for code generation
  • Jia Li
  • Yunfei Zhao
  • Yongmin Li
  • Ge Li
  • Zhi Jin
Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2023b. Towards enhancing in-context learning for code generation. arXiv preprint arXiv:2303.17780.
GPT-4 technical report
  • Openai
OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  • Ashwinee Panda
  • Tong Wu
  • Jiachen T Wang
  • Prateek Mittal
Ashwinee Panda, Tong Wu, Jiachen T. Wang, and Prateek Mittal. 2023. Differentially private in-context learning. CoRR, abs/2305.01639.