Praveen Venkateswaran’s research while affiliated with IBM Research - Thomas J. Watson Research Center and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (22)


Figure 3: Variation in the average accuracy across all permutations of three examples for five datasets
Figure 4: Variation in API recall for different number of in-context examples for ToolBench dataset
Figure 7: Comparing 3, 4, and 5 shot random/Top-K selection strategy with 3 shot OptiSeq (in red) .
Figure 8: Comparing In-distribution with Out-ofdistribution performance. Here ToolBench uses incontext examples from RestGPT and vice-versa. Results shown for 2 models.
OptiSeq: Optimizing Example Ordering for In-Context Learning
  • Preprint
  • File available

January 2025

·

3 Reads

·

Praveen Venkateswaran

·

·

[...]

·

Developers using LLMs in their applications and agents have provided plenty of anecdotal evidence that in-context-learning (ICL) is fragile. In addition to the quantity and quality of examples, we show that the order in which the in-context examples are listed in the prompt affects the output of the LLM and, consequently, their performance. In this paper, we present OptiSeq, which introduces a score based on log probabilities of LLM outputs to prune the universe of possible example orderings in few-shot ICL and recommend the best order(s) by distinguishing between correct and incorrect outputs resulting from different order permutations. Through a detailed empirical evaluation on multiple LLMs, datasets and prompts, we demonstrate that OptiSeq improves accuracy by 6 - 10.5 percentage points across multiple tasks.

Download

Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

October 2024

·

20 Reads

In this work, we focus our attention on developing a benchmark for instruction-following where it is easy to verify both task performance as well as instruction-following capabilities. We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. This allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions. In contrast to existing benchmarks for instruction following, we not only measure instruction-following capabilities but also use LLM-free methods to study task performance. We study a series of openly available large language models of varying parameter sizes (1B-405B) and closed source models namely GPT-4o-mini, GPT-4o. We find that even large-scale instruction-tuned LLMs fail to follow simple instructions in zero-shot settings. We release our dataset, the benchmark, code, and results for future work.


Figure 2: Evaluation of GRANITE-20B-FUNCTIONCALLING against the best open function calling models (according to BFCL)
Figure 3: Performance vs. Hallucination rates for Out-of-Domain Function Calling
Function Calling Academic Benchmarks: Full Function Calling. Best performance is highlighted in bold, second best is underlined. All evaluations are done in a zero-shot manner.
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

June 2024

·

87 Reads

Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.


Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search

March 2024

·

10 Reads

·

2 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

There are increasingly many large language models (LLMs) available to the public. While these LLMs have exhibited impressive abilities on a variety of task, any individual LLM in particular may do well on some tasks and worse on others. Additionally, the performance of these models is heavily dependent on the choice of prompt template used. For instance, they exhibit sensitivity to the few shot examples chosen or brittleness to the wording of instructions. Moreover, a prompt template that makes a model perform well for one input may not be the optimal template for another input. This necessitates an approach for adaptively selecting LLM and prompt template pairs for each input. Recent work has shown that the accuracy of LLM's responses is correlated with the LLM's confidence in the response. Thus, a natural choice for selecting which model and prompt template to use is to select the pair that is most confident in its response. However, existing confidence metrics are expensive to calculate - necessitating multiple calls to each LLm and prompt pair. We thus propose an approach to predict the confidence of each pair using an auxiliary regression model that is inexpensive to run. Using this auxiliary model, we select the LLM and prompt template with the highest predicted confidence for a given input. Results on a range of benchmark datasets show that our confidence-based instance-level prompt search method consistently improves the performance of LLMs.


A Case for Business Process-Specific Foundation Models

January 2024

·

21 Reads

·

3 Citations

Lecture Notes in Business Information Processing

The inception of large language models has helped advance the state-of-the-art on numerous natural language tasks. This has also opened the door for the development of foundation models for other domains and data modalities (e.g., images and code). In this paper, we argue that business process data has unique characteristics that warrant the creation of a new class of foundation models to handle tasks like activity prediction, process optimization, and decision making. These models should also tackle the challenges of applying AI to business processes which include data scarcity, multi-modal representations, domain specific terminology, and privacy concerns. To support our claim, we show the effectiveness of few-shot learning and transfer learning in next activity prediction, crucial properties for the success of foundation models.







Citations (13)


... Our prompt library represents actions, i.e., tool calls, following the JSON tool calling schema (Abdelaziz et al. 2024). An action is represented as a JSON object with name and arguments mapping. ...

Reference:

AutoPDL: Automatic Prompt Optimization for LLM Agents
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
  • Citing Conference Paper
  • January 2024

... Beheshti et al. propose opportunities for training an LLM on business process data to achieve various BPM tasks [2]. Similarly, Rizk et al. suggest a process-specific Foundational Model and the opportunities it could offer [24]. These position papers offer first intuitions and recaps of the state of the art, but feasibility assessments remain vague. ...

A Case for Business Process-Specific Foundation Models
  • Citing Chapter
  • January 2024

Lecture Notes in Business Information Processing

... By moving towards natural language state descriptions, we aim to create DST models that are not only more accurate but also more adaptable to open domains and easier to interpret and debug. Recent works like S3-DST [9], DISTRICT [10], and FnCTOD [11] are also exploring the use of LLMs in DST, but often still rely on structured output formats or specific function calls. Our approach distinguishes itself by focusing on free-form natural language generation for dialogue state representation, offering a different perspective on leveraging LLMs for DST. ...

DiSTRICT: Dialogue State Tracking with Retriever Driven In-Context Tuning

... These models leverage extensive pre-training on diverse textual datasets to develop sophisticated language comprehension and generative capabilities, enabling tasks such as context-aware reasoning and human-like text production. Modern LLM-driven systems [17,20,22] integrate large language models to analyze problems, formulate actionable strategies, and deploy solutions through tool-assisted execution. For example, LLMbased infrastructure management systems can dynamically analyze server logs, diagnose anomalies, and autonomously execute corrective actions or escalate alerts to human operators. ...

Towards large language model-based personal agents in the enterprise: Current trends and open problems

... These methods are based on the data distribution and require additional models or calculations to align the distribution. Distributed masking function and invariant learning are introduced into FL in FedGen [29], which outperforms current FL in efficiency, accuracy, and generalization in various contexts. However, the FedGen model focuses on sequential data, which differs from our approach. ...

FedGen: Generalizable Federated Learning for Sequential Data
  • Citing Conference Paper
  • July 2023

... 30 Despite being relatively new, FL has already demonstrated its value in addressing the generalizability challenge. 31,32 Due to the high sensitivity of the data, many other works have been performed to explore FL in the biomedical field. 23,33,34 While the effectiveness of FL in increasing the model generalizability has already been proven, finding the best way to aggregate the contributions from all the data owners remains an open challenge. ...

FedGen: Generalizable Federated Learning

... Out-of-the-Box (OOTB). This category encompasses integrations where ML capabilities are either built directly into the RPA software or can be added later via a robot store controlled by the software provider [33]. Crucially, the availability of ML capabilities depends entirely on the provider's offering. ...

Can You Teach Robotic Process Automation Bots New Tricks?
  • Citing Chapter
  • September 2022

Lecture Notes in Business Information Processing

... Prediction models have also been trained with augmented data or adversarial samples, relying on model-agnostic trace augmentations (Käppel et al. 2023), trying Generative Adversarial Networks (GANs) (Taymouri et al. 2020) or training with adversarial samples (Pasquadibisceglie et al. 2024;Stevens et al. 2023). Others try to improve generalization by adapting the loss function to balance the performance of the predictive model across multiple environments instead of performing best in only one (Venkateswaran et al. 2021). Transfer Learning from one event log to another has shown to be beneficial when training next activity prediction models (Jiralerspong et al. 2024). ...

Robust and Generalizable Predictive Models for Business Processes
  • Citing Chapter
  • August 2021

Lecture Notes in Computer Science

... Game theory was applied to IRM to achieve Nash equilibrium among domains [33]. Other interesting subsequent works include the non-linear IRM [34], Bayesian IRM [35], meta IRM [36], globally sparse IRM [37], and domain agnostic IRM [38]. In general, two major gaps remain unfilled by existing studies [15], [31]- [38]. ...

Environment Agnostic Invariant Risk Minimization for Classification of Sequential Datasets
  • Citing Conference Paper
  • August 2021

... Two theoretical extensions further explore: (i) the complexity analysis in Theorem 1 to determine approximation ratios of SmartParcels for performance guarantees and (ii) providing formal guarantees in placement [33]. Finally, we are exploring the dynamicity of the system in the following topics: (i) how Software Defined Network (SDN) can help for the effective management in the multinetwork data flows [28] and (ii) the ability to dynamically choose edge analytics [10,34]. ...

REAM: Resource Efficient Adaptive Monitoring of Community Spaces at the Edge Using Reinforcement Learning
  • Citing Conference Paper
  • September 2020