December 2024
·
1 Read
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
December 2024
·
1 Read
September 2024
·
45 Reads
Although language model (LM) agents are demonstrating growing potential in many domains, their success in cybersecurity has been limited due to simplistic design and the lack of fundamental features for this domain. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. EnIGMA introduces new Agent-Computer Interfaces (ACIs) to improve the success rate on CTF challenges. We establish the novel Interactive Agent Tool concept, which enables LM agents to run interactive command-line utilities essential for these challenges. Empirical analysis of EnIGMA on over 350 CTF challenges from three different benchmarks indicates that providing a robust set of new tools with demonstration of their usage helps the LM solve complex problems and achieves state-of-the-art results on the NYU CTF and Intercode-CTF benchmarks. Finally, we discuss insights on ACI design and agent behavior on cybersecurity tasks that highlight the need to adapt real-world tools for LM agents.
August 2024
·
14 Reads
High-quality datasets of real-world vulnerabilities are enormously valuable for downstream research in software security, but existing datasets are typically small, require extensive manual effort to update, and are missing crucial features that such research needs. In this paper, we introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software. By sourcing vulnerabilities from C/C++ projects that Google's OSS-Fuzz discovered and implementing a reliable re-compilation system, we successfully reproduce more than 5,000 memory vulnerabilities across over 250 projects, each with a triggering input, the canonical developer-written patch for fixing the vulnerability, and the ability to automatically rebuild the project from source and run it at its vulnerable and patched revisions. Moreover, our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities, allowing it to grow over time. We provide a thorough characterization of the ARVO dataset, show that it can locate fixes more accurately than Google's own OSV reproduction effort, and demonstrate its value for future research through two case studies: firstly evaluating real-world LLM-based vulnerability repair, and secondly identifying over 300 falsely patched (still-active) zero-day vulnerabilities from projects improperly labeled by OSS-Fuzz.
June 2024
·
48 Reads
·
1 Citation
Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized dataset, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our dataset open source to public https://github.com/NYU-LLM-CTF/LLM_CTF_Database along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.
May 2024
·
44 Reads
The rapid development of large language models (LLMs) has led to significant advancements in code completion tasks. While larger models have higher accuracy, they also cost much more to run. Meanwhile, model cascading has been proven effective to conserve computational resources while enhancing accuracy in LLMs on natural language generation tasks. It generates output with the smallest model in a set, and only queries the larger models when it fails to meet predefined quality criteria. However, this strategy has not been used in code completion tasks, primarily because assessing the quality of code completions differs substantially from assessing natural language, where the former relies heavily on the functional correctness. To address this, we propose letting each model generate and execute a set of test cases for their solutions, and use the test results as the cascading threshold. We show that our model cascading strategy reduces computational costs while increases accuracy compared to generating the output with a single model. We also introduce a heuristics to determine the optimal combination of the number of solutions, test cases, and test lines each model should generate, based on the budget. Compared to speculative decoding, our method works on black-box models, having the same level of cost-accuracy trade-off, yet providing much more choices based on the server's budget. Ours is the first work to optimize cost-accuracy trade-off for LLM code generation with model cascading.
February 2024
·
75 Reads
·
87 Citations
ACM Transactions on Design Automation of Electronic Systems
In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by automatically completing partial Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation. We release our training/evaluation scripts and LLM checkpoints as open-source contributions.
January 2024
·
64 Reads
·
15 Citations
IEEE Transactions on Information Forensics and Security
The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or testing-based checking. However, writing security-centric assertions is a challenging task. In this work, we investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security, where primarily natural language prompts, such as those one would see as code comments in assertion files, are used to produce SystemVerilog assertions. We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt. We design an evaluation framework that generates a variety of prompts, and we create a benchmark suite comprising real-world hardware designs and corresponding golden reference assertions that we want to generate with the LLM.
July 2023
·
1,704 Reads
·
2 Citations
In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by generating high-quality Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation.
June 2023
·
547 Reads
·
1 Citation
The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or testing-based checking. However, writing security-centric assertions is a challenging task. In this work, we investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security, where primarily natural language prompts, such as those one would see as code comments in assertion files, are used to produce SystemVerilog assertions. We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt. We design an evaluation framework that generates a variety of prompts, and we create a benchmark suite comprising real-world hardware designs and corresponding golden reference assertions that we want to generate with the LLM.
May 2023
·
643 Reads
·
14 Citations
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
... These agents are capable of performing complex tasks involving decision-making and multi-step planning. Building upon this advancement, a new and growing approach has emerged: the use of autonomous generative agents to automate pentesting processes [11][12][13][14][15][16][17]. ...
June 2024
... This paper [143] explores the use of LLMs to generate security assertions for hardware designs. Security assertions are critical for verifying that hardware operates as intended without vulnerabilities. ...
January 2024
IEEE Transactions on Information Forensics and Security
... Verilog [19]. The fine-tuned LLM, though, performed marginally better than ChatGPT3.5-turbo ...
February 2024
ACM Transactions on Design Automation of Electronic Systems
... This new problem-solving paradigm (Saha et al. 2024;Yao et al. 2024) has also emerged in the field of cybersecurity, where LLMs show promising potential. In academia, substantial research is focused on utilizing LLMs for cybersecurity tasks (Pearce et al. 2023;Sun et al. 2024b). In industry, alongside powerful general-purpose LLMs like GPT(OpenAI 2024; Anthropic), specialized models such as SecGPT (Clouditera 2024) are being developed to tackle security-specific challenges. ...
May 2023
... These advanced AI models offer significant potential in enhancing cybersecurity resilience by generating informed and dynamic policies. Pearce et al. [192] investigated the use of LLMs like OpenAI's Codex and AI21's Jurassic J-1 for zero-shot vulnerability repair. Authors found that while LLMs can effectively repair synthetically generated and handcrafted security bugs, they struggle with real-world scenarios due to context limitations and reliability issues. ...
May 2023
... Recent advancements in Large Language Models (LLMs) for natural language understanding and generation [16] have inspired efforts to extend their ability to facilitate hardware chip designs. Prior works have demonstrated LLMs' potential in generating HDL code from natural language descriptions or high-level specifications [1], [3], [8], [19], [21], [29]. Some studies [6], [22] have further explored the design space with the assistance of LLMs, through their ability to learn by imitation. ...
April 2023
... A code repository often contains multiple code files. Previous studies [2], [4], [5] typically randomly sample files for training, failing to leverage the relationships and contextual information between files. We propose four new sampling strategies: sampling based on file content similarity, sampling based on file path similarity, sampling based on inter-file dependency, and random sampling. ...
May 2023
... Binary analysis is of fundamental importance in the field of software security and software engineering, encompassing a range of critical downstream applications such as malware analysis [4,7,25,28,39,40,81], vulnerability detection [23,79], software fingerprinting [15], APT attack forensics [2,22,43,69,77,82], and software reuse [21,68]. LmPa is intrinsically connected to decompilation [27,33,47], a foundational task in binary analysis. In addition to the related works discussed in Section 2.1, substantial research has been conducted in the area of decompilation, addressing topics such as type inference [11,31,49,66], binary-level data-flow analysis [3,88], function signature inference [11,12,31,66], and binary similarity [20,52,68,84,86,89]. ...
January 2022
... Rokon et al. [88] identified over 7,500 instances of malware hosted on GitHub. Attackers may also employ more covert methods, such as spreading malware through forks [9]. • Compromising Legitimate Libraries. ...
January 2022
... There has been an enormous effort to train code-generating large language models (LLMs) (Chen et al., 2021;Austin et al., 2021;Li et al., 2023;Rozière et al., 2024;Team, 2024), leading to LLM-powered agents that can perform tasks ranging from fixing bugs in software repositories to solving Olympiad-level algorithmic problems (Jimenez et al., 2023;Li et al., 2022b). Despite these successes, multiple stud-ies have identified disturbing mistakes in LLM-produced code, including subtle bugs and serious security vulnerabilities (Hendler, 2023;Pearce et al., 2021;Jesse et al., 2023;Zhong & Wang, 2023;Perry et al., 2023;Elgedawy et al., 2024). Ultimately these mistakes stem from a fundamental property of LLMs: language models can generate any string of code, without regard to correctness. ...
May 2022