Sumit Gulwani's research while affiliated with Microsoft and other places

Publications (200)

Article
Most users of low-code platforms, such as Excel and PowerApps, write programs in domain-specific formula languages to carry out nontrivial tasks. Often users can write most of the program they want, but introduce small mistakes that yield broken formulas. These mistakes, which can be both syntactic and semantic, are hard for low-code users to ident...
Article
Integrated Development Environments (IDEs) provide tool support to automate many source code editing tasks. Traditionally, IDEs use only the spatial context, i.e., the location where the developer is editing, to generate candidate edit recommendations. However, spatial context alone is often not sufficient to confidently predict the developer’s nex...
Article
Incorporating storytelling into organizational culture.
Preprint
Students often make mistakes on their introductory programming assignments as part of their learning process. Unfortunately, providing custom repairs for these mistakes can require a substantial amount of time and effort from class instructors. Automated program repair (APR) techniques can be used to synthesize such fixes. Prior work has explored t...
Preprint
Full-text available
Most programmers make mistakes when writing code. Some of these mistakes are small and require few edits to the original program - a class of errors recently termed last mile mistakes. These errors break the flow for experienced developers and can stump novice programmers. Existing automated repair techniques targeting this class of errors are doma...
Preprint
Full-text available
Spreadsheets are widely used for table manipulation and presentation. Stylistic formatting of these tables is an important property for both presentation and analysis. As a result, popular spreadsheet software, such as Excel, supports automatically formatting tables based on data-dependent rules. Unfortunately, writing these formatting rules can be...
Preprint
Integrated Development Environments (IDEs) provide tool support to automate many source code editing tasks. Traditionally, IDEs use only the spatial context, i.e., the location where the developer is editing, to generate candidate edit recommendations. However, spatial context alone is often not sufficient to confidently predict the developer's nex...
Preprint
Full-text available
Most users of low-code platforms, such as Excel and PowerApps, write programs in domain-specific formula languages to carry out nontrivial tasks. Often users can write most of the program they want, but introduce small mistakes that yield broken formulas. These mistakes, which can be both syntactic and semantic, are hard for low-code users to ident...
Preprint
Full-text available
Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially i...
Article
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trai...
Article
Use of third-party libraries is extremely common in application software. The libraries evolve to accommodate new features or mitigate security vulnerabilities, thereby breaking the Application Programming Interface(API) used by the software. Such breaking changes in the libraries may discourage client code from using the new library versions there...
Article
The ability to learn programs from few examples is a powerful technology with disruptive applications in many domains, as it allows users to automate repetitive tasks in an intuitive way. Existing frameworks on inductive synthesis only perform syntactic manipulations, where they rely on the syntactic structure of the given examples and not their me...
Preprint
Full-text available
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trai...
Preprint
Forking structure is widespread in the open-source repositories and that causes a significant number of merge conflicts. In this paper, we study the problem of textual merge conflicts from the perspective of Microsoft Edge, a large, highly collaborative fork off the main Chromium branch with significant merge conflicts. Broadly, this study is divid...
Article
While editing code, it is common for developers to make multiple related repeated edits that are all instances of a more general program transformation. Since this process can be tedious and error-prone, we study the problem of automatically learning program transformations from past edits, which can then be used to predict future edits. We take a...
Article
Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing. We...
Preprint
We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box componen...
Preprint
The shortage of people trained in STEM fields is becoming acute, and universities and colleges are straining to satisfy this demand. In the case of computer science, for instance, the number of US students taking introductory courses has grown three-fold in the past decade. Recently, massive open online courses (MOOCs) have been promoted as a way t...
Preprint
Programming-by-example technologies are being deployed in industrial products for real-time synthesis of various kinds of data transformations. These technologies rely on the user to provide few representative examples of the transformation task. Motivated by the need to find the most pertinent question to ask the user, in this paper, we introduce...
Preprint
The reliability and proper function of data-driven applications hinge on the data's continued conformance to the applications' initial design. When data deviates from this initial profile, system behavior becomes unpredictable. Data profiling techniques such as functional dependencies and denial constraints encode patterns in the data that can be u...
Article
Full-text available
When working with a document, users often perform context-specific repetitive edits – changes to the document that are similar but specific to the contexts at their locations. Programming by demonstration/examples (PBD/PBE) systems automate these tasks by learning programs to perform the repetitive edits from demonstration or examples. However, PBD...
Preprint
Programming-by-Example (PBE) systems synthesize an intended program in some (relatively constrained) domain-specific language from a small number of input-output examples provided by the user. In this paper, we motivate and define the problem of quantitative PBE (qPBE) that relates to synthesizing an intended program over an underlying (real world)...
Article
Full-text available
We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the criti...
Conference Paper
Providing feedback on programming assignments is a tedious task for the instructor, and even impossible in large Massive Open Online Courses with thousands of students. Previous research has suggested that program repair techniques can be used to generate feedback in programming education. In this paper, we present a novel fully automated program r...
Conference Paper
Full-text available
Compile-time errors pose a major learning hurdle for students of introductory programming courses. Compiler error messages, while accurate, are targeted at seasoned programmers, and seem cryptic to beginners. In this work, we address this problem of pedagogically-inspired program repair and report TRACER (Targeted RepAir of Compilation ERrors), a s...
Chapter
Until recently, model checking and data-flow analysis—two traditional approaches to software verification—were used independently and in isolation for solving similar problems. Theoretically, the two different approaches are equivalent; they are two different ways to compute the same solution to a problem. In recent years, new practical approaches...
Article
Programming by example (PBE) systems allow end users to easily create programs by providing a few input-output examples to specify their intended task. The system attempts to generate a program in a domain specific language (DSL) that satisfies the given examples. However, a key challenge faced by existing PBE techniques is to ensure the robustness...
Conference Paper
K-8 mathematics students must learn many procedures, such as addition and subtraction. Students frequently learn "buggy' variations of these procedures, which we ideally could identify automatically. This is challenging because there are many possible variations that reflect deep compositions of procedural thought. Existing approaches for K-8 math...
Article
Synthesizing user-intended programs from a small number of input-output examples is a challenging problem with several important applications like spreadsheet manipulation, data wrangling and code refactoring. Existing synthesis systems either completely rely on deductive logic techniques that are extensively hand-engineered or on purely statistica...
Conference Paper
Full-text available
In recent years there has been rising interest in the use of programming-by-example techniques to assist users in data manipulation tasks. Such techniques rely on an explicit input-output examples specification from the user to automatically synthesize programs. However, in a wide range of data extraction tasks it is easy for a human observer to pr...
Conference Paper
Programming by Examples (PBE) involves synthesizing intended programs in an underlying domain-specific language from example-based specifications. PBE systems are already revolutionizing the application domain of data wrangling and are set to significantly impact several other domains including code refactoring. There are three key components in a...
Conference Paper
99% of computer users do not know programming and hence struggle with repetitive tasks. Programming by Examples (PBE) can revolutionize this landscape by enabling users to synthesize intended programs from example based specifications. A key technical challenge in PBE is to search for programs that are consistent with the examples provided by the u...
Article
We address the problem of learning comprehensive syntactic profiles for a set of strings. Real-world datasets, typically curated from multiple sources, often contain data in various formats. Thus any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify various formats is...
Conference Paper
Programming-by-example technologies let end users construct and run new programs by providing examples of the intended program behavior. But, the few provided examples seldom uniquely determine the intended program. Previous approaches to picking a program used a bias toward shorter or more naturally structured programs. Our work here gives a machi...
Conference Paper
A shaded area problem in high school geometry consists of a figure annotated with facts such as lengths of line segments or angle measures, and asks to compute the area of a shaded portion of the figure. We describe a technique to generate fresh figures for these problems. Given a figure, we describe a technique to automatically synthesize shaded a...
Article
Program synthesis from incomplete specifications (e.g. input-output examples) has gained popularity and found real-world applications, primarily due to its ease-of-use. Since this technology is often used in an interactive setting, efficiency and correctness are often the key user expectations from a system based on such technologies. Ensuring effi...
Article
In recent years there has been rising interest in the use of programming-by-example techniques to assist users in data manipulation tasks. Such techniques rely on an explicit input-output examples specification from the user to automatically synthesize programs. However, in a wide range of data extraction tasks it is easy for a human observer to pr...
Book
Program synthesis is the task of automatically finding a program in the underlying programming language that satisfies the user intent expressed in the form of some specification. Since the inception of artificial intelligence in the 1950s, this problem has been considered the holy grail of Computer Science. Despite inherent challenges in the probl...
Article
Program synthesis is the task of automatically finding a program in the underlying programming language that satisfies the user intent expressed in the form of some specification. Since the inception of AI in the 1950s, this problem has been considered the holy grail of Computer Science. Despite inherent challenges in the problem such as ambiguity...
Conference Paper
With increasing amounts of data available on the web and a diverse range of users interested in programmatically accessing that data, web automation must become easier. Automation helps users complete many tedious interactions, such as scraping data, completing forms, or transferring data between websites. However, writing web automation scripts ty...
Article
Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string exa...
Conference Paper
Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string exa...
Article
With increasing amounts of data available on the web and a diverse range of users interested in programmatically accessing that data, web automation must become easier. Automation helps users complete many tedious interactions, such as scraping data, completing forms, or transferring data between websites. However, writing web automation scripts ty...
Article
Full-text available
IDEs, such as Visual Studio, automate common transformations, such as Rename and Extract Method refactorings. However, extending these catalogs of transformations is complex and time-consuming. A similar phenomenon appears in intelligent tutoring systems where instructors have to write cumbersome code transformations that describe "common faults" t...
Article
Full-text available
An introductory programming course (CS1) is an integral part of any undergraduate curriculum. Due to large number and diverse programming background of students, providing timely and personalised feedback to individual students is a challenging task for any CS1 instructor. The help provided by teaching assistants (typically senior students) is not...
Conference Paper
99 % of computer end users do not know programming, and struggle with repetitive tasks. Programming by Examples (PBE) can revolutionize this landscape by enabling users to synthesize intended programs from example based specifications. A key technical challenge in PBE is to search for programs that are consistent with the examples provided by the u...
Conference Paper
Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program...
Conference Paper
Cleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty printing their string representations. This presents many challenges to users because cleaning such dat...
Conference Paper
Students have enthusiastically taken to online programming lessons and contests. Unfortunately, they tend to struggle due to lack of personalized feedback. There is an urgent need of program analysis and repair techniques capable of handling both the scale and variations in student submissions, while ensuring quality of feedback. Towards this goal,...
Article
Providing feedback on programming assignments is a tedious task for the instructor, and even impossible in large MOOCs with thousands of students. In this paper, we present a novel technique for automatic feedback generation: (1) For a given programming assignment, we automatically cluster the correct student attempts using a dynamic program analys...
Article
Cleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty printing their string representations. This presents many challenges to users because cleaning such dat...
Article
Programming by Examples (PBE) has the potential to revolutionize end-user programming by enabling end users, most of whom are non-programmers, to create small scripts for automating repetitive tasks. However, examples, though often easy to provide, are an ambiguous specification of the user's intent. Because of that, a key impedance in adoption of...
Article
This paper presents an intelligent tutoring system, GeoTutor, for Euclidean Geometry that is automatically able to synthesize proof problems and their respective solutions given a geometric figure together with a set of properties true of it. GeoTutor can provide personalized practice problems that address student deficiencies in the subject matter...
Conference Paper
Full-text available
Inductive synthesis, or programming-by-examples (PBE) is gaining prominence with disruptive applications for automating repetitive tasks in end-user programming. However, designing , developing, and maintaining an effective industrial-quality inductive synthesizer is an intellectual and engineering challenge, requiring 1-2 man-years of effort. Our...
Article
Full-text available
We consider from a practical perspective the problem of checking equivalence of context-free grammars. We present techniques for proving equivalence, as well as techniques for finding counter-examples that establish non-equivalence. Among the key building blocks of our approach is a novel algorithm for efficiently enumerating and sampling words and...
Article
Full-text available
MUCH OF THE world's population use computers for everyday tasks, but most fail to benefit from the power of computation due to their inability to program. Most crucially, users often have to perform repetitive actions manually because they are not able to use the macro languages available for many application programs. Recently, a first mass-market...
Article
Full-text available
Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program...
Conference Paper
Full-text available
Word problems are an established technique for teaching mathematical modeling skills in K-12 education. However, many students find word problems unconnected to their lives, artificial, and uninteresting. Most students find them much more difficult than the corresponding symbolic representations. To account for this phenomenon, an ideal pedagogy mi...
Conference Paper
We study the problem of efficiently predicting a correct program from a large set of programs induced from few input-output examples in Programming-by-Example (PBE) systems.
Article
Full-text available
With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble...
Conference Paper
To build a programming by demonstration (PBD) web scraping tool for end users, one needs two central components: a list finder, and a record and replay tool. A list finder extracts logical tables from a webpage. A record and replay (R+R) system records a user's interactions with a webpage, and replays them programmatically. The research community h...
Patent
Full-text available
Systems and methods for generating a tuple of structured data files are described herein. In one example, a method includes detecting an expression that describes a structure of a structured image using a constructor. The method can also include using an inference-rule based search strategy to identify a hierarchical arrangement of bounding boxes i...
Conference Paper
Interactive learning environments such as intelligent tutoring systems and software tutorials often teach procedures with step-by-step demonstrations. This instructional scaffolding is typically authored by hand, and little can be reused across problem domains. In this work, we present a framework for generating interactive tutorials from an algori...
Conference Paper
Good alignment and repetition of objects across presentation slides can facilitate visual processing and contribute to audience understanding. However, creating and maintaining such consistency during slide design is difficult. To solve this problem, we present two complementary tools: (1) StyleSnap, which increases the alignment and repetition of...
Conference Paper
A long-term goal of game design research is to achieve end-to-end automation of much of the design process, one aspect of which is creating effective level progressions. A key difficulty is getting the player to practice with interesting combinations of learned skills while maintaining their engagement. Although recent work in task generation and s...
Patent
Ranking technique embodiments are presented that use statistical and machine learning techniques to learn the desired ranking function for use in inductive program synthesis for the domain of string transformations. This generally involves automatically creating a training dataset of positive and negative examples from a given set of training tasks...
Article
Full-text available
In computer-aided education, the goal of automatic feedback is to provide a meaningful explanation of stu-dents' mistakes. We focus on providing feedback for constructing a deterministic finite automaton that accepts strings that match a described pattern. Natural choices for feedback are binary feedback (correct/wrong) and a counterexample of a st...
Conference Paper
The programming languages (PL) research community has traditionally catered to the needs of professional programmers in the continuously evolving technical industry. However, there is a new opportunity that knocks our doors. The recent IT revolution has resulted in the masses having access to personal computing devices. More than 99% of these compu...
Article
The programming languages (PL) research community has traditionally catered to the needs of professional programmers in the continuously evolving technical industry. However, there is a new opportunity that knocks our doors. The recent IT revolution has resulted in the masses having access to personal computing devices. More than 99% of these compu...
Article
Full-text available
Simple board games, like Tic-Tac-Toe and CONNECT-4, play an important role not only in the development of mathematical and logical skills, but also in the emotional and social development. In this paper, we address the problem of generating targeted starting positions for such games. This can facilitate new approaches for bringing novice players to...
Patent
Full-text available
A quantified belief propagation (QBP) algorithm receives as input an existentially quantified boolean formula (QBF) of existentially quantified boolean variables, universally quantified variables, and boolean operators. A tripartite graph is constructed, and includes (i) there-exists nodes that correspond to and represent the existentially quantifi...
Article
We show how to programmatically model processes that humans use when extracting answers to queries (e.g., "Who invented typewriter?", "List of Washington national parks") from semi-structured Web pages returned by a search engine. This modeling enables various applications including automating repetitive search tasks, and helping search engine deve...
Patent
Semantic entity manipulation technique embodiments are presented that generate a probabilistic program capable of manipulating character strings representing semantic entities based on input-output examples. The program can then be used to produce a desired output consistent with the input-output examples from inputs of a type included in the examp...
Article
Example-based reasoning techniques developed in the programming languages community can also help automate certain repetitive and structured tasks in education, including problem generation, solution generation, and feedback generation. This can make standard and online classrooms more efficient and enable new pedagogies involving personalized work...
Article
Computing devices have become widely available to billions of end users, yet a handful of experts have the needed expertise to program these devices. Automated program synthesis has the potential to revolutionize this landscape, when targeted for the right set of problems and when allowing the right interaction model. The first part of this talk di...
Article
Recent advances in Programming by Example (PBE) have supported new applications to text editing, but existing approaches are limited to simple text strings. In this paper we address transformations in richly formatted documents, using an approach based on the idea of least general generalizations from inductive inference, which avoids the scalabili...
Article
This paper presents a semi-automated methodology for generating geometric proof problems of the kind found in a high-school curriculum. We formalize the notion of a geometry proof problem and describe an algorithm for generating such problems over a user-provided figure. Our experimental results indicate that our problem generation algorithm can ef...
Article
Millions of computer end users need to perform tasks over tabular spreadsheet data, yet lack the programming knowledge to do such tasks automatically. This paper describes the design and implementation of a robust natural language based interface to spreadsheet programming. Our methodology involves designing a typed domain-specific language (DSL) t...
Patent
Full-text available
A system that facilitates computing a symbolic bound with respect to a procedure that is executable by a processor on a computing device is described herein. The system includes a transition system generator component that receives the procedure and computes a disjunctive transition system for a control location in the procedure. A compute bound co...
Article
Programming-by-example technologies empower end-users to create simple programs merely by providing input/output examples. Existing systems are designed around solvers specialized for a specific set of data types or domain-specific language (DSL). We present a program synthesizer which can be parameterized by an arbitrary DSL that may contain condi...
Article
Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. I...