Collin McMillan's research while affiliated with University of Notre Dame and other places

Publications (82)

Preprint
Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost o...
Preprint
In source code search, a common information-seeking strategy involves providing a short initial query with a broad meaning, and then iteratively refining the query using terms gleaned from the results of subsequent searches. This strategy requires programmers to spend time reading search results that are irrelevant to their development needs. In co...
Preprint
API search involves finding components in an API that are relevant to a programming task. For example, a programmer may need a function in a C library that opens a new network connection, then another function that sends data across that connection. Unfortunately, programmers often have trouble finding the API components that they need. A strong sc...
Preprint
Full-text available
A source code summary of a subroutine is a brief description of that subroutine. Summaries underpin a majority of documentation consumed by programmers, such as the method summaries in JavaDocs. Source code summarization is the task of writing these summaries. At present, most state-of-the-art approaches for code summarization are neural network-ba...
Preprint
Virtual Assistant technology is rapidly proliferating to improve productivity in a variety of tasks. While several virtual assistants for everyday tasks are well-known (e.g., Siri, Cortana, Alexa), assistants for specialty tasks such as software engineering are rarer. One key reason software engineering assistants are rare is that very few experime...
Preprint
Source code summarization of a subroutine is the task of writing a short, natural language description of that subroutine. The description usually serves in documentation aimed at programmers, where even brief phrase (e.g. "compresses data to a zip file") can help readers rapidly comprehend what a subroutine does without resorting to reading the co...
Preprint
A question answering (QA) system is a type of conversational AI that generates natural language answers to questions posed by human users. QA systems often form the backbone of interactive dialogue systems, and have been studied extensively for a wide variety of tasks ranging from restaurant recommendations to medical diagnostics. Dramatic progress...
Preprint
Source code summarization is the task of creating short, natural language descriptions of source code. Code summarization is the backbone of much software documentation such as JavaDocs, in which very brief comments such as "adds the customer object" help programmers quickly understand a snippet of code. In recent years, automatic code summarizatio...
Article
Virtual Assistant technology is rapidly proliferating to improve productivity in a variety of tasks. While several virtual assistants for everyday tasks are well-known (e.g., Siri, Cortana, Alexa), assistants for specialty tasks such as software engineering are rarer. One key reason software engineering assistants are rare is that very few experime...
Preprint
Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code him or herself. The task of writing these descriptions is called "source code summarization" and has been a target of resea...
Preprint
Full-text available
Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summariza-tion techniques use the source co...
Preprint
Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source cod...
Preprint
Virtual assistant technology has the potential to make a significant impact in the field of software engineering. However, few SE-related datasets exist that would be suitable for the design or training of a virtual assistant. To help lay the groundwork for a hypothetical virtual assistant for API usage, we designed and conducted a Wizard-of-Oz stu...
Preprint
descriptions of subroutines are short (usually one-sentence) natural language explanations of a subroutine's behavior and purpose in a program. These summaries are ubiquitous in documentation, and many tools such as JavaDocs and Doxygen generate documentation built around them. And yet, extracting summaries from unstructured source code repositorie...
Conference Paper
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...
Preprint
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...
Preprint
Full-text available
Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven...
Conference Paper
This paper targets the problem of speech act detection in conversations about bug repair. We conduct a ``Wizard of Oz'' experiment with 30 professional programmers, in which the programmers fix bugs for two hours, and use a simulated virtual assistant for help. Then, we use an open coding manual annotation procedure to identify the speech act types...
Conference Paper
Full-text available
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as “editors” or “science.” Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification,...
Preprint
Full-text available
This paper targets the problem of speech act detection in conversations about bug repair. We conduct a "Wizard of Oz" experiment with 30 professional programmers, in which the programmers fix bugs for two hours, and use a simulated virtual assistant for help. Then, we use an open coding manual annotation procedure to identify the speech act types i...
Preprint
Full-text available
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as "editors" or "science." Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification,...
Conference Paper
Programmers who are blind use a screen reader to speak source code one word at a time, as though the code were text. For example, "float f = 5.23;" can be read as "float f equals five point two three semicolon". This process of reading is in stark contrast to sighted programmers, who skim source code rapidly with their eyes. At present, it is not k...
Article
Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These technique...
Article
Programmers who are blind use a screen reader to speak source code one word at a time, as though the code were text. This process of reading is in stark contrast to sighted programmers, who skim source code rapidly with their eyes. At present, it is not known whether the difference in these processes has effects on the program comprehension gained...
Article
Full-text available
Programmers need documentation to comprehend software, but they often lack the time to write it. Thus, programmers must prioritize their documentation effort to ensure that sections of code important to program comprehension are thoroughly explained. In this paper, we explore the possibility of automatically prioritizing documentation effort. We pe...
Article
Full-text available
“Change Impact Analysis” is the process of determining the consequences of a modification to software. In theory, change impact analysis should be done during software maintenance, to make sure changes do not introduce new bugs. Many approaches and techniques are proposed to help programmers do change impact analysis automatically. However, it is s...
Article
Committing to a version control system means submitting a software change to the system. Each commit can have a message to describe the submission. Several approaches have been proposed to automatically generate the content of such messages. However, the quality of the automatically generated messages falls far short of what humans write. In studyi...
Article
Full-text available
When learning to use an Application Programming Interface (API), programmers need to understand the inputs and outputs (I/O) of the API functions. Current documentation tools automatically document the static information of I/O, such as parameter types and names. What is missing from these tools is dynamic information, such as I/O examples---actual...
Article
A key problem during copy–paste source code reuse is that, to reuse even a small section of code from a program as opposed to an API, a programmer must include a huge amount of additional source code from elsewhere in the same program. This additional code is notoriously large and complex, and portions can only be identified at runtime. In this pap...
Article
Bug reports are crucial software artifacts for both software maintenance researchers and practitioners. A typical use of bug reports by researchers is to evaluate automated software maintenance tools: a large repository of reports is used as input for a tool, and metrics are calculated from the tool's output. But this process is quite different fro...
Conference Paper
Blind programmers typically use a screen reader when reading code whereas sighted programmers are able to skim the code with their eyes. This difference has the potential to impact the generalizability of software engineering studies and approaches. We present a summary of a paper which will soon be under review at TSE that investigates how code co...
Conference Paper
Terms in source code have become extremely important in Software Engineering research. These "important" terms are typically used as input to research tools. Therefore, the quality of the output of these tools will depend on the quality of the term extraction technique. Currently, there is no definitive best technique for predicting the importance...
Article
Programs are, in essence, a collection of implemented features. Feature discovery in software engineering is the task of identifying key functionalities that a program implements. Manual feature discovery can be time consuming and expensive, leading to automatic feature discovery tools being developed. However, these approaches typically only descr...
Article
Source Code Summarization is an emerging technology for automatically generating brief descriptions of code. Current summarization techniques work by selecting a subset of the statements and keywords from the code, and then including information from those statements and keywords in the summary. The quality of the summary depends heavily on the pro...
Article
Source code summarization is the task of creating readable summaries that describe the functionality of software. Source code summarization is a critical component of documentation generation, for example as Javadocs formed from short paragraphs attached to each method in a Java program. At present, a majority of source code summarization is manual...
Article
Bug reports are crucial software artifacts for both software maintenance researchers and practitioners. A typical use of bug reports by researchers is to evaluate automated software maintenance tools: a large repository of reports is used as input for a tool, and metrics are calculated from the tool's output. But this process is quite different fro...
Article
Source code documentation often contains summaries of source code written by authors. Recently, automatic source code summarization tools have emerged that generate summaries without requiring author intervention. These summaries are designed for readers to be able to understand the high-level concepts of the source code. Unfortunately, there is no...
Conference Paper
Full-text available
Some previous work began studying the relationship between application domains and quality, in particular through the prevalence of code and design smells (e.g., anti-patterns). Indeed, it is generally believed that the presence of these smells degrades quality but also that their prevalence varies across domains. Though anecdotal experiences and e...
Article
In the past decade, there have been many well-publicized cases of source code leaking from different well-known companies. These leaks pose a serious problem when the source code contains sensitive information encoded in its identifier names and comments. Unfortunately, redacting the sensitive information requires obfuscating the identifiers, which...
Article
A documentation generator is a programming tool that creates documentation for software by analyzing the statements and comments in the software's source code. While many of these tools are manual, in that they require specially-formatted metadata written by programmers, new research has made inroads towards automatic generation of documentation. T...
Article
In this paper, we present an emerging source code summarization technique that uses topic modeling to select keywords and topics as summaries for source code. Our approach organizes the topics in source code into a hierarchy, with more general topics near the top of the hierarchy. In this way, we present the software's highest-level functionality f...
Article
A key problem during source code reuse is that, to reuse even a small section of code from a program, a programmer must include a huge amount of dependency source code from elsewhere in the same program. These dependencies are no- toriously large and complex, and many can only be known at runtime. In this paper, we propose execution record/replay a...
Article
Source Code Summarization is an emerging technology for automatically generating brief descriptions of code. Current summarization techniques work by selecting a subset of the statements and keywords from the code, and then including information from those statements and keywords in the summary. The quality of the summary depends heavily on the pro...
Conference Paper
This paper presents a technique for automatically mining and visualizing API usage examples. In contrast to previous approaches, our technique is capable of finding examples of API usage that occur across an arbitrary number of functions in a program. This distinction is important because of a gap between what current API learning tools provide and...
Article
Different studies show that programmers are more interested in finding definitions of functions and their uses than variables, statements, or ordinary code fragments. Therefore, developers require support in finding relevant functions and determining how these functions are used. Unfortunately, existing code search engines do not provide enough of...
Conference Paper
Full-text available
Traceability recovery is a key software mainte-nance activity in which software engineers extract the relation-ships among software artifacts. Information Retrieval (IR) has been widely accepted as a method for automated traceability recovery based on the textual similarity among the software artifacts. However, a notorious difficulty for IR-based...
Article
Full-text available
Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Manual categorization is expensive, tedious, and laborious –...
Article
Full-text available
Although popular text search engines allow users to retrieve similar web pages, source code search engines do not have this feature. Detecting similar applications is a notoriously difficult problem, since it implies that similar highlevel requirements and their low-level implementations can be detected and matched automatically for different appli...
Article
Full-text available
Rapid prototypes are often developed early in the software development process in order to help project stakeholders explore ideas for possible features, and to discover, analyze, and specify requirements for the project. As prototypes are typically thrown-away following the initial analysis phase, it is imperative for them to be created quickly wi...
Conference Paper
Full-text available
Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Unfortunately, for different legal and organizational reasons...
Conference Paper
As a programmer writes new software, he or she may instinctively sense that certain functionality is generally or widely-enough applicable to have been implemented before. Unfortunately, programmers face major challenges when attempting to reuse this functionality: First, developers must search for source code relevant to the high-level task at han...
Conference Paper
Full-text available
In this demonstration, we present a code search system called Portfolio that retrieves and visualizes relevant functions and their usages. We will show how chains of relevant functions and their usages can be visualized to users in response to their queries.
Conference Paper
Source code search engines locate and display fragments of code relevant to user queries. These fragments are often isolated and detached from one another. Programmers need to see how source code interacts in order to understand the concepts implemented in that code, however. In this paper, we present Portfolio, a source code search engine that ret...
Conference Paper
Full-text available
Different studies show that programmers are more interested in finding definitions of functions and their uses than variables, statements, or arbitrary code fragments [30, 29, 31]. Therefore, programmers require support in finding relevant functions and determining how those functions are used. Unfortunately, existing code search engines do not pro...
Article
Full-text available
A fundamental problem of finding software applications that are highly relevant to development tasks is the mismatch between the high-level intent reflected in the descriptions of these tasks and low-level implementation details of applications. To reduce this mismatch we created an approach called Exemplar (EXEcutable exaMPLes ARchive) for finding...
Article
Full-text available
Online source code repositories contain software projects that already implement certain requirements that developers must fulfill. Programmers can reuse code from these existing projects if they can find relevant code without significant effort. We propose a new method to recommend source code examples to developers by querying against Application...
Conference Paper
Full-text available
Searching for applications that are highly relevant to development tasks is challenging because the high-level intent reflected in the descriptions of these tasks doesn't usually match the low-level implementation details of applications. In this demo we show a novel code search engine called Exemplar (EXEcutable exaMPLes ARchive) to bridge this mi...
Conference Paper
Full-text available
A fundamental problem of finding applications that are highly rel- evant to development tasks is the mismatch between the high-level intent reflected in the descriptions of these tasks and low-level implementation details of applications. To reduce this mismatch we created an approach called Exemplar (EXEcutable exaMPLes ARchive) for finding highly...
Conference Paper
Full-text available
Getting insight into different aspects of source code artifacts is increasingly important -- yet there is little empirical research using large bodies of source code, and subsequently there are not much statistically significant evidence of common patterns and facts of how programmers write source code. We pose 32 research questions, explain ration...
Article
Existing methods for recovering traceability links among software documentation artifacts analyze textual similarities among these artifacts. It may be the case, however, that related documentation elements share little terminology or phrasing. This paper presents a technique for indirectly recovering these traceability links in requirements docume...