Collin McMillan's research while affiliated with University of Notre Dame and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (98)
Source code summarization is the task of writing natural language descriptions of source code. A typical use case is generating short summaries of subroutines for use in API documentation. The heart of almost all current research into code summarization is the encoder-decoder neural architecture, and the encoder input is almost always a single subr...
Source code summarization is the task of writing natural language descriptions of source code. The primary use of these descriptions is in documentation for programmers. Automatic generation of these descriptions is a high value research target due to the time cost to programmers of writing these descriptions themselves. In recent years, a confluen...
A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as "changes all visible polygons to the color blue" can give a programmer a high-level idea of what code does without the effort of reading the code...
This paper launches a new effort at modeling programmer attention by predicting eye movement scanpaths. Programmer attention refers to what information people intake when performing programming tasks. Models of programmer attention refer to machine prediction of what information is important to people. Models of programmer attention are important b...
This paper presents an improved loss function for neural source code summariza-tion. Code summarization is the task of writing natural language descriptions ofsource code. Neural code summarization refers to automated techniques for gen-erating these descriptions using neural networks. Almost all current approachesinvolve neural networks as either...
Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the fr...
Mobile applications are required to give privacy notices to users when they collect or share personal information. Creating consistent and concise privacy notices can be a challenging task for developers. Previous work has attempted to help developers create privacy notices through a questionnaire or predefined templates. In this paper, we propose...
Neural source code summarization is the task of generating natural language descriptions of source code behavior using neural networks. A fundamental component of most neural models is an attention mechanism. The attention mechanism learns to connect features in source code to specific words to use when generating natural language descriptions. Hum...
This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchab...
Label smoothing is a regularization technique for neural networks. Normally neural models are trained to an output distribution that is a vector with a single 1 for the correct prediction, and 0 for all other elements. Label smoothing converts the correct prediction location to something slightly less than 1, then distributes the remainder to the o...
Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost o...
In source code search, a common information-seeking strategy involves providing a short initial query with a broad meaning, and then iteratively refining the query using terms gleaned from the results of subsequent searches. This strategy requires programmers to spend time reading search results that are irrelevant to their development needs. In co...
API search involves finding components in an API that are relevant to a programming task. For example, a programmer may need a function in a C library that opens a new network connection, then another function that sends data across that connection. Unfortunately, programmers often have trouble finding the API components that they need. A strong sc...
A source code summary of a subroutine is a brief description of that subroutine. Summaries underpin a majority of documentation consumed by programmers, such as the method summaries in JavaDocs. Source code summarization is the task of writing these summaries. At present, most state-of-the-art approaches for code summarization are neural network-ba...
Virtual Assistant technology is rapidly proliferating to improve productivity in a variety of tasks. While several virtual assistants for everyday tasks are well-known (e.g., Siri, Cortana, Alexa), assistants for specialty tasks such as software engineering are rarer. One key reason software engineering assistants are rare is that very few experime...
Source code summarization of a subroutine is the task of writing a short, natural language description of that subroutine. The description usually serves in documentation aimed at programmers, where even brief phrase (e.g. "compresses data to a zip file") can help readers rapidly comprehend what a subroutine does without resorting to reading the co...
A question answering (QA) system is a type of conversational AI that generates natural language answers to questions posed by human users. QA systems often form the backbone of interactive dialogue systems, and have been studied extensively for a wide variety of tasks ranging from restaurant recommendations to medical diagnostics. Dramatic progress...
Source code summarization is the task of creating short, natural language descriptions of source code. Code summarization is the backbone of much software documentation such as JavaDocs, in which very brief comments such as "adds the customer object" help programmers quickly understand a snippet of code. In recent years, automatic code summarizatio...
Virtual Assistant technology is rapidly proliferating to improve productivity in a variety of tasks. While several virtual assistants for everyday tasks are well-known (e.g., Siri, Cortana, Alexa), assistants for specialty tasks such as software engineering are rarer. One key reason software engineering assistants are rare is that very few experime...
Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code him or herself. The task of writing these descriptions is called "source code summarization" and has been a target of resea...
Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summariza-tion techniques use the source co...
Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source cod...
Virtual assistant technology has the potential to make a significant impact in the field of software engineering. However, few SE-related datasets exist that would be suitable for the design or training of a virtual assistant. To help lay the groundwork for a hypothetical virtual assistant for API usage, we designed and conducted a Wizard-of-Oz stu...
descriptions of subroutines are short (usually one-sentence) natural language explanations of a subroutine's behavior and purpose in a program. These summaries are ubiquitous in documentation, and many tools such as JavaDocs and Doxygen generate documentation built around them. And yet, extracting summaries from unstructured source code repositorie...
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...
Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven...
This paper targets the problem of speech act detection in conversations about bug repair. We conduct a ``Wizard of Oz'' experiment with 30 professional programmers, in which the programmers fix bugs for two hours, and use a simulated virtual assistant for help. Then, we use an open coding manual annotation procedure to identify the speech act types...
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as “editors” or “science.” Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification,...
This paper targets the problem of speech act detection in conversations about bug repair. We conduct a "Wizard of Oz" experiment with 30 professional programmers, in which the programmers fix bugs for two hours, and use a simulated virtual assistant for help. Then, we use an open coding manual annotation procedure to identify the speech act types i...
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as "editors" or "science." Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification,...
Programmers who are blind use a screen reader to speak source code one word at a time, as though the code were text. For example, "float f = 5.23;" can be read as "float f equals five point two three semicolon". This process of reading is in stark contrast to sighted programmers, who skim source code rapidly with their eyes. At present, it is not k...
Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These technique...
Programmers who are blind use a screen reader to speak source code one word at a time, as though the code were text. This process of reading is in stark contrast to sighted programmers, who skim source code rapidly with their eyes. At present, it is not known whether the difference in these processes has effects on the program comprehension gained...
Programmers need documentation to comprehend software, but they often lack the time to write it. Thus, programmers must prioritize their documentation effort to ensure that sections of code important to program comprehension are thoroughly explained. In this paper, we explore the possibility of automatically prioritizing documentation effort. We pe...
“Change Impact Analysis” is the process of determining the consequences of a modification to software. In theory, change impact analysis should be done during software maintenance, to make sure changes do not introduce new bugs. Many approaches and techniques are proposed to help programmers do change impact analysis automatically. However, it is s...
Committing to a version control system means submitting a software change to the system. Each commit can have a message to describe the submission. Several approaches have been proposed to automatically generate the content of such messages. However, the quality of the automatically generated messages falls far short of what humans write. In studyi...
When learning to use an Application Programming Interface (API), programmers need to understand the inputs and outputs (I/O) of the API functions. Current documentation tools automatically document the static information of I/O, such as parameter types and names. What is missing from these tools is dynamic information, such as I/O examples---actual...
A key problem during copy–paste source code reuse is that, to reuse even a small section of code from a program as opposed to an API, a programmer must include a huge amount of additional source code from elsewhere in the same program. This additional code is notoriously large and complex, and portions can only be identified at runtime. In this pap...
Bug reports are crucial software artifacts for both software maintenance researchers and practitioners. A typical use of bug reports by researchers is to evaluate automated software maintenance tools: a large repository of reports is used as input for a tool, and metrics are calculated from the tool's output. But this process is quite different fro...
Blind programmers typically use a screen reader when reading code whereas sighted programmers are able to skim the code with their eyes. This difference has the potential to impact the generalizability of software engineering studies and approaches. We present a summary of a paper which will soon be under review at TSE that investigates how code co...
Terms in source code have become extremely important in Software Engineering research. These "important" terms are typically used as input to research tools. Therefore, the quality of the output of these tools will depend on the quality of the term extraction technique. Currently, there is no definitive best technique for predicting the importance...
Programs are, in essence, a collection of implemented features. Feature discovery in software engineering is the task of identifying key functionalities that a program implements. Manual feature discovery can be time consuming and expensive, leading to automatic feature discovery tools being developed. However, these approaches typically only descr...
Source Code Summarization is an emerging technology for automatically generating brief descriptions of code. Current summarization techniques work by selecting a subset of the statements and keywords from the code, and then including information from those statements and keywords in the summary. The quality of the summary depends heavily on the pro...
Source code summarization is the task of creating readable summaries that describe the functionality of software. Source code summarization is a critical component of documentation generation, for example as Javadocs formed from short paragraphs attached to each method in a Java program. At present, a majority of source code summarization is manual...
Bug reports are crucial software artifacts for both software maintenance researchers and practitioners. A typical use of bug reports by researchers is to evaluate automated software maintenance tools: a large repository of reports is used as input for a tool, and metrics are calculated from the tool's output. But this process is quite different fro...
Source code documentation often contains summaries of source code written by authors. Recently, automatic source code summarization tools have emerged that generate summaries without requiring author intervention. These summaries are designed for readers to be able to understand the high-level concepts of the source code. Unfortunately, there is no...
Some previous work began studying the relationship between application domains and quality, in particular through the prevalence of code and design smells (e.g., anti-patterns). Indeed, it is generally believed that the presence of these smells degrades quality but also that their prevalence varies across domains. Though anecdotal experiences and e...
In this paper, we present an emerging source code summarization technique that uses topic modeling to select keywords and topics as summaries for source code. Our approach organizes the topics in source code into a hierarchy, with more general topics near the top of the hierarchy. In this way, we present the software's highest-level functionality f...
A documentation generator is a programming tool that creates documentation for software by analyzing the statements and comments in the software's source code. While many of these tools are manual, in that they require specially-formatted metadata written by programmers, new research has made inroads towards automatic generation of documentation. T...
In the past decade, there have been many well-publicized cases of source code leaking from different well-known companies. These leaks pose a serious problem when the source code contains sensitive information encoded in its identifier names and comments. Unfortunately, redacting the sensitive information requires obfuscating the identifiers, which...
A key problem during source code reuse is that, to reuse even a small section of code from a program, a programmer must include a huge amount of dependency source code from elsewhere in the same program. These dependencies are no- toriously large and complex, and many can only be known at runtime. In this paper, we propose execution record/replay a...
Source Code Summarization is an emerging technology for automatically generating brief descriptions of code. Current summarization techniques work by selecting a subset of the statements and keywords from the code, and then including information from those statements and keywords in the summary. The quality of the summary depends heavily on the pro...
This paper presents a technique for automatically mining and visualizing API usage examples. In contrast to previous approaches, our technique is capable of finding examples of API usage that occur across an arbitrary number of functions in a program. This distinction is important because of a gap between what current API learning tools provide and...
Different studies show that programmers are more interested in finding definitions of functions and their uses than variables, statements, or ordinary code fragments. Therefore, developers require support in finding relevant functions and determining how these functions are used. Unfortunately, existing code search engines do not provide enough of...
Traceability recovery is a key software mainte-nance activity in which software engineers extract the relation-ships among software artifacts. Information Retrieval (IR) has been widely accepted as a method for automated traceability recovery based on the textual similarity among the software artifacts. However, a notorious difficulty for IR-based...
Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Manual categorization is expensive, tedious, and laborious –...
Although popular text search engines allow users to retrieve similar web pages, source code search engines do not have this feature. Detecting similar applications is a notoriously difficult problem, since it implies that similar highlevel requirements and their low-level implementations can be detected and matched automatically for different appli...
Rapid prototypes are often developed early in the software development process in order to help project stakeholders explore ideas for possible features, and to discover, analyze, and specify requirements for the project. As prototypes are typically thrown-away following the initial analysis phase, it is imperative for them to be created quickly wi...
Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Unfortunately, for different legal and organizational reasons...
As a programmer writes new software, he or she may instinctively sense that certain functionality is generally or widely-enough applicable to have been implemented before. Unfortunately, programmers face major challenges when attempting to reuse this functionality: First, developers must search for source code relevant to the high-level task at han...
Different studies show that programmers are more interested in finding definitions of functions and their uses than variables, statements, or arbitrary code fragments [30, 29, 31]. Therefore, programmers require support in finding relevant functions and determining how those functions are used. Unfortunately, existing code search engines do not pro...
Source code search engines locate and display fragments of code relevant to user queries. These fragments are often isolated and detached from one another. Programmers need to see how source code interacts in order to understand the concepts implemented in that code, however. In this paper, we present Portfolio, a source code search engine that ret...
In this demonstration, we present a code search system called Portfolio that retrieves and visualizes relevant functions and their usages. We will show how chains of relevant functions and their usages can be visualized to users in response to their queries.
A fundamental problem of finding software applications that are highly relevant to development tasks is the mismatch between the high-level intent reflected in the descriptions of these tasks and low-level implementation details of applications. To reduce this mismatch we created an approach called Exemplar (EXEcutable exaMPLes ARchive) for finding...
Getting insight into different aspects of source code artifacts is increasingly important -- yet there is little empirical research using large bodies of source code, and subsequently there are not much statistically significant evidence of common patterns and facts of how programmers write source code. We pose 32 research questions, explain ration...