Conference Paper

Making sense of online code snippets

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Stack Overflow contains a large number of high-quality source code snippets. The quality of these snippets has been verified by users marking them as solving a specific problem. Stack Overflow treats source code snippets as plain text and searches surface snippets as they would any other text. Unfortunately, plain text does not capture the structural qualities of these snippets; for example, snippets frequently refer to specific API (e.g., Android), but by treating the snippets as text, linkage to the Android API is not always apparent. We perform snippet analysis to extract structural information from short plain-text snippets that are often found in Stack Overflow. This analysis is able to identify 253,137 method calls and type references from 21,250 Stack Overflow code snippets. We show how identifying these structural relationships from snippets could perform better than lexical search over code blocks in practice.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An API element refers to 'a named entity belonging to an API, such as a class, interface, or method' [9]. With a large volume of data, SO is a good source for data mining and analytics related to APIs [9][10][11][12]. ...
... The massive volume of crowd-generated data in SO makes it a suitable repository for data mining and analytics on crowd documentation of APIs [9][10][11][12]. One supporting reason is SO posts contain a huge amount of code snippets that are of good quality [11]. ...
... The massive volume of crowd-generated data in SO makes it a suitable repository for data mining and analytics on crowd documentation of APIs [9][10][11][12]. One supporting reason is SO posts contain a huge amount of code snippets that are of good quality [11]. ...
Article
Full-text available
Abstract To address the lexical gaps between natural language (NL) queries and Application Programming Interface (API) documentations, and between NL queries and programme code, this study developed a novel approach for recommending Java API classes that are relevant to the program ming tasks described in NL queries. A Doc2Vec model was trained using question titles mined from Stack Overflow. The model was used to find question titles that are semantically similar to a query. Latent Dirichlet Allocation (LDA) topic modelling was applied on the Java API classes (extracted from code snippets found in the accepted answers of these similar questions) to extract a single topic comprising of the Top‐10 Java API classes that are relevant to the query. The benchmarking of the proposed approach against state‐of‐the‐art approaches, RACK and NLP2API, by using four performance metrics show that it is possible to produce comparable API recommendation results using a less complex approach that makes use of some basic machine learning models, in particular, Doc2Vec and LDA. The approach was implemented in a Java API class recommender with an Eclipse IDE's plug‐in serving as the front‐end.
... When reusing SO code snippets, developers and researchers face a major obstacle: most SO code snippets do not compile [5], [6], [7]. It mainly occurs because they are written for illustrative purposes, to convey solutions at a high level, without implementation details [8]. ...
... Terragni et al. have shown that ≈92 % of 491,906 analyzed SO code snippets are uncompilable [5]. A common missing implementation detail is the type declaration [5], [6]. For instance, the JAVA SO code snippet in Fig. 1 (left side) misses the declaration of type Calendar and Date. ...
... The SO code snippet in Fig. 1 (left side) is an example of dangling statements. One could automatically wrap the code snippet inside a generic method declaration [5], [6] (e.g., the main function). It would resolve compilation errors but would not recover the proper method declaration that exposes the intended input and output of the code snippet. ...
Preprint
Developer forums like StackOverflow have become essential resources to modern software development practices. However, many code snippets lack a well-defined method declaration, and thus they are often incomplete for immediate reuse. Developers must adapt the retrieved code snippets by parameterizing the variables involved and identifying the return value. This activity, which we call APIzation of a code snippet, can be tedious and time-consuming. In this paper, we present APIzator to perform APIzations of Java code snippets automatically. APIzator is grounded by four common patterns that we extracted by studying real APIzations in GitHub. APIzator presents a static analysis algorithm that automatically extracts the method parameters and return statements. We evaluated APIzator with a ground-truth of 200 APIzations collected from 20 developers. For 113 (56.50 %) and 115 (57.50 %) APIzations, APIzator and the developers extracted identical parameters and return statements, respectively. For 163 (81.50 %) APIzations, either the parameters or the return statements were identical.
... When reusing SO code snippets, developers and researchers face a major obstacle: most SO code snippets do not compile [5], [6], [7]. It mainly occurs because they are written for illustrative purposes, to convey solutions at a high level, without implementation details [8]. ...
... Terragni et al. have shown that ≈92 % of 491,906 analyzed SO code snippets are uncompilable [5]. A common missing implementation detail is the type declaration [5], [6]. For instance, the JAVA SO code snippet in Fig. 1 (left side) misses the declaration of type Calendar and Date. ...
... The SO code snippet in Fig. 1 (left side) is an example of dangling statements. One could automatically wrap the code snippet inside a generic method declaration [5], [6] (e.g., the main function). It would resolve compilation errors but would not recover the proper method declaration that exposes the intended input and output of the code snippet. ...
Conference Paper
Full-text available
Developer forums like StackOverflow have become essential resources to modern software development practices. However, many code snippets lack a well-defined method declaration, and thus they are often incomplete for immediate reuse. Developers must adapt the retrieved code snippets by parameterizing the variables involved and identifying the return value. This activity, which we call APIzation of a code snippet, can be tedious and time-consuming. In this paper, we present APIzator to perform APIzations of Java code snippets automatically. APIzator is grounded by four common patterns that we extracted by studying real APIzations in GitHub. APIzator presents a static analysis algorithm that automatically extracts the method parameters and return statements. We evaluated APIzator with a ground-truth of 200 APIzations collected from 20 developers. For 113 (56.50 %) and 115 (57.50 %) APIzations, APIzator and the developers extracted identical parameters and return statements, respectively. For 163 (81.50 %) APIzations, either the parameters or the return statements were identical.
... However, there are questions surrounding the quality of answers on Stack Overflow [25,64]. In addition, given that developers use many of the code snippets in posts on Stack Overflow during development [62], it is important to evaluate the quality of these artefacts. In particular, while the Stack Overflow community's collective surveillance may help to identify and improve errors in code snippets on this platform and users may appropriately use code snippets by adapting them to their specific problem/task, this is not always the case. ...
... Given the preceding concerns, it is important for research work to help the software development community with quality validations of Stack Overflow to fill this gap. In addition, given that developers use many of the code snippets in posts on Stack Overflow during development [62], it is important to evaluate the quality of these artefacts. Some investigations have been conducted looking at specific aspects of code snippets, such as the work of Squire and Funkhouser [60], who recommended a 1:3 ratio of code to text in answers on Stack Overflow. ...
... It was thus not within the scope of this research project to assess code snippets in answers that were not Javarelated, which could be deemed a limitation. However, given Java's popularity and its preference in previous research [22,62], we believe that by studying this language we provide needed insights for the software engineering community. That said, we have analysed 151,954 Stack Overflow Java code snippets in this work (refer to Section 3.2). ...
Article
Full-text available
Community Question and Answer (CQA) platforms use the power of online groups to solve problems, or gain information. While these websites host useful information, it is critical that the details provided on these platforms are of high quality, and that users can trust the information. This is particularly necessary for software development, given the ubiquitous use of software across all sections of contemporary society. Stack Overflow is the leading CQA platform for programmers, with a community comprising over 10 million contributors. While research confirms the popularity of Stack Overflow, concerns have been raised about the quality of answers that are provided to questions on Stack Overflow. Code snippets often contained in these answers have been investigated; however, the quality of these artefacts remains unclear. This could be problematic for the software engineering community, as evidence has shown that Stack Overflow snippets are frequently used in both open source and commercial software. This research fills this gap by evaluating the quality of code snippets on Stack Overflow. We explored various aspects of code snippet quality, including reliability and conformance to programming rules, readability, performance and security. Outcomes show variation in the quality of Stack Overflow code snippets for the different dimensions; however, overall, quality issues in Stack Overflow snippets were not always severe. Vigilance is encouraged for those reusing Stack Overflow code snippets.
... We choose SCC because it has high precision and recall and also scales to a large code corpus. Since SO snippets are often free-standing statements [23,24], we parse and tokenize them using a customized Java parser [25]. Prior work finds that larger SO snippets have more meaningful clones in GitHub [26]. ...
... Quality assessment of SO examples. Our work is inspired by previous studies that find SO examples are incomplete and inadequate[7]-[9,12,23,24,29]. Subramanian and Holmes find that the majority of SO snippets are free standing statements with no class or method headers[23]. ...
... Our work is inspired by previous studies that find SO examples are incomplete and inadequate[7]-[9,12,23,24,29]. Subramanian and Holmes find that the majority of SO snippets are free standing statements with no class or method headers[23]. Zhou et al. find that 86 of 200 accepted SO posts use deprecated APIs but only 3 of them are reported by other programmers[9].(a) ...
Preprint
Developers often resort to online Q&A forums such as Stack Overflow (SO) for filling their programming needs. Although code examples on those forums are good starting points, they are often incomplete and inadequate for developers' local program contexts; adaptation of those examples is necessary to integrate them to production code. As a consequence, the process of adapting online code examples is done over and over again, by multiple developers independently. Our work extensively studies these adaptations and variations, serving as the basis for a tool that helps integrate these online code examples in a target context in an interactive manner. We perform a large-scale empirical study about the nature and extent of adaptations and variations of SO snippets. We construct a comprehensive dataset linking SO posts to GitHub counterparts based on clone detection, time stamp analysis, and explicit URL references. We then qualitatively inspect 400 SO examples and their GitHub counterparts and develop a taxonomy of 24 adaptation types. Using this taxonomy, we build an automated adaptation analysis technique on top of GumTree to classify the entire dataset into these types. We build a Chrome extension called ExampleStack that automatically lifts an adaptation-aware template from each SO example and its GitHub counterparts to identify hot spots where most changes happen. A user study with sixteen programmers shows that seeing the commonalities and variations in similar GitHub counterparts increases their confidence about the given SO example, and helps them grasp a more comprehensive view about how to reuse the example differently and avoid common pitfalls.
... Recently, Fischer et al. find that 29% of security-related snippets in Stack Overflow are insecure and these snippets could have been reused by over 1 million Android apps on Google play, which raises a big security concern [9]. Previous studies have also investigated the quality of online code examples in terms of compilability [23,37], unchecked obsolete usage [39], and comprehension issues [29]. However, none of these studies have investigated the reliability of online code examples in terms of API usage correctness. ...
... Previous studies have shown that online code snippets are often unparsable [23,37] and contain ambiguous API elements [5] due to the incompleteness of these snippets. ExampleCheck leverages a state-of-the-art partial program parsing and type resolution technique to handle these incomplete snippets, whose accuracy of API resolution is reported to be 97% [24]. ...
... While the precision is rather low, ExampleCheck could be still useful in the case of false positives, since the goal of ExampleCheck is not to discard SO posts with potential API violations, but rather to suggest desirable or alternative API usage details to the users. [23,37]. Due to the incompleteness of code snippets, 89% of API names in code snippets from online forums are ambiguous and cannot be easily resolved [5]. ...
Conference Paper
Full-text available
Programmers often consult an online Q&A forum such as Stack Overflow to learn new APIs. This paper presents an empirical study on the prevalence and severity of API misuse on Stack Overflow. To reduce manual assessment effort, we design ExampleCheck, an API usage mining framework that extracts patterns from over 380K Java repositories on GitHub and subsequently reports potential API usage violations in Stack Overflow posts. We analyze 217,818 Stack Overflow posts using ExampleCheck and find that 31% may have potential API usage violations that could produce unexpected behavior such as program crashes and resource leaks. Such API misuse is caused by three main reasons---missing control constructs, missing or incorrect order of API calls, and incorrect guard conditions. Even the posts that are accepted as correct answers or upvoted by other programmers are not necessarily more reliable than other posts in terms of API misuse. This study result calls for a new approach to augment Stack Overflow with alternative API usage details that are not typically shown in curated examples.
... Text preceding code segments in Stack Overflow can be extracted as potential comments for similar code segments in an application [6]. Similarly, method descriptions can be extracted using clues in the text [15], [16]. ...
... Subramanian et. al. [16] performed analyses of source code snippets found in Stack Overflow, constructing an Abstract Syntax Tree (AST) for each code snippet and then parsing to effectively identify specific API usage. Building on their previous work [16], Subramanian et. ...
... al. [16] performed analyses of source code snippets found in Stack Overflow, constructing an Abstract Syntax Tree (AST) for each code snippet and then parsing to effectively identify specific API usage. Building on their previous work [16], Subramanian et. al. [6] developed an iterative, deductive method of linking source code examples to API documentation. ...
Preprint
Full-text available
The availability of large corpora of online software-related documents today presents an opportunity to use machine learning to improve integrated development environments by first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and education conference and journal articles can be a rich source for code examples that are used to motivate or explain particular concepts or issues. Because they are used as examples in an article, these code examples are accompanied by descriptions of their functionality, properties, or other associated information expressed in natural language text. Identifying code segments in these documents is relatively straightforward, thus this paper tackles the problem of extracting the natural language text that is associated with each code segment in an article. We present and evaluate a set of heuristics that address the challenges of the text often not being colocated with the code segment as in developer communications such as online forums.
... However, there is a major obstacle in reusing or analyzing Q&A code snippets. The majority of Q&A posts (91.59% of the 491,906 posts we collected from StackOverflow) contain uncompilable code snippets [15,45], indicating that they are non-executable and semantically incomplete for precise static analysis [26]. This phenomenon occurs because code snippets on Q&A sites are written for illustrative purposes, where compilability is not a concern. ...
... From the 491,906 posts, we observed that 35.71% (175,653) contain multiple code snippets like the example in Figure 1. Following Subramanian et al.'s work [45], which analysed Android code snippets in Stack-Overflow, for the baseline synthesis we considered each code snippet in a post as an individual c-unit. For each c-unit, we also include all import declarations that are present in the corresponding code snippet. ...
... Second, it is often difficult to automatically identify which libraries a Q&A post refers to [46,38,45]. This is because the majority of code snippets refer to library classes using simple names. ...
Conference Paper
Full-text available
Popular Q&A sites like StackOverflow have collected numerous code snippets. However, many of them do not have complete type information, making them uncompilable and inapplicable to various software engineering tasks. This paper analyzes this problem, and proposes a technique CSNIPPEX to automatically convert code snippets into compilable Java source code files by resolving external dependencies, generating import declarations, and fixing syntactic errors. We implemented CSNIPPEX as a plug-in for Eclipse and evaluated it with 242,175 StackOverflow posts that contain code snippets. CSNIPPEX successfully synthesized compilable Java files for 40,410 of them. It was also able to effectively recover import declarations for each post with a precision of 91.04% in a couple of seconds.
... Algorithms dedicated to API type resolution of code snippets have been proposed. Baker [53,3] traverses a best-effort Abstract Syntax Tree (AST) constructed by the Eclipse JDT parser or the Esprima parser for JavaScript to collect type information at variable declaration nodes. It then associates a list of candidate FQNs for these nodes by consulting a database populated from the JAR files of candidate libraries in the case of Java code snippets. ...
... Resolutions learned by combining the language and mapping model are refined based on the local context surrounding the name to be resolved. StatType achieved higher accuracy than Baker [53,3]; however, training its models may be computationally expensive. In contrast to StatType, RESICO handles the incompleteness problem as a classification procedure where a learned context will influence the label to be predicted (i.e., the FQN of the API reference in analysis). ...
... For instance, Tavakoli et al. (2020) showed that more than half of the code snippets they investigated were incomplete. Similarly, Subramanian & Holmes (2013) showed that 82% of code snippets investigated were incomplete. Therefore, collaborative community platforms like Stack ...
... https://stackoverflow.com/tags 22 consider important when judging Stack Overflow code quality. Consequently, several studies have investigated Reusability(Ahmad & Cinnéide, 2019;AlOmar et al., 2020;Verdi et al., 2020), Security(Meng et al., 2018;Zhang et al., 2021), Readability(Campos et al., 2019;Meldrum et al., 2020a), Completeness(Tavakoli et al., 2020;Subramanian & Holmes, 2013), and Reliability(Zhang et al., 2018;Abdalkareem et al., 2017). For example, Ahmad & Cinnéide (2019) investigated reusability of code snippet quality by studying how the quality of the program evolves over the time of the project.Meng et al. (2018) pointed out that metadata from Stack Overflow data, like accepted answers, responders' reputation scores, and high vote counts, can further mislead developers to take insecure advice from wrongly promoted posts with vulnerable code because they are upvoted and highly viewed. ...
Article
Context Over the years, there has been debate about what constitutes software quality and how it should be measured. This controversy has caused uncertainty across the software engineering community, affecting levels of commitment to the many potential determinants of quality among developers. An up-to-date catalogue of software quality views could provide developers with contemporary guidelines and templates. In fact, it is necessary to learn about views on the quality of code on frequently used online collaboration platforms (e.g., Stack Overflow), given that the quality of code snippets can affect the quality of software products developed. If quality models are unsuitable for aiding developers because they lack relevance, developers will hold relaxed or inappropriate views of software quality, thereby lacking awareness and commitment to such practices. Objective We aim to explore differences in interest in quality characteristics across research and practice. We also seek to identify quality characteristics practitioners consider important when judging code snippet quality. First, we examine the literature for quality characteristics used frequently for judging software quality, followed by the quality characteristics commonly used by researchers to study code snippet quality. Finally, we investigate quality characteristics used by practitioners to judge the quality of code snippets. Methods We conducted two systematic literature reviews followed by semi-structured interviews of 50 practitioners to address this gap. Results The outcomes of the semi-structured interviews revealed that most practitioners judged the quality of code snippets using five quality dimensions: Functionality, Readability, Efficiency, Security and Reliability. However, other dimensions were also considered (i.e., Reusability, Maintainability, Usability, Compatibility and Completeness). This outcome differed from how the researchers judged code snippet quality. Conclusion Practitioners today mainly rely on code snippets from online code resources, and specific models or quality characteristics are emphasised based on their need to address distinct concerns (e.g., mobile vs web vs standalone applications, regular vs machine learning applications, or open vs closed source applications). Consequently, software quality models should be adapted for the domain of consideration and not seen as one-size-fits-all. This study will lead to targeted support for various clusters of the software development community.
... Rigby & Robillard developed a traceability retrieval method known as Automatic Code Element Extractor (ACE), aimed to retrieve the code components present in several documents [180]. Similarly, Subramanian & Holmes analyzed code snippets to extract valuable information from the plain-text pieces in SO posts [181]. Later on, Subramanian et al. extended the work of [181] in the same line by developing a method called Baker [183]. ...
... Similarly, Subramanian & Holmes analyzed code snippets to extract valuable information from the plain-text pieces in SO posts [181]. Later on, Subramanian et al. extended the work of [181] in the same line by developing a method called Baker [183]. ...
Thesis
The recent years has witnessed an enormous growth in use of social media as a way to communicate and share knowledge. Some of the sample social media mechanisms are, wikis, blogs, Q&A websites/forums, and feeds, have tremendously shaped the way we communicate, work and play online. Many of these social media technologies have successfully made their way into collaborative software engineering. As a result, software developers too have started utilizing social Q&A websites for effective software development. Software developers around the globe are actively asking a question(s) and sharing solutions to the problems related to software development on Stack Overflow - a social question and answer (Q&A) website. The knowledge shared by software developers on Stack Overflow contains useful information related to software development such as feature requests (functional/non-functional), code snippets, reporting bugs or sentiments.
... Due to the effectiveness of code snippets, several works have tried to mine these from various sources to answer questions pertinent to software engineering. These questions deal with aspects like code snippet collection, best use, evolution, maintainability, legal implication, etc. Subramanian et al. did an analysis of Stack Overflow code snippets to identify structural relationships from snippets on questions relevant to android [10]. XSnippet was proposed as a context-aware framework to facilitate developers in querying relevant code snippet from a sample repository [11]. ...
... We calculated the bootstrap difference between the executability of samples with and without GitHub reference. We used 10000 iterations resulting in a 95% CI of [3.69, 10.77] and an average of 7.2 percent. This shows us that code snippets with GitHub references tend to have a higher executability. ...
Preprint
Full-text available
Online resources today contain an abundant amount of code snippets for documentation, collaboration, learning, and problem-solving purposes. Their executability in a "plug and play" manner enables us to confirm their quality and use them directly in projects. But, in practice that is often not the case due to several requirements violations or incompleteness. However, it is a difficult task to investigate the executability on a large scale due to different possible errors during the execution. We have developed a scalable framework to investigate this for SOTorrent Python snippets. We found that with minor adjustments, 27.92% of snippets are executable. The executability has not changed significantly over time. The code snippets referenced in GitHub are more likely to be directly executable. But executability does not affect the chances of the answer to be selected as the accepted answer significantly. These properties help us understand and improve the interaction of users with online resources that include code snippets.
... One study found that the greatest obstacle to learning an API in practice is "insufficient or inadequate examples" [30]. Subramanian et al. [31] analyzed 39,000 code snippets given in response to SO questions and found that only 6,766 (17%) were complete files with class and method declarations, 6,302 (16%) code snippets were just method bodies devoid of class declarations, and the remaining 66% contained standalone source code statements (i.e., the majority are not compilable code fragments with complete class and method body declarations). Since the code is usually not complete, information present in the code is often not sufficient to resolve API method accesses. ...
... Since the code is usually not complete, information present in the code is often not sufficient to resolve API method accesses. Moreover, they observed that most answers extend on details provided in the question; because of this, certain aspects of the snippet, like variable declarations are often skipped [31]. ...
Article
Full-text available
Stack Overflow (SO) is a question and answer service directed to issues related to software development. In SO, developers post questions related to a programming topic and other members of the site can provide answers to help them. The information available on this type of service is also known as "crowd knowledge" and currently is one important trend in supporting activities related to software development. We present an approach that makes use of "crowd knowledge" available in SO to recommend information that can assist developers in their activities. This strategy recommends a ranked list of question-answer pairs from SO based on a query. The criteria for ranking are based on three main aspects: the textual similarity of the pairs with respect to the query related to the developer's problem, the quality of the pairs, and a filtering mechanism that considers only "how-to" posts. We conducted an experiment considering programming problems on three different topics (Swing, Boost and LINQ) widely used by the software development community to evaluate the proposed recommendation strategy. The results have shown that for Lucene+Score+How-to approach, 77.14% of the assessed activities have at least one recommended pair proved to be useful concerning the target programming problem.
... One study found that the greatest obstacle to learning an API in practice is "insufficient or inadequate examples" [30]. Subramanian et al. [31] analyzed 39,000 code snippets given in response to SO questions and found that only 6,766 (17%) were complete files with class and method declarations, 6,302 (16%) code snippets were just method bodies devoid of class declarations, and the remaining 66% contained standalone source code statements (i.e., the majority are not compilable code fragments with complete class and method body declarations). Since the code is usually not complete, information present in the code is often not sufficient to resolve API method accesses. ...
... Since the code is usually not complete, information present in the code is often not sufficient to resolve API method accesses. Moreover, they observed that most answers extend on details provided in the question; because of this, certain aspects of the snippet, like variable declarations are often skipped [31]. ...
Preprint
Stack Overflow (SO) is a question and answer service directed to issues related to software development. In SO, developers post questions related to a programming topic and other members of the site can provide answers to help them. The information available on this type of service is also known as "crowd knowledge" and currently is one important trend in supporting activities related to software development. We present an approach that makes use of "crowd knowledge" available in SO to recommend information that can assist developers in their activities. This strategy recommends a ranked list of question-answer pairs from SO based on a query. The criteria for ranking are based on three main aspects: the textual similarity of the pairs with respect to the query related to the developer's problem, the quality of the pairs, and a filtering mechanism that considers only "how-to" posts. We conducted an experiment considering programming problems on three different topics (Swing, Boost and LINQ) widely used by the software development community to evaluate the proposed recommendation strategy. The results have shown that for Lucene+Score+How-to approach, 77.14% of the assessed activities have at least one recommended pair proved to be useful concerning the target programming problem.
... It allows programmers to ask questions and give answers to programming problems. The website has found to be useful for software development [16], [33], [49], [52], [53], [69], [70], [74] and also valuable for educational purposes [47]. On Stack Overflow, each conversation contains a question and a list of answers. ...
... The code snippets on Stack Overflow are mostly examples or solutions to programming problems. Hence, several code search systems use whole or partial data from Stack Overflow as their code search databases [16], [33], [49], [69], [70]. Furthermore, Treude et al. [74] use machine learning techniques to extract insight sentences from Stack Overflow and use them to improve API documentation. ...
Preprint
Full-text available
Online code clones are code fragments that are copied from software projects or online sources to Stack Overflow as examples. Due to an absence of a checking mechanism after the code has been copied to Stack Overflow, they can become toxic code snippets, i.e., they suffer from being outdated or violating the original software license. We present a study of online code clones on Stack Overflow and their toxicity by incorporating two developer surveys and a large-scale code clone detection. A survey of 201 high-reputation Stack Overflow answerers (33% response rate) showed that 131 participants (65%) have ever been notified of outdated code and 26 of them (20%) rarely or never fix the code. 138 answerers (69%) never check for licensing conflicts between their copied code snippets and Stack Overflow's CC BY-SA 3.0. A survey of 87 Stack Overflow visitors shows that they experienced several issues from Stack Overflow answers: mismatched solutions, outdated solutions, incorrect solutions, and buggy code. 85% of them are not aware of CC BY-SA 3.0 license enforced by Stack Overflow, and 66% never check for license conflicts when reusing code snippets. Our clone detection found online clone pairs between 72,365 Java code snippets on Stack Overflow and 111 open source projects in the curated Qualitas corpus. We analysed 2,289 non-trivial online clone candidates. Our investigation revealed strong evidence that 153 clones have been copied from a Qualitas project to Stack Overflow. We found 100 of them (66%) to be outdated and potentially harmful for reuse. Furthermore, we found 214 code snippets that could potentially violate the license of their original software and appear 7,112 times in 2,427 GitHub projects.
... and recall value of 0.30-0.65 dependent on the theme of the document. Similarly, Subramanian and Holmes (2013) analyzed code snippets to extract valuable information from the plain-text pieces in SO posts. The output depicted the identification of 253,137 method calls and type references from the available SO code snippets. ...
... Later on, Subramanian et al. (2014). extended the work of Subramanian and Holmes (2013) in the same line by developing a method called Baker. They use the constraint-based method to uniquely detect fine-grained type references, method calls, and field references present in source code fragments with high accuracy. ...
Article
Purpose Software developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development. Design/methodology/approach A comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016. Findings Most of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps. Practical implications The work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development. Originality/value The study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.
... One study found that the greatest obstacle to learning an API in practice is " insufficient or inadequate examples "[30]. Subramanian et al.[31]analyzed 39,000 code snippets given in response to SO questions and found that only 6,766 (17%) were complete files with class and method declarations, 6,302 (16%) code snippets were just method bodies devoid of class declarations, and the remaining 66% contained standalone source code statements (i.e., the majority are not compilable code fragments with complete class and method body declarations). Since the code is usually not complete, information present in the code is often not sufficient to resolve API method accesses. ...
... Since the code is usually not complete, information present in the code is often not sufficient to resolve API method accesses. Moreover, they observed that most answers extend on details provided in the question; because of this, certain aspects of the snippet, like variable declarations are often skipped[31]. Concerning the Reprod criterion, we also evaluated code snippets that are present in external sources because several sites contain detailed information on how to solve a particular programming task. ...
Article
Stack Overflow (SO) is a question and answer service directed to issues related to software development. In SO, developers post questions related to a programming topic and other members of the site can provide answers to help them. The information available on this type of service is also known as ‘crowd knowledge’ and currently is one important trend in supporting activities related to software development. We present an approach that makes use of ‘crowd knowledge’ in SO to recommend information that can assist developer activities. This strategy recommends a ranked list of question‐answer pairs from SO based on a query. The criteria for ranking are based on three main aspects: the textual similarity of the pairs with respect to the query related to the developer's problem, the quality of the pairs, and a filtering mechanism that considers only ‘how‐to’ posts. We conducted an experiment considering programming problems on three different topics (Swing, Boost and LINQ) widely used by the software development community to evaluate the proposed recommendation strategy. The results have shown that for Lucene + Score + How-to approach, 77.14% of the assessed activities have at least one recommended pair proved to be useful concerning the target programming problem. Copyright © 2016 John Wiley & Sons, Ltd.
... In reality, a discussion has echoed disputes over the texture of Stack Overflow questions and answers. Throughout the light of these studies, it is evident that the software development community recognizes different components of code quality [19]. ...
Article
Full-text available
Community Question and Answer platforms (CQA) use the power of online groups to solve or gain information about problems. Since these websites contain valuable information, high-quality data must be given on these pages, so that consumers may trust the data. Given the widespread usage of data in all parts of contemporary society, this is particularly important for software creation. Stack Overflow is CQA 's leading programmer platform, with a community of over 10 million contributors. Although research supports Stack Overflow's popularity, doubts have been posted about the consistency of the answers provided to stack overflow queries. Application fragments have been examined often found in these answers; nevertheless, the accuracy of these objects remains uncertain. It may present a challenge for the software development world, as data suggests that samples from Stack Overflow are widely included in both open source and commercial applications. By assessing the consistency of code snippets on Stack Overload, this work fills the void. I have discussed various facets of the quality of code excerpts, including usability and accordance with programming standards, readability, efficiency, and protection. Outcomes indicate variability in the consistency of Stack Overflow code snippets for the various dimensions; but generally, quality problems were not necessarily dangerous in Stack Overflow snippets. Vigilance with anyone that duplicate fragments of Stack Overflow code is urged.
... Several other studies have focused on identifying various information about code snippets. Subramanian et al. used code snippet analysis to extract their structural information in order to effectively identify API usage in the snippets [12]. Chatterjee et al. also mined the natural language text associated with code snippets in software-related documents, e.g., API documentation and code reviews [2]. ...
Conference Paper
Full-text available
Code review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and developers might use code snippets to provide necessary information (e.g., suggestions or explanations). However, little is known about the intentions and impacts of code snippets in code reviews. To this end, we conducted a preliminary study to investigate the nature of code snippets and their purposes in code reviews. We manually collected and checked 10,790 review comments from the Nova and Neutron projects of the OpenStack community, and finally obtained 626 review comments that contain code snippets for further analysis. The results show that: (1) code snippets are not prevalently used in code reviews, and most of the code snippets are provided by reviewers. (2) We identified two high-level purposes of code snippets provided by reviewers (i.e., Suggestion and Citation) with six detailed purposes, among which, Improving Code Implementation is the most common purpose. (3) For the code snippets in code reviews with the aim of suggestion, around 68.1% was accepted by developers. The results highlight promising research directions on using code snippets in code reviews.
... Several other studies have focused on identifying various information about code snippets. Subramanian et al. used code snippet analysis to extract their structural information in order to effectively identify API usage in the snippets [12]. Chatterjee et al. also mined the natural language text associated with code snippets in software-related documents, e.g., API documentation and code reviews [2]. ...
Preprint
Full-text available
Code review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and developers might use code snippets to provide necessary information (e.g., suggestions or explanations). However, little is known about the intentions and impacts of code snippets in code reviews. To this end, we conducted a preliminary study to investigate the nature of code snippets and their purposes in code reviews. We manually collected and checked 10,790 review comments from the Nova and Neutron projects of the OpenStack community, and finally obtained 626 review comments that contain code snippets for further analysis. The results show that: (1) code snippets are not prevalently used in code reviews, and most of the code snippets are provided by reviewers. (2) We identified two high-level purposes of code snippets provided by reviewers (i.e., Suggestion and Citation) with six detailed purposes, among which, Improving Code Implementation is the most common purpose. (3) For the code snippets in code reviews with the aim of suggestion, around 68.1% was accepted by developers. The results highlight promising research directions on using code snippets in code reviews.
... Stack Overflow has been used to support software developers by mining API descriptions and examples (Keivanloo et al. 2014), generating source code comments (Vassallo et al. 2014), extracting code snippets (Subramanian and Holmes 2013), helping with bug triaging (Sajedi Badashian et al. 2016), summarizing answers to technical questions (Xu et al. 2017), etc. The popularity and vast amounts of communications among developers in Stack Overflow has also lead to research into using the information present in these forums to build a software-specific word similarity database similar to WordNet (Tian et al. 2014). ...
Article
Full-text available
Software developers are often using instant messaging platforms to communicate with each other and other stakeholders. Among these platforms, Gitter has emerged as a popular choice and the messages it contains can reveal important information to researchers studying open source software systems. Uncovering what developers are communicating about through Gitter is an essential first step towards successfully understanding and leveraging this information. In this paper, we first describe the largest manually labeled and curated dataset of Gitter developer messages, named GitterCom, obtained by manually analyzing and labeling 10,000 Gitter messages in 10 software projects. We then present a qualitative study to understand the extent to which the categories identified in previous work by Lin et al. (2016) found on Slack through surveys are applicable to developer messages exchanged on Gitter. Further, in an effort to automate the labeling process, we investigate the accuracy of 9 traditional machine learning and deep learning algorithms in predicting the intent of Gitter messages. We found that Decision Trees and Random Forest performed the best, achieving an accuracy of 88%, which is very promising for this multi-class classification task. Finally, we discuss the potential directions for future research enabled by labeled Gitter datasets such as GitterCom.
... Therefore, we can automatically obtain API tags from Q&A pairs without manual annotation effort. Besides, 65% of the accepted answer of SO Q&A pairs contain code snippets (Subramanian and Holmes 2013). That is, we can extract abundant API tags from the code snippet of the accepted answer of Q&A pairs. ...
Article
Full-text available
API tutorials are important learning resources as they explain how to use certain APIs in a given programming context. An API tutorial can be split into a number of units. Consecutive units that describe a same topic are often called a tutorial fragment. We consider the API explained by a tutorial fragment as an API tag. Generating API tags for a tutorial fragment can help understand, navigate, and retrieve the fragment. Existing approaches often do not perform well on API tag generation due to high manual effort and low accuracy. Like API tutorials, Stack Overflow (SO) is also an important learning resource that provides the explanations of APIs. Thus, SO posts also contain API tags. Besides, API tags of SO posts are abundant and can be extracted easily. In this paper, we propose a novel approach ATTACK (stands for A PI T ag for T utorial frA gments using C rowd K nowledge), which can automatically generate API tags for tutorial fragments from SO posts. ATTACK first constructs \(\left \langle Q\&A\ pair, tag\ set \right \rangle \) pairs by extracting API tags of SO posts. Then, it trains a deep neural network with the attention mechanism to learn the semantic relatedness between Q&A pairs and the associated API tags, taking into consideration both textual descriptions and code in a Q&A pair. Finally, the trained model is used to generate API tags for tutorial fragments. We evaluate ATTACK on public Java and Android datasets containing 43,132 \(\left \langle Q\&A\ pair, tag\ set \right \rangle \) pairs. Experimental results show that ATTACK is effective and outperforms the state-of-the-art approaches in terms of F-Measure. Our user study further confirms the effectiveness of ATTACK in generating API tags for tutorial fragments. We also apply ATTACK to document linking and the results confirm the usefulness of API tags generated by ATTACK.
... There is growing research interest in the quality of content on Stack Overflow [34]. This is particularly fitting given that this platform is increasingly replacing formal programming languages tutorials and documentation [45]. ...
Preprint
Full-text available
Platforms such as Stack Overflow are available for software practitioners to solicit solutions to their challenges and knowledge needs. The practices therein have in recent times however triggered quality related concerns. This is a noteworthy issue when considering that the Stack Overflow platform is used by numerous software developers. Academic research tends to provide validation for the practices and processes employed by Stack Overflow and other such forums. However, previous work did not review the scale of scientific attention that is given to this cause. Continuing from our preliminary work, we conducted a Systematic Mapping study involving 265 papers from six relevant databases to address this gap. In this work, we explored the level of academic interest Stack Overflow has generated, the publication venues that are targeted, the topics that are studied, approaches used, types of contributions and the quality of the publications that are written about Stack Overflow. Outcomes show that Stack Overflow has attracted increasing research interest over the years, with topics relating to both community dynamics and human factors, and technical issues. In addition, research studies have been largely evaluative or proposed solutions; however, the latter approach tends to lack validation. The contributions of these studies are often techniques or answers to a specific problem. Evaluating the quality of all studies that were dedicated to software programming (58 papers), our outcomes show that on average only 58% of the developed quality criteria were met. Notwithstanding that research is continually aiming to understand Stack Overflow and other similar communities, further investigations are required to validate such studies and the solutions they propose.
... Besides, runnability of the code inside the discussion is of no importance for the authors of the posts. As a result, snippets may contain syntax errors and hence, not parsable (Subramanian and Holmes, 2013;Dagenais and Robillard, 2012). ...
Article
Software development nowadays is heavily based on libraries, frameworks and their proposed Application Programming Interfaces (APIs). However, due to challenges such as the complexity and the lack of documentation, these APIs may introduce various obstacles for developers and common defects in software systems. To resolve these issues, developers usually utilize Question and Answer (Q&A) websites such as Stack Overflow by asking their questions and finding proper solutions for their problems on APIs. Therefore, these websites have become inevitable sources of knowledge for developers, which is also known as the crowd knowledge. However, the relation of this knowledge to the software quality has never been adequately explored before. In this paper, we study whether using APIs which are challenging according to the discussions of the Stack Overflow is related to code quality defined in terms of post-release defects. To this purpose, we define the concept of challenge of an API, which denotes how much the API is discussed in high-quality posts on Stack Overflow. Then, using this concept, we propose a set of products and process metrics. We empirically study the statistical correlation between our metrics and post-release defects as well as added explanatory and predictive power to traditional models through a case study on five open source projects including Spring, Elastic Search, Jenkins, K-8 Mail Android Client, and OwnCloud Android client. Our findings reveal that our metrics have a positive correlation with post-release defects which is comparable to known high-performance traditional process metrics, such as code churn and number of pre-release defects. Furthermore, our proposed metrics can provide additional explanatory and predictive power for software quality when added to the models based on existing products and process metrics. Our results suggest that software developers should consider allocating more resources on reviewing and improving external API usages to prevent further defects.
... The code wrapping technique has been used in some other work to make source code more comprehensive. Subramanian and Holmes [42] developed an 820 approach that can parse short code snippets to effectively identify API usage. ...
Article
Full-text available
Context – During the development of complex software systems, programmers look for external resources to understand better how to use specific APIs and to get advice related to their current tasks. Stack Overflow provides developers with a broader insight into API usage as well as useful code examples. Given the circumstances, tools and techniques for mining Stack Overflow are highly desirable. Objective – In this paper, we introduce PostFinder, an approach that analyzes the project under development to extract suitable context, and allows developers to retrieve messages from Stack Overflow being relevant to the API function calls that have already been invoked. Method – PostFinder augments posts with additional data to make them more exposed to queries. On the client side, it boosts the context code with various factors to construct a query containing information needed for matching against the stored indexes. Multiple facets of the data available are used to optimize the search process, with the ultimate aim of recommending highly relevant SO posts. Results – The approach has been validated utilizing a user study involving a group of 12 developers to evaluate 500 posts for 50 contexts. Experimental results indicate the suitability of PostFinder to recommend relevant Stack Overflow posts and concurrently show that the tool outperforms a well-established baseline. Conclusions – We conclude that PostFinder can be deployed to assist developers in selecting relevant Stack Overflow posts while they are programming as well as to replace the module for searching posts in a code-to-code search engine.
... investigated the usage and attribution of SO code snippets in GH projects and found that at most a quarter of the usages are attributed as required by SO's license. Moreover, they point to possible licensing issues, similar to what we described in Section 8. Other studies aimed at identifying API usage in SO code snippets (Subramanian and Holmes, 2013), describing characteristics of effective code examples (Nasehi et al., 2012), investigating whether SO code snippets are self-explanatory (Treude and Robillard, 2017), or analyzing the impact of copied SO code snippets on application security (Acar et al., 2016;Fischer et al., 2017). There has also been work on the interplay between user activity on SO and GH (Vasilescu et al., 2013;Silvestri et al., 2015;Badashian et al., 2014). ...
Preprint
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on different analyses using the dataset, we present: (1) insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text; (2) a qualitative study investigating the close relationship between post edits and comments, (3) a first analysis of code clones on SO together with an investigation of possible licensing risks. Finally, since the initial presentation of the dataset, we improved the post block extraction and our predecessor matching strategy.
... The code snippets on Stack Overflow are mostly examples or solutions to programming problems. Hence, several code search systems use whole or partial data from Stack Overflow as their code search databases ( Keivanloo et al, 2014;Park et al, 2014;Stolee et al, 2014;Subramanian and Holmes, 2013;Diamantopoulos and Symeonidis, 2015). Furthermore, Treude et al. ...
Preprint
Full-text available
We performed two online surveys of Stack Overflow answerers and visitors to assess their awareness to outdated code and software licenses in Stack Overflow answerers. The answerer survey targeted 607 highly reputed Stack Overflow users and received a high response rate of 33%. Our findings are as follows. Although most of the code snippets in the answers are written from scratch, there are code snippets cloned from the corresponding questions, from personal or company projects, or from open source projects. Stack Overflow answerers are aware that some of their snippets are outdated. However, 19% of the participants report that they rarely or never fix their outdated code. At least 98% of the answerers never include software licenses in their snippets and 69% never check for licensing conflicts with Stack Overflow's CC BY-SA 3.0 if they copy the code from other sources to Stack Overflow answers. The visitor survey uses convenient sampling and received 89 responses. We found that 66% of the participants experienced a problem from cloning and reusing Stack Overflow snippets. Fifty-six percent of the visitors never reported the problems back to the Stack Overflow post. Eighty-five percent of the participants are not aware that StackOverflow applies the CC BY-SA 3.0 license, and sixty-two percent never give attributions to Stack Overflow posts or answers they copied the code from. Moreover, 66% of the participants do not check for licensing conflicts between the copied Stack Overflow code and their software. With these findings, we suggest Stack Overflow raise awareness of their users, both answerers and visitors, to the problem of outdated and license-violating code snippets.
... Yang et al. [61] analyzed code clones between Python snippets from SO and Python projects on GH and found a considerable number of non-trivial clones, which may have a negative impact on code quality [1]. Other studies aimed at identifying API usage in SO code snippets [51], describing characteristics of effective code examples [40], investigating whether SO code snippets are self-explanatory [54], or analyzing the impact of copied SO code snippets on application security [2,25]. There has also been work on the interplay between user activity on SO and GH [5,46,56]. ...
Article
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.
... Abdalkareem et al. found that reusing code from SO may have a negative impact on code quality (Abdalkareem et al. 2017). Other studies aimed at identifying API usage in SO code snippets (Subramanian and Holmes 2013), describing characteristics of effective code examples (Nasehi et al. 2012), investigating whether SO code snippets are self-explanatory (Treude and Robillard 2017), or analyzing the impact of copied SO code snippet on application security (Acar et al. 2016;Fischer et al. 2017). Recently, Zhang et al. analyzed potential API usage violations in SO posts and found that, of the 217,818 analyzed Java and Android SO posts, 31% may contain potential API usage violations, which could lead to program crashes or resource leaks (Zhang et al. 2018). ...
Article
Full-text available
Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO's license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO's license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.
... Zhong and Su [40] detect errors in code samples of API documentations, and such samples are partial programs. Subramanian and Holmes [33] extract API calls from code samples in StackOverflow. Our approach presents a general way to leverage tools for complete code, so they can analyze partial programs, complementing existing approaches. ...
... Unlike NLP2Code, Example Overflow is not integrated into an IDE, but is its own website. In their work on making sense of online code snippets, Subramanian and Holmes [13] extracted structural information from short plain-text snippets on Stack Overflow and showed that these structural relationships could improve code snippet search. ...
Article
Developers increasingly take to the Internet for code snippets to integrate into their programs. To save developers the time required to switch from their development environments to a web browser in the quest for a suitable code snippet, we introduce NLP2Code, a content assist for code snippets. Unlike related tools, NLP2Code integrates directly into the source code editor and provides developers with a content assist feature to close the vocabulary gap between developers' needs and code snippet meta data. Our preliminary evaluation of NLP2Code shows that the majority of invocations lead to code snippets rated as helpful by users and that the tool is able to support a wide range of tasks.
... Developer Forums or Stack Overflow. Discussions on Stack Overflow have been used in various empirical research studies, such as understanding the behavior of users [2,10,15,18,25], extracting documentation [17], analyzing prominent topics of general discussions [1,4,29], analyzing software code [23], and assigning tags to discussion posts [21]. Mining fine-grained knowledge for particular types of APIs or platforms is becoming popular, such as extracting Java API usage in mobile applications [11] and analyzing the interesting topics of general discussions among mobile developers [20]. ...
Article
Full-text available
The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and source code data and fine-tune it on pairs of StackOverflow question titles and code answers. Our results show that the pre-trained models consistently outperform the models that were not pre-trained. In cases where the model was pre-trained on natural language and source code data, it also outperforms an information retrieval baseline based on Lucene. Also, we demonstrated that the combined use of an information retrieval-based approach followed by a Transformer leads to the best results overall, especially when searching into a large search pool. Transfer learning is particularly effective when much pre-training data is available and fine-tuning data is limited. We demonstrate that natural language processing models based on the Transformer architecture can be directly applied to source code analysis tasks, such as code search. With the development of Transformer models designed more specifically for dealing with source code data, we believe the results of source code analysis tasks can be further improved.
Article
Past studies have proposed solutions that analyze Stack Overflow content to help users find desired information or aid various downstream software engineering tasks. A common step performed by those solutions is to extract suitable representations of posts; typically, in the form of meaningful vectors. These vectors are then used for different tasks, for example, tag recommendation, relatedness prediction, post classification, and API recommendation. Intuitively, the quality of the vector representations of posts determines the effectiveness of the solutions in performing the respective tasks. In this work, to aid existing studies that analyze Stack Overflow posts, we propose a specialized deep learning architecture Post2Vec which extracts distributed representations of Stack Overflow posts. Post2Vec is aware of different types of content present in Stack Overflow posts, i.e., title, description, and code snippets, and integrates them seamlessly to learn post representations. Tags provided by Stack Overflow users that serve as a common vocabulary that captures the semantics of posts are used to guide Post2Vec in its task. To evaluate the quality of Post2Vec's deep learning architecture, we first investigate its end-to-end effectiveness in tag recommendation task. The results are compared to those of state-of-the-art tag recommendation approaches that also employ deep neural networks. We observe that Post2Vec achieves 15-25 percent improvement in terms of F1-score@5 at a lower computational cost. Moreover, to evaluate the value of representations learned by Post2Vec, we use them for three other tasks, i.e., relatedness prediction, post classification, and API recommendation. We demonstrate that the representations can be used to boost the effectiveness of state-of-the-art solutions for the three tasks by substantial margins (by 10, 7, and 10 percent in terms of F1-score, F1-score, and correctness, respectively). We release our replication package at https://github.com/maxxbw/Post2Vec .
Chapter
The introduction of question–answering services, such as Stack Overflow, has given rise to a new problem-solving paradigm in software development. Using these services, developers can post their programming questions online and get useful solutions by the community. In this chapter we propose a methodology that allows searching for solutions in Stack Overflow, using the main elements of a question post, including its title, tags, body, and source code snippets. We design a similarity scheme for these elements that can be used for finding similar question posts. Text elements are compared using Information Retrieval metrics, while snippet similarity is computed by first converting snippets into sequences using a representation that extracts structural information. The results of the evaluation of our methodology indicate that it can be effective for recommending similar question posts, and thus can be used to search for solutions without fully forming a question.
Article
Online code clones are code fragments that are copied from software projects or online sources to Stack Overflow as examples. Due to an absence of a checking mechanism after the code has been copied to Stack Overflow, they can become toxic code snippets, e.g., they suffer from being outdated or violating the original software license. We present a study of online code clones on Stack Overflow and their toxicity by incorporating two developer surveys and a large-scale code clone detection. A survey of 201 high-reputation Stack Overflow answerers (33% response rate) showed that 131 participants (65%) have ever been notified of outdated code and 26 of them (20%) rarely or never fix the code. 138 answerers (69%) never check for licensing conflicts between their copied code snippets and Stack Overflow?s CC BY-SA 3.0. A survey of 87 Stack Overflow visitors shows that they experienced several issues from Stack Overflow answers: mismatched solutions, outdated solutions, incorrect solutions, and buggy code. 85% of them are not aware of CC BY-SA 3.0 license enforced by Stack Overflow, and 66% never check for license conflicts when reusing code snippets. Our clone detection found online clone pairs between 72,365 Java code snippets on Stack Overflow and 111 open source projects in the curated Qualitas corpus. We analysed 2,289 non-trivial online clone candidates. Our investigation revealed strong evidence that 153 clones have been copied from a Qualitas project to Stack Overflow. We found 100 of them (66%) to be outdated, of which 10 were buggy and harmful for reuse. Furthermore, we found 214 code snippets that could potentially violate the license of their original software and appear 7,112 times in 2,427 GitHub projects.
Article
Researchers have shown that related functions can be mined from groupings of functions found in the version history of a system. Our first contribution is to expand this approach to a community of applications and set of similar applications. Android developers use a set of application programming interface (API) calls when creating apps. These API calls are used in similar ways across multiple applications. By clustering co-changing API calls used by 230 Android apps across 12k versions, we are able to predict the API calls that individual app developers will use with an average precision of 75% and recall of 22%. When we make predictions from the same category of app, such as Finance, we attain precision and recall of 81% and 28%, respectively. Our second contribution can be characterized as “programmers who discussed these functions were also interested in these functions.” Informal discussions on Stack Overflow provide a rich source of information about related API calls as developers provide solutions to common problems. By grouping API calls contained in each positively voted answer posts, we are able to create rules that predict the calls that app developers will use in their own apps with an average precision of 66% and recall of 13%. For comparison purposes, we developed a baseline by clustering co-changing API calls for each individual app and generated association rules from them. The baseline predicts API calls used by app developers with a precision and recall of 36% and 23%, respectively.
Conference Paper
JavaScript frameworks, such as jQuery, are widely used for developing web applications. To facilitate using these JavaScript frameworks to implement a feature (e.g., functionality), a large number of programmers often search for code snippets that implement the same or similar feature. However, existing code search approaches tend to be ineffective, without taking into account the fact that JavaScript code snippets often implement a feature based on various relationships (e.g., sequencing, condition, and callback relationships) among the invoked framework API methods. To address this issue, we present a novel Relationship-Aware Code Search (RACS) approach for finding code snippets that use JavaScript frameworks to implement a specific feature. In advance, RACS collects a large number of code snippets that use some JavaScript frameworks, mines API usage patterns from the collected code snippets, and represents the mined patterns with method call relationship (MCR) graphs, which capture framework API methods’ signatures and their relationships. Given a natural language (NL) search query issued by a programmer, RACS conducts NL processing to automatically extract an action relationship (AR) graph, which consists of actions and their relationships inferred from the query. In this way, RACS reduces code search to the problem of graph search: finding similar MCR graphs for a given AR graph. We conduct evaluations against representative real-world jQuery questions posted on Stack Overflow, based on 308,294 code snippets collected from over 81,540 files on the Internet. The evaluation results show the effectiveness of RACS: the top 1 snippet produced by RACS matches the target code snippet for 46% questions, compared to only 4% achieved by a relationship-oblivious approach.
Conference Paper
Full-text available
Software engineering tools often deal with the source code of programs retrieved from the web or source code repositories. Typically, these tools only have access to a subset of a program's source code (one file or a subset of files) which makes it difficult to build a complete and typed intermediate representation (IR). Indeed, for incomplete object-oriented programs, it is not always possible to completely disambiguate the syntactic constructs and to recover the declared type of certain expressions because the declaration of many types and class members are not accessible. We present a framework that performs partial type inference and uses heuristics to recover the declared type of expressions and resolve ambiguities in partial Java programs. Our framework produces a complete and typed IR suitable for further static analysis. We have implemented this framework and used it in an empirical study on four large open source systems which shows that our system recovers most declared types with a low error rate, even when only one class is accessible.
Conference Paper
Full-text available
Programmers commonly reuse existing frameworks or li- braries to reduce software development efforts. One com- mon problem in reusing the existing frameworks or libraries is that the programmers know what type of object that they need, but do not know how to get that object with a spe- cific method sequence. To help programmers to address this issue, we have developed an approach that takes queries of the form "Source object type ! Destination object type" as input, and suggests relevant method-invocation sequences that can serve as solutions that yield the destination object from the source object given in the query. Our approach in- teracts with a code search engine (CSE) to gather relevant code samples and performs static analysis over the gath- ered samples to extract required sequences. As code sam- ples are collected on demand through CSE, our approach is not limited to queries of any specific set of frameworks or libraries. We have implemented our approach with a tool called PARSEWeb, and conducted four different evaluations to show that our approach is effective in addressing program- mers' queries. We also show that PARSEWeb performs bet- ter than existing related tools: Prospector and Strathcona. Categories and Subject Descriptors: D.2.3 (Software Engineering): Coding Tools and Techniques—Object-oriented programming; D.2.6 (Software Engineering): Programming Environments—Integrated environments;
Article
Full-text available
This paper describes the results of an interview study conducted at ten industrial sites. The interview focused on the work practices of software engineers engaged in maintaining large scale systems. Five `truths' emerged from this study. First, software maintenance engineers are experts in the systems they are maintaining. Second, source code is the primary source of information about systems. Third, the documentation is used, but not necessarily trusted. Fourth, maintenance control systems are important repositories of information about systems. Finally, reproduction of problems and/or problem scenarios is essential to problem solutions. These truths confirm much of the conventional wisdom in the field. However, in fleshing them out, details were elaborated, and additionally new knowledge was acquired. These results are discussed with respect to tool design.
Technical Report
Traditionally, many types of software documentation, such as API documentation, require a process where a few peo-ple write for many potential users. The resulting documen-tation, when it exists, is often of poor quality and lacks sufficient examples and explanations. In this paper, we report on an empirical study to investigate how Question and Answer (Q&A) websites, such as Stack Overflow, fa-cilitate crowd documentation — knowledge that is written by many and read by many. We examine the crowd doc-umentation for three popular APIs: Android, GWT, and the Java programming language. We collect usage data us-ing Google Code Search, and analyze the coverage, quality, and dynamics of the Stack Overflow documentation for these APIs. We find that the crowd is capable of generating a rich source of content with code examples and discussion that is actively viewed and used by many more developers. For example, over 35,000 developers contributed questions and answers about the Android API, covering 87% of the classes. This content has been viewed over 70 million times to date. However, there are shortcomings with crowd documentation, which we identify. In addition to our empirical study, we present future directions and tools that can be leveraged by other researchers and software designers for performing API analytics and mining of crowd documentation.
Conference Paper
Reuse of existing code from class libraries and frameworks is often difficult because APIs are complex and the client code required to use the APIs can be hard to write. We observed that a common scenario is that the programmer knows what type of object he needs, but does not know how to write the code to get the object. In order to help programmers write API client code more easily, we developed techniques for synthesizing jungloid code fragments automatically given a simple query that describes that desired code in terms of input and output types. A jungloid is simply a unary expression; jungloids are simple, enabling synthesis, but are also versatile, covering many coding problems, and composable, combining to form more complex code fragments. We synthesize jungloids using both API method signatures and jungloids mined from a corpus of sample client programs. We implemented a tool, PROSPECTOR, based on these techniques. PROSPECTOR is integrated with the Eclipse IDE code assistance feature, and it infers queries from context so there is no need for the programmer to write queries. We tested PROSPECTOR on a set of real programming problems involving APIs; PROSPECTOR found the desired solution for 18 of 20 problems. We also evaluated PROSPECTOR in a user study, finding that programmers solved programming problems more quickly and with more reuse when using PROSPECTOR than without PROSPECTOR.
Conference Paper
Software development blogs, developer forums and Q&A websites are changing the way software is documented. With these tools, developers can create and communicate knowledge and experiences without relying on a central authority to provide official documentation. Instead, any content created by a developer is just a web search away. To understand whether documentation via social media can replace or augment more traditional forms of documentation, we study the extent to which the methods of one particular API - jQuery - are documented on the Web. We analyze 1,730 search results and show that software development blogs in particular cover 87.9% of the API methods, mainly featuring tutorials and personal experiences about using the methods. Further, this effort is shared by a large group of developers contributing just a few blog posts. Our findings indicate that social media is more than a niche in software documentation, that it can provide high levels of coverage and that it gives readers a chance to engage with authors.
Conference Paper
This paper describes the results of an interview study conducted at ten industrial sites. The interview focused on the work practices of software engineers engaged in maintaining large scale systems. Five `truths' emerged from this study. First, software maintenance engineers are experts in the systems they are maintaining. Second, source code is the primary source of information about systems. Third, the documentation is used, but not necessarily trusted. Fourth, maintenance control systems are important repositories of information about systems. Finally, reproduction of problems and/or problem scenarios is essential to problem solutions. These truths confirm much of the conventional wisdom in the field. However, in fleshing them out, details were elaborated, and additionally new knowledge was acquired. These results are discussed with respect to tool design
Article
Imagine hypothetically, just for a moment, that programmers are humans," writes Steven Pemberton in a July 1997 magazine devoted to human-computer interaction design and development. "Now suppose for a moment, also for the sake of the argument, that their chief method of communicating and interacting with computers was with programming languages. What would we, as HCI people, then do? Run screaming in the other direction...." 1 It is a good question and, unfortunately, an all too common response. It's hard enough for us to ensure that product interfaces, like those for Excel or Word, are easy to use and learn. But programmers are users, too. They need application and system libraries that are just as easy to learn and use as the products they build from these libraries. Listen to this customer: "I think it would be worthwhile if all developers would spend maybe a couple of hours a year seeing how the[ir] product is used by...customers. Just watching them. And while they're watching ...the customer would say, 'I don't like the way this works....'You need to see how they use it." 2 Now ask yourself: why is it easier to visualize the customer who's purchased a financial accounting package from a neighborhood computer outlet, rather than a programmer whose company has just purchased a new Java class library? Wouldn't the developer of this library find it worthwhile to watch programmers work with it?
Mining challenge 2013: Stack overflow
  • A Bacchelli
A. Bacchelli, " Mining challenge 2013: Stack overflow, " in The Working Conference on Mining Software Repositories, 2013, To Appear. [Online]. Available: http://2013.msrconf.org/challenge.php