Conference Paper

Reusing Code from StackOverflow: The Effect on Technical Debt

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Software reuse is a well-established software engineering pro-cess that aims at improving development productivity. Although reuse can be performed in a very systematic way (e.g., through product lines), in practice, reuse is performed in many cases opportunistically, i.e., copying small code chunks either from the web or in-house developed projects. Knowledge sharing communities and especially StackOverflow constitute the primary source of code-related information for amateur and professional software developers. Despite the obvious benefit of increased productivity, reuse can have a mixed effect on the quality of the resulting code depending on the properties of the reused solutions. An efficient concept for capturing a wide-range of internal software qualities is the metaphor of Tech-nical Debt which expresses the impact of shortcuts in software development on its maintenance costs. In this paper, we pre-sent the results from an empirical study on the effect of code retrieved from StackOverflow on the technical debt of the tar-get system. In particular, we study several open-source projects and identify non-trivial pieces of code that exhibit a perfect or near-perfect match with code provided in the context of an-swers in StackOverflow. Then, we compare the technical debt density of the reused fragments—obtained as the ratio of inef-ficiencies identified by SonarQube over the lines of reused code—to the technical debt density of the target codebase. The results provide insight to the potential impact of code reuse on technical debt and highlight the benefits of assessing code qual-ity before committing changes to a repository

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The practitioners who participated in this study also held this same view of Reusability regarding code snippets. Similarly, some researchers have studied the Reusability of code snippets along this same line (e.g., Digkas et al. (2019) studied the effect of snippet code reuse on the quality of open-source projects on Github, and Yang et al. (2017) investigated how programmers used code snippets in their real projects). However, other researchers have also viewed Reusability as compilable code snippets. ...
... Abdalkareem et al., 2017;Ahmad & Cinnéide, 2019;Digkas et al., 2019).Geremia et al. (2019) analysed some code snippet characteristics such as lines of code, identifiers length, and the number of loops. They found that they were relevant in predicting code snippets of high quality worthy of reusing. ...
... In contrast, a few studies believe that reusing code snippets rather improves the quality of software projects. For instance,Digkas et al. (2019) ...
Article
Context Over the years, there has been debate about what constitutes software quality and how it should be measured. This controversy has caused uncertainty across the software engineering community, affecting levels of commitment to the many potential determinants of quality among developers. An up-to-date catalogue of software quality views could provide developers with contemporary guidelines and templates. In fact, it is necessary to learn about views on the quality of code on frequently used online collaboration platforms (e.g., Stack Overflow), given that the quality of code snippets can affect the quality of software products developed. If quality models are unsuitable for aiding developers because they lack relevance, developers will hold relaxed or inappropriate views of software quality, thereby lacking awareness and commitment to such practices. Objective We aim to explore differences in interest in quality characteristics across research and practice. We also seek to identify quality characteristics practitioners consider important when judging code snippet quality. First, we examine the literature for quality characteristics used frequently for judging software quality, followed by the quality characteristics commonly used by researchers to study code snippet quality. Finally, we investigate quality characteristics used by practitioners to judge the quality of code snippets. Methods We conducted two systematic literature reviews followed by semi-structured interviews of 50 practitioners to address this gap. Results The outcomes of the semi-structured interviews revealed that most practitioners judged the quality of code snippets using five quality dimensions: Functionality, Readability, Efficiency, Security and Reliability. However, other dimensions were also considered (i.e., Reusability, Maintainability, Usability, Compatibility and Completeness). This outcome differed from how the researchers judged code snippet quality. Conclusion Practitioners today mainly rely on code snippets from online code resources, and specific models or quality characteristics are emphasised based on their need to address distinct concerns (e.g., mobile vs web vs standalone applications, regular vs machine learning applications, or open vs closed source applications). Consequently, software quality models should be adapted for the domain of consideration and not seen as one-size-fits-all. This study will lead to targeted support for various clusters of the software development community.
... Reuse is expected to bring important benefits to software development, especially with respect to time to market and the quality of software [3]. In the literature, there are two mainstream processes to reuse: systematic reuse, e.g., through product lines, model-driven engineering, etc. [4]; and opportunistic reuse, e.g., by searching development forums like StackOverflow for pieces of code [5], or OSS repositories for classes, libraries, or products [6]. As a first step towards reuse, the practitioner needs to perform reusable asset identification [7]. ...
... The results are summarized in Fig. 7. Based on the findings (continuous line), we can observe that the accuracy of the model is increasing. On the one hand, for units of analysis when the annotators declared an averagely high confidence (i.e., ' (4)(5)'), the accuracy is 100% (15 services classified correctly). On the other hand, for the four services that the experts declared an averagely low confidence (i.e., '[1.0-2.5]'), the model showcased an accuracy of 50%. ...
Conference Paper
Developing software based on services is one of the most emerging programming paradigms in software development. Service-based software development relies on the composition of services (i.e., pieces of code already built and deployed in the cloud) through orchestrated API calls. Black-box reuse can play a prominent role when using this programming paradigm, in the sense that identifying and reusing already existing/deployed services can save substantial development effort. According to the literature, identifying reusable assets (i.e., components, classes, or services) is more successful and efficient when the discovery process is domain-specific. To facilitate domain-specific service discovery, we propose a service classification approach that can categorize services to an application domain, given only the service description. To validate the accuracy of our classification approach, we have trained a machine-learning model on thousands of open-source services and tested it on 67 services developed within two companies employing service-based software development. The study results suggest that the classification algorithm can perform adequately in a test set that does not overlap with the training set; thus, being (with some confidence) transferable to other industrial cases. Additionally, we expand the body of knowledge on software categorization by highlighting sets of domains that consist 'grey-zones' in service classification.
... A small number of the code snippets associated with violations were found to be used in GitHub projects. Nikolaidis, et al. [49] measured quality in terms of technical debt (the effort required to fix code inefficiencies), determining that the Java code snippets were actually associated with an overall lower technical debt density than the project code. However, there were a number of cases when the code snippets were associated with a much larger technical debt density than the project code. ...
... That said, previous works have attempted to consider the measures that may influence or reduce Stack Overflow answer quality [25,64]. Measures considered include: the degree to which Stack Overflow code compiles [69], code cohesion over time [3], the currency of snippet solutions [53], the effort required to fix code inefficiencies [49], if code snippets are selfexplanatory [65], if code snippets were reused directly [68], and the proportion of text and code in answers [60]. There has been less interest in investigating Stack Overflow snippet quality holistically. ...
Article
Full-text available
Community Question and Answer (CQA) platforms use the power of online groups to solve problems, or gain information. While these websites host useful information, it is critical that the details provided on these platforms are of high quality, and that users can trust the information. This is particularly necessary for software development, given the ubiquitous use of software across all sections of contemporary society. Stack Overflow is the leading CQA platform for programmers, with a community comprising over 10 million contributors. While research confirms the popularity of Stack Overflow, concerns have been raised about the quality of answers that are provided to questions on Stack Overflow. Code snippets often contained in these answers have been investigated; however, the quality of these artefacts remains unclear. This could be problematic for the software engineering community, as evidence has shown that Stack Overflow snippets are frequently used in both open source and commercial software. This research fills this gap by evaluating the quality of code snippets on Stack Overflow. We explored various aspects of code snippet quality, including reliability and conformance to programming rules, readability, performance and security. Outcomes show variation in the quality of Stack Overflow code snippets for the different dimensions; however, overall, quality issues in Stack Overflow snippets were not always severe. Vigilance is encouraged for those reusing Stack Overflow code snippets.
... In a study by Digkas et al. [25], the authors' used technical debt to measure the quality of code snippets and found that there were cases where code snippets had a much larger total technical debt density than other project code. Hence, this indicates that the effort required to fix project code inefficiencies will be more when code snippets are reused. ...
Article
Software developers make use of on crowdsourcing during development. Beyond learning from others, developers use online portals such as Stack Overflow as a vehicle for collaboration. However, little is known about developers' experiences on such platforms, particularly around problems that are encountered online. Such insights could benefit software developers in terms of recommendations for pitfalls to avoid, ways to exploit crowdsourced knowledge, and the provision of insights to improve online code sharing communities. We interviewed 50 practitioners to fill this gap, where outcomes show that software developers' use of online portals is targeted, and such portals are a lifeline to modern software development. Practitioners are facilitated with code solutions and debugging, often in a very timely fashion. While these experiences are largely positive, practitioners also encounter negative experiences online, some of which could be significantly deleterious to the community. We discuss the implications of these findings, such as creating awareness of the quality and reliability of code snippets, improving code searches, code validation and outdated code detection and attribution of code snippets.
... Digkas et al. [22] study the relation between reusing code from SO and TD. While both our study and the one of Digkas et al. focus on TD, and consider SO as a data source, the two studies differ drastically. ...
Conference Paper
Full-text available
Background: Q&A sites allow to study how users reference and request support on technical debt. To date only few studies, focusing on narrow aspects, investigate technical debt on Stack Overflow. Aims: We aim at gaining an in-depth understanding on the characteristics of technical debt questions on Stack Overflow. In addition, we assess if identification strategies based on machine learning can be used to automatically identify and classify technical debt questions. Method: We use automated and manual processes to identify technical debt questions on Stack Overflow. The final set of 415 questions is analyzed to study (i) technical debt types, (ii) question length, (iii) perceived urgency, (iv) sentiment, and (v) themes. Natural language processing and machine learning techniques are used to assess if questions can be identified and classified automatically. Results: Architecture debt is the most recurring debt type, followed by code and design debt. Most questions display mild urgency, with frequency of higher urgency steadily declining as urgency rises. Question length varies across debt types. Sentiment is mostly neutral. 29 recurrent themes emerge. Machine learning can be used to identify technical debt questions and binary urgency, but not debt types. Conclusions: Different patterns emerge from the analysis of technical debt questions on Stack Overflow. The results provide further insights on the phenomenon, and support the adoption of a more comprehensive strategy to identify technical debt questions.
... SonarQube was combined with Arcan to study Architectural Debt and opportunistic code reuse in Java projects, determining that 'cyclic dependency' is a typical smell [32]. SonarQube was also applied to StackOverflow's Java code snippets, determining that reused code tends to exhibit ''a substantially lower TD density'' [33]. A large-scale analysis based on GitHub's annual report analyzed SonarQube's coding violations and mapped them to the most common code smells by Fowler [34]; this enabled estimating developers' profiles according to coding maturity and TD tolerance, among other points. ...
Article
Full-text available
Automated Static Analysis Tools (ASATs) analyze source-code to capture defects and ensure higher quality. SonarQube is a renown ASAT that supports mainstream programming languages. However, R programming is not included. R is an increasingly popular multi-paradigm and package-based programming environment for scientific programming. Nevertheless, R’s Object-Oriented (OO) functionalities are implemented through three different systems: S3, S4, and R6, and seldom used by developers. We present analyzeR, an advanced SonarQube plugin to examine R packages built in any of the current OO models. It implements widely-used, commonly-accepted OO metrics and displays the results using SonarQube’s graphical interface for increased usability, implementing an array of metrics.
... Furthermore, it was stated that the approach might be inadequate in recommending comments for the code segment from proprietary or legacy projects [4]. The value of comments revolves around many important aspects such as improving the source code, analyzing the code further, and facilitating code reuse [4] [17] [18] [19]. During the Exploration of the means of which comments have an effect on answer updates, comments and answer updates which involve code segments were only being taken into consideration. ...
Article
Full-text available
Stack Overflow is a public platform for developers to share their knowledge on programming with an engaged community. Crowdsourced programming knowledge is not only generated through questions and answers but also through comments which are commonly known as developer discussions. Despite the availability of standard commenting guidelines on Stack Overflow, some users tend to post comments not adhering to those guidelines. This practice affects the quality of the developer discussion, thus adversely affecting the knowledge-sharing process. Literature reveals that analyzing the comments could facilitate the process of learning and knowledge sharing. Therefore, this study intends to extract and classify useful comments into three categories: request clarification, constructive criticism, and relevant information. In this study, the classification of useful comments was performed using the Support Vector Machine (SVM) algorithm with five different kernels. Feature engineering was conducted to identify the possibility of concatenating ten external features with textual features. During the feature evaluation, it was identified that only TF-IDF and N-grams scores help classify useful comments. The evaluation results confirm Radial Basis Function (RBF) kernel of the SVM classification algorithm performs best in classifying useful comments in Stack Overflow regardless of the usage of the optimal combinations of hyperparameters.
... Abdalkareem et al., 2017;Ahmad & Cinnéide, 2019;Digkas et al., 2019).Geremia et al. (2019) analysed some code snippet characteristics such as lines of code, identifiers length, and the number of loops. They found that they were relevant in predicting code snippets of high quality worthy of reusing. ...
... Digkas et al. [11] performed an investigation about the effect of software reuse on technical debt. However, their focus was on reusing code chunks copy-pasted from the StackOverflow website. ...
... Digkas et al. [11] performed an investigation about the effect of software reuse on technical debt. However, their focus was on reusing code chunks copy-pasted from the StackOverflow website. ...
... Focusing on the Energy toolbox, it analyses projects available in an online repository (e.g., GitHub) on the machine running the Docker container with regard to its energy efficiency. is means it finds the energy hotspots, estimates the energy consumption through static or dynamic analysis [93,94], and inspects possible solutions by suggesting specific code refactoring. is is a valuable approach, in particular, for software reusing [95]. e project ended at the end of 2020. ...
Article
Full-text available
Energy consumption is one of the major issues in today’s computer science, and an increasing number of scientific communities are interested in evaluating the tradeoff between time-to-solution and energy-to-solution. Despite, in the last two decades, computing which revolved around centralized computing infrastructures, such as supercomputing and data centers, the wide adoption of the Internet of Things (IoT) paradigm is currently inverting this trend due to the huge amount of data it generates, pushing computing power back to places where the data are generated—the so-called fog/edge computing. This shift towards a decentralized model requires an equivalent change in the software engineering paradigms, development environments, hardware tools, languages, and computation models for scientific programming because the local computational capabilities are typically limited and require a careful evaluation of power consumption. This paper aims to present how these concepts can be actually implemented in scientific software by presenting the state of the art of powerful, less power-hungry processors from one side and energy-aware tools and techniques from the other one.
... non-manifold meshes) and consultations (e.g. modeling tutorials) could direct clients to online resources where those common questions and concerns are addressed, similar to how StackOverflow functions for programmers [14,54]. For more in-depth consultations, where a single text-response is likely not sufficient (e.g. a design task), collaborators should communicate with media more-similar to proximal interactions to better afford tightly-coupled collaborations. ...
Conference Paper
Broader participation in 3D printing may be facilitated through printing services that insulate clients from the costs and detailed technical knowledge necessary to operate and maintain printers. However, newcomers to 3D printing encounter barriers and challenges even before gaining access to printing facilities. This paper explores the challenges and barriers newcomers encounter when identifying printing opportunities and when learning how to specify 3D printing ideas through observations of stakeholders (n=20) in two university 3D printing shops, and through a focused lab study investigating how to introduce newcomers individually to 3D printing (n=21). We adopt Olsons and Olson's framework for remote collaborations, proposed in "Distance Matters", to analyze the sociotechnical requirements for initiating collaborations with 3D printing services. We found that newcomers often require prior guidance towards 3D printing procedures and websites before establishing what to print in collaboration with 3D printing services. Finally, we discuss how future printing processes and computational systems may empower a future where Anyone Can Print.
... A small number of the code snippets associated with violations were found to be used in GitHub projects. Nikolaidis, et al. [64] measured quality in terms of technical debt (the effort required to fix code inefficiencies), determining that the Java code snippets were actually associated with an overall lower technical debt density than the project code. However, there were a number of cases when the code snippets were associated with a much larger technical debt density than the project code. ...
Preprint
Full-text available
Platforms such as Stack Overflow are available for software practitioners to solicit solutions to their challenges and knowledge needs. The practices therein have in recent times however triggered quality related concerns. This is a noteworthy issue when considering that the Stack Overflow platform is used by numerous software developers. Academic research tends to provide validation for the practices and processes employed by Stack Overflow and other such forums. However, previous work did not review the scale of scientific attention that is given to this cause. Continuing from our preliminary work, we conducted a Systematic Mapping study involving 265 papers from six relevant databases to address this gap. In this work, we explored the level of academic interest Stack Overflow has generated, the publication venues that are targeted, the topics that are studied, approaches used, types of contributions and the quality of the publications that are written about Stack Overflow. Outcomes show that Stack Overflow has attracted increasing research interest over the years, with topics relating to both community dynamics and human factors, and technical issues. In addition, research studies have been largely evaluative or proposed solutions; however, the latter approach tends to lack validation. The contributions of these studies are often techniques or answers to a specific problem. Evaluating the quality of all studies that were dedicated to software programming (58 papers), our outcomes show that on average only 58% of the developed quality criteria were met. Notwithstanding that research is continually aiming to understand Stack Overflow and other similar communities, further investigations are required to validate such studies and the solutions they propose.
Article
Full-text available
The Conference on Energy Consumption, Quality of Service, Reliability, Security, and Maintainability of Computer Systems and Networks (EQSEM) was held as a virtual conference on October 20–21, 2020. This paper summarises the objectives and proceedings of this conference. It then briefly presents the keynotes and other papers which were presented. Then, in the context of the EU-funded research project SDK4ED which motivated this conference, we outline several solutions that are being developed for managing the potential inter-dependencies and corresponding trade-offs between design-time and runtime qualities in software applications, and review the key functionalities that have been implemented within the SDK4ED integrated platform as of this writing. Then, we briefly introduce four papers among those presented at the EQSEM Conference that are included in this issue of the journal SN Computer Science, presenting relevant research achievements of the SDK4ED Project on software maintainability, reliability, and energy efficiency.
Article
During the last years the TD community is striving to offer methods and tools for reducing the amount of TD, but also understand the underlying concepts. One popular practice that still has not been inves-tigated in the context of TD, is software reuse. The aim of this paper is to investigate the relation be-tween white-box code reuse and TD principal and interest. In particular, we target at unveiling if the reuse of code can lead to software with better levels of TD. To achieve this goal, we performed a case study on approximately 400 OSS systems, comprised of 897 thousand classes, and compare the levels of TD for reused and natively-written classes. The results of the study suggest that reused code usually has less TD interest; however, the amount of principal in them is higher. A synthesized view of the aforementioned results suggest that software engineers shall opt to reuse code when necessary, since apart from the established reuse benefits (i.e., cost savings, increased productivity, etc.) are also getting benefits in terms of maintenance. Apart from understanding the phenomenon per se, the results of this study provide various implications to research and practice.
Article
Full-text available
Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO's license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO's license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.
Article
Full-text available
When programmers look for how to achieve certain programming tasks, Stack Overflow is a popular destination in search engine results. Over the years, Stack Overflow has accumulated an impressive knowledge base of snippets of code that are amply documented. We are interested in studying how programmers use these snippets of code in their projects. Can we find Stack Overflow snippets in real projects? When snippets are used, is this copy literal or does it suffer adaptations? And are these adaptations specializations required by the idiosyncrasies of the target artifact, or are they motivated by specific requirements of the programmer? The large-scale study presented on this paper analyzes 909k non-fork Python projects hosted on Github, which contain 290M function definitions, and 1.9M Python snippets captured in Stack Overflow. Results are presented as quantitative analysis of block-level code cloning intra and inter Stack Overflow and GitHub, and as an analysis of programming behaviors through the qualitative analysis of our findings.
Conference Paper
Full-text available
Code smells are a well-known metaphor to describe symptoms of code decay or other issues with code quality which can lead to a variety of maintenance problems. Even though code smell detection and removal has been well-researched over the last decade, it remains open to debate whether or not code smells should be considered meaningful conceptualizations of code quality issues from the developer's perspective. To some extent, this question applies as well to the results provided by current code smell detection tools. Are code smells really important for developers? If they are not, is this due to the lack of relevance of the underlying concepts, due to the lack of awareness about code smells on the developers' side, or due to the lack of appropriate tools for code smell analysis or removal? In order to align and direct research efforts to address actual needs and problems of professional developers, we need to better understand the knowledge about, and interest in code smells, together with their perceived criticality. This paper reports on the results obtained from an exploratory survey involving 85 professional software developers.
Conference Paper
Full-text available
Stack Overflow is a popular on-line programming question and answer community providing its participants with rapid access to knowledge and expertise of their peers, especially benefitting coders. Despite the popularity of Stack Overflow, its role in the work cycle of open-source developers is yet to be understood: on the one hand, participation in it has the potential to increase the knowledge of individual developers thus improving and speeding up the development process. On the other hand, participation in Stack Overflow may interrupt the regular working rhythm of the developer, hence also possibly slow down the development process. In this paper we investigate the interplay between Stack Overflow activities and the development process, reflected by code changes committed to the largest social coding repository, GitHub. Our study shows that active GitHub committers ask fewer questions and provide more answers than others. Moreover, we observe that active Stack Overflow askers distribute their work in a less uniform way than developers that do not ask questions. Finally, we show that despite the interruptions incurred, the Stack Overflow activity rate correlates with the code changing activity in GitHub.
Conference Paper
Full-text available
Delivering increasingly complex software-reliant systems demands better ways to manage the long-term effects of short-term expedients. The technical debt metaphor is gaining significant traction in the agile development community as a way to understand and communicate such issues. The idea is that developers sometimes accept compromises in a system in one dimension (e.g., modularity) to meet an urgent demand in some other dimension (e.g., a deadline), and that such compromises incur a "debt": on which "interest" has to be paid and which the "principal" should be repaid at some point for the long-term health of the project. We argue that the software engineering research community has an opportunity to study and improve this concept. We can offer software engineers a foundation for managing such trade-offs based on models of their economic impacts. Therefore, we propose managing technical debt as a part of the future research agenda for the software engineering field.
Article
Full-text available
Background. This paper describes a case study on the benefits of software reuse in a large telecom product. The reused components were developed in-house and shared in a product-family approach. Methods. Quantitative data mined from company repositories are combined with other quantitative data and qualitative observations. Results. We observed significantly lower fault-density and less modified code between successive releases of reused components. Reuse and standardization of software architecture and processes allowed easier transfer of development when organizational changes happened. Conclusions. The study adds to the evidence of quality benefits of large-scale reuse programs and explores organizational motivations and outcomes. This paper describes results of an empirical investigation of data collected from a large industrial telecom product to evaluate and explore software reuse benefits. A product family was initiated across two Ericsson organizations in Norway and Sweden based on extracting reusable components in a first product and developing a reusable, layered architecture to achieve benefits in productivity and lead time. Apart from the above incentives, two benefits are discussed in this paper: a) reuse led to benefits in quality, and b) reuse of software architecture, processes, infrastructure and domain knowledge showed to be beneficial for facing organizational changes.
Article
Online code clones are code fragments that are copied from software projects or online sources to Stack Overflow as examples. Due to an absence of a checking mechanism after the code has been copied to Stack Overflow, they can become toxic code snippets, e.g., they suffer from being outdated or violating the original software license. We present a study of online code clones on Stack Overflow and their toxicity by incorporating two developer surveys and a large-scale code clone detection. A survey of 201 high-reputation Stack Overflow answerers (33% response rate) showed that 131 participants (65%) have ever been notified of outdated code and 26 of them (20%) rarely or never fix the code. 138 answerers (69%) never check for licensing conflicts between their copied code snippets and Stack Overflow?s CC BY-SA 3.0. A survey of 87 Stack Overflow visitors shows that they experienced several issues from Stack Overflow answers: mismatched solutions, outdated solutions, incorrect solutions, and buggy code. 85% of them are not aware of CC BY-SA 3.0 license enforced by Stack Overflow, and 66% never check for license conflicts when reusing code snippets. Our clone detection found online clone pairs between 72,365 Java code snippets on Stack Overflow and 111 open source projects in the curated Qualitas corpus. We analysed 2,289 non-trivial online clone candidates. Our investigation revealed strong evidence that 153 clones have been copied from a Qualitas project to Stack Overflow. We found 100 of them (66%) to be outdated, of which 10 were buggy and harmful for reuse. Furthermore, we found 214 code snippets that could potentially violate the license of their original software and appear 7,112 times in 2,427 GitHub projects.
Conference Paper
Stack Overflow (SO) is the largest Q&A website for developers, providing a huge amount of copyable code snippets. Using these snippets raises various maintenance and legal issues. The SO license requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO's license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. In this paper, we present the research design and summarized results of an empirical study analyzing attributed and unattributed usages of SO code snippets in GitHub projects. On average, 3.22% of all analyzed repositories and 7.33% of the popular ones contained a reference to SO. Further, we found that developers rather refer to the whole thread on SO than to a specific answer. For Java, at least two thirds of the copied snippets were not attributed.
Article
Context: Source code reuse has been widely accepted as a fundamental activity in software development. Recent studies showed that StackOverflow has emerged as one of the most popular resources for code reuse. Therefore, a plethora of work proposed ways to optimally ask questions, search for answers and find relevant code on StackOverflow. However, little work studies the impact of code reuse from StackOverflow. Objective: To better understand the impact of code reuse from StackOverflow, we perform an exploratory study focusing on code reuse from StackOverflow in the context of mobile apps. Specifically, we investigate how much, why, when, and who reuses code. Moreover, to understand the potential implications of code reuse, we examine the percentage of bugs in files that reuse StackOverflow code. Method: We perform our study on 22 open source Android apps. For each project, we mine their source code and use clone detection techniques to identify code that is reused from StackOverflow. We then apply different quantitative and qualitative methods to answer our research questions. Results: Our findings indicate that 1) the amount of reused StackOverflow code varies for different mobile apps, 2) feature additions and enhancements in apps are the main reasons for code reuse from StackOverflow, 3) mid-age and older apps reuse StackOverflow code mostly later on in their project lifetime and 4) that in smaller teams/apps, more experienced developers reuse code, whereas in larger teams/apps, the less experienced developers reuse code the most. Additionally, we found that the percentage of bugs is higher in files after reusing code from StackOverflow. Conclusion: Our results provide insights on the potential impact of code reuse from StackOverflow on mobile apps. Furthermore, these results can benefit the research community in developing new techniques and tools to facilitate and improve code reuse from StackOverflow.
Article
Reuse is an established software development practice, whose benefits have attracted the attention of researchers and practitioners. In order for software reuse to advance from an opportunistic activity to a well-defined, systematic state of practice, the reuse phenomenon should be empirically studied in a real-world environment. To this end, OSS projects consist a fitting context for this purpose. In this paper, we aim at assessing the: (a) strategy and intensity of reuse activities in OSS development, (b) effect of reuse activities on design quality, (c) modification of reuse decisions from a chronological viewpoint and (d) effect of these modifications on software design quality. In order to achieve these goals, we performed a large-scale embedded multi-case study on 1,111 Java projects, extracted from Google Code repository. The results of the case study provide a valuable insight on reuse processes in OSS development, that can be exploited by both researchers and practitioners.
Book
Introduction Design of the Case Study Data Collection Data Analysis Reporting and Dissemination Lessons Learned
Article
ContextWhilst technical debt is considered to be detrimental to the long term success of software development, it appears to be poorly understood in academic literature. The absence of a clear definition and model for technical debt exacerbates the challenge of its identification and adequate management, thus preventing the realisation of technical debt's utility as a conceptual and technical communication device.Objective To make a critical examination of technical debt and consolidate understanding of the nature of technical debt and its implications for software development.Method An exploratory case study technique that involves multivocal literature review, supplemented by interviews with software practitioners and academics to establish the boundaries of the technical debt phenomenon.ResultA key outcome of this research is the creation of a theoretical framework that provides a holistic view of technical debt comprising a set of technical debts dimensions, attributes, precedents and outcomes, as well as the phenomenon itself and a taxonomy that describes and encompasses different forms of the technical debt phenomenon.Conclusion The proposed framework provides a useful approach to understanding the overall phenomenon of technical debt for practical purposes. Future research should incorporate empirical studies to validate heuristics and techniques that will assist practitioners in their management of technical debt.
Article
Web scraping is the set of techniques used to automatically get some information from a website instead of manually copying it. The goal of a Web scraper is to look for certain kinds of information, extract, and aggregate it into new Web pages. In particular, scrapers are focused on transforming unstructured data and save them in structured databases. In this paper, among others kind of scraping, we focus on those techniques that extract the content of a Web page. In particular, we adopt scraping techniques in the Web advertising field. To this end, we propose a collaborative filtering-based Web advertising system aimed at finding the most relevant ads for a generic Web page by exploiting Web scraping. To illustrate how the system works in practice, a case study is presented.
Article
Over the last decade many techniques and tools for software clone detection have been proposed. In this paper, we provide a qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools, and organize the large amount of information into a coherent conceptual framework. We begin with background concepts, a generic clone detection process and an overall taxonomy of current techniques and tools. We then classify, compare and evaluate the techniques and tools in two different dimensions. First, we classify and compare approaches based on a number of facets, each of which has a set of (possibly overlapping) attributes. Second, we qualitatively evaluate the classified techniques and tools with respect to a taxonomy of editing scenarios designed to model the creation of Type-1, Type-2, Type-3 and Type-4 clones. Finally, we provide examples of how one might use the results of this study to choose the most appropriate clone detection tool or technique in the context of a particular set of goals and constraints. The primary contributions of this paper are: (1) a schema for classifying clone detection techniques and tools and a classification of current clone detectors based on this schema, and (2) a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors based on this taxonomy.
Article
Code reuse is a form of knowledge reuse in software development, which is fundamental to innovation in many fields. To date, there has been no systematic investigation of code reuse in open source software projects. This study uses quantitative and qualitative data gathered from a sample of six open source software projects, to evaluate two sets of propositions derived from the literature on software reuse in firms and open source software development. We find that code reuse is extensive across the sample and that open source software developers, much like developers in firms, apply tools that lower their search costs for knowledge and code, assess the quality of software components, and they have incentives to reuse code. Open source software developers reuse code because they want to integrate functionality quickly, because they want to write preferred code, because they operate under limited resources in terms of time and skills, and because they can mitigate development costs through code reuse. I
Conference Paper
The benefits of reusing software components have been studied for many years. Several previous studies have concluded that reused components have fewer defects in general than non-reusable components. However, few of these studies have gone a further step, i.e., investigating which type of defects has been reduced because of reuse. Thus, it is suspected that making a software component reusable will automatically improve its quality. This paper presents an on-going industrial empirical study on the quality benefits of reuse. We are going to compare the defects types, which are classified by ODC (orthogonal defect classification), of the reusable component vs. the non-reusable components in several large and medium software systems. The intention is to figure out which defects have been reduced because of reuse and the reasons of the reduction.
Conference Paper
Much attention has been paid to software reuse in recent years because it is recognized as a key means for obtaining higher productivity in the development of new software systems (Gaffney and Durek 1988; Gaffney and Durek 1991; and Gaffney 1989). Also, software reuse has provided the technical benefit of reduced error content and thus higher quantity. The primary economic benefit of software reuse is cost avoidance. Reuse of an existent software object generally costs much less than creating a new software object.
Article
The paper presents a failure modes model of parts-based software reuse, and shows how this model can be used to evaluate and improve software reuse processes. The model and the technique are illustrated using survey data about software reuse gathered from 113 people from 29 organizations
  • C Ragkhitwetsagul
  • J Krinke
  • M Paixao
  • G Bianco
  • R Oliveto
C. Ragkhitwetsagul, J. Krinke, M. Paixao, G. Bianco, and R. Oliveto, "Toxic Code Snippets on Stack Overflow," arXiv:1806.07659 [cs], Jun. 2018, arXiv: 1806.07659.