Conference Paper

Challenges in migrating imperative deep learning programs to graph execution: an empirical study

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Though promising, hybridization necessitates non-trivial metadata [13] and exhibits limitations and known issues [17] with native program constructs. Subtle considerations are required to make code amenable to safe, accurate, and efficient graph execution [18], [19]. Alternative approaches [13] impose custom Python interpreters, which may be impractical for industry, and support only specific Python constructs. ...
... It will also discover possible side-effects in Python functions to safely transform imperative DL code to either execute eagerly or as a graph at run-time. While LLMs [21] and big datadriven refactorings [22] have emerged, obtaining a (correct) dataset large enough to automatically extract the proposed refactorings is challenging as developers struggle with (manually) migrating DL code to graph execution [18]. Also, while developers generally underuse automated refactorings [23], [24], since data scientists and engineers may not be classically trained software engineers, they may be more open to using automated (refactoring) tools. ...
... Our in progress work aims to automate client-side analyses and transformations to use hybridization APIs correctly and optimally. In the future, we plan to evaluate our approach by curating a DL Python project dataset from Castro Vélez et al. [18] that is ready to be analyzed, with dependency and build infrastructure. We will also open-source our refactoring tools. ...
Preprint
Full-text available
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code -- supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce code that is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, less error-prone imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the "best of both worlds," using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution -- avoiding performance bottlenecks and semantically inequivalent results. We present our ongoing work on an automated refactoring approach that assists developers in specifying whether and how their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs at run-time in a semantics-preserving fashion. The approach, based on a novel tensor analysis specifically for imperative DL code, consists of refactoring preconditions for automatically determining when it is safe and potentially advantageous to migrate imperative DL code to graph execution and modifying decorator parameters or eagerly executing code already running as graphs. The approach is being implemented as a PyDev Eclipse IDE plug-in and uses the WALA Ariadne analysis framework. We discuss our ongoing work towards optimizing imperative DL code to its full potential.
... Unlike the work of , they exclude all generic (i.e., non-DL specific) bugs in DL applications before building the taxonomy, and identify API misuse as a leaf category of the whole taxonomy. Vélez et al. (2022) examined code patches and bug reports of tf.function, and found that hybridization approach could lead to API misuses. The work of Vélez et al. (2022) is not an exclusive study targeting API misuse. ...
... Vélez et al. (2022) examined code patches and bug reports of tf.function, and found that hybridization approach could lead to API misuses. The work of Vélez et al. (2022) is not an exclusive study targeting API misuse. It focuses on covering all types of bugs of tf.function, including API misuse, API incompatibility, numerical errors, etc. ...
Article
Full-text available
Deep Learning (DL) is achieving staggering performance on an increasing number of applications in various areas. Meanwhile, its associated data-driven programming paradigm comes with a set of challenges for the software engineering community, including the debugging activities for DL applications. Recent empirical studies on bugs in DL applications have shown that the API (i.e., Application Program Interface) misuse has been flagged as an important category of DL programming bugs. By exploring this literature towards API misuse bugs in DL applications, we identified three barriers that are locking an entire research direction. However, three barriers are hindering progress in this research direction: misclassification of API misuse bugs, lack of relevant dataset, and limited depth of analysis. Our work unlocks these barriers by providing an in-depth analysis of a frequent bug type that appears as a mystery. Concretely, we first offer a new perspective to a significant misclassification issue in the literature that hinders understanding of API misuses in DL applications. Subsequently, we curate the first dataset MisuAPI of 143 API misuses sampled from real-world DL applications. Finally, we perform systematic analyses to dissect API misuses and enumerate the symptoms of API misuses in DL applications as well as investigate the possibility of detecting them with state-of-the-art static analyzers. Overall, the insights summarized in this work are important for the community: 1) 18-35% of real API misuses are mislabelled in existing DL bug studies; 2) the widely adopted API misuse taxonomy, namely MUC, does not cover the cases of 1 out of 3 encountered API misuses; 3) DL library API misuses show significant differences from the general third-party library API misuses in terms of the API-usage element issue and symptoms; 4) Most (92.3%) API misuses lead to program crashes; 5) 95.8% API misuses remain undetectable by state-of-the-art static analyzers.
... Column id represents the unique GitHub issue identifier, column comment line the preprocessed, tokenized comment text, and column category the classification category as produced using Arya et al. [2]'s pre-built model. The results were obtained for an empirical study on the challenges facing developers in improving the run-time performance of imperative Deep Learning (DL) code using hybridization [15]. The query "tf.function" was used (period included) to uncover unsolved GitHub issues mentioning the TensorFlow [1] hybridization API keyword. ...
Preprint
Full-text available
Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover -- among others -- challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.
Article
Recently, software systems powered by deep learning (DL) techniques have significantly facilitated people’s lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs. These bugs may be propagated to programs and software developed based on DL libraries, thereby posing serious threats to users’ personal property and safety. Studying the characteristics of DL libraries, their associated bugs, and the corresponding testing methods is crucial for enhancing the security of DL systems and advancing the widespread application of DL technology. This paper provides an overview of the testing research on various DL libraries, discusses the strengths and weaknesses of existing methods, and provides guidance and reference for the application of DL library testing methods. This paper first introduces the workflow of DL underlying libraries and the characteristics of three kinds of DL libraries involved, namely DL framework, DL compiler, and DL hardware library. Subsequently, this paper constructs a literature collection pipeline and comprehensively summarizes existing testing methods on these DL libraries to analyze their effectiveness and limitations. It also reports findings and the challenges of existing DL library testing in real-world applications for future research.
Conference Paper
Full-text available
Because of the naturalness of software and the rapid evolution of Machine Learning (ML) techniques, frequently repeated code change patterns (CPATs) occur often. They range from simple API migrations to changes involving several complex control structures such as for loops. While manually performing CPATs is tedious, the current state-of-the-art techniques for inferring transformation rules are not advanced enough to handle unseen variants of complex CPATs, resulting in a low recall rate. In this paper we present a novel, automated workflow that mines CPATs, infers the transformation rules, and then transplants them automatically to new target sites. We designed, implemented, evaluated and released this in a tool, PYEVOLVE. At its core is a novel data-flow, control-flow aware transformation rule inference engine. Our technique allows us to advance the state-of-the-art for transformation-by-example tools; without it, 70% of the code changes that PYEVOLVE transforms would not be possible to automate. Our thorough empirical evaluation of over 40,000 transformations shows 97% precision and 94% recall. By accepting 90% of CPATs generated by PYEVOLVE in famous open-source projects, developers confirmed its changes are useful.
Conference Paper
Full-text available
The growing application of DL makes detecting and fixing defective DL programs of paramount importance. Recent studies on DL defects report that TensorFlow API misuses represent a common class of DL defects. However to effectively detect, fix, and verify them remains an understudied problem. This paper presents the TensorFlow API misuses Detector And Fixer (TADAF) technique, which relies on 11 common API misuses patterns and corresponding fixes that we extracted from StackOverflow. TADAF statically analyses a TensorFlow program for identifying matches of any of the 11 patterns. If it finds a match, it automatically generates a fixed version of the program. To verify that the misuse brings a tangible negative effect, TADAF reports functional, accuracy, or efficiency differences when training and testing (with the same data) the original and fixed versions of the program. Our preliminary evaluation on five GitHub projects shows that TADAF detected and fixed all the API misuses.
Preprint
Full-text available
Enabled by a rich ecosystem of Machine Learning (ML) libraries, programming using learned models , i.e., Software-2.0 , has gained substantial adoption. However, we do not know what challenges developers encounter when they use ML libraries. With this knowledge gap, researchers miss opportunities to contribute to new research directions, tool builders do not invest resources where automation is most needed, library designers cannot make informed decisions when releasing ML library versions, and developers fail to use common practices when using ML libraries. We present the first large-scale quantitative and qualitative empirical study to shed light on how developers in Software-2.0 use ML libraries, and how this evolution affects their code. Particularly, using static analysis we perform a longitudinal study of 3,340 top-rated open-source projects with 46,110 contributors. To further understand the challenges of ML library evolution, we survey 109 developers who introduce and evolve ML libraries. Using this rich dataset we reveal several novel findings. Among others, we found an increasing trend of using ML libraries: The ratio of new Python projects that use ML libraries increased from 2% in 2013 to 50% in 2018. We identify several usage patterns including the following: (i) 36% of the projects use multiple ML libraries to implement various stages of the ML workflows, (ii) developers update ML libraries more often than the traditional libraries , (iii) strict upgrades are the most popular for ML libraries among other update kinds, (iv) ML library updates often result in cascading library updates, and (v) ML libraries are often downgraded (22.04% of cases). We also observed unique challenges when evolving and maintaining Software-2.0 such as (i) binary incompatibility of trained ML models and (ii) benchmarking ML models. Finally, we present actionable implications of our findings for researchers, tool builders, developers, educators, library vendors, and hardware vendors.
Article
Full-text available
Refactoring detection is crucial for a variety of applications and tasks: (i) empirical studies about code evolution, (ii) tools for library API migration, (iii) code reviews and change comprehension. However, recent research has questioned the accuracy of the state-of-the-art refactoring mining tools, which poses threats to the reliability of the detected refactorings. Moreover, the majority of refactoring mining tools depend on code similarity thresholds. Finding universal threshold values that can work well for all projects, regardless of their architectural style, application domain, and development practices is extremely challenging. Therefore, in a previous work [1], we introduced the first refactoring mining tool that does not require any code similarity thresholds to operate. In this work, we extend our tool to support low-level refactorings that take place within the body of methods. To evaluate our tool, we created one of the most accurate, complete, and representative refactoring oracles to date, including 7,226 true instances for 40 different refactoring types detected by one (minimum) up to six (maximum) different tools, and validated by one up to four refactoring experts. Our evaluation showed that our approach achieves the highest average precision (99.6%) and recall (94%) among all competitive tools, and on median is 2.6 times faster than the second faster competitive tool.
Conference Paper
Full-text available
In open-source software (OSS), the design of usability is often influenced by the discussions among community members on platforms such as issue tracking systems (ITSs). However, digesting the rich information embedded in issue discussions can be a major challenge due to the vast number and diversity of the comments. We propose and evaluate ArguLens, a conceptual framework and automated technique leveraging an argumentation model to support effective understanding and consolidation of community opinions in ITSs. Through content analysis, we anatomized highly discussed usability issues from a large, active OSS project, into their argumentation components and standpoints. We then experimented with supervised machine learning techniques for automated argument extraction. Finally, through a study with experienced ITS users, we show that the information provided by ArguLens supported the digestion of usability-related opinions and facilitated the review of lengthy issues. ArguLens provides the direction of designing valuable tools for high-level reasoning and effective discussion about usability.
Chapter
Full-text available
Streaming APIs allow for big data processing of native data structures by providing MapReduce-like operations over these structures. However, unlike traditional big data systems, these data structures typically reside in shared memory accessed by multiple cores. Although popular, this emerging hybrid paradigm opens the door to possibly detrimental behavior, such as thread contention and bugs related to non-execution and non-determinism. This study explores the use and misuse of a popular streaming API, namely, Java 8 Streams. The focus is on how developers decide whether or not to run these operations sequentially or in parallel and bugs both specific and tangential to this paradigm. Our study involved analyzing 34 Java projects and 5:53 million lines of code, along with 719 manually examined code patches. Various automated, including interprocedural static analysis, and manual methodologies were employed. The results indicate that streams are pervasive, parallelization is not widely used, and performance is a crosscutting concern that accounted for the majority of fixes. We also present coincidences that both confirm and contradict the results of related studies. The study advances our understanding of streams, as well as benefits practitioners, programming language and API designers, tool developers, and educators alike.
Article
Full-text available
Due to the voluntary nature of open source software, it can be hard to find a developer to work on a particular task. For example, some issue reports may be too cumbersome and unexciting for someone to volunteer to do them, yet these issue reports may be of high priority to the success of a project. To provide an incentive for implementing such issue reports, one can propose a monetary reward, i.e., a bounty, to the developer who completes that particular task. In this paper, we study bounties in open source projects on GitHub to better understand how bounties can be leveraged to evolve such projects in terms of addressing issue reports. We investigated 5,445 bounties for GitHub projects. These bounties were proposed through the Bountysource platform with a total bounty value of 406,425. We find that 1) in general, the timing of proposing bounties is the most important factor that is associated with the likelihood of an issue being addressed. More specifically, issue reports are more likely to be addressed if they are for projects in which bounties are used more frequently and if they are proposed earlier. 2) The bounty value of an issue report is the most important factor that is associated with the issue-addressing likelihood in the projects in which no bounties were used before. 3) There is a risk of wasting money for backers who invest money on long-standing issue reports.
Chapter
Full-text available
Artificial intelligence enabled systems have been an inevitable part of everyday life. However, efficient software engineering principles and processes need to be considered and extended when developing AI- enabled systems. The objective of this study is to identify and classify software engineering challenges that are faced by different companies when developing software-intensive systems that incorporate machine learning components. Using case study approach, we explored the development of machine learning systems from six different companies across various domains and identified main software engineering challenges. The challenges are mapped into a proposed taxonomy that depicts the evolution of use of ML components in software-intensive system in industrial settings. Our study provides insights to software engineering community and research to guide discussions and future research into applied machine learning.
Conference Paper
Full-text available
Surprisingly promising results have been achieved by deep learning (DL) systems in recent years. Many of these achievements have been reached in academic settings, or by large technology companies with highly skilled research groups and advanced supporting infrastructure. For companies without large research groups or advanced infrastructure, building high-quality production-ready systems with DL components has proven challenging. There is a clear lack of well-functioning tools and best practices for building DL systems. It is the goal of this research to identify what the main challenges are, by applying an interpretive research approach in close collaboration with companies of varying size and type. A set of seven projects have been selected to describe the potential with this new technology and to identify associated main challenges. A set of 12 main challenges has been identified and categorized into the three areas of development, production, and organizational challenges. Furthermore, a mapping between the challenges and the projects is defined, together with selected motivating descriptions of how and why the challenges apply to specific projects. Compared to other areas such as software engineering or database technologies, it is clear that DL is still rather immature and in need of further work to facilitate development of high-quality systems. The challenges identified in this paper can be used to guide future research by the software engineering and DL communities. Together, we could enable a large number of companies to start taking advantage of the high potential of the DL technology.
Conference Paper
Full-text available
Sites such as GitHub have created a vast collection of software artifacts that researchers interested in understanding and improving software systems can use. Current tools for processing such GitHub data tend to target project metadata and avoid source code processing, or process source code in a manner that requires significant effort for each language supported. This paper presents GitcProc, a lightweight tool based on regular expressions and source code blocks, which downloads projects and extracts their project history, including fine-grained source code information and development time bug fixes. GitcProc can track changes to both single-line and block source code structures and associate these changes to the surrounding function context with minimal set up required from users. We demonstrate GitcProc's ability to capture changes in multiple languages by evaluating it on C, C++, Java, and Python projects, and show it finds bug fixes and the context of source code changes effectively with few false positives.
Conference Paper
Full-text available
Bugs severely hurt blockchain system dependability. A thorough understanding of blockchain bug characteristics is required to design effective tools for preventing, detecting and mitigating bugs. We perform an empirical study on bug characteristics in eight representative open source blockchain systems. First, we manually examine 1,108 bug reports to understand the nature of the reported bugs. Second, we leverage card sorting to label the bug reports, and obtain ten bug categories in blockchain systems. We further investigate the frequency distribution of bug categories across projects and programming languages. Finally, we study the relationship between bug categories and bug fixing time. The findings include: (1)semantic bugs are the dominant runtime bug category; (2) frequency distributions of bug types show similar trends across different projects and programming languages; (3)security bugs take the longest median time to be fixed; (4) 35.71%performancebugs are fixed in more than one year; performance bugs take the longest average time to be fixed.
Article
Full-text available
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.
Article
Full-text available
MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters. This paper describes both the API design and the system implementation of MXNet, and explains how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.
Article
Full-text available
Many machine learning systems that include various data mining, information retrieval, and natural language processing code and libraries have being used in real world applications. Search engines, internet advertising systems, product recommendation systems are sample users of these algorithm intensive code and libraries. Machine learning code and toolkits have also been used in many recent studies on software mining and analytics that aim to automate various software engineering tasks. With the increasing number of important applications of machine learning systems, the reliability of such systems is also becoming increasingly important. A necessary step for ensuring reliability of such systems is to understand the features and characteristics of bugs occurred in the systems. A number of studies have investigated bugs and fixes in various software systems, but none focuses on machine learning systems. Machine learning systems are unique due to their algorithm-intensive nature and applications to potentially large-scale data, and thus deserve a special consideration. In this study, we fill the research gap by performing an empirical study on the bugs appeared in machine learning systems. We analyze three systems, namely Apache Mahout, Lucene, and OpenNLP, which are data mining, information retrieval, and natural language processing tools respectively. We look into their bug databases and code repositories, analyze existing bugs and corresponding fixes, and label the bugs into various categories. Our study finds that 22.6% of the bugs belong to algorithm/method category, 15.6% of the bugs belong to the non-functional category, and 13% of the bugs belong to the assignment/initialization category. We also report the relationship between the categories of bugs and their severity, the time and effort needed to fix the bugs, and their impact. We highlight several categories of bugs that deserve attention in future research.
Article
Full-text available
Support for generic programming was added to the Java language in 2004, representing perhaps the most significant change to one of the most widely used programming languages today. Researchers and language designers anticipated this addition would relieve many long-standing problems plaguing developers, but surprisingly, no one has yet measured how generics have been adopted and used in practice. In this paper, we report on the first empirical investigation into how Java generics have been integrated into open source software by automatically mining the history of 40 popular open source Java programs, traversing more than 650 million lines of code in the process. We evaluate five hypotheses and research questions about how Java developers use generics. For example, our results suggest that generics sometimes reduce the number of type casts and that generics are usually adopted by a single champion in a project, rather than all committers. We also offer insights into why some features may be adopted sooner and others features may be held back.
Conference Paper
Full-text available
The MapReduce programming paradigm is gaining more and more popularity in recent years due to its ability in supporting easy programming, data distribution, as well as fault tolerance. Failure is an unwanted but inevitable fact that all large-scale parallel computing systems have to face with. MapReduce introduces a novel data replication and task reexecution strategy for fault tolerance. This study intends to lead a better understanding of such fault tolerance mechanisms. In particular, we build a stochastic performance model to quantify the impact of failures on MapReduce applications and to investigate its effectiveness under different computing environments. Simulations also have been carried out to verify the accuracy of the proposed model. Our results show that data replication is an effective approach even when failure rate is high, and the task migration mechanism of MapReduce works well in balancing the reliability difference among individual nodes. This work provides a theoretical foundation for optimizing large-scale MapReduce applications, especially when fault tolerance is the concern.
Conference Paper
Machine learning has been widely adopted in diverse science and engineering domains, aided by reusable libraries and quick development patterns. The TensorFlow library is probably the best-known representative of this trend and most users employ the Python API to its powerful back-end. TensorFlow programs are susceptible to several systematic errors, especially in the dynamic typing setting of Python. We present Pythia, a static analysis that tracks the shapes of tensors across Python library calls and warns of several possible mismatches. The key technical aspects are a close modeling of library semantics with respect to tensor shape, and an identification of violations and error-prone patterns. Pythia is powerful enough to statically detect (with 84.62% precision) 11 of the 14 shape-related TensorFlow bugs in the recent Zhang et al. empirical study-an independent slice of real-world bugs.
Article
Logging—used for system events and security breaches to describe more informational yet essential aspects of software features—is pervasive. Given the high transactionality of today's software, logging effectiveness can be reduced by information overload. Log levels help alleviate this problem by correlating a priority to logs that can be later filtered. As software evolves, however, levels of logs documenting surrounding feature implementations may also require modification as features once deemed important may have decreased in urgency and vice-versa. We present an automated approach that assists developers in evolving levels of such (feature) logs. The approach, based on mining Git histories and manipulating a degree of interest (DOI) model,¹ transforms source code to revitalize feature log levels based on the “interestingness” of the surrounding code. Built upon JGit and Mylyn, the approach is implemented as an Eclipse IDE plug-in and evaluated on 18 Java projects with ∼3 million lines of code and ∼4K log statements. Our tool successfully analyzes 99.22% of logging statements, increases log level distributions by ∼20%, and increases the focus of logs in bug fix contexts ∼83% of the time. Moreover, pull (patch) requests were integrated into large and popular open-source projects. The results indicate that the approach is promising in assisting developers in evolving feature log levels.
Article
Actor concurrency is becoming increasingly important in the development of real-world software systems. Although actor concurrency may be less susceptible to some multithreaded concurrency bugs, such as low-level data races and deadlocks, it comes with its own bugs that may be different. However, the fundamental characteristics of actor concurrency bugs, including their symptoms, root causes, API usages, examples, and differences when they come from different sources are still largely unknown. Actor software development can significantly benefit from a comprehensive qualitative and quantitative understanding of these characteristics, which is the focus of this work, to foster better API documentation, development practices, testing, debugging, repairing, and verification frameworks. To conduct this study, we take the following major steps. First, we construct a set of 186 real-world Akka actor bugs from Stack Overflow and GitHub via manual analysis of 3,924 Stack Overflow questions, answers, and comments and 3,315 GitHub commits, messages, original and modified code snippets, issues, and pull requests. Second, we manually study these actor bugs and their fixes to understand and classify their symptoms, root causes, and API usages. Third, we study the differences between the commonalities and distributions of symptoms, root causes, and API usages of our Stack Overflow and GitHub actor bugs. Fourth, we discuss real-world examples of our actor bugs with these symptoms and root causes. Finally, we investigate the relation of our findings with those of previous work and discuss their implications. A few findings of our study are: (1) symptoms of our actor bugs can be classified into five categories, with Error as the most common symptom and Incorrect Exceptions as the least common, (2) root causes of our actor bugs can be classified into ten categories, with Logic as the most common root cause and Untyped Communication as the least common, (3) a small number of Akka API packages are responsible for most of API usages by our actor bugs, and (4) our Stack Overflow and GitHub actor bugs can differ significantly in commonalities and distributions of their symptoms, root causes, and API usages. While some of our findings agree with those of previous work, others sharply contrast.
Conference Paper
Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work.
Conference Paper
Deep learning has gained substantial popularity in recent years. Developers mainly rely on libraries and tools to add deep learning capabilities to their software. What kinds of bugs are frequently found in such software? What are the root causes of such bugs? What impacts do such bugs have? Which stages of deep learning pipeline are more bug prone? Are there any antipatterns? Understanding such characteristics of bugs in deep learning software has the potential to foster the development of better deep learning platforms, debugging mechanisms, development practices, and encourage the development of analysis and verification frameworks. Therefore, we study 2716 high-quality posts from Stack Overflow and 500 bug fix commits from Github about five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand the types of bugs, root causes of bugs, impacts of bugs, bug-prone stage of deep learning pipeline as well as whether there are some common antipatterns found in this buggy software. The key findings of our study include: data bug and logic bug are the most severe bug types in deep learning software appearing more than 48% of the times, major root causes of these bugs are Incorrect Model Parameter (IPS) and Structural Inefficiency (SI) showing up more than 43% of the times.We have also found that the bugs in the usage of deep learning libraries have some common antipatterns.
Article
The rapid evolution of deep neural networks is demanding deep learning (DL) frameworks not only to satisfy the requirement of quickly executing large computations, but also to support straightforward programming models for quickly implementing and experimenting with complex network structures. However, existing frameworks fail to excel in both departments simultaneously, leading to diverged efforts for optimizing performance and improving usability. This paper presents JANUS, a system that combines the advantages from both sides by transparently converting an imperative DL program written in Python, a de-facto scripting language for DL, into an efficiently executable symbolic dataflow graph. JANUS can convert various dynamic features of Python, including dynamic control flow, dynamic types, and impure functions, into elements of a symbolic dataflow graph. Our experiments show that JANUS can achieve fast DL training by exploiting the techniques imposed by symbolic graph-based DL frameworks, while maintaining the simple and flexible programmability of imperative DL frameworks at the same time.
Conference Paper
Recent advances in computing technologies and the availability of huge volumes of data have sparked a new machine learning (ML) revolution, where almost every day a new headline touts the demise of human experts by ML models on some task. Open source software development is rumoured to play a significant role in this revolution, with both academics and large corporations such as Google and Microsoft releasing their ML frameworks under an open source license. This paper takes a step back to examine and understand the role of open source development in modern ML, by examining the growth of the open source ML ecosystem on GitHub, its actors, and the adoption of frameworks over time. By mining LinkedIn and Google Scholar profiles, we also examine driving factors behind this growth (paid vs. voluntary contributors), as well as the major players who promote its democratization (companies vs. communities), and the composition of ML development teams (engineers vs. scientists). According to the technology adoption lifecycle, we find that ML is in between the stages of early adoption and early majority. Furthermore, companies are the main drivers behind open source ML, while the majority of development teams are hybrid teams comprising both engineers and professional scientists. The latter correspond to scientists employed by a company, and by far represent the most active profiles in the development of ML applications, which reflects the importance of a scientific background for the development of ML frameworks to complement coding skills. The large influence of cloud computing companies on the development of open source ML frameworks raises the risk of vendor lock-in. These frameworks, while open source, could be optimized for specific commercial cloud offerings.
Conference Paper
Deep learning applications become increasingly popular in important domains such as self-driving systems and facial identity systems. Defective deep learning applications may lead to catastrophic consequences. Although recent research efforts were made on testing and debugging deep learning applications, the characteristics of deep learning defects have never been studied. To fill this gap, we studied deep learning applications built on top of TensorFlow and collected program bugs related to TensorFlow from StackOverflow QA pages and Github projects. We extracted information from QA pages, commit messages, pull request messages, and issue discussions to examine the root causes and symptoms of these bugs. We also studied the strategies deployed by TensorFlow users for bug detection and localization. These findings help researchers and TensorFlow users to gain a better understanding of coding defects in TensorFlow programs and point out a new direction for future research.
Article
Programming languages and platforms improve over time, sometimes resulting in new language features that offer many benefits. However, despite these benefits, developers may not always be willing to adopt them in their projects for various reasons. In this paper, we describe an empirical study where we assess the adoption of a particular new language feature. Studying how developers use (or do not use) new language features is important in programming language research and engineering because it gives designers insight into the usability of the language to create meaning programs in that language. This knowledge, in turn, can drive future innovations in the area. Here, we explore Java 8 default methods, which allow interfaces to contain (instance) method implementations. Default methods can ease interface evolution, make certain ubiquitous design patterns redundant, and improve both modularity and maintainability. A focus of this work is to discover, through a scientific approach and a novel technique, situations where developers found these constructs useful and where they did not, and the reasons for each. Although several studies center around assessing new language features, to the best of our knowledge, this kind of construct has not been previously considered. Despite their benefits, we found that developers did not adopt default methods in all situations. Our study consisted of submitting pull requests introducing the language feature to 19 real-world, open source Java projects without altering original program semantics. This novel assessment technique is proactive in that the adoption was driven by an automatic refactoring approach rather than waiting for developers to discover and integrate the feature themselves. In this way, we set forth best practices and patterns of using the language feature effectively earlier rather than later and are able to possibly guide (near) future language evolution. We foresee this technique to be useful in assessing other new language features, design patterns, and other programming idioms.
Conference Paper
Correct error handling is essential for building reliable and secure systems. Unfortunately, low-level languages like C often do not support any error handling primitives and leave it up to the developers to create their own mechanisms for error propagation and handling. However, in practice, the developers often make mistakes while writing the repetitive and tedious error handling code and inadvertently introduce bugs. Such error handling bugs often have severe consequences undermining the security and reliability of the affected systems. Fixing these bugs is also tiring—they are repetitive and cumbersome to implement. Therefore, it is crucial to develop tool supports for automatically detecting and fixing error handling bugs. To understand the nature of error handling bugs that occur in widely used C programs, we conduct a comprehensive study of real world error handling bugs and their fixes. Leveraging the knowledge, we then design, implement, and evaluate ErrDoc, a tool that not only detects and characterizes different types of error handling bugs but also automatically fixes them. Our evaluation on five open-source projects shows that ErrDoc can detect error handling bugs with 100% to 84% precision and around 95% recall, and categorize them with 83% to 96% precision and above 90% recall. Thus, ErrDoc improves precision up to 5 percentage points, and recall up to 44 percentage points w.r.t. the state-of-the-art. We also demonstrate that ErrDoc can fix the bugs with high accuracy.
Conference Paper
Modern computer systems are highly-configurable, complicating the testing and debugging process. The sheer size of the configuration space makes the quality of software even harder to achieve. Performance is one of the key aspects of non-functional qualities, where performance bugs can cause significant performance degradation and lead to poor user experience. However, performance bugs are difficult to expose, primarily because detecting them requires specific inputs, as well as a specific execution environment (e.g., configurations). While researchers have developed techniques to analyze, quantify, detect, and fix performance bugs, we conjecture that many of these techniques may not be effective in highly-configurable systems. In this paper, we study the challenges that configurability creates for handling performance bugs. We study 113 real-world performance bugs, randomly sampled from three highly-configurable open-source projects: Apache, MySQL and Firefox. The findings of this study provide a set of lessons learned and guidance to aid practitioners and researchers to better handle performance bugs in highly-configurable software systems.
Conference Paper
To protect sensitive data processed by current applications, developers, whether security experts or not, have to rely on cryptography. While cryptography algorithms have become increasingly advanced, many data breaches occur because developers do not correctly use the corresponding APIs. To guide future research into practical solutions to this problem, we perform an empirical investigation into the obstacles developers face while using the Java cryptography APIs, the tasks they use the APIs for, and the kind of (tool) support they desire. We triangulate data from four separate studies that include the analysis of 100 StackOverflow posts, 100 GitHub repositories, and survey input from 48 developers. We find that while developers find it difficult to use certain cryptographic algorithms correctly, they feel surprisingly confident in selecting the right cryptography concepts (e.g., encryption vs. signatures). We also find that the APIs are generally perceived to be too low-level and that developers prefer more task-based solutions.
Article
Programming languages evolve over time, adding additional language features to simplify common tasks and make the language easier to use. For example, the Java Language Specification has four editions and is currently drafting a fifth. While the addition of language features is driven by an assumed need by the community (often with direct requests for such features), there is little empirical evidence demonstrating how these new features are adopted by developers once released. In this paper, we analyze over 31k open-source Java projects representing over 9 million Java files, which when parsed contain over 18 billion AST nodes. We analyze this corpus to find uses of new Java language features over time. Our study gives interesting insights, such as: there are millions of places features could potentially be used but weren't; developers convert existing code to use new features; and we found thousands of instances of potential resource handling bugs.
Article
Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers. Studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance. The kappa statistic (or kappa coefficient) is the most commonly used statistic for this purpose. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement equivalent to chance. A limitation of kappa is that it is affected by the prevalence of the finding under observation. Methods to overcome this limitation have been described.
Article
A major obstacle to finding program errors in a real system is knowing what correctness rules the system must obey. These rules are often undocumented or specified in an ad hoc manner. This paper demonstrates techniques that automatically extract such checking information from the source code itself, rather than the programmer, thereby avoiding the need for a priori knowledge of system rules. The cornerstone of our approach is inferring programmer "beliefs" that we then cross-check for contradictions. Beliefs are facts implied by code: a dereference of a pointer, p, implies a belief that p is non-null, a call to "unlock(l)" implies that l was locked, etc. For beliefs we know the programmer must hold, such as the pointer dereference above, we immediately ag contradictions as errors. For beliefs that the programmer may hold, we can assume these beliefs hold and use a statistical analysis to rank the resulting errors from most to least likely. For example, a call to "spin lock" followed once by a call to "spin unlock" implies that the programmer may have paired these calls by coincidence. If the pairing happens 999 out of 1000 times, though, then it is probably a valid belief and the sole deviation a probable error. The key feature of this approach is that it requires no a priori knowledge of truth: if two beliefs contradict, we know that one is an error without knowing what the correct belief is. Conceptually, our checkers extract beliefs by tailoring rule "templates" to a system -- for example, finding all functions that fit the rule template " must be paired with ." We have developed six checkers that follow this conceptual framework. They find hundreds of bugs in real systems such as Linux and OpenBSD. From our experience, they give a dramatic reduction in the manua...