Conference Paper

An Empirical Study of Build Failures in the Docker Context

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Docker containers have become the de-facto industry standard. Docker builds often break, and a large amount of efforts are put into troubleshooting broken builds. Prior studies have evaluated the rate at which builds in large organizations fail. However, little is known about the frequency and fix effort of failures that occur in Docker builds of open-source projects. This paper provides a first attempt to present a preliminary study on 857,086 Docker builds from 3,828 open-source projects hosted on GitHub. Using the Docker build data, we measure the frequency of broken builds and report their fix time. Furthermore, we explore the evolution of Docker build failures across time. Our findings help to characterize and understand Docker build failures and motivate the need for collecting more empirical evidence.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Despite the prevalence of build issues from non-contributors, existing research has largely overlooked these challenges, focusing primarily on the perspective of contributors, such as the developers of the project being built. For instance, extensive studies have been conducted on the effort required to maintain build scripts [10,11], the categorization and distribution of bugs in build scripts [3,16,22], the patterns of fixing build scripts [9,23], and how build failures occur in new integration scenarios such as Continuous Integration (CI) chains [24] or Docker environments [21]. It is important to note that build issues experienced by non-contributors may have arXiv:2410.16311v1 ...
... Wu et al. [21] studied the characteristics of build failures in the context of docker environments. These existing studies focus on build failure logs and build failure fixes in the commit history. ...
Preprint
Full-text available
Open-source software (OSS) often needs to be built by roles who are not contributors. Despite the prevalence of build issues experienced by non-contributors, there is a lack of studies on this topic. This paper presents a study aimed at understanding the symptoms and causes of build issues experienced by non-contributors. The findings highlight certain build issues that are challenging to resolve and underscore the importance of understanding non-contributors' behavior. This work lays the foundation for further research aimed at enhancing the non-contributors' experience in dealing with build issues.
... Rausch et al. (2017), analyzed build failures from 14 OS Java projects and identified error categories and noisy build data, while Zolfagharinia et al. (2017) studied the impact of OS changes and environments on build failures. Finally, Wu et al. (2020) presented insights on failure frequencies and fix times in docker build data and studied the evolution of docker build failures over time. ...
Article
Full-text available
Context Continuous Integration (CI) is a resource intensive, widely used industry practice. The two most commonly used heuristics to reduce the number of builds are either by grouping multiple builds together or by skipping builds predicted to be safe. Yet, both techniques have their disadvantages in terms of missing build failures and respectively higher build turn-around time (delays). Objective We aim to bring together these two lines of research, empirically comparing their advantages and disadvantages over time, and proposing and evaluating two ways in which these build avoidance heuristics can be combined more effectively, i.e., the ML-CI model based on machine learning and the Timeout Rule. Method We empirically study the trade-off between reduction in the number of builds required and the speed of recognition of failing builds on a dataset of 79,482 builds from 20 open-source projects. Results We find that both of our hybrid heuristics can provide a significant improvement in terms of less missed build failures and lower delays than the baseline heuristics. They substantially reduce the turn-around-time of commits by 96% in comparison to skipping heuristics, the Timeout Rule also enables a median of 26.10% less builds to be scheduled than grouping heuristics. Conclusions Our hybrid approaches offer build engineers a better flexibility in terms of scheduling builds during CI without compromising the quality of the resulting software.
... Several studies focus on the quality of Dockerfile. Wu et al. [36] found that up to 85.2% of Docker projects have at least one broken build. They also found that the more frequently developers build projects, the less likely projects fail to build, while the more often developers fail to build, the longer it takes to fix issues. ...
Preprint
Full-text available
Docker allows for the packaging of applications and dependencies, and its instructions are described in Dockerfiles. Nowadays, version pinning is recommended to avoid unexpected changes in the latest version of a package. However, version pinning in Dockerfiles is not yet fully realized (only 17k of the 141k Dockerfiles we analyzed), because of the difficulties caused by version pinning. To maintain Dockerfiles with version-pinned packages, it is important to update package versions, not only for improved functionality, but also for software supply chain security, as packages are changed to address vulnerabilities and bug fixes. However, when updating multiple version-pinned packages, it is necessary to understand the dependencies between packages and ensure version compatibility, which is not easy. To address this issue, we explore the applicability of the meta-maintenance approach, which aims to distribute the successful updates in a part of a group that independently maintains a common artifact. We conduct an exploratory analysis of 7,914 repositories on GitHub that hold Dockerfiles, which retrieve packages on GitHub by URLs. There were 385 repository groups with the same multiple package combinations, and 208 groups had Dockerfiles with newer version combinations compared to others, which are considered meta-maintenance applicable. Our findings support the potential of meta-maintenance for updating multiple version-pinned packages and also reveal future challenges.
... It is also a trendy research area, with more than 7,770 research papers published on Docker since the beginning of 2020. 15 In software engineering, a wide variety of Docker studies have investigated specific aspects, including the ecosystem of Docker containers (Cito et al. 2017), Docker build failure (Yiwen et al. 2020), and other Docker topics (Henkel et al. 2020a, b). These studies aimed to examine practical findings and lessons learned by mining a large number of published Dockerfiles. ...
Article
Full-text available
In software development, ad hoc solutions that are intentionally implemented by developers are called self-admitted technical debt (SATD). Because the existence of SATD spreads poor implementations, it is necessary to remove it as soon as possible. Meanwhile, container virtualization has been attracting attention in recent years as a technology to support infrastructure such as servers. Currently, Docker is the de facto standard for container virtualization. In Docker, a file describing how to build a container (Dockerfile) is a set of procedural instructions; thus, it can be considered as a kind of source code. Moreover, because Docker is a relatively new technology, there are few developers who have accumulated good or bad practices for building Docker container. Hence, it is likely that Dockerfiles contain many SATDs, as is the case with general programming language source code analyzed in previous SATD studies. The goal of this paper is to categorize SATDs in Dockerfiles and to share knowledge with developers and researchers. To achieve this goal, we conducted a manual classification for SATDs in Dockerfile. We found that about 3.0% of the comments in Dockerfile are SATD. In addition, we have classified SATDs into five classes and eleven subclasses. Among them, there are some SATDs specific to Docker, such as SATDs for version fixing and for integrity check. The three most common classes of SATD were related to lowering maintainability, testing, and defects.
... Similar to the concept of inheritance in object-oriented programming, docker images can define the FROM instruction to inherit image definitions from another base image [15]. The new image will inherit all the attributes and files encapsulated in the base image [13]. In practice, to find a suitable image, developers often need to manually search the base image from docker registries, e.g., Docker Hub. ...
Preprint
Full-text available
Docker containers are being widely used in large-scale industrial environments. In practice, developers must manually specify the base image in the dockerfile in the process of container creation. However, finding the proper base image is a nontrivial task because manually searching is time-consuming and easily leads to the use of unsuitable base images, especially for newcomers. There is still a lack of automatic approaches for recommending related base image for developers through dockerfile configuration. To tackle this problem, this paper makes the first attempt to propose a neural network approach named DCCimagerec which is based on deep configuration comprehension. It aims to use the structural configuration features of dock-erfile extracted by AST and path-attention model to recommend potentially suitable base image. The evaluation experiments based on about 83,000 dockerfiles show that DCCimagerec outperforms multiple baselines, improving Precision by 7.5%-67.5%, Recall by 6.2%-106.6%, and F1 by 7.5%-150.2%.
... Today, source code is essentially required to have Continuous Integration and Continuous Delivery/Deployment (CI/CD) to target environments because automation has become a norm in software development practices [5]. To carry out CI/CD, GitHub, as one of the most widely used source code repository providers, has integrated many third-party CI/CD tools, such as Travis CI [34], Jenkins [1] and Docker [28]. It is becoming increasingly clear that making a choice of CI/CD tools and infrastructures is a tough task [21]. ...
Preprint
Automation has become a norm in software development practices, especially in CI/CD practices. Recently, GitHub introduced GitHub Actions (GA) to provide automated workflows for software maintainers. However, few researches have been proposed to evaluate its capability and impact, even though several GA have been built by practitioners. In this paper, we conduct a large-scale empirical study of GitHub projects, to help practitioners gain deep insights into the GA. We quantitatively investigate the basic adoption of GA and its potential correlation with project properties. We also analyze the usage details of GA, including its component scale and action sequences. Finally, using regression modeling, we investigate the impact of GA on the commit frequency, pull request and issue's resolution efficiency. Our findings suggest a nuanced picture of how practitioners are adapting to, and benefiting from GA.
... However, the CI/CD pipeline execution is often blocked by Docker build failures due to the discrepancy between the local environment and that inside a Docker image. Previous work found that 17.8% of historical Docker builds failed in the sample open-source projects [5]. In the community of deep learning, it was reported that 39.4% of the jobs based on Docker failed because of accessing a non-existent file or directory [6]. ...
Preprint
Continuous Integration (CI) and Continuous Deployment (CD) are widely adopted in software engineering practice. In reality, the CI/CD pipeline execution is not yet reliably continuous because it is often interrupted by Docker build failures. However, the existing trial-and-error practice to detect faults is time-consuming. To timely detect Dockerfile faults, we propose a context-based pre-build analysis approach, named DockerMock, through mocking the execution of common Dockerfile instructions. A Dockerfile fault is declared when an instruction conflicts with the approximated and accumulated running context. By explicitly keeping track of whether the context is fuzzy, DockerMock strikes a good balance of detection precision and recall. We evaluated DockerMock with 53 faults in 41 Dockerfiles from open source projects on GitHub and 130 faults in 105 Dockerfiles from student course projects. On average, DockerMock detected 68.0% Dockerfile faults in these two datasets. While baseline hadolint detected 6.5%, and baseline BuildKit detected 60.5% without instruction execution. In the GitHub dataset, DockerMock reduces the number of builds to 47, outperforming that of hadolint (73) and BuildKit (74).
... We found a comparable breakage rate, but have also developed methods aimed at making repairs instead of just analyzing quality. More recently, Wu et al. [34] conducted a comprehensive study of build failures in Dockerfiles. They analyzed a total of 3,828 GitHub project containing Dockerfiles, and a total of 857,086 Docker builds. ...
Preprint
Docker is a tool for lightweight OS-level virtualization. Docker images are created by performing a build, controlled by a source-level artifact called a Dockerfile. We studied Dockerfiles on GitHub, and -- to our great surprise -- found that over a quarter of the examined Dockerfiles failed to build (and thus to produce images). To address this problem, we propose SHIPWRIGHT, a human-in-the-loop system for finding repairs to broken Dockerfiles. SHIPWRIGHT uses a modified version of the BERT language model to embed build logs and to cluster broken Dockerfiles. Using these clusters and a search-based procedure, we were able to design 13 rules for making automated repairs to Dockerfiles. With the aid of SHIPWRIGHT, we submitted 45 pull requests (with a 42.2% acceptance rate) to GitHub projects with broken Dockerfiles. Furthermore, in a "time-travel" analysis of broken Dockerfiles that were later fixed, we found that SHIPWRIGHT proposed repairs that were equivalent to human-authored patches in 22.77% of the cases we studied. Finally, we compared our work with recent, state-of-the-art, static Dockerfile analyses, and found that, while static tools detected possible build-failure-inducing issues in 20.6--33.8% of the files we examined, SHIPWRIGHT was able to detect possible issues in 73.25% of the files and, additionally, provide automated repairs for 18.9% of the files.
Article
Fuzzing is an automated software testing technique used to find software vulnerabilities that works by sending large amounts of inputs to a software system to trigger bad behaviors. In recent years, the open-source software ecosystem has seen a significant increase in the adoption of fuzzing to avoid spreading vulnerabilities throughout the ecosystem. While fuzzing can uncover vulnerabilities, there is currently a lack of knowledge regarding the challenges of conducting fuzzing activities over time. Specifically, fuzzers are very complex tools to set up and build before they can be used. We set out to empirically find out how challenging is build maintenance in the context of fuzzing. We mine over 1.2 million build logs from Google's OSS-Fuzz service to investigate fuzzing build failures. We first conduct a quantitative analysis to quantify the prevalence of fuzzing build failures. We then manually investigate 677 failing fuzzing builds logs and establish a taxonomy of 25 root causes of build failures. We finally train a machine learning model to recognize common failure patterns in failing build logs. Our taxonomy can serve as a reference for practitioners conducting fuzzing build maintenance. Our modeling experiment shows the potential of using automation to simplify the process of fuzzing.
Article
Docker is a containerization technology that allows developers to ship software applications along with their dependencies in Docker images. Developers can extend existing images using them as base images when writing Dockerfiles. However, a lot of alternative functionally-equivalent base images are available. While many studies define and evaluate quality features that can be extracted from Docker artifacts, it is still unclear what are the criteria on which developers choose a base image over another. In this paper, we aim to fill this gap. First, we conduct a literature review through which we define a taxonomy of quality features, identifying two main groups: Configuration-related features ( i.e., mainly related to the Dockerfile and image build process), and externally observable features ( i.e., what the Docker image users can observe). Second, we ran an empirical study considering the developers’ preference for 2,441 Docker images in 1,911 open-source software projects. We want to understand (i) how the externally observable features influence the developers’ preferences, and (ii) how they are related to the configuration-related features . Our results pave the way to the definition of a reliable quality measure for Docker artifacts, along with tools that support developers for a quality-aware development of them.
Article
Full-text available
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004–2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.
Conference Paper
Docker building is a critical component of containerized workflow, which automates the process by which sources are packaged and transformed into container images. If not run properly, Docker builds can bring long durations (i.e., slow builds), which increases the cost in human and computing resources, and thus inevitably affect the software development. However, the current status and remedy for the duration cost in Docker builds remain unclear and need an in-depth study. To fill this gap, this paper provides the first empirical investigation on 171,439 Docker builds from 5,833 open source software (OSS) projects. Starting with an exploratory study, the Docker build durations can be characterized in real-world projects, and the developers’ perceptions of slow builds are obtained via a comprehensive survey. Driven by the results of our exploratory study, we propose a prediction modeling of Docker build duration, leveraging 27 handcrafted features from build-related context and configuration and 8 regression algorithms for the prediction task. Our results demonstrate that Random Forest model provides the superior performance with a Spearman’s correlation of 0.781, outperforming the baseline random model by 82.9% in RMSE, 90.6% in MAE, and 94.4% in MAPE, respectively. The implications of this study will facilitate research and assist practitioners in improving the Docker build process.
Article
Full-text available
Dockerfile plays an important role in the Docker-based software development process, but many Dockerfile codes are infected with smells in practice. Understanding the occurrence of Dockerfile smells in open-source software can benefit the practice of Dockerfile and enhance project maintenance. In this paper, we perform an empirical study on a large dataset of 6,334 projects to help developers gain some insights into the occurrence of Dockerfile smells, including its coverage, distribution, cooccurrence, and correlation with project characteristics. Our results show that smells are very common in Dockerfile codes and there exists co-occurrence between different types of Dockerfile smells. Further, using linear regression analysis, when controlled for various variables, we statistically identify and quantify the relationships between Dockerfile smells occurrence and project characteristics. We also provide a rich resource of implications for software practitioners.
Conference Paper
Full-text available
Continuous Integration (CI) and Continuous Delivery (CD) are widespread in both industrial and open-source software (OSS) projects. Recent research characterized build failures in CI and identified factors potentially correlated to them. However, most observations and findings of previous work are exclusively based on OSS projects or data from a single industrial organization. This paper provides a first attempt to compare the CI processes and occurrences of build failures in 349 Java OSS projects and 418 projects from a financial organization, ING Nederland. Through the analysis of 34,182 failing builds (26% of the total number of observed builds), we derived a taxonomy of failures that affect the observed CI processes. Using cluster analysis, we observed that in some cases OSS and ING projects share similar build failure patterns (e.g., few compilation failures as compared to frequent testing failures), while in other cases completely different patterns emerge. In short, we explain how OSS and ING CI processes exhibit commonalities, yet are substantially different in their design and in the failures they report.
Conference Paper
Full-text available
Software development has always inherently required multitasking: developers switch between coding, reviewing, testing, designing, and meeting with colleagues. The advent of software ecosystems like GitHub has enabled something new: the ability to easily switch between projects. Developers also have social incentives to contribute to many projects; prolific contributors gain social recognition and (eventually) economic rewards. Multitasking, however, comes at a cognitive cost: frequent context switches can lead to distraction, sub-standard work, and even greater stress. In this paper, we gather ecosystem-level data on a group of programmers working on a large collection of projects. We develop models and methods for measuring the rate and breadth of a developers' context-switching behavior, and we study how context-switching affects their productivity. We also survey developers to understand the reasons for and perceptions of multitasking. We find that the most common reason for multitasking is interrelationships and dependencies between projects. Notably, we find that the rate of switching and breadth (number of projects) of a developer’s work matter. Developers who work on many projects have higher productivity if they focus on few projects per day. Developers that switch projects too much during the course of a day have lower productivity as they work on more projects overall. Despite these findings, developers perceptions of the benefits of multitasking are varied.
Conference Paper
Continuous Integration (CI) is a widely-used software development practice to reduce risks. CI builds often break, and a large amount of efforts are put into troubleshooting broken builds. Despite that compiler errors have been recognized as one of the most frequent types of build failures, little is known about the common types, fix efforts and fix patterns of compiler errors that occur in CI builds of open-source projects. To fill such a gap, we present a large-scale empirical study on 6,854,271 CI builds from 3,799 open-source Java projects hosted on GitHub. Using the build data, we measured the frequency of broken builds caused by compiler errors, investigated the ten most common compiler error types, and reported their fix time. We manually analyzed 325 broken builds to summarize fix patterns of the ten most common compiler error types. Our findings help to characterize and understand compiler errors during CI and provide practical implications to developers, tool builders and researchers.
Article
Docker, as a de-facto industry standard, enables the packaging of an application with all its dependencies and execution environment in a lightweight, self-contained unit, i.e., containers. By launching the container from Docker image, developers can easily share the same operating system, libraries, and binaries. As the configuration file, the dockerfile plays an important role, because it defines the specific Docker image architecture and building orders. As a project progresses through its development stages, the content of the dockerfile may be revised many times. This dockerfile evolution is indicative of how the project infrastructure varies over time, and different projects can exhibit different evolutionary trajectories. However, projects with similar goals and needs may converge to more similar trajectories than more disparate projects. Identifying software projects that have undergone similar changes can be very important for the discovery and implementation of the best practices when adopting new tools and pipelines, especially in the DevOps software development paradigm. The potential to implement the best practices through the analysis of the dockerfile evolutionary trajectories motivated this work. This research studied dockerfile longitudinal changes at large scale and presented a clustering-based approach for mining convergent evolutionary trajectories. An empirical study of 2840 projects was conducted, and six distinct clusters of dockerfile evolutionary trajectories were found. Furthermore, each cluster was summarized, accompanied by case studies, and the differences between different clusters were discussed. The proposed approach quantifies distinct dockerfile evolution modes and reflects the learning curves of project maintainers, and this benefits future project maintenance. Also, the proposed approach is generic and can be used for the study of general infrastructure configuration file evolution.
Conference Paper
Continuous deployment (CD) is a software development practice aimed at automating delivery and deployment of a software product, following any changes to its code. If properly implemented, CD together with other automation in the development process can bring numerous benefits, including higher control and flexibility over release schedules, lower risks, fewer defects, and easier on-boarding of new developers. Here we focus on the (r)evolution in CD workflows caused by containerization, the virtualization technology that enables packaging an application together with all its dependencies and execution environment in a light-weight, self-contained unit, of which Docker has become the de-facto industry standard. There are many available choices for containerized CD workflows, some more appropriate than others for a given project. Owing to cross-listing of GitHub projects on Docker Hub, in this paper we report on a mixed-methods study to shed light on developers' experiences and expectations with containerized CD workflows. Starting from a survey, we explore the motivations, specific workflows, needs, and barriers with containerized CD. We find two prominent workflows, based on the automated builds feature on Docker Hub or continuous integration services, with different trade-offs. We then propose hypotheses and test them in a large-scale quantitative study.
Conference Paper
Docker containers are standardized, self-contained units of applications, packaged with their dependencies and execution environment. The environment is defined in a Dockerfile that specifies the steps to reach a certain system state as infrastructure code, with the aim of enabling reproducible builds of the container. To lay the groundwork for research on infrastructure code, we collected structured information about the state and the evolution of Dockerfiles on GitHub and release it as a PostgreSQL database archive (over 100,000 unique Dockerfiles in over 15,000 GitHub projects). Our dataset enables answering a multitude of interesting research questions related to different kinds of software evolution behavior in the Docker ecosystem.
Conference Paper
Build systems translate sources into deliverables. Developers execute builds on a regular basis in order to integrate their personal code changes into testable deliverables. Prior studies have evaluated the rate at which builds in large organizations fail. A recent study at Google has analyzed (among other things) the rate at which builds in developer workspaces fail. In this paper, we replicate the Google study in the Visual Studio context of the MSR challenge. We extract and analyze 13,300 build events, observing that builds are failing 67%--76% less frequently and are fixed 46%--78% faster in our study context. Our results suggest that build failure rates are highly sensitive to contextual factors. Given the large number of factors by which our study contexts differ (e.g., system size, team size, IDE tooling, programming languages), it is not possible to trace the root cause for the large differences in our results. Additional data is needed to arrive at more complete conclusions.
Conference Paper
Continuous Integration (CI) is a cornerstone of modern quality assurance, providing on-demand builds (compilation and tests) of code changes or software releases. Despite the myriad of CI tools and frameworks, the basic activity of interpreting build results is not straightforward, due to not only the number of builds being performed but also, and especially, due to the phenomenon of build inflation, according to which one code change can be built on dozens of different operating systems, run-time environments and hardware architectures. As existing work mostly ignored this inflation, this paper performs a large-scale empirical study of the impact of OS and run-time environment on build failures on 30 million builds of the CPAN ecosystem's CI environment. We observe the evolution of build failures over time, and investigate the impact of OSes and environments on build failures. We show that distributions may fail differently on different OSes and environments and, thus, that the results of CI require careful filtering and selection to identify reliable failure data.
Article
In episode 217 of Software Engineering Radio, host Charles Anderson talks with James Turnbull, a software developer and security specialist who's vice president of services at Docker. Lightweight Docker containers are rapidly becoming a tool for deploying microservice-based architectures.
Article
Building is an integral part of the software development process. However, little is known about the compiler errors that occur in this process. In this paper, we present an empirical study of 26.6 million builds produced during a period of nine months by thousands of developers. We describe the workflow through which those builds are generated, and we analyze failure frequency, compiler error types, and resolution efforts to fix those compiler errors. The results provide insights on how a large organization build process works, and pinpoints errors for which further developer support would be most effective.