ArticlePDF Available

Managing Larger Data on a GitHub Repository

Authors:
Managing Larger Data on a GitHub Repository
Carl Boettiger1
1University of California, Berkeley
DOI: 10.21105/joss.00971
Software
Review
Repository
Archive
Submitted: 21 September 2018
Published: 24 September 2018
License
Authors of papers retain copyright
and release the work under a Cre-
ative Commons Attribution 4.0 In-
ternational License (CC-BY).
Piggyback: Working with larger data in GitHub
GitHub has become a central component for preserving and sharing software-driven anal-
ysis in academic research (Ram, 2013). As scientists adopt this workow, a desire to
manage data associated with the analysis in the same manner soon emerges. While small
data can easily be committed to GitHub repositories along-side source code and analy-
sis scripts, les larger than 50 MB cannot. Existing work-arounds introduce signicant
complexity and break the ease of sharing (Boettiger, 2018a).
This package provides a simple work-around by allowing larger (up to 2 GB) data les to
piggyback on a repository as assets attached to individual GitHub releases. piggyback
provides a workow similar to Git LFS (“Git LFS,” 2018), in which data les can be
tracked by type and pushed and pulled to GitHub with dedicated commands. These les
are not handled by git in any way, but instead are uploaded, downloaded, or edited directly
by calls through the GitHub API (“GitHub API version 3,” 2018). These data les can be
versioned manually by creating dierent releases. This approach works equally well with
public or private repositories. Data can be uploaded and downloaded programmatically
from scripts. No authentication is required to download data from public repositories.
Examples
As long as a repository has at least one release, users can upload a set of specied les from
the current repository to that release by simply passing the le names to pb_upload().
Specify individual les to download using pb_download(), or use no arguments to down-
load all data les attached to the latest release. Alternatively, users can track les by a
given pattern: for instance, pb_track("*.csv") will track all *.csv les in the reposi-
tory. Then use pb_upload(pb_track()) to upload all currently tracked les. piggyback
compares timestamps to avoid unnecessary transfer. The piggyback package looks for
the same GITHUB_TOKEN environmental variable for authentication that is used across
GitHub APIs. Details are provided in an introductory vignette (Boettiger, 2018b).
References
Boettiger, C. (2018a). Piggyback comparison to alternatives. Retrieved from https://
ropensci.github.io/piggyback/articles/alternatives.html
Boettiger, C. (2018b). Piggyback Data atop your GitHub Repository! Retrieved from
https://ropensci.github.io/piggyback/articles/intro.html
Git LFS. (2018). https://git-lfs.github.com/. Retrieved from https://git-lfs.github.com/
GitHub API version 3. (2018). https://developer.github.com/v3/. Retrieved from https:
//developer.github.com/v3/
Boettiger, (2018). Managing Larger Data on a GitHub Repository. Journal of Open Source Software, 3(29), 971.
https://doi.org/10.21105/joss.00971
1
Ram, K. (2013). Git can facilitate greater reproducibility and increased transparency in
science. Source Code for Biology and Medicine,8(1), 7. doi:10.1186/1751-0473-8-7
Boettiger, (2018). Managing Larger Data on a GitHub Repository. Journal of Open Source Software, 3(29), 971.
https://doi.org/10.21105/joss.00971
2
Article
Accurate field data are essential to understanding ecological systems and forecasting their responses to global change. Yet, data collection errors are common, and data analysis often lags far enough behind its collection that many errors can no longer be corrected, nor can anomalous observations be revisited. Needed is a system in which data quality assurance and control (QA/QC), along with the production of basic data summaries, can be automated immediately following data collection. Here, we implement and test a system to satisfy these needs. For two annual tree mortality censuses and a dendrometer band survey at two forest research sites, we used GitHub Actions continuous integration (CI) to automate data QA/QC and run routine data wrangling scripts to produce cleaned datasets ready for analysis. This system automation had numerous benefits, including (1) the production of near real‐time information on data collection status and errors requiring correction, resulting in final datasets free of detectable errors, (2) an apparent learning effect among field technicians, wherein original error rates in field data collection declined significantly following implementation of the system, and (3) an assurance of computational reproducibility—that is, robustness of the system to changes in code, data and software. By implementing CI, researchers can ensure that datasets are free of any errors for which a test can be coded. The result is dramatically improved data quality, increased skill among field technicians, and reduced need for expert oversight. Furthermore, we view CI implementation as a first step towards a data collection and analysis pipeline that is also more responsive to rapidly changing ecological dynamics, making it better suited to study ecological systems in the current era of rapid environmental change.
Article
Full-text available
Background Reproducibility is the hallmark of good science. Maintaining a high degree of transparency in scientific reporting is essential not just for gaining trust and credibility within the scientific community but also for facilitating the development of new ideas. Sharing data and computer code associated with publications is becoming increasingly common, motivated partly in response to data deposition requirements from journals and mandates from funders. Despite this increase in transparency, it is still difficult to reproduce or build upon the findings of most scientific publications without access to a more complete workflow. Findings Version control systems (VCS), which have long been used to maintain code repositories in the software industry, are now finding new applications in science. One such open source VCS, Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. For individual researchers, Git provides a powerful way to track and compare versions, retrace errors, explore new approaches in a structured manner, while maintaining a full audit trail. For larger collaborative efforts, Git and Git hosting services make it possible for everyone to work asynchronously and merge their contributions at any time, all the while maintaining a complete authorship trail. In this paper I provide an overview of Git along with use-cases that highlight how this tool can be leveraged to make science more reproducible and transparent, foster new collaborations, and support novel uses.
Piggyback comparison to alternatives
  • C Boettiger
Boettiger, C. (2018a). Piggyback comparison to alternatives. Retrieved from https:// ropensci.github.io/piggyback/articles/alternatives.html
Piggyback Data atop your GitHub Repository!
  • C Boettiger
Boettiger, C. (2018b). Piggyback Data atop your GitHub Repository! Retrieved from https://ropensci.github.io/piggyback/articles/intro.html