Available via license: CC BY 4.0
Content may be subject to copyright.
Managing Larger Data on a GitHub Repository
Carl Boettiger1
1University of California, Berkeley
DOI: 10.21105/joss.00971
Software
•Review
•Repository
•Archive
Submitted: 21 September 2018
Published: 24 September 2018
License
Authors of papers retain copyright
and release the work under a Cre-
ative Commons Attribution 4.0 In-
ternational License (CC-BY).
Piggyback: Working with larger data in GitHub
GitHub has become a central component for preserving and sharing software-driven anal-
ysis in academic research (Ram, 2013). As scientists adopt this workow, a desire to
manage data associated with the analysis in the same manner soon emerges. While small
data can easily be committed to GitHub repositories along-side source code and analy-
sis scripts, les larger than 50 MB cannot. Existing work-arounds introduce signicant
complexity and break the ease of sharing (Boettiger, 2018a).
This package provides a simple work-around by allowing larger (up to 2 GB) data les to
piggyback on a repository as assets attached to individual GitHub releases. piggyback
provides a workow similar to Git LFS (“Git LFS,” 2018), in which data les can be
tracked by type and pushed and pulled to GitHub with dedicated commands. These les
are not handled by git in any way, but instead are uploaded, downloaded, or edited directly
by calls through the GitHub API (“GitHub API version 3,” 2018). These data les can be
versioned manually by creating dierent releases. This approach works equally well with
public or private repositories. Data can be uploaded and downloaded programmatically
from scripts. No authentication is required to download data from public repositories.
Examples
As long as a repository has at least one release, users can upload a set of specied les from
the current repository to that release by simply passing the le names to pb_upload().
Specify individual les to download using pb_download(), or use no arguments to down-
load all data les attached to the latest release. Alternatively, users can track les by a
given pattern: for instance, pb_track("*.csv") will track all *.csv les in the reposi-
tory. Then use pb_upload(pb_track()) to upload all currently tracked les. piggyback
compares timestamps to avoid unnecessary transfer. The piggyback package looks for
the same GITHUB_TOKEN environmental variable for authentication that is used across
GitHub APIs. Details are provided in an introductory vignette (Boettiger, 2018b).
References
Boettiger, C. (2018a). Piggyback comparison to alternatives. Retrieved from https://
ropensci.github.io/piggyback/articles/alternatives.html
Boettiger, C. (2018b). Piggyback Data atop your GitHub Repository! Retrieved from
https://ropensci.github.io/piggyback/articles/intro.html
Git LFS. (2018). https://git-lfs.github.com/. Retrieved from https://git-lfs.github.com/
GitHub API version 3. (2018). https://developer.github.com/v3/. Retrieved from https:
//developer.github.com/v3/
Boettiger, (2018). Managing Larger Data on a GitHub Repository. Journal of Open Source Software, 3(29), 971.
https://doi.org/10.21105/joss.00971
1
Ram, K. (2013). Git can facilitate greater reproducibility and increased transparency in
science. Source Code for Biology and Medicine,8(1), 7. doi:10.1186/1751-0473-8-7
Boettiger, (2018). Managing Larger Data on a GitHub Repository. Journal of Open Source Software, 3(29), 971.
https://doi.org/10.21105/joss.00971
2