ArticlePDF Available

Git can facilitate greater reproducibility and increased transparency in science

Authors:

Abstract and Figures

Background Reproducibility is the hallmark of good science. Maintaining a high degree of transparency in scientific reporting is essential not just for gaining trust and credibility within the scientific community but also for facilitating the development of new ideas. Sharing data and computer code associated with publications is becoming increasingly common, motivated partly in response to data deposition requirements from journals and mandates from funders. Despite this increase in transparency, it is still difficult to reproduce or build upon the findings of most scientific publications without access to a more complete workflow. Findings Version control systems (VCS), which have long been used to maintain code repositories in the software industry, are now finding new applications in science. One such open source VCS, Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. For individual researchers, Git provides a powerful way to track and compare versions, retrace errors, explore new approaches in a structured manner, while maintaining a full audit trail. For larger collaborative efforts, Git and Git hosting services make it possible for everyone to work asynchronously and merge their contributions at any time, all the while maintaining a complete authorship trail. In this paper I provide an overview of Git along with use-cases that highlight how this tool can be leveraged to make science more reproducible and transparent, foster new collaborations, and support novel uses.
Content may be subject to copyright.
Ram Source Code for Biology and Medicine 2013, 8:7
http://www.scfbm.org/content/8/1/7
BRIEF REPORTS Open Access
Git can facilitate greater reproducibility and
increased transparency in science
Karthik Ram
Abstract
Background: Reproducibility is the hallmark of good science. Maintaining a high degree of transparency in scientific
reporting is essential not just for gaining trust and credibility within the scientific community but also for facilitating
the development of new ideas. Sharing data and computer code associated with publications is becoming
increasingly common, motivated partly in response to data deposition requirements from journals and mandates
from funders. Despite this increase in transparency, it is still difficult to reproduce or build upon the findings of most
scientific publications without access to a more complete workflow.
Findings: Version control systems (VCS), which have long been used to maintain code repositories in the software
industry, are now finding new applications in science. One such open source VCS, Git, provides a lightweight yet
robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures,
lab notes, and manuscripts. For individual researchers, Git provides a powerful way to track and compare versions,
retrace errors, explore new approaches in a structured manner, while maintaining a full audit trail. For larger
collaborative efforts, Git and Git hosting services make it possible for everyone to work asynchronously and merge
their contributions at any time, all the while maintaining a complete authorship trail. In this paper I provide an
overview of Git along with use-cases that highlight how this tool can be leveraged to make science more
reproducible and transparent, foster new collaborations, and support novel uses.
Keywords: Reproducible research, Version control, Open science
Findings
Introduction
Reproducible science provides the critical standard by
which published results are judged and central findings
are either validated or refuted [1]. Reproducibility also
allows others to build upon existing work and use it to test
new ideas and develop methods. Advances over the years
have resulted in the development of complex methodolo-
gies that allow us to collect ever increasing amounts of
data. While repeating expensive studies to validate find-
ings is often difficult, a whole host of other reasons have
contributed to the problem of reproducibility [2,3]. One
such reason has been the lack of detailed access to under-
lying data and statistical code used for analysis, which
can provide opportunities for others to verify findings
[4,5]. In an era rife with costly retractions, scientists have
Correspondence: karthik.ram@berkeley.edu
Environmental Science, Policy, and Management, University of California,
Berkeley, Berkeley, CA 94720, USA
an increasing burden to be more transparent in order
to maintain their credibility [6]. While post-publication
sharing of data and code is on the rise, driven in part by
funder mandates and journal requirements [7], access to
such research outputs is still not very common [8,9]. By
sharing detailed and versioned copies of one’s data and
code researchers can not only ensure that reviewers can
make well-informed decisions, but also provide opportu-
nities for such artifacts to be repurposed and brought to
bear on new research questions.
Opening up access to the data and software, not just
the final publication, is one of goals of the open science
movement. Such sharing can lower barriers and serve as
a powerful catalyst to accelerate progress. In the era of
limited funding, there is a need to leverage existing data
and code to the fullest extent to solve both applied and
basic problems. This requires that scientists share their
research artifacts more openly, with reasonable licenses
that encourage fair use while providing credit to origi-
nal authors [10]. Besides overcoming social challenges to
© 2013 Ram; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction
in any medium, provided the original work is properly cited.
Ram Source Code for Biology and Medicine 2013, 8:7 Page 2 of 8
http://www.scfbm.org/content/8/1/7
these issues, existing technologies can also be leveraged to
increase reproducibility.
All scientists use version control in one form or another
at various stages of their research projects, from the data
collection all the way to manuscript preparation. This
process is often informal and haphazard, where multiple
revisions of papers, code, and datasets are saved as dupli-
cate copies with uninformative file names (e.g. draft
1.doc,
draft
2.doc). As authors receive new data and feedback
from peers and collaborators, maintaining those versions
and merging changes can result in an unmanageable pro-
liferation of files. One solution to these problems would be
to use a formal Version Control System (VCS), which have
long been used in the software industry to manage code. A
keyfeaturecommontoalltypesofVCSisthatabilitysave
versions of files during development along with informa-
tive comments which are referred to as commit messages.
Every change and accompanying notes are stored inde-
pendent of the files, which obviates the need for duplicate
copies. Commits serve as checkpoints where individual
files or an entire project can be safely reverted to when
necessary. Most traditional VCS are centralized which
means that they require a connection to a central server
which maintains the master copy. Users with appropriate
privileges can check out copies, make changes, and upload
them back to the server.
Among the suite of version control systems currently
available, Git stands out in particular because it offers
features that make it desirable for managing artifacts of
scientific research. The most compelling feature of Git
is its decentralized and distributed nature. Every copy
of a Git repository can serve either as the server (a
central point for synchronizing changes) or as a client.
This ensures that there is no single point of failure.
Authors can work asynchronously without being con-
nected to a central server and synchronize their changes
when possible. This is particularly useful when work-
ing from remote field sites where internet connections
are often slow or non-existent. Unlike other VCS, every
copy of a Git repository carries a complete history of
all changes, including authorship, that can be viewed
and searched by anyone. This feature allows new authors
to build from any stage of a versioned project. Git
also has a small footprint and nearly all operations
occur locally.
By using a formal VCS, researchers can not only
increase their own productivity but also make it for others
to fully understand, use, and build upon their contribu-
tions. In the rest of the paper I describe how Git can be
used to manage common science outputs and move on to
describing larger use-cases and benefits of this workflow.
Readers should note that I do not aim to provide a com-
prehensive review of version control systems or even Git
itself. There are also other comparable alternatives such
as Mercurial and Bazaar which provide many of the fea-
tures described below. My goal here is to broadly outline
some of advantages of using one such system and how
it can benefit individual researchers, collaborative efforts,
and the wider research community.
How Git can track various artifacts of a research effort
Before delving into common use-cases, I first describe
how Git can be used to manage familiar research outputs
such as data, code used for statistical analyses, and docu-
ments. Git can be used to manage them not just separately
but also in various combinations for different use cases
such as maintaining lab notebooks, lectures, datasets, and
manuscripts.
Manuscripts and notes
Version control can operate on any file type including
ones most commonly used in academia such as Microsoft
Word.However,sincethesefiletypesarebinary,Gitcan-
not examine the contents and highlight sections that have
changed between revisions. In such cases, one would have
to rely solely on commit messages or scan through le
contents. The full power of Git can best be leveraged when
working with plain-text files. These include data stored
in non-proprietary spreadsheet formats (e.g. comma sep-
arated files versus xls), scripts from programming lan-
guages, and manuscripts stored in plain text formats
(LaTeX and markdown versus Word documents). With
such formats, Git not only tracks versions but can also
highlight which sections of a file have changed.
In Microsoft Word documents the track changes feature
is often used to solicit comments and feedback. Once
those comments and changes have either been accepted
or rejected, any record of their existence also disappears
forever. When changes are submitted using Git, a perma-
nent record of author contributions remains in the version
history and available in every copy of the repository.
Datasets
Data are ideal for managing with Git. These include data
manually entered via spreadsheets, recorded as part of
observational studies, or ones retrieved from sensors (see
also section on Managing large data). With each signifi-
cant change or additions, commits can record a log those
activities (e.g. Entered data collected between 12/10/2012
and 12/20/2012 ”, o r Updated data from temperature log-
gers for December 2012).Overtimethisprocessavoids
proliferation of files, while the Git history maintains a
complete provenance that can be reviewed at any time.
When errors are discovered, earlier versions of a file can
be reverted without affecting other assets in the project.
Statistical code and figures
When data are analyzed programmatically using soft-
ware such as R and Python, code files start out small
Ram Source Code for Biology and Medicine 2013, 8:7 Page 3 of 8
http://www.scfbm.org/content/8/1/7
and often become more complex over time. Some-
where along the process, inadvertent errors such as
misplaced subscripts and incorrectly applied functions
can lead to serious errors down the line. When such
errors are discovered well into a project, comparing
versions of statistical scripts can provide a way to
quickly trace the source of the problem and recover
from them.
Similarly, figures that are published in a paper often
undergo multiple revisions before resulting in a final
version that gets published. Without version control,
one would have to deal with multiple copies and
use imperfect information such as file creation dates
to determine the sequence in which they were gen-
erated. Without additional information, figuring out
why certain versions were created (e.g. in response to
comments from coauthors) also becomes more diffi-
cult. When figures are managed with Git, the com-
mit messages (e.g. Upd ated figure in response to
Ethans comments regarding use of normalized data.”)
provide an unambiguous way to track various ver-
sions.
Complete manuscripts
When all of the above artifacts are used in a single effort,
such as writing a manuscript, Git can collectively manage
versions in a powerful way for both individual authors and
groups of collaborators. This process avoids rapid multi-
plication of unmanageable files with uninformative names
(e.g. final
1.doc, final 2.doc, final final.doc, final KR 1.doc
etc.)asillustratedbythepopularcartoonstripPhD
Comics http://www.phdcomics.com/comics/archive.php?
comicid=1531.
Use cases for Git in science
1. Lab notebook
Day to day decisions made over the course of a study
are often logged for review and reference in lab
notebooks. Such notebooks contain important
information useful to both future readers attempting
to replicating a study, or for thorough reviewers
seeking additional clarification. However, lab
notebooks are rarely shared along with publications
or made public although there are some exceptions
[11]. Git commit logs can serve as a proxies for lab
notebooks if clear yet concise messages are recorded
over the course of a project. One of the fundamental
features of Git that make it so useful to science is that
every copy of a repository carries a complete history
of changes available for anyone to review. These logs
can be be easily searched to retrieve versions of
artifacts like data and code. Third party tools can also
be leveraged to mine Git histories from one or more
projects for other types of analyses.
2. Facilitating collaboration
In collaborative efforts, authors contribute to one or
more stages of the manuscript preparation such as
collecting data, analyzing them, and/or writing up
the results. Such information is extremely useful for
both readers and reviewers when assessing relative
author contributions to a body of work. With high
profile journals now discouraging the practice of
honorary authorship [12], Git commit logs can
provide a highly granular way to track and assess
individual author contributions to a project.
When projects are tracked using Git, every single
action (such as additions, deletions, and changes) is
attributed to an author. Multiple authors can choose
to work on a single branch of a repository (the
master
branch), or in separate branches and work
asynchronously. In other words, authors do not have
to wait on coauthors before contributing. As each
author adds their contribution, they can sync those
to the master branch and update their copies at any
time. Over time, all of the decisions that go into the
production of a manuscript from entering data and
checking for errors, to choosing appropriate
statistical models and creating figures, can be traced
back to specific authors.
With the help of a remote Git hosting services,
maintaining various copies in sync with each other
becomes effortless. While most changes are merged
automatically, conflicts will need to be resolved
manually which would also be the case with most
other workflows (e.g. using Microsoft Word with
track changes). By syncing changes back and forth
with a remote repository, every author can update
their local copies as well as push their changes to the
remote version at any time, all the while maintaining
a complete audit trail. Mistakes or unnecessary
changes can easily undone by reverting either the
entire repository or individual files to earlier commits.
Since commits are attributed to specific authors,
error or clarifications can also be appropriately
directed. Perhaps most importantly this workflow
ensures that revisions do not have to be emailed back
and forth. While cloud storage providers like
Dropbox alleviate some of these annoyances and also
provide versioning, the process is not controlled
making it hard to discern what and how many
changes have occurred between two time intervals.
In a recent paper led by Philippe Desjardins-Proulx
[13] all of the authors successfully collaborated using
only Git and GitHub (https://github.com/). In this
particular Git workflow, each of us cloned a copy of
the main repository and contributed our changes
back to the lead author. Figures 1 and 2 show the list
of collaborators and a network diagram of how and
Ram Source Code for Biology and Medicine 2013, 8:7 Page 4 of 8
http://www.scfbm.org/content/8/1/7
Figure 1 A list of contributions to a project on GitHub.
when changes were contributed back the master
branch.
3. Backup and failsafe against data loss
Collecting new data and developing methods for
analysis are often expensive endeavors requiring
significant amounts of grant funding. Therefore
protecting such valuable products from loss or theft
is paramount. A recent study found that a vast
majority of data and code are stored on lab
computers or web servers both of which are prone to
failure and often become inaccessible after a certain
length of time. One survey found that only 72% of
studies of 1000 surveyed still had data that were
accessible [14,15]. Hosting data and code publicly not
only ensures protection against loss but also
increases visibility for research efforts and provides
opportunities for collaboration and early review [16].
While Git provides a powerful features that can
leveraged by individual scientists, Git hosting
services open up a whole new set of possibilities. Any
local Git repository can be linked to one or more Git
remotes, which are copies hosted on a remote cloud
severs. Git remotes serve as hubs for collaboration
where authors with write privileges can contribute
anytime while others can download up-to-date
versions or submit revisions with author approval.
Figure 2 Git makes it easy to track individual contributions through time ensuring appropriate attribution and accountability. This
screenshot shows subset of commits (colored dots) by four authors over a period spanning November 17th, 2012 - January 26th, 2013.
Ram Source Code for Biology and Medicine 2013, 8:7 Page 5 of 8
http://www.scfbm.org/content/8/1/7
There are currently several Git hosting services such
as SourceForge, Google Code, GitHub, and BitBucket
that provide free Git hosting. Among them, GitHub
has surpassed other source code hosts like Google
Code and SourceForge in popularity and hosts over
4.6 million repositories from 2.8 million users as of
December 2012 [17-19]. While these services are
usually free for publicly open projects, some research
efforts, especially those containing embargoed or
sensitive data will need to be kept private. There are
multiple ways to deal with such situations. For
example, certain files can be excluded from Git’s
history, others maintained as private sub-modules, or
entire repositories can be made private and opened
to the public at a future time. Some Git hosts like
BitBucket offer unlimited public and private accounts
for academic use.
Managing a research project with Git provides
several safe guards against short-term loss. Frequent
commits synced to remote repositories ensure that
multiple versioned copies are accessible from
anywhere. In projects involving multiple
collaborators, the presence of additional copies
makes even more difficult to lose work. While Git
hosting services protect against short-term data loss,
they are not a solution for more permanent archiving
since none of them offer any such guarantees. For
long-term archiving, researchers should submit their
Git-managed projects to academic repositories that
are members of CLOCKSS (http://www.clockss.
org/). Output stored on such repositories
(e.g. figshare) are archived over a network of
redundant nodes and ensure indefinite availability
across geographic and geopolitical regions.
4. Freedom to explore new ideas and methods Git
tracks development of projects along timelines
referred to as branches. By default, there is always a
master branch (line with blue dots in Figure 3). For
most authors, working with this single branch is
sufficient. However, Git provides a powerful
branching mechanism that makes it easy for
exploring alternate ideas in a structured and
documented way without disrupting the central flow
of a project. For example, one might want to try an
improved simulation algorithm, a novel statistical
method, or plot figures in a more compelling way. If
these changes don’t work out, one could revert
changes back to an earlier commit when working on
a single master branch. Frequent reverts on a master
branch can be disruptive, especially when projects
involve multiple collaborators. Branching provides a
risk-free way to test new algorithms, explore better
data visualization techniques, or develop new
analytical models. When branches yield desired
outcomes, they can easily be merged into the master
copy while unsuccessful efforts can be deleted or left
as-is to serve as a historical record (illustrated in
Figure 3).
Branches can prove extremely useful when
responding to reviewer questions about the rationale
for choosing one method over another since the Git
history contains a record of failed, unsuitable, or
abandoned attempts. This is particularly helpful
given that the time between submission and response
can be fairly long. Additionally, future users can mine
Git histories to avoid repeating approaches that were
never fruitful in earlier studies.
5. Mechanism to solicit feedback and reviews While
it is possible to leverage most of core functionality in
Git at the local level, Git hosting services offer
additional services such as issue trackers,
collaboration graphs, and wikis. These can easily be
used to assign tasks, manage milestones, and
maintain lab protocols. Issue trackers can be
repurposed as a mechanism for soliciting both
feedback and review, especially since the comments
can easily be linked to particular lines of code or
blocks of text. Early comments and reviews for this
article were also solicited via GitHub Issues https://
github.com/karthikram/smb
git/issues/
6. Increase transparency and verifiability Methods
sections in papers are often succinct to adhere to
strict word limits imposed by journal guidelines. This
practice is especially common when describing
well-known methods where authors assume a certain
degree of familiarity among informed readers. One
unfortunate consequence of this practice is that any
modifications to the standard protocol (typically
noted in internal lab notebooks) implemented in a
study may not available to the reviewers and readers.
However, seemingly small decisions, such as choosing
an appropriate distribution to use in a statistical
method, can have a disproportionately strong
influence on the central finding of a paper. Without
access to a detailed history, a reviewer competent in
statistical methods has to trust that authors carefully
met necessary assumptions, or engage in a long back
and forth discussion thereby delaying the review
process. Sharing a Git repository can alleviate these
kinds of ambiguities and allow authors to point out
commits where certain key decisions were made
before choosing certain approaches. Journals could
facilitate this process by allowing authors to submit
links to their Git repository alongside manuscripts
and sharing them with reviewers.
7. Managing l arge data Git is extremely efficient with
managing small data files such as ones routinely
collected in experimental and observational studies.
Ram Source Code for Biology and Medicine 2013, 8:7 Page 6 of 8
http://www.scfbm.org/content/8/1/7
Figure 3 A hypothetical Git workflow for a scientific collaboration involving three authors. Each circle represents a commit and colors
denote author specific commits. Two way arrows indicate a sync (a push and pull in Git terminology). One way arrows indicate an update to one
branch from another. Horizontal arrows indicate development along a particular branch.
However, when the data are particularly large such as
those in bioinformatics studies (in the order of tens
of megabytes to gigabytes), managing them with Git
can degrade efficiency and slow down the
performance of Git operations. With large data files,
the best practice would be to exclude them from the
repository and only track changes in metadata. This
protocol is especially ideal when large datasets do not
change often over the course of a study. In situations
where the data are large
and
undergo frequent
updates, one could leverage third-party tools such as
Git-annex http://git-annex.branchable.com/ and still
seamlessly use Git to manage a project.
8. Lowering b arriers to reuse A common barrier that
prevents someone from reproducing or building
upon an existing method is lack of sufficient details
about a method. Even in cases where methods are
adequately described, the use of expensive
proprietary software with restrictive licenses makes it
difficult to use [20]. Sharing code with licenses that
encourage fair use with appropriate attribution
removes such artificial barriers and encourages
readers to modify methods to suit their research
needs, improve upon them, or find new applications
[10]. With open source software, analysis pipelines
can be easily
forked
or branched from public Git
repositories and modified to answer other questions.
Although this process of depositing code somewhere
public with appropriate licenses involves additional
work for the authors, the overall benefits outweigh
the costs. Making all research products publicly
available not only increases citation rates [21-23] but
can also increase opportunities for collaboration by
increasing overall visibility. For example,
Niedermeyer & Strohalm [24] describe their struggle
with finding appropriate software for comprehensive
mass spectrum annotation, and eventually found an
open source software which they where able to
extend. In particular, the authors cite availability of
complete source code along with an open license as
the motivation for their choice. Examples of such
collaboration and extensions are likely to become
more common with increased availability of fully
versioned projects with permissive licenses. A similar
argument can be made for data as well. Even
publications that deposit data in persistent
repositories rarely share the original raw data. The
versions submitted to persistent repositories are
often
cleaned
and finalized versions of datasets. In
cases where no datasets are deposited, the only data
accessible are likely mean values reported in the main
text or appendix of a paper. Raw data can be
leveraged to answer questions not originally intended
by the authors. For example, research areas that
address questions about uncertainty often require
messy raw data to test competing methods. Thus,
versioned data provide opportunities to retrieve
copies before they have been modified for use in
different contexts and have lost some of their utility.
Conclusions
Wider use of Git has the potential to revolutionize schol-
arly communication and increase opportunities for reuse,
novel synthesis, and new collaborative efforts. Since Git
is a standard tool that is widely used and backed by a
large developer community, there are numerous resources
for learning (official tutorial at http://git-scm.com/) and
seeking help. With disciplined use of Git, individual sci-
entists and labs can ensure that the entire timeline of
events that occur over the development of a research
project are securely logged in a system that provides secu-
rity against data loss and encourages risk-free exploration
of new ideas and approaches. In an era with shrinking
research budgets, scientists are under increasing pressure
to produce more with less. If more granular sharing via Git
reduces time spent developing new software, or repeating
expensive data collection efforts, then everyone stands to
benefit. Scientists should note that these efforts don’t have
to viewed as entirely altruistic. In a recent mandate the
National Science Foundation [25] has expanded its merit
Ram Source Code for Biology and Medicine 2013, 8:7 Page 7 of 8
http://www.scfbm.org/content/8/1/7
guidelines to include a range of academic products such
as software and data, in addition to peer-reviewed publi-
cations. With the rise in use of altmetric tools that track
and credit such efforts, then everyone can benefit [26].
Although I have laid out various arguments for why
more scientists should be using Git, one should be careful
not to view Git as a one stop solution to all the problems
facing reproducibility in science. Git can be readily used
without any knowledge of command-line tools due to the
available of many fully featured Git graphic user interfaces
http://git-scm.com/downloads/guis. However, leveraging
its full potential, especially when working on complex
projects where one might encounter unwieldy merge con-
flicts, comes at a significant learning cost. There are also
comparable alternatives to Git (e.g. Mercurial) which offer
less granularity but are more user-friendly. While time
invested in becoming proficient in Git would be valuable
in the long-term, most scientists do not have the lux-
ury of learning software skills that do not address more
immediate problems. Despite the fact that scientists spent
considerable time using and creating their own software to
address domain specific needs, good programming prac-
tices are rarely taught [27]. Therefore wider adoption of
useful tools like Git will require greater software develop-
ment literacy among scientists. On a more optimistic note,
such literacy is slowly becoming common in the new gen-
erationofacademics,driveninpartbyeortssuchasSoft-
ware Carpentry http://software-carpentry.org/ and newer
courses taught in graduate curricula (e.g. Programming
for biologists http://www.programmingforbiologists.org/
taught at Utah State University).
Abbreviations
VCS: Version Control System; NSF: National Science Foundation; CSV: Comma
Separated Values.
Competing interests
The author(s) declared they have no competing interests.
Acknowledgements
Comments from Carl Boettiger, Yoav Ram, David Jones, and Scott
Chamberlain on earlier drafts greatly improved the final version of this article. I
also thank the rOpenSci project (http://ropensci.org) for helping me gain a
greater appreciation for Git as a tool for advancing science. This manuscript is
available both as a Git repository (with a full history of changes) https://github.
com/karthikram/smb git.git and also as a permanent archived copy on
figshare (http://dx.doi.org/10.6084/m9.figshare.155613).
Funding support
The author did not receive any specific funding for this work.
Received: 25 January 2013 Accepted: 6 February 2013
Published: 28 February 2013
References
1. Vink CJ, Paquin P, Cruickshank RH: Taxonomy and irreproducible
biological science. BioScience 2012, 62:451–452. Available: [http://www.
bioone.org/doi/abs/10.1525/bio.2012.62.5.3]
2. Peng RD: Reproducible research in computational science. Science
2011, 334:1226–1227. Available: [http://www.sciencemag.org/cgi/doi/10.
1126/science.1213847]
3. Begley CG, Ellis LM: Drug development: Raise standards for preclinical
cancer research. Nature 2012, 483:531–533. Available: [http://dx.doi.org/
10.1038/483531a]
4. Schwab M, Karrenbach M, Claerbout J: Making scientific computations
reproducible. Comput Sci Eng 2000, 2:61–67. Available: [http://
ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=881708]
5. Ince DC, Hatton L, Graham-Cumming J: The case for open computer
programs. Nature 2012, 482:485–488. Available: [http://dx.doi.org/10.
1038/nature10836]
6. Van Noorden, R: The trouble with retractions. Nature 2011,
478(7367):6–8.
7. Whitlock MC, McPeek MA, Rausher MD, Rieseberg L, Moore AJ: Data
archiving. Am Nat 2010, 175:145–146. Available: [http://www.jstor.org/
stable/10.1086/650340]
8. Vines TH, Andrew RL, Bock DG, Franklin MT, Gilbert KJ, et al.: Mandated
data archiving greatly improves access to research data. FASEB J
2013 doi:10.1096/fj.12-218164.
9. Wolkovich EM, Regetz J, O’Connor MI: Advances in global change
research require open science by individual researchers. Glob
Change Biol 2012, 18:2102–2110. Available: [http://apps.
webofknowledge.com/full record.do?product=UA&search mode=
GeneralSearch&qid=1&SID=1CfaPnJ9gbl5bo171Jc&page=1&doc=4]
10. Neylon C: Open access must enable open use. Nature 2012, 492:8–9.
11. Wald C: Issues & Perspectives Scientists Embrace openness. 2010.
Available: [http://sciencecareers.sciencemag.org/career
magazine/
previous issues/articles/2010
04 09/caredit.a1000036] Accessed 16 Jan
2013.
12. Greenland P, Fontanarosa PB: Ending honorary authorship. Science
(New York, NY) 2012, 337:1019. Available: [http://www.sciencemag.org/
content/337/6098/1019.short]
13. Desjardins-Proulx P, White EP, Adamson JJ, Ram K, Poisot T, Gravel D: The
case for open preprints in biology. PLoS Biol. Accepted.
14. Schultheiss SJ, M
¨
unch M-C, Andreeva GD, R
¨
atsch G: Persistence and
availability of Web services in computational biology. PloS One 2011,
6:e24914. Available: [http://dx.plos.org/10.1371/journal.pone.0024914]
15. Wren JD: 404 not found: the stability and persistence of URLs
published in MEDLINE. Bioinformatics (Oxford, England) 2004,
20:668–72. Available: [http://bioinformatics.oxfordjournals.org/content/
20/5/668.abstract]
16. Prli
´
cA,ProcterJB:Ten simple rules for the open development of
scientific software. PLoS Comput Biol 2012, 8:e1002802. Available:
[http://dx.plos.org/10.1371/journal.pcbi.1002802]
17. Pearson DP: GitHub sees 3 millionth member account. 2013. Available:
[http://www.gamesindustry.biz/articles/2013-01-17-Github-sees-3-
millionth-member-account] Accessed 18 Jan 2013.
18. Finley K: Github Has surpassed sourceforge and Google code in
popularity. 2011. Available: [http://readwrite.com/2011/06/02/github-
has-passed-sourceforge] Accessed 15 Jan 2013.
19. The Octoverse in 2012 · GitHub Blog. Available: [https://github.com/
blog/1359-the-octoverse-in-2012]. Accessed 01AD–Feb 13AD.
20. Morin A, Urban J, Sliz P: A quick guide to software licensing for the
scientist-programmer. PLoS Comput Biol 2012, 8:e1002598. [http://dx.
plos.org/10.1371/journal.pcbi.1002598].
21. Piwowar HA, Day RS, Fridsma B: Sharing detailed research data is
associated with increased citation rate. PLOS One 2007, 2(3):e308.
22. Piwowar HA: Who shares? Who doesn’t? Factors associated with
openly archiving raw research data. PloS One 2011, 6:e18657.
Available: [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=
3135593&tool=pmcentrez&rendertype=abstract]
23. Alsheikh-Ali, A a, Qureshi W, Al-Mallah MH, Ioannidis, JP a: Public
availability of published research data in high-impact journals. PloS
One 2011, 6:e24357. Available: [http://www.pubmedcentral.nih.gov/
articlerender.fcgi?artid=3168487&tool=pmcentrez&rendertype=abstract]
24. Niedermeyer THJ, Strohalm M: mMass as a software tool for the
annotation of cyclic peptide tandem mass spectra. PloS one 2012,
7:e44913. Available: [http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=3441486&tool=pmcentrez&rendertype=abstract]
25. US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal &
Award Policies and Procedures Guide (NSF13004). 2012. Available:
Ram Source Code for Biology and Medicine 2013, 8:7 Page 8 of 8
http://www.scfbm.org/content/8/1/7
[http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp?WT.mc id=
USNSF 109] Accessed 11 Nov 2012.
26. Piwowar H, Altmetrics: Value all research products. Nature 2013,
493:159–159. Available: [http://www.nature.com/doifinder/10.1038/
493159a]
27. Wilson G, Aruliah DA, Brown CT, Hong NPC, Davis M, et al.: Best practices
for scientific computing. Arxiv. 6. [http://arxiv.org/abs/1210.0530]
doi:10.1186/1751-0473-8-7
Cite this article as: Ram: Git can facilitate greater reproducibility and
increased transparency in science. Source Code for Biology and Medicine 2013
8:7.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
...  Community & collaboration: When researchers publish data, scripts, and software used in their study onto a GHP, there are lower barriers for reuse "and [it can] serve as a powerful catalyst to accelerate progress" (Langille, Ravel, & Fricke, 2018;Ram, 2013;Vanderplas, 2014Vanderplas, /2021. ...
... Git commit history can give researchers and team members reference to understand the iterative progress of a study. Data analysis pipelines, also known as processing workflows, can benefit from version control systems for exposure, collaborative feedback, and reproducibility of the methodology (CARPi, Minges, & Piel, 2017;Dallas, 2015Dallas, /2019Raj, 2016;Ram, 2013). ...
Article
Full-text available
This paper presents original research about the behaviours, histories, demographics, and motivations of scholars who code, specifically how they interact with version control systems locally and on the Web. By understanding patrons through multiple lenses – daily productivity habits, motivations, and scholarly needs – librarians and archivists can tailor services for software management, curation, and long-term reuse, raising the possibility for long-term reproducibility of a multitude of scholarship.
... If it has, users should be able to install it with one command line within the programming language like "pip install" or "install.packages". If the software is not a plug-in, put NA Have version control for code (Heil et al., 2021;Jiménez et al., 2017;Ram, 2013) A1 ...
... Our main focus is the software itself, and we did not consider factors (e.g., the type of the mass spectrometer) that may affect the development of the software. Additionally, it is essential to assess the quantitative performance of software using criteria from existing literature regarding best software practices, including reproducible software development practices (Heil et al., 2021;Jiménez et al., 2017;Ram, 2013), best practices for scientific software (Hunter-Zinck et al., 2021;Leprevost et al., 2014;Zhao et al., 2012), best software documenting practices (Karimzadeh & Hoffman, 2018;Lee, 2018), best practices for commandline software (Georgeson et al., 2019;Seemann, 2013), and best software testing practices (Aghamohammadi et al., 2021). We also filtered out criteria that were not required by FAIR4RS Principles such as open-source (Katz, Gruenpeter, et al., 2021). ...
Article
Full-text available
Background Liquid chromatography-high resolution mass spectrometry (LC-HRMS) is a popular approach for metabolomics data acquisition and requires many data processing software tools. The FAIR Principles – Findability, Accessibility, Interoperability, and Reusability – were proposed to promote open science and reusable data management, and to maximize the benefit obtained from contemporary and formal scholarly digital publishing. More recently, the FAIR principles were extended to include Research Software (FAIR4RS). Aim of review This study facilitates open science in metabolomics by providing an implementation solution for adopting FAIR4RS in the LC-HRMS metabolomics data processing software. We believe our evaluation guidelines and results can help improve the FAIRness of research software. Key scientific concepts of review We evaluated 124 LC-HRMS metabolomics data processing software obtained from a systematic review and selected 61 software for detailed evaluation using FAIR4RS-related criteria, which were extracted from the literature along with internal discussions. We assigned each criterion one or more FAIR4RS categories through discussion. The minimum, median, and maximum percentages of criteria fulfillment of software were 21.6%, 47.7%, and 71.8%. Statistical analysis revealed no significant improvement in FAIRness over time. We identified four criteria covering multiple FAIR4RS categories but had a low %fulfillment: (1) No software had semantic annotation of key information; (2) only 6.3% of evaluated software were registered to Zenodo and received DOIs; (3) only 14.5% of selected software had official software containerization or virtual machine; (4) only 16.7% of evaluated software had a fully documented functions in code. According to the results, we discussed improvement strategies and future directions.
... Version control has seen tremendous adoption rates in software engineering over the past two decades, most notably Git [9]. Additionally, other fields such as scientific research adopt version control for its benefits in transparency, collaboration features, reproducibility, and time savings [19,25]. ...
Preprint
Full-text available
Collaborative writing is essential for teams that create documents together. Creating documents in large-scale collaborations is a challenging task that requires an efficient workflow. The design of such a workflow has received comparatively little attention. Conventional solutions such as working on a single Microsoft Word document or a shared online document are still widely used. In this paper, we propose a new workflow consisting of a combination of the lightweight markup language AsciiDoc together with the state-of-the-art version control system Git. The proposed process makes use of well-established workflows in the field of software development that have grown over decades. We present a detailed comparison of the proposed markup + Git workflow to Word and Word for the Web as the most prominent examples for conventional approaches.We argue that the proposed approach provides significant benefits regarding scalability, flexibility, and structuring of most collaborative writing tasks, both in academia and industry.
... Data analyses and plot generation were carried out in the R statistical environment (107) with packages from the comprehensive R archive network (CRAN; https://cran.r-project.org/) and the Bioconductor project (108). Code needed to replicate the analyses and figures follows principles of reproducible analysis, leveraging knitr (109) dynamic reports and git source code management (110). ...
Preprint
Full-text available
After allogeneic hematopoietic stem cell transplantation (allo-HCT), donor-derived B cells develop under selective pressure from alloantigens and play a substantiated role in the autoimmune-like syndrome, chronic graft versus host disease (cGVHD). We performed single-cell RNA-Sequencing (scRNA-Seq) of blood B cells from allo-HCT patients and resolved 10 clusters that were distinguishable by signature genes for maturation, activation and memory. Notably, allo-HCT patient ‘memory’ B cells displayed some striking transcriptional differences when compared to memory B cells from healthy individuals or non-HCT patients, suggesting molecular differences with a propensity for autoreactivity. To inform more specifically about transcriptional programs important for allo-HCT tolerance, we assessed all 10 clusters for differentially expressed genes (DEGs) between patients with Active cGVHD vs. those without disease (No cGVHD). Data reveal insight into DEGs that may influence multiple aspects of B cell function in allo-HCT, leading to chronic B cell activation in Active cGVHD and our observed expansion of potentially pathogenic atypical/age-related memory B cell (ABC) subsets within a B Cell Receptor (BCR)-experienced ‘memory’ cluster. Data also indicate chronic B-cell activation and diversification in allo-HCT is potentially ‘plastic’, and may be manipulated therapeutically. Our findings have implications in understanding how alloreactivity may lead to human autoimmune disease, and identify potentially novel targets for future study and intervention. One Sentence Summary We elucidate B cell subsets and DEGs at the single-cell level in humans receiving allo-HCT, when tolerance is lost or maintained.
Article
Full-text available
Git, an open source version control software, is difficult to both teach and learn because it conceals its work in a hidden directory and is often deployed using a command line interface. Yet, version control is an essential tool of reproducible research. Two librarians formed a playful partnership to experiment with teaching Git using LEGO, an approach inspired by our Carpentries Instructor certifications and one’s LEGO Serious Play certification. Our method combines novice-centric Carpentries pedagogy with theories of play, constructionism, constructivism, conceptual metaphor, and the hand-mind connection that are foundational to LEGO Serious Play. We share details of developing, testing, and assessing this Git-LEGO combination. Our preliminary study indicates that our LEGO activity may increase learners' short-term retention of Git commands. We discuss future directions for our research and our experiences as academic librarians in a playful partnership.
Article
Despite their central importance to a variety of endeavors and despite widespread use in both industry and academia, version control systems (software for tracking versions of files) have not been extensively studied in fields related to technical communication, rhetoric, and communication design. Git, by far the most dominant version control system today, is largely absent. This study theorizes Git as boundary infrastructure---infrastructure used to facilitate collaboration across disciplines and domains. The unique characteristics of boundary infrastructure explain how something as prominent as Git can be so invisible and help identify dangers posed by boundary infrastructure. Drawing on modes of resistance developed in feminist rhetorics, this article concludes with suggestions to ameliorate the negatives effects such infrastructure might have on collaborative knowledge work.
Article
Background: Research Software is a concept that has been only recently clarified. In this paper we address the need for a similar enlightenment concerning the Research Data concept. Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and within the Open Science framework. Then we explore the challenges of a data definition and some of the Research Data definitions proposed in the literature. Results: We propose a Research Data concept featuring three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind. Conclusions: The analysis of this definition and the context in which it is proposed provides some answers to the Borgman’s conundrum challenges, that is, which Research Data might be shared, by whom, with whom, under what conditions, why, and to what effects. They are completed with answers to the questions: how? and where?
Chapter
In the pharmaceutical industry, animal experiments are often performed as the final go/no-go step prior to a clinical trial. Animal-to-human translations may be deduced from a well-planned in vivo study that follows good statistical practices (GSP); the converse implies uncertainty in the go/no-go review points. Traditionally, GSP involves sample size and statistical power analysis, animal randomization, blinding, and blocking. To this, we add the concepts of replicability and reproducibility. A drug discovery experiment is replicable if a repeat study results in the same conclusion. Experimental results are reproducible if, given the same data set and possibly the same computer code, the data analysis is repeatable. In vivo study results that are both replicable and reproducible provide highly-confident go/no-go decisions and, as such, arguably should be part of the standard review process prior to a clinical trial. The concepts and issues around replicability and reproducibility in drug discovery are explored in detail in this article.
Article
Concern over social scientists’ inability to reproduce empirical research has spawned a vast and rapidly growing literature. The size and growth of this literature make it difficult for newly interested academics to come up to speed. Here, we provide a formal text modeling approach to characterize the entirety of the field, which allows us to summarize the breadth of this literature and identify core themes. We construct and analyze text networks built from 1,947 articles to reveal differences across social science disciplines within the body of reproducibility publications and to discuss the diversity of subtopics addressed in the literature. This field-wide view suggests that reproducibility is a heterogeneous problem with multiple sources for errors and strategies for solutions, a finding that is somewhat at odds with calls for largely passive remedies reliant on open science. We propose an alternative rigor and reproducibility model that takes an active approach to rigor prior to publication, which may overcome some of the shortfalls of the postpublication model.
Article
Effective coding is fundamental to the study of biology. Computation underpins most research, and reproducible science can be promoted through clean coding practices. Clean coding is crafting code design, syntax and nomenclature in a manner that maximizes the potential to communicate its intent with other scientists. However, computational biologists are not software engineers, and many of our coding practices have developed ad hoc without formal training, often creating difficult‐to‐read code for others. Hard‐to‐understand code can thus be limiting our efficiency and ability to communicate as scientists with one another. The purpose of this paper is to provide a primer on some of the practices associated with crafting clean code by synthesizing a transformative text in software engineering along with recent articles on coding practices in computational biology. We review past recommendations to provide a series of best practices that transform coding into a human‐accessible form of communication. Three common themes shared in this synthesis are the following: (a) code has value and you are responsible for its organization to enable clear communication, (b) use a formatting style to guide writing code that is easily understandable and consistent and (c) apply abstraction to emphasize important elements and declutter. While many of the provided practices and recommendations were developed with computational biologists in mind, we believe there is wider applicability to any biologist undertaking work in data management or statistical analyses. Clean code is thus a crucial step forward in resolving some of the crisis in reproducibility for science.
Article
Full-text available
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software.
Article
Full-text available
Biologists should submit their preprints to open servers, a practice common in mathematics and physics, to open and accelerate the scientific process.
Article
Full-text available
The data underlying scientific papers should be accessible to researchers both now and in the future, but how best can we ensure that these data are available? Here we examine the effectiveness of four approaches to data archiving: no stated archiving policy, recommending (but not requiring) archiving, and two versions of mandating data deposition at acceptance. We control for differences between data types by trying to obtain data from papers that use a single, widespread population genetic analysis, structure. At one extreme, we found that mandated data archiving policies that require the inclusion of a data availability statement in the manuscript improve the odds of finding the data online almost 1000-fold compared to having no policy. However, archiving rates at journals with less stringent policies were only very slightly higher than those with no policy at all. We also assessed the effectiveness of asking for data directly from authors and obtained over half of the requested datasets, albeit with ∼8 d delay and some disagreement with authors. Given the long-term benefits of data accessibility to the academic community, we believe that journal-based mandatory data archiving policies and mandatory data availability statements should be more widely adopted.
Article
Full-text available
Open-source software development has had significant impact, not only on society, but also on scientific research. Papers describing software published as open source are amongst the most widely cited publications (e.g., BLAST [1], [2] and Clustal-W [3]), suggesting many scientific studies may not have been possible without some kind of open software to collect observations, analyze data, or present results. It is surprising, therefore, that so few papers are accompanied by open software, given the benefits that this may bring. Publication of the source code you write not only can increase your impact [4], but also is essential if others are to be able to reproduce your results. Reproducibility is a tenet of computational science [5], and critical for pipelines employed in data-driven biological research. Publishing the source for the software you created as well as input data and results allows others to better understand your methodology, and why it produces, or fails to produce, expected results. Public release might not always be possible, perhaps due to intellectual property policies at your or your collaborators' institutes; and it is important to make sure you know the regulations that apply to you. Open licensing models can be incredibly flexible and do not always prevent commercial software release [5]. Simply releasing the source under an open license, however, is not sufficient if you wish your code to remain useful beyond its publication [6]. The sustainability of software after publication is probably the biggest problem faced by researchers who develop it, and it is here that participating in open development from the outset can make the biggest impact. Grant-based funding is often exhausted shortly after new software is released, and without support, in-house maintenance of the software and the systems it depends on becomes a struggle. As a consequence, the software will cease to work or become unavailable for download fairly quickly [7], which may contravene archival policies stipulated by your journal or funding body. A collaborative and open project allows you to spread the resource and maintenance load to minimize these risks, and significantly contributes to the sustainability of your software. If you have the choice, embracing an open approach to development has tremendous benefits. It allows you to build on the work of other scientists, and enables others to build on your own efforts. To make the development of open scientific software more rewarding and the experience of using software more positive, the following ten rules are intended to serve as a guide for any computational scientist.
Article
Full-text available
Natural or synthetic cyclic peptides often possess pronounced bioactivity. Their mass spectrometric characterization is difficult due to the predominant occurrence of non-proteinogenic monomers and the complex fragmentation patterns observed. Even though several software tools for cyclic peptide tandem mass spectra annotation have been published, these tools are still unable to annotate a majority of the signals observed in experimentally obtained mass spectra. They are thus not suitable for extensive mass spectrometric characterization of these compounds. This lack of advanced and user-friendly software tools has motivated us to extend the fragmentation module of a freely available open-source software, mMass (http://www.mmass.org), to allow for cyclic peptide tandem mass spectra annotation and interpretation. The resulting software has been tested on several cyanobacterial and other naturally occurring peptides. It has been found to be superior to other currently available tools concerning both usability and annotation extensiveness. Thus it is highly useful for accelerating the structure confirmation and elucidation of cyclic as well as linear peptides and depsipeptides.
Article
Understanding how species and ecosystems respond to climate change requires spatially and temporally rich data for a diverse set of species and habitats, combined with models that test and predict responses. Yet current study is hampered by the long‐known problems of inadequate management of data and insufficient description of analytical procedures, especially in the field of ecology. Despite recent institutional incentives to share data and new data archiving infrastructure, many ecologists do not archive and publish their data and code. Given current rapid rates of global change, the consequences of this are extreme: because an ecological dataset collected at a certain place and time represents an irreproducible set of observations, ecologists doing local, independent research possess, in their file cabinets and spreadsheets, a wealth of information about the natural world and how it is changing. Although large‐scale initiatives will increasingly enable and reward open science, we believe that change demands action and personal commitment by individuals – from students and PIs. Herein, we outline the major benefits of sharing data and analytical procedures in the context of global change ecology, and provide guidelines for overcoming common obstacles and concerns. If individual scientists and laboratories can embrace a culture of archiving and sharing we can accelerate the pace of the scientific method and redefine how local science can most robustly scale up to globally relevant questions.
Article
Those wishing to maximize the benefits of public research must require more than free access, says Cameron Neylon -- they must facilitate reuse.