PreprintPDF Available

Bioconda: A sustainable and comprehensive software distribution for the life sciences

Authors:

Abstract and Figures

We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bioconda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of which are easily installed and managed without administrative privileges.
Content may be subject to copyright.
Bioconda: A sustainable and comprehensive software distribution
for the life sciences
Bj¨orn Gr¨uning1, Ryan Dale,2, Andreas Sj¨odin3,4 , Brad A. Chapman5, Jillian Rowe6,
Christopher H. Tomkins-Tinch7,8, Renan Valieris9, Adam Caprez10, B´er´enice Batut1,
Mathias Haudgaard11, Thomas Cokelaer12, Kyle A. Beauchamp13, Brent S Pedersen14,
Youri Hoogstrate15, Anthony Bretaudeau16, Devon Ryan17, Gildas Le Corguill´e18, Dilmurat
Yusuf1, Sebastian Luna-Valero19, Rory Kirchner20, Karel Brinda21, Thomas Wollmann22,
Martin Raden1, Simon J. van Heeringen23, Nicola Soranzo24, Lorena Pantano5, Zachary
Charlop-Powers25, Per Unneberg26, Matthias De Smet27, Marcel Martin28, Greg Von
Kuster29, Tiago Antao30, Milad Miladi1, Kevin Thornton31, Christian Brueffer32, Marius
van den Beek33, Daniel Maticzka1, Clemens Blank1, Sebastian Will34, K´evin Gravouil35,
Joachim Wolff1, Manuel Holtgrewe36,37, J¨org Fallmann38, Vitor C. Piro39,40, Ilya
Shlyakhter8, Ayman Yousif41, Philip Mabon42, Xiao-Ou Zhang43, Wei Shen44, Jennifer
Cabral42, Cristel Thomas45 , Eric Enns42, Joseph Brown46, Jorrit Boekel47, Mattias de
Hollander48, Jerome Kelleher49 , Nitesh Turaga50, Julian R. de Ruiter51, Dave Bouvier52,
Simon Gladman53, Saket Choudhary54, Nicholas Harding49, Florian Eggenhofer1, Arne
Kratz11, Zhuoqing Fang55, Robert Kleinkauf56, Henning Timm57, Peter J. A. Cock58, Enrico
Seiler39, Colin Brislawn59, Hai Nguyen60, Endre Bakken Stovner61 , Philip Ewels62, Matt
Chambers63, James E. Johnson64 , Emil H¨agglund65, Simon Ye66, Roman Valls Guimera67 ,
Elmar Pruesse68, W. Augustine Dunn69 , Lance Parsons70, Rob Patro71, David Koppstein72,
Elena Grassi73, Inken Wohlers74, Alex Reynolds75 , MacIntosh Cornwell76, Nicholas Stoler77,
Daniel Blankenberg78, Guowei He79, Marcel Bargull57, Alexander Junge80 , Rick Farouni81,
Mallory Freeberg82, Sourav Singh83, Daniel R. Bogema84, Fabio Cumbo85,86,77,87, Liang-Bo
Wang88,89 , David E Larson90, Matthew L. Workentine91, Upendra Kumar Devisetty92,
Sacha Laurent93, Pierrick Roger94, Xavier Garnier16,95, Rasmus Agren96, Aziz Khan97, John
M Eppley98, Wei Li99, Bianca Katharina St¨ocker57, Tobias Rausch100, James Taylor101,
Patrick R. Wright1, Adam P. Taranto102 , Davide Chicco103, Bengt Sennblad26, Jasmijn A.
Baaijens104, Matthew Gopez42, Nezar Abdennur66, Iain Milne58 , Jens Preussner105, Luca
Pinello81, Avi Srivastava71 , Aroon T. Chande106, Philip Reiner Kensche107, Yuri Pirola108,
Michael Knudsen109, Ino de Bruijn110, Kai Blin111, Giorgio Gonnella112, Oana M. Enache8,
Vivek Rai113, Nicholas R. Waters114 , Saskia Hiltemann115, Matthew L. Bendall116,117,
Christoph Stahl118, Alistair Miles49 , Yannick Boursin119, Yasset Perez-Riverol120 , Sebastian
Schmeier121, Erik Clarke122, Kevin Arvai123, Matthieu Jung124, Tom´as Di Domenico125,
Julien Seiler124, Eric Rasche1, Etienne Kornobis126, Daniela Beisser127 , Sven Rahmann128,
Alexander S Mikheyev129,130, Camy Tran42 , Jordi Capellades131, Christopher Schr¨oder132,
Adrian Emanuel Salatino133, Simon Dirmeier134, Timothy H. Webster135, Oleksandr
Moskalenko136 , Gordon Stephen58, and Johannes K¨oster,137,138
1Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg,
Germany
1
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
2Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and
Digestive and Kidney Diseases, National Institutes of Health, Bethesda, United States
3Division of CBRN Security and Defence, FOI - Swedish Defence Research Agency, Ume˚a.
Sweden
4Department of Chemistry, Computational Life Science Cluster (CLiC), Ume˚a University,
Ume˚a, Sweden
5Harvard T.H. Chan School of Public Health, Boston, United States
6NYU Abu Dhabi, Abu Dhabi, United Arab Emirates
7Department of Organismic and Evolutionary Biology, Harvard University, Cambridge,
United States
8Broad Institute of MIT and Harvard, Cambridge, United States
9Laboratory of Bioinformatics and Computational Biology, A. C. Camargo Cancer Center,
ao Paulo, Brazil
10Holland Computing Center, University of Nebraska, Lincoln, United States
11Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
12Bioinformatics and Biostatistics Hub - C3BI, USR IP CNRS, Institut Pasteur, Paris,
France
13Counsyl, South San Francisco, United States
14Department of Human Genetics, University of Utah, Eccles Institute of Human Genetics,
Salt Lake City
15Erasmus Medical Center, Department of Urology, Rotterdam, The Netherlands
16INRA, UMR IGEPP, BioInformatics Platform for Agroecosystems Arthropods (BIPAA),
Campus Beaulieu, Rennes, France
17Bioinformatics core facility, Max Planck Institute for Immunobiology and Epigenetics,
Freiburg, Germany
18UPMC, CNRS, FR2424, ABiMS, Station Biologique, Roscoff, France
19MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, United
Kingdom
20Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, United
States
21Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard TH
Chan School of Public Health, Boston, United States
22University of Heidelberg and DKFZ, Heidelberg, Germany
23Radboud University, Faculty of Science, Department of Molecular Developmental
Biology, Radboud Institute for Molecular Life Sciences, Nijmegen, The Netherlands
24Earlham Institute, Norwich Research Park, Norwich, United Kindgom
25The Laboratory for Genetically Encoded Small Molecules, The Rockefeller University,
New York, United States
26Department of Cell and Molecular Biology, National Bioinformatics Infrastructure
Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
27Ghent University Hospital, Ghent University, Belgium
28Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure
Sweden, Science for Life Laboratory, Stockholm University, Sweden
29Institute for CyberScience, Pennsylvania State University, University Park, United States
2
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
30Division of Biological Sciences, University of Montana, Missoula, United States of
America
31Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine,
United States
32Division of Oncology and Pathology, Department of Clinical Sciences, Lund University,
Lund, Sweden
33Stem Cells and Tissue Homeostasis, Institut Curie, Paris, France
34Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria
35Universit´e Clermont Auvergne, INRA, MEDIS, Clermont-Ferrand, France
36Core Unit Bioinformatics, Berlin Institute of Health, Berlin, Germany
37Charit´e Universit¨atsmedizin Berlin, Berlin, Germany
38Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for
Bioinformatics, University of Leipzig, Leipzig, Germany
39Bioinformatics Unit, Robert Koch Institute, Berlin, Germany
40CAPES Foundation, Ministry of Education of Brazil, Bras´ılia, Brazil
41Center for Genomics and System Biology, New York University, Abu Dhabi, United Arab
Emirates
42National Microbiology Laboratory, Public Health Agency of Canada, Winnipeg, Canada
43Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical
School, Worcester, United States
44Department of Clinical Laboratory, Chengdu Military General Hospital, Chengdu, China
45Northrop Grumman Corporation, Technology Services, Rockville, United States
46Biological Sciences Division, Pacific Northwest National Laboratory, Richland, United
States
47Department of Oncology-Pathology, National Bioinformatics Infrastructure Sweden,
Science for Life Laboratory, Karolinska Institutet, Solna, Sweden
48Netherlands Institute of Ecology (NIOO-KNAW), Wageningen, The Netherlands
49Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University
of Oxford, Oxford, United Kingdom
50Department of Biology, Johns Hopkins University, Baltimore, United States
51Divisions of Molecular Pathology and Molecular Carcinogenesis, The Netherlands Cancer
Institute, Amsterdam, The Netherlands
52Department of Biochemistry Molecular Biology, Pennsylvania State University,
University Park, United States
53Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia
54Computational Biology and Bioinformatics, University of Southern California, Los
Angeles, United States
55Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences,
Chinese Academy of Sciences, Shanghai China
56-
57Genome Informatics, Institute of Human Genetics, University Hospital Essen, University
of Duisburg-Essen, Essen, Germany
58The James Hutton Institute, Dundee, United Kingdom
59Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory,
3
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
Richland, United States
60Department of Chemistry Chemical Biology, Rutgers University, Piscataway, United
States
61Department of Computer Science, Norwegian University of Science and Technology
62Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm
University, Stockholm, Sweden
63Department of Biochemistry, Molecular Biology and Biophysics (as contractor, not
employee), University of Minnesota, Minneapolis, United States
64Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, United
States
65Department of Molecular Evolution, Cell and Molecular Biology, Science for Life
Laboratory, Biomedical Centre, Uppsala University, Uppsala, Sweden
66Massachusetts Institute of Technology, Cambridge, United States
67Center for Cancer Research, University of Melbourne, Melbourne, Australia
68University of Colorado, Denver, United States
69Boston Children’s Hospital, Boston, United States
70Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, United
States
71Department of Computer Science, Stony Brook University, Stony Brook, United States
72The Kirby Institute of Infection and Immunity, University of New South Wales, Sydney,
Australia
73Transcription and Chromatin Lab, Humanitas University, Rozzano, Italy
74L¨ubeck Interdisciplinary Platform for Genome Analytics (LIGA), Institutes of
Neurogenetics and Integrative Experimental Genomics, University of L¨ubeck, L¨ubeck,
Germany
75Altius Institute for Biomedical Sciences, Seattle, United States
76New York University School of Medicine, New York City, United States
77Department of Biochemistry and Molecular Biology, Pennsylvania State University,
University Park, United States
78Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland,
United States
79High Performance Computing, NYU Abu Dhabi, Abu Dhabi, United Arab Emirates
80Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein
Research, University of Copenhagen, Copenhagen, Denmark
81Massachusetts General Hospital and Harvard Medical School, Boston, United States
82EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, United Kingdom
83Savitribai Phule Pune University, Pune, Maharashtra, India
84NSW Department of Primary Industries, Elizabeth Macarthur Agricultural Institute,
Menangle, Australia
85Department of Engineering, Roma Tre University, Rome, Italy
86Institute for Systems Analysis and Computer Science ”Antonio Ruberti”, National
Research Council of Italy, Rome, Italy
87SYSBIO.IT Center for Systems Biology, Milan, Italy
88Oncology Division, Department of Medicine, Washington University School of Medicine,
4
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
St. Louis, United States
89McDonnell Genome Institute, Washington University School of Medicine, St. Louis,
United States
90The McDonnell Genome Institute, Washington University, St. Louis, United States
91Faculty of Veterinary Medicine, University of Calgary, Calgary, Canada
92CyVerse, Bio5 institute, University of Arizona, Tucson, United States
93Institute of Microbiology, Universitary Hospital of Lausanne, Switzerland
94CEA, LIST, Laboratory for data analysis and systems’ intelligence, MetaboHUB, France
95Dyliss - Dynamics, Logics and Inference for biological Systems and Sequences,
Inria/IRISA, Campus Beaulieu, Rennes, France
96Department of Biology and Biological Engineering, National Bioinformatics Infrastructure
Sweden, Science for Life Laboratory, Chalmers University of Technology, Sweden
97Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University
of Oslo, Oslo, Norway
98Daniel K. Inouye Center for Microbial Oceanography: Research and Education,
Department of Oceanography, University of Hawaii, Honolulu, United States
99Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute,
Harvard T.H. Chan School of Public Health, Boston, United States
100European Molecular Biology Laboratory (EMBL), Genomics Core Facility, Heidelberg,
Germany
101Departments of Biology and Computer Science, Johns Hopkins University, Baltimore,
United States
102Plant Sciences Division, Research School of Biology, The Australian National University,
Canberra, Australia
103Princess Margaret Cancer Centre, Toronto, Canada
104Centrum Wiskunde and Informatica, Amsterdam, Netherlands
105ECCPS Bioinformatics Core Unit, Max Planck Institute for Heart and Lung Research,
Bad Nauheim, Germany
106Applied Bioinformatics Laboratory, 2 Ravinia Drive, Suite 1200 Atlanta, GA 30346,
United States
107German Cancer Research Center (DKFZ), Foundation under Public Law, Heidelberg,
Germany
108Dip. di Informatica Sistemistica e Comunicazione, Univ. degli Studi di Milano-Bicocca,
Milan, Italy
109Department of Molecular Medicine, Aarhus University Hospital, Aarhus, Denmark
110Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, United
States
111The Novo Nordisk Foundation Center for Biosustainability, Technical University of
Denmark, Lyngby, Denmark
112ZBH - Center for Bioinformatics, MIN-Fakult¨at, Universit¨at Hamburg, Hamburg,
Germany
113Department of Computational Medicine and Bioinformatics, University of Michigan,
Ann Arbor, United States
114Department of Microbiology, School of Natural Sciences, National University of Ireland,
5
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
Galway, Ireland Information and Computational Sciences, James Hutton Institute,
Invergowrie, Scotland
115Erasmus Medical Center, Rotterdam, The Netherlands
116Computational Biology Institute, Milken Institute School of Public Health, The George
Washington University, Washington, D.C., United States
117Department of Microbiology, Immunology Tropical Medicine, The George Washington
University School of Medicine and Health Sciences, Washington, D.C., United States
118Genome Informatics, Institute of Human Genetics, University Hospital Essen, University
of Duisburg Essen, Essen, Germany
119Institut Gustave Roussy, Villejuif, France
120European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome
Trust Genome Campus , Hinxton, Cambridge, United Kingdom
121Massey University, Institute of Natural and Mathematical Sciences, North Shore City,
New Zealand
122Department of Microbiology, University of Pennsylvania, United States
123GeneDx, Gaithersburg, United States
124Institut de G´en´etique et de Biologie Mol´eculaire et Cellulaire, CNRS, Illkirch, France
125Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, United Kingdom
126Epigenetic Regulation Unit, Pasteur Institute, Paris, France
127Biodiversity, Faculty of Biology, University of Duisburg-Essen, Essen, Germany
128Genome Informatics, Institute of Human Genetics, University of Duisburg-Essen,
University Hospital Essen, Essen, Germany
129Evolutionary Genomics Lab, Research School of Biology, The Australian National
University, Canberra, Australia
130Ecology and Evolution Unit, Okinawa Institute of Science and Technology Graduate
University, Onna-son, Kunigami-gun, Okinawa, Japan
131Universitat Rovira i Virgili, Spanish Biomedical Research Center in Diabetes and
Associated Metabolic Disorders (CIBERDEM), Reus Spain
132Genome Informatics, Institute of Human Genetics, University of Duisburg-Essen, Essen,
Germany
133Department of Molecular Genetics and Biology of Complex Diseases, Institute of
Medical Research A Lanari-IDIM, University of Buenos Aires, National Scientific and
Technical Research Council (CONICET), Ciudad Aut´onoma de Buenos Aires, Argentina.
134Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
135School of Life Sciences, Arizona State University, Tempe, United States
136UFIT Research Computing, University of Florida, Gainesville, United States
137Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human
Genetics, University Hospital Essen, University of Duisburg-Essen
138Dana Farber Cancer Institute, Harvard Medical School, Boston, United States
October 27, 2017
Co-first author
To whom correspondence should be addressed.
6
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
Abstract
We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-
platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages,
which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bio-
conda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of
which are easily installed and managed without administrative privileges.
Introduction
Thousands of new software tools have been released for bioinformatics in recent years, in a variety of pro-
gramming languages. Accompanying this diversity of construction is an array of installation methods. Often,
Software has to be compiled manually for different hardware architectures and operating systems, with man-
agement left to the user or system administrator. Scripting languages usually deliver their own package
management tools for installing, updating, and removing packages, though these are often limited in scope
to packages written in the same scripting language and cannot handle external dependencies (e.g., C li-
braries). Published scientific software often consists of simple collections of custom scripts distributed with
textual descriptions of the manual steps required to install the software. New analyses often require novel
combinations of multiple tools, and the heterogeneity of scientific software makes management of a software
stack complicated and error-prone. Moreover, it inhibits reproducible science (Mesirov,2010;Baker,2016;
Munaf`o et al.,2017), because it is hard to reproduce a software stack on different machines. System-wide
deployment of software has traditionally been handled by administrators, but reproducibility often requires
that the researcher (who is often not an expert in administration) is able to maintain full control of the
software environment and rapidly modify it without administrative privileges.
The Conda package manager (https://conda.io) has become an increasingly popular approach to overcome
these challenges. Conda normalizes software installations across language ecosystems by describing each
software package with a recipe that defines meta-information and dependencies, as well as a build script
that performs the steps necessary to build and install the software. Conda prepares and builds software
packages within an isolated environment, transforming them into relocatable binaries. Conda packages can
be built for all three major operating systems: Linux, macOS, and Windows. Importantly, installation
and management of packages requires no administrative privileges, such that a researcher can control the
available software tools regardless of the underlying infrastructure. Moreover, Conda obviates reliance on
system-wide installation by allowing users to generate isolated software environments, within which versions
and tools can be managed per-project, without generating conflicts or incompatibilities (see online methods).
These environments support reproducibility, as they can can be rapidly exchanged via files that describe
their installation state. Conda is tightly integrated into popular solutions for reproducible scientific data
analysis like Galaxy (Afgan et al.,2016), bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen),
and Snakemake (oster and Rahmann,2012). Finally, while Conda provides many commonly-used packages
by default, it also allows users to optionally include additional repositories (termed channels) of packages
that can be installed.
Results
In order to unlock the benefits of Conda for the life sciences, the Bioconda pro ject was founded in 2015.
The mission of Bioconda is to make bioinformatics software easily installable and manageable via the Conda
package manager. Via its channel for the Conda package manager, Bioconda currently provides over 3000
software packages for Linux and macOS. Development is driven by an open community of over 200 inter-
national scientists. In the prior two years, package count and the number of contributors have increased
7
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
linearly, on average, with no sign of saturation (Fig. 1a,b). The barrier to entry is low, requiring a willing-
ness to participate and adherence to community guidelines. Many software developers contribute recipes for
their own tools, and many Bioconda contributors are invested in the project as they are also users of Conda
and Bioconda. Bioconda provides packages from various language ecosystems like Python, R (CRAN and
Bioconductor), Perl, Haskell, as well as a plethora of C/C++ programs (Fig. 1c). Many of these packages
have complex dependency structures that require various manual steps to install when not relying on a
package manager like Conda (Fig. 2a, Online Methods). With over 6.3 million downloads, the service has
become a backbone of bioinformatics infrastructure (Fig. 1d). Bioconda is complemented by the conda-forge
project (https://conda-forge.github.io), which hosts software not specifically related to the biological
sciences. The two projects collaborate closely, and the Bioconda team maintains over 500 packages hosted
by conda-forge. Among all currently available distributions of bioinformatics software, Bioconda is by far
the most comprehensive, while being among the youngest (Fig. 2d).
Figure 1: Bioconda development and usage since the beginning of the project. (a) contributing authors and
added recipes over time. (b) code line additions and deletions per week. (c) package count per language
ecosystem (saturated colors on bottom represent explicitly life science related packages). (d) total downloads
per language ecosystem. The term “other” entails all recipes that do not fall into one of the specific categories.
Note that a subset of packages that started in Bioconda have since been migrated to the more appropriate,
general-purpose conda-forge channel. Older versions of such packages still reside in the Bioconda channel,
and as such are included in the recipe count (a) and download count (d). Statistics obtained Oct. 25, 2017.
To ensure reliable maintenance of such numbers of packages, we use a semi-automatic, agent-assisted de-
velopment workflow (Fig. 2b). All Bioconda recipes are hosted in a GitHub repository (https://github.
com/bioconda/bioconda-recipes). Both the addition of new recipes and the update of existing recipes in
Bioconda is handled via pull requests. Thereby, a modified version of one or more recipes is compared against
the current state of Bioconda. Once a pull request arrives, our infrastructure performs several automatic
checks. Problems discovered in any step are reported to the contributor and further progress is blocked until
they are resolved. First, the modified recipes are checked for syntactic anti-patterns, i.e., formulations that
are syntactically correct but bad style (termed linting). Second, the modified recipes are built on Linux
and macOS, via a cloud based, free-of-charge service (https://travis-ci.org). Successfully built recipes
are tested (e.g., by running the generated executable). Since Bioconda packages must be able to run on
any supported system, it is important to check that the built packages do not rely on particular elements
from the build environment. Therefore, testing happens in two stages: (a) test cases are executed in the
build environment (b) test cases are executed in a minimal Docker (https://docker.com) container which
purposefully lacks all non-common system libraries (hence, a dependency that is not explicitly defined will
lead to a failure). Once the build and test steps have succeeded, a member of the Bioconda team reviews the
proposed changes and, if acceptable, merges the modifications into the official repository. Upon merging, the
recipes are built again and uploaded to the hosted Bioconda channel (https://anaconda.org/bioconda),
where they become available via the Conda package manager. When a Bioconda package is updated to a
new version, older builds are generally preserved, and recipes for multiple older versions may be maintained
8
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
in the Bioconda repository. The usual turnaround time of above workflow is short (Fig. 2d). 61% of the
pull requests are merged within 5 hours. Of those, 36% are even merged within 1 hour. Only 18% of the
pull requests need more than a day. Hence, publishing software in Bioconda or updating already existing
packages can be accomplished typically within minutes to a few hours.
Figure 2: Dependency structure, workflow, comparison with other resources, and turnaround time. (a)
largest connected component of directed acyclic graph of Bioconda packages (nodes) and dependen-
cies (edges). Highlighted is the induced subgraph of the CNVkit (Talevich et al.,2016) package and
it’s dependencies (node coloring as defined in Fig. 1c, squared node represents CNVkit). (b) GitHub
based development workflow: a contributor provides a pull request that undergoes several build and
test steps, followed by a human review. If any of these checks does not succeed, the contributor can
update the pull request accordingly. Once all steps have passed, the changes can be merged. (c)
Turnaround time from submission to merge of pull requests in Bioconda. (d) Comparison of explicitly
life science related packages in Bioconda with Debian Med (https://www.debian.org/devel/debian-med),
Gentoo Science Overlay (category sci-biology, https://github.com/gentoo/sci), EasyBuild (module
bio, https://easybuilders.github.io/easybuild), Biolinux (Field et al.,2006), Homebrew Science (tag bioinfor-
matics, https://brew.sh), GNU Guix (category bioinformatics, https://www.gnu.org/s/guix), and BioBuilds
(https://biobuilds.org). The lower panel shows the project age since the first release or commit. Statis-
tics obtained Oct. 23, 2017.
Reproducible software management and distribution is enhanced by other current technologies. Conda inte-
grates itself well with environment modules (http://modules.sourceforge.net/), a technology used nearly
universally across HPC systems. An administrator can use Conda to easily define software stacks for multiple
labs and project-specific configurations. Popularized by Docker, containers provide another way to publish
an entire software stack, down to the operating system. They provide greater isolation and control over the
environment a software is executed in, at the expense of some customizability. Conda complements container-
based approaches. Where flexibility is needed, Conda packages can be used and combined directly. Where
the uniformity of containers is required, Conda can be used to build images without having to reproduce the
nuanced installation steps that would ordinarily be required to build and install a software within an image.
In fact, for each Bioconda package, our build system automatically builds a minimal Docker image containing
that package and its dependencies, which is subsequently uploaded and made available via the Biocontainers
project (da Veiga Leprevost et al.,2017). As a consequence, every built Bioconda package is available not
only for installation via Conda, but also as a container via Docker, Rkt (https://coreos.com/rkt), and Sin-
gularity (Kurtzer et al.,2017), such that the desired level of reproducibility can be chosen freely (Gr¨uning
et al.,2017).
9
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
Discussion
By turning the arduous and error-prone process of installing bioinformatics software, previously repeated
endlessly by scientists around the globe, into a concerted community effort, Bioconda frees significant re-
sources to instead be invested into productive research. The new simplicity of deploying even complex
software stacks with strictly controlled software versions enables software authors to safely rely on existing
methods. Where previously the cost of depending on a third party tool - requiring its installation and
maintaining compatibility with new versions - was often higher than the effort to re-implement its methods,
authors can now simply specify the tool and version required, incurring only negligible costs even for large
requirement sets.
For reproducible data science, it is crucial that software libraries and tools are provided via an easy to
use, unified interface, such that they can be easily deployed and sustainably managed. With its ability to
maintain isolated software environments, the integration into major workflow management systems and the
fact that no administration privileges are needed, the Conda package manager is the ideal tool to ensure
sustainable and reproducible software management. With Bioconda, we unlock Conda for the life sciences
while coordinating closely with other related projects such as conda-forge and Biocontainers. Bioconda
offers a comprehensive resource of thousands of software libraries and tools that is maintained by hundreds
of international contributors. Although it is among the youngest, it outperforms all competing projects by
far in the number of available packages. With almost six million downloads so far, Bioconda packages have
been well received by the community. We invite everybody to participate in reaching the goal of a central,
comprehensive, and language agnostic collection of easily installable software by maintaining existing or
publishing new software in Bioconda.
Funding
The Bioconda project has received support from Anaconda, Inc., Austin, TX, USA, in the form of expanded
storage for Bioconda packages on their hosting service (https://anaconda.org). Further, the project has
been granted extended build times from Travis CI, GmbH (https://travis-ci.com). The Bioconda community
also would like to thank ELIXIR (https://www.elixir-europe.org) for their constant support and donating
staff.
Acknowledgements
We thank the participants of various hackathons (e.g., the GalaxyP and IUC contribution fest, ELIXIR
BioContainers and NETTAB hackathon) for porting numerous packages to Bioconda.
Contributions
Kyle Beauchamp, Christian Brueffer, Brad Chapman, Ryan Dale, Florian Eggenhofer, Bj¨
orn Gr¨
uning, Jo-
hannes K¨
oster, Elmar Pruesse, Martin Raden, Jillian Rowe, Devon Ryan, Ilya Shlyakter, Andreas Sj¨
odin,
Christopher Tomkins-Tinch, and Renan Valieris (in alphabetical order) have written the manuscript. Jo-
hannes K¨
oster and Ryan Dale have conducted the data analysis. Dan Ariel Sondergaard contributed by
supervising student programmers on contributing recipes and maintaining the connection with ELIXIR. All
other authors have contributed or maintained recipes.
10
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
Online Methods
Security Considerations
Using Bioconda as a service to obtain packages for local installation entails trusting that (a) the provided
software itself is not harmful and (b) it has not been modified in a harmful way. Ensuring (a) is up to the
user. In contrast, (b) is handled by our workflow. First, source code or binary files defined in recipes are
checked for integrity via MD5 or SHA256 hash values. Second, all review and testing steps are enforced via the
GitHub interface. This guarantees that all packages have been tested automatically and reviewed by a human
being. Third, all changes to the repository of recipes are publicly tracked, and all build and test steps are
transparently visible to the user. Finally, the automatic parts of the development workflow are implemented in
the open-source software bioconda-utils (https://github.com/bioconda/bioconda-utils). In the future,
we will further explore the possibility to sign packages cryptographically.
Software management with Conda
Via the Conda package manager, installing software from Bioconda becomes very simple. In the following,
we describe the basic functionality assuming that the user has access to a Linux or macOS terminal. After
installing Conda, the first step is to set up the Bioconda channel via:
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
Now, all Bioconda packages are visible to the Conda package manager. For example, the software CNV-
kit (Talevich et al.,2016), can be searched for with
$ conda search cnvkit
in order to check if and in which versions it is available. It can be installed with:
$ conda install cnvkit
CNVkit needs various dependencies from Python and R, which would otherwise have to be installed in sepa-
rate manual steps (Fig. 2a). Furthermore, Conda enables updating and removing all these dependencies via
one unified interface. A key value of Conda is the ability to define isolated, shareable software environments.
This can happen ad-hoc, or via YAML (https://yaml.org) files. For example, the following defines an
environment consisting of Salmon (Patro et al.,2017) and DESeq2 (Love et al.,2014):
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- bioconductor-deseq2 =1.16.1
- salmon =0.8.2
- r-base =3.4.1
Given that the above environment specification is stored in the file env.yaml, an environment my-env meeting
the specified requirements can be created via the command:
$ conda env create --name my-env --file env.yaml
To use the commands installed in this environment, it must first be “activated” by issuing the following
command:
$ source activate my-env
11
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
Within the environment, R, Salmon, and DESeq2 are available in exactly the defined versions. For example,
salmon can be executed with:
$ salmon --help
It is possible to modify an existing environment by using conda update,conda install and conda remove.
For example, we could add a particular version of Kallisto (Bray et al.,2016) and update Salmon to the
latest available version with:
$ conda install kallisto=0.43.1
$ conda update salmon
Finally, the environment can be deactivated again with:
$ source deactivate
How isolated software environments enable reproducible research
With isolated software environments as shown above, it is possible to define an exact version for each package.
This increases reproducibility by eliminating differences due to implementation changes. Note that above
we also pin an R version, although the latest compatible one would also be automatically installed without
mentioning it. To further increase reproducibility, this pattern can be extended to all dependencies of DESeq2
and Salmon and recursively down to basic system libraries like zlib and boost (https://www.boost.org).
Environments are isolated from the rest of the system, while still allowing interaction with it: e.g., tools inside
the environment are preferred over system tools, while system tools that are not available from within the
environment can still be used. Conda also supports the automatic creation of environment definitions from
already existing environments. This allows to rapidly explore the needed combination of packages before it is
finalized into an environment definition. When used with workflow management systems like Galaxy (Afgan
et al.,2016), bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen), and Snakemake (K¨
oster and
Rahmann,2012) that interact directly with Conda, a data analysis can be shipped and deployed in a fully
reproducible way, from description and automatic execution of every analysis step down to the description
and automatic installation of any required software.
Data analysis
The presented figures and numbers have been generated via a fully automated, reproducible Snakema-
ke (K¨
oster and Rahmann,2012) workflow that is freely available under https://github.com/bioconda/
bioconda-paper.
12
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
References
E Afgan, D Baker, den Beek M van, D Blankenberg, D Bouvier, M ˇ
Cech, J Chilton, D Clements, N Coraor,
C Eberhard, B Gr¨uning, A Guerler, J Hillman-Jackson, Kuster G Von, E Rasche, N Soranzo, N Turaga,
J Taylor, A Nekrutenko, and J Goecks. The Galaxy platform for accessible, reproducible and collaborative
biomedical analyses: 2016 update. Nucleic Acids Res, 44:W3–W10, Jul 2016. doi: 10.1093/nar/gkw343.
URL https://doi.org/10.1093/nar/gkw343.
Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604):452–454, may 2016. doi:
10.1038/533452a. URL https://doi.org/10.1038%2F533452a.
NL Bray, H Pimentel, P Melsted, and L Pachter. Near-optimal probabilistic RNA-seq quantification. Nat
Biotechnol, 34:525–7, May 2016. doi: 10.1038/nbt.3519. URL https://doi.org/10.1038/nbt.3519.
F da Veiga Leprevost, BA Gr¨uning, Aflitos S Alves, HL R¨ost, J Uszkoreit, H Barsnes, M Vaudel, P Moreno,
L Gatto, J Weber, M Bai, RC Jimenez, T Sachsenberg, J Pfeuffer, Alvarez R Vera, J Griss, AI Nesvizh-
skii, and Y Perez-Riverol. BioContainers: an open-source and community-driven framework for software
standardization. Bioinformatics, 33:2580–2582, Aug 2017. doi: 10.1093/bioinformatics/btx192. URL
https://doi.org/10.1093/bioinformatics/btx192.
Dawn Field, Bela Tiwari, Tim Booth, Stewart Houten, Dan Swan, Nicolas Bertrand, and Milo Thurston.
Open software for biologists: from famine to feast. Nature Biotechnology, 24(7):801–803, jul 2006. doi:
10.1038/nbt0706-801. URL https://doi.org/10.1038%2Fnbt0706- 801.
Bj¨orn Gr¨uning, John Chilton, Johannes K¨oster, Ryan Dale, Jeremy Goecks, Rolf Backofen, Anton
Nekrutenko, and James Taylor. Practical computational reproducibility in the life sciences. oct 2017.
doi: 10.1101/200683. URL https://doi.org/10.1101%2F200683.
GM Kurtzer, V Sochat, and MW Bauer. Singularity: Scientific containers for mobility of compute. PLoS
One, 12:e0177459, 2017. doi: 10.1371/journal.pone.0177459. URL https://doi.org/10.1371/journal.
pone.0177459.
J K¨oster and S Rahmann. Snakemake–a scalable bioinformatics workflow engine. Bioinformatics, 28:2520–
2, Oct 2012. doi: 10.1093/bioinformatics/bts480. URL https://doi.org/10.1093/bioinformatics/
bts480.
MI Love, W Huber, and S Anders. Moderated estimation of fold change and dispersion for RNA-seq data
with DESeq2. Genome Biol, 15:550, 2014. doi: 10.1186/s13059-014-0550-8. URL https://doi.org/10.
1186/s13059-014-0550-8.
J. P. Mesirov. Accessible Reproducible Research. Science, 327(5964):415–416, jan 2010. doi: 10.1126/
science.1179653. URL https://doi.org/10.1126%2Fscience.1179653.
Marcus R. Munaf`o, Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Cham-
bers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A.
Ioannidis. A manifesto for reproducible science. Nature Human Behaviour, 1(1):0021, jan 2017. doi:
10.1038/s41562-016-0021. URL https://doi.org/10.1038%2Fs41562-016-0021.
R Patro, G Duggal, MI Love, RA Irizarry, and C Kingsford. Salmon provides fast and bias-aware quan-
tification of transcript expression. Nat Methods, 14:417–419, Apr 2017. doi: 10.1038/nmeth.4197. URL
https://doi.org/10.1038/nmeth.4197.
E Talevich, AH Shain, T Botton, and BC Bastian. CNVkit: Genome-Wide Copy Number Detection and
Visualization from Targeted DNA Sequencing. PLoS Comput Biol, 12:e1004873, Apr 2016. doi: 10.1371/
journal.pcbi.1004873. URL https://doi.org/10.1371/journal.pcbi.1004873.
13
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/207092doi: bioRxiv preprint first posted online Oct. 21, 2017;
... For overall adoption of software solutions it is important the tools and 310 documentation get packaged by software distributions, such as Bioconda [43], ...
Preprint
Full-text available
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies — as well as in somatic and germline mutation studies. VCF can present single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called against a reference genome. Here we present over 125 useful and much used free and open source software tools and libraries, part of vcflib tools and bio-vcf . We also highlight cyvcf2 , hts-nim and slivar tools. Application is typically in the comparison, filtering, normalisation, smoothing, annotation, statistics, visualisation and exporting of variants. Our tools run daily and invisibly in pipelines and countless shell scripts. Our tools are part of a wider bioinformatics ecosystem and we consider it very important to make these tools available as free and open source software to all bioinformaticians so they can be deployed through software distributions, such as Debian, GNU Guix and Bioconda. vcflib , for example, was installed over 40,000 times and bio-vcf was installed over 15,000 times through Bioconda by December 2020. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation that can not easily be represented by the VCF format. All source code is published under free and open source software licenses and can be downloaded and installed from https://github.com/vcflib . Author summary Most bioinformatics workflows deal with DNA/RNA variations that are typically represented in the variant call format (VCF) — a file format that describes mutations (SNP and MNP), insertions and deletions (INDEL) against a reference genome. Here we present a wide range of free and open source software tools that are used in biomedical sequencing workflows around the world today.
... Data processing tasks were organized into a Snakemake (Koster and Rahmann, 2012) workflow with the help of the hundo package (Brown et al., 2018). Versioned executables were downloaded during runtime using Bioconda (Grüning et al., 2017). ...
Article
Full-text available
Riverbeds are hotspots for microbially-mediated reactions that exhibit pronounced variability in space and time. It is challenging to resolve biogeochemical mechanisms in natural riverbeds, as uncontrolled settings complicate data collection and interpretation. To overcome these challenges, laboratory flumes are often used as proxies for natural riverbed systems. Flumes capture spatiotemporal variability and thus allow for controlled investigations of riverbed biogeochemistry. These investigations implicitly rely on the assumption that the flume microbiome is similar to the microbiome of natural riverbeds. However, this assumption has not been tested and it is unknown how the microbiome of a flume compares to natural aquatic settings, including riverbeds. To evaluate the fundamental assumption that a flume hosts a microbiome similar to natural riverbed systems, we used 16s rRNA gene sequencing and publicly available data to compare the sediment microbiome of a single large laboratory flume to a wide variety of natural ecosystems including lake and marine sediments, river, lake, hyporheic, soil, and marine water, and bank and wetland soils. Richness and Shannon diversity metrics, analyses of variance, Bray-Curtis dissimilarity, and analysis of the common microbiomes between flume and river sediment all indicated that the flume microbiome more closely resembled natural riverbed sediments than other ecosystems, supporting the use of flume experiments for investigating natural microbially-mediated biogeochemical processes in riverbeds.
... Installation and dependencies NanoPack and individual scripts are available through the public software repositories PyPI using pip and bioconda through conda (Dale et al. 2017). The scripts build on a number of third party Python modules: matplotlib (Hunter 2007), pysam (Li et al. 2009;Heger 2009), pandas (McKinney 2011), numpy (Walt, Colbert, and Varoquaux 2011), seaborn (Waskom et al. 2017) and biopython (Cock et al. 2009). ...
Preprint
Full-text available
Here we describe NanoPack, a set of tools developed for visualization and processing of long read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences. Availability and Implementation: The NanoPack tools are written in Python3 and released under the GNU GPL3.0 Licence. The source code can be found at https://github.com/wdecoster/nanopack , together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools. Contact: wouter.decoster@molgen.vib-ua.be Supplementary information: Supplementary tables and figures are available at Bioinformatics online.
... In addition to runtime, analysis times are also reduced with GROOT due to its ease of use. It runs as a self-contained binary, is packaged with bioconda [48] and requires only two commands to run a resistome profiling analysis, offering significant advantage over more complex workflows or those that require upload to remote servers halfway through the analysis [23]. Our implementation is targeted towards researchers who may not have access to high performance computing and wish to run metagenomics workflows on a laptop. ...
Preprint
Full-text available
Background Antimicrobial resistance remains a major threat to global health. Profiling the collective antimicrobial resistance genes within a metagenome (the “resistome”) facilitates greater understanding of antimicrobial resistance gene diversity and dynamics. In turn, this can allow for gene surveillance, individualised treatment of bacterial infections and more sustainable use of antimicrobials. However, resistome profiling can be complicated by high similarity between reference genes, as well as the sheer volume of sequencing data and the complexity of analysis workflows. We have developed an efficient and accurate method for resistome profiling that addresses these complications and improves upon currently available tools. Results Our method combines a variation graph representation of gene sets with an LSH Forest indexing scheme to allow for fast classification of metagenomic sequence reads using similarity-search queries. Subsequent hierarchical local alignment of classified reads against graph traversals enables accurate reconstruction of full-length gene sequences using a scoring scheme. We provide our implementation, GROOT, and show it to be both faster and more accurate than a current reference-dependent tool for resistome profiling. GROOT runs on a laptop and can process a typical 2 gigabyte metagenome in 2 minutes using a single CPU. Conclusion We present a method for resistome profiling that utilises a novel index and search strategy to accurately type resistance genes in metagenomic samples. The use of variation graphs yields several advantages over other methods using linear reference sequences. Our method is not restricted to resistome profiling and has the potential to improve current metagenomic workflows. The implementation is written in Go and is available at https://github.com/will-rowe/groot (MIT license).
... MetaWRAP is hosted on github (https://github.com/bxlab/metaWRAP), distributed through Anaconda [33], and can be easily installed locally and on remote clusters. The metawrap-mg conda package (https:// anaconda.org/ursky/metawrap-mg) ...
Article
Full-text available
Background: The study of microbiomes using whole-metagenome shotgun sequencing enables the analysis of uncultivated microbial populations that may have important roles in their environments. Extracting individual draft genomes (bins) facilitates metagenomic analysis at the single genome level. Software and pipelines for such analysis have become diverse and sophisticated, resulting in a significant burden for biologists to access and use them. Furthermore, while bin extraction algorithms are rapidly improving, there is still a lack of tools for their evaluation and visualization. Results: To address these challenges, we present metaWRAP, a modular pipeline software for shotgun metagenomic data analysis. MetaWRAP deploys state-of-the-art software to handle metagenomic data processing starting from raw sequencing reads and ending in metagenomic bins and their analysis. MetaWRAP is flexible enough to give investigators control over the analysis, while still being easy-to-install and easy-to-use. It includes hybrid algorithms that leverage the strengths of a variety of software to extract and refine high-quality bins from metagenomic data through bin consolidation and reassembly. MetaWRAP's hybrid bin extraction algorithm outperforms individual binning approaches and other bin consolidation programs in both synthetic and real data sets. Finally, metaWRAP comes with numerous modules for the analysis of metagenomic bins, including taxonomy assignment, abundance estimation, functional annotation, and visualization. Conclusions: MetaWRAP is an easy-to-use modular pipeline that automates the core tasks in metagenomic analysis, while contributing significant improvements to the extraction and interpretation of high-quality metagenomic bins. The bin refinement and reassembly modules of metaWRAP consistently outperform other binning approaches. Each module of metaWRAP is also a standalone component, making it a flexible and versatile tool for tackling metagenomic shotgun sequencing data. MetaWRAP is open-source software available at https://github.com/bxlab/metaWRAP .
Article
Full-text available
Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy ( source code: https://github.com/NCBI-Hackathons/seqacademy , webpage: http://www.seqacademy.org/ ). This user-friendly pipeline, fully written in markdown language, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.
Chapter
Full-text available
Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
Article
Full-text available
Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy ( source code: https://github.com/NCBI-Hackathons/seqacademy , webpage:http://www.seqacademy.org/ ). This user-friendly pipeline, fully written in markdown language, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.
Article
Full-text available
Software Containers are changing the way scientists and researchers develop, deploy and exchange scientific software. They allow labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. However, containers and software packages should be produced under certain rules and standards in order to be reusable, compatible and easy to integrate into pipelines and analysis workflows. Here, we presented a set of recommendations developed by the BioContainers Community to produce standardized bioinformatics packages and containers. These recommendations provide practical guidelines to make bioinformatics software more discoverable, reusable and transparent. They are aimed to guide developers, organisations, journals and funders to increase the quality and sustainability of research software.
Article
Full-text available
Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy ( source code: https://github.com/NCBI-Hackathons/seqacademy , webpage:http://www.seqacademy.org/ ). This user-friendly pipeline, fully written in Jupyter Notebook, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.
Preprint
Full-text available
Many areas of research suffer from poor reproducibility. This problem is particularly acute in computationally intensive domains where results rely on a series of complex methodological decisions that are not well captured by traditional publication approaches. Various guidelines have emerged for achieving reproducibility, but practical implementation of these practices remains difficult. This is because reproducing published computational analyses requires installing many software tools plus associated libraries, connecting tools together into the complete pipeline, and specifying parameters. Here we present a suite of recently emerged technologies which make computational reproducibility not just possible, but, finally, practical in both time and effort. By combining a system for building highly portable packages of bioinformatics software, containerization and virtualization technologies for isolating reusable execution environments for these packages, and an integrated workflow system that automatically orchestrates the composition of these packages for entire pipelines, an unprecedented level of computational reproducibility can be achieved.
Article
Full-text available
Here we present Singularity, software developed to bring containers and reproducibility to scientific computing. Using Singularity containers, developers can work in reproducible environments of their choosing and design, and these complete environments can easily be copied and executed on other platforms. Singularity is an open source initiative that harnesses the expertise of system and software engineers and researchers alike, and integrates seamlessly into common workflows for both of these groups. As its primary use case, Singularity brings mobility of computing to both users and HPC centers, providing a secure means to capture and distribute software and compute environments. This ability to create and deploy reproducible environments across these centers, a previously unmet need, makes Singularity a game changing development for computational science.
Article
Full-text available
Motivation: BioContainers (biocontainers.pro) is an open-source and community-driven framework which provides platform independent executable environments for bioinformatics software. BioContainers allows labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. BioContainers is based on popular open-source projects Docker and rkt frameworks, that allow software to be installed and executed under an isolated and controlled environment. Also, it provides infrastructure and basic guidelines to create, manage and distribute bioinformatics containers with a special focus on omics technologies. These containers can be integrated into more comprehensive bioinformatics pipelines and different architectures (local desktop, cloud environments or HPC clusters). Availability: The software is freely available at github.com/BioContainers/. Contact: yperez@ebi.ac.uk , European Molecular Biology Laboratory, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK, Tel: +44-1223-492686, Fax: +44-1223-494468.
Article
Full-text available
Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery. Here we argue for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. There is some evidence from both simulations and empirical studies supporting the likely effectiveness of these measures, but their broad adoption by researchers, institutions, funders and journals will require iterative evaluation and improvement. We discuss the goals of these measures, and how they can be implemented, in the hope that this will facilitate action toward improving the transparency, reproducibility and efficiency of scientific research.
Article
Full-text available
High-throughput data production technologies, particularly ‘next-generation’ DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated statistical and computational methods, as well as substantial computational power. This has led to an acute crisis in life sciences, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has worked to address this problem by providing a framework that makes advanced computational tools usable by non experts. Galaxy seeks to make data-intensive research more accessible, transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse. In this report we highlight recently added features enabling biomedical analyses on a large scale.
Article
Full-text available
Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massively parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for integration into existing analysis pipelines. CNVkit is freely available from https://github.com/etal/cnvkit.
Article
Full-text available
In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0550-8) contains supplementary material, which is available to authorized users.
Article
We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA–seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis. © 2017 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.
Article
We present a novel approach to RNA-Seq quantification that is near optimal in speed and accuracy. Software implementing the approach, called kallisto, can be used to analyze 30 million unaligned RNA-Seq reads in less than 5 minutes on a standard laptop computer while providing results as accurate as those of the best existing tools. This removes a major computational bottleneck in RNA-Seq analysis.