ArticlePDF Available

A comparison of three programming languages for a full-fledged next-generation sequencing tool

Authors:
Article

A comparison of three programming languages for a full-fledged next-generation sequencing tool

Abstract and Figures

Background: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use. Results: The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case. Conclusions: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.
This content is subject to copyright. Terms and conditions apply.
Costanza et al. BMC Bioinformatics (2019) 20:301
https://doi.org/10.1186/s12859-019-2903-5
METHODOLOGY ARTICLE Open Access
A comparison of three programming
languages for a full-fledged next-generation
sequencing tool
Pascal Costanza *†, Charlotte Herzeeland Wilfried Verachtert
Abstract
Background: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing
pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file
for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other
SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity
bottleneck in its original implementation language during recent further development of elPrep. We therefore
investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on
the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects.
We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use.
Results: The Go implementation performs best, yielding the best balance between runtime performance and
memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use
of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while
using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is
better at managing a large heap of objects than reference counting in our case.
Conclusions: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and
recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM
data as well.
Keywords: Next-generation sequencing, Sequence analysis, SAM/BAM files, C++, Go, Java, Runtime performance,
Memory usage, Garbage collection, Reference counting
Background
The sequence alignment/map format (SAM/BAM) [1]is
the de facto standard in the bioinformatics community for
storing mapped sequencing data. There exists a large body
of work on tools for processing SAM/BAM files for anal-
ysis [115]. The SAMtools [1], Picard [2], and Genome
Analysis Toolkit (GATK) [3] software packages devel-
oped by the Broad and Sanger institutes are considered
to be reference implementations for many operations on
SAM/BAM files, examples of which include sorting reads,
marking polymerase chain reaction (PCR) and optical
*Correspondence: pascal.costanza@imec.be
Pascal Costanza and Charlotte Herzeel contributed equally to this work.
imec, ExaScience Lab, Kapeldreef 75, 3001 Leuven, Belgium
duplicates, recalibrating base quality scores, indel realign-
ment, and various filtering options, which typically pre-
cede variant calling. Many alternative software packages
[410,12,14,15] focus on optimizing the computations
of these operations, either by providing alternative algo-
rithms, or by using parallelization, distribution, or other
optimization techniques specific to their implementation
language, which is often C, C++, or Java.
We have developed elPrep [8,16], an open-source,
multi-threaded framework for processing SAM/BAM
files in sequencing pipelines, especially designed for
optimizing computational performance. It can be used
as a drop-in replacement for many operations imple-
mented by SAMtools, Picard, and GATK, while pro-
ducing identical results [8,16]. elPrep allows users to
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 2 of 10
specify arbitrary combinations of SAM/BAM opera-
tions as a single pipeline in one command line. elPrep’s
unique software architecture then ensures that running
such a pipeline requires only a single pass through
the SAM/BAM file, no matter how many operations
are specified. The framework takes care of merg-
ing and parallelizing the execution of the operations,
which significantly speeds up the overall execution of
a pipeline.
In contrast, related work focuses on optimizing
individual SAM/BAM operations, but we have shown
that our approach of merging operations outperforms this
strategy [8]. For example, compared to using GATK4,
elPrep executes the 4-step Broad Best Practices pipeline
[17] (consisting of sorting, marking PCR and optical dupli-
cates, and base quality score recalibration and application)
up to 13x faster on whole-exome data, and up to 7.4x
faster on whole-genome data, while utilizing fewer com-
pute resources [8].
All SAM/BAM tools have in common that they need
to manipulate large amounts of data, as SAM/BAM files
easily take up 10–100 gigabytes (GB) in compressed
form. Some tools implement data structures that spill
to disk when reaching a certain threshold on random
access memory (RAM) use, but elPrep uses a strategy
where data is split upfront into chunks that are processed
entirely in memory to avoid repeated file input/output
[16]. Our benchmarks show that elPrep’s representation
of SAM/BAM data is more efficient than, for example,
GATK version 4 (GATK4), as elPrep uses less memory
for loading the same number of reads from a SAM/BAM
file in memory [8]. However, since elPrep does not pro-
vide data structures that spill to disk, elPrep currently
requires a fixed minimum amount of RAM to process a
whole-exome or whole-genome file, whereas other tools
sometimes allow putting a cap on the RAM use by using
disk space instead. Nonetheless, for efficiency, it is rec-
ommended to use as much RAM as available, even when
spilling to disk [8,18]. This means that, in general, tools
for processing SAM/BAM data need to be able to manip-
ulate large amounts of allocated memory.
In most programming languages, there exist more or
less similar ways to explicitly or implicitly allocate mem-
ory for heap objects which, unlike stack values, are not
bound to the lifetimes of function or method invocations.
However, programming languages strongly differ in how
memory for heap objects is subsequently deallocated. A
detailed discussion can be found in “The Garbage Collec-
tion Handbook” by Jones, Hosking, and Moss [19]. There
are mainly three approaches:
Manual memory management Memory has to be
explicitly deallocated in the program source code
(for example by calling free in C [20]).
Garbage collection Memory is automatically managed
by a separate component of the runtime library
called the garbage collector. At arbitrary points in
time, it traverses the object graph to determine
which objects are still directly or indirectly accessible
by the running program, and deallocates inacces-
sible objects. This ensures that object lifetimes do
not have to be explicitly modelled, and that point-
ers can be more freely passed around in a program.
Most garbage collector implementations interrupt
the running program and only allow it to continue
executing after garbage collection – they “stop the
world” [19] – and perform object graph traversal
using a sequential algorithm. However, advanced
implementation techniques, as employed by Java [21]
and Go [22], include traversing the object graph
concurrently with the running program while limit-
ing its interruption as far as possible; and using a
multi-threaded parallel algorithm that significantly
speeds up garbage collection on modern multicore
processors.
Reference counting Memory is managed by maintain-
ing a reference count with each heap object. When
pointers are assigned to each other, these reference
counts are increased or decreased to keep track of
how many pointers refer to each object. Whenever
a reference count drops to zero, the corresponding
object can be deallocated.1
elPrep was originally, up to version 2.6, implemented
in the Common Lisp programming language [23]. Most
existing Common Lisp implementations use stop-the-
world, sequential garbage collectors. To achieve good per-
formance, it was therefore necessary to explicitly control
how often and when the garbage collector would run to
avoid needless interruptions of the main program, espe-
cially during parallel phases. As a consequence, we also
had to avoid unnecessary memory allocations, and reuse
already allocated memory as far as possible, to reduce
the number of garbage collector runs. However, our more
recent attempts to add more functionality to elPrep (like
optical duplicate marking, base quality score recalibra-
tion, and so on) required allocating additional memory for
these new steps, and it became an even more complex task
and a serious productivity bottleneck to keep memory
allocation and garbage collection in check. We there-
fore started to look for a different programming language
using an alternative memory management approach to
continue developing elPrep and still achieve good perfor-
mance.
Existing literature on comparing programming lan-
guages and their implementations for performance
typically focus on specific algorithms or kernels in isola-
tion, no matter whether they cover specific domains like
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 3 of 10
bioinformatics [24], economics [25], or numerical com-
puting [26], or are about programming languages in gen-
eral [2731]. Except for one of those articles [31], none of
them consider parallel algorithms. Online resources that
compare programming language performance also focus
on algorithms and kernels in isolation [32]. elPrep’s per-
formance stems both from efficient parallel algorithms for
steps like parallel sorting or concurrent duplicate mark-
ing, but also from the overall software architecture that
organizes these steps into a single-pass, multi-threaded
pipeline. Since such software-architectural aspects are not
covered by the existing literature, it therefore became
necessary to perform the study described in this article.
elPrep is an open-ended software framework that allows
for arbitrary combinations of different functional steps in
a pipeline, like duplicate marking, sorting reads, replac-
ing read groups, and so on; additionally, elPrep also
accommodates functional steps provided by third-party
tool writers. This openness makes it difficult to precisely
determine the lifetime of allocated objects during a pro-
gram run. It is known that manual memory management
can contribute to extremely low productivity when devel-
oping such software frameworks. See for example the
IBM San Francisco project, where a transition from C++
with manual memory management to Java with garbage
collection led to an estimated 300% productivity increase
[33]. Other open-ended software frameworks for process-
ing SAM/BAM files include GATK4 [3], Picard [2], and
htsjdk [34].
Therefore, manual memory management is not a practi-
cal candidate for elPrep, and concurrent, parallel garbage
collection and reference counting are the only remain-
ing alternatives. By restricting ourselves to mature pro-
gramming languages where we can expect long-term
community support, we identified Java and Go as the
only candidates with support for concurrent, parallel
garbage collection2, and C++17 [35] as the only candi-
date with support for reference counting (through the
std::shared_ptr library feature).3
The study consisted of reimplementations of elPrep in
C++17, Go, and Java, and benchmarking their runtime
performance and memory usage. These are full-fledged
applications in the sense that they fully support a typical
preparation pipeline for variant calling consisting of sort-
ing reads, duplicate marking, and a few other commonly
used steps. While these three reimplementations of elPrep
only support a limited set of functionality, in each case
the software architecture could be completed with addi-
tional effort to support all features of elPrep version 2.6
and beyond.
Results
Running a typical preparation pipeline using elPrep’s soft-
ware architecture in the three selected programming lan-
guages shows that the Go implementation performs best,
followed by the Java implementation, and then the C++17
implementation.4
To determine this result, we used a five-step prepara-
tion pipeline, as defined in our previous article [16], on a
whole-exome data set (NA12878 [36]). This preparation
pipeline consists of the following steps:
Sorting reads for coordinate order.
Removing unmapped reads.
Marking duplicate reads.
Replacing read groups.
Reordering and filtering the sequence dictionary.
We ran this pipeline 30 times for each implementation,
and recorded the elapsed wall-clock time and maximum
memory use for each run using the Unix time com-
mand. We then determined the standard deviation and
confidence intervals for each set of runs [37].
C++17 and Java allow for fine-grained tuning of their
memory management, leading to four variations each.
For the final ranking in this section, we have chosen the
best result from each set of variations, one for C++17
and one for Java. The other results are presented in the
Discussion” section below. The Go benchmarks were
executed with default settings.
The benchmark results for the runtime performance of
the three selected implementations are shown in Fig. 1.Go
needs on average 7 mins 56.152 secs with a standard devi-
ation of 8.571 secs; Java needs on average 6 mins 54.546
secs with a standard deviation of 5.376 secs; and C++17
needs on average 10 mins 23.603 secs with a standard
deviation of 22.914 secs. The confidence intervals for Go
andJavaareverytight,withaslightlylooserconfidence
interval for C++17.
The benchmark results for the maximum memory use
are shown in Fig. 2. Go needs on average ca. 221.73 GB
with a standard deviation of ca. 6.15 GB; Java needs on
average ca. 335.46 GB with a standard deviation of ca. 0.13
GB; and C++17 needs on average ca. 255.48 GB with a
standard deviation of ca. 2.93 GB. Confidence intervals are
very tight.
The goal of elPrep is to simultaneously keep both the
runtime and the memory use low. To determine the final
ranking, we therefore multiply the average elapsed wall-
clock time in hours (h) with the average maximum mem-
ory use in gigabytes (GB), with lower values in gigabyte
hours (GBh) being better. This yields the following values
(cf. Fig. 3):
29.33 GBh for Go
38.63 GBh for Java
44.26 GBh for C++17
This appropriately reflects the results of the benchmarks:
While the Java benchmarks report a somewhat faster run-
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 4 of 10
Fig. 1 Runtime performance. Average elapsed wall-clock times in minutes for the best Go, Java, and C++17 implementations, with confidence
intervals
time than the Go benchmarks, the memory use of the
Java runs is significantly higher, leading to a higher GBh
value than for the Go runs. The C++17 runs are sig-
nificantly slower than both Go and Java, explaining the
highest reported GBh value. We therefore consider Go to
be the best choice, yielding the best balance between run-
time performance and memory use, followed by Java and
then C++17.
Discussion
Memory management issues in elPrep in more detail
ThemostcommonusecaseforelPrepisthatitperforms
sorting of reads and duplicate marking, among other steps
[17]. Such a pipeline executes in two phases: In the first
phase, elPrep reads a BAM input file, parses the read
entries into objects, and performs duplicate marking and
some filtering steps on the fly. Once all reads are stored as
heap objects in RAM, they are sorted using a parallel sort-
ing algorithm. Finally, in the second phase, the modified
reads are converted back into entries for a BAM output
file and written back. elPrep splits the processing of reads
into these two phases because writing the reads back to an
output file can only commence once duplicates are fully
known and reads are fully sorted in RAM.
Phase 1 allocates various data structures while pars-
ing the read representations from BAM files into heap
objects. A subset of these objects become obsolete after
phase 1. The different memory management approaches
outlined in the “Background” section above deal with
these temporary objects in different ways.
A garbage collector needs to spend time to classify these
obsolete objects as inaccessible and deallocate them. A
stop-the-world, sequential garbage collector creates a sig-
nificant pause in which the main program cannot make
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 5 of 10
Fig. 2 Maximum memory use. Average maximum memory use in GB for the best Go, Java, and C++17 implementations, with confidence intervals
Fig. 3 Final ranking of programming languages. Average elapsed wall-clock times multiplied by average maximum memory use in GBh
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 6 of 10
progress. This was the case with the previous elPrep ver-
sions (up to version 2.6), which is why we provided an
option to users to disable garbage collection altogether
in those versions [38]. In contrast, a concurrent, parallel
garbage collector can perform its job concurrently with
phase 2, which can therefore commence immediately.
With reference counting, objects are recognized as
obsolete due to their reference counts dropping to zero.
Deallocation of these objects leads to transitive dealloca-
tions of other objects because of their reference counts
transitively dropping to zero. Since this is an inherently
sequential process, this leads to a similar significant pause
as with a stop-the-world garbage collector.
C++17 performance in more detail
C and C++ typically perform much better than other pro-
gramming languages in most benchmarks that focus on
isolated algorithms or kernels [2426,2830]. Since our
C++17 implementation of elPrep uses reference counting,
this performance gap may be explained by the dealloca-
tion pause caused by reference counting, as described in
the previous subsection.
To verify this theory, we timed each phase and the deal-
location pause in the C++17 implementation of elPrep
separately, and repeated the benchmark another 30 times
to determine the timings, standard deviations, and confi-
dence intervals. The results are shown in Fig. 4.Thefirst
phase needs on average 4 mins 26.657 secs, with a stan-
dard deviation of 6.648 secs; the deallocation pause needs
on average 2 mins 18.633 secs, with a standard deviation
of 4.77 secs; and the second phase needs on average 3 mins
33.832 secs, with a standard deviation of 17.376 secs.
The average total sum of the 30 C++17 runtimes is 10
mins 19.122 secs with a standard deviation of 22.782 secs.
If we substract the timings of the deallocation pause from
the average total runtime, we get 8 mins 0.489 secs with a
standard deviation of 20.605 secs. This is indeed very close
to the Go benchmarks which, as reported above, need
on average 7 mins 56.152 secs. We therefore conclude
that the performance gap between the C++17 version
and the Go and Java versions can indeed be explained by
the deallocation pause caused by the reference counting
mechanism in C++17.
C++ provides many features for more explicit memory
management than is possible with reference counting. For
example, it provides allocators [35] to decouple memory
management from handling of objects in containers. In
principle, this may make it possible to use such an allo-
cator to allocate temporary objects that are known to
become obsolete during the deallocation pause described
above. Such an allocator could then be freed instantly,
removing the described pause from the runtime. However,
this approach would require a very detailed, error-prone
analysis which objects must and must not be managed
by such an allocator, and would not translate well to
other kinds of pipelines beyond this particular use case.
Since elPrep’s focus is on being an open-ended software
framework, this approach is therefore not practical.
Tuning of memory management in C++17
The performance of parallel C/C++ programs often suf-
fers from the low-level memory allocator provided by the
C/C++ standard libraries. This can be mitigated by linking
a high-level memory allocator into a program that reduces
Fig. 4 Runtimes of phases in the C++17 implementation. Average elapsed wall-clock times in minutes for the two main phases of an elPrep pipeline
in the C++17 implementation, and the deallocation pause in between phase 1 and 2 caused by the reference counting mechanism, with
confidence intervals. The second row depicts the same averages as in the first now, but without the deallocation pause. The sum of the two phases
in the second row is very close to the Go runtimes shown in Fig. 1
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 7 of 10
synchronization, false sharing, and memory consumption,
among other things [39]. Such a memory allocator also
groups objects of similar sizes into separate groups that
can be allocated from the operating system and freed
again in larger blocks, to efficiently handle even large
numbers of small-scale heap allocations in programs.
These are techniques which are also commonly found in
garbage-collected programming languages, but are largely
independent from whether memory management is auto-
matic or manual [19]. In our study, we have benchmarked
the C++17 implementation using the default unmodi-
fied memory allocator, the tbbmalloc allocator from Intel
Threading Building Blocks [40], the tcmalloc allocator
from gperftools [41], and the jemalloc allocator [42]. The
measurements are shown in Table 1. According to the
listed GBh values, jemalloc performs best.
Tuning of memory management in Java
Java provides a number of tuning options for its mem-
ory management [43]. Since our Java implementation of
elPrep suffers from a significantly higher average maxi-
mum memory use than the C++17 and Go implementa-
tions, we have investigated two of these options in more
detail:
The string deduplication option identifies strings
with the same contents during garbage collection,
and subsequently removes the redundancy by letting
these strings share the same underlying character
arrays. Since a significant portion of read data in
SAM/BAM files is represented by strings, it seemed
potentially beneficial to use this option.
The minimum and maximum allowed percentage of
free heap space after garbage collection can be
configured using the “MinFreeHeap” and
“MaxFreeHeap” options to minimze the heap size.
We ran the Java benchmark 30 times each with the
following cofigurations: with the default options; with
just the string deduplication option; with just the free-
heap options; and with both the string deduplication
and the free-heap options. For the free-heap options, we
followed the recommendation of the Java documenta-
tion to reduce the heap size as far as possible without
Table 1 Performance results for the different memory allocators
used in the C++17 benchmarks: (1) default allocator,
(2) tbbmalloc, (3) tcmalloc, (4) jemalloc
Average runtime Average memory Product
(1) 16 mins 57.467 secs 233.63 GB 66.03 GBh
(2) 16 mins 26.450 secs 233.51 GB 63.96 GBh
(3) 11 mins 24.809 secs 246.78 GB 46.94 GBh
(4) 10 mins 23.603 secs 255.48 GB 44.26 GBh
causing too much performance regression. The mea-
surements are shown in Table 2: The free-heap options
show no observable impact on the runtime perfor-
mance or the memory use, and the string deduplica-
tion option increases the average elapsed wall-clock time
with a minor additional increase in memory use. Accord-
ing to the listed GBh values, Java with default options
performs best.
Conclusions
Due to the concurrency and parallelism of Go’s and
Java’s garbage collectors, the elPrep reimplementations in
these programming languages perform significantly faster
than the C++17 implementation which relies on reference
counting. Since the Go implementation uses significantly
less heap memory than the Java implementation, we there-
fore decided to base the official elPrep implementation
since version 3.0 on Go.
Based on our positive experiences, we recommend
authors of other bioinformatics tools for processing
SAM/BAM data, and potentially also other sequencing
data formats, to also consider Go as an implementation
language. Previous bioinformatics tools that are imple-
mented in Go include bíogo [44], Fastcov [45], SeqKit [46],
and Vcfanno [47], among others.
Methods
Existing literature on comparing programming languages
for performance strives to replicate algorithm or ker-
nel implementations as close to each other as possi-
ble across different programming languages, to ensure
fair comparisons of the underlying compiler and run-
time implementations. We focused on taking advantage
of the respective strengths of the different programming
languages and their libraries instead. Eventually, a reim-
plementation of elPrep would have to do this anyway to
achieve optimal performance, so this approach results in a
more appropriate assessment for our purpose. For exam-
ple, in C++17 we have used Intel’s Threading Building
Blocks as an advanced library for parallel programming,
and benchmarked different memory allocators optimized
for multi-threaded programs; in Go, we have relied on
Table 2 Performance results for the different memory
management options used in the Java benchmarks: (1) default
options, (2) with string deduplication, (3) with heap-free options,
(4) with string deduplication and heap-free options
Average runtime Average memory Product
(1) 6 mins 54.546 secs 335.46 GB 38.63 GBh
(2) 7 mins 30.815 secs 338.74 GB 42.42 GBh
(3) 6 mins 55.842 secs 335.45 GB 38.75 GBh
(4) 7 mins 25.415 secs 338.74 GB 41.91 GBh
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 8 of 10
its concurrency support through goroutines and channels
for communicating between them; and in Java, we have
based elPrep on its framework to support functional-
style operations on streams of elements in the package
java.util.Stream introduced in Java 8.
The benchmarks have all been performed on a Supermi-
cro SuperServer 1029U-TR4T node with two Intel Xeon
Gold 6126 processors consisting of 12 processor cores
each, clocked at 2.6 gigahertz (GHz), with 384 GB RAM.
The operating system used for the benchmarks is the
CentOS 7 distribution of Linux.
We have used the following compilers and libraries:
C++17: GNU g++ version 7.2.1
Threading Building Blocks 2018 Update 2
gperftools version 2.6.3
jemalloc version 5.0.1
Go: Official Go distribution version 1.9.5
Java: Java Platform, Standard Edition (JDK) 10
For C++17, we additionally used the Intel Threading
Building Blocks, gperftools, and jemalloc libraries. The Go
and Java versions do not require additional libraries.
We verified that all implementations produce exactly
the same results by using the method described in our pre-
vious paper on elPrep [16]. This method consists of the
following steps:
1 We verify that the resulting BAM file is properly
sorted by coordinate order with samtools index.
2 We remove the program record identifier tag (PG)
and alphabetically sort the optional fields in each
read with biobambam.
3 We sort the BAM file by read name and store it in
SAM format with samtools sort.
4 Finally, we verify that the contents are identical with
the result of the original elPrep version with the Unix
diff command.
Endnotes
1Object graphs with cycles cannot be easily reclaimed
using reference counting alone. However, such cyclic data
structures have not occurred yet in elPrep, which is why
we do not discuss this issue further in this paper.
2Specifically, Java uses concurrent, parallel Garbage-
firstgarbagecollection[48], whereas Go uses a more
traditional concurrent, parallel mark-and-sweep collector
[49].
3Other mature programming languages with support
for reference counting include Objective-C, Swift, and
Rust [50]. However, in its algorithm for duplicate marking,
elPrep requires an atomic compare-and-swap operation
on reference-counted pointers, which does not exist in
those languages, but exists in C++17.
4We have not performed a detailed comparison against
the original version of elPrep implemented in Common
Lisp, but based on previous performance benchmarks,
the Go implementation seems to perform close to the
Common Lisp implementation.
Abbreviations
BAM: The binary file format equivalent of SAM files; biobambam: A program
for processing BAM files; CentOS 7: A distribution of Linux; diff: A Unix tool for
reporting the differences between two text files; GATK: Genome Analysis
Toolkit: A software framework for variant discovery, genotyping, and other
kinds of analysis of sequencing data; GATK4: Genome Analysis Toolkit version
4; GBh: Gigabyte hours: A measure of memory size multiplied by time; GHz:
Gigahertz: A measure of computer processor speed; GNU g++: A C++
compiler; gperftools: Google Performance Tools: A C/C++ library that contains
tcmalloc and additional memory analysis tools; htsjdk: A Java library for
processing DNA sequencing data; jemalloc: A C/C++ library for managing
heap memory; NA12878: A publicly available DNA sequencing data set; PCR:
Polymerase chain reaction: A method for copying DNA segments; PG:
Program record identifier: Entries used in SAM/ BAM files for recording which
software programs were used to create them; RAM: Random access memory:
The main memory in a computer; SAM: Sequence Alignment/Map format: A
text file format for representing aligned DNA sequencing data; samtools: A
software tool for processing SAM/BAM files; Superserver 1029U-TR4T: A
particular server computer system sold by Supermicro; tbbmalloc: A C/C++
library for managing heap memory, part of Intel Threading Building Blocks;
tcmalloc: A C/C++ library for managing heap memory, part of gperftools;
Xeon Gold 6126: A particular server computer processor sold by Intel
Acknowledgements
The authors are grateful to the imec.icon GAP project members, and especially
Western Digital for providing the compute infrastructure for performing the
benchmarks described in this paper. The authors also thank Thomas J. Ashby
and Tom Haber for in-depth discussions about memory management
techniques in various programming languages.
Funding
No funding was received for this study.
Availability of data and materials
The source code for the different elPrep implementations are available at the
following locations:
Common Lisp: https://github.com/exascience/cl-elprep
C++17, Java: https://github.com/exascience/elprep-bench
Go: https://github.com/exascience/elprep/tree/v3.04
The five-step preparation pipeline benchmarked in this paper corresponds to
the pipeline implemented in the script run-wes-gatk.sh, which is
available at https://github.com/exascience/elprep/tree/v3.04/demo together
with its input files.
Authors’ contributions
PC designed and performed the study, participated in the Common Lisp and
Go implementations of elPrep, implemented the C++17 and Java versions of
elPrep, and drafted the manuscript. CH designed the elPrep software
architecture and the benchmarked preparation pipeline, participated in the
Common Lisp and Go implementations of elPrep, and drafted the manuscript.
PC, CH and WV contributed to the final manuscript. All authors read and
approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 9 of 10
Competing interests
The authors are employees of imec, Belgium, and declare that they have no
competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Received: 21 February 2019 Accepted: 15 May 2019
References
1. Li H, Hansaker B, Wysoker A, Fennell T, Ruan J, Homer N, Abecasis G,
Durbin R. TheSequenceAlignment/Mapformat and SAMtools. Bioinformatics.
2009;25(16):. https://doi.org/10.1093/bioinformatics/btp352.
2. Broad Institute. Picard. http://broadinstitute.github.io/picard. Accessed 19
Sept 2018.
3. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A,
Garimella K, Altshuler D, Gabriel S, Daly M, DePristo M. The Genome
Analysis Toolkit: A MapReduce framework for analyzing nextgeneration
DNA sequencing data. Genome Res. 2010;20:1297–303. https://doi.org/
10.1101/gr.107524.110.
4. Tarasov A, Vilella A, Cuppen E, Nijman I, Prins P. Sambamba: fast
processing of NGS alignment formats. Bioinformatics. 2015;31(12):
2032–4. https://doi.org/10.1093/bioinformatics/btv098.
5. Tischler G, Leonard S. biobambam: tools for read pair collation based
algorithms on BAM files. Source Code Biol Med. 2014;9(13):. https://doi.
org/10.1186/1751-0473- 9-13.
6. Jun G, Wing M, Abecasis G , Kang H. An efficient and scalable analysisframe work
forvariant extraction and refinement from population scale DNA sequence
data. Genome Res. 2015. https://doi.org/10.1101/gr.176552.114.
7. Faust GG, Hall IM. SAMBLASTER: fast duplicate marking and structural
variant read extraction. Bioinformatics. 2014;30(17):2503–5. https://doi.
org/10.1093/bioinformatics/btu314.
8. Herzeel C, Costanza P, Decap D, Fostier J, Verachtert W. elPrep 4: A
multithreaded framework for sequence analysis. PLoS ONE. 2019;14(2):.
https://doi.org/10.1371/journal.pone.0209523.
9. Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade: scalable
sequence analysis with MapReduce. Bioinformatics. 2015;31(15):2482–8.
https://doi.org/10.1093/bioinformatics/btv179.
10. Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C,
Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin M,
Joseph AD, Patterson DA. Rethinking data-intensive science using
scalable analytics systems. In: Proceedings of the 2015 International
Conference on Management of Data (SIGMOD ’15). New York: ACM; 2015.
https://doi.org/10.1145/2723372.2742787..
11. Guimera R. bcbio-nextgen: Automated, distributed next-gen sequencing
pipeline. EMBnet.journal. 2012;17:30. https://doi.org/10.14806/ej.17.B.286.
12. Niemenmaa M, Kallio A, Schumacher A, Klemela P, Korpelainen E,
Heljanko K. Hadoop-BAM: directly manipulating next generation
sequencing data in the cloud. Bioinformatics. 2012;28(6):876–7. https://
doi.org/10.1093/bioinformatics/bts054.
13. Deng L, Huang G, Zhuang Y, Wei J, Yan Y. Higene: A high-performance
platform for genomic data analysis; 2017. https://doi.org/10.1109/BIBM.
2016.7822584. IEEE.
14. Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade-RNA:
Parallel variant calling from transcriptomic data using MapReduce. PLOS
ONE. 2017;12(3):. https://doi.org/10.1371/journal.pone.0174575.
15. Weeks N, Luecke G. Cluster computing - The Journal of Networks
Software Tools and Applications. 2017;20(3):1869–80. https://doi.org/10.
1007/s10586-017- 0874-8.
16. Herzeel C, Costanza P, Decap D, Fostier J, Reumers J. elPrep:
High-performance preparation of sequence alignment/map files for
variant calling. PLoS ONE. 2015;10(7):. https://doi.org/10.1371/journal.
pone.0132868.
17. Van der Auwera GA, Carmeiro MO, Hartl C, Poplin R, del Angel G,
Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E,
Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FastQ data to
high confidence variant calls: the Genome Analysis Toolkit best practices
pipeline. Curr Protoc Bioinform. 2013;43(1):11–101111033. https://doi.
org/10.1002/0471250953.bi1110s43.
18. Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Performance
analysis of a parallel, multi-node pipeline for DNA sequencing. In:
Proceedings of the 11th International Conference on Parallel Processing
and Applied Mathematics (PPAM):6-9 September 2015. Krakow: LNCS,
Springer; 2015. p. 233–42. https://doi.org/10.1007/978-3-319- 32152-3_22.
19. Jones R, Hosking A, Moss E. The Garbage Collection Handbook. Boca
Raton: CRC Press; 2012.
20. Harbison III SP, Steele Jr GL. C — A Reference Manual, Fifth Edition. Upper
Saddle River: Prentice Hall; 2002.
21. Gosling J, Joy B, Steele Jr GL, Bracha G, Buckley A. The Java Language
Specification, Java SE 8 Edition. Upper Saddle River: Addison-Wesley
Professional; 2014.
22. Donovan AAA, Kernighan BW. The Go Programming Language. Upper
Saddle River: Addison-Wesley Professional; 2015.
23. Steele Jr GL. Common Lisp, The Language, Second Edition. Boston: Digital
Press; 1990.
24. Fourment M, Gillings MR. A comparison of common programming
languages used in bioinformatics. BMC Bioinformatics. 2008;9(1):. https://
doi.org/10.1186/1471-2105- 9-82.
25. Bora˘
gan Aruoba S, Fernández-Villaverde J. A comparison of programming
languages in economics. J Econ Dyn Control. 2015;58:265–73.
26. MoreiraJE, Midkiff SP, Gupta M. A comparison of Java, C/C++, and FORTRAN
for numerical computing. IEEE Antennas Propag Mag. 1998;40(5):102–5.
27. Biswa K, Jamatia B, Choudhury D, Borah P. Comparative analysis of C,
FORTRAN, C# and Java programming languages. Int J Comput Sci Inf
Technol. 2016;7(2):1004–7.
28. Hundt R. Loop recognition in C++/Java/Go/Scala. In: Proceedings of Scala
Days 2011; 2011. https://days2011.scalalang.org/sites/days2011/files/
ws3-1- Hundt.pdf. Accessed 19 Sept 2018.
29. Nanz S, Furia CA. A comparative study of programming languages in
Rosetta Code. In: 2015 IEEE/ACM 37th IEEE International Conference on
Software Engineering. Los Alamitos: IEEE; 2015. p. 778–88.https://doi.
org/10.1109/ICSE.2015.90.
30. Prechelt L. An empirical comparison of seven programming languages.
Computer. 2000;33(10):23–9.
31. Togashi N, Klyuev V. Concurrency in Go and Java: Perf or ma nc e an al ys is . In :
2014 4th IEEEInternationalConferenceon Information Society and Technology.
Beijing: IEEE; 2014. https://doi.org/10.1109/ICIST.2014.6920368.
32 . Gouy I . The Computer Language Benchmarks Game . https://benchmarksgame-
team.pages.debian.net/benchmarksgame/. Accessed 19 Sept 2018.
33. Rubin BS, Christ AR, Bohrer KA. Java and the IBM San Francisco project.
IBM Syst J. 1998;37(3):365–71.
34. Samtools organisation. Htsjdk. https://github.com/samtools/htsjdk.
Accessed 19 Sept 2018.
35. Stroustrup B. A Tour of C++, Second Edition. Upper Saddle River:
Addison-Wesley Professional; 2018.
36. Icahn School of Medicine at Mount Sinai. High-coverage Whole Exome
Sequencing of CEPH/UTAH Female Individual (HapMap: NA12878).
https://www.ncbi.nlm.nih.gov/sra/SRX731649. Accessed 19 Sept 2018.
37. Georges A, Buytaert D, Eeckhout L. Adding rigorous statistics to the Java
benchmarker’s toolbox. In: Companion to the 22nd ACM SIGPLAN
Conference on Object-oriented Programming Systems and Applications.
New York: ACM; 2007. p. 793–4. https://doi.org/10.1145/1297846.
1297891.
38. Herzeel C. elPrep – Execution Command Options. https://github.com/
ExaScience/elprep/tree/2.61#execution-command- options. Accessed 19
Sept 2018.
39. Berger ED, McKinley KS, Blumofe RD, Wilson PR. Hoard: a scalable
memory allocator for multithreaded applications. In: Proceedings of the
Ninth International Conference on Architectural Support for
Programming Languages and Operating Systems. New York: ACM; 2000.
https://doi.org/10.1145/378993.379232.
40. Reinders J. Intel Threading Building Blocks. Sebastopol: O’Reilly; 2007.
41. gperftools. https://github.com/gperftools/gperftools. Accessed 19 Sept2018.
42. jemalloc. http://jemalloc.net. Accessed 19 Sept 2018.
43. Java Platform, Standard Edition Tools Reference. https://docs.oracle.com/
javase/10/tools/java.htm. Accessed 19 Sept 2018.
44. Kortschak RD, Snyder JB, Maragkakis M, Adelson DL. bíogo: a simple
high-performance bioinformatics toolkit for the Go language. J Open
Source Softw. 2017;2(10):167. https://doi.org/10.21105/joss.00167.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Costanza et al. BMC Bioinformatics (2019) 20:301 Page 10 of 10
45. Shen W, Li Y. A novel algorithm for detecting multiple covariance and
clustering of biological sequences. Sci Rep. 2016;6:. https://doi.org/10.
1038/srep30425.
46. Shen W, Le S, Li Y, Hu F. Seqkit: A cross-platform and ultrafast toolkit for
FASTA/Q file manipulation. PLoS ONE. 2016;11(10):. https://doi.org/10.
1371/journal.pone.0163962.
47. Pedersen BS, Layer RM, Quinlan AR. Vcfanno: fast, flexible annotation of
genetic variants. Genome Biol. 2016;17(1):118. https://doi.org/10.1186/
s13059-016- 0973-5.
48. Detlefs D, Flood C, Heller S, Printezis T. Garbage-first garbage collection.
In: Proceedings of the 4th International Symposium on Memory
Managament. New York: ACM; 2004. https://doi.org/10.1145/1029873.
1029879.
49. Hudson RL. Getting to Go. https://blog.golang.org/ismmkeynote.
Accessed 19 Sept 2018.
50. Klabnik S, Nichols C. The Rust Programming Language. San Francisco: No
Starch Press; 2018.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Most of the high-throughput analyzing tools were established in scripting languages, which are not able to provide efficient and timely analysis for the large-scale datasets. Tools developed in compiling languages exhibited much faster speed and lower memory and hardware requirement than scripting languages [24][25][26][27]. C++, as a compiling language, has been shown with the best performance among the programming languages commonly used in bioinformatics field [24][25][26][27]. ...
... Tools developed in compiling languages exhibited much faster speed and lower memory and hardware requirement than scripting languages [24][25][26][27]. C++, as a compiling language, has been shown with the best performance among the programming languages commonly used in bioinformatics field [24][25][26][27]. However, most of the current tools were not written in C++, instead, in Java or scripting languages. ...
... Thus, programming languages have become necessary to assist the analysis to avoid the errors created by manual operation, to maintain the robustness, and to accelerate the analyzing speed. C++ is one of the fastest common programming languages used in bioinformatics, comparing to Java, Perl, and Python [24][25][26][27]. Besides, C++ requires the least memory while performing the computation [27]. ...
Article
Full-text available
Background Yeast one-hybrid (Y1H) is a common technique for identifying DNA-protein interactions, and robotic platforms have been developed for high-throughput analyses to unravel the gene regulatory networks in many organisms. Use of these high-throughput techniques has led to the generation of increasingly large datasets, and several software packages have been developed to analyze such data. We previously established the currently most efficient Y1H system, meiosis-directed Y1H; however, the available software tools were not designed for processing the additional parameters suggested by meiosis-directed Y1H to avoid false positives and required programming skills for operation. Results We developed a new tool named GateMultiplex with high computing performance using C++. GateMultiplex incorporated a graphical user interface (GUI), which allows the operation without any programming skills. Flexible parameter options were designed for multiple experimental purposes to enable the application of GateMultiplex even beyond Y1H platforms. We further demonstrated the data analysis from other three fields using GateMultiplex, the identification of lead compounds in preclinical cancer drug discovery, the crop line selection in precision agriculture, and the ocean pollution detection from deep-sea fishery. Conclusions The user-friendly GUI, fast C++ computing speed, flexible parameter setting, and applicability of GateMultiplex facilitate the feasibility of large-scale data analysis in life science fields.
... The main difference between elPrep and other tools for processing this kind of data such as Picard, SAMtools [3], and GATK4 [6] lies in its software architecture that parallelizes and merges the execution of the pipeline steps while minimizing the number of data accesses to files. Our previous work [1,2,7] shows that this design greatly speeds up the runtimes of both whole-genome and whole-exome pipelines. ...
... We additionally always guarantee that the output elPrep produces for any step is identical to the output of the reference tool, for example GATK4, generates. This creates additional complexity from the implementation side, leading us to develop multiple new algorithms [1,2,7]. From a user's perspective, however, it makes elPrep a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 a drop-in replacement for other tools, resulting in its adoption by different bioinformatics projects [8][9][10][11][12][13][14][15]. ...
... While the GATK approach to optimise the single variant calling step does reduce the overall runtime (by less than 2x), elPrep shows a much better overall speedup (up to 16x). This confirms the effectiveness of elPrep's architecture, verifying again our claims made in earlier work [1,2,7]. ...
Article
Full-text available
We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.
... In [2], analysis is performed to conduct an empirical study to analyze the productivity variations across different programming languages. In bioinformatics, three programming languages for a full-fledged next-generation sequencing tool are compared [3]. The memory usage and speed of execution for three standard bioinformatics methods for six different programming languages are compared [4]. ...
Preprint
Full-text available
An in-house developed 2D ultrasound computerized Tomography system is fully automated. Performance analysis of instrument and software interfacing soft tools, namely the LabVIEW, MATLAB, C, and Python, is presented. The instrument interfacing algorithms, hardware control algorithms, signal processing, and analysis codes are written using above mentioned soft tool platforms. Total of eight performance indices are used to compare the ease of (a) realtime control of electromechanical assembly, (b) sensors, instruments integration, (c) synchronized data acquisition, and (d) simultaneous raw data processing. It is found that C utilizes the least processing power and performs a lower number of processes to perform the same task. In runtime analysis (data acquisition and realtime control), LabVIEW performs best, taking 365.69s in comparison to MATLAB (623.83s), Python ( 1505.54s), and C (1252.03s) to complete the experiment. Python performs better in establishing faster interfacing and minimum RAM usage. LabVIEW is recommended for its fast process execution. C is recommended for the most economical implementation. Python is recommended for complex system automation having a very large number of components involved. This article provides a methodology to select optimal soft tools for instrument automation-related aspects.
... Performance of 99 tools from 80 packages on 92 BED12 test cases(Alneberg et al., 2014;Ay et al., 2014;Bentsen et al., 2020;Bioconvert Developers, 2017;Bollen et al., 2019;Boyle et al., 2008;Breese and Liu, 2013;Broad Institute, 2019;Buske et al., 2011;Chen et al., 2016;Cingolani et al., 2012a;2012b;Cooke et al., 2021;Costanza et al., 2019;Cotto et al., 2021;Cretu Stancu et al., 2017; T.Curk et al., in preparation;Dale et al., 2011;Daley and Smith, 2014;Dunn and Weissman, 2016;Fang et al., 2015;Farek, 2017;Feng et al., 2011;Garrison, 2012;Gremme et al., 2013;Hanghøj et al., 2019;Heger et al., 2013;Heinz et al., 2010;Hensly et al., 2015;Herzeel et al., 2015;Heuer, 2022;Huddleston et al., 2021;Karunanithi et al., 2019;Kaul, 2018;Kaul et al., 2020;Kent et al., 2002;Khan and Mathelier, 2017;Kodali, 2020;Langenberger et al., 2009;Leonardi, 2019;Li, 2012;Li et al., 2009;Lopez et al., 2019;Mahony et al., 2014;Mapleson et al., 2018;Mikheenko et al., 2018;Narzisi et al., 2014;Neph et al., 2012;Neumann et al., 2019;Okonechnikov et al., 2016;Orchard et al., 2020;Pedersen, 2018;Pedersen et al., 2012;Pedersen and Quinlan, 2018;Pertea and Pertea, 2020;Pongor et al., 2020;Quinlan and Hall, 2010;Ram ırez et al., 2016;Ramskö ld et al., 2009;Rausch et al., 2019;Robinson et al., 2011;Sadedin and Oshlack, 2019;Schiller, 2013;Shen et al., 2016;Sims et al., 2014;Song and Smith, 2011; Stovner and Saetrom, 2019;Sturm et al., 2018;Talevich et al., 2016;Thorvaldsdó ttir et al., 2013;Uren et al., 2012;van Heeringen and Veenstra, 2011;van't Hof et al., 2017;Vorderman et al., 2019;Wala et al., 2016;Wang et al., 2012;Webster et al., 2019;Willems et al., 2017;Xu et al., 2010;Zerbino et al., 2014; ...
Article
Full-text available
Motivation Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, making it difficult or impossible for the creators of these tools to robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results. Results We developed a new verification system, Acidbio, which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the Browser Extensible Data (BED) format. We also used a fuzzing approach to automatically perform additional testing. Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software’s performance on the test suite. Availability Acidbio is available at https://github.com/hoffmangroup/acidbio. Supplementary information Supplementary data are available at Bioinformatics online.
... While languages like Python and Perl were never designed for this domain, other languages like Java, C#, Go and (more recently) Rust also advertise a high performance. Most empiric comparisons still conclude that C ++ is superior to the others or at least among the best (Prechelt, 2000;Fourment and Gillings, 2008;Aruoba and Fernández-Villaverde, 2014), 3 although under very specific circumstances other programming languages also take the lead (Costanza et al., 2019). ...
Thesis
Full-text available
This thesis introduces SeqAn3, a new software library built with Modern C++ to solve problems from the domain of sequence analysis in bioinformatics. It discusses previous versions of the library in detail and explains the importance of highly performing programming languages like C++. Complexity in the design of the library and of the programming language itself are identified as the major obstacles to user satisfaction, widespread adoption and long-term viability of the project. Therefore, based on very fundamental changes in the C++ programming language, a new library design is formulated and implemented. Its impact is showcased by porting the local aligner called Lambda from SeqAn2 to SeqAn3. Both, the library and the application are highly relevant in practice and prove that simpler and more compact solutions are possible. This thesis documents the process of creating said software, contributing vital information to the fields of research software engineering, library design and to a certain degree also applied programming language research. As one of the first larger projects to be designed fully around C++20 features, it has instructive value beyond bioinformatics.
... Costanza et al. [8] performed performance testing to analyze and compare the performance of three programming languages (Go, Java, and C ++ ). Based on their benchmark results, the authors selected Go as their implementation tool and recommended considering Go as a valid candidate for developing other bioinformatics applications. ...
Article
Full-text available
Context Software development is a continuous decision-making process that mainly relies on the software engineer’s experience and intuition. One of the essential decisions in the early stages of the process is selecting the best fitting programming language ecosystem based on the project requirements. A significant number of criteria, such as developer availability and consistent documentation, in addition to the number of available options in the market, lead to a challenging decision-making process. As the selection of programming language ecosystems depends on the application to be developed and its environment, a decision model is required to analyze the selection problem using systematic identification and evaluation of potential alternatives for a development project. Method Recently, we introduced a framework to build decision models for technology selection problems in software production. Furthermore, we designed and implemented a decision support system that uses such decision models to support software engineers with their decision-making problems. This study presents a decision model based on the framework for the programming language ecosystem selection problem. Results The decision model has been evaluated through seven real-world case studies at seven software development companies. The case study participants declared that the approach provides significantly more insight into the programming language ecosystem selection process and decreases the decision-making process’s time and cost. Conclusion With the decision model, software engineers can more rapidly evaluate and select programming language ecosystems. Having the knowledge in the decision model readily available supports software engineers in making more efficient and effective decisions that meet their requirements and priorities. Furthermore, such reusable knowledge can be employed by other researchers to develop new concepts and solutions for future challenges.
... At first, let us consider research devoted to the productivity of the algorithms and improving the analysis results quality. In [6], Go, C++, and Java programming languages were assessed with respect to the ease of implementation, memory consumption and overall computation performance, Go was chosen. The main requirements were addressed to big data string processing. ...
Article
Full-text available
Considering the large number of optimisation techniques that have been integrated into the design of the Java Virtual Machine (JVM) over the last three decades, the Java interpreter continues to persist as a significant bottleneck in the performance of bytecode execution. This paper examines the relationship between Java Runtime Environment (JRE) performance concerning the interpreted execution of Java bytecode and the effect modern compiler selection and integration within the JRE build toolchain has on that performance. We undertook this evaluation relative to a contemporary benchmark suite of application workloads, the Renaissance Benchmark Suite. Our results show that the choice of GNU GCC compiler version used within the JRE build toolchain statistically significantly affects runtime performance. More importantly, not all OpenJDK releases and JRE JVM interpreters are equal. Our results show that OpenJDK JVM interpreter performance is associated with benchmark workload. In addition, in some cases, rolling back to an earlier OpenJDK version and using a more recent GNU GCC compiler within the build toolchain of the JRE can significantly positively impact JRE performance.
Preprint
Background: Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, and only rarely do the creators of these tools robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results. Methods: We sought (1) to assess the interoperability of a wide range of bioinformatics software using a shared genomics file format and (2) to provide a simple, reproducible method for enhancing interoperability. As a focus, we selected the popular BED file format for genomic interval data. Based on the file format's original documentation, we created a formal specification. We developed a new verification system, Acidbio (https://github.com/hoffmangroup/acidbio), which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the BED format. We also used a fuzzing approach to automatically perform additional testing. Results: Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software's performance on the test suite. Discussion: Acidbio makes it easy to assess interoperability of software using the BED format, and therefore to identify areas for improvement in individual software packages. Applying our approach to other file formats would increase the reliability of bioinformatics software and data.
Article
Full-text available
We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep’s parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.
Article
Full-text available
SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.
Article
Full-text available
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
Article
Full-text available
FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering , deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.
Article
Full-text available
Single genetic mutations are always followed by a set of compensatory mutations. Thus, multiple changes commonly occur in biological sequences and play crucial roles in maintaining conformational and functional stability. Although many methods are available to detect single mutations or covariant pairs, detecting non-synchronous multiple changes at different sites in sequences remains challenging. Here, we develop a novel algorithm, named Fastcov, to identify multiple correlated changes in biological sequences using an independent pair model followed by a tandem model of site-residue elements based on inter-restriction thinking. Fastcov performed exceptionally well at harvesting co-pairs and detecting multiple covariant patterns. By 10-fold cross-validation using datasets of different scales, the characteristic patterns successfully classified the sequences into target groups with an accuracy of greater than 98%. Moreover, we demonstrated that the multiple covariant patterns represent co-evolutionary modes corresponding to the phylogenetic tree, and provide a new understanding of protein structural stability. In contrast to other methods, Fastcov provides not only a reliable and effective approach to identify covariant pairs but also more powerful functions, including multiple covariance detection and sequence classification, that are most useful for studying the point and compensatory mutations caused by natural selection, drug induction, environmental pressure, etc.
Article
Full-text available
The integration of genome annotations is critical to the identification of genetic variants that are relevant to studies of disease or other traits. However, comprehensive variant annotation with diverse file formats is difficult with existing methods. Here we describe vcfanno, which flexibly extracts and summarizes attributes from multiple annotation files and integrates the annotations within the INFO column of the original VCF file. By leveraging a parallel “chromosome sweeping” algorithm, we demonstrate substantial performance gains by annotating ~85,000 variants per second with 50 attributes from 17 commonly used genome annotation resources. Vcfanno is available at https://github.com/brentp/vcfanno under the MIT license. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0973-5) contains supplementary material, which is available to authorized users.
Technical Report
In this experience report we encode a well specified, compact benchmark in four programming languages, namely C++, Java, Go, and Scala. The implementations each use the languages' idiomatic container classes, looping constructs, and memory/object allocation schemes. It does not attempt to exploit specific language and run-time features to achieve maximum performance. This approach allows an almost fair comparison of language features, code complexity, compilers and compile time, binary sizes, run-times, and memory footprint. While the benchmark itself is simple and compact, it employs many language features, in particular, higher-level data structures (lists, maps, lists and arrays of sets and lists), a few algorithms (union/find, dfs / deep recursion, and loop recognition based on Tarjan), iterations over collection types, some object oriented features, and interesting memory allocation patterns. We do not explore any aspects of multi-threading, or higher level type mechanisms, which vary greatly between the languages. The benchmark points to very large differences in all examined dimensions of the language implementations. After publication of the benchmark internally at Google, several engineers produced highly optimized versions of the benchmark. We describe many of the performed optimizations, which were mostly targeting runtime performance and code complexity. While this effort is an anecdotal comparison only, the benchmark, and the subsequent tuning efforts, are indicative of typical performance pain points in the respective languages.
Chapter
Post-sequencing DNA analysis typically consists of read mapping followed by variant calling and is very time-consuming, even on a multi-core machine. Recently, we proposed Halvade, a parallel, multi-node implementation of a DNA sequencing pipeline according to the GATK Best Practices recommendations. The MapReduce programming model is used to distribute the workload among different workers. In this paper, we study the impact of different hardware configurations on the performance of Halvade. Benchmarks indicate that especially the lack of good multithreading capabilities in the existing tools (BWA, SAMtools, Picard, GATK) cause suboptimal scaling behavior. We demonstrate that it is possible to circumvent this bottleneck by using multiprocessing on high-memory machines rather than using multithreading. Using a 15-node cluster with 360 CPU cores in total, this results in a runtime of 1 h 31 min. Compared to a single-threaded runtime of \(\sim \)12 days, this corresponds to an overall parallel efficiency of 53 %.