ArticlePDF Available

genozip: a fast and efficient compression tool for VCF files

Article

genozip: a fast and efficient compression tool for VCF files

Abstract

Motivation: genozip is a new lossless compression tool for VCF (Variant Call Format) files. By applying field-specific algorithms and fully utilizing the available computational hardware, genozip achieves the highest compression ratios amongst existing lossless compression tools known to the authors, at speeds comparable with the fastest multi-threaded compressors. Availability: genozip is freely available to non-commercial users. It can be installed via conda-forge, Docker Hub, or downloaded from github.com/divonlan/genozip. Supplementary information: Supplementary data are available at Bioinformatics online.
Genome analysis
genozip: a fast and efficient compression tool for
VCF files
Divon Lan *, Raymond Tobler , Yassine Souilmi
and Bastien Llamas *
,†
School of Biological Sciences, The Environment Institute, Faculty of Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
*To whom correspondence should be addressed.
The authors wish it to be known that, in their opinion, the last two authors should be regarded as Joint Authors.
Associate Editor: Peter Robinson
Received on January 15, 2020; revised on April 1, 2020; editorial decision on April 24, 2020; accepted on April 27, 2020
Abstract
Motivation: genozip is a new lossless compression tool for Variant Call Format (VCF) files. By applying field-
specific algorithms and fully utilizing the available computational hardware, genozip achieves the highest compres-
sion ratios amongst existing lossless compression tools known to the authors, at speeds comparable with the fastest
multi-threaded compressors.
Availability and implementation: genozip is freely available to non-commercial users. It can be installed via conda-
forge, Docker Hub, or downloaded from github.com/divonlan/genozip.
Contact: divon.lan@adelaide.edu.au or bastien.llamas@adelaide.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Large genomic projects are becoming increasingly common, resulting
in Variant Call Format (VCF; Danecek et al., 2011) files comprising
thousands of individual genomic datasets. Even in their compressed
form, such files are very large (typically several GB), rapidly driving
up the cost of long-term data storage and file transfer and spurring
the development of more efficient compression algorithms.
While a handful of new compression algorithms have recently
emerged that work by compressing genotypes within VCF files (e.g.
Deorowicz and Danek, 2019;Durbin, 2014;Kelleher et al.,2019),
genotypes are only one data type represented in a VCF file, and are
often only a minor contributor to the total data content. For example,
in the file used as the real-world example in (Durbin, 2014)—File1 in
our benchmarks—the genotypes represent only 7.1% of the uncom-
pressed VCF file data. Thus, it is clear that just compressing the geno-
types is not sufficient as a compression strategy for VCF files.
Here, we present genozip, a lossless compression tool that
greatly improves genomic data compression by utilizing algorithms
specific to the data types common to VCF files. genozip can han-
dle VCF files of any ploidy, phasing structure or variant type with
up to 99 alternate alleles per variant, along with any FORMAT and
INFO data. While the primary objective of genozip is optimal
packaging of genomic data for efficient and secure storage and dis-
tribution, it also includes capabilities for pipeline analyses.
2 Software description
The genozip package runs on all popular operating systems and
includes four command line tools—genozip,genounzip,
genocat and genols.genozip receives one or more .vcf,
.vcf.gz, .vcf.bz2, .vcf.xz or .bcf files or urls (FTP or HTTP) as in-
put, and outputs one or more .vcf.genozip files, while genoun-
zip decompresses .vcf.genozip files back to .vcf or .vcf.gz format
and genols provides statistics regarding the contents of .genozip
files.
To support seamless integration into analytical pipelines, the
genocat command is provided to access data within .vcf.genozip
files, and includes options like –regions and –samples that allow
random access to data. Indexing is done as part of the compression
and there is no separate indexing step or index file. In addition, the
toolset is designed to enable use of standard input/output streams.
By encrypting the data with –password (using 256 bit AES), gen-
ozip enables efficient and secure distribution of genomic files that
comply with stringent privacy requirements . Data integrity is fur-
ther ensured by generating an MD5 signature with –md5.
Additionally, the -output option concatenates VCF files with identi-
cal samples and the original components can be regenerated using
–split.
We have included several additional options that allow the user to
optimize compression according to their needs. First, the –optimize
option improves compression by modifying data in some INFO and
FORMAT subfields by rounding floating point numbers to 2 signifi-
cant digits and capping Phred values. Note that in this case the VCF
data are modified, and therefore the compression is not lossless, but
this does not impact downstream analytical results. Second, the
gtshark option makes use of the GTShark algorithm (Deorowicz
and Danek, 2019) that improves compression ratios compared to
using either genozip or GTShark alone (see Supplementary
Material). Finally, the –vblock and –sblock options allow the user
V
CThe Author(s) 2020. Published by Oxford University Press. 4091
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Bioinformatics, 36(13), 2020, 4091–4092
doi: 10.1093/bioinformatics/btaa290
Advance Access Publication Date: 14 May 2020
Applications Note
to control the tradeoff between compression and speed related to sub-
setting regions and samples.
Note that some options require the appropriate tools to be
installed: compressing .bcf files into .genozip format requires
bcftools, compressing .xz files requires XZ Utils (Collin, 2011),
decompressing into .vcf.gz requires bgzip, using –gtshark
requires GTShark, and compressing from an URL requires cURL
(Hostetter et al., 1997).
3 Benchmark
To evaluate genozip’s performance, we compared its compression
ratios and speeds on two different VCF files from The 1000
Genomes Project Consortium (2012)—‘File1’ which is rich in
FORMAT subfields and ’File2’ which is rich in genotype data (see
Supplementary Table S1)—against a wide range of tools. All bench-
marks were conducted on the same machine that has 56 physical
cores (4 X Intel
V
R
Xeon
V
R
Gold 6132 CPU @2.60 GHz) and 755 GB
of usable memory. More details, including benchmarks against
genotype compression tools such as BGT (Li, 2016) and GTC (Danek
and Deorowicz, 2018) that are not capable of compressing arbitrary
VCF files losslessly are available in Supplementary Material.
For both tested VCF files, the compression ratios achieved by
genozip are considerably higher than other tested tools (Fig. 1a).
Further, genozip offers one of the fastest compression/decompres-
sion speeds amongst the tested tools (Fig. 1b), indicating that per-
formance gains are achieved without negatively impacting run
times. To achieve high processing speeds, genozip implements an
advanced memory and thread management strategy that scales
across 10 s of cores (Fig. 1c).
4 Conclusion
genozip is a user friendly and fully featured compression software
that readily integrates into any standard bioinformatics pipeline.
genozip achieves compression ratios significantly better than other
standard tools, by exploiting redundancies in the data that are spe-
cific to biological data and that are not evident by textual analysis
alone. Moreover, genozip achieves significant gains to compres-
sion speed relative to other tools by taking full advantage of modern
computational hardware, including multi-core processors and multi-
gigabyte RAM, whenever available. By default, genozip dynamic-
ally balances its internal execution pipelines to maximize utilization
of all the available resources.
Acknowledgements
The authors thank Christian Huber, Heng Li and two anonymous reviewers
for comments on the manuscript.
Funding
D.L. is supported by a scholarship from the University of Adelaide. Y.S. is
supported by the Australian Research Council (ARC DP190103705). R.T. is
an ARC DECRA fellow (DE190101069). B.L. is an ARC Future Fellow
(FT170100448).
Conflict of Interest: D.L. intends to receive royalties from commercial users of
genozip.
References
Collin,L. (2011) XZ Utils. https://tukaani.org/xz/. (1 April 2020, date last
accessed).
Danecek,P.et al.; 1000 Genomes Project Analysis Group. (2011) The variant
call format and VCFtools. Bioinformatics,27, 2156–2158.
Danek,A. and Deorowicz,S. (2018) GTC: how to maintain huge genotype col-
lections in a compressed form. Bioinformatics,34, 1834–1840.
Deorowicz,S. and Danek,A. (2019) GTShark: genotype compression in large
projects. Bioinformatics,35, 4791–4793.
Durbin,R. (2014) Efficient haplotype matching and storage using the position-
al Burrows–Wheeler transform (PBWT). Bioinformatics,30, 1266–1272.
Hostetter,M. et al. (1997) Curl: a gentle slope language for the Web. World
Wide Web J. Biol., 2, 121–134.
Kelleher,J. et al. (2019) Inferring whole-genome histories in large population
datasets. Nat. Genet., 51, 1330–1338.
Li,H. (2016) BGT: efficient and flexible genotype query across many samples.
Bioinformatics,32, 590–592.
The 1000 Genomes Project Consortium (2012) An integrated map of genetic
variation from 1,092 human genomes. Nature,491, 56–65.
Fig. 1. Benchmarking genozip performance. (a) Compression ratios for genozip using three different options relative to five other commonly used compression tools (see
labels) for two VCF files, the FORMAT-subfields-rich data (x-axis) and genotype-rich data dominant (y-axis). (b) Compression (x-axis) and decompression (y-axis) rates for
genozip and five other tools on the two VCF files (see inset key), and the rates (c)genozip execution scalability with used CPU cores (see Supplementary Material)
4092 D.Lan et al.
... Some partial solutions (focusing just on genotypes) are PBWT (Durbin, 2014), BGT (Li, 2015), GTC (Danek and Deorowicz, 2018), GTShark (Deorowicz and Danek, 2019). Recently, Lan et al. (2020) proposed genozip, a VCF-specialized compressor. ...
... ; https://doi.org/10. 1101/2020 ...
Preprint
Full-text available
The VCF files with results of sequencing projects take a lot of space. We propose VCFShark squeezing them up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). Availability and Implementation: https://github.com/refresh-bio/vcfshark . Contact: sebastian.deorowicz@polsl.pl . Supplementary information: Supplementary data are available at publisher's Web site.
... Here we introduce an implementation of the Dual Coordinate VCF (DVCF) [7] format in the Genozip platform [8,9] , an extensible compression software. DVCF is an extension to the standard VCF format compliant with the VCF 4.3 specification, that includes variants with coordinates pertaining to two different genome assemblies simultaneously (Fig. S1). ...
Preprint
Full-text available
We introduce Dual Coordinate VCF (DVCF), a file format that records genomic variants against two different reference genomes simultaneously and is fully compliant with the current VCF specification. As implemented in the Genozip platform, DVCF enables bioinformatics pipelines to seamlessly operate across two coordinate systems by leveraging the system most advantageous to each pipeline step, simplifying bioinformatics workflows and reducing file generation and associated data storage burden. Moreover, our benchmarking of Genozip DVCF shows that it produces more complete, less erroneous, and less biased translations across coordinate systems than two widely used alternative tools (i.e., LiftoverVcf and CrossMap). Availability and Implementation Genozip is free for academic use. Documentation is available on https://genozip.com/dvcf.html . Genozip user manual is available on https://genozip.com/manual.html . The source code is available on https://genozip.com/source.html . The scripts for reproducing the benchmarks are available on https://github.com/divonlan/genozip-dvcf-results .
... This feature has resulted in wide adoption by bioinformatics software developers. VCF typically scales well in bioinformatics workflows because files can be indexed [5], compressed [1,6,7] and trivially parallelized in workflows by splitting files and processing them independently, e.g. [4]. ...
Article
Full-text available
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies—as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib , bio-vcf , cyvcf2 , hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.
... Here, we introduce a new version of the compression software Genozip, which has been nearly completely re-written from an earlier version designed to compress VCF files (Lan et al., 2020). Genozip now offers five core capabilities: ...
Article
Full-text available
We present Genozip, a universal and fully featured compression software for genomic data. Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities – universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility. Genozip delivers high-performance compression for widely-used genomic data formats in genomics research, namely FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe formats. Our test results show that Genozip is fast and achieves greatly improved compression ratios, even when the files are already compressed. Further, Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs. With this, we intend for Genozip to be a general-purpose compression platform where researchers can implement compression for additional file formats, as well as new codecs for data types or fields within files, in the future. We anticipate that this will ultimately increase the visibility and adoption of these algorithms by the user community, thereby accelerating further innovation in this space. Availability: Genozip is written in C. The code is open-source and available on GitHub (https://github.com/divonlan/genozip). The package is free for non-commercial use. It is distributed as a Docker container on DockerHub and through the conda package manager. Genozip is tested on Linux, Mac, and Windows. Supplementary information: Supplementary data are available at Bioinformatics online.
... We sought an incremental solution to these challenges for existing pVCFbased pipelines, which may be reluctant to adopt fundamentally different formats or data models (Layer et al., 2015;Li, 2015;Zheng et al., 2017;Danek and Deorowicz, 2018;Deorowicz and Danek, 2019;Lan et al., 2020; Appendix 1) to minimize disruption to existing processes and users. To this end, we developed Sparse Project VCF (spVCF), which adds three simple features to VCF ( Figure 1): ...
Article
Full-text available
Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. Availability and implementation Apache-licensed reference implementation: github.com/mlin/spVCF Supplementary information Supplementary data are available at Bioinformatics online.
Preprint
Full-text available
Genozip performs compression of a wide range of genomic data, including widely used FASTQ, BAM and VCF file formats. Here, we introduce the latest advancement in Genozip technology, focused on compression of BAM and CRAM files. We demonstrate Genozip’s ability to compress data generated by a variety of study types (e.g., whole genome sequencing, DNA methylation, RNASeq), sequencing technologies and aligners, up to 2.7 times better than the current state of the art compressor, CRAM version 3.1. Availability and implementation Genozip is freely available for academic research use and has been tested for Linux, Mac and Windows. Installation instructions are available at https://genozip.com/installing.html . A user manual is available at https://genozip.com/manual.html . Supplementary information Supplementary data are available.
Chapter
In the last decades, the human genoma analysis for addressing health-care problems, has widely grown. With the high throughput of biological data and, needing of represent them, the Next-Generation Sequencing was introduced. In order to store genomic features without losing information, different data format (such as FAST-A, FAST-Q, SAM, VCF) have been proposed. To overcome the storing process issues of these data, several genomic compressors have been presented. A specific VCF compressor is analyzed. Due to the restricted hardware resources limit of multi-core architecture when input size dimension data are given, large execution times are required. Thanks to the well-known parallel nature of the most recent Graphic Process Units, in this work we present a Multi-GPU based parallel implementation by exploiting CUDA framework. An ad-hoc memory approach combined with a suitable work decomposition strategy are able to give a strong increase in performance. To observe the benefits in terms of performance, tests and experiments complete our work.
Preprint
Full-text available
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies — as well as in somatic and germline mutation studies. VCF can present single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called against a reference genome. Here we present over 125 useful and much used free and open source software tools and libraries, part of vcflib tools and bio-vcf . We also highlight cyvcf2 , hts-nim and slivar tools. Application is typically in the comparison, filtering, normalisation, smoothing, annotation, statistics, visualisation and exporting of variants. Our tools run daily and invisibly in pipelines and countless shell scripts. Our tools are part of a wider bioinformatics ecosystem and we consider it very important to make these tools available as free and open source software to all bioinformaticians so they can be deployed through software distributions, such as Debian, GNU Guix and Bioconda. vcflib , for example, was installed over 40,000 times and bio-vcf was installed over 15,000 times through Bioconda by December 2020. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation that can not easily be represented by the VCF format. All source code is published under free and open source software licenses and can be downloaded and installed from https://github.com/vcflib . Author summary Most bioinformatics workflows deal with DNA/RNA variations that are typically represented in the variant call format (VCF) — a file format that describes mutations (SNP and MNP), insertions and deletions (INDEL) against a reference genome. Here we present a wide range of free and open source software tools that are used in biomedical sequencing workflows around the world today.
Article
VCF files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and implementation: https://github.com/refresh-bio/vcfshark. Supplementary information: Supplementary data are available at publisher's Web site.
Article
Full-text available
Background: Ion Torrent is one of the major next generation sequencing (NGS) technologies and it is frequently used in medical research and diagnosis. The built-in software for the Ion Torrent sequencing machines delivers the sequencing results in the BAM format. In addition to the usual SAM/BAM fields, the Ion Torrent BAM file includes technology-specific flow signal data. The flow signals occupy a big portion of the BAM file (about 75% for the human genome). Compressing SAM/BAM into CRAM format significantly reduces the space needed to store the NGS results. However, the tools for generating the CRAM formats are not designed to handle the flow signals. This missing feature has motivated us to develop a new program to improve the compression of the Ion Torrent files for long term archiving. Results: In this paper, we present IonCRAM, the first reference-based compression tool to compress Ion Torrent BAM files for long term archiving. For the BAM files, IonCRAM could achieve a space saving of about 43%. This space saving is superior to what achieved with the CRAM format by about 8-9%. Conclusions: Reducing the space consumption of NGS data reduces the cost of storage and data transfer. Therefore, developing efficient compression software for clinical NGS data goes beyond the computational interest; as it ultimately contributes to the overall cost reduction of the clinical test. The space saving achieved by our tool is a practical step in this direction. The tool is open source and available at Code Ocean, github, and http://ioncram.saudigenomeproject.com .
Article
Full-text available
Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an ‘evolutionary encoding’ of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
Article
Full-text available
Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the data set being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here I develop a theory of haplotype matching using suffix array ideas, which should scale to much larger data sets than those currently handled by genotype algorithms. Given M sequences with N bi-allelic variable sites, I give an O(NM) algorithm to derive a representation of the data based on positional prefix arrays, which I term the Positional Burrows-Wheeler Transform (PBWT). On large data sets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation I show how to find all maximal haplotype matches within the set in O(NM) time rather than O(NM(2)) as expected from naive pairwise comparison, and provide a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing. http://github.com/richarddurbin/pbwt CONTACT: richard.durbin@sanger.ac.uk.
Article
Full-text available
The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: rd@sanger.ac.uk
Article
Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e., squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results. Availability and implementation: https://github.com/refresh-bio/GTShark. Supplementary information: Supplementary data are available at publisher's Web site.
Article
Motivation: Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results: We present GTC, a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 thousand haplotypes at about 40 million SNPs can be stored in less than 4 Gbytes, while the queries related to variants are answered in a fraction of a second. Availability: GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact: sebastian.deorowicz@polsl.pl. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
BGT is a compact format, a fast command line tool and a simple web application for efficient and convenient query of whole-genome genotypes and frequencies across tens to hundreds of thousands of samples. On real data, it encodes the haplotypes of 32,488 samples across 39.2 million SNPs into a 7.4GB database and decodes a couple of hundred million genotypes per CPU second. The high performance enables real-time responses to complex queries. Availability and implementation: https://github.com/lh3/bgt Contact: hengli@broadinstitute.org
An integrated map of genetic variation from 1,092 human genomes