Available via license: CC BY 4.0
Content may be subject to copyright.
Vcflib and tools for processing the VCF variant call format
Erik Garrison1, Zev N. Kronenberg2, Eric T. Dawson3, Brent S. Pedersen4, and
Pjotr Prins,1,*
1Dept. Genetics, Genomics and Informatics, University of Tennessee Health Science
Center, Memphis, TN, USA
2Pacific Biosciences, San Diego, CA, USA
3NVIDIA Corporation, Santa Clara, CA, USA
4Center for Molecular Medicine, University Medical Center, Utrecht, The Netherlands
* jprins@uthsc.edu
Abstract
Since its introduction in 2011 the variant call format (VCF) has been widely adopted
for processing DNA and RNA variants in practically all population studies — as well
as in somatic and germline mutation studies. VCF can present single nucleotide
variants, multi-nucleotide variants, insertions and deletions, and simple structural
variants called against a reference genome. Here we present over 125 useful and much
used free and open source software tools and libraries, part of vcflib tools and
bio-vcf. We also highlight cyvcf2,hts-nim and slivar tools. Application is
typically in the comparison, filtering, normalisation, smoothing, annotation, statistics,
visualisation and exporting of variants. Our tools run daily and invisibly in pipelines
and countless shell scripts. Our tools are part of a wider bioinformatics ecosystem and
we consider it very important to make these tools available as free and open source
software to all bioinformaticians so they can be deployed through software
distributions, such as Debian, GNU Guix and Bioconda. vcflib, for example, was
installed over 40,000 times and bio-vcf was installed over 15,000 times through
Bioconda by December 2020. We shortly discuss the design of VCF, lessons learnt,
and how we can address more complex variation that can not easily be represented by
the VCF format. All source code is published under free and open source software
licenses and can be downloaded and installed from https://github.com/vcflib.
Keywords: VCF, variant calling, SNP, MNP, INDEL, SV, GFA, C++, Ruby,
Python, JSON
Author summary
Most bioinformatics workflows deal with DNA/RNA variations that are typically
represented in the variant call format (VCF) — a file format that describes mutations
(SNP and MNP), insertions and deletions (INDEL) against a reference genome. Here
we present a wide range of free and open source software tools that are used in
biomedical sequencing workflows around the world today.
Introduction 1
From its introduction in 2011 the VCF variant call format has become pervasive in 2
bioinformatics sequencing workflows [1–3]. VCF is one of the important file formats in 3
May 21, 2021 1/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
bioinformatics workflows because of its critical role in describing variants that come 4
out of sequencing of DNA and RNA. VCF can describe single- and multi- nucleotide 5
polymorphisms (SNP & MNP), insertions and deletions (INDEL), and simple 6
structural variants (SV) against a reference genome [1]. Practically all important 7
variant callers, such as GATK [4] and freebayes [5], produce files in the VCF format. 8
The VCF file format is used in population studies as well as somatic mutation and 9
germline mutation studies. In this paper we discuss the tools we wrote to process VCF 10
and we shortly discuss strengths and shortcomings of the VCF format. We discuss 11
how we can improve future variant calling in its contribution to population genetics. 12
An important part of the success of VCF that it is a relatively simple and flexible 13
standard that is easy to read, understand and parse. This ‘feature’ has resulted in 14
wide adoption by bioinformatics software developers. VCF typically scales well in 15
bioinformatics workflows because files can be indexed [6], compressed [1,7, 8] and 16
trivially parallelized in workflows by splitting files and processing them independently, 17
e.g. [5]. 18
Here we present and discuss important tools and libraries for processing VCF in 19
sequencing workflows: vcflib,bio-vcf,cyvcf2,slivar and hts-nim. These tools 20
were created by the authors for the demands of large VCF file processing and data 21
analysis following the Unix philosophy of small utilities as explained in the ‘small tools 22
manifesto’ [9]. Development of these tools was often driven by the need to transform 23
VCF into other formats, to digest information, to address quality control, and to 24
compute statistics. The vcflib toolkit contains both a library and collection of 25
executable programs for transforming VCF files consisting of over 30,000 lines of 26
source code written in the C++ programming language for performance. vcflib also 27
comes with a toolkit for population genetics: the Genotype Phenotype Association 28
Toolkit (GPAT). Even though vcflib was not previously published, Google scholar 29
shows 132 citations for 2020 alone — pointing to the github source code repository. 30
Next to vcflib, we present bio-vcf, a parallelized domain-specific language (DSL) 31
for convenient querying and transforming VCF; and we discuss cyvcf2 [10], 32
slivar [11], and hts-nim [12] as useful tools and libraries for VCF processing. 33
Tools and libraries 34
2.1 vcflib C++ tools and libraries 35
At its core, vcflib provides C++ tools and a library application programmers 36
interface (API) to plain text and compressed VCF files. A collection of 83 command 37
line utilities is provided, as well as 44 scripts (Table 1 lists a selection). Most of these 38
tools are designed to be strung together: piping the output of one program into the 39
next, thereby preventing the creation of intermediate files, parallelize processing, and 40
reducing the number IO operations. For example, the vcflib vcfjoincalls script 41
includes the following pipeline (where vt is a variant normalization tool [13]; see 42
Table 1 for the other individual steps by vcflib): 43
May 21, 2021 2/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
1vcfintersect -r
$
re f \
2-u <( z c at
$
base_vcf \
3| vcfallelicprimitives -kg \
4| vt normalize -r
$
re f - q - 2>/ d ev / n u ll ) \
5
$
aux_vcf \
6| vcfaddinfo -
$
aux_vcf \
7| v cf s t re a m so r t -w 10 000 00 0 000 \
8| vcfuniq \
9| vcfgeno2haplo -r
$
re f - w 0 \
10 | vcffixup - \
11 | gr e p -v " ^ # "
Such transformations and statistical functions provided in the toolkit were written for 44
the requirements of projects such as the The 1,000 Genomes Project [14] and NIST’s 45
genome in a bottle [15]. 46
The VCF format is a textual file format: each line line describes a variant, i.e., a 47
single nucleotide variant (SNV), an insertion, a deletion or a structural variant with 48
rich annotation [1]. In a VCF line, fields are separated by the TAB character. Fields 49
for chromosome, position, the reference sequence, the ALT alleles, and fields for 50
quality, filter, INFO, FORMAT and calls for multiple samples are expected (see 51
Fig. 1). To split fields, for example for ‘ALT T,CT’ another separator is used; in this 52
case a comma. VCF makes use of many separators by splitting fields into subfields, 53
subsubfields and so on: effectively projecting a ‘tree’ datastructure onto a single line. 54
The advantage is that it is easy to view a VCF file and it is almost trivial to write a 55
basic VCF parser and it is easy to add information to VCF, sometimes leading to 56
unwieldy nested annotations. An evolving VCF ‘standard’ is tracked by the 57
samtools/htslib project [17] and later amendments are particularly focused on more 58
complex structural arrangements of DNA/RNA with ALT fields taking somewhat ad 59
hoc creative forms, such as ‘A[3:67656[’ combined with an INFO field containing 60
‘SVTYPE=BND’ meaning that starting at reference position on a different 61
chromosome, an ALT A nucleotide is followed by the sequence starting at chromosome 62
3 and position 67656. These SV annotations do away with some of the original 63
simplicity of VCF. There are many such exensions introduced since the first 64
publication of the VCF ‘standard’ that are used by specific SV tools and largely 65
ignored by most VCF processing tools. 66
The vcflib API describes a class vcflib::VariantCallFile to manage the 67
reading of VCF files, and vcflib::Variant to describe the information contained in a 68
single VCF record. The API provides iterators that are used in every included tool. 69
For every record the tree-type hierarchy (Fig. 1) of information can be navigated in 70
the record through interfaces to the fixed fields (CHROM, POS, ID, QUAL, FILTER, 71
INFO) and sample-related fields (FORMAT, and samples). vcflib implements 72
functions for accessing and modifying data in these fields; interpreting the alleles and 73
genotypes in record; filtering sites, alleles, and genotypes via a domain-specific 74
filtering boolean language; and reading and writing VCF streams. 75
In addition to the tools listed in Table 1, vcflib includes tools for genotype 76
detection. For example, abba-baba calculates the tree pattern for four indviduals with 77
an ancestral reference. vcflib includes a wide range of tools for transformation, e.g. 78
vcf2dag modifies the VCF file so that homozygous regions are included as ‘REF/.’79
calls. For each REF and ALT allele we can assign an index. These steps enable use of 80
the VCF as a partially ordered directed acyclic graph (DAG). vcfannotate will 81
intersect the records in the VCF file with targets provided in a BED file and 82
vcfannotategenotypes annotates genotypes in the first file with genotypes in the 83
second. vcfclassify generates a new VCF where each variant is tagged by allele class: 84
SNP,Ts/Tv,INDEL, and MNP.vcfglxgt sets genotypes using the maximum genotype 85
May 21, 2021 3/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
(a) VCF record
1#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Original s1t1 s2t1
s3t1 s1t2 s2t2 s3t2
21 1 02 57 . A T , CT 7 7 .6 9 . A C =1 ; AF = 0. 07 1; AN = 14 ; Ba s eQ R an kS u m = -0 .0 66 ; DP = 15 18 ;
3Dels=0.00; FS=24.214; H aplotypeSco re=226.5209; MLEAC =1;MLEAF =0.071;
4MQ = 25 .0 0; M Q0 = 32 9; M QR an kS um = -1 .7 44 ; QD =0 .3 7; Re a dP os Ra n kS um = 1. 55 1
5GT:AD: DP:GQ: PL 0/0:151 ,8:159:99:0 ,195,2282
60/0:219 ,22:242:99:0 ,197 ,2445 0/0:227 ,22:249:90:0 ,90,2339
70/0:226 ,22:249:99:0 ,159,2695 0/0:166 ,18:186:99:0 ,182 ,1989
80/1:185 ,27:212:99:111 ,0 ,2387 0/0:201 ,15:218:24:0 ,24,1972
(b) JSON representation of VCF record
1{
2" CH R " :" 1 " ,
3" PO S " : 1 02 57 ,
4" RE F " :" A " ,
5" AL T " : [
6"T",
7" CT "
8],
9" QU A L " : 7 7. 69 ,
10 " DP " : 1 51 8 ,
11 " AF " : 0 .0 71 ,
12 " AN " : 1 4 ,
13 " MQ " : 2 5 ,
14 " QD " : 0 .3 7 ,
15 "BaseQRankSum": -0 .0 66 ,
16 "HaplotypeScore": 226.5209 ,
17 "samples ": {
18 "Original ": {
19 " GT " :" 0 / 0 " ,
20 " AD " : [
21 15 1 ,
22 8
23 ],
24 " DP " : 1 59
25 },
26 " s1 t 1 " : {
27 " GT " :" 0 / 0 " ,
28 " AD " : [
29 21 9 ,
30 22
31 ],
32 " DP " : 2 42
33 },
34 " s2 t 1 " : {
35 " GT " :" 0 / 0 " ,
36 " AD " : [
37 22 7 ,
38 22
39 ],
40 " DP " : 2 49
41 },
42 " s 3t 1 " : {
43 " GT " :" 0 /0 ",
44 " AD " : [
45 22 6 ,
46 22
47 ],
48 " DP " : 2 49
49 },
50 " s 1t 2 " : {
51 " GT " :" 0 /0 ",
52 " AD " : [
53 16 6 ,
54 18
55 ],
56 " DP " : 1 86
57 },
58 " s 2t 2 " : {
59 " GT " :" 0 /1 ",
60 " AD " : [
61 18 5 ,
62 27
63 ],
64 " DP " : 2 12
65 },
66 " s 3t 2 " : {
67 " GT " :" 0 /0 ",
68 " AD " : [
69 20 1 ,
70 15
71 ],
72 " DP " : 2 18
73 }
74 }
(c) VCF to JSON transformation
1ca t g at k_ e xo m e .v cf | b io - vc f - - t em p la t e vc f 2j s on . e rb -- j s on
Fig 1. Example of the VCF format and a VCF transformation to Javascript
Object Notation (JSON) using bio-vcf —(a) the line-based VCF record makes use of
separators to split tab-delimited fields into subfields. Subfields are split with characters ,=:; /
and so on. This splitting effectively projects a ‘tree-like’ datastructure that can also be
represented as (b) a JSON record. JSON is used as a common data exchange format for
databases and web-services. This example was generated with (c) the bio-vcf tool using a
template [16]. bio-vcf transform data to any textual format, including RDF, HTML, XML
etc. See also the bio-vcf section
May 21, 2021 4/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
Table 1. A selection of VCF processing tools included with vcflib (a full list of over
125 tools with descriptions and options can be found in the online vcflib
documentation)
Name Description
Tools for filtering
vcfaddinfo add info fields from a second VCF file for records missing
in the first file.
vcffixup insert AC and NS using sample genotypes
vcffilter filter on fields, e.g. DP > 10
vcfuniq filter out duplicate entries
vcfuniqalleles filter unique alleles only
Tools for transformation
vcfintersect set operations — intersect, union, complement
vcfstreamsort sort
vcfoverlay merge files by overlaying
vcfcombine combine samples on identical sites
vcffixup update fields
vcfannotate annotate records from BED file
vcfflatten flatten multi-allelic sites with common ALT genotype
vcfgeno2haplo transform phased alleles into haplotypes
vcfsamplediff compare VCF files and add annotations to INFO
vcfprimers design primers
vcf2tsv convert to tab separated table
vcf2fasta convert to FASTA
vcf2bed convert to BED
vcf2sqlite convert to SQLite
smoother averages a set of scores over a sliding genomic window
Tools for metrics
vcfdistance compute distance between positions and add field
vcfentropy annotates and add field for sequence entropy for a window
Tools for genotyping
vcfallelicprimitives split records if multiple allelic primitives (gaps or mis-
matches) are specified in a single VCF record
genotypeSummmary summarizes genotype counts for bi-allelic SNVs and INDEL
hapLRT likelihood ratio test for haplotype lengths
Tools for statistics
iHS integrated ratio of haplotype decay between reference and
non-reference allele
vFst compute vFst as a measure of CNV stratification
vcfroc compute a pseudo-ROC curve
meltEHH plot extended haplotype homozygosity (EHH) curves
plotHaps provide output for haplotype plots
popStat population genetic statistics for each SNP
vcfrandomsample random sampling
Tools for validation
vcfcheck check integrity and identity against reference genome
likelihood for each sample. vcfinfosummarize and vcfsample2info edit annotations 86
given in the per-sample fields and adds the mean, median, min, or max to the 87
site-level INFO. vcfleftalign Left-align INDELs and complex variants in the input 88
using a pairwise REF/ALT alignment followed by a heuristic, iterative left realignment 89
May 21, 2021 5/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
Fig 2. Smoothed pFst (−log10) statistic with color coded number of variants in a
window — as computed by vcflib’s pFst and smoother tools [18].
process that shifts INDEL representations to their absolute leftmost (5’) extent. 90
vcflib includes tools for phenotype association based on VCF files. We have 91
developed a flexible and robust genotype-phenotype software library designed 92
specifically for large and noisy NGS datasets. Wright’s F-statistics, and especially Fst,93
provide important insights into the evolutionary processes that influence the structure 94
of genetic variation within and among populations, and they are among the most 95
widely used descriptive statistics in population and evolutionary genetics (Fig 2). Fst 96
is defined as the correlation between randomly sampled gametes relative to the total 97
drawn from the same subpopulation [19]. wcFst is Weir & Cockerham’s Fst for two 98
populations [20]. pFst is a probabilistic approach for detecting differences in allele 99
frequencies between two populations and bFst is a Bayesian approach. bFst accounts 100
for genotype uncertainty in the model using genotype likelihoods. For a more detailed 101
description see [21]. The likelihood function has been modified to use genotype 102
likelihoods provided by variant callers. There are five free parameters estimated in the 103
model: each subpopulation’s allele frequency and Fis (fixation index, within each 104
subpopulation), a free parameter for the total population’s allele frequency, and Fst.105
pVst calculates Vst to test the difference in copy numbers at each SV between two 106
groups: V st = (V t −V s)/V t, where Vt is the overall variance of copy number and Vs 107
the average variance within populations. sequenceDiversity calculates two popular 108
metrics of haplotype diversity: P i and extended haplotype homozygosity (eHH). P i is 109
calculated using the Nei and Li formulation [22]. eHH is a convenient way to think 110
about haplotype diversity. When eHH=0 all haplotypes in the window are unique and 111
when eHH=1 all haplotypes in the window are identical. The vcfremap tool attempts 112
to realign, for each alternate allele, against the reference genome with a lowered gap 113
open penalty and adjusts the CIGAR and REF/ALT alleles accordingly. These 114
traditional and novel population genetic methods are implemented in the Genotype 115
Phenotype Association Toolkit (GPAT++), part of the vcflib API. For example, 116
permuteGPAT++, adds empirical p-values to a GPAT++ score. And vcfld computes 117
linkage disequilibrium (LD). GPAT++ includes basic population stats (Af,Pi,eHH,118
oHet,genotypeCounts) and several flavors of Fst and tools for linkage, association 119
testing (genotypic and pooled data), haplotype methods (hapLrt), smoothing, 120
permutation, and plotting. 121
vcflib includes tools for genotype statisics. vcfgenosummarize adds summary 122
statistics to each record summarizing qualities reported in called genotypes. It uses 123
RO (reference observation count), QR (quality sum reference observations) AO (alternate 124
observation count), QA (quality sum alternate observations). The normalizeHS is used 125
for iHS and XP-EHH scores [23]. 126
A full list of over 125 commands and functionality can be found on the website, as 127
well as documentation and examples of application [18]. 128
May 21, 2021 6/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
2.2 Bio-vcf and Slivar flexible command-line DSL filters and 129
transformers 130
2.2.1 bio-vcf 131
Compared to vcflib with its many dedicated command line tools, bio-vcf takes a 132
different approach by providing a single command line tool that uses a domain specific 133
language (DSL) for processing the VCF format. Thanks to a dynamic interpretation 134
of the VCF tree representation (see Fig. 1) all data elements in a VCF header or 135
record can be reached using field names and their sub names. For example the 136
following is a valid select filter: 137
138
1s .dp > 20 a nd r . f il te r !~ / ^ Lo w QD / a nd r . t u m or . b co u n t [ r . a lt ] >4 139
which selects all variants where the sample depth field s.db is larger than 20, where 140
the FILTER field of a record rdoes not start with the letters LowQD (note it uses a 141
Perl/Ruby-style regular expression or regex [24]), and where the tumor bcount of the 142
ALT allele is larger than 4. The letter ‘r’ represents a record or line in a VCF file and 143
the letter ‘s’ stands for each sample in a record. 144
The naming of variables, such as s.dp and r.tumor.bcount, is inferred from the 145
VCF file itself, so if a VCF has different naming conventions they are picked up 146
automatically. 147
bio-vcf typically reads from the terminal STDIN and writes to STDOUT. 148
The following full command line invocation reads VCF files and filters for
chromosomes 1–9 where the quality (r.qual) is larger than 50. It also checks for
non-empty samples where the sample read depth is larger than 20. For each selected
record with --eval it outputs a BED record (the default output is the VCF record
itself, useful for filtering):
1bi o - v cf -- f i lt e r ’ r . c h ro m . t o _i > 0 a nd r . c h r om . to _ i < = 9 a n d r . qu a l > 5 0 ’ \
2--sfilter ’ ! s . e m pt y ? a n d s . dp >2 0 ’ - - eval ’ [ r . ch ro m , r . po s , r . p o s + 1] ’
149
For comparisons and for output, fields can be converted to integers, floats and strings 150
with to i, to f and to s respectively. Note that these are Ruby functions and, in fact, 151
all such Ruby functionality is available in bio-vcf statements. For extreme flexibility 152
bio-vcf even supports lambdas which makes for very powerful queries and 153
transformations. For example, to output the count of valid genotype fields in samples 154
one could use 155
156
1bi o - v cf -- eval ’r .samples .count {|s| s. gt!="./."} ’ 157
where count is a function that invokes the lambda s.gt!="./.", i.e., where the 158
genotype gt of sample sis not equal to ”./.”. Sample ‘s’ is passed as a parameter. 159
Because of the flexibility of bio-vcf almost all imaginable data queries can be 160
executed. bio-vcf was implemented for processing large VCF files and is fast because 161
it is designed make use of multi-core processors (using Linux parallelized 162
copy-on-write, i.e. a technique where RAM is shared between processes). bio-vcf is 163
also ‘lazy’ which means that it only parses fields when they are used. For example, in 164
the above query, only the sample GT field is unpacked and parsed to get a result. All 165
other data in the record is ignored by the query and not evaluated. This contrasts 166
largely with most VCF parsers in use today. 167
Finally, bio-vcf comes with a full parser and lexer that can tokenize the VCF file 168
header and transform that in some other format. For example, the command 169
170
May 21, 2021 7/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
1bi o - v c f - - e val - o n ce ’header. meta’ -- json < gatk_exome .vcf 171
will turn the metadata information passed by GATK [4] into a JSON document. To 172
get a full JSON document of the VCF file use a template that looks like 173
1=HEADER
2{
3" HE A DE R " : {
4"options ": <%= options. to_h.to_json %>,
5" fi l es " : <% = A RG V % >,
6},
7" B OD Y " : [
8= BO D Y
9{
10 " seq : c hr " :" <% = r ec . c hr om %> " ,
11 " seq : p os " : <% = re c . p os % > ,
12 " seq : r ef " :" <% = r e c . r ef %> " ,
13 " seq : a lt " :" <% = r ec . a lt [0 ] % > ",
14 " dp " : < %= re c . i n fo . d p % >
15 }
16 =FOOTER
17 ]
18 }
and run 174
175
1bio-vcf --template vcf2json .erb -- json < gatk_exome .vcf 176
The high expressiveness and adaptable parsing makes bio-vcf a very powerful tool for 177
searching, filtering and rewriting VCF files. See the bio-vcf website for full 178
information on record and sample inclusion/exclusion filters, generators, multicore 179
performance, field computations, statistics, genotype processing, set analysis and 180
templates for user definable output, including templates for output of VCF header 181
information and records for RDF, JSON, LaTeX, HTML and BED formats [16]. 182
2.2.2 Slivar 183
Similar to bio-vcf,slivar allows users to specify simple expressions for filtering and 184
annotation [11]. Whereas bio-vcf uses Ruby to supply the DSL, slivar uses 185
Javascript. slivar has built-in pedigree support for the samples so that, for example, 186
a single expression can be applied to every trio (mother, father, child) to identify de 187
novo variants: 188
189
1slivar expr -- p e d
$
pedigree_file --trio " de no vo : k id . he t & & m om . h om _r ef 190
&& d ad . hom _r e f & & ki d . GQ > 10 && m om . GQ > 10 && d ad . G Q > 10 & & INF O . 191
gnomad_af < 0.001 " 192
The expression above checks the genotype pattern along with genotype quality and 193
limits to rare variants by INFO.gnomad af<0.001. 194
Expressions on families (including multigenerational) and arbitrary groups are 195
supported so that, for example, expressions can be applied to tumor-normal pairs 196
using tumor and normal labels. 197
2.3 VCF programming libraries 198
VCF programming libraries are mainly useful when direct calls to vcflib and 199
bio-vcf command line tools proves too limited. The Bio* libraries, e.g., 200
biopython [25], bioperl [26], bioruby [27] and R’s CRAN [28], contain VCF parsers 201
that may be useful. But a first point of call may be vcflib itself as it is also a C++ 202
May 21, 2021 8/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
programming library and in addition to being an integral part of the vcflib tools 203
mentioned here is used by, for example, the freebayes variant caller [5]. 204
Of particular interest is the fast cyvcf2 library that was started in 2016 with 205
htslib [17] bindings and is actively maintained by co-author Pedersen today [10]. 206
Similar to bio-vcf it presents a DSL-type language that can be used in Python 207
programming. Meanwhile hts-nim [12] contains bindings for the Nim programming 208
language with similar syntax and functionality. For example a nim bin counter (part 209
of hts-nim-tools/vcf-check [12]) can be written as: 210
211
1# nim code to count variants in bins 212
2fo r variant in v . q u er y ( c hr o m ) : 213
3n += 1 214
4if variant .info .get(" AF " , a f s ) != St a t us . OK : continue 215
5if ma x (a fs ) < m af : continue 216
6va r bi n = which_bin ( variant. start. int , chunk_size ) 217
7counts. extend( bi n )218
8in c ( co u nt s [ bi n ]) 219
Nim is a statically typed compiled language that looks very similar to Python and, 220
because it transpiles to C, Nim has a much faster runtime and can link without 221
overheads against C libraries such as vcflib and htslib.222
Finally, another tool of interest, by the same author, is vcfanno; written in the Go 223
language and allows annotations of a VCF with any number of INFO fields from any 224
number of VCFs or BED files. vcfanno uses a simple conf file to allow the user to 225
specify the source annotation files and fields and how they will be added to the info of 226
the query VCF [29]. 227
Discussion 228
Ten years is a long time in bioinformatics and the VCF file format is starting to show 229
its age. Not only is the VCF format redundant and bloated with duplication of data, a 230
more important concern is that the VCF format does not accommodate interesting 231
complex genomic variations, such as complex and nested variants, such as 232
superbubbles, ultrabubbles, and cacti [30–32]. An even more important shortcoming of 233
VCF is that it always depends on a single reference genome, resulting in variant calling 234
bias and missing out on variation not represented in the reference [32]. One solution is 235
to work with multiple reference genomes, but comparing VCF files from different 236
reference genomes is challenging — even for different versions of one reference genome. 237
To address such challenges the authors are actively working on pangenome 238
approaches that store variation in a pangenome graph format, e.g. [32–37]. 239
Pangenomes can incorporate multiple individuals and multiple reference genomes. 240
Pangenomes can cater for very complex structural variation. Pangenomes are also 241
efficient in storing information, including metadata, without redundancy. In effect, 242
pangenomes cater for a ‘lossless’ view of all data at the population level. This largely 243
differs from VCF-type data because, despite mentioned data size, a generated VCF 244
implies a data reduction step — or data loss — that effectively disconnects variants 245
from each other and related features, such as quality metrics. This means that 246
rebuilding the original data from VCF files is virtually impossible. In contrast, with 247
the newer pangenome formats it is possible to rebuild sequences independent of the 248
underlying complexity of features. Having a full view of the data makes downstream 249
analysis, such as population genotyping, more powerful with improved results, e.g. [32]. 250
The VCF file format has become a crucial part of almost all sequencing workflows 251
today. The design and presentation of the VCF file format can set the norm for 252
designing future file formats [2,3], but we can also learn from its mistakes. In this 253
May 21, 2021 9/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
paper we wrote VCF ‘standard’ consistently between quotes because, even though 254
there exists a standardization effort — now at VCFv4.3 [3] — VCF is flexible by 255
design, alternative VCF standards are introduced (e.g. [38]) and most tools take 256
liberties when it comes to producing VCF files. Therefore all VCF parsers have to 257
take a flexible approach towards digesting input data and ignore input data that is not 258
understood. 259
We recognise that the success of a file format requires a crucial focus on having an 260
early ‘standard’ that is both easy to understand and flexible enough to grow, in line 261
with the success of other bioinformatics file formats, such as SAM/BAM [3] and 262
GFF [39]. Biology is a fast moving field and it is impossible to predict how a file 263
format is going to be used in a (near) future. The downside of such flexibility is that 264
older software may not support features that were added later. One of the weakest 265
aspects of the VCF format is its metadata: next to ad hoc metadata in records (see 266
Fig.1), the header record requires specialized parsing and ignores existing ontologies. 267
Also for the VCF records, robust validation, error checking and correctness 268
checking is virtually impossible. Great attention should therefore be paid to any 269
amendments to an earlier standard to keep backward compatibility when possible. 270
VCF and many other formats in bioinformatics use layered character separators as a 271
grammar for defining a tree structure of data (see Fig. 1). This type of format requires 272
specialized ad hoc parsers for every format. In the future, when designing new formats, 273
we strongly suggest to base a new format on existing standards such as JSON, 274
JSON-LD and RDF web formats for storing hierarchical data and graph data 275
respectively. Each of these formats has efficient storage implementations. A future 276
format should also benefit from reusing existing ontologies or create and champion a 277
new ontology, if one does not exist, so data becomes easily shareable, comparable and 278
queryable and living upto FAIR requirements [40]. Not only are JSON, JSON-LD and 279
RDF natively and efficiently supported by most computer languages, they are also 280
more easily embedded in existing infrastructure, such as NoSQL databases. 281
Software development and distribution practices: In this paper we present 282
three types of tools that mirror three common approaches in bioinformatics towards 283
large data parsing. First are vcflib Unix style command line tools where each tool 284
does a small job [9]. Second are bio-vcf and slivar-style extremely flexible command 285
line DSLs. And third are programming against libraries that can be called from 286
programming languages, such as cyvcf2 and hts-nim, as well as vcflib and bio-vcf.287
A wide range of solutions exist for VCF processing that make use of these three 288
approaches and functional overlap is found between vcflib,bio-vcf,cyvcf2, the 289
original vcftools [2], bcftools [7] and the existing Bio* programming libraries, such as 290
biopython [25], bioruby [27] and biojava [41]. vcftools and bcftools provide annotation, 291
merging, normalization and filtering capabilities that complement functionality and 292
can be combined in workflows with vcflib and bio-vcf. These solutions together 293
provide a comprehensive and scalable way of dealing with VCF data and every single 294
tool represents a significant investment in research and software development. 295
Therefore, before writing a new parser from scratch, we strongly suggest to first study 296
the existing solutions. In the rare case a new tool is required it may be an idea to 297
merge that with existing projects so everyone can benefit. 298
Once software is written, it is important software development and maintenance 299
continues. In the biomedical sciences it is a clear risk for projects to get abandoned 300
once the original author moves on to another job or other interests; partly due to a 301
lack of scientific recognition, attribution and reward [42]. We note that with the pyvcf 302
project, for example, this has happened twice and the github contribution tracker 303
shows no more contributions by a project owner. This means no one is merging 304
changes back into the main code repository and the code is essentially unmaintained. 305
May 21, 2021 10/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
vcflib,bio-vcf and cyvcf2, in contrast, show a continuous adoption of code 306
contributions thanks to the original authors encouraging others to take ownership and 307
even release versions of the software. We also recognise the importance of creating 308
small tools that can interact with each other following the Unix philosophy. 309
For overall adoption of software solutions it is important the tools and 310
documentation get packaged by software distributions, such as Bioconda [43], 311
Debian [44] and GNU Guix [45, 46]. Bioconda downloads are a good estimation of 312
relative popularity because they tend to represent actual installations. vcflib, for 313
example, was installed over 40,000 times and bio-vcf was installed over 15,000 times 314
through Bioconda by December 2020. vcflib is also an integral part of the freebayes 315
variant caller with an additional 110,000 downloads through Bioconda. 316
Future work: Software development never stands still. With new requirements 317
tools get updated. With the evolving VCF ‘standard’ and associated tooling for 318
pangenomes and reference based approaches we keep updating our tools and libraries. 319
We intend to add more documentation and regression tests. One recent innovation in 320
vcflib is the generation of Unix ‘man’ pages from markdown pages and the --help 321
output from every individual tool and script that also doubles as regression tests. This 322
documentation is always up to date because when it goes out of sync the embedded 323
tests fail. We think it will also be useful to add tool descriptions in the common 324
workflow language (CWL) format [46–48]. CWL definitions allow easy sharing of tool 325
components between sequencing workflows. The scenario will be to write a CWL tool 326
definition that can be converted to documentation and running tests. 327
Despite our criticism of VCF, VCF as a file format is likely to remain in use. To 328
replace VCF most existing tools and workflows used in sequencing would have to be 329
rewritten. Pangenome tools, in principle, are capable of producing reference guided 330
VCF files from GFA graph structures. These tools guarantee compatibility with 331
upstream and downstream analysis workflows. We predict that pangenome approaches 332
will play an increasingly important role in sequence analysis and, at the same time, 333
VCF processing tools will remain in sequencing workflows for the forseeable future. 334
Acknowledgments 335
These free software tools are a community effort and benefit from ideas by hundreds of 336
people. We thank all collaborators, supervisors and other contributors. We 337
particularly want to thank software packagers, such as Andreas Tille and 338
Michael R. Crusoe for providing these tools in Debian. We also want to thank 339
Brad Chapman and Torsten Seemann for the Bioconda packages and Efraim Flashner 340
for the GNU Guix packages. 341
References
1. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al.
The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–2158.
doi:10.1093/bioinformatics/btr330.
May 21, 2021 11/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
2. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al.
The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–2158.
doi:10.1093/bioinformatics/btr330.
3. HTS-Specs: specifications of SAM/BAM and related high-throughput
sequencing file formats; 2011 (accessed April 2021).
https://samtools.github.io/hts-specs/. GitHub Repository.
4. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al.
The Genome Analysis Toolkit: a MapReduce framework for analyzing
next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303.
doi:10.1101/gr.107524.110.
5. Garrison E, Marth G. Haplotype-Based Variant Detection from Short-Read
Sequencing. 2012;arXiv:abs/1207.3907.
6. Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files.
Bioinformatics. 2011;27(5):718–719. doi:10.1093/bioinformatics/btq671.
7. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al.
Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2).
doi:10.1093/gigascience/giab008.
8. Lan D, Tobler R, Souilmi Y, Llamas B. genozip: a fast and efficient
compression tool for VCF files. Bioinformatics. 2020;36(13):4091–4092.
doi:10.1093/bioinformatics/btaa290.
9. Prins P, Strozzi F, Tarasov A, de Ligt J, Githinji G, oth ers. Small tools
MANIFESTO for Bioinformatics; 2014. doi: 10.5281/zenodo.11321. Available
from: https://dx.doi.org/10.5281/zenodo.11321.
10. Pedersen BS, Quinlan AR. cyvcf2: fast, flexible variant analysis with Python.
Bioinformatics. 2017;33(12):1867–1869. doi:10.1093/bioinformatics/btx057.
11. Pedersen BS, Brown JM, Dashnow H, Wallace AD, Velinder M, Tvrdik T, et al.
Effective variant filtering and expected candidate variant yield in studies of rare
human disease. BioRxiv. 2020;.
12. Pedersen BS, Quinlan AR. hts-nim: scripting high-performance genomic
analyses. Bioinformatics. 2018;34(19):3387–3389.
doi:10.1093/bioinformatics/bty358.
13. Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants.
Bioinformatics. 2015;31(13):2202–2204. doi:10.1093/bioinformatics/btv112.
14. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, et al. A
global reference for human genetic variation. Nature. 2015;526(7571):68–74.
doi:10.1038/nature15393.
15. Zook JM, McDaniel J, Parikh H, Heaton H, Irvine SA, Trigg L, et al.
Reproducible integration of multiple sequencing datasets to form
high-confidence SNP, indel, and reference calls for five human genome reference
materials. bioRxiv. 2018;doi:10.1101/281006.
16. bio-vcf: smart VCF parser; 2021 (accessed Feb 2021).
https://github.com/vcflib/bio-vcf. GitHub Repository.
May 21, 2021 12/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
17. Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, et al.
HTSlib: C library for reading/writing high-throughput sequencing data.
Gigascience. 2021;10(2). doi:10.1093/gigascience/giab007.
18. vcflib for working with VCF files; 2021 (accessed Feb 2021).
https://github.com/vcflib/vcflib. GitHub Repository.
19. Holsinger KE, Weir BS. Genetics in geographically structured populations:
defining, estimating and interpreting F(ST). Nat Rev Genet.
2009;10(9):639–650. doi:10.1038/nrg2611.
20. Cockerham CC, Weir BS. Estrimation of gene flow from F-statistics. Evolution.
1993;47(3):855–863. doi:10.1111/j.1558-5646.1993.tb01239.x.
21. Holsinger KE, Lewis PO, Dey DK. A Bayesian approach to inferring population
structure from dominant markers. Mol Ecol. 2002;11(7):1157–1164.
doi:10.1046/j.1365-294x.2002.01512.x.
22. Nei M, Li WH. Mathematical model for studying genetic variation in terms of
restriction endonucleases. Proc Natl Acad Sci U S A. 1979;76(10):5269–5273.
doi:10.1073/pnas.76.10.5269.
23. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, et al.
Genome-wide detection and characterization of positive selection in human
populations. Nature. 2007;449(7164):913–918. doi:10.1038/nature06250.
24. Friedl JEF, Oram A. Mastering Regular Expressions: Powerful Techniques for
Perl and Other Tools. In a Nutshell Series. O’Reilly; 1997. Available from:
https://books.google.nl/books?id=qEsPAQAAMAAJ.
25. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al.
Biopython: freely available Python tools for computational molecular biology
and bioinformatics. Bioinformatics. 2009;25(11):1422–1423.
26. Sta jich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, et al.
The Bioperl toolkit: Perl modules for the life sciences. Genome Res.
2002;12(10):1611–1618.
27. Goto N, Prins P, Nakao M, Bonnal R, Aerts J, Katayama T. BioRuby:
bioinformatics software for the Ruby programming language. Bioinformatics.
2010;26(20):2617–2619. doi:10.1093/bioinformatics/btq475. Open access.
28. Knaus BJ, Gr¨unwald NJ. VCFR: a package to manipulate and visualize variant
call format data in R. Molecular Ecology Resources. 2017;17(1):44–53.
29. Pedersen BS, Layer RM, Quinlan AR. Vcfanno: fast, flexible annotation of
genetic variants. Genome Biol. 2016;17(1):118. doi:10.1186/s13059-016-0973-5.
30. Paten B, Eizenga JM, Rosen YM, Novak AM, Garrison E, Hickey G.
Superbubbles, Ultrabubbles, and Cacti. Journal of Computational Biology.
2018;25(7):649–663. doi:10.1089/cmb.2017.0251.
31. Paten B, Novak AM, Eizenga JM, Garrison E. Genome Graphs and the
Evolution of Genome Inference. Genome Research. 2017;27(5):665–676.
doi:10.1101/gr.214155.116.
May 21, 2021 13/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
32. Garrison E, Siren J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al.
Variation Graph Toolkit Improves Read Mapping by Representing Genetic
Variation in the Reference. Nature Biotechnology. 2018;36(9):875–879.
doi:10.1038/nbt.4227.
33. Graphical Fragment Assembly (GFA) Format Specification; 2015 (accessed Jan
2021).
https://github.com/GFA-spec/GFA-spec. GitHub Repository.
34. vgtools for Working with Genome Variation Graphs; 2014 (accessed Jan 2021).
https://github.com/vgteam/. GitHub Repository.
35. Pangenome Tools; 2020 (accessed Jan 2021).
https://github.com/pangenome/. GitHub Repository.
36. Pangenome Tools; 2020 (accessed Jan 2021).
https://pangenome.github.io/. GitHub Repository.
37. pggb: pangenome graph builder; 2020 (accessed Jan 2021).
https://github.com/pangenome/pggb. GitHub Repository.
38. Lin MF, Bai X, Salerno WJ, Reid JG. Sparse Project VCF: efficient encoding of
population genotype matrices. bioRxiv. 2020;doi:10.1101/611954.
39. GFF-Spec: Generic Feature Format Version 3 (GFF3); 2016 (accessed April
2021).
GFF3 Specification. GitHub Repository.
40. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton Ge, Axton M, Baak A,
et al. The FAIR Guiding Principles for scientific data management and
stewardship;3:160018. doi:10.1038/sdata.2016.18.
41. Holland RC, Down TA, Pocock M, Prlic A, Huen D, James K, et al. BioJava:
an open-source framework for bioinformatics. Bioinformatics.
2008;24(18):2096–2097.
42. Prins P, de Ligt J, Tarasov A, Jansen RC, Cuppen E, Bourne PE. Toward
effective software solutions for big biology. Nat Biotechnol. 2015;33(7):686–687.
doi:10.1038/nbt.3240.
43. Gr¨uning B and Dale R and Sj¨odin A and Rowe J and Chapman B and
Tomkins-Tinch CH and Valieris R and K¨oster J. Bioconda: A sustainable and
comprehensive software distribution for the life sciences. bioRxiv.
2017;doi:10.1101/207092.
44. Debian Linux Software Distribution; 1993 (accessed April 2021).
https://debian.org/. Online Webpage.
45. Bavier E, Court`es L, Garlick P, Prins P, Wurmus R. Guix-HPC Activity Report
2017–2018. Inria Bordeaux Sud-Ouest ; Max Delbr¨uck Center for Molecular
Medicine ; Cray, Inc. ; Tourbillion Technology; 2019. Available from:
https://hal.inria.fr/hal-02056461.
46. Prins P. Creating a reproducible workflow with CWL; 2019. Online.
GNU GUIX HPC BLOG.
May 21, 2021 14/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint
47. Amstutz P and Crusoe MR and Tijani´c N and Chapman B and Chilton J and
Heuer M and Kartashov A and Kern J and Leehr D and M´enager H and
Nedeljkovich M and Scales M and Soiland-Reyes S and Stojanovic L . Common
Workflow Language, v1.0; 2016. Available from:
http://dx.doi.org/10.6084/m9.figshare.3115156.v2.
48. Strozzi F, Janssen R, Wurmus R, Crusoe MR, Githinji G, Di Tommaso P, et al.
Scalable Workflows and Reproducible Data Analysis for Genomics. Methods
Mol Biol. 2019;1910:723–745. doi:10.1007/978-1-4939-9074-0 24.
May 21, 2021 15/15
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 23, 2021. ; https://doi.org/10.1101/2021.05.21.445151doi: bioRxiv preprint