ArticlePDF Available

GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data

Authors:

Abstract

Background Large genotyping datasets have become commonplace due to efficient, cheap methods for SNP identification. Typical genotyping datasets may have thousands to millions of data points per accession, across tens to thousands of accessions. There is a need for tools to help rapidly explore such datasets, to assess characteristics such as overall differences between accessions and regional anomalies across the genome. Results We present GCViT (Genotype Comparison Visualization Tool), for visualizing and exploring large genotyping datasets. GCViT can be used to identify introgressions, conserved or divergent genomic regions, pedigrees, and other features for more detailed exploration. The program can be used online or as a local instance for whole genome visualization of resequencing or SNP array data. The program performs comparisons of variants among user-selected accessions to identify allele differences and similarities between accessions and a user-selected reference, providing visualizations through histogram, heatmap, or haplotype views. The resulting analyses and images can be exported in various formats. Conclusions GCViT provides methods for interactively visualizing SNP data on a whole genome scale, and can produce publication-ready figures. It can be used in online or local installations. GCViT enables users to confirm or identify genomics regions of interest associated with particular traits. GCViT is freely available at https://github.com/LegumeFederation/gcvit. The 1.0 version described here is available at https://doi.org/10.5281/zenodo.4008713.
S O F T W A R E Open Access
GCViT: a method for interactive, genome-
wide visualization of resequencing and SNP
array data
Andrew P. Wilkey
1
, Anne V. Brown
2
, Steven B. Cannon
2
and Ethalinda K. S. Cannon
2*
Abstract
Background: Large genotyping datasets have become commonplace due to efficient, cheap methods for SNP
identification. Typical genotyping datasets may have thousands to millions of data points per accession, across tens
to thousands of accessions. There is a need for tools to help rapidly explore such datasets, to assess characteristics
such as overall differences between accessions and regional anomalies across the genome.
Results: We present GCViT (Genotype Comparison Visualization Tool), for visualizing and exploring large
genotyping datasets. GCViT can be used to identify introgressions, conserved or divergent genomic regions,
pedigrees, and other features for more detailed exploration. The program can be used online or as a local instance
for whole genome visualization of resequencing or SNP array data. The program performs comparisons of variants
among user-selected accessions to identify allele differences and similarities between accessions and a user-
selected reference, providing visualizations through histogram, heatmap, or haplotype views. The resulting analyses
and images can be exported in various formats.
Conclusions: GCViT provides methods for interactively visualizing SNP data on a whole genome scale, and can
produce publication-ready figures. It can be used in online or local installations. GCViT enables users to confirm or
identify genomics regions of interest associated with particular traits.
GCViT is freely available at https://github.com/LegumeFederation/gcvit. The 1.0 version described here is available
at https://doi.org/10.5281/zenodo.4008713.
Keywords: GCViT, CViT, SNP, Resequencing, Genotype, Visualization, UI, Web service
Background
As high throughput genotyping costs have dropped, the
dense genotyping of large germplasm collections has be-
come commonplace. Re-sequencing and SNP-array pro-
jects are used to identify sequence variants between
multiple lines, and may be used to perform genome wide
association studies (GWAS) to find variants that are as-
sociated with phenotypes. These studies can produce
millions of SNPs. For example, Torkamaneh et al. [1]
identified 15 million variants among 1007 accessions of
soybean, which has relatively low diversity compared
with a crop such as maize. Often these data sets are used
for a single genome wide association study (GWAS), but
such data sets are rich and may be repurposed for other
studies. Reuse of this valuable data requires tools for
visualization and analysis.
Several tools exist for exploring this data. The com-
mand line tool Genotype Query Tools (GQT) [2] and its
web form, webGQT [3] provide a means of indexing and
querying VCF files. However it lacks visualization op-
tions. Many tools are available for genomic and geno-
typic data visualization [4]. Some of these tools include:
© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if
changes were made. The images or other third party material in this article are included in the article's Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons
licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the
data made available in this article, unless otherwise stated in a credit line to the data.
* Correspondence: ethy.cannon@usda.gov
2
USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, IA 50011,
USA
Full list of author information is available at the end of the article
Wilkey et al. BMC Genomics (2020) 21:822
https://doi.org/10.1186/s12864-020-07217-2
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Flapjack [5], Integrative Genomics Viewer (IGV) [6],
Tassel-GBS [7], JBrowse [810], Xena [11], and SNPVer-
sity [12]. These tools provide visualization of selected
genomic regions/genes in a single view, but lack a whole
genome overview. Tools that provide a whole genome
scale visualization in a single view include: CViT
(Chromosome Visualization Tool) [13], Synteny Explorer
[14], MizBee [15] and SNP & Variation Suite (SVS)
Golden Helix® - yet these tools do not automate the com-
parisons of accessions using SNP data. CViT displays fea-
tures on backbones, including complete genetic and
cytogenetic maps, and whole genome views of genomic
features. However, although CViT can be integrated into
online resources, it is a standalone Perl application that
generates static images on predefined comparisons, limit-
ing its utility for interactive exploration.
In this paper we describe a new tool, GCViT (Geno-
type Comparison Visualization Tool) for dynamic, whole
genome visualization of resequencing and SNP array
data through histogram, heatmap or haplotype views of
two or more accessions selected from a genotyping data
set. The visualization enables identification of regions of
similarity and difference across the genome. GCViT is
built on top of CViTjs, the JavaScript rewrite of CViT.
(https://github.com/LegumeFederation/gcvit).
Implementation
GCViT operates on variant call (VCF) files which have
been mapped to a single reference genome assembly.
Users select a reference genotype and one or more com-
parison genotypes, then GCViT performs pairwise com-
parisons between the comparison and reference
genotypes and displays the results on a whole genome
view of the reference assembly. GCViT is written in
Golang (https://golang.org) and JavaScript. The Java-
Script application CViTjs (https://github.com/Legume-
Federation/cvitjs) is used to display the comparison
results. Resulting images can be downloaded as in PNG
or SVG formats.
Overview
GCViT, CViTjs, and CViT display features across the
full genome and use similar glyphs to represent features
and similar configuration files. CViT is a generic Perl ap-
plication for displaying features on any sort of backbone
(e.g. linkage group, pseudomolecule) which uses the GD
graphics library and produces a static image. CViTjs is a
JavaScript rewrite of CViT that makes use of Canvas for
drawing images and enables interactive data views.
GCViT, a Golang wrapper around CViTjs, provides add-
itional functionality for handling genotype data and per-
mits users to interactively choose reference and
comparison genotypes and modify display options.
Although GCViT is able to handle fairly large data
sets, data sets of millions of SNPs and/or hundreds of
genotypes may need to be subsampled. A utility for
accomplishing this, subsample_vcf.pl, is included in the
distribution. This script can be used to filter SNPs by
quality, and will select representative SNPs within a spe-
cified genomic window size.
Availability
GCViT is available at https://github.com/LegumeFedera-
tion/gcvit under an MIT license. CViTjs is available at
https://github.com/LegumeFederation/cvitjs also under
an MIT license. The 1.0 version of GCViT that is de-
scribed here is available at https://doi.org/10.5281/
zenodo.4008713.
Results
Server-side configuration
An instance of GCViT can be set up in a Docker con-
tainer or installed as a Go and Nodejs application. Set
up consists of the service configuration file, preparation
of data sets and connecting them to GCViT, and config-
uration of the user interface (UI). Instructions for
deploying an instance of GCViT are provided in the
GitHub repository (https://github.com/LegumeFedera-
tion/gcvit).
The GCViT display
Binning and scaling
To handle the display of dense data, the chromosomes
are divided up into bins and counts are represented for
each bin. The default bin size is 500 kb, but this can be
changed in the server-side configuration file and inter-
actively by the user. Bin sizes should be set according to
SNP density and the degree of scaling: very high-density
genotype data may be suited for smaller bin sizes, but a
large genome will require larger bins due to pixel size
because data cant be displayed at a scale less than one
pixel.
Display types
There are 3 different data displays: histogram, heatmap,
and haplotype. The histogram view shows SNP counts
in each bin, where the size of each bar represents the
count proportional to the minimum and maximum
values across the entire genome. The heatmap view
shows SNP counts within each bin using color ranges
that are proportional to the minimum and maximum
values across the genome. The haplotype view shows
SNP presence/absence within each bin if the count in
the bin matches or exceeds a given threshold.
Wilkey et al. BMC Genomics (2020) 21:822 Page 2 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
User Interface
The user interface (UI) controls are grouped into sec-
tions: Configure View,where the data set and geno-
types are selected; Display,where the image and its
interactive controls are displayed; and View Controls,
which contains controls for turning on and off portions
of the image. Detailed instructions for the UI are pro-
vided in GCViT itself, through the Help button.
Selecting the reference genotype
The first step is to select a data set and reference geno-
type. Data set availability is established in the configur-
ation, along with file paths and data set name. In
addition, availability of a particular data set may be con-
trolled via simple authentication. Comparisons can be
made only within a single data set.
Selecting the comparison genotypes
After selecting the data set and reference genotype, one
or more comparison genotypes can be selected and each
assigned a distinct color. A full color palette is provided
to help distinguish the selected genotypes.
SNP comparisons
Comparisons can be displayed on the left or right side of
the chromosome backbones. For each comparison, the
user selects a display type (histogram, heatmap, or
haplotype), and the type of comparison (alleles are dif-
ferent from the reference, same as the reference, or the
total SNP count). Depending on the display type, the
user has the option of setting specific minimum or max-
imum values rather than leaving GCViT to calculate
them across the genome (histogram and heatmap), or of
setting a threshold value (haplotype).
Fig. 1 GCViT UI with an example comparison from the SoySNP50K dataset. aconfigure viewsection of the UI in which the figures constraints
are generated. binteractive displaysection of the selected comparison. cview controlsfor further filtering of the displayed view
Wilkey et al. BMC Genomics (2020) 21:822 Page 3 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Chr01 Chr02 Chr03 Chr04 Chr05 Chr06
Chr07 Chr08 Chr09 Chr10 Chr11
Fig. 2 Identify introgression from Mesoamerican population into Andean line Heirloom using the Histogram view with default settings.
Differences between Heirloom and Andean lines Dolly (purple), Majesty (green), and Bonus (blue) are displayed on the left-hand side of the
chromosome. Differences between Heirloom and Mesoamerican lines Avalanche (yellow), Maverick (orange) and Zorro (red) are displayed on the
right-hand side of the chromosome. All of the chromosomes are displayed
Wilkey et al. BMC Genomics (2020) 21:822 Page 4 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Optional settings
In the Configure View Options section, the image can
be given a title, the bin size can be changed, the ruler
placement can be modified, and the ruler interval (fre-
quency of tic marks and how often coordinate counts
are displayed) can be changed.
Control buttons
There are three main buttons, Display, Download, and
Help. The Display button generates the image. The
image may be larger than the viewport, in which case it
can be moved by clicking and dragging the image. The
Download button gives the option of downloading the
results in SVG or PNG formats. There are some differ-
ences between the two options: the SVG format is
downloaded as the whole image (which may be larger
than what is displayed on the screen, while the PNG for-
mat will only download what is currently visible in the
viewport. The GFF file that was created and used to
draw the visualization can also be downloaded. The Help
button provides information about GCViT and instruc-
tions for using the interface.
Pop up box
Clicking on a glyph in the image will pop up a box that
identifies the bin number, chromosome coordinates, the
value for each accession and the total value for the bin.
The pop up box can be customized by modifying the
CViTjs pop up template. Examples of potential customi-
zations include link-outs to other resources, such as the
Germplasm Repository Information Network (GRIN) ac-
cession page, or to a genome browser. In our example
on SoyBase there are linkouts to the SoyBase Gbrowse
instance, for exploration of genic features in the bin; and
to the Legume Information System Context Viewer,
which enables examination of synteny among similar
genomic regions.
Key
Above the image, a key is displayed with the currently
displayed genotypes and their respective colors. This key
will update only after the Display button has been
pressed to update the view.
Image controls
On the left side of the image is a toolbox that provides
zoom controls and a set of drawing options that permit
drawing free-hand lines or rectangles, an eraser, and a
color palette. The image can be moved within the view-
port by clicking and dragging with the mouse (Fig. 1).
Note that the bin size does not change when zooming in
or out.
Gm01 Gm02 Gm03 Gm04 Gm05 Gm06 Gm07 Gm08 Gm09 Gm10 Gm11 Gm12 Gm13 Gm14
Fig. 3 Inheritance of soybean line Blackhawk using the Haplotype view with a threshold of 5. SNP differences between Blackhawk and its sibling
Hawkeye (PI 548577) are displayed on the left side of the chromosome in red. SNP differences between Blackhawk and its parents are displayed
on the right, with Mukden in green and Richland in blue. The first 14 chromosomes are displayed
Wilkey et al. BMC Genomics (2020) 21:822 Page 5 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
View control
At the bottom of the page, the View Controlsection
permits the user to toggle off and on individual chromo-
somes and other display elements (Fig. 1).
Data sets
Online instances of GCViT at the time of writing in-
clude soybean (https://soybase.org/gcvit/), common
bean (https://gcvit.phaseolus.legumeinfo.org), chickpea
(https://gcvit.cicer.legumeinfo.org), and peanut
(https://peanutbase.org/germplasm/gcvit/). Data sets
available for soybean include: the whole U.S. germ-
plasm collection genotyped with the SoySNP50K array
[16], resequencing of 481 soybean accessions [17],
resequencing of 102 Canadian accessions [18], the
soybean Nested Association Mapping (SoyNAM) par-
ents and progeny [19,20], 222 Korean accessions ge-
notyped using the Axiom® SoyaSNP array [21], 4234
Korean accessions using the Axiom® SoyaSNP array
[22], GmHapMap data consisting of 1007 resequenced
accessions [1], and genotyping of 374 U.S. and Brazil-
ian accessions [23].
Data available for Chickpea contains genotype infor-
mation from 279 Chickpea accessions [24]. For common
bean, diversity data is available for two diverse collec-
tions of Phaseolus vulgaris: the Mesoamerican Diversity
Panel (MDP) and the Andean Diversity Panel (ADP)
[25]. The peanut data set contains the U.S. Peanut Mini
Core Collection genotyped using the 58 K Affymetrix
SNP array, Axiom Arachis [26].
Discussion
Case studies
There are many potential uses for GCViT. Here we de-
scribe four use cases.
Use case 1: identify introgressions between two common
bean populations
Using GCViT on the Bean CAP diversity panels [25],
the Andean and Mesoamerican populations can be
compared to identify introgressions. In Fig. 2,the
Andean line Heirloomis the reference genotype,
which is compared with three other Andrean lines:
Dolly, Majesty and Bonus; and three Mesoamerican
lines: Avalanche, Maverick and Zorro. Regions that
were introgressed from the Mesoamerican population
can be seen on chromosomes 3, 4 and 9. Although
Gm05 Gm10 Gm12 Gm19 Gm20
Fig. 4 Elite, Landrace and Wild soybean lines were compared against reference accession Williams 82 (Wm82) using the Heatmap view and a
Max threshold of 30. In order from left to right are Elite accessions: PI 548655 (green, common name- Forrest), PI 548656 (turquoise, common
name- Lee) and PI 546487 (blue, common name-Archer); Landrace accessions: PI 567155A (red), PI 361109 (yellow) and PI 407802 (orange); Wild
accessions: PI 407040 (pink), PI 163453 (purple), and PI 407247 (gray). A subset of chromosomes are displayed including Gm05, and Gm20 which
show conserved regions within the cultivated soybeans (Elite and Landrace) highlighted in orange boxes. Centromeric regions are indicated by
dark gray boxes on the chromosomes
Wilkey et al. BMC Genomics (2020) 21:822 Page 6 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
there are differences between Heirloom and the
other three Andrean lines, there are few differences
among the Mesoamerican lines, suggesting that these
regions were introgressed from the Mesoamerican
population.
Use case 2: inheritance analysis
Using the SoySNP50K data [16], pedigree relation-
ships were plotted, comparing soybean line Blackhawk
(PI 548516) to sibling Hawkeye (PI 548577), and par-
ents Mukden (PI 548391) and Richland (PI 548406)
Fig. 5 Differences between two lines with the same name using the Histogram view. ausing a merged dataset of the SoySNP50K and 180
Axiom array from Lee et al. [21] we compared PI 424032 from the SoySNP50K to PI 424032 from Lee et al. SNPs in gray (left) represent the total
number of SNPs used in the comparison while SNPs in orange (right) indicate differences between the two lines. busing the SoySNP50K dataset,
two Dwight lines were compared. Dwight from Dr. Songs lab and Dwight directly from the soybean germplasm collection. SNPs in gray (left)
represent the total number of SNPs used in the comparison while SNPs in pink (right) indicate differences between the two lines. Centromeric
regions are indicated by dark gray boxes on the chromosomes. A subset of chromosomes are displayed
Wilkey et al. BMC Genomics (2020) 21:822 Page 7 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
(Fig. 3). In this example, every region with a differ-
ence between Blackhawk and its sibling, also shows a
difference between Blackhawk and one of its parents,
indicating that this region was inherited from differ-
ent parents in each sibling. From this information, it
is apparent that most of Gm04, Gm08, and Gm17 of
the siblings were inherited from Mukden, while most
of Gm05 was inherited from Richland.
Use case 3: identify conserved genomic regions and/or
regions of interest
Using the SoySNP50k data, we can identify regions
that are conserved between cultivated soybean lines
(Glycine max) and its wild relative (Glycine soja). In
this example we used soybean cultivar Williams 82
(Wm82) as the reference and compared it to 6 other
cultivated soybean lines and 3 wild (G.soja)lines.On
chromosome Gm05 and Gm20 there are regions that
show no differences between Wm82 and the other
cultivated soybeans, but clear differences between the
cultivated soybeans and the wild soybeans (Fig. 4).
These regions could indicate regions that were se-
lected during domestication.
Use case 4: identify if two soybean accessions labeled the
same are indeed the same
It is known that there can be genomic variation between
soybean cultivars with the same name due to differential
segregation of polymorphic regions during the breeding
process [27]. One cultivar where we see this variation is
the representative soybean genome, Williams 82 (Wm82).
In these two examples we show two different situations
where two accessions with the same name are not 100%
genetically similar.
Example 1
In this example we use a soybean accession which
was genotyped by two different studies. The two VCF
files were merged using BCFtools and a similarity
matrix was created using SNPRelate. Accessions over-
lap was identified and all of the accessions matched
their counterpart, except for two. One of these acces-
sions was PI 424032, which was found to have a simi-
larity score of 0.715. The differences between these
two accessions were then plotted using GCViT
(Fig. 5a).InthisexamplewecanseethatthelinePI
424032 genotyped from the SoySNP50K and Lee
et al. [21] are completely different. (Fig. 5a)Itwas
later confirmed by the author/PI (personal comm. Dr.
Soon-Chun Jeong) that the wrong seed was received
from the Soybean Germplasm Repository.
Example 2
Using two different lines of soybean accession Dwight,
we are able to identify regions of selection. Using infor-
mation from the SoySNp50K data set the similarity score
was calculated between Dwight and PI 597386. The
similarity score between these two lines is 0.977. Using
GCViT we can plot where these two lines differ (Fig. 5b)
This information shows that these two lines started out
the same, but were then grown out for multiple genera-
tions in different labs (personal comm. Dr. Qijian Song).
Resources
A video tutorial can be found here: https://www.youtube.
com/watch?v=B2gPVUipWo0 and a blog post on GCViT
can be found here: https://www.legumefederation.org/en/
blog/2020/03/26/genotype-comparison-visualization-tool-
gcvit/
Further development
GCViT remains in active development. As it is in the
process of being adopted for additional organisms and
research communities, we are receiving requests for
enhancements and new features. These and future re-
quests will be considered for inclusion in subsequent
releases.
Conclusions
GCViT provides useful visualization of SNP data on a
whole genome scale. This visualization can provide many
insights. Images can be downloaded as publication-ready
figures.
Availability and requirements
Project name: GCViT (Genotype Comparison Visualization
Tool)
Project home page: https://github.com/LegumeFe-
deration/gcvit
Operating system(s): Platform Independent
Programming language: Golang, JavaScript
Other requirements: NodeJS > = 10.13.0 and Go > =
1.10 or Docker to build
License: MIT
Any restrictions to use by non-academics: No
Restrictions
Abbreviations
CViT: Chromosome Visualization Tool; GCViT: Genotype Comparison
Visualization Tool; GRIN: Germplasm Repository Information Network;
GWAS: Genome Wide Association Study; LIS: Legume Information System;
SoySNP50k: Illumina Infinium BeadChip containing 50,000 SNPs from
soybean, used to sequence the full U.S. soybean germplasm collection.;
UI: User Interface; VCF: Variant Call Format; Wm82 : Williams 82
Acknowledgements
We thank Nathan Weeks for help in testing, containerization, and
deployment of the GCViT package and to all the members of the USDA
SoyBase and Legume Database group who have provided suggestions
Wilkey et al. BMC Genomics (2020) 21:822 Page 8 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
during the development process. This research was supported in part by the
US Department of Agriculture, Agricultural Research Service, project 5030-
21000-062-00D. USDA is an equal opportunity provider and employer.
Authorscontributions
APW: design and all code, AVB: design, SBC and EKSC: design and
development of original CViT, and design guidance for GCViT. AVB and
APW wrote the manuscript. All authors edited and approved the final
manuscript.
Funding
This research was supported in part by the NSF project Federated Plant
Database Initiative for the Legumes,award #1444806, and by the US.
Department of Agriculture, Agricultural Research Service, project 5030
21000-069-00D. Mention of trade names or commercial products in this
publication is solely for the purpose of providing specific information and
does not imply recommendation or endorsement by the U.S. Department of
Agriculture. USDA is an equal opportunity provider and Employer.
Availability of data and materials
GCViT is freely available here: https://github.com/LegumeFederation/gcvit.
The 1.0 version described here is available at https://doi.org/10.5281/zenodo.
4008713.
All data used in on-line versions of GCViT can be found at the Legume
Federation Datastore.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors report no competing interests.
Author details
1
ORISE Fellow, USDA-ARS Corn Insects and Crop Genetics Research Unit,
Ames, IA 50011, USA.
2
USDA-ARS Corn Insects and Crop Genetics Research
Unit, Ames, IA 50011, USA.
Received: 26 June 2020 Accepted: 9 November 2020
References
1. Torkamaneh D, Laroche J, Valliyodan B, O'donoughue L, Cober E, Rajcan
I, et al. Soybean haplotype map (GmHapMap): a universal resource for
soybean translational and functional genomics. BioRxiv. 2019:534578.
2. Layer RM, Kindlon N, Karczewski KJ, Quinlan AR, Exome Aggregation
Consortium. Efficient genotype compression and analysis of large genetic-
variation data sets. Nature Methods. 2016;13(1):63.
3. Arumilli M, Layer RM, Hytönen MK, Lohi H webGQT: A Shiny Server for
Genotype Query Tools for Model-Based Variant Filtering. Front Genetics
2020;11:152.
4. Nusrat S, Harbig T, Gehlenborg N. Tasks, techniques, and tools for genomic
data visualization. In Computer Graphics Forum 2019 (Vol. 38, No. 3, pp.
781805).
5. Milne I, Shaw P, Stephen G, Bayer M, Cardle L, Thomas WT, et al.
Flapjackgraphical genotype visualization. Bioinformatics. 2010;26(24):
31334.
6. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer
(IGV): high-performance genomics data visualization and exploration. Brief
Bioinform. 2013;14(2):17892.
7. Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES.
TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline.
PLoS One. 2014;9(2):e90346.
8. Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH. JBrowse: a next-
generation genome browser. Genome Res. 2009;19(9):16308.
9. Westesson O, Skinner M, Holmes I. Visualizing next-generation sequencing
data with JBrowse. Brief Bioinform. 2013;14(2):1727.
10. Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, et al. JBrowse:
a dynamic web platform for genome visualization and analysis. Genome
Biol. 2016;17(1):66.
11. Goldman M, Craft B, Brooks A, Zhu J, Haussler D. The UCSC Xena platform
for cancer genomics data visualization and interpretation. BioRxiv. 2018:
326470.
12. Schott DA, Vinnakota AG, Portwood JL, Andorf CM, Sen TZ. SNPversity: a
web-based tool for visualizing diversity. Database. 2018;1:2018.
13. Cannon EK, Cannon SB. Chromosome visualization tool: a whole genome
viewer. Int J Plant Genomics. 2011;2011:373875.
14. BryanC,GutermanG,MaKL,LewinH,LarkinD,KimJ,MaJ,Farre
M. Synteny explorer: an interactive visualization application for
teaching genome evolution. IEEE Trans Vis Comput Graph. 2016;
23(1):71120.
15. Meyer M, Munzner T, Pfister H. MizBee: a multiscale synteny browser. IEEE
Trans Vis Comput Graph. 2009;15(6):897904.
16. Song Q, Hyten DL, Jia G, Quigley CV, Fickus EW, Nelson RL, Cregan PB.
Fingerprinting soybean germplasm and its utility in genomic research. G3:
Genes Genomes Genetics. 2015;5(10):19992006.
17. Valliyodan B, Brown AV, Cannon SB, Nguyen H. Data from: Genetic variation
among 481 diverse soybean accessions. 2020. Ag Data Commons. https://
doi.org/10.15482/USDA.ADC/1518301.
18. Torkamaneh D, Laroche J, Tardivel A, O'Donoughue L, Cober E, Rajcan I,
Belzile F. Comprehensive description of genomewide nucleotide and
structural variation in short-season soya bean. Plant Biotechnol J. 2018;16(3):
74959.
19. Song Q, Yan L, Quigley C, Jordan BD, Fickus E, Schroeder S, et al. Genetic
characterization of the soybean nested association mapping population.
Plant Genome. 2017;10(2):114.
20. Diers BW, Specht J, Rainey KM, Cregan P, Song Q, Ramasubramanian V, et al.
Genetic architecture of soybean yield and agronomic traits. G3: Genes
Genomes Genetics. 2018;8(10):336775.
21. Lee YG, Jeong N, Kim JH, Lee K, Kim KH, Pirani A, et al. Development,
validation and genetic analysis of a large soybean SNP genotyping array.
Plant J. 2015;81(4):62536.
22. Jeong SC, Moon JK, Park SK, Kim MS, Lee K, Lee SR, et al. Genetic diversity
patterns and domestication origin of soybean. Theor Appl Genet. 2019;
132(4):117993.
23. Wei W, Mesquita AC, Figueiró AD, Wu X, Manjunatha S, Wickland DP, et al.
Genome-wide association mapping of resistance to a Brazilian isolate of
Sclerotinia sclerotiorum in soybean genotypes mostly from Brazil. BMC
Genomics. 2017;18(1):849.
24. von Wettberg EJ, Chang PL, Başdemir F, Carrasquila-Garcia N, Korbu
LB, Moenga SM, et al. Ecology and genomics of an important crop
wild relative as a prelude to agricultural innovation. Nat Commun.
2018;9(1):13.
25. Moghaddam SM, Mamidi S, Osorno JM, Lee R, Brick M, Kelly J, et al.
Genome-wide association study identifies candidate loci underlying
agronomic traits in a middle American diversity panel of common bean.
Plant Genome. 2016;9(3):121.
26. OtyamaPI,WilkeyA,KulkarniR,AssefaT,ChuY,ClevengerJ,etal.
Evaluation of linkage disequilibrium, population structure, and genetic
diversity in the US peanut mini core collection. BMC Genomics. 2019;
20(1):481.
27. HaunWJ,HytenDL,XuWW,GerhardtDJ,AlbertTJ,RichmondT,etal.
The composition and origins of genomic variation among individuals
of the soybean reference cultivar Williams 82. Plant Physiol. 2011;
155(2):64555.
PublishersNote
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Wilkey et al. BMC Genomics (2020) 21:822 Page 9 of 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... These can then be used as an additional source of information when inferring a candidate gene's function. As an extra set of conformation of QTL, a new tool called the Genotype Comparison Visualization Tool (GCViT) [14], available on Github (https://github.com/LegumeFederation/gcvit, accessed on 12 November 2021) and SoyBase, can be of use. ...
... ZBrowse is an interactive tool for the visualization of GWAS data across experiments within a single species, while ZZBrowse is an interactive web tool for the comparative analysis of GWAS and QTL between species [16]. As an extra set of conformation of QTL, a new tool called the Genotype Comparison Visualization Tool (GCViT) [14], available on Github (https://github.com/LegumeFederation/ gcvit, accessed on 12 November 2021) and SoyBase, can be of use. ...
Article
Full-text available
Seeds, especially those of certain grasses and legumes, provide the majority of the protein and carbohydrates for much of the world’s population. Therefore, improvements in seed quality and yield are important drivers for the development of new crop varieties to feed a growing population. Quantitative Trait Loci (QTL) have been identified for many biologically interesting and agronomically important traits, including many seed quality traits. QTL can help explain the genetic architecture of the traits and can also be used to incorporate traits into new crop cultivars during breeding. Despite the important contributions that QTL have made to basic studies and plant breeding, knowing the exact gene(s) conditioning each QTL would greatly improve our ability to study the underlying genetics, biochemistry and regulatory networks. The data sets needed for identifying these genes are increasingly available and often housed in species- or clade-specific genetics and genomics databases. In this demonstration, we present a generalized walkthrough of how such databases can be used in these studies using SoyBase, the USDA soybean Genetics and Genomics Database, as an example.
... GCViT is open-source (https://github.com/legumefederation/ gcvit) and is implemented primarily in Golang and JavaScript [14]. ...
Chapter
Full-text available
In this chapter, we introduce the main components of the Legume Information System (https://legumeinfo.org) and several associated resources. Additionally, we provide an example of their use by exploring a biological question: is there a common molecular basis, across legume species, that underlies the photoperiod-mediated transition from vegetative to reproductive development, that is, days to flowering? The Legume Information System (LIS) holds genetic and genomic data for a large number of crop and model legumes and provides a set of online bioinformatic tools designed to help biologists address questions and tasks related to legume biology. Such tasks include identifying the molecular basis of agronomic traits; identifying orthologs/syntelogs for known genes; determining gene expression patterns; accessing genomic datasets; identifying markers for breeding work; and identifying genetic similarities and differences among selected accessions. LIS integrates with other legume-focused informatics resources such as SoyBase (https://soybase.org), PeanutBase (https://peanutbase.org), and projects of the Legume Federation (https://legumefederation.org).
... For example, if Lee (PI 548656) is used as the reference instead of Essex, we see the exact same results on Gm05 and Gm20. Results can be tested/confirmed/explored using the GCViT tool 37,38 at SoyBase and selecting the USB481 dataset. ...
Article
Full-text available
We report characteristics of soybean genetic diversity and structure from the resequencing of 481 diverse soybean accessions, comprising 52 wild ( Glycine soja ) selections and 429 cultivated ( Glycine max ) varieties (landraces and elites). This data was used to identify 7.8 million SNPs, to predict SNP effects relative to genic regions, and to identify the genetic structure, relationships, and linkage disequilibrium. We found evidence of distinct, mostly independent selection of lineages by particular geographic location. Among cultivated varieties, we identified numerous highly conserved regions, suggesting selection during domestication. Comparisons of these accessions against the whole U.S. germplasm genotyped with the SoySNP50K iSelect BeadChip revealed that over 95% of the re-sequenced accessions have a high similarity to their SoySNP50K counterparts. Probable errors in seed source or genotype tracking were also identified in approximately 5% of the accessions.
Article
Full-text available
The Legume Information System (LIS; https://legumeinfo.org) houses genetic and genomic data, integrated in various online tools to allow comparative genomic analyses. The website and database maintain data for more than two dozen species, particularly focusing on crop and model species and holding data for other diverse species of taxonomic interest. Major analysis features include genome browsers, sequence‐search tools, legume‐focused gene families and a phylogenetic tree viewer, a gene annotation service (which places a submitted gene into a gene family and phylogenetic tree), an interactive microsynteny and pan‐genome viewer, a novel viewer of genetic variant data, genetic maps and viewers, a Data Store for data sets such as assemblies and annotations, InterMine instances for querying genetic and genomic data, and a tool for viewing geographic distributions of germplasm accessions. LIS also integrates with several other legume data resources and tools, including PeanutBase (https://peanutbase.org), SoyBase (https://soybase.org), Medicago Hapmap (https://medicagohapmap2.org), Alfalfa Breeder's Toolbox (https://alfalfatoolbox.org), and the Legume Federation (https://legumefederation.org).
Article
Full-text available
Here we describe a worldwide haplotype map for soybean (GmHapMap) constructed using whole‐genome sequence data for 1,007 Glycine max accessions and yielding 14.9 million variants as well as 4.3M tag single nucleotide polymorphisms (SNPs). When sampling random subsets of these accessions, the number of variants and tag SNPs plateaued beyond approximately 800 and 600 accessions, respectively. This suggests extensive coverage of diversity within the cultivated soybean. GmHapMap variants were imputed onto 21,618 previously genotyped accessions with up to 96% success for common alleles. A local association analysis was performed with the imputed data using markers located in a 1‐Mb region known to contribute to seed oil content and enabled us to identify a candidate causal SNP residing in the NPC1 gene. We determined gene‐centric haplotypes (407,867 GCHs) for the 55,589 genes and show that such haplotypes can help to identify alleles that differ in the resulting phenotype. Finally, we predicted 18,031 putative loss‐of‐function (LOF) mutations in 10,662 genes and illustrate how such a resource can be used to explore gene function. The GmHapMap provides a unique worldwide resource for applied soybean genomics and breeding.
Article
Full-text available
Genotype Query Tools (GQT) were developed to discover disease-causing variations from billions of genotypes and millions of genomes, processes data at substantially higher speed over other existing methods. While GQT has been available to a wide audience as command-line software, the difficulty of constructing queries among non-IT or non-bioinformatics researchers has limited its applicability. To overcome this limitation, we developed webGQT, an easy-to-use tool with a graphical user interface. With pre-built queries across three modules, webGQT allows for pedigree analysis, case-control studies, and population frequency studies. As a package, webGQT allows researchers with less or no applied bioinformatics/IT experience to mine potential disease-causing variants from billions. Results: webGQT offers a flexible and easy-to-use interface for model-based candidate variant filtering for Mendelian diseases from thousands to millions of genomes at a reduced computation time. Additionally, webGQT provides adjustable parameters to reduce false positives and rescue missing genotypes across all modules. Using a case study, we demonstrate the applicability of webGQT to query non-human genomes. In addition, we demonstrate the scalability of webGQT on large data sets by implementing complex population-specific queries on the 1000 Genomes Project Phase 3 data set, which includes 8.4 billion variants from 2504 individuals across 26 different populations. Furthermore, webGQT supports filtering single-nucleotide variants, short insertions/deletions, copy number or any other variant genotypes supported by the VCF specification. Our results show that webGQT can be used as an online web service, or deployed on personal computers or local servers within research groups. Availability: webGQT is made available to the users in three forms: 1) as a webserver available at https://vm1138.kaj.pouta.csc.fi/webgqt/, 2) as an R package to install on personal computers, and 3) as part of the same R package to configure on the user's own servers. The application is available for installation at https://github.com/arumds/webgqt.
Article
Full-text available
Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data‐driven and many studies lack well‐defined hypotheses. A key challenge in data‐driven research is to discover unexpected patterns and to formulate hypotheses in an unbiased manner in vast amounts of genomic and other associated data. Over the past two decades, this has driven the development of numerous data visualization techniques and tools for visualizing genomic data. Based on a comprehensive literature survey, we propose taxonomies for data, visualization, and tasks involved in genomic data visualization. Furthermore, we provide a comprehensive review of published genomic visualization tools in the context of the proposed taxonomies.
Article
Full-text available
Background Due to the recent domestication of peanut from a single tetraploidization event, relatively little genetic diversity underlies the extensive morphological and agronomic diversity in peanut cultivars today. To broaden the genetic variation in future breeding programs, it is necessary to characterize germplasm accessions for new sources of variation and to leverage the power of genome-wide association studies (GWAS) to discover markers associated with traits of interest. We report an analysis of linkage disequilibrium (LD), population structure, and genetic diversity, and examine the ability of GWA to infer marker-trait associations in the U.S. peanut mini core collection genotyped with a 58 K SNP array. Results LD persists over long distances in the collection, decaying to r2 = half decay distance at 3.78 Mb. Structure within the collection is best explained when separated into four or five groups (K = 4 and K = 5). At K = 4 and 5, accessions loosely clustered according to market type and subspecies, though with numerous exceptions. Out of 107 accessions, 43 clustered in correspondence to the main market type subgroup whereas 34 did not. The remaining 30 accessions had either missing taxonomic classification or were classified as mixed. Phylogenetic network analysis also clustered accessions into approximately five groups based on their genotypes, with loose correspondence to subspecies and market type. Genome wide association analysis was performed on these lines for 12 seed composition and quality traits. Significant marker associations were identified for arachidic and behenic fatty acid compositions, which despite having low bioavailability in peanut, have been reported to raise cholesterol levels in humans. Other traits such as blanchability showed consistent associations in multiple tests, with plausible candidate genes. Conclusions Based on GWA, population structure as well as additional simulation results, we find that the primary limitations of this collection for GWAS are a small collection size, significant remaining structure/genetic similarity and long LD blocks that limit the resolution of association mapping. These results can be used to improve GWAS in peanut in future studies – for example, by increasing the size and reducing structure in the collections used for GWAS.
Preprint
Full-text available
Here we describe the first worldwide haplotype map for soybean (GmHapMap) constructed using whole-genome sequence data for 1,007 Glycine max accessions and yielding 15 million variants. The number of unique haplotypes plateaued within this collection (4.3 million tag SNPs) suggesting extensive coverage of diversity within the cultivated germplasm. We imputed GmHapMap variants onto 21,618 previously genotyped (50K array/210K GBS) accessions with up to 96% success for common alleles. A GWAS performed with imputed data enabled us to identify a causal SNP residing in the NPC1 gene and to demonstrate its role in controlling seed oil content. We identified 405,101 haplotypes for the 55,589 genes and show that such haplotypes can help define alleles. Finally, we predicted 18,031 putative loss-of-function (LOF) mutations in 10,662 genes and illustrate how such a resource can be used to explore gene function. The GmHapMap provides a unique worldwide resource for soybean genomics and breeding.
Article
Full-text available
Key message Genotyping data of a comprehensive Korean soybean collection obtained using a large SNP array were used to clarify global distribution patterns of soybean and address the evolutionary history of soybean. Abstract Understanding diversity and evolution of a crop is an essential step to implement a strategy to expand its germplasm base for crop improvement research. Accessions intensively collected from Korea, which is a small but central region in the distribution geography of soybean, were genotyped to provide sufficient data to underpin population genetic questions. After removing natural hybrids and duplicated or redundant accessions, we obtained a non-redundant set comprising 1957 domesticated and 1079 wild accessions to perform population structure analyses. Our analysis demonstrates that while wild soybean germplasm will require additional sampling from diverse indigenous areas to expand the germplasm base, the current domesticated soybean germplasm is saturated in terms of genetic diversity. We then showed that our genome-wide polymorphism map enabled us to detect genetic loci underlying flower color, seed-coat color, and domestication syndrome. A representative soybean set consisting of 194 accessions was divided into one domesticated subpopulation and four wild subpopulations that could be traced back to their geographic collection areas. Population genomics analyses suggested that the monophyletic group of domesticated soybeans was likely originated at a Japanese region. The results were further substantiated by a phylogenetic tree constructed from domestication-associated single nucleotide polymorphisms identified in this study.
Article
Full-text available
Soybean is the world's leading source of vegetable protein and demand for its seed continues to grow. Breeders have successfully increased soybean yield, but the genetic architecture of yield and key agronomic traits is poorly understood. We developed a 40-mating soybean nested association mapping (NAM) population of 5,600 inbred lines that were characterized by single nucleotide polymorphism (SNP) markers and for six agronomic traits in field trials in 22 environments. Analysis of the yield, agronomic, and SNP data revealed 23 significant marker-trait associations for yield, 19 for maturity, 15 for plant height, 17 for plant lodging, and 29 for seed mass. A higher frequency of estimated positive yield alleles was evident from elite founder parents than from exotic founders, although unique desirable yield alleles from the exotic group were identified, demonstrating the value of expanding the genetic base of US soybean breeding.
Preprint
Full-text available
UCSC Xena is a web-based visual integration and exploration tool for multi-omic data and associated clinical and phenotypic annotations. The investigator-driven platform consists of a web-based Xena Browser and turn-key Xena Hubs. Xena showcases seminal cancer genomics datasets from TCGA, Pan-Cancer Atlas, PCAWG, ICGC, GTEx, and the GDC; a total of more than 1500 datasets across 50 cancer types. We support virtually any type of functional genomics data modalities, including SNPs, INDELs, large structural variants, CNV, gene and other types of expression, DNA methylation, clinical and phenotypic annotations. A researcher can host their own data securely via private hubs running on a laptop or behind a firewall, with visual and analytical integration occurring only within the Xena Browser. Browser features include the high performance Visual Spreadsheet, dynamic Kaplan-Meier survival analysis, powerful filtering and subgrouping, charts, statistical analyses, genomic signatures, and bookmarks.
Article
Full-text available
Domesticated species are impacted in unintended ways during domestication and breeding. Changes in the nature and intensity of selection impart genetic drift, reduce diversity and increase the frequency of deleterious alleles. Such outcomes constrain our ability to expand the cultivation of crops into environments that differ from those under which domestication occurred. We address this need in chickpea, an important pulse legume, by harnessing the diversity of wild crop relatives. We document an extreme domestication-related genetic bottleneck and decipher the genetic history of wild populations. We provide evidence of ancestral adaptations for seed coat color crypsis, estimate the impact of environment on genetic structure and trait values, and demonstrate variation between wild and cultivated accessions for agronomic properties. A resource of genotyped, association mapping progeny functionally links the wild and cultivated genepools and is an essential resource chickpea for improvement, while our methods inform collection of other wild crop progenitor species.
Article
Many stand-alone desktop software suites exist to visualize single nucleotide polymorphism (SNP) diversity, but web-based software that can be easily implemented and used for biological databases is absent. SNPversity was created to answer this need by building an open-source visualization tool that can be implemented on a Unix-like machine and served through a web browser that can be accessible worldwide. SNPversity consists of a HDF5 database back-end for SNPs, a data exchange layer powered by TASSEL libraries that represent data in JSON format, and an interface layer using PHP to visualize SNP information. SNPversity displays data in real-time through a web browser in grids that are color-coded according to a given SNP’s allelic status and mutational state. SNPversity is currently available at MaizeGDB, the maize community’s database, and will be soon available at GrainGenes, the clade-oriented database for Triticeae and Avena species, including wheat, barley, rye, and oat. The code and documentation are uploaded onto github, and they are freely available to the public. We expect that the tool will be highly useful for other biological databases with a similar need to display SNP diversity through their web interfaces. Database URL: https://www.maizegdb.org/snpversity