ArticlePDF Available

Genomic analyses of rice bean landraces reveal adaptation and yield related loci to accelerate breeding

Authors:

Abstract and Figures

Rice bean (Vigna umbellata) is an underexploited domesticated legume crop consumed for dietary protein in Asia, yet little is known about the genetic diversity of this species. Here, we present a high-quality reference genome for a rice bean landrace (FF25) built using PacBio long-read data and a Hi-C chromatin interaction map, and assess the phylogenetic position and speciation time of rice bean within the Vigna genus. We sequence 440 landraces (two core collections), and GWAS based on data for growth sites at three widely divergent latitudes reveal loci associated with flowering and yield. Loci harboring orthologs of FUL (FRUITFULL), FT (FLOWERING LOCUS T), and PRR3 (PSEUDO-RESPONSE REGULATOR 3) contribute to the adaptation of rice bean from its low latitude center of origin towards higher latitudes, and the landraces which pyramid early-flowering alleles for these loci display maximally short flowering times. We also demonstrate that copy-number-variation for VumCYP78A6 can regulate seed-yield traits. Intriguingly, 32 landraces collected from a mountainous region in South-Central China harbor a recently acquired InDel in TFL1 (TERMINAL FLOWER1) affecting stem determinacy; these materials also have exceptionally high values for multiple human-desired traits and could therefore substantially advance breeding efforts to improve rice bean.
The genetic architecture underlying flowering time control from low to high latitudes a–c Manhattan plots of GWAS for flowering-time data measured in Sanya (18°N) in 2021 (a), Nanning (22° N) in 2020 (b), and Beijing (40°N) in 2021 (c). Red horizontal dashed line indicated the Bonferroni-corrected significance thresholds of GWAS (α = 1). Pie charts represented allelic frequencies of the major associated loci. The bar plots display the flowering time of landraces carrying each allele of the identified major loci. DAE, days after emergence. The number (n) of landraces carrying each allele is shown below. d The geographical distributions of landraces carrying the early-flowering alleles of FT and PRR3, and the late-flowering allele of FUL, respectively. The map was created using the map_data() function in the R package ggplot2. The violin plot showed significant differences in latitudes between landraces carrying the early-flowering allele of FT (n = 40) and PRR3 (n = 26), respectively. In the box plots, central line: median values; bounds of the box: 25th and 75th percentiles; whiskers: 1.5*IQR (IQR: the interquartile range between the 25th and 75th percentile). e The bar plots show the flowering shortening effects of the early-flowering alleles of FT and PRR3 at each of the three measurement sites. NA indicates landraces carrying neither of these two early-flowering alleles. The number (n) of landraces carrying each allele at each of the three measurement sites is shown below. f The dot plots show the flowering time shortening effects of early-flowering allelic combinations at each of the three measurement sites. Blue dots represent the landraces categorized according to all of the different allelic combinations found in the 440 sequenced landraces. Red lines indicate the average value of each category. NA indicates landraces carrying no early-flowering alleles. The number of landraces for each category is shown below. The significance was tested with two-sided Wilcoxon tests in (a–e). The data in a–c and e are shown as mean ± SE, and the error bars represent SE. Source data are provided as a Source Data file.
… 
The molecular basis and selection history of stem determinacy in cultivated rice bean a InDel-based GWAS result from the analysis of data for stem determinacy measured at the Nanning site in 2020. The peak InDel is indicated by the red circle. The red horizontal dashed line indicated the Bonferroni-corrected significance thresholds of GWAS (α = 1). b A 2-bp causative deletion (the peak InDel) introduced a premature termination codon in the first exon of the TFL1 gene. Ref reference, Del deletion. c The frequency distributions of three types of stem growth habit (indeterminate (Indet), semi-determinate (Semi-det), and determinate (Det)) among three groups comprising landraces carrying the homologous reference alleles (designated as Ref), heterozygous (Ref/Del) or homologous mutation (2-bp deletion) alleles (Del), respectively. Two-sided Fisher’s exact tests were used to assess the significance of the differences in the proportion of the determinate type of stem growth habit between landraces carrying Ref and Del alleles and between landraces carrying Ref/Del and Del alleles. d The geographical distributions of the 32 landraces carrying Del alleles from Southern-Central China. The map was created using the map_data() function in the R package ggplot2. HB Hubei province, HN Hunan province, CQ Chongqing province, GZ Guizhou province, GX Guangxi province. e There was a significant improvement for the 32 landraces carrying Del alleles compared with the landraces carrying Ref alleles in the SC group for multiple human-desired traits including flowering time (DAE, days after emergence), branch number, seed length, and HGW (hundred seed weight). Significance was tested with two-sided Wilcoxon tests. In the box plots, central line: median values; bounds of the box: 25th and 75th percentiles; whiskers: 1.5*IQR (IQR: the interquartile range between the 25th and 75th percentile). f Divergence time of the 32 landraces carrying Del alleles with the landraces carrying Ref alleles in the SC group, inferred using the SMC++ program⁶⁶, under a mutation rate μ = 1.5 × 10–8 per site per generation¹⁴⁰, and a generation time of one year. Source data are provided as a Source Data file.
… 
This content is subject to copyright. Terms and conditions apply.
Article https://doi.org/10.1038/s41467-022-33515-2
Genomic analyses of rice bean landraces
reveal adaptation and yield related loci to
accelerate breeding
Jiantao Guan
1,2,3,10
, Jintao Zhang
1,4,10
, Dan Gong
1,4,10
, Zhengquan Zhang
2
,
Yang Yu
2
,GaolingLuo
5
,PrakitSomta
6
, Zheng Hu
1
,SuhuaWang
1
,
Xingxing Yuan
7
, Yaowen Zhang
8
, Yanlan Wang
9
, Yanhua Chen
5
,KularbLaosatit
6
,
Xin Chen
7
, Honglin Chen
1
,AihuaSha
4
, Xuzhen Cheng
1
,HuaXie
2
&
Lixia Wang
1
Rice bean (Vigna umbellata) is an underexploited domesticated legume crop
consumed for dietary protein in Asia, yet little is known about the genetic
diversity of this species. Here, we present a high-quality reference genome for
a rice bean landrace (FF25) built using PacBio long-read data and a Hi-C
chromatin interaction map, and assess the phylogenetic position and specia-
tion time of rice bean within the Vigna genus. We sequence 440 landraces (two
core collections), and GWAS based on data for growth sites at three widely
divergent latitudes reveal loci associated with owering and yield. Loci har-
boring orthologs of FUL (FRUITFULL), FT (FLOWERING LOCUS T), and PRR3
(PSEUDO-RESPONSE REGULATOR 3) contribute to the adaptation of rice bean
from its low latitude center of origin towards higher latitudes, and the land-
races which pyramid early-owering alleles for these loci display maximally
short owering times. We also demonstrate that copy-number-variation for
VumCYP78A6 can regulate seed-yield traits. Intriguingly, 32 landraces collected
from a mountainous region in South-Central China harbor a recently acquired
InDel in TFL1 (TERMINAL FLOWER1) affecting stem determinacy; these mate-
rials also have exceptionally high values for multiple human-desired traits and
could therefore substantially advance breeding efforts to improve rice bean.
The genus Vigna is a pan-tropical genus in the family Fabaceae, com-
prising more than 100 wild species and 10 domesticated species such
as cowpea (Vigna unguiculata), mung bean (V. radiata), and rice bean
(V. umbellata)1. As one of the representative species in the genus Vigna,
the rice bean is a multipurpose legume and is widely cultivated in
South, Southeast, and East Asia2. The seeds of rice beans have been
consumed for thousands of years as a good source of dietary protein
and micronutrients, and these are used as a diuretic in traditional
medicine practices3,4. Rice bean has also been widely used as a donor
parent for interspecic hybridization with other species in the genus
Vigna57due to its notable agronomic characteristics including high
grain yield and large biomasspotential2,8,aswellasstrongresistanceto
pests914, diseases15, drought2,16,17, water logging18, and capacity to grow
in poor fertility soils19. Thus, as the continually growingpopulation and
exacerbated climate changes, rice bean has received increased atten-
tion in recent years and has been proposed as one of the potential
future smart foods to help to ght hunger and malnutrition in Asia2,18,20.
However, the lack of a high-quality reference genome for rice beans
has hindered the exploration of the genetic basis of these excellent
agronomic characteristics and its further genetic improvement.
Current thinking holds that the rice bean originated and was
domesticated in tropical regions of South & Southeast Asia, after
Received: 4 April 2022
Accepted: 21 September 2022
Check for updates
A full list of afliations appears at the end of the paper. e-mail: xiehua@baafs.net.cn;wanglixia03@caas.cn
Nature Communications | (2022) 13:5707 1
1234567890():,;
1234567890():,;
Content courtesy of Springer Nature, terms of use apply. Rights reserved
which it spread to higher latitude regions including China, Japan,
and Korea2,21,22. There are many rice bean landraces that have,
through long-term human and natural selection, become locally
adapted to diverse environments. However, as a short-day plant, the
yield potential and agricultural utility of rice beans can be strongly
affected by photoperiod and temperature conditions2,23,24.More-
over, few cultivated rice bean varieties have a determinate stem
growth habit that inuences the potential grain yield and is also
required to support mechanical harvest6,25. Landraces have been
demonstrated as useful resources for the improvement of diverse
crop species26, and there are presently two rice bean core collec-
tions, one comprising mainly landraces from South & Southeast
Asia and the other with a preponderance of Chinese rice bean
landraces22,27,28. Thus, there are rich germplasm panels available
representing the high diversity and broad adaptation of rice beans
to both tropical and temperate environments.
Previous studies have reported several QTLs for adaptation and
yield component-related traits using linkage mapping based on
biparental populations in rice bean11,29. However, the resolution and
sensitivity have been limited by the small number of markers and
genetic recombination, thus making it difcult to reveal the genetic
mechanism of these traits and/or to develop breeding markers30,31.
Genome-wide association studies (GWAS) have been successfully
applied in crops for the efcient identication of favorable alleles/
haplotypes or causal variants/genes underlying complex traits as this
strategy could simultaneously detect many natural allelic variations
using a diverse germplasm panel32,33.
Here, we present a high-quality reference genome assembly of
rice beans based on the integration of Illumina short-reads, PacBio
long-reads, and Hi-C sequencing data. We also construct a genome
variation map based on sequencing of 440 diverse rice bean landraces
covering two core collections. Subsequent population genomic ana-
lyses support the previously proposed origin of rice bean in South &
Southeast Asia and revealed genetic bottlenecks that occurred along
the northward dispersal of rice bean. GWAS based on phenotypic data
for a germplasm diversity panel grown at three sites with widely
divergent latitudes helps decipher the genetic basis of traits including
owering time, seed yield, and stem determinacy. Our study also
identies candidate genes and landraces with strong potential as elite
germplasm lines that could be used to generate excellent varieties that
simultaneously display geographically suitable owering times, stem
determinacy to support mechanized cultivation, and high yields of
rice beans.
Results
Sequencing and assembly of a reference genome for rice bean
The rice bean landrace FF25which has red seeds, an erect habit, and
wide environmental adaptabilitywas selected for genome sequen-
cing and de novo assembly of a rice bean reference genome (Fig. 1a).
We integrated three sequencing technologies: PacBio single molecule
real-time (SMRT) long-read sequencing, Illumina short-read sequen-
cing, and chromosome conformation capture sequencing data (Hi-C)
(Supplementary Table 1). The estimated genome size of the FF25
genome was ~525.60 Mb based on 17-kmer depth distribution using
Chr1
Chr2
Chr3
Chr4
Chr5
Chr6
Chr7
Chr8
Chr9
Chr10
Chr11
10
20
30
40
0
10
20
30
40
50
60
0
10
20
30
40
0
10
20
30
40
0
10
20
30
40
50
0
10
20
0
3
0
10
20
30
0
10
20
30
0
10
20
30
0
10
20
0
10
20
30
0
I
II
III
IV
V
VI
VII
VIII
I: Gene density II:Ppseudogene density III: Repeat density IV: TE density
V: Gypsy density VI: Copia density VII: GC content (%) VIII: Gene synteny
ab
Fig. 1 | FF25 genome assembly. a FF25 plant (Top); FF25 pod (Bottom left);
FF25 seeds (Bottom right). bGenomic features of the FF25 reference genome. The
outer gray track represents the chromosomes of the FF25 genome assembly (with
units in Mb). The densities of features were calculated based on 100kb window
size, witha step size of 10 kb. The inner green and orange links represent the intra-
and inter- chromosomal collinear genes, respectively. Photograph credit: LXW (a).
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 2
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Illumina short-reads (~106.65×; Supplementary Fig. 1). The PacBio
reads (~300.54×) were used to assemble the contigs using Canu v1.934
and the highly efcient repeat assembly (HERA) algorithm35,which
resulted in a 475.64 Mb genome (90.49% of the estimated size) con-
taining 351 contigs, with an N50 of 18.26 Mb (Table 1), thus repre-
senting highest quality genome among species of the Vigna genus3640.
To assign the contigs to different chromosomes, 66 contigs
(~465.19 Mb, 97.80% of the original assembly) were anchored to eleven
pseudo-chromosomes based on a Hi-C interaction map (Table 1;Sup-
plementary Fig. 2).
We used multiple methods to evaluate the quality of the assem-
bled genome. The mapping and coverage rates of the Illumina short-
read data were 99.67% and 99.33%, respectively. We further performed
benchmarking universal single-copy orthologs (BUSCO) analysis41
based on the eudicotyledons_odb10 dataset, and the result showed
that 97.3% of the BUSCO sequences were completely present in the
genome assembly, while 0.5% and 2.2% were partially present or
missing, respectively (Table 1; Supplementary Table 2). The genome
assembly had a high LTR Assembly Index (LAI) score (20.30) (Sup-
plementary Table 2; Supplementary Fig. 3), reaching the levelof a gold
standard genome assembly according to previously proposed
criteria42. All of these lines of evidence indicate that our de novo FF25
genome assembly is of high quality.
We used an integrated strategy including evidence-based meth-
ods and ab initio gene prediction to annotate the protein-coding gene
content of the FF25 genome assembly. A nal set of 26,736 protein-
coding genes was predicted, of which 26,430 genes (~98.86%) could be
assigned to eleven pseudomolecules (Supplementary Table 3). Of
these genes, the average lengths of coding sequences, exons, and
introns were 1232 base pairs (bp), 238, and 570 bp, respectively
(Table 1). The average gene density was one gene per 17.79 Kb, and the
genes were unevenly distributed, being more abundant towards the
chromosomal ends (Fig. 1b). We also specically concatenated 2202
transcription factor genes, 9635 pseudogenes, and 3318 noncoding
RNA genes comprising 764 transfer RNA genes, 558 ribosomal RNA
genes, 714 small nucleolar RNA genes, and 1282 microRNA genes
(Fig. 1b; Supplementary Table 4).
Of these predicted protein-coding genes, we found that 96.90% o f
the BUSCO sequences were completely present (Table 1;Supplemen-
tary Table 2). Moreover, the tissue-specicRNA-Seqdataconrmed
that 85.86% of the predicted protein-coding genes were expressed
(FPKM >1) in at least one of the 6 examined tissues (Supplementary
Table 5). And 97.48% of the protein-coding genes were assigned a
functional annotation based on ve public databases (Supplementary
Table 6). These evaluations collectively support the high accuracy and
completeness of our rice bean genome assembly and annotation.
Phylogenetic position and comparative genomics analyses
To explore the genome evolution of rice bean, genes from the ve
Vigna species (Vigna stipulacea,V. radiata,V. angularis,V. umbellata,
and V. unguiculata), four other legumes (Phaseolus vulgaris,Glycine
max,Lotus japonicus,andArachis duranensis), ve other eudicots
(Arabidopsis thaliana,Citrus sinensis,Populus trichocarpa,Vitis vini-
fera,andSolanum lycopersicum), as well as one monocot (Oryza sativa)
were clustered into 20,736 gene families. Of these, 334 single-copy
gene families were used to construct a maximum-likelihood phyloge-
netic tree (Fig. 2a). This indicated that rice bean is a sister species to
adzuki bean (V. angularis); they apparently diverged about 1.75 million
years ago (MYA), ndings in accord with a previous study based on
transcriptome data37.
This view was also supported by a gene synteny analysis between
rice bean and its closely related species in the Vigna genus based on
protein sequences using the MCScanX program43, which revealed that
(as expected) rice bean had higher conservation with V. angularis in
terms of gene structure and order as compared to other Vigna species
(Supplementary Fig. 4; Supplementary Table 7). Based on the tree, we
found that 230 rice bean gene families (comprising 1396 genes)
exhibited signicant expansions (P<0.01) relative to the MRCA (most
recent common ancestor) of rice bean and adzuki bean (Supplemen-
tary Data 1). KEGG pathway analysis indicated that these expanded
genes were signicantly enriched for metabolism pathways such as the
phenylpropanoid, sesquiterpenoid, and triterpenoid biosynthesis
(P< 0.05, Fishers exact test; Supplementary Fig. 5).
Whole-genome duplication (WGD) provides additional genetic
material that can be subsequently subjected to divergence, sub-
functionalization, and neofunctionalization44,45. To investigate WGD
events in rice bean, we identied 332 syntenic blocks within its
genome (including 8052 homologous genes accounting for ~30.12%
of all genes) (Fig. 1b) and estimated synonymous nucleotide sub-
stitutions at synonymous sites (Ks) for homologs. The Ks distribu-
tion of collinear gene pairs indicated no recent WGD in rice beans;
we also observed the expected signals for the ECH event (eudicot-
common hexaploidy; Ks = 1.72) and the LCT (legume-common tet-
raploid; Ks = 0.64) event (Fig. 2b). We estimated the relative time of
evolutionary divergence between rice bean and closely related
Vigna species using the Ks distributions of orthologs based on the
known evolutionary time (~13 MYA) of the SST (soybean-specic
tetraploid) event in soybean37,46. Similar to the very recent specia-
tion time estimated from the maximum-likelihood phylogenetic
tree (Fig. 2a), the Ks distribution of rice bean and adzuki bean also
showed the smallest peak value at 0.019 (Fig. 2b), corresponding to
a divergence time of 1.72 MYA.
Beyond comparisons of orthologous genes, we annotated the
repetitive content in the rice bean genome using an integrated pipe-
line, including de novo repeat identication and homology search
methods (see the Methodssection). We identied that 38.40% of the
rice bean genome comprises transposable elements (TEs; Supple-
mentary Data 2). Among the distinct classes of TEs, LTR elements
including Gypsy and Copia elements were the predominant classes;
andcomparedtoCopia elements (10.41%), Gypsy elements (19.85%)
occupied relatively larger proportions of genomic sequence in rice
bean, which is consistent with earlier reports about other Vigna
species37,40. In addition, we identied full-length LTR elements and
performed an insert time analysis for rice bean as well as other four
additional Vigna species with sequenced genome assemblies. Except-
ing Vigna radiata, more than half of the LTR elements in the other four
examined Vigna species proliferated at 0 0.5 MYA, suggesting that
the amplication of LTR elements has largely occurred after specia-
tion (Fig. 2c).
Table 1 | Summary statistics for the rice bean genome
assembly
Genomic feature Value
Total assembly size (Mb/%) 475.64/90.49%
Number of contigs 351
Largest contigs (Mb) 32.05
Contig N50 (Mb) 18.26
Sequences anchored to chromosomes (Mb) 465.19/97.80%
Genomic GC content (%) 34.21
Genome Complete BUSCOsa(%) 97.3
Protein Complete BUSCOsa(%) 96.9
LTR assembly index, LAI 20.30
Repetitive sequences (%) 57.19
Protein-coding genes 26,736
Mean gene length (bp) 3,602
Mean coding sequences/exon/intron length (bp) 1232/238/570
aAnalysis based on comparisons with the eudicotyledons_odb10 database.
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Population structure and genetic divergence of rice bean
landraces
We performed whole genome re-sequencing for a total of 440 rice
bean landraces from various geographic regions, including the land-
races inthe Asia core collection (73) and Chinese core collection (230)
using Illumina sequencing technology (Fig. 3a), ultimately generating
5.32 Tb of high-quality sequencing data, with an average depth of
~24.91× and an average mapping rate of 99.12% based on the newly
assembled reference genome (Supplementary Data 3). A nal set of
10,525,548 high-quality single-nucleotide polymorphisms (SNPs) and
2,743,289 small insertions and deletions (InDels) were identied. We
found 5690 SNPs (0.054%) that caused start codon changes, pre-
mature stop codons, or elongated transcripts, while 15,530 InDels
(0.57%) lead to frameshift mutations (Supplementary Table 8), pro-
portions similar to other species likely soybean47, cucumber48,and
watermelon49.
To infer the population structure, we constructed an SNP-based
neighbor-joining (NJ) phylogenetic tree and divided the 440 landraces
into three geographical groups:landraces from South & Southeast Asia
(SSA), South China (SC; coastline of South China to the Yangtze River),
and North China (NC; Yangtze River to North China) (Fig. 3a, b; Sup-
plementary Data 3). This classication was supported by a principal
component analysis (Fig. 3c) as well as a model-based clustering ana-
lysis (K= 4) conducted using STRUCTURE50 (Fig. 3b). Notably, the
landraces collected from other geographical regions (Japan, Korea,
Europe, and America) were spread amongst the SC and NC groups,
indicating their close genetic relationship with Chinese landraces or
their probable introduction from China2. We excluded these landraces
from the SC and NC groups in our further analyses.
To investigate genetic diversity and divergence among the three
geographical groups, we calculated the nucleotide diversity (π)for
each group and conducted a pairwise analysis of genetic distances
(Fixation index values, F
ST
).TheSSAgroupshowedthehighest
nucleotide diversity (1.08 × 103), consistent with the previous results
using SSR markers22 and further supporting the hypothesis that rice
beans originated from South & Southeast Asia2,22.Comparedwiththe
SSA group, gradually decreased nucleotide diversity was observed in
the SC group (0.78 × 103) and then the NC group (0.43 × 103), indi-
cating that sequential bottlenecks (π
SSA
/π
SC
=1.38; π
SC
/π
NC
=1.81)
occurred during the northward dispersal of rice bean from the origin
center (Fig. 3d). When compared with the SSA group, the F
ST
value of
the SC group was 0.16, whereas it became higher (0.29) for the NC
group, indicating enlarged population differentiation during the
northward dispersal (Fig. 3d).
MYA (million years ago)
20.0
98.86
46.87
4.25
127.65
26.46
111.34
117.41
1.75
61.53
5.39
9.18
14.66
105.57
150.42
Gene family
expansion/contraction
Vigna stipulacea
Vigna radiata
Vigna angularis
Vigna umbellata
Vigna unguiculata
Phaseolus vulgaris
Glycine max
Lotus japonicus
Arachis duranensis
Arabidopsis thaliana
Citrus sinensis
Populus trichocarpa
Vitis vinifera
Solanum lycopersicum
Oryza sativa
+176/-193
+86 /-427
+278/-176
+230/-241
+438/-291
+184/-475
+434/-335
+245/-626
+120/-774
+206/-709
+217/-668
+208/-675
+197/-701
+114 /-795
+9/-8
+8/-8
+128/-131
+46 /-307
+4 /-336
+58 /-90
+109/-178
+282/-68
+195/601
+126/-85
+315/-104
+7/-6
+24/-17
+39/-24
0.025.050.075.0100.0125.0150.0
No. of syntenic blocks kernel density
0.0
0.5
1.0
1.5
2.0
Synonymous nucleotide subsititution (Ks)
0.0 0.15 0.5 0.64 1.0 1.5 1.72 2.0 2.5
Vum
Gma
Van-Vum
Pvu-Vum
Vst-Vum
Vun-Vum
Vra-Vum
ECH
LCT
SST
14,650
13,788
16,173
15,724
15,652
15,441
16,125
14,709
14,354
12,825
14,040
14,370
14,542
13,870
12,097
24,384
20,824
26,552
25,781
29,069
26,617
47,220
25,682
28,686
22,259
23,840
30,419
24,044
25,401
20,511
26,038
22,368
31,241
26,735
31,948
28,134
55,897
28,251
34,553
27,416
29,406
34,699
31,845
35,768
35,775
a
bc
Gene family Genes in family Total genes
20
40
60
80
MYA (million years ago)
0
0−0.5
0.5−1
1−1.5
1.5−2
2−2.5
2.5−3
3−3.5
3.5−4
4−4.5
Percentage of LTR (%)
Vun
Vst
Vra
Van
Vum
Fig. 2 | Phylogenetic position and comparative genomics analyses. a Genome
evolution and gene family characteristics of Vigna umbellata (rice bean) and 13
other dicot species using the monocot plant Oryza sativa (rice) as an out-group.
This tree was generated using 334 single-copy ortholog families. Black numerical
values beside each node show the estimated divergence time of each node (MYA,
million years ago) in the phylogenetic tree shown on the left. Blue and orange
backgrounds represent Leguminosae and non-Leguminosae species, respectively.
The number of gene families, genes in the family, and the total number of genesare
shown on the right for each species. bDensity distribution of synonymous
nucleotide substitution levels (Ks) of syntenic orthologous (solid curves) and
paralogous genes (dashed curves). Vum: Vigna umbellata; Gma: Glycine max;Van:
V. angularis;Pvu:Phaseolus vulgaris;Vst:V. stipulacea;Vun:V. unguiculata;Vra:V.
radiata.cInsertion bursts of full-length LTR elements in the genomes of V.
umbellata and other four Vigna species. Source data are provided as a Source
Data le.
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 4
Content courtesy of Springer Nature, terms of use apply. Rights reserved
We further examined linkage disequilibrium (LD) using the
measure (r2)51 between pairwise SNP loci in SSA, SC, and NC groups.
For the SSA group, the decay of LD with physical distance (i.e., a
drop to half of its maximum value) between SNPs occurred at only
~7.7 kb (r2= 0.39), whereas it increased to ~34.8 kb (r2= 0.42) in the
SC group and to ~73.0 kb (r2= 0.44) in the NC group (Fig. 3e); these
trends are in accord with the observed gradual reduction in genetic
diversity in the SC and NC groups. The LD of rice bean landraces was
similar to those of outcrossing species such as maize (30 kb)52 but
shorter than those of inbreeding crops like soybean (83 kb)53, rice
(123 kb and 167 kb in indica and japonica, respectively)54, and foxtail
millet (~100 kb)55.Thisnding is consistent with a previous report
that rice bean has a fairly high outcrossing rate22. Notably, the
relatively rapid LD decay in the rice bean landraces may be useful
for enhancement of resolution power of association studies to map
a narrow candidate QTL interval56.
We searched for putatively selective regions with outliers (top 5%)
of F
ST
over 20-kb windows for the three comparisons (SSA vs. SC, SSA
vs. NC, and SC vs. NC). We detected 473, 512, and 444 outlier regions
for these three comparisons, respectively occupying 5.67% (26.95 Mb),
5.92% (28.15 Mb), and 5.59% (26.57 Mb) of the genome and including
1894, 1950, and 1296 protein-coding genes (Supplementary Data 4). A
MapMan analysis of all the selected genes indicated that these genes
were signicantly enriched for annotations related to biological pro-
cesses such as phytohormone action,nutrient uptake,andcirca-
dian clock system(Supplementary Fig. 6). Notably, among the genes
related to circadian clock system, we found four orthologs of
reported owering time genes in A. thaliana using FLOweRing Inter-
active Database (FLOR-ID57), including TOC158,PRR359,andtwoLHY160
genes between the SSC and SC groups, of which one LHY160 apparently
also underwent selection between the SSC and NC groups (Supple-
mentary Fig. 7). These owering time genes could plausibly have
contributed to the adaptation of rice bean landraces to different
latitudes.
The genetic architecture underlying control of owering at dif-
ferent latitudes
We observed owering time variationacross 440 landraces as grownat
three sites with widely divergent latitudes: 22106 days in Sanya (18°N)
in 2020 and 2021, 25122 days in Nanning (22°N) in 2020 and 2021, and
38104 days in Beijing (40°N; where some landraces did not bloom
before the rst frost in the autumn of 2020 and 2021). To explore the
genetic basis of the owering time for rice beans, we performed GWAS
for the owering phenotype data measured in both years at the three
sites, which revealed distinct association signals for the different
locations (Fig. 4ac). The repeatedly detected major signal from Sanya
was an intergenic region (Chr11: 6,142,9336,162,249) that was only
~5 kb away from a MADS-box gene that is the closest rice bean
homolog (Vum_11G00418)ofArabidopsisFRUITFUL (FUL)(Fig.4a;
Supplementary Figs. 8, 9; Supplementary Table 9), a gene known to
control owering time and reproductive transition61.
This GWAS signal explained up to 7.0414.86% of the owering
time variation across two years (Supplementary Data 5). All the sig-
nicantly associated SNPs and InDels in this GWAS signal were located
in its upstream region (>5 kb) (Supplementary Fig. 10), suggesting that
these polymorphisms could inuence FUL expression to control
owering time. This was further supported by the observation that the
expression level of FUL (in newly expanded leaves in a panel of 16
diverse rice bean landraces) was strongly negatively correlated
(R=0.69, P=2.95×10
3)withowering time (Supplementary Fig. 11).
For the Nanning site, the repeatedly detected major signal (Chr4:
35,931,10135,996,258) had a PVE (phenotypic variation explained)
value of 6.078.23% (Supplementary Fig. 8; Supplementary Data 5;
Supplementary Table 9). And the most likely candidate among the ve
protein-coding genes in this region is a FLOWERING LOCUS T (FT;
Vum_04G01668)ortholog(Fig.4b; Supplementary Data 6;Supple-
mentary Fig. 12); in many species, FT genes function as integrators of
diverse signals for controlling of owering time62. We found two sig-
nicantly associated SNPs around (<2kb) and within FT gene; one SNP
SSA
SC
(0.78 x 10-3)
(1.08 x 10-3)
(0.43 x 10-3)
NC
0.29
0.13
0.16
0 50 100 150 200 250 300
0.2 0.4 0.6 0.8 1.0
Distance (kb)
r2
SSA
SC
NC
0.44
73.0 (kb)
0.42
34.8 (kb)
0.39
17.7 (kb)
0.88
0.84
0.78
−0.10
−0.05
0.00
0.05
0.10
−0.05 0.00 0.05 0.10
PC1 (72.59%)
PC2 (21.88%)
SSA
SC
NC
Others
K = 2
K = 3
K = 4
a b
cde
SSA
SC
NC
Others
5
20
40
60
Sample size
Fig. 3 | Population structure and genetic divergence of rice bean landraces.
aThe geographic distributions of 440 rice bean landraces. SSA South & Southeast
Asia, SC South China, NC North China. The size and color of each pie chart repre-
sent the sample size in a specic geographic location. The map was created using
the map_data() function in the R package ggplot2. bPhylogenetic tree and model-
based clustering (K=24) of 440 sequenced landraces. cScores plot from a prin-
cipal component analysis, supporting the division of the landraces into three
geographical groups (SSA, SC, and NC). dSummary of nucleotide diversity (π)and
population divergence (F
ST
) across the three geographical groups. Values in par-
entheses represent measures of nucleotide diversity of each group, and values
between pairs indicate population divergence. eDecay of linkage disequilibrium
(LD), measured by r2, in the three geographical groups. The upper and lower black
dots withnumerical values in the lines represented maximum and medianvalues of
the r2and the corresponding physical distances. Source data are provided as a
Source Data le.
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
(Chr4:35,950,445) was located upstream (<200 bp) of the transcrip-
tion start site and another was located in the rst intron
(Chr4:35,951,311) (Supplementary Fig. 13).
For the Beijing data, we repeatedly detected a peak SNP in the
PSEUDO-RESPONSE REGULATOR 3 (PRR3)gene(Vum_02G01965)at
Chr2: 38,647,190 (7.3017.30% of PVE), encoding a nonsynonymous
variant (SF) in the third CDS consisting of the functional PR (pseudo-
receiver) domain (Fig. 4c; Supplementary Data 5; Supplementary
Figs. 8 and 14; Supplementary Table 9). PRR3 is an ortholog of the
known soybean circadian clock gene GmTof12/GmPRR3b that hasbeen
previously shown to function as a major owering time regulatory
gene and has been linked to the expansion of soybean into higher
latitudes63,64. Notably, a similar effect from a single amino acid change
(SL) on owering time has also been reported for the GmPRR3b gene
in soybean63.
We next explored the potential owering-time-related impacts of
the FUL,FT,andPRR3 orthologs in rice beans by classifying the land-
races according to their alleles at these three loci. There were two
alleles for FUL in the collection, and at the Sanya site, the set of 28
landraces carrying the minor allele (6.73%) displayed signicantly
(P<0.001) later owering time (~33 days delayed, a 70.72% increase)
than the set of landraces carrying the major allele (Fig. 4a). Note that all
ofthelandracescarryingthelate-owering FUL allele were initially
collected from low latitude regions (South & Southeast Asia; Fig. 4d).
We also found these landraces carrying the late-owering FUL allele
also exhibited a signicantly higher number of branches than other
landraces carrying the early-owering FUL allele (Supplementary
Fig. 15), suggesting the probable effect of high yield potential from the
late-owering FUL allele.
In contrast, landracescarrying theminor allelesfor FT (12.31%) and
for PRR3 (7.24%) displayed earlier owering times, with the average
owering time for the landraces carrying the early-owering FT allele
~19 days earlier (a 30.29% reduction) and 18 days earlier (23.36%) for
the landraces carrying the PRR3 minor allele (Fig. 4b, c). There were
notable geographical differences among the landraces carrying the
early-owering alleles of FT and PRR3 genes: for FT there was a clear
e
f
51.25%
49.80%
47.03%
38.25%
0%
3 40 26 248 19
25
50
75
100
Effect
PRR3+FT+FUL
FT+FUL
PRR3+FUL
FUL
NA
PRR3+FT+FUL
FT+FUL
PRR3+FUL
FUL
NA
PRR3+FT+FUL
FT+FUL
PRR3+FUL
FUL
NA
54.73%
50.27%
46.26%
32.18%
0%
3 43 26 260 28
30
60
90
120
3 26 42 188 10
25.37%
21.14%
5.05%
-4.65%
0%
40
60
80
100
Flowering time (DAE)
Sanya site Nanning site Beijing site
Sanya site Nanning site Beijing site
P = 7.1 × 10-6
P = 5.4 × 10-8
P = 2.2 × 10-16
0204314730431482803144
25
50
75
100
125
FT PRR3 NA
P = 2.2 × 10-16
P = 1.8 × 10-3
P = 6.9 ×10-10
0
50
100
150
FT PRR3 NA
P = 2.1 × 10-15
P = 4.0 × 10-3
P = 5.6 × 10-7
0
50
100
FT PRR3 NA
Flowering time (DAE)
Effect 23.3% 19.9% 0% 32.0% 27.6% 0% 25.0%10.8% 0%
d
20°N 30°N 40°N 50°N10°N0°N
80°E70°E60°E 90°E 110°E 120°E 130°E 140°E100°E
FT early-flowering allele
FUL late-flowering allele
Yellow river
PRR3 early-flowering allele
20 40 26
30
40
P = 2.7 × 10-9
50
Latitude (°)
Yangtze river
n=
n=n=n=
n=n=n=
FT
Flowering time at Nanning site (22°N)
1234567891011
10
02468
Chr
b
048 336
50
100
Flowering time
(DAE)
c
PRR3
Flowering time at Beijing site (40°N)
1234567891011
01234567
Flowering time at Sanya site (18°N)
FUL
1234567891011
1412
10
-log
10
(P)-log
10
(P)-log
10
(P)
02468
Chr
a
P = 5.7 × 10-12
P = 2.2 × 10-16
30
031 281
60
90
Flowering time
(DAE)
P = 2.2 × 10-16
0358 19
30
60
90
Flowering time
(DAE)
Early-flowering allele (T)
Late-flowering allele (C)
Early-flowering allele (T)
Late-flowering allele (C)
Early-flowering allele (G)
Late-flowering allele (A)
Chr
n=
n=
n=
Fig. 4 | The geneticarchitecture underlying owering time control from lowto
high latitudes. acManhattan plots of GWAS for owering-time data measured in
Sanya (18°N) in 2021 (a),Nanning (22° N) in 2020 (b), andBeijing (40°N) in 2021(c).
Red horizontal dashed line indicated the Bonferroni-corrected signicance
thresholds of GWAS (α=1). Pie charts represented allelic frequencies of the major
associated loci. The bar plots display the owering time of landraces carrying each
allele of the identied major loci. DAE, days after emergence. The number (n)of
landraces carrying each allele is shown below. dThe geographical distributions of
landraces c arrying the early-owering all eles of FT and PRR3,andthelate-owering
allele of FUL, respectively. The map was created using the map_data() function in
the R package ggplot2. The violin plot showed signicant differences in latitudes
between landraces carrying the early-owering allele of FT (n=40)and PRR3
(n= 26), respectively. In the box plots, central line: median values; bounds of the
box: 25th and 75th percentiles; whiskers: 1.5*IQR (IQR: the interquartile range
between the 25th and 75th percentile). eThe bar plots show the owering short-
ening effects of the early-owering alleles of FT and PRR3 at each of the three
measurement sites. NA indicates landraces carrying neither of these two early-
owering alleles. The number (n) of landraces carrying each allele at each of the
three measurement sites is shown below. fThe dot plots show the owering time
shortening effects of early-owering allelic combinations at each of the three
measurement sites. Blue dots represent the landraces categorized according to all
of the different allelic combinations found in the 440 sequenced landraces. Red
lines indicate the average value of each category. NA indicates landraces carrying
no early-owering alleles. The number of landraces for each category is shown
below. The signicance was testedwith two-sided Wilcoxon tests in (ae). The data
in acand eare shown as mean± SE, and the error bars represent SE. Source data
are provided as a Source Data le.
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 6
Content courtesy of Springer Nature, terms of use apply. Rights reserved
trend for collection from the region between the Yangtze and Yellow
rivers, whereasthe landraces harboring the early-owering PRR3 allele
tended to be from higher latitude regions north of the Yellow River
(including Northwest and Northeast China) (Fig. 4d). We also inferred
the model of inheritance for these alleles and found that the best
models for FUL,FT,andPRR3 loci were additive, dominant, and addi-
tive, respectively (Supplementary Table 10; see the Methods
section).
Beyond suggesting that early-owering alleles for both of these
loci have contributed to the adaptation of rice beans to higher
latitudes (relative to the tropical origin center), these results indi-
cate potential discrete impacts of the two loci that are sensitive to
conditions found in different latitudinal ranges. Offering support
for this idea, analysis of phenotype data from the geographically
distinct test site reveal ed differential impacts from the two alleles of
interest at the FT and PPR3 loci. That is, at the northernmost site of
our study (Beijing), the extent of the owering time shortening
effect was signicantly larger among the set of landraces carrying
the relevant PRR3 allele as compared to the set of landraces carrying
the relevant FT allele (Fig. 4e). Importantly, this trend was reversed
at the other two (more southerly) sites: at both Nanning and Sanya,
the set of landraces with the early-owerin g FTa llele had the shorter
owering times (Fig. 4e).
We also evaluated the pyramiding effects of the alleles for the FUL,
FT,andPRR3 loci by comparing the owering time data in Sanya,
Nanning, and Beijing sites among landraces carrying multiple early-
owering allelic combinations. As expected, landraces carrying a
relatively higher number of early-owering alleles invariably exhibited
relatively earlier owering times (Fig. 4f): a total of three landraces
carried all the three early-owering alleles, and these showed the
earliest detected ower times, with the average maximum shortening
effects for this set of three landraces being 51.25%, 54.73%, and 25.37%
for the Sanya, Nanning, and Beijing sites, respectively (Fig. 4f). It
should be noted that this apparently weaker shortening effect at the
Beijing site was virtually certainly underestimated, as most of those
landraces harboring no early-owering alleles failed to bloom before
the autumn frost. Collectively, these results highlight an opportunity
to improve rice bean adaptability for growth in distinct latitudes
through breeding efforts to combine the early-owering alleles for
three owering time-controlling genes.
The molecular basis and selection history of stem determinacy
in cultivated rice bean
The stem determinacy trait is known to strongly inuence lodging in
legumes65. We collected data for stem determinacy traits in 2020 and
2021 for the 440 landraces at the Nanning site. The large majority
(>85%) of the landraces exhibited an indeterminate stem growth
phenotype (Supplementary Data 3). Notably, this distribution
emphasizes thatmost rice bean landraces do not have the determinate
stem growth phenotype that is amenable for mechanized cultivation
systems. We performed GWAS analysis of stem determinacy based on
the whole genome SNPs data for the germplasm panel and detected a
total of 29 and 22 signicant signals for stem determinacy in 2020 and
2021, including 7 signals detected repeatedly in both years (Supple-
mentary Fig. 16; Supplementary Data 5; Supplementary Table 9).
Among the repeatedly detected signals, the strongest signal was at
1234567891011
0 5 10 15 25 3520 30
-log
10
(P)
47,173 47,173.5 47,174 47,174.5
3’.. CA ATAT TGA GTG TC A ..5’
3’.. AAT ATT GA--GTCA ..5
TLE
TSVYN
Chr4 (Kb)
*
10⁶
~ 219
Ne
10⁵
10⁴
10³
10²
102
Generations
10 10³ 10⁴
0
0
25
50
75
100
Percentage (%)
Semi-detIndet Det
Ref
Ref/Del
Del
a b
f
c
e
Ref
Del
d
Ref
TFL1
Del
5.0
7.5
10.0
12.5
Hundred seed weight (g)
P = 1.4 × 10-3
P = 2.8 × 10-2
P = 5.2 × 10-5
P = 2.2 × 10-16
P = 4.4 × 10-5
P = 5.4 × 10-8
7
9
11
Seed length (mm)
50
114 32 115 32 74 32 54 32
70
90
Ref Del Ref Del Ref Del Ref Del
Flowering time (DAE)
5
10
Branch number
HB
CQ
GZ
GX
HN
n=n=n=n=
Fig. 5 | The molecular basis and selection history of stem determinacy in cul-
tivated rice bean. a InDel-based GWAS result from the analysis of data for stem
determinacy measured at the Nanning site in 2020. The peak InDel is indicated by
the red circle. The red horizontal dashed line indicated the Bonferroni-corrected
signicance thresholds of GWAS (α=1).bA 2-bp causativedeletion (the peakInDel)
introduced a premature termination codon in the rst exon of the TFL1 gene. Ref
reference, Del deletion. cThe frequency distributions of three typesof stem growth
habit (indeterminate (Indet), semi-determinate (Semi-det), and determinate (Det))
among three groups comprising landraces carrying the homologous reference
alleles (designated as Ref), heterozygous (Ref/Del) or homologous mutation (2-bp
deletion) alleles (Del), respectively. Two-sided Fishers exact tests were used to
assess thesignicance of the differences in the proportion of the determinate type
of stem growth habit between landraces carrying Ref and Del alleles and between
landraces carrying Ref/Del and Del alleles. dThe geographical distributions of the
32 landraces carrying Del alleles from Southern-Central China. The map was cre-
ated using the map_data() function in the R package ggplot2. HB Hubei province,
HN Hunan province, CQ Chongqing province, GZ Guizhou province, GX Guangxi
province. eThere was a signicant improvement for the 32 landraces carrying Del
allelescompared withthe landracescarrying Ref alleles in the SC group for multiple
human-desired traits including owering time (DAE, days after emergence), branch
number,seed length, andHGW (hundred seedweight). Signicance wastested with
two-sided Wilcoxon tests. In the box plots, central line: median values; bounds of
the box: 25th and 75th percentiles; whiskers: 1.5*IQR (IQR: the interquartile range
between the 25th and 75th percentile). fDivergence time of the 32 landraces car-
rying Del alleles with the landraces carrying Ref alleles in the SC group, inferred
using the SMC++ program66, under a mutation rate μ=1.5×10
8per site per
generation140, and a generation time of one year. Source data are provided as a
Source Data le.
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Chr4 with high PVE values (17.9641.43%) but spanned up to ~12 Mb
genomic region (Chr4: 42,022,54453,749,059).
The InDel-based GWAS for the 2-year stem determinacy data
revealed a signicantly associated InDel (2 bpdeletion at Chr4:
47,174,187) with a PVE of 22.0135.21% positioned within the strongest
SNP signal (Fig. 5a; Supplementary Data 5). Gene functional annotation
revealed that this InDel apparently leads to premature termination of
translation for the rst exon in the gene Vum_04G02513,TFL1 (TERM-
INAL FLOWER1;Fig.5b), for which the ortholog in soybean was
reported as the Dt1 locus (Gmt1gene) controlling stem determinacy25.
We found that a total of 32 landraces carried the homozygous muta-
tion (2-nt deletion) alleles, which were identied using the sequencing
data and were conrmed using Sanger sequencing (Supplementary
Fig. 17). These landraces had a signicantly higher proportion of
determinate growth habit type compared to landraces harboring the
reference alleles or the heterozygous alleles (Fig. 5c).
Notably, these 32 landraces were all in the SC group and were
originally collected from an adjoining and mountainous area in South-
Central China comprising ve provinces (Chongqing, Hunan, Hubei,
Guizhou, and Guangxi) (Fig. 5d). We also observed that these 32
landraces (represented by the bars with a predominant proportion of
beige color in the Supplementary Fig. 18) were genetically distinct
from other landraces within the SC group using model-based
clustering (K=4), an inference that was further supported by a mod-
erate level of differentiation (F
ST
= 0.11). Notably, these la ndraces also
displayed desirable agronomic traits including signicantly earlier
owering time and signicantly increased pod width, seed length,
hundred seed weight, and branch number as compared to the other
landraces of the SC group (Fig. 5e).
We next estimated the divergence time for these 32 landraces
from the other landraces with distinct genetic admixture in the SC
group inferred by the model-based clustering analysis (Fig. 3b), and
obtained a similar divergence time of ~219 and 249 years agousing the
SMC++66 and MSMC267 methods, respectively (Fig. 5f; Supplementary
Fig. 19). Our results collectively support that the 32 landraces carrying
the homologous mutation alleles have been improved by producers in
certain mountainous regions in South-Central China for at least 200
years, and suggest that these materials have huge potential for utili-
zation in modern breeding programs seeking a variety of
improvement goals.
Tandem duplication of the VumCYP78A6 gene associated with
seed yield trait
Seed yield traits (including size and weight) have undergone strong
selection in the domestication histories and modern breeding pro-
grams for legume crops53,6870. We measured the hundred seed weight
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●●
●●
●●
●●
●●
●●
●●
●●●●
●●
●●
●●
VumCYP78A6-1
VumCYP78A6-2
0
4
8
12
00.40.8
ba
d
e
f
g
OE1
OE2
Col−0
5mm
S28 S33
1mm
P = 5.0 × 10-11
10
0310 87
20
30
12
Copy number of VumCYP78A
6
HSW (g)
P = 8.3 × 10-5
P = 5.8 × 10-3
c
S28 S33
2.0
3.0
4.0
33
0
2
4
6
Col−0
***
OE1
***
OE2
Seed length (mm)
444
n=444
Col−0
OE1
OE2
0
5
10
Siliquewidth (mm)
** *
Cell number (x 10³)
444
Col−0
OE1
OE2
0.0
0.5
1.0
1.5
Silique length (cm)
*** ***
Col−0
OE1
OE2
*
***
0
4
44
n=
1
2
3
TSW (10-2 g)
0
3n=
n=
3
3
6
9
S28S33
REL of VumCY78A6
02468
123456
7891011
Chr
HSW at the Nanning site in 2021
-log
10
(P)
123456
7891011
Chr
12
04 8
SL at the Nanning site in 2020
-log
10
(P)
02468
123456
7891011
Chr
SL at the Nanning site in 2021
-log
10
(P)
123456
7891011
Chr
12
-log
10
(P)
04 8
HSW at the Nanning site in 2020
-log
10
(P)
Fig. 6 | Tandemduplicationof theVumCYP78A6 geneassociatedwith seed yield
traits. a GWAS using the 2020 and 2021 Nanning datasets, indicating that the
strongest association signals for hundred seed weight (HSW) and seed length (SL)
traits all located at Chr9: 29,030,43729,126,729. bLocal Manhattan plot of HSW
(top), the gene models (middle), and pairwise linkage disequilibrium heat map
(bottom) at Chr9: 29,030,43729,174,247. The two tandemly duplicated Vum-
CYP78A6genes (VumCYP78A6-1 and VumCYP78A6-2) are shown with thered dashed
triangles. In aand b, the red horizontal dashed lines indicated the Bonferroni-
corrected signicance thresholds of GWAS (α=1). cThe HSW distributions of
landraces carrying distinct copy numbers of the VumCYP78A6 gene. The number
(n) of landraces carrying distinctcopy numbers is shown below. dBar plot showing
the relative expression levels of VumCYP78A6 in the pods at 16 DAP (days after
pollination) from the long-seed landrace S28 (carrying two gene copies) and the
short-seed landrace S33 (carrying one gene copy). eL ight microscop e images (top)
and cell number per square millimeter (bottom) of the cross-sections of the pod
wall for the S28 and S33 landraces at16 DAP. Scale bar, 100μm. In the box plots of
cand e, central line: median values; bounds of box: 25th and 75th percentiles;
whiskers: 1.5*IQR (IQR: the interquartile range between the 25th and 75th percen-
tile). fSilique (left) and seed (right) morphology of the wild type (Col-0) and two
independent Arabidopsis thaliana transformants overexpressingthe VumCYP78A6-
2gene (OE1 andOE2). Scale bar: 5 mm forsilique and 1 mm for seed. gThebar plots
of thousand seed weight (TSW), silique length, silique width, and SLfor Col-0, OE1,
and OE2. Pvalues are 1.07 × 104,1.110
2,2.310
8,4.110
6,1.44×10
2,
5.25 × 103,1.810
7, and 1.94 × 106. The signicance was tested using the two-
sided Studentst-test in c,d,e,andg.*P<0.05, **P< 0.01, ***P< 0.001 in (g). The
data in dand gare shown as mean ± SE. In d,e,andg, the number (n) of each
independent experiment is shown below. Source data are provided as a Source
Data le.
Article https://doi.org/10.1038/s41467-022-33515-2
Nature Communications | (2022) 13:5707 8
Content courtesy of Springer Nature, terms of use apply. Rights reserved
(HSW) and seed length (SL) at the Nanning and Sanya sites in both
2020 and 2021. We next performed GWAS analysis to explore the
genetic basis of these two traits, and identied one QTL locus sig-
nicantly associated with the two traits at both examined sites in both
examined years (Fig. 6a; Supplementary Fig. 20; Supplementary
Table 9); this QTL is positioned at Chr9: 29,030,43729,126,729, con-
tains six predicted ORFs (Fig. 6b; Supplementary Data 7), and explains
5.9916.17% of phenotypic variations (with the maximum value for SL
at the Nanning site in 2020; Supplementary Data 5).
We next conducted a qPCR analysis for seed tissues of one long-
seed landrace (S28) and one short-seed landrace (S33) at the 16 DAP
(days after pollination) for the six candidate genes positioned within
the aforementioned signicantly associated interval on Chr9. Two of
these genes showed signicant differencesin expression level between
the two landraces, but neither of them had an obviously relevant
functional annotation (Supplementary Fig. 21), which prompted us to
explore the potential candidate genes positioned adjacent to this QTL.
We found two tandemly repeated genes (Vum_09G01129 and
Vum_09G01130) at ~10.45 kb downstream of the QTL (Fig. 6b). Using a n
in silico detection approach based on read depth information71,acopy
number variation (CNV) analysis of this gene in the 440 landraces
showed that the 87 landraces carrying two copies exhibited sig-
nicantly higher values for the two examined phenotypes than the 310
landraces carrying only one copy (Fig. 6c; Supplementary Fig. 22).
These results suggested that the CNV may represent the causal variant
controlling these two seed yield component traits.
The two duplicated genes had identical CDS sequences and were
homologous to the AtCYP78A6 gene (64.65% amino acid sequence
identity; Supplementary Fig. 23), which encodes a cytochrome P450
monooxygenase known to function in maternally promoting seed
growth by increasing the cell number in the integument of developing
Arabidopsis seeds72. We, therefore, designated these rice bean genes
as VumCYP78A6-1 (Vum_09G01129)andVumCYP78A6-2
(Vum_09G01130). A qPCR analysis showed that the expression level
of VumCYP78A6 in pod wall tissue at 16 DAP was signicantly higher
(~2-fold) in S28 than S33 (Fig. 6d). The impact of this CNV on the
expression of VumCYP78A6 was also veried in a larger panel com-
prising 20 landraces with one copy and 20 landraces with two copies.
Specically, qPCR analysis of the VumCYP78A6 gene for the rst fully
expanded trifoliate leaves at 14 days after sowing showed a sig-
nicantly (P< 0.01) higher expression level ( ~2-fold) in the 20 landraces
with two copies than that in 20 landraces with one copy (Supple-
mentary Fig. 24).
We also examined the number of cells in the pod wall at 16 DAP
through cytological observation and detected a signicantly increased
number of cells in S28 compared to S33 (Fig. 6e). Finally, we generated
two independent transgenic Arabidopsis lines by overexpressing the
VumCYP78A6-2 gene (Supplementary Fig. 25), both of which displayed
signicantly increased values for silique length, silique width, seed
length, and seed weight (Fig. 6f, g; Supplementary Fig. 26). Viewed
collectively, these results support VumCYP78A6 as a highly probable
causal gene underlying seed yield component traits in rice bean.
Discussion
Rice bean has been proposed as a potential multipurpose legume crop
to promote sustainable agriculture and ght hunger in Asia18,73.Inthe
present study, we assembled a high-quality landrace FF25 reference
genome and developed a valuable genomics resource by re-
sequencing 440 rice bean landraces. By combining the high cover-
age of PacBio long reads and a Hi-C interaction map, our reference
genome reached high accuracy and high continuity; this genome
provides a valuable resource for future comparative genomics, evo-
lutionary studies, and molecular research. Our rice bean genome
assembly still contains 87 gaps, 78 of which have more than one
anking region (100 bp) with a high proportion (>90%) of repeat
sequences, suggesting that most of the gaps were caused by the
incomplete assembly of the repeat sequences, which also reported by
other studies74,75. We also predicted the candidate centromereregions
using a previously publ ished method76 (see the Methodssection) and
found that all the 11 candidate centromere regions contained more
than one assembly gap, suggesting none of the centromere sequences
was fully assembled (Supplementary Table 11); future efforts using
long