ArticlePDF Available

Coupled analysis of Pawpaw (Asimina triloba) genetic markers and ancestry records


Abstract and Figures

Subsets of 49 RAPD markers for 36 Asimina triloba specimens from U.S. NCGR repository sites are examined for matches to ancestry records. Several known parent-progeny and sibling relationships are verified, but a few specimens are also determined mislabeled due to excessive dissimilarities. An insight to the debate of cultivar Overleese vs NC-1 is also presented.
Content may be subject to copyright.
Richard Frost
Frost Concepts, Vista CA, USA
Subsets of 49 RAPD markers for 36 Asimina triloba specimens from U.S. NCGR repository sites are
examined for matches to ancestry records. Several known parent-progeny and sibling relationships are
verified, but a few specimens are also determined mislabeled due to excessive dissimilarities. An insight to
the debate of cultivar Overleese vs NC-1 is also presented.
Cultivar ancestry, Graph theory, RAPD markers
The Pawpaw is a deciduous tree native to eastern North America. It produces a potato-size fruit
which has been cultivated by native peoples since antiquity and more recently in home orchards
and small farms in the U.S. [1, 2]. Through the efforts of USDA agro-economist Neal Peterson
[3] the use of Pawpaw has increased in the past few decades due to his breeding of advanced
cultivars (see Figure 1) and establishment of USDA satellite repositories for Asimina specimens
[4]. The fruit is also being considered as a crop in other parts of the world [5].
There have been 4 genomic studies of the specimens assembled by Peterson. The first 3 were by
Hongwen Huang, also known for his studies with chestnuts. Of them, the 1st in 2000 was a
preliminary study to determine appropriate single-loci RAPD markers [6]. The 2nd in 2003
applied 71 of these markers to 37 specimens [7]. One pair of the specimens has synonymous
marker values, thus bringing the usable total to 36. Also, 22 of the markers returned by the lab
contained missing values and unfortunately the measurements could not be repeated. Regardless,
Huang processed the data with the NTSYS-pc biostatistical package and employed two
questionable practices: use of markers with missing values [8] and dissimilarity measurements
with a pseudo-metric [9]. On a positive note, Huang published all the marker data which provides
an opportunity to revisit the study. A few years later Huang published his 3rd study of the
specimens, this time using AFLP markers [10]. This study also used pseudo-metric analysis and
only published the resulting dissimilarity values. The fourth genomic study of the Pawpaws was
by Pomper et al in 2010 [11]. Only 6 SSR markers were used leading to a grossly underdetermined
data matrix. The data also contains many missing values and is thus of no use for further
The present effort involves rectifying the useable data from Huang’s 2nd study with known
ancestry data (Figure 2). Huang’s original 71 markers are considered a balanced set which were
then arbitrarily reduced to 49. As such, dissimilarity relations among the specimens are examined
using subsets of the markers with the goal of identifying one or more marker groups that are
meaningful with respect to ancestry records at an acceptable genetic distance resolution.
International Journal on Computational Science & Applications vol 12:3 pp.1-8, 2022
Figure 1. Known ancestry of U.S. Pawpaw cultivars currently in circulation [3, 11-13].
Figure 2. Ancestry and origins of specimens in H. Huang's RAPD study [3, 11]. The origin of
Wells-PPF is unknown. BEF = Blandy Experimental Farm.
A candidate group of markers was identified after an exhaustive, automated search of 2,121,017
topological graphs produced by subsets of size 44 through 49 of Huang’s error free markers
sans 17,393 sets which produced one or more zero distances. A complete distance graph 𝐺
determined by the selected marker set was constructed along with a connected least bridges graph
𝐺𝐿𝐵 [14]. Four known parent-progeny pairs appeared as nearest neighbours in 𝐺𝐿𝐵. The
distribution of mismatches is shown in Figure 3 and minimum and maximum distances are
exhibited in Figure 4. Loci mismatches of ancestry relations are given in Tables 1, 2.
Figure 3. Distribution of loci mismatches in complete graph of selected marker set.
Figure 4. Combined graph of distance extrema shown with solid lines, plus selected
neighbouring specimens for orientation denoted by dotted lines. Black vertices denote members
of known sibling sets a-f while grey vertices are members of suspected sibling sets g and h (see
Table 2). Arrowheads specify parent-progeny relations, otherwise spatial orientation is arbitrary.
Table 1. Known and suspected parent-progeny relations. Distance units are loci mismatches.
1-7-1 Shenandoah
not nearest neighbour
NC-1 (suspected)
Sweet Alice
Sweet Alice
not nearest neighbour
Table 2. Known and suspected sibling relations.
Set #
14, 15, 11
8, 10, 10
g (suspected)
10, 9
h (suspected)
From the top of Figure 4, one observes the tight cluster of specimens NC-1, Potomac, Prolific,
Middletown, Shenandoah, and Rappahannock. The cultivars 2-10 and Potomac are close enough
to imply at least a sibling relationship. Off to the right note the long distance to the parent-sibling
group of Sweet Alice and the SAA-Zimmermans, plus the adjacent group containing Sunflower
and Wabash. Two of C. Davis' first cultivars Taylor and Taytwo are found below along with
Wilson - a possibly undocumented offspring of Taylor. The numbered cultivar 11-13 a sibling
of Potomac appears there at great distance from Taylor indicating the large dissimilarity of these
Davis breeds from the specimens above. Wells and Mitchell are found at the bottom also
dissimilar from those above and the Davis breeds. The specimen Wells-PPF is displaced by 5
mismatches from Wells, indicating one could be the progeny of the other. Peterson apparently
believes the latter is the original.
The distance from Overleese to its progeny 1-7-1 appears excessive and from Taylor to its
progeny 1-23 even more so. Since both parents have suitable distances between other relations,
this calls into question the validity of the labels on 1-7-1 and 1-23. In the case of 1-7-1, the
problem is further emphasized by its relatively large distance to sibling 1-68. For specimen 1-23,
the discrepancy appeared in all marker subsets during the selection process.
In the long-debated case of Overleese vs its suspected sibling NC-1, a distance of 7 was found
which is in the range of 3-8 found for other parent-progeny relations. However, it is also close to
the range of 8-15 found in sibling relations. Consequently the debate appears unresolved by any
genomic measurements performed to date.
The results of the present study are limited by relevance of the original marker set and the process
of selection by ancestry records. Given the range of dissimilarities produced for known relations,
the measurements here should be considered a coarse approximation to the actual displacements
among specimens. Even so, the high correspondence (75%) between measurements and the
known relations of Tables 1, 2 indicate that H. Huang's markers have merit. Therefore the author
believes a retesting of the specimens using all 71 markers at a lab capable of producing error-free
results would be beneficial.
The data from H. Huang's paper was extracted using Adobe Acrobat® and placed in CSV files.
The markers with missing data values were entirely deleted. Specimens 11-13-KSU and 11-13-
PPF were found to have identical marker values and thus replaced by the single label 11-13. This
vetted set contains 36 specimens with 49 markers each.
A software program was then constructed to iterate through progressively smaller subsets of the
original size L = 49. For each subset, basic statistics such as distances in known relations was
extracted, along with parameters of the least bridges graph [14] produced by the markers including
the component maximal and a list of component vertices. Marker sets producing one or more zero
distances were discarded for poor resolution. The number of zero producing marker sets increased
from 0.085% at L = 47 to 0.85% at L = 44. Also at this latter size the resulting graphs suffered
from too much cohesion and thus no smaller sizes were pursued. Elapsed execution times for this
software program ranged from 0.2 seconds for L = 49 to 3.6 days for L = 44, including I/O.
A second program was written to rate the results. For each subset, a specimen pair from a known
relation was considered "present" if both members of the pair occurred in the same component of
the least bridges graph limited by δopt. Two vectors were formed from this data: numbers of
known relations and numbers of suspected relations, with each value in the last columns of Tables
1, 2 representing a vector component. The 2-norm of the outer product of these vectors was then
used as a score. From the scores a group of 291 candidates was produced. A high degree of
duplication was noticed among the topological structures. The candidates were examined for
cohesion properties and a best-of-class with L = 45 and δopt = 8 was selected. A connected graph
of the selection is shown in Figure 5.
All computation and visualizations for this study were performed with Mathematica® versions
12 and 13. The hardware platform was a deskside Intel® i9-10900KF PC with 32GB RAM and
1TB SSD running Windows® 11. No compatibility issues were detected within this environment.
Figure 5. Least genetic distances between 36 Pawpaw cultivars tested by Huang et al [7].
Distances represent # of loci mismatches between a rectified set of 45 markers from Huang's
original error-free set of 49. Orientation of Pawpaws is arbitrary except for solid arrows indicating
parent-progeny relations. Dashed arrow indicates suspected parent-progeny relation. Solid lines
(not arrows) are nearest-neighbour relations and dashed lines are least bridges. Names assigned
upon release of a breed to the nursery industry are specified by “≡”. Labels with superscripts a-f
are sets of known siblings, while g-h are sets of possible siblings.
[1] R. N. Peterson, "PAWPAW (ASIMINA)," Genetic Resources of Temperate Fruit and Nut Crops
290, pp. 569-602, 1991. [Online]. Available:
[2] C. Ferrer-Blanco, J. Hormaza, and J. Lora, "Phenological growth stages of “pawpaw”[Asimina
triloba (L.) Dunal, Annonaceae] according to the BBCH scale," Scientia Horticulturae, vol. 295,
p. 110853, 2022, doi:
[3] R. N. Peterson, "Pawpaw variety development: a history and future prospects," HortTechnology,
vol. 13, no. 3, pp. 449-454, 2003, doi:
[4] "NCGR Corvallis - Asimina Germplasm." USDA ARS.
(accessed 2022).
[5] R. G. Brannan and M. N. Coyle, "Worldwide Introduction of North American Pawpaw (Asimina
triloba): Evidence Based on Scientific Reports," Sustainable Agriculture Research, vol. 10, no.
3, pp. 1-19, 2021, doi:
[6] H. Huang, D. R. Layne, and T. L. Kubisiak, "RAPD inheritance and diversity in pawpaw
(Asimina triloba)," Journal of the American Society for Horticultural Science, vol. 125, no. 4,
pp. 454-459, 2000, doi:
[7] H. Huang, D. R. Layne, and T. L. Kubisiak, "Molecular characterization of cultivated pawpaw
(Asimina triloba) using RAPD markers," Journal of the American Society for Horticultural
Science, vol. 128, no. 1, pp. 85-93, 2003, doi:
[8] P. M. Schlueter and S. A. Harris, "Analysis of multilocus fingerprinting data sets containing
missing data," Molecular Ecology Notes, vol. 6, no. 2, pp. 569-572, 2006, doi:
[9] R. Frost, "Re-evaluation of NCGR Davis Ficus carica and palmata SSR profiles," PLoS ONE,
vol. 17, no. 2, p. e0263715, 2022, doi:
[10] Y. Wang, G. L. Reighard, D. R. Layne, A. G. Abbott, and H. Huang, "Inheritance of AFLP
markers and their use for genetic diversity analysis in wild and domesticated pawpaw [Asimina
triloba (L.) Dunal]," Journal of the American Society for Horticultural Science, vol. 130, no. 4,
pp. 561-568, 2005, doi:
[11] K. W. Pomper et al., "Characterization and identification of pawpaw cultivars and advanced
selections by simple sequence repeat markers," Journal of the American Society for
Horticultural Science, vol. 135, no. 2, pp. 143-149, 2010, doi:
[12] K. W. Pomper, S. B. Crabtree, and J. D. Lowe, "The North American Pawpaw Variety:'KSU-
Atwood (TM)'," Journal of the American Pomological Society, vol. 65, no. 4, pp. 218-221, 2011.
[Online]. Available:
[13] K. Gasic, J. E. Preece, and D. Karp, "Register of new fruit and nut cultivars list 50,"
HortScience, vol. 55, no. 7, pp. 1164-1201, 2020, doi:
[14] R. Frost. "Least Bridges Graphs." (accessed
Richard is an old-school numerical analyst with
academic and vocational experience in applied
mathematics, computer science, and
horticulture. He is currently pursuing research
in the genomics of lesser-studied fruits.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
To date all public records of F . carica SSR profiles are from NCGR Davis. Prior studies of this data have not been received well because several of the stated relationships do not match what is observed in the field. Upon examination of the prior authors methods it is found that the 1979 Nei similarity measures are not valid distance metrics for the profiles thus invalidating their analysis of genetic distance. Further, the data are tensor in nature and it is shown here that "flattening the data" for use in a vector method will change the problem under study. Consequently the present analysis focuses on geometric, statistical, and biostatistical tensor-based methods–finding that only the latter produces results matching what is manually observed among the profiles. Combining this with historical breeding records and morphologic observations reveals that a modest portion of the profiled accessions are mislabeled–and also reveals the existence of previously undocumented close relations. Another area of concern in the prior studies is the statistical partitioning of the complete graph of distances to define clades. In the present analysis it is shown that genetic clades cannot be defined in this profile collection due to lack of cohesion in nearest neighbor components. It is also shown that it is presently intractable to significantly rectify gaps in the sample population by profile enrichment because the number of individuals in an entire population within the estimated profile distribution exceeds 10 ¹⁴ . The profiles themselves are found to have very few occurrences of common values between the 15 loci and thus according to Fisher’s theory of epistatic variance no correlation to phenotype attributes is expected–a result verified by the original investigators. Therefore further discovery of appropriate markers is needed to fully capture geno- and pheno-type characteristics in F . carica and F . palmata SSR profiles.
Full-text available
Thirty-four extant pawpaw [Asimina triloba (L.) Dunal] cultivars and advanced selections representing a large portion of the gene pool of cultivated pawpaws were investigated using 71 randomly amplified polymorphic DNA (RAPD) markers to establish genetic identities and evaluate genetic relatedness. All 34 cultivated pawpaws were uniquely identified by as few as 14 loci of eight primers. Genetic diversity of the existing gene pool of cultivated pawpaws, as estimated by Nei's gene diversity (He), was similar to that of wild pawpaw populations. The genetic relatedness among the cultivated pawpaws examined by UPGMA cluster analysis separated 34 cultivars and selections into two distinct clusters, a cluster of PPF (The PawPaw Foundation) selections and a cluster including a majority of the extant cultivars selected from the wild and their derived selections. The results are in general agreement with the known selection history and pedigree information available. The consensus fingerprint profile using the genetically defined RAPD markers is a useful and reliable method for establishing the genetic identities of the pawpaw cultivars and advanced selections. This also proved to be an improved discriminating tool over isozyme markers for the assessment of genetic diversity and relatedness. RAPD profiling of data presented in this study provides a useful reference for germplasm curators engaged in making decisions of sampling strategies, germplasm management and for breeders deciding which parents to select for future breeding efforts.
Full-text available
Twelve, 10-base primers amplified a total of 20 intense and easily scorable polymorphic bands in an interspecific cross of PPF1-5 pawpaw [Asimina triloba (L.) Dunal.] x RET (Asimina reticulata Shuttlew.). In this cross, all bands scored were present in, and inherited from, the A. triloba parent PPF1-5. Nineteen of the 20 bands were found to segregate as expected (1:1 or 3:1) based on chi-square goodness-of-fit tests, and were subsequently used to evaluate genetic diversity in populations of A. triloba collected from six states (Georgia, Illinois, Indiana, Maryland, New York, and West Virginia) within its natural range. Analysis of genetic diversity of the populations revealed that the mean number of alleles per locus was A = 1.64, percent polymorphic loci was P = 64, and expected heterozygosity was H(e) = 0.25. No significant differences were found among populations for any of the polymorphic indices. Partitioning of the population genetic diversity showed that the average genetic diversity within populations was H(s) = 0.26, accounting for 72% of the total genetic diversity. Genetic diversity among populations was D(st) = 0.10, accounting for 28% of the total genetic diversity. Nei's genetic identity and distance showed a high mean identity of 0.86 between populations. Genetic relationships among the populations examined by unweighted pair-group mean clustering analysis separated the six populations into two primary clusters: one composed of Georgia, Maryland, and New York, and the other composed of Illinois, Indiana, and West Virginia. The Georgia and Indiana populations were further separated from the other populations within each group. This study provides additional evidence that marginal populations within the natural range of A. triloba should be included in future collection efforts to capture most of the rare and local alleles responsible for this differentiation.
Full-text available
Pawpaw (Asimina triloba) produces the largest fruit native to the United States. Six linkage groups were identified for A. triloba using the interspecific cross [PPF1-5 (A. triloba) x RET (A. reticulata Shuttlw. ex Chapman)], covering 206 centimorgans (cM). A total of 134 dominant amplification fragment length polymorphism (AFLP) markers (37 polymorphic and 97 monomorphic) were employed for estimating the genetic diversity of eight wild populations and 31 cultivars and advanced selections. For the wild populations, the percentage of polymorphic loci over all populations was 28.1% for dominant markers and Nei's genetic diversity (He) were 0.077 estimated by 134 dominant markers. Genetic diversity and the percentage of polymorphic loci estimated using only polymorphic dominant AFLPs were 0.245 and 79%, respectively, which are comparable with other plant species having the same characteristics. Estimated genetic diversity within populations accounted for 81.3% of the total genetic diversity. For cultivars and advanced selections, genetic diversity estimated by 134 dominant markers was similar to that of wild pawpaw populations (He = 0.071). Thirty-one cultivars and advanced selections were delineated by as few as nine polymorphic AFLPdominant loci. Genetic relationships among wild populations, cultivars and advanced selections were further examined by unweighted pair group method with arithmetic mean (UPGMA) of Nei's unbiased genetic distance. The genetic diversity estimated for wild populations using the clustered polymorphic markers was lower than the result estimated using the nonclustered polymorphic markers. Therefore, this study indicates that the number of sampled genomic regions, instead of the number of markers, plays an important role for the genetic diversity estimates.
The “pawpaw” [Asimina triloba (L.) Dunal] is a deciduous fruit tree native to eastern North America where it is produced at a limited commercial scale. It is particularly interesting due to its tropical flavor and powerful aroma, unique features for a fruit crop adapted to temperate climates. Although the “pawpaw” is still an underutilized fruit crop, it has a clear niche for expansion in regions with temperate climates. However, appropriate phenological comparisons in different regions are hampered due to the lack of a standardized phenological coding system in this species. Thus, in order to fill this gap, in this work, we describe with detail the phenological growth stages of the “pawpaw” by using a two-digit decimal coding system, according to the extended BBCH-scale (Biologische Bundesanstalt, Bundessortenamt und Chemische Industrie). The BBCH code allows to easily identify the standard phenological stages under field conditions, a useful tool for management of fruit crops and essential for the expansion of “pawpaw” to other growing areas.
The North American pawpaw variety 'KSU-Atwood (TM)' ('KSU8-2' cultivar) is released for grower trial by the Horticulture Program of the Kentucky State University Land Grant Program. This pawpaw variety is a high yielding, medium sized fruited, middle to late season ripening variety, with a unique mango-banana-pineapple-like flavor. This selection also naturally forms strong right-angled branches to support high crop loads. The release is named for Rufus B. Atwood, who served as president of Kentucky State College (now University) from 1929 to 1962 and also led efforts for desegregated education in Kentucky in the 1940s.
Pawpaw [Asimina triloba (L.) Dunal.], a tree fruit native to eastern North America, is in the beginning stages of commercialization. Cultivars available in the early 20th century have been lost, and significant genetic erosion may have occurred. Polymorphic microsatellite marker loci were developed from enriched genomic libraries. Five marker loci were used to fingerprint 28 cultivars and 13 selections. For the 41 genotypes, 102 alleles were amplified and major allele frequency (0.16-0.94), number of genotypes (2-27), and allele size (144-343 bp) varied greatly by locus. Four loci were highly polymorphic, as indicated by values for expected heterozygosity (H e), observed heterozygosity (H o), and polymorphism information content, but only two alleles were detected at locus Pp-C104. A high level of genetic diversity was observed in the studied genotypes. The H o (0.68) and H e (0.70) were similar and indicated few null alleles. In the 41 genotypes, 39 unique fingerprints were observed. These new microsatellite marker loci will be useful for cultivar fingerprinting, management of collections, and investigation of genetic diversity in collections and wild populations. Grouping of genotypes in an unweighted pair group method with arithmetic mean dendrogram was generally consistent with their origins.
ADDITIONAL INDEX WORDS. Asimina triloba, domestication, new crops, breeding, selection SUMMARY. The pawpaw (Asimina triloba) is a new crop in the early stages of domestication. Recently com- mercialization has become feasible with the availability of high quality varieties. The history of pawpaw varieties is divided into three periods: 1900-50, 1950-85, and 1985 to the present. The history before 1985 was concerned primarily with the discovery of superior selections from the wild but experienced a serious break in continuity around 1950. The third period has been characterized by greater developmen- tal activity. Larger breeding programs have been pursued, regional variety trials initiated, a germplasm repository established, and a formal research program at Kentucky State University (KSU) instituted. Future breeding will likely rely on dedicated amateurs with the education and means to conduct a 20-year project involving the evaluation of hundreds of trees. For the foreseeable future, governments and univer- sities will not engage in long-term pawpaw breeding.
Missing data are commonly encountered using multilocus, fragment-based (dominant) fingerprinting methods, such as random amplified polymorphic DNA (RAPD) or amplified fragment length polymorphism (AFLP). Data sets containing missing data have been analysed by eliminating those bands or samples with missing data, assigning values to missing data or ignoring the problem. Here, we present a method that uses random assignments of band presence–absence to the missing data, implemented by the computer program famd (available from, for analyses based on pairwise similarity and Shannon's index. When missing values group in a data set, sample or band elimination is likely to be the most appropriate action. However, when missing values are scattered across the data set, minimum, maximum and average similarity coefficients are a simple means of visualizing the effects of missing data on tree structure. Our approach indicates the range of values that a data set containing missing data points might generate, and forces the investigator to consider the effects of missing values on data interpretation.