Content uploaded by Richard Frost

Author content

All content in this area was uploaded by Richard Frost on Aug 29, 2022

Content may be subject to copyright.

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

DOI:10.5121/ijcsa.2022.12401 1

DECADES OF MISCOMPUTATION IN

GENOMIC CLADES AND DISTANCES

Richard B. Frost

Frost Concepts, Vista CA, USA

ABSTRACT

Hardly a week seems to go by without encountering a new genetics study that contains a diagram of

specimen genetic similarities and clades. For these diagrams, biologists have long relied on university-

based and/or commercial computational packages which are not only prone to pilot errors but also contain

“analysis” methods which should never be used for genetic distance or clustering. Not that all the software

is poor – it appears there is a mixture of good and bad in each package. The troublesome methods,

however, have enjoyed acceptable use for so long that serious errors are published on a frequent basis.

What follows is a list of concerns that will hopefully be useful to authors and reviewers alike. The report

concludes with a graph-theoretical alternative to the current status quo in genomics.

KEYWORDS

Bayesian clustering, Graph partitioning, Missing values, Pair joining, Pseudo-metrics.

1. ITEMS OF CONCERN

1.1. Use of pseudo-metrics

A portion of the genomic literature utilizes pseudo-metrics to compute similarity or dissimilarity

among genetic profiles. But to be used as a distance, the values are not valid for comparison

unless the measure is a qualified metric [1]. This matter was contested 38 years ago by

Felsenstein [2] who insisted biologists were only making adjacency comparisons in nodes of

topological ancestry trees. However, the construction of those trees involves all-to-all

comparisons of dissimilarities[3].

The well-known 1979 coefficient of Nei & Li [4] (eqns. 8 and 26) is an example of a pseudo-

metric:

Consider the 8 specimens with 8 random 0,1 markers listed in Table 1. In the spatial domain Nei

& Li's 1979 coefficient produces 2 infinite distances due to zero denominators. Ignoring these,

the remaining subset violates the metric triangle axiom 16 times out of 270. In the marker

frequency domain (Nei's intended use) the results are equally poor with the measure

producing 49 triangle axiom errors out of 336 tests plus 1 zero distance value. A list of non-

metric measures offered as “distances” in a selection of commonly used software packages is

given in Supplemental Table S1. It is also worth noting that only one biostatistical software

package is careful with the term “metric” and refers to all genetic “distances” as dissimilarities

[5].

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

2

Table 1.

Part 1: Eight example specimens with randomly generated marker values.

Part 2: Marker value frequencies.

Part 3: Jaccard metric spatial distances, scaled to integers so that values represent number of

marker mismatches. In Part 3, nearest neighbours have the smallest value in any row or column.

For example, row #6 indicates that specimen 6 has nearest neighbours 1, 2, and 8.

marker spatial values

marker value frequencies

specimen spatial distances

#

A

B

C

D

E

F

G

H

A

B

C

D

E

F

G

H

1

2

3

4

5

6

7

8

1

1

1

1

0

1

1

0

1

3/8

7/8

1/2

5/8

1/4

1/2

3/8

5/8

4

5

5

4

3

6

4

2

0

1

1

0

0

0

1

1

5/8

7/8

1/2

5/8

3/4

1/2

5/8

5/8

4

5

1

2

3

4

4

3

0

1

0

1

0

1

0

0

5/8

7/8

1/2

3/8

3/4

1/2

3/8

3/8

5

5

4

3

4

5

3

4

0

1

0

0

0

0

1

1

5/8

7/8

1/2

5/8

3/4

1/2

5/8

5/8

5

1

4

3

4

3

3

5

0

1

1

1

0

0

0

1

5/8

7/8

1/2

3/8

3/4

1/2

3/8

5/8

4

2

3

3

5

6

4

6

1

1

1

0

0

1

1

0

3/8

7/8

1/2

5/8

3/4

1/2

5/8

3/8

3

3

4

4

5

5

3

7

0

0

0

0

1

0

1

0

5/8

1/8

1/2

5/8

1/4

1/2

5/8

3/8

6

4

5

3

6

5

6

8

1

1

0

1

0

1

1

1

3/8

7/8

1/2

3/8

3/4

1/2

5/8

5/8

4

4

3

3

4

3

6

1.2. Use of synonyms in cluster analysis

When a metric produces a zero distance between two or more profiles, they are termed synonyms

under that metric. Investigators should make a note of such occurrences and then pick one as an

ambassador to represent the synonymous group going forward. Synonyms are a violation of the

metric positive definite axiom and should not be present when dissimilarity values are being

compared since they skew the analysis of connectivity among profiles (see Figure 1).

Figure 1. The effects of synonymy on cluster analysis of genetic distances. In the above graphs the marker

data for specimen Trojano is identical to the data for specimen Kadota and thus there is zero distance

between them. The presence and absence of Trojano produces dendrograms that are structurally different in

both clustering and depth of branch points. Structural differences are also apparent in the nearest neighbour

graphs on the right. Vertical hierarchy does not imply ancestry. See Supplemental Table S2 for marker data

and computed distances.

1.3. Flattening multi-dimensional data into vectors

Except for pattern-matching metrics, most distance measures available in software packages are

vector-based e.g. List in Mathematica® [6], pdist and clustergram in MATLAB® [7, 8],

and single spreadsheet rows in SPSS [9]. As such investigators and software packages often

“flatten” their tensor marker data (multiple primers per marker) into vectors by the following

procedure:

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

3

where is tensor data, is the resulting vector, is the number of markers and the number of

primers per marker. This is sometimes done silently without the users knowledge e.g.

functionmat_gen_dist in R [10]. Doing so is equivalent to assuming all primers per marker

are independent. It also changes the problem under study when a metric with a non-trivial

normative expression (e.g. Euclidean) is used, since

except in rare instances of , . Further, any distance computed by such metrics cannot

be viably projected back into the original problem space because of the nature of the transform.

In particular, the scalars of the normed vector space would have to be inverted to scalars

of the normed tensor space which is generally infeasible when for non-trivial

norms due to the loss of dimensionality. Investigators desiring a normative metric for multi-allele

data should consider the spectral radius

Where is the ith eigenvalue of square matrix and denotes the modulus of the jth root

of .

1.4. Misuse of metrics on data with errors and omissions

Some investigators will use a pattern matching metric and incorporate all their markers into an

analysis of genetic distances – including those with missing values due to recording errors. This

malpractice has been previously discussed by Schlueter and Harris[11]. Investigators plus authors

of clustering and genetic distance software (e.g. [5, 8, 9, 12, 13]) should consider what happens

when profiles , , and where is missing a final

value due to recording error. Jaccard's metric [14] will produce , but without the

recording error the result would be . Likewise if any metric is used to compare only the

primers with all values intact the result will be the same. Hence both approaches introduce at

least as many errors as are removed.

1.5. Use of marker frequencies to distinguish individuals

Investigators who choose non-pattern-matching metrics on data composed of amino letter

sequences are forced to use marker frequencies due to the lack of numeric values. To do so, the

distribution of letters within a specimen marker profile is considered a population sample whose

positional frequencies (Table 1, Part 2) are compared by distance metric or pseudo-metric to the

letter distribution of another specimen profile. The first problem that occurs here is with multi-

dimensional data: investigators and software packages are flattening tensors as discussed above

instead of comparing multi-dimensional distribution samples. The second, more general problem

is that distances computed between specimen marker distributions cannot be considered valid

when the correlation matrix (or tensor) of all profile frequencies is singular – demonstrating that a

valid encompassing distribution has not been established for the positional frequencies. Singular

frequency correlation objects are typically the case with data from genetic profiles e.g., the

correlation matrices and tensors of Supplemental Tables S2, S3, S4 are singular as discussed in

the Table legends. Investigators desiring a dissimilarity measure for single-value marker

frequencies should consider Mahalanobis’ metric [15]

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

4

Mahalanobis

where contains frequency vectors of the specimen spatial vectors and is the

correlation matrix of . Be sure to check the numerical condition [16] of before computing

distances.

1.6. Structures derived from profile enrichment

To classify genetic profiles into clades or distance clusters, some software packages use the

sample distribution of primer values per marker to generate additional profiles for an “enriched”

population, sometimes referred to as “burn-in”, “bootstrap”, or “Bayesian clustering”. Partitions

of the enriched set are then used to decide cluster memberships for the original sample e.g., [17-

22]. Some implementations provide a questionable confidence interval calculated from within the

enriched profile set.

It is unclear whether the enriched sets are relevant to the original data [23] or whether the

resulting partition should be considered as anything more than one of several possibilities [24].

The enrichments are commonly of magnitude 1k to 100k. However, a non-trivial set of biased

profiles with markers and primers per marker will imply a full population of at least

magnitude to , requiring a statistical sample size to [25]. This is an

intractable situation in terms of computing order distances or order metric tests for

nontrivial size . Implementers of enrichment methods claim that smaller magnitudes are

sufficient. To verify this claim they need to provide an analytic function of the distance metric

and the marker distributions of all specimen profiles that for a given profile and radial

displacement yields an unbiased measurable set of enriched profiles that fill (topologically cover)

the enclosed hypersphere. This will enable computation of an accurate confidence interval for the

profile enrichment results.

1.7. False nearest neighbours

Nearest neighbour analysis is preferred for cluster determination among genetic distances due to

the complex topology of multi-dimensional genetic distance spaces (the alternative is to establish

an eigenbasis with a Lie algebra). It is important for investigators to realize that a genetic profile

can have multiple nearest neighbours (n.n.) of the same distance – a condition termed multiplicity

(Table 1, Part 3). Unfortunately, some of the n.n. algorithms used internally in biostatistical

software only pick the first element returned by sorting instead of the entire multiplicity group – a

behaviour inherited from graph traversals (see Figure 2). When multiplicity exists and is ignored

by the n.n. algorithm the distance to the selected neighbour is effectively shortened and the

problem under study is changed. Consequently, many published sets of genetic distance clusters

are erroneous.

1.8. Misuse of pair group analysis

The concordance correlation of pair group analysis forces elements into pairs by design [3]. As a

result, pair group analysis cannot accurately partition distance sets containing odd-ranked

multiplicity – a condition due to the nature of genetic profiles (see Figure 2). Another concern is

the interpretation of branching points in pair group dendrograms as mutation or procreation

points. This might be true in a system with only binary branching and no re-entrant breeding, but

statements of these qualifications are not found in the genomic literature even though ample

evidence is available for both.

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

5

1.9. Misapplication of graph partitioning software

Among the many graph partitioning algorithms available from computer science and graph

theory, only a subset is applicable to distances, and only those that do not ignore multiplicity are

viable for genetic profiles with significant number of markers. Further, just because an

investigator picks a good algorithm does not mean the results will have any relevance as

population subgroups or clades. To check this, examine a distance-limited nearest neighbour

graph using either the component maximal or an empirically known upper bound of distance

separation for generations. The result will be one of 3 outcomes: viable clusters, lack of

cohesion, or too much cohesion (Figure 3). These conditions can be due to the choice of metric,

choice of markers, or the reality of the specimens.

Figure 2. Three renderings of the example specimen data from Table 1 using Jaccard’s metric on spatial

values. In the first graph, notice the nearest-neighbor multiplicities of length 3 for specimens 6, 8, and 3,

while specimen 4’s n.n. is at length 1. The second graph is from an algorithm that ignores multiplicity and

leads to false structural conclusions. The dendrogram on the right illustrates the inability of Sokal’s pair-

joining algorithm to parse odd-ranked multiplicity. Vertical hierarchy does not imply ancestry.

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

6

Figure 3. Cohesion extremes in distance limited nearest neighbour graphs. Distance limits were empirically

determined by comparing specimen ancestry records to 2nd generation distances computed with Jaccard’s

metric. Vertical hierarchy does not imply ancestry. The left graph illustrates lack of cohesion in SSR data

(see Supplemental Table S3. The right graph shows excess cohesion in SSR data (see Supplemental Table

S5). Thick black edges are length 1/9, thin grey edges are length 2/9. Note that distance partitioning cannot

improve the right-hand graph because to cut any grey line one must cut them all.

2. A TOPOLOGICAL APPROACH

If is a collection of genetic profiles plus a set of some or all the distances between them and

these qualify as a metric, then is termed a distance graph with profiles for vertices and

distances for edges. The traditional nearest neighbor graph is the subset of containing only

nearest neighbor edges and their vertices. (Note: some computer programs ignore multiplicity so

it is recommended to check automated results.) A least bridges graph of will have overlap

with but offers more insight to components. The construction method is hierarchical. Vertices

are first added as disconnected components. The shortest available edge connections are then

added incrementally. Edges are only added between disconnected components and thus termed

"bridges" [26]. A new component is created each time an edge is added, replacing the prior two.

If there are multiple edges of the same distance that qualify then the entire set is added, possibly

engulfing multiple components. The distances among components must be re-evaluated after an

edge or edge set is added. Inter-component distances are determined by selecting the shortest

vertex-to-vertex distance between them. This is often but not always a nearest neighbor edge. The

process is continued incrementally until a prescribed limit is reached (e.g. a maximum distance)

or a connected graph is achieved.

When or is distance-limited by a non-trivial amount , the number of edges will be

reduced and hence the number of components can increase. In any such graph, consider the

function of the product of the # of components with the # of vertices : . The

value which produces a graph that maximizes is termed the component maximal (Figure

4). Since this value maximizes the number of graph components with respect to vertices, it

produces elemental clusters of original graph for its given distance metric. An example is

shown in Figure 5.

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

7

Figure 4. Variation of c of components (isolated clusters), v of vertices (specimens), and fcv

in least bridges graphs of 28 Moraceae specimens limited to genetic distance . The component maximal

occurs at f where is used for illustration purposes (see Figure 5).

Figure 5. Distance-limited least bridges graph showing elemental clusters of 28 Moraceae specimens [27].

Solid lines denote nearest neighbors and dashed lines are least bridges. 26 specimens are present in the

graph with 2 remaining cladeless. In the bottom left component Seizuro and Sabbawala-2 are mutual

nearest neighbors, while Seizuro is the single nearest neighbor of Mysore Local and SRDC-1. A complete

set of distance values are available in Supplemental Table S6 and the referenced publication. Distance limit

is the component maximal . Vertical hierarchy does not imply ancestry.

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

8

REFERENCES

[1] J. K. Hunter and B. Nachtergaele, Applied analysis. World Scientific Publishing Company, 2001, p.

438, doi: https://doi.org/10.1142/4319.

[2] J. Felsenstein, "Distance methods for inferring phylogenies: a justification," Evolution, pp. 16-24,

1984. https://www.jstor.org/stable/2408542.

[3] J. H. Camin and R. R. Sokal, "A method for deducing branching sequences in phylogeny," Evolution,

pp. 311-326, 1965. https://www.jstor.org/stable/2406441.

[4] M. Nei and W.-H. Li, "Mathematical model for studying genetic variation in terms of restriction

endonucleases," Proceedings of the National Academy of Sciences, vol. 76, no. 10, pp. 5269-5273,

1979, doi: https://doi.org/10.1073/pnas.76.10.5269.

[5] X. J.-C. Perrier, Jean-Pierre. "DARwin - Dissimilarity Analysis and Representation for Windows."

CIRAD. https://darwin.cirad.fr/.

[6] W. Research. "Mathematica." https://www.wolfram.com/mathematica.

[7] MATLAB. "Pairwise distance between pairs of observations - MATLAB pdist." MathWorks.

https://www.mathworks.com/help/stats/pdist.html.

[8] MATLAB. "Object containing hierarchical clustering analysis data - MATLAB." MathWorks.

https://www.mathworks.com/help/bioinfo/ref/clustergram.html.

[9] IBM. "SPSS Statistics | IBM." IBM. https://www.ibm.com/products/spss-statistics.

[10] P. Savary. "Landscape and genetic data processing with graph4lg." The R Project. https://cran.r-

project.org/web/packages/graph4lg/vignettes/input_data_processing_1.html.

[11] P. M. Schlueter and S. A. Harris, "Analysis of multilocus fingerprinting data sets containing missing

data," Molecular Ecology Notes, vol. 6, no. 2, pp. 569-572, 2006, doi: https://doi.org/10.1111/j.1471-

8286.2006.01225.x.

[12] Biostat. "NTSYSpc." Applied Biostat LLC. http://www.appliedbiostat.com/ntsyspc/ntsyspc.html.

[13] R. "The R Project for Statistical Computing." The R Foundation. https://www.r-project.org/.

[14] S. Kosub, "A note on the triangle inequality for the Jaccard distance," Pattern Recognition Letters,

vol. 120, pp. 36-38, 2019, doi: https://doi.org/10.1016/j.patrec.2018.12.007.

[15] P. C. Mahalanobis, "On the generalized distance in statistics," 1936.

http://library.isical.ac.in:8080/jspui/bitstream/10263/6765/1/Vol02_1936_1_Art05-pcm.pdf.

[16] G. W. Stewart, Afternotes on numerical analysis. SIAM, 1996.

https://doi.org/10.1137/1.9781611971491.

[17] M. J. Hubisz, D. Falush, M. Stephens, and J. K. Pritchard, "Inferring weak population structure with

the assistance of sample group information," Molecular ecology resources, vol. 9, no. 5, pp. 1322-

1332, 2009, doi: https://doi.org/10.1111/j.1755-0998.2009.02591.x.

[18] C. C. Chang, C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell, and J. J. Lee, "Second-generation

PLINK: rising to the challenge of larger and richer datasets," Gigascience, vol. 4, no. 1, pp. s13742-

015-0047-8, 2015, doi: https://doi.org/10.1186/s13742-015-0047-8.

[19] G. Guillot, S. Renaud, R. Ledevin, J. Michaux, and J. Claude, "A unifying model for the analysis of

phenotypic, genetic, and geographic data," Systematic biology, vol. 61, no. 6, pp. 897-911, 2012, doi:

https://doi.org/10.1093/sysbio/sys038.

[20] L. Excoffier and H. E. Lischer, "Arlequin suite ver 3.5: a new series of programs to perform

population genetics analyses under Linux and Windows," Molecular ecology resources, vol. 10, no.

3, pp. 564-567, 2010, doi: https://doi.org/10.1111/j.1755-0998.2010.02847.x.

[21] O. François, S. Ancelet, and G. Guillot, "Bayesian clustering using hidden Markov random fields in

spatial population genetics," Genetics, vol. 174, no. 2, pp. 805-816, 2006, doi:

https://doi.org/10.1534/genetics.106.059923.

[22] C. Chen, E. Durand, F. Forbes, and O. François, "Bayesian clustering algorithms ascertaining spatial

population structure: a new computer program and a comparison study," Molecular Ecology Notes,

vol. 7, no. 5, pp. 747-756, 2007, doi: https://doi.org/10.1111/j.1471-8286.2007.01769.x.

[23] D. J. Witherspoon et al., "Genetic similarities within and between human populations," Genetics, vol.

176, no. 1, pp. 351-359, 2007, doi: https://doi.org/10.1534/genetics.106.067355.

[24] J. Novembre, "Pritchard, Stephens, and Donnelly on population structure," Genetics, vol. 204, no. 2,

pp. 391-393, 2016, doi: https://doi.org/10.1534/genetics.116.195164.

[25] M. F. Triola, Elementary Statistics, 8th ed. Addison-Wesley, 2001.

https://books.google.com/books?id=G6u8PwAACAAJ.

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

9

[26] C. Godsil and G. F. Royle, Algebraic graph theory. Springer Science & Business Media, 2013.

https://link.springer.com/book/10.1007/978-1-4613-0163-9.

[27] B. Mathi Thumilan, R. Sajeevan, J. Biradar, T. Madhuri, K. N. Nataraja, and S. M. Sreeman,

"Development and characterization of genic SSR markers from Indian mulberry transcriptome and

their transferability to related species of Moraceae," PloS ONE, vol. 11, no. 9, p. e0162909, 2016,

doi: https://doi.org/10.1371/journal.pone.0162909.

[28] MATLAB. "Pairwise distance between pairs of observations - MATLAB pdist - Distance metric."

MathWorks. https://www.mathworks.com/help/stats/pdist.html#mw_39296772-30a1-45f3-a296-

653c38875df7.

[29] Wolfram. "Distance and Similarity Measures - Wolfram Language Documentation." Wolfram

Research, Inc. https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html.

[30] IBM. "Distances - IBM Documentation." IBM Corporation. https://www.ibm.com/docs/en/spss-

statistics/28.0.0?topic=features-distances.

[31] USDA. "Ficus carica L. GRIN-Global." USDA ARS. https://npgsweb.ars-

grin.gov/gringlobal/taxon/taxonomydetail?id=16801.

[32] K. W. Pomper et al., "Characterization and identification of pawpaw cultivars and advanced

selections by simple sequence repeat markers," Journal of the American Society for Horticultural

Science, vol. 135, no. 2, pp. 143-149, 2010, doi: https://doi.org/10.21273/JASHS.135.2.143.

[33] K. Vinod, "Structured association mapping using STRUCTURE and TASSEL," Bioinformatics Tools

for Genomics Research, p. 103, 2011.

https://www.academia.edu/706699/Structured_Association_Mapping_using_STRUCTURE_and_TA

SSEL.

[34] A. Wünsch and J. Hormaza, "Molecular characterisation of sweet cherry (Prunus avium L.)

genotypes using peach [Prunus persica (L.) Batsch] SSR sequences," Heredity, vol. 89, no. 1, pp. 56-

63, 2002, doi: https://doi.org/10.1038/sj.hdy.6800101.

AUTHOR

R.B. Frost is an old-school numerical analyst with academic and vocational experience

in applied mathematics, computer science, and horticulture. He is currently pursuing

research in the genomics of lesser-studied fruits.

SUPPLEMENTARY INFORMATION

Supplemental Table S1. Non-metric dissimilarity measures erroneously presented as distances in software

packages commonly used for biostatistics research [12, 28-30]. Symmetric pairs of zero distances were

removed prior to analysis.

Metric Tests

Software

Measure

Name

Data Type

Domain

Dataset

Reflexive

Com-

mutative

Triangle

Inequality

MATLAB R2022a

Correlation

Numerical

Spatial

Table x

Passed

Passed

Failed

MATLAB R2022a

Correlation

Numerical

Frequency

Table x

Passed

Passed

Failed

MATLAB R2022a

Cosine

Numerical

Spatial

Table x

Passed

Passed

Failed

MATLAB R2022a

Cosine

Numerical

Frequency

Table x

Passed

Passed

Failed

Mathematica v13.0

Bray-Curtis

Numerical

Spatial

Table x

Passed

Passed

Failed

Mathematica v13.0

Bray-Curtis

Numerical

Frequency

Table x

Passed

Passed

Failed

Mathematica v13.0

Correlation

Numerical

Spatial

Table x

Passed

Passed

Failed

Mathematica v13.0

Correlation

Numerical

Frequency

Table x

Passed

Passed

Failed

Mathematica v13.0

Cosine

Numerical

Spatial

Table x

Passed

Passed

Failed

Mathematica v13.0

Cosine

Numerical

Frequency

Table x

Passed

Passed

Failed

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

10

NTSYSpc v2.21w

Hillis

Multi-allele

Counts

Spatial

Table S2

Passed

Passed

Failed

NTSYSpc v2.21w

Hillis

Multi-allele

Counts

Frequency

Table S2

Passed

Passed

Failed

NTSYSpc v2.21w

Nei 1972

Multi-allele

Counts

Spatial

Table S2

Passed

Failed

Failed

NTSYSpc v2.21w

Nei 1972

Multi-allele

Counts

Frequency

Table S2

Passed

Failed

Failed

NTSYSpc v2.21w

Nei 1978

Multi-allele

Counts

Spatial

Table S2

Passed

Passed

Failed

NTSYSpc v2.21w

Nei 1978

Multi-allele

Counts

Frequency

Table S2

Passed

Passed

Failed

SPSS v28.0.1.1

Size

Difference

0,1 Binary

Spatial

Table 1

Passed

Passed

Failed

SPSS v28.0.1.1

Pattern

Difference

0,1 Binary

Spatial

Table 1

Passed

Passed

Failed

SPSS v28.0.1.1

Binary

Shape

0,1 Binary

Spatial

Table 1

Passed

Passed

Failed

SPSS v28.0.1.1

Lance and

Williams

0,1 Binary

Spatial

Table 1

Passed

Passed

Failed

Supplemental Table S2. Multi-dimensional SSR and genetic distance data from 8 Ficus carica specimens

at NCGR Davis [31]. Distances computed with spectral radius of , where are the tensors

below each specimen name. The correlation tensor of frequencies for this data is singular due to marker

frequencies of M8N1.2 all having value 1/2, and several other markers having no variation e.g., C22F1.1.

specimens

Adriatic

Archipel

Calimyrna

Kadota

Mission

Panachee

Trojano

Vernino

SSR markers

C22F1

283

283

283

283

283

283

283

283

283

283

283

283

285

283

283

283

C24H1

272

270

272

272

270

272

272

270

272

272

272

272

272

272

272

272

C26N1

234

234

234

234

234

234

234

234

234

236

234

234

236

234

234

234

C31F1

224

224

239

224

224

224

224

224

239

224

239

239

239

239

239

239

C35H1

252

254

254

254

254

252

254

254

254

254

256

254

254

254

254

254

C37N1

204

204

204

204

204

204

204

204

204

208

204

208

204

204

208

204

LM12H1

214

214

214

214

214

233

214

233

243

243

243

243

243

233

243

243

LM14H1

200

200

200

200

200

200

200

198

200

200

200

200

200

200

200

200

LM30N1

245

245

243

243

245

237

243

231

251

251

245

245

247

251

245

251

LM36N1

248

248

248

248

248

248

248

248

248

248

250

248

250

248

248

250

M1F1

172

189

172

172

189

155

172

172

184

189

184

189

189

184

189

188

M2H1

155

161

153

153

161

153

153

155

161

167

161

167

167

153

167

155

M3N1

124

120

132

120

122

124

120

132

132

132

132

132

122

132

132

132

M4F1

194

194

194

194

194

194

194

214

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

11

214

214

214

218

218

214

218

214

M8N1

171

171

171

171

171

171

171

171

171

175

171

175

175

171

175

171

distances

Adriatic

21.383

17.378

12.042

20.772

27.715

12.042

32.14

Archipel

21.383

30.989

19.046

19.546

40.948

19.046

37.917

Calimyrna

17.378

30.989

19.209

27.995

32.223

19.209

33.892

Kadota

12.042

19.046

19.209

19.134

27.599

0.

33.119

Mission

20.772

19.546

27.995

19.134

40.773

19.134

37.264

Panachee

27.715

40.948

32.223

27.599

40.773

27.599

28.505

Trojano

12.042

19.046

19.209

0.

19.134

27.599

33.119

Vernino

32.14

37.917

33.892

33.119

37.264

28.505

33.119

Adriatic

Archipel

Calimyrna

Kadota

Mission

Panachee

Trojano

Vernino

Supplemental Table S3. SSR data values for 5 loci from 2010 Asimina triloba (Pawpaw) study of

Pomper et al [32]. Suffix _.F refers to forward SSR primer, _.R to reverse. Note that the zeros are missing

values which skew the analysis as discussed in the main text. The correlation tensor of frequencies for this

data is singular due to the interdependence of primers C104.F and C104.R.

Genotype

B3.F

B3.R

B103.F

B103.R

B129.F

B129.R

C104.F

C104.R

G119.F

G119.R

10-35

183

191

266

339

166

172

184

0

158

164

11-13

191

0

264

305

166

172

184

0

158

0

1-23

185

189

290

310

158

0

184

0

158

176

1-68

185

187

268

341

158

179

175

184

158

164

2-10

191

0

264

270

170

172

184

0

161

164

2-54

191

0

264

270

162

166

184

0

161

0

3-11

191

0

272

288

158

172

184

0

158

161

3-21

189

191

266

305

166

170

184

0

161

164

5-5

183

189

270

305

166

168

184

0

161

0

7-90

185

191

305

342

170

176

184

0

161

164

8-20

189

191

264

270

162

0

184

0

158

167

9-47

183

0

272

274

158

166

184

0

158

161

9-58

183

191

264

339

170

176

184

0

158

164

BH10

189

0

319

321

162

170

184

0

144

161

Cales Creek

175

183

266

274

156

158

184

0

158

164

Davis

185

189

264

268

158

164

175

184

158

164

Greenriver Belle

183

189

264

266

162

172

184

0

158

161

IXL

187

189

274

309

158

162

175

184

158

164

M. Gordon

185

195

270

312

164

170

184

0

161

164

Middletown

183

193

270

321

170

0

184

0

158

161

Mitchell

0

0

266

321

158

172

184

0

158

167

NC-1

185

193

266

0

158

162

184

0

158

161

Overleese

185

189

264

0

158

164

175

184

158

0

PA-Golden#1

191

193

336

343

172

176

184

0

158

164

PA-Golden#3

189

191

336

343

158

172

184

0

161

170

PA-Golden#4

175

183

319

326

164

0

184

0

158

161

Potomac

183

191

264

324

170

0

184

0

158

164

Prolific

189

191

309

323

158

162

184

0

158

0

Rappahannock

183

191

266

0

166

0

184

0

164

170

Rebeccas Gold

185

193

266

0

158

162

184

0

158

161

Shenandoah

185

187

264

274

162

164

184

0

158

164

Sue

175

189

266

329

166

180

184

0

161

164

Sunflower

187

0

274

341

162

180

175

184

164

0

Susquehanna

189

191

264

270

162

0

184

0

158

167

Sweet Alice

175

183

260

324

166

182

184

0

144

164

Taylor

183

185

268

322

173

193

184

0

167

170

Taytwo

175

185

252

290

158

0

184

0

164

176

Wabash

183

0

266

324

170

172

184

0

158

170

Wells

175

191

276

290

177

0

184

0

161

164

Wilson

183

185

268

321

173

193

184

0

167

170

Zimmerman

191

195

303

324

164

177

184

0

158

0

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

12

Supplemental Table S4. Example SSR data table from Vinod 2011 [33]. Row “GENO6” has been deleted

due to a missing data value in marker SSR2. The correlation matrix of frequencies for this data is singular

due to the lack of variation in columns 4, 6, and 8.

SSR1

SSR2

SSR3

SSR4

SSR5

SSR6

SSR7

SSR8

SSR9

GENO1

110

330

190

140

220

140

240

160

200

GENO2

110

330

190

140

230

140

240

160

190

GENO3

110

320

190

140

220

140

240

160

200

GENO4

110

320

190

140

220

140

240

160

200

GENO5

110

330

180

140

220

140

240

160

200

GENO7

110

330

190

140

220

140

240

160

200

GENO8

110

320

180

140

220

140

240

160

200

GENO9

110

330

190

140

220

140

250

160

200

GENO10

120

320

180

140

220

140

240

160

200

Supplemental Table S5. SSR data values for 9 loci from 2002 Prunus avium (Sweet Cherry) study of

Wünsch & Hormaza [34]. Dual values are treated as rational numbers. Note that single values are

instances of missing scores which skew the analysis as discussed in the main text. The correlation matrix

of frequencies for this data is not singular, with determinant 0.280843. This would not be the case if the

missing scores were recorded as zeros (see Table S3).

Cultivar

Index

Pchcms1

Pchcms3

Pchcms5

UDP96-

005

UDP98-

409

UPD98-

021

UPD98-

022

UPD97-

402

UPD98-

412

Ambrunes

1

140

180/160

290

150/125

160/130

110/100

110/90

125

130

Arcina

2

140

180

260

150/130

130

110/100

100/90

140/125

130

Beige

3

190/140

180

260

150/120

130

110/100

110/90

130

130

Bing

4

190/140

180

260

150/120

130

110/100

110/90

130

130

Blanca de

Provenza

5

140

180/160

290/260

120

130

100

105/90

145/125

130/100

Brooks

6

140

180

290

150/120

130

110/100

90

135/125

130

Burlat

7

140

180

290

150/130

130

110/100

100/90

130/120

130

Burlat C-1

8

140

180

290

150/130

130

110/100

100/90

130/120

130

Celeste

9

140

180

290

150/130

130

110/100

100

135/125

130

Chinook

10

190/140

180

260

150/120

130

110/100

110

145/130

130

Compact

Stella

11

190/140

180

290/260

150/120

130

110/100

110/100

145/125

130

Coralise

12

140

180

260

150/130

130

110/100

100/90

135/125

130

Corum

13

190/140

180

290/260

150/120

130

110/100

110/100

135/125

140/130

Cristalina

14

190/140

180/160

260

150/120

130

110

110/100

135/125

130

Cristo-

balina

15

190/140

180

290/260

150/115

130

110/100

110/90

135/125

130

Duroni 3

16

140

180

290/260

150/115

130

110/100

100/90

125

130

Earlise

17

140

180

290/260

150/130

130

100

90

130/120

130

Earlystar

18

190/140

180

290

150/120

130

100

100/90

145/125

130

Early Van

Compact

19

190/140

180

290/260

150/120

130

110

100/90

135/125

130/120

Ferrovia

20

140

180

290/260

150/120

130

110

100

145/130

140/130

Garnet

21

140

180

260

120

130

110/100

110/90

130

130

Gil Peck

22

190/140

180/160

260

150/120

130

110

110

145/130

130

Giorgia

23

140

180

290/260

150/120

130

110

100/90

125

130/120

Hartland

24

140

180

260

150/120

130

110

100/90

140/130

130

Hedel-

finger

25

140

180

290/260

150/120

130

110/100

110/100

135/125

130/100

Lambert

26

190/140

180/160

290/260

150/120

130

110/100

110

125

130

Lamida

27

140

180

260

150/120

130

110

110

145/125

130

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

13

Lapins

28

190/140

180

290

150/120

130

110/100

100/90

135/125

130

Larian

29

190/140

180

260

150/120

130

110

110/100

145/130

130

Marmotte

30

190/140

180

290/260

150/120

130

110/100

110/100

135/125

130/100

Marvin

31

140

180

290/260

120

130

110/100

100/90

130/125

130

Moreau

32

140

180

290/260

150/115

130

100

100/90

145/125

130

Napoleon

33

190/140

180

260

150/120

130

110/100

110/100

135/125

130

Newstar

34

140

180

290/260

150/120

130

110/100

100

135/125

130/120

Pico

Colorado

35

140

180/160

290/260

150/130

160/130

110/100

110

140/125

130

Pico Negro

36

140

180/160

290/260

150/115

160/130

100

110/90

140/120

130

Precoce

Bernard

37

140

180

290/260

150/130

130

100

100/90

145/130

130

Rainier

38

190/140

180

290/260

120

130

110/100

90

130

130

Ramon

Oliva

39

140

180/160

290

150/130

130

100

90

120

130

Reverchon

40

140

180

290/260

150/115

130

110/100

100

125

130

Royalton

41

190/140

180

260

150/130

130

100

110/100

135/125

130

Ruby

42

190/140

180

260

120

130

110/100

110/90

130

130

Sam

43

140

180/160

290/260

150/120

130

100

110/100

145/125

130

Samba

44

190/140

180

290/260

150/120

130

110

110/90

135/130

130

Santina

45

190/140

180/160

290/260

150/130

130

100

110/100

135/125

130

Skeena

46

190

180

290/260

150/120

130

100

90

145/125

130

Somerset

47

190/140

180

260

150/120

130

110/100

100

135/125

130/120

Sonata

48

190

180

290

150/120

130

100

100/90

145/125

130/120

Spalding

49

190/140

180

260

150/130

130

110/100

110/100

135/125

140/130

Star

50

190/140

180/160

260

150/120

130

110/100

110/100

135/125

130

Starky

Hardy

Giant

51

140

180

260

150/120

130

110/100

100/90

135/125

130

Sue

52

140

180/160

260

150/120

130

110

110/100

135/125

140/130

Sumesi

53

140

180

290/260

150/120

130

110/100

110/100

145/130

130

Summit

54

140

180/160

290/260

150/120

130

110/100

100

125

130

Sunburst

55

140

180

290

150/120

130

110/100

100/90

145/130

130/120

Sweetheart

56

190/140

180

290/260

150/120

130

110/100

100/90

135/125

130/120

Sylvia

57

140

180/160

290

150/120

130

110/100

100

145/125

130

Taleguera

Brilliante

58

140

180/160

290/260

150/115

130

100

100/90

135

130

Tigre

59

140

180

290

150/115

130

110

110/90

120

130

Van

60

190/140

180

290/260

150/120

130

110

100/90

135/125

130/120

Van Spur

61

190/140

180

290/260

150/120

130

110

100/90

135/125

130/120

Vega

62

190/140

180

260

150/130

130

110

100/90

135/125

140/130

Vic

63

140

180

260

150/120

130

110/100

100/90

135/125

140/130

Vignola

64

140

180

260

150/130

130

110

100

140/125

130

Vittoria

65

140

180

290/260

150/130

130

110

100/90

145/130

130/120

13N.7.19

66

140

180

290/260

150/120

130

110/100

100/90

135/125

130/120

13S.17.20

67

190/140

180

290/260

150/120

130

110

110/90

130

130

13S.18.10

68

190/140

180

290/260

150/120

130

110/100

110/100

135/125

130

13S.18.15

69

190/140

180

290/260

150/120

130

110

110/100

135/125

130/120

13S.21.7

70

140

180

290

150/130

130

110/100

100

135/125

130

13S.27.17

71

140

180

260

150/120

130

110

100/90

135/125

140/130

13S.3.13

72

190/140

180

290

150/120

130

100

110/100

145/130

130

44W.11.8

73

190/140

180

260

150/120

130

110

110/100

145/130

130

83703007

74

190/140

180

290

150/120

130

100

100

145/120

130

84703002

75

190/140

180

290

150/120

130

110/100

100/90

130/120

130

84704006

76

140

180

290

150/130

130

110/100

100/90

130/120

130

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

14

Supplemental Table S6. Distance matrix from 2016 Moraceae SSR study by Mathi Thumilan et al [27].

The investigators computed distances with one of the metrics in the Darwin v.5.0 program [5]. It is

unknown whether the original SSR data was vector-valued or flattened from a tensor.

Genot

ype

M.

macro

ura

M.

nigra

ME-

107

T-12

C-

1725

T-21

Moul

ai

T-08

C-776

ACC-

118

ACC-

115

Gajap

athipu

ra

SRDC

-1

M.

nigra

0.286

ME-

107

0.302

0.298

T-12

0.322

0.319

0.308

C-

1725

0.353

0.349

0.339

0.277

T-21

0.36

0.357

0.346

0.285

0.313

Moula

i

0.355

0.351

0.341

0.279

0.307

0.259

T-08

0.354

0.35

0.339

0.278

0.306

0.258

0.235

C-776

0.341

0.338

0.327

0.266

0.294

0.246

0.234

0.232

ACC-

118

0.364

0.36

0.35

0.288

0.316

0.288

0.283

0.282

0.27

ACC-

115

0.357

0.354

0.343

0.282

0.31

0.282

0.276

0.275

0.263

0.243

Gajap

athipu

ra

0.36

0.356

0.346

0.285

0.313

0.303

0.297

0.296

0.284

0.306

0.3

SRDC

-1

0.394

0.39

0.38

0.318

0.346

0.336

0.331

0.33

0.318

0.34

0.333

0.319

Sabba

wala-2

0.397

0.393

0.383

0.321

0.349

0.339

0.334

0.333

0.32

0.343

0.336

0.322

0.288

Seizur

o

0.388

0.384

0.374

0.313

0.34

0.331

0.325

0.324

0.312

0.334

0.328

0.313

0.279

Mysor

e

Local

0.404

0.4

0.39

0.328

0.356

0.346

0.341

0.34

0.327

0.35

0.343

0.329

0.295

Kanva

-2

0.398

0.394

0.384

0.323

0.35

0.341

0.335

0.334

0.322

0.344

0.338

0.323

0.309

S-1

0.397

0.393

0.383

0.322

0.349

0.34

0.334

0.333

0.321

0.343

0.337

0.322

0.308

RF-

175

0.407

0.403

0.393

0.331

0.359

0.349

0.344

0.343

0.33

0.353

0.346

0.332

0.317

S-54

0.417

0.414

0.403

0.342

0.37

0.36

0.354

0.353

0.341

0.363

0.357

0.342

0.328

S-13

0.399

0.395

0.385

0.324

0.352

0.342

0.336

0.335

0.323

0.345

0.339

0.324

0.31

DD-1

0.398

0.394

0.384

0.322

0.35

0.34

0.335

0.334

0.321

0.344

0.337

0.323

0.308

V-1

0.406

0.403

0.392

0.331

0.359

0.349

0.343

0.342

0.33

0.353

0.346

0.332

0.317

S-36

0.413

0.41

0.399

0.338

0.366

0.356

0.35

0.349

0.337

0.36

0.353

0.339

0.324

F.

bengh

alensis

0.613

0.609

0.599

0.538

0.565

0.556

0.55

0.549

0.537

0.559

0.553

0.538

0.547

F.

carica

0.612

0.608

0.598

0.536

0.564

0.555

0.549

0.548

0.536

0.558

0.551

0.537

0.546

Jackfr

uit

0.612

0.608

0.597

0.536

0.564

0.554

0.549

0.547

0.535

0.558

0.551

0.537

0.546

Dudia

white

0.866

0.863

0.852

0.791

0.819

0.809

0.803

0.802

0.79

0.813

0.806

0.792

0.801

Genot

ype

Sabba

wala-2

Seizur

o

Mysor

e

Local

Kanv

a-2

S-1

RF-

175

S-54

S-13

DD-1

V-1

S-36

F.

bengh

alensis

F.

carica

Jackfr

uit

Seizur

o

0.214

International Journal on Computational Science & Applications (IJCSA) Vol.12, No.4, August 2022

15

Mysor

e

Local

0.264

0.255

Kanva

-2

0.312

0.303

0.318

S-1

0.311

0.302

0.318

0.223

RF-

175

0.32

0.312

0.327

0.257

0.256

S-54

0.331

0.322

0.338

0.268

0.267

0.228

S-13

0.313

0.304

0.32

0.29

0.289

0.298

0.309

DD-1

0.311

0.303

0.318

0.288

0.287

0.297

0.307

0.229

V-1

0.32

0.311

0.327

0.297

0.296

0.305

0.316

0.238

0.217

S-36

0.327

0.318

0.334

0.304

0.303

0.312

0.323

0.245

0.224

0.192

F.

bengh

alensis

0.55

0.542

0.557

0.552

0.551

0.56

0.571

0.553

0.551

0.56

0.567

F.

carica

0.549

0.541

0.556

0.55

0.549

0.559

0.57

0.552

0.55

0.559

0.566

0.231

Jackfr

uit

0.549

0.54

0.556

0.55

0.549

0.559

0.569

0.551

0.55

0.559

0.566

0.246

0.245

Dudia

white

0.804

0.795

0.811

0.805

0.804

0.814

0.824

0.806

0.805

0.813

0.821

0.861

0.859

0.859