# Robert R. Sokal's research while affiliated with Stony Brook University and other places

## Publications (233)

We have newly constructed an ethnohistorical database consisting of 3460 records of ethnic locations and movements in Europe since 2200 B.C. Using this database, we computed vectors of proportions that peoples speaking various language families contributed to the gene pools of 2216 1° x 1° land-based quadrats of Europe. From these vectors we comput...

Classifications of European populations were produced based on 59 gene frequencies and 10 cranial measurements, but recorded for different population samples. A map-quadrat approach circumvented the problem of noncorrespondence of sampling localities. Clustering by standard numerical taxonomic procedures shows that these data are represented poorly...

Synthetic maps of human gene frequencies, which are maps of principal component scores based on correlations of interpolated surfaces, have been popularized widely by L. Cavalli-Sforza, P. Menozzi, and A. Piazza. Such maps are used to make ethnohistorical inferences or to support various demographic or historical hypotheses. We show from first prin...

In this study we relate language differences on a global scale with genetic distances for the same populations. The analysis is carried out on more populations (130) but fewer genetic systems (11) than earlier studies. We constructed an overall genetic distance matrix that allowed for missing values. A separate genetic distance matrix was also comp...

The premier text in the field, Biometry provides both an elementary introduction to basic biostatistics as well as coverage of more advanced methods used in biological research. Students are shown how to think through research problems and understand the logic behind the different experimental situations. This book is designed to serve not only as...

The connectivity and edge density properties of Gabriel graphs are investigated. The shortest spanning (nearest neighbor) tree is shown to be a subgraph of the Gabriel graph, which, in turn, has been shown to be a subgraph of the dual of the Thiessen diagram. The Gabriel graph is a planar graph and so can never be more than three times as edge-dens...

We review the recently developed local spatial autocorrelation statistics Ii, ci, Gi, and Gi*. We discuss two alternative randomization assumptions, total and conditional, and then newly derive expectations and variances under conditional randomization for Ii and ci, as well as under total randomization for ci. The four statistics are tested by a b...

Wombling is a method for discovering boundaries in a collection of continuous variables observed at the same geographic localities. We extend this method to categorical data. A categorical wombling statistic Ci, which identifies areas of rapid change, is defined for every pair i = 1,…, n of adjacent localities, and is equal to the average number of...

We develop tests of whether a pattern of geographic variation departs significantly from random variation over an area. Localities are vertices in a graph whose edges are connections based on criteria of geographic contiguity. Ranked variables are assigned to each locality. Distributions of absolute differences in rank along edges between vertices...

Tests for differences among regional means are typically carried out by analysis of variance (ANOVA). When such data are spatially autocorrelated (SA), the assumptions of ANOVA are not met, giving rise to excessive type I error rates. Two spatially adjusted ANOVA methods, Griffith's and COCOPAN, have been proposed to overcome this problem. In this...

Spatial patterns are described and analysed for the 84 most common surnames in England and Wales, as well as 16 others selected for various reasons. At least three-quarters of the surname frequencies show spatial structure and are heterogeneous over the area of study. While they do not exhibit clines extending over the entire area of study, they do...

The geographic covariation of the eastern cottonwood Populus deltoides with three different gallforming aphids in the genus Pemphigus is studied over eastern North America. Ten vegetative Populus characters were analyzed together with 32 stem mother, slate und gall dimension characters in Pemphigus populicaulis und in two morphs of P. populitransve...

The geographic variation of 33 morphological characters of the gall-forming aphid Pemphigus populicaulis is studied for 118 localities east of 100oW longitude. Variation can be partitioned into within-gall, among-gall and among-locality components. Among localities variation ranges from 26 to 54%, being significant for all characters. Variation amo...

The geographic variation of 33 morphological characters of two morphs of the gall-forming aphid Pemphigus populitransversus is studied in 214 locality samples. Among-locality variation ranges from 1 to 69% in the elongate morph and from 0 to 44% in the globular morph. The design of the study permits separation of interlocality correlations from int...

Spatial autocorrelation (SA) methods were recently extended to detect local spatial autocorrelation (LSA) at individual localities. LSA statistics serve as useful indicators of local genetic population structure. We applied this method to 15 allele frequencies from 43 villages of a South American tribe, the Yanomama. Based on a network of links <or...

During much of his academic career, Raymond Pearl advocated the employment of quantitative and statistical methods in biology and applied them in his own voluminous research output, much of it in various aspects of human biology. If he were to return today, he would be pleased to note the fruits of his advocacy-the widespread quantification of biol...

We introduce a statistical protocol for analyzing spatially varying data, including putative explanatory variables. The procedures comprise preliminary spatial autocorrelation analysis (from an earlier study), path analysis, clustering of the resulting set of path diagrams, ordination of these diagrams, and confirmatory tests against extrinsic info...

We have previously shown that geographic differences in cancer mortalities in Europe are related to (in order of importance): geographic distances (reflecting environmental differences), ethnohistoric distances (encompassing cultural and genetic attributes), and genetic distances of the populations in the areas studied. In this study, we analyzed t...

We applied the techniques of spatial autocorrelation (SA) analysis to 40 cancer mortality distributions in Western Europe. One of the aims of these methods is to describe the scale over which spatial patterns of mortalities occur, which may provide suggestions concerning the agents bringing about the patterns. We analyzed 355 registration areas, ap...

Inhabitants of the Croatian islands of Brac, Hvar, Korcula, and the Peljesac Peninsula have been the subject of extensive previous studies of local population differentiation. Most of these studies used biological and ecological variables, but some also considered historical and sociological factors. In this study we use genetic, morphological, kin...

Spatial autocorrelation (SA) methods have recently been extended to include the detection of local spatial autocorrelation at individual sampling stations. We review the formulas for these statistics and report on the results of an extensive population-genetic simulation study we have published elsewhere to test the applicability of these methods i...

We studied spatial patterns for 24 allele frequencies representing 15 systems (blood antigens, enzymes, serum proteins, color blindness, and cerumen) in Japan. The total number of samples over all systems and localities is 1125. We investigated patterns of genetic variation graphically as interpolated allele frequency surfaces, as one-dimensional a...

Geographic variation in cancer rates is thought to be the result of two major factors: environmental agents varying spatially
and the attributes, genetic or cultural, of the populations inhabiting the areas studied. These attributes in turn result
from the history of the populations in question. We had previously constructed an ethnohistorical data...

This paper examines competing theories for cases in which both the data and the hypotheses can be represented as distance matrices. A test due to Dow & Cheverud has been used for such comparisons in anthropology, but when data are spatially, temporally, or phylogenetically autocorrelated, this test may be far too liberal. We examine a classificatio...

To explore the extent to which microevolutionary inference can be made using spatial autocorrelation analysis of gene frequency surfaces, we simulated sets of surfaces for nine evolutionary scenarios, and subjected spatially-based summary statistics of these to linear discriminant analysis. Scenarios varied the amounts of dispersion, selection, mig...

Population movements of 891 ethnic units in Europe over the past 4,200 years, and the correlations of these movements with modern genetic distances were investigated on a one-degree-square grid of the continent. There is significant spatial pattern in movements from sources, to targets, and overall. Patterns change significantly over time. Patterns...

We have newly constructed an ethnohistorical database consisting of 3460 records of ethnic locations and movements in Europe since 2200 B.C. Using this database, we computed vectors of proportions that peoples speaking various language families contributed to the gene pools of 2216 1 degree x 1 degree land-based quadrats of Europe. From these vecto...

We describe the geographic variation patterns of 236 dermatoglyphic variables (118 for each sex) for 74 samples in Europe. Using principal components analysis and rotating to simple structure, we simplified these patterns to the first 20 axes, representing 74.2% of covariation. We then used heterogeneity tests, interpolated surfaces, one-dimensiona...

A series of tests was undertaken to relate lexicostatistical dissimilarities (LAN) among 48 Indo-European languages to distances representing various causal hypotheses. The comparison is limited to languages currently spoken in Europe. The putative causal distance matrices include (1) geographic (GEO) distances between the languages, (2) distances...

Allele frequency distributions were generated by computer simulation of five models of microevolution in European populations. Genetic distances calculated from these distributions were compared with observed genetic distances among Indo-European speakers. The simulated models differ in complexity, but all incorporate random genetic drift and short...

Spatial patterns were studied for 36 allele frequencies representing 14 genetic systems (blood antigens, enzymes and serum proteins) in the United Kingdom and Irish Republic. The total number of data points over all systems and localities is 331. Patterns of genetic variation in space are graphically represented by one-dimensional and directional c...

From 420 records of ethnic locations and movements since 2000 B.C., we computed vectors describing the proportions which peoples of the various European language families contributed to the gene pools within 85 land-based 5 x 5-degree quadrats in Europe. Using these language family vectors, we computed ethnohistorical affinities as arc distances be...

We studied the factors affecting the accuracy of the neighbor-joining (NJ) method for estimating phylogenies by simulating character change under different evolutionary models applied to twenty different 8-OTU tree topologies that varied widely with respect to tree imbalance and stemminess. The models incorporated three evolutionary rates-constant,...

We describe the geographic variation patterns of six dermatoglyphic traits from 144 samples in Eurasia. The methods of analysis include computation of interpolated surfaces, one-dimensional and directional correlograms, correlations between all pairs of surfaces, and distances between correlograms. There are at least two, probably three, distinct a...

Several methods have recently been introduced for investigating relations between three interpoint proximity matricesA, B, C, each of which furnishes a different type of distance between the same objects. Smouse, Long, and Sokal (1986) investigate the partial correlation betweenA andB conditional onC. Dow and Cheverud (1985) ask whethercorr (A, C),...

Two theories of the origins of the Indo-Europeans currently compete. M. Gimbutas believes that early Indo-Europeans entered southeastern Europe from the Pontic Steppes starting ca. 4500 B.C. and spread from there. C. Renfrew equates early Indo-Europeans with early farmers who entered southeastern Europe from Asia Minor ca. 7000 BC and spread throug...

The character and OTU stability of classifications based on UPGMA clustering and maximum parsimony (MP) trees were compared for 5 datasets (families of angiosperms, families of orthopteroid insects, species of the fish genusIctalurus, genera of the salamander family Salamandridae, and genera of the frog family Myobatrachidae). Stability was investi...

A summary ethnohistory database on population movements in Europe between 2000 B.C. and A.D. 1970 was related to genetic variances and distances based on 26 genetic systems. For the purposes of these analyses, Europe was divided into 85 terrestrial quadrats measuring 5 degrees x 5 degrees. Counts, stratified by time, were taken of the number of mov...

Samples of the gall-forming aphids Pemphigus populicaulis and P. populitransversus (both elongate and globular morphs) were re-collected at sites in eastern North America after 13 to 16 years. Twenty-three morphometric characters of the galls, stem mothers, and alate fundatrigeniae were analyzed by univariate and multivariate methods. Varying propo...

Genetic relations between various Jewish (J) and non-Jewish (NJ) populations were assessed using two sets of data. The first set contained 12 pairs of matched J and NJ populations from Europe, the Middle East, and North Africa, for which 10 common polymorphic genetic systems (13 loci) were available. The second set included 22 polymorphic genetic s...

The diversity of spatial patterns of 61 allele frequencies for 20 genetic systems (15 loci) in Italy is presented. Blood antigens, enzymes, and proteins were analyzed. The total number of data points over all systems and localities was 1119. We used homogeneity tests, one-dimensional and directional spatial correlograms, and SYMAP interpolated surf...

European agriculture originated in the Near East about 9,000 years ago. The Neolithic reached almost all areas suitable for agriculture by 5,000 yr BP (before present). The routes and times of the spread of agriculture through Europe are relatively well established, but not its manner of spreading. This could have been by cultural diffusion with fe...

Three approaches were employed to evaluate the relative importance of geographic and linguistic factors in maintaining genetic differentiation of Italian populations as shown by blood groups and erythrocyte and serum markers. Genetic distances are closer to linguistic than to geographic distances. Gene-frequency change across 12 linguistic boundari...

We generated numerous simulated gene-frequency surfaces subjected to 200 generations of isolation by distance with, in some cases, added migration or selection. From these surfaces we assembled six data sets comprising from 12 to 15 independent allele-frequency surfaces, to simulate biologically plausible population samples. The purpose of the stud...

A simulation study was carried out to investigate the relative importance of tree topology (both balance and stemminess), evolutionary rates (constant, varying among characters, and varying among lineages), and evolutionary models in determining the accuracy with which phylogenetic trees can be estimated. The three evolutionary context models were...

A newly elaborated method, "Wombling," for detecting regions of abrupt change in biological variables was applied to 63 human allele frequencies in Europe. Of the 33 gene-frequency boundaries discovered in this way, 31 are coincident with linguistic boundaries marking contiguous regions of different language families, languages, or dialects. The re...

Migration, selection, and spatial differentiation determine the patterns of geographic variation in the gene frequencies of human populations. Inferences about past processes must be made from current patterns. The use of language differences as a variable concomitant to gene frequencies allows such inferences despite the complex relationship betwe...

The classical method for analysis of variance of data divided in geographic regions is impaired if the data are spatially autocorrelated within regions, because the condition of independence of the observations is not met. Positive autocorrelation reduces within-group variability, thus artificially increasing the relative amount of among-group vari...

Geographic variation trends are often quite complex and consist of variation at different spatial scales. In such cases an
analysis of spatial structure by spatial autocorrelation analysis is confounded by this intermixing of different scales. Trend
surface analysis (TSA) or canonical trend surface analysis (CTSA) offer ways of overcoming this prob...

The areas where the rates of change of biological variables across space are particularly high may correspond to either steep
ecological gradients or regions of limited admixture among demes. A method for detecting such biological boundaries was proposed
by Womble (1951), who suggested averaging the absolute values of the derivatives of the functio...

The aims of this study of spatial patterns of human gene frequencies in Europe are twofold. One is to present new methodology developed for the analysis of such data. The other is to report on the diversity of spatial patterns observed in Europe and their interpretation as evidence of population processes. Spatial variation in 59 allele and haploty...

We investigated whether 59 allele frequencies and 10 cranial variables differed among speakers of the 12 modern language families in Europe. Although this is a classical analysis of variance design, special techniques had to be developed for the analysis because of spatial autocorrelation of both biological and language data. The method examines po...

We test various assumptions necessary for the interpretation of spatial autocorrelation analysis of gene frequency surfaces, using simulations of Wright's isolation-by-distance model with migration or selection superimposed. Increasing neighborhood size enhances spatial autocorrelation, which is reduced again for the largest neighborhood sizes. Spa...

The effect of six resemblance coefficients (taxonomic distance, Manhattan distance, correlations, cosines, and two new general dissimilarity coefficients) on the character stability of classifications based on six data sets was evaluated. The six data sets represent a variety of organisms, and of ratios of number of characters to number of OTUs, an...

Genetic distances among speakers of the European language families were computed by using gene-frequency data for human blood group antigens, enzymes, and proteins of 26 genetic systems. Each system was represented by a different subset of 3369 localities across Europe. By subjecting the matrix of distances to numerical taxonomic procedures, we obt...

By means of three different methods we investigated whether 59 allele frequencies and ten cranial variables show increased change at 29 language-family boundaries in Europe. The quadrat-variance method compares variances of map quadrats crossed by language-family boundaries to variances of quadrats that are not crossed. The rate-of-change method ex...

Genetic and taxonomic distances were computed for 3466 samples of human populations in Europe based on 97 allele frequencies and 10 cranial variables. Since the actual samples employed differed among the genetic systems studied, the genetic distances were computed separately for each system, as were matrices of geographic distances and of linguisti...

We analyze the taxonomic structure of European populations at three time periods, the Early Middle Ages, the Late Middle Ages and the Recent Period. The data consist of sample means for 10 cranial variables based on 137, 108, and 183 samples for the three periods. Clustering by standard numerical taxonomic procedures reveals that the data are repre...

This study reports on spatial variation of 10 cranial variables in European populations at 3 time periods. Means for these variables, based on 137, 108, and 183 samples from the Early Medieval, Late Medieval, and Recent periods, were subjected to one-dimensional and directional spatial autocorrelation analyses. Significant spatial structure was fou...

The methods of spatial autocorrelation analysis for both continuous and nominal variables are explained. Spatial correlograms depict autocorrelation as a function of geographic distance. They permit inferences from patterns to process. The Mantel test and its extensions are special ways of detecting autocorrelation in ecology. The methods are appli...

The spatial structure of 12 allele frequencies was examined for 57 populations of the cactophilic fly Drosophila buzzatii from eastern Australia. Techniques include spatial-autocorrelation analysis and the newly developed directional spatial autocorrelation. Although 11 allele frequencies differ among localities, only 6 show spatial structure along...

Monte Carlo methods were used to examine the sampling distribution of eight consensus indices based on either of the following
two models: all bifurcating trees are equally likely; or all trees (including both bifurcating and multifurcating trees) are
equally likely. Ten different consensus-tree methods were applied before computing consensus indic...

Fifteen allele frequencies have previously been determined for 50 villages of the Yanomama, an Amerindian tribe from southern Venezuela and northern Brazil. These frequencies were subjected to spatial autocorrelation analysis to investigate their population structure. There are significant spatial patterns for most allele frequencies. Clinical patt...

Spatial autocorrelation analysis was used to study the patterns of geographic variation in Populus deltoides (Salicaceae). Ten characters reflecting the vegetative morphology in this species were analyzed for each of 522 individuals in 302 localities scattered throughout eastern North America. Factor analysis reduced the dimensionality of the matri...

Reviews the principles for forming biological classifications and summarizes recent findings concerning optimality criteria for classifications. Natural taxa are recognized as polythetic and related to concept formation in cognitive psychology. The 3 currently advocated schools of taxonomy are reviewed and their assumptions and purposes compared. T...

We developed a simulation model of phylogenesis with which we generated a large number of phylogenies and associated data matrices. We examined the characteristics of these and evaluated the success of three taxonomic methods (Wagner parsimony, character compatibility, and UPGMA clustering) as estimators of phylogeny, paying particular atten- tion...

This study examines whether a satisfactory classification can be obtained from phenetic clustering of herbarium specimens based on vegetative characters alone. Twelve leaf and twig characters of 731 specimens in 13 taxa within the genus Populus L. were subjected to standard numerical phenetic techniques. A classification based on character means fo...

