Question
Asked 2nd Feb, 2014

What does the warning message "1: In is.euclid(d) : Zero distance(s)" mean?

** Package ade4, command dist.binary
I would like to estimate the genetic distance using simple matching approach "Simple matching coefficient of Sokal & Michener (1958)"
My file name is snp and I run the command as follow
snp
for (i in 1:10) {
d <- dist.binary(snp, method = 2)
cat(attr(d, "2"), is.euclid(d), "\n")}
and the results were
> for (i in 1:10) {
+ d <- dist.binary(snp, method = 2)
+ cat(attr(d, "2"), is.euclid(d), "\n")}
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
Warning messages:
1: In is.euclid(d) : Zero distance(s)
2: In is.euclid(d) : Zero distance(s)
3: In is.euclid(d) : Zero distance(s)
4: In is.euclid(d) : Zero distance(s)
5: In is.euclid(d) : Zero distance(s)
6: In is.euclid(d) : Zero distance(s)
7: In is.euclid(d) : Zero distance(s)
8: In is.euclid(d) : Zero distance(s)
9: In is.euclid(d) : Zero distance(s)
10: In is.euclid(d) : Zero distance(s)
I would like to ask you,
1- what does the warning message mean?
I would like to be sure that I correctly deal with missing values.
2- How can I deal with missing values?

Popular answers (1)

5th Feb, 2014
Andrés Pérez-Figueroa
Industry
The warning message means that there is at least a Zero distance in your d distance matrix. This should be caused by at least two identical rows in your snp matrix (so the distance between them is equal to 0). In fact, the warning is given by the is.euclid() function and just is there to inform you that have two identical rows (and maybe you want to simplify them).
This is an example:
# ALL ROWS DIFFERENT - No warning
> snp <- matrix(c(1,0,0,0,0,0,1,1,1,1,0,1), nrow=3)
> snp
[,1] [,2] [,3] [,4]
[1,] 1 0 1 1
[2,] 0 0 1 0
[3,] 0 0 1 1
> dist.binary(snp, method=2)
1 2
2 0.7071068
3 0.5000000 0.5000000
> is.euclid(dist.binary(snp, method=2))
[1] TRUE
#ROWS 1 AND · IDENTICAL - Warning
> snp <- matrix(c(0,0,0,0,0,0,1,1,1,1,0,1), nrow=3)
> snp
[,1] [,2] [,3] [,4]
[1,] 0 0 1 1
[2,] 0 0 1 0
[3,] 0 0 1 1
> dist.binary(snp, method=2)
1 2
2 0.5
3 0.0 0.5
> is.euclid(dist.binary(snp, method=2))
[1] TRUE
Warning message:
In is.euclid(dist.binary(snp, method = 2)) : Zero distance(s)
By other way, dist.binary doesn't allow for missing values as it only accepts FALSE(0)/TRUE(any positive integer) binary values. If you try to add any NA (missing value) it yields an error:
> snp <- matrix(c(NA,0,0,0,0,0,1,1,1,1,0,1), nrow=3)
> snp
[,1] [,2] [,3] [,4]
[1,] NA 0 1 1
[2,] 0 0 1 0
[3,] 0 0 1 1
> dist.binary(snp, method=2)
Error in if (any(df < 0)) stop("non negative value expected in df") :
missing value where TRUE/FALSE needed
Please note that in you example code you are doing the same thing in the 10 iterations of the for loop. I think that code is adapted from the example given in the dist.binary vignette where the loop is intended to show the outcome for all the 10 methods. So, you should use method=i instead of method=2:
d <- dist.binary(snp, method = 2)
I hope this answers your question.
7 Recommendations

All Answers (2)

5th Feb, 2014
Lindsay Virginia Clark
University of Illinois, Urbana-Champaign
I am not sure why you are running a "for" loop since it looks like you are simply performing the same calculation ten times. I am guessing the error meant that something went wrong at the dist.binary stage and produced a dist object d that is empty. Have you looked at the contents of the R object "snp" to make sure your data was imported correctly? Have you looked at the contents of the object "d" to make sure that dist.binary did what was expected?
5th Feb, 2014
Andrés Pérez-Figueroa
Industry
The warning message means that there is at least a Zero distance in your d distance matrix. This should be caused by at least two identical rows in your snp matrix (so the distance between them is equal to 0). In fact, the warning is given by the is.euclid() function and just is there to inform you that have two identical rows (and maybe you want to simplify them).
This is an example:
# ALL ROWS DIFFERENT - No warning
> snp <- matrix(c(1,0,0,0,0,0,1,1,1,1,0,1), nrow=3)
> snp
[,1] [,2] [,3] [,4]
[1,] 1 0 1 1
[2,] 0 0 1 0
[3,] 0 0 1 1
> dist.binary(snp, method=2)
1 2
2 0.7071068
3 0.5000000 0.5000000
> is.euclid(dist.binary(snp, method=2))
[1] TRUE
#ROWS 1 AND · IDENTICAL - Warning
> snp <- matrix(c(0,0,0,0,0,0,1,1,1,1,0,1), nrow=3)
> snp
[,1] [,2] [,3] [,4]
[1,] 0 0 1 1
[2,] 0 0 1 0
[3,] 0 0 1 1
> dist.binary(snp, method=2)
1 2
2 0.5
3 0.0 0.5
> is.euclid(dist.binary(snp, method=2))
[1] TRUE
Warning message:
In is.euclid(dist.binary(snp, method = 2)) : Zero distance(s)
By other way, dist.binary doesn't allow for missing values as it only accepts FALSE(0)/TRUE(any positive integer) binary values. If you try to add any NA (missing value) it yields an error:
> snp <- matrix(c(NA,0,0,0,0,0,1,1,1,1,0,1), nrow=3)
> snp
[,1] [,2] [,3] [,4]
[1,] NA 0 1 1
[2,] 0 0 1 0
[3,] 0 0 1 1
> dist.binary(snp, method=2)
Error in if (any(df < 0)) stop("non negative value expected in df") :
missing value where TRUE/FALSE needed
Please note that in you example code you are doing the same thing in the 10 iterations of the for loop. I think that code is adapted from the example given in the dist.binary vignette where the loop is intended to show the outcome for all the 10 methods. So, you should use method=i instead of method=2:
d <- dist.binary(snp, method = 2)
I hope this answers your question.
7 Recommendations

Similar questions and discussions

How to interpret generalized additive model (GAM) summary of statistics in R?
Question
6 answers
  • Abraham EustaceAbraham Eustace
I have run a GAM model and got summary of statisticts (see below and attached pdf). I'm aware that, the parametric coefficients are interpreted just like a normal GLM however I'm not clear on how to interpret the approximate significance of smooth terms. Please, if anyone is aware of this, I need your help.
Formula:
Abundance ~ Vegetation + s(dist_road_km) + s(dist_boundary_km) +
s(dist_waterway_km) + te(dist_waterway_km, dist_road_km) +
Vegetation:dist_boundary_km + Vegetation:dist_waterway_km
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.000 0.000 NA NA
VegetationWooded grassland -171.897 39.408 -4.362 1.29e-05 ***
VegetationGrassland -167.544 39.396 -4.253 2.11e-05 ***
VegetationRiverline forest:dist_boundary_km 34.162 7.644 4.469 7.85e-06 ***
VegetationWooded grassland:dist_boundary_km 67.767 15.426 4.393 1.12e-05 ***
VegetationGrassland:dist_boundary_km 65.308 15.202 4.296 1.74e-05 ***
VegetationRiverline forest:dist_waterway_km 55.308 16.177 3.419 0.000629 ***
VegetationWooded grassland:dist_waterway_km 0.000 0.000 NA NA
VegetationGrassland:dist_waterway_km 2.915 1.193 2.445 0.014503 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(dist_road_km) 8.999 8.999 61.01 9.39e-10 ***
s(dist_boundary_km) 8.931 8.994 103.16 < 2e-16 ***
s(dist_waterway_km) 8.987 8.999 60.69 9.65e-10 ***
te(dist_waterway_km,dist_road_km) 20.151 20.620 135.81 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Rank: 56/58
R-sq.(adj) = 0.556 Deviance explained = 87.8%
UBRE = 1.4389 Scale est. = 1 n = 73
Glmer, glmmPQL, glmmTMB, gamm: best option to analyse count data over long time periods?
Question
9 answers
  • Alessandro ManfrinAlessandro Manfrin
Dear all, I would like to start a discussion here on the use of generalised mixed effect (or additive) models to analyse count data over time. I reported here the "few" analyses I know in R for which I found GOOD (things) and LIMITS /DOUBTS. Please feel free to add/ comment further information and additional approaches to analyse such a dataset. Said that, generalised mixed effect modelling still requires further understanding (at least from me) and that my knowledge is limited, I would like to start here a fruitful discussion including both people which would like to know more about this topic, and people who knows more.
About my specific case: I have counted data (i.e., taxa richness of fish) collected over 30 years in multiple sites (each site collected multiple times). Therefore my idea is to fit a model to predict trends in richness over years using generalised (Poisson) mixed effect models with fixed factor "Year" (plus another couple of environmental factors such as elevation and catchment area) and random factor "Site". I also believe that since I am dealing with data collected over time I would need to account for potential serial autocorrelation (let us leave the spatial correlation aside for the moment!). So here some GOOD (things) and LIMITS I found in using the different approaches:
glmer (lme4):
GOOD: good model residual validation plot (fitted values vs residuals) and good estimation of the richness over years, at least based on the model plot produced.
LIMITS: i) it is not possible to include correction factor (e.g., corARMA) for autocorrelation.
glmmPQL(MASS):
GOOD: possible to include corARMA in the model
LIMITS: i) bad final residual vs fitted validation plot and completely different estimation of the richness over years compared to glmer; ii) How to compare different models e.g., to find the best autocorrelation structure (as far as I know, no AIC or BIC are produced)? iii) I read that glmmPQL it is not recommended for Poisson distributions (?).
gamm (mgcv):
GOOD: Possible to include corARMA, and smoothers for specific dependent variables (e.g., years) to add the non-linear component.
LIMITS (DOUBTS): i) How to obtain residual validation plot (residuals vs fitted)? ii) double output summary ($gam; $lme): which one to report? iii) in $gam output, variables with smoothers are not estimated (only degree of freedom and significance is given)? Is this reported somewhere else?
If you have any comment, please feel free to answer to this question. Also, feel free to suggest different methodologies.
Just try to keep the discussion at a level which is understandable for most of the readers, including not experts.
Thank you and best regards

Related Publications

Article
Full-text available
Methodical aspects of using the analysis of DNA single-nucleotide polymorphism (SNP-analysis) for certification and identification of maize lines are considered. It is shown that SNP-genotyping is a method with high discriminatory potential that can differentiate maize lines among themselves and is recommended to use for certification of maize line...
Article
To investigate the relationship of common single nucleotide polymorphisms (SNPs) of the beta(2)-adrenergic receptor (AR) gene at codons 16 and 27, and the intermediate phenotype of airways hyperresponsiveness. A case-control study in 543 white men (152 case patients and 391 control subjects), who were nested in an ongoing longitudinal cohort. Subje...
Article
Full-text available
Background Current World Health Organization guidelines for conducting anti-malarial drug efficacy clinical trials recommend genotyping Plasmodium falciparum genes msp1 and msp2 to distinguish recrudescence from reinfection. A more recently developed potential alternative to this method is a molecular genotyping assay based on a panel of 24 single...
Got a technical question?
Get high-quality answers from experts.