Science topic

# Ecological Statistics - Science topic

Explore the latest questions and answers in Ecological Statistics, and find Ecological Statistics experts.
Questions related to Ecological Statistics
• asked a question related to Ecological Statistics
Question
HI all
I would like to perform a redundancy analysis with a response coded as a distance matrix (disimilarity in species composition). That can be done using #dbrda function in #vegan package for R. However, the problem is that I have as predictors two matrices: one with "raw" data (soil variables) and the other a distance matrix calculated from "spatial" distances among sites. Function #dbrda accepts more than one explanatory matrix, but not if it one is a distance matrix. Does anyone knows if there is other function in R able yto do this? Probably in #ade4 or #phytools?
Or did I understand something wrong?
Nevertheless, i completely disagree with the previous comment.
In my opionion it makes always sense to have an understanding what happens in an analysis and to have the chance to access intermediate steps.
Jan
• asked a question related to Ecological Statistics
Question
It is somewhat common in ecological analyses to see a principal coordinates analysis (PCA) used as a variable reduction step, followed by the use of PCA axis loadings in linear regression (or related) analyses.
Do you know if non-metric dimensional scaling ordination (NMDS) site scores can be used in this same way?
Is there any reasons why they should not be?
The short answer is no. Why? Because nMDS plots don't technically have axes. Plots can be enlarged, reduced, rotated, flipped arbitrarily, and all that matters are the relative distances among objects in the chosen number of dimension in which the ordination has been produced. OK, axes (and scores) are used to plot the result, but they have no interpretation. What you want, to do what you describe, is output from an ordination method that explicitly tries to fit axes in the full multivariate space. PCA does this, and the axes are interpretable. PCoA is another option which allows a wider range of resemblance measures to be used. Finally, mMDS (metric MDS) might be an option, as the axes are interpretable, but mMDS often has high stress owing to the way it works, so something like PCA/PCoA might be better.
• asked a question related to Ecological Statistics
Question
I am testing several environmental variables' ability to predict fruit production in a tropical forest over 24 years, with an eye on how climate change may affect fruit production. I need to consider several different time scales (time lags) for each of these environmental variables. For example, I'm looking at rainfall 3 months, 6 months, and 1 and 2 years prior to the fruit production.
I'm attempting to determine the variables to include in my model by using Bayesian model averaging by including 5 different time frames of 4 different variables (so effectively 20 different covariates).
My first question is whether including the same variable, but at several different time lags, would confound the variable selection process.
My second question is: once I've run the analysis (Bayesian model averaging of GAM using a spike and slab prior with the gamclass::spikeSlabGAM function in R) and identified the covariates with high probability of inclusion (posterior inclusion probabilities P[gamma = 1]), is it then appropriate to re-analyze the model with only the selected covariates? I have done this, and the results of re-running the model with only selected covariates slightly alters the inclusion probabilities.
One last question: If anyone is familiar with gamclass and the spikeSlabGAM function, what would then be an appropriate way of determining how much variability is accounted for by my model (i.e., is there a Bayesian equivalent to GAM Rsquared values)? I'm new to Bayesian statistics, so I apologize if this is a very basic question!
One way to evaluate how a model works is to compare it with actuals. Perhaps, a very general comment, but still: you can try various options and then compare the performance of out-of-sample predictions. Here you can use metrics for accuracy and bias. Please let me recommend this work:
If you use time series from non-negative domain and transform them, I would recommend using transformed time series and then evaluate forecast performance using the workflows described in the above paper.
Log-transform seems appropriate in your case, but you need to evaluate the quality of prediction using the overestimation percentage (OPc), as suggested in the above paper.
• asked a question related to Ecological Statistics
Question
I designed a factorial experiment involving 2 explanatory variables (A and B, qualitative). Because I couldn’t achieve the assumptions of a parametric model, I used kruskal.test on the variable to explain (VAR) for A and B like: kruskal.test(VAR ~ A, data = data) and kruskal.test(X ~ B, data = data).
But, I was also interested in the effect of “A and B interaction” on VAR. So, does anybody know if it is right to perform a kruskal-Wallis test on interactions? Here, what I did it with R:
interAB<-interaction(data\$A, data\$B)
kruskal.test(VAR ~ interAB, data = data)
Moreover, in order to access which level of each variable is significantly different from each other, I used as post-hoc test after the kruskal.test: pairwise.wilcoxon.test(data\$VAR, data\$A,p.adjust.method ="holm",exact=F, paired=F). The pairwise test didn’t work on the variable interAB and I was wondering what method I should use as post-hoc test for each variable A and B and for the interaction interAB.
Obviously this is an older thread, but it popped up for me, so I thought I would add a few thoughts.
First, there are methods to conduct a nonparametric analysis for a factorial design. One is the Scheirer–Ray–Hare Test. Another is the aligned ranks anova.
Second, I probably wouldn't recommend pairwise Wilcoxon-Mann-Whitney tests as a post-hoc test. For Kruskal-Wallis, I like the Dunn (1964) test as a post-hoc. For other options, see the tests with functions beginning with "kw" at the following: https://cran.r-project.org/web/packages/PMCMRplus/PMCMRplus.pdf .
• asked a question related to Ecological Statistics
Question
Hello all,
This is a real dbRDA plot using real invertebrate abundance data (taxa-station matrix) with environmental data (substrate characteristics-station matrix) as predictor variables. The plot is produced in PRIMER v.7. Invertebrate data is 4th root transformed, Bray-Curtis similarity was used. Environmental data is normalized, Euclidean distance was used.
My question is: why is the vector overlay not centered at 0,0 in the plot? Interpreting this plot, one would conclude that every sampling station within the study area has values below the mean for predictor variables 2 and 13, which is impossible. Why would the center of the vector overlay be displaced -40 units? How can this be? Why is the plot centered on the dbRDA2 axis but the dbRDA1 axis?
The analysis is fine. The position of the vector diagram relative to the ordination is arbitrary - it could just as easily be in a separate key. The diagram indicates the direction across the ordination plane in which values of the selected variables increase. The length of the lines indicates the amount of total variation in each variable is explained in the chosen ordination plane. If all of the variation is explained, the line reaches the circle.
• asked a question related to Ecological Statistics
Question
I have a large dataset of fish abundance as well as some environmental variables covering around 795 sampling sites. I have tried to find the relationship of my environmental variables with the biological data with the RELATE function in PRIMER-E. The results indicate that using Spearman rank correlation, the sample statistic (Rho) is 0.11. Now the significance level of sample statistic is 0.1% (much less than 5%), so according to the manual, this is significant result! I used 999 permutations to get this result. I am unable to interpret this result as I would usually expect that if the p value is significant, the corresponding degree of association should be high also. So, I would expect the sample statistic to be much higher than 0.11! (above 0.7 or so). Smilar situation with the distLM procedure, here the p value suggests that each of the variable has significant effect on the model but again the overall R^2 of fit is only 0.13! How is this possible that with such a poor R^2, individual variables are all significant.
I have used the square root transformation and Bray-Curtis similarity on the biological data and have normalized the environmental variable and Euclidean measure. I haven't transformed the environmental variables.
I would really appreciate it if someone can help me to interpret these results.
Yes, the null hypothesis in RELATE is that the correlation is zero. As you have a lot of samples you have a lot of statistical power to detect departure from the null. The significance tells you how likely it is that you would observe a value as high (or higher) than the value you have observed if the true correlation was actually zero.
There are many other tools in Primer you could also try (RELATE for 2-way designs, BIO-ENV, etc.), depending on your survey design. It might be that you get more meaningful results if you combine samples, maybe pooling nearby samples for example. It is likely, given that you have so many samples, that the signal (from the environment) is being lost in the noise (many samples each dominated by one or two species of fish).
• asked a question related to Ecological Statistics
Question
We have two nominal variables: diet and location. We want to know whether the diet of the species differs in various locations. I know it is possible to do it by chi-square, but I also have seen others have done it through the Kruskal-Wallis test. I applied chi-square, though I still believe Kruskal-Wallis is the right choice. Am I right? However, I have difficulties with performing this test on my data. Here is a view of a little part of my data.
Maybe I should re-order my data in another way?
I really appreciate it if you could help me.
If the data really are nominal then a chisq could be appropriate. But, what is you location variable? If these are locations that, for some, will be next to each other, then it is likely diets will be more similar in those than if more distant. Or is location just something like desert, mountains, ocean?
• asked a question related to Ecological Statistics
Question
I have several metagenomic samples, from which I got the viromes through a bioinformatics pipeline. I made a PCoA of those samples, and some of the samples cluster together. I would like to know which families or species of virus in particular are making these samples similar and at the same time different to the other samples. Is there a way I can do this?
No, you can do it with the package Metagenome seq in R
• asked a question related to Ecological Statistics
Question
Hello there, i would like some advice on how to correctly perform a CCA, and if i do need to transform my data. To do so ill explain my intends abut this work and how is my data. (i'll try to be as much specific and to explain correctly in english)
Im working on a ecologic assessment using fitoplankton and environmental parameters on tropial eutrophic reservoirs. To do so i collected weekly summer samples for each site. I made a monthly mean for the biotic (density data) and abiotic parametres (Tempratura, pH, NO3, etc...). This mean, is the data that i have right now to do a CCA (60 taxons and 19 environmental variables). The taxons are in densities (Org/mL) unit, and i'm thinking abut perform a pre PCA for the abiotic parameters and biotic to exclude some data. But i'm worring about my statistical approach, if it is correctly, or if i need to do some more steps, like a data transformation in my analysis..
Thank you @ Malcolm Baptie !
That helps a lot :)
• asked a question related to Ecological Statistics
Question
Analysing morphology-habitat relationships in a montane plant species, I am thinking of using slope exposition (i.e., northern, southern slopes, etc.) as one of the habitat features, since a direct measuring of all the associated microclimatic factors appears problematic. I have plant samples from many sites within a montane area of ca. 1300 squared kilometres and for each site I have slope sexposition data (cardinal and inter-cardinal directions). I need to correlate this data with leaf morphometric anatomical/morphological traits.
I would be grateful if someone could also recommend some papers reporting relationships between plant growth/occurrence and slope exposition in mountains.
Alternatively, you can break your directions into a north-south and an east-west aspect component.You require assignation of angles or compass directions in degrees. Depending on your study system, one of these slope aspect components might be of greatest interest (for example if working at a temperate latitude, most likely you would expect the degree of N-S orientation to matter more biologically, due to the difference in solar incidences). If you take the cos (angle) this will give you the N-S component as a numeric form ranging from 1 to -1, with 1 being N (0 or 360 degrees) and -1 being S (180), zero indicates a compete east or west exposition. The Sin(angle) is the E-W component, again ranging from 1 (East) to -1 (West). Then you can run correlations and linear regressions with your data.
• asked a question related to Ecological Statistics
Question
Dear all,
I have a dataset of fish collected in different rivers over different years each of them sampled a different number of times during different projects . This different number of observations among rivers in some cases can be important: e.g.
River X = 1 project (1 observation=1 sampling x 1 year);
River Y = 5 projects (15 observations= 1 sampling x 3 years x 5 projects);
River Z= 15 projects (105 observations=1 sampling x 7 years x 15 projects);
I want to calculate in the all region (so not interested in specific rivers) how the abundance is related to Years, Latitude, Altitude and Anthopic pressures (APindex). I thought to use the following model:
lme: Abu~Years+Latitude+Altitude+APindex + (1|river/project) + corrARMA (form = time|River/project).
-What is the influence of RiverZ with its 105 observation compared to the other rivers which have less number of observations?
-Am I accounting for this unbalanced observations in the random structure (1|river/project)?
-Do I have to account in the model for the different number of observations with (weight=1/n observation for each project?)
Thank you
Pure ANOVA is not good for an unbalanced study, but GLM (regression) is okay. Or the Brown-Forsyth test for skewed data. See: Analysis of Unbalanced Data by Mixed Linear Models using the mixed Procedure of the SAS System, DOI: 10.1111/j.1439-037X.2004.00120.x
• asked a question related to Ecological Statistics
Question
I'm having hard time choosing the right statistical method for my study. The data that I have is summarized below.
- Waterbird count data (absolute counts). Non-standard, varying effort but standard effort is assumed.
- The counts are done yearly and the years are categorized into two groups: high water level years and low water level years. Sample sizes are different for these groups (e.g. there are 8 high level years and 6 low water level years)
- I want to see if there is any statistical difference between high water level years and low water level years in terms of mean/median number of birds counted.
Because the count data is not normally distributed, I directly used Mann-Whitney U-test for the purpose. But I wanna know what other or better methods can I use for the same purpose? I also want to compare the two groups in terms of different biodiversity measures like species richness.
Hi Kaan, a useful approach would be to run a GLM using counts as response variable and groups as independent variable. You should also run some goodness of fit tests to see what error-distribution model (Poisson, negative binomial, etc) best fit to your data.
Search for "count data and GLM" in google and you will find some more answers.
Good luck!
• asked a question related to Ecological Statistics
Question
Dear all,
I have maybe for the "time series" experts a silly question:
-I have a dataset of European rivers =80
-In 50% of the rivers I have more than 1 project; in the other 50% is 1 river = 1 project
-In 50 % of the projects I have data collected only for 1 year; in the other 50% of the projects data were collected over years (from 2 untill 20 years, depending on the project)
->I want to assess the Fish diversity depending on the altitude, latitude, catchment size.
After exploring data for the model assumption of normality, variance heterogeneity etc..I though to run this model:
mod<-lme(Fish Diversity~log(altitude)+log(latitude)+log(catchment size), random~1|Rivers/Projects, method="ML", data=dati)
When I look at the residuals of model mod and at the acf (residuals(mod) and pacf(residuals(mod), they are pretty good but in acf there is autocorrelation in lag1
and in pacf the line goes slightly over in lag 3. I think I would give it a try with CorAR1 (p=1) correction in lme.
My questions are:
1- Is the model developed in your opinion correct?
2- Can I fit a correlation CorrAR1 in the lme by just looking at the acf and pacf plots from the model mod? As u see I have different project over time that means potentially multiple time series (for each project). Can I just fit a unique AR1 structure looking at the residuals of the model (without CorrAR1) and not at the raw data and assume that the same temporal trend is present in all the projects analysed? How can the acf and pacf know what is the temporal repetion (i.e.
how the acf and pacf biuld the lags in the plots)?
3- if the question number 2 is yes, do I have to organise in the dataframe chronologically in the dataset for each project? (e.g. Project1 from 2000 untill 2008; Project 2 from 1998 untill 2015, and so on?) as dati[order(dati\$Project_names, dati\$Year_evaluation), ]
and give to the corrAR1 the form structure form=1|Rivers/Project_names
Would this model be ok?
modAR<-lme(Abu~log(altitude)+log(latitude)+log(catchment size), random~1|Rivers/Project_names, method="ML",
correlation=corARMA(form = ~1|Rivers/Project_names, p=1)
Alessandro
Dear Keston and Fjdor,
thank you very much for your suggestions. These were really usefull. I will check both options (including VEC models).
Thank you very much
• asked a question related to Ecological Statistics
Question
If I have a 100 Square Km of forest site (homogenous vegetation accros the site), how many ligne transects should I design to get reliable density estimates of forest primates through Distance Sampling methods  - Is there a relation between study site size and number of transect somewhere ?
Well, you have to put your transects where the birds are! No bird present = no bird counted lol. So put the transects where the birds are because you simply want to know what their density is in those locations. There's nothing wrong with doing that. Sampling doesn't have to be random remember :)
• asked a question related to Ecological Statistics
Question
I am trying to create a model in R that accounts for my survey. I’ve been doing a lot of reading over the past couple of weeks trying to understand the best way to do this, and my brain is ready to explode so I thought I’d ask for some advice!
I am studying a particular species of reptile, trying to discover the environmental variables that account for their distribution over my study site.
The study involves placing refuges that attract the reptiles evenly across the site, and counting the reptiles that I find underneath. There are 68 refuges in total across 34 grid squares: Each grid square contains 2 refuges, of different materials. I have surveyed these refuges 11 times, so this is a repeated measures study.
My dependent variable is number of reptiles found. This is count data so I think I need a Poisson distribution. I thought I’d have zero-inflation but using a goodness of fit test for Poisson distribution tells me no, as does a comparison of mean and variance:
Reptiles under refuge – Frequency
0 - 579
1 - 140
2 - 24
3 – 5
Mean = 0.271
Variance = 0.302
My independent variables are as follows:
• Date of survey (I assume I use this to tell R that this study is repeated measures)
• Temperature under each refuge (continuous)
• Proportion of area around refuge that is scrub vegetation (could be converted to area if proportion is problematic)
• Mean vegetation height immediately around each refuge (continuous)
• Material of refuge (binomial – although as I have already found that there is a significant preference by my reptiles for one over the other, should I still include this?)
• Angle of slope that the refuge is on (continuous)
• Direction of slope (continuous)
• ID code of grid square (factor – I’m not sure about this one. There were some squares that no reptiles were found, while others had several)
Temperature was measured for each refuge at each survey, while the other variables were measured once and assumed to remain constant. There are no missing values.
Do I need to check for independence between all these variables before including them? Should I use a correlation table for that?
From my reading, a repeated measures GLMM looks most appropriate for my study but I wanted a second opinion. I’m also getting confused on which factors are fixed and which are random.
Here is my attempt at building a model:
model <- glmer(reptiles ~ (1|date) + temp + scrub + meanvegheight + material + slope + direction + square, data = dataset, family = poisson)
Would this provide me with what I’m after? Please treat me as an ignorant ecologist rather than an experienced statistician! If anything needs clarification please just ask. Many thanks for your help!
Maybe the above link could help. I would suggest you start with a basic model and progressively add other variables, as in a hierarchical regression. You could start with the, for you, most interesting or relevant variables.
For example:
basic <- glmer(reptiles ~ 1, data = dataset, method=poisson ) (only the intercept is defined here)
model1 <- update(model_basic, .~. +temp)
then, each time, assess the significance of new model:
anova(basic, model1)
model2 <- update(model1, .~. +scrub)
anova(model1,model2)
etc.
You might also add possible interaction effects, e.g.  temp*scrub, provided you calculated such products in the data step. (This makes your study even more complicated ...)
Fixed vs random (I quote from Field et al., "Discovering Statistics using R", p.862):
An effect is said to be fixed if all possible treatment conditions that a researcher is interested in are present in the experiment
An effect is said to be random if the experiment contains only a random sample of possible treatment conditions. Fixed effects can be generalized only to the situations in your experiment, whereas random effects can be generalized beyond the treatment conditions in the experiment (provided that the treatment conditions are representative)
• asked a question related to Ecological Statistics
Question
I'm doing EBSD analysis on a same area after ECSTM analysis to define any diffrences that could explain dissimilarity behavior between two grains boundaries of a same type.
The EBSD data include sigma degree (CSL or not), misorientation and GBs plan. but also a parameter called deviation and two athers called plan(P1) and plan(P2), exemple:
GB1: sigma=3, misorientation=58.7, plan=-18 17 17, deviation=1.7, plan(P1)= 10 -21 -12, plan(P2)= -10 -9 4.
I would like to know what means deviation and if plan(P1) and plan(P2) are the real plans of the two cristals from either sides of the GB.
Dear Mohamed Bettayeb,
Could you please mention the software/system that was used to get this data? Different software often use different terminology, and someone who uses the same software may be able to give you an accurate explanation.
In general, 5 parameters are used to describe grain boundaries, I suspect the data you mention would be along those lines. For general reading on the subject you could go through this link: https://www.tf.uni-kiel.de/matwis/amat/def_en/kap_7/backbone/r7_1_1.html
• asked a question related to Ecological Statistics
Question
I collected benthic samples from 12 stations (3 samples per station across 4 locations) across mudflats from one estuary, in autumn & the following spring, totalling 72 samples. In addition, I collected one sediment core from each station per season (total 24).
Data analysis is being conducted through R, though I do have access to CAP4. Within R, I've generated dendrograms, NMDS plots, rarefaction curves and basic Simpson's diversity analysis (Vegan package).
Sediment analysis was conducted through Gradistat.
To understand benthic density/presence on sediment type, I'm trying to analyse sediment composition against benthos for each station/season.  I'm assuming it is better to analyse against the full breakdown of sediment type rather than the generated classification (i.e. muddy sand/sandy mud).
However, to get this working in R, there is a lot of data that would need to be imputed into the main data sheet alongside the species/site data and I'm really not sure the best way to do this/how it would look.
Additionally, I also not too sure what package/analysis is the best to use to analyse these data?
Any help or advice would be greatly appreciated.
Thank you Andrew & Ajit!
• asked a question related to Ecological Statistics
Question
Dear all,
I'm working on dung beetle assemblages, and I would like to test the hypothesis that the community structure of these insects is different along a gradient of grazing pressure. In two similar sites, dung beetles were sampled into 3 levels of grazing pressure (High, Moderate and Low), with 5 pitfall traps in each level.
After analyzing my data with a Correspondence Analysis (where sampled communities are classified in 3 groups : High, Moderate and Low grazing), I would like to know if the dung beetle community structure is significantly different (or not) between the 3 levels of grazing pressure. An ANOSIM (build under R software) shows that : R = 0.4097, p = 0.000999. That, it's ok ! But I don't understand the other parts of the results... for example, the values of "Dissimilarity ranks between and within classes".
Thanks a lot for your help !
Hi William
The R-statistic in ANOSIM is a ratio between within-group and between-group dissimilarities. The steps in the analysis are:
1. calculate a matrix of dissimilarity scores for every pair of sites
2. convert the dissimilarities to ranks
3. calculate the R statistic as the ratio between dissimilarities between sites within a group (e.g. high grazing pressure) and the dissimilarities between sites that are in different groups. The closer this value is to 1, the more the sites within a group are similar to each other and dissimilar to sites in other groups.
4. The significance of the R-statistic is determined by permuting the membership of sites in groups.
• asked a question related to Ecological Statistics
Question
Hello all,
I want to run a PGLS analysis. I have a phylogeny with branch lengths, but I want to run the PGLS analysis for a slightly smaller subset of taxa than the ones contained in the phylogeny. Is it ok/feasible to prune (remove) taxa from the pre-existing phylogeny, so I don't have to re-calculate a tree from scratch?
Hello Michael and Manichanh. I want to prune individual taxa for which I don't have data. My impression is that it would not be good practice to do this. However, I have seen this process being described during PGLS analyses in r-phylo and specific packages (ape) that allow you too drop tips from a tree. Don't know if this would bias my assumptions too much.
• asked a question related to Ecological Statistics
Question
I want to model the distribution of several species. I have read about the subject and have found that there are several models to achieve it:
Ellipsoid
Bioclim
Maxent
Maxlike
GLM
Which one do you suggest to use, considering that I only have presence records (GBIF)?
Some researchers use single-algorithm (for example Generalized Linear Models, Maximum Entropy, Generalized Additive Models, Neural network, Random Forest, etc.) while others are more willing to work with ensemble projections and consensus models (i.e. mean or weighted mean of more than one algorithm). It doesn't really matter; you can choose what you like best, so long as you describe the advantages and limitations of your model and do a sensitivity analysis.
This one uses Maxent:
Using species distribution modeling to delineate the botanical richness patterns and phytogeographical regions of China, Scientific Reports, Nature, 6, Article number: 22400 (2016)
doi:10.1038/srep22400
• asked a question related to Ecological Statistics
Question
Hi there,
I have data on a population of carrot fly, which have been trapped at various distances from a probable source of the flies, at regular time intervals.
I am interested at looking at the associations between the number of flies caught and other independent factors- including distance from the source, hedgerow properties (e.g. age of hedgerow) and the proportion of host plant species present.
Does anyone have any thoughts on how I could statistically analyse or measure the combined effects of the independent variables above upon the number of flies caught?
I'm aware and confident that I could simply look to see if there are correlations/associations between a singular variable e.g. % cover of a host species against the number of flies caught. However, I am more interested in (and think it is more interesting) considering how say % cover of the host plant species AND distance from the source may impact upon the presence of flies.
Programs available- SPSS, Minitab, GraphPad etc. and ArcGIS software.
Since your response is a count, it sounds like a good starting point would be a Poisson regression via a glm or glmm ...or, potentially, a negative bionomial regression (if your count distribution is overdispersed). Use the glmm if you want to account for random effects in one of your predictors.
It's super duper easy to do in R; the glm() function is built into the base package.
> mydata <- read.csv("my data file.csv")   #create data frame from .csv file
> fit <- glm(response ~ predictor1 + predictor2 +...+predictorX, data=mydata, family=poisson())
>summary(fit)
#before you run the glm above, it is probably a good idea to check and make sure you variables are being read properly before doing the glm.  For example, if you had a dummy variable (0/1) that you wanted to be a factor:
> sapply(mydata, class)
> mydata <- transform(mydata, myvariable = as.factor(myvariable))
#Once you have run the glm, you can then create a vector of the residuals and test (or plot) to see if they are autocorrelated. If this is the case, the glmm framework might be particularly useful because it is better able to handle correlated errors.
#There is great online support for doing a glm or glmm in r. In particular, I have found the tutorials from the IDRE center (at UCLA) to be very helpful. For example,
Good luck!
• asked a question related to Ecological Statistics
Question
Hello everyone, I am trying to apply a PERMANOVA with covariables to a benthic community dataset. I have species density per sample in 4 different distances from a shipwreck and 4 covariables. I am trying to do this using Primer but all the time the results are "no test" and df=0, to pairwise tests for distances. Can anyone help me with that? What am I doing wrong?
Try using Type 1 sums of squares
• asked a question related to Ecological Statistics
Question
I am looking for suggestions for analyses that can compare of different taxa in terms of the relative difference in composition among sites.
I have 4 parallel datasets of species abundance data from 4 different taxa sampled in the same sites (n=12).
Each site was sampled between 4 - 10 times.  Usually (not always) sampling was done at the same time for all taxa within a site, but not all sites were sampled at the same time so the data are unbalanced.
I can create balanced subsets if needed but this would severely truncate the data.
I've heard of co-correspondence analysis, co-inertia anlaysis, and possibly multiple-factor analysis as potential candidates for doing this type of comparison but I'm not sure about the differences or which is most appropriate.
Are there pros and cons/restrictions/assumptions for each of these?
Is there an alternative method that I have mentioned that would be better?
Also what do these analyses allow me to test exactly - is their intention is to be able say for example that taxa A and B had high correlation in terms of variation in composition across sites, while taxa C showed low correlation with any other taxa ...etc  ?
Thanks
Tania
Thank you for all your responses.
The question I would like to ask is
1) do the spatial patterns of diversity differ among taxa? For example one taxa may show high clustering of sites based on habitat type, while another will show similar composition across all sites.
2) do spatial trends differ over time- for example, one taxa may show stable composition in all habitats over time while another taxa may show convergence of composition between habitats, and a third taxa may show high variability over time for one particular habitat...
The intention is to demonstrate the different taxa have different spatial and temporal distributions and therefore can or cannot be used as surrogates for each other based on composition.
3) Characterise the similarity (betadiversity) between and within habitat types, based on all taxa.
I can of course compare univariate measures of diversity in each site using anova but I would like to compare the taxa based on their composition.
Thanks for further suggestions to address these specific research questions..
• asked a question related to Ecological Statistics
Question
While searching the net, there seems to be a plethora of codes/packages, and I was wondering if ecologists could suggest the simplest aproach to it
Also, somewhat simpler, see the Rattlesnake example at the end of this page, using either nlme or lmer for mixed effects model.
• asked a question related to Ecological Statistics
Question
Hi, any suggestions will be welcome as the title reads.
- Four habitats to compare, one which is a control
- Bird data (categorical and quantitative, traits, abundances, microhabitat uses, nesting categories) for breeding and non-breeding seasons
- Various tree characteristics measured.
Which tests could I perform with what data to understand how bird communities use the different habitats?
First, it is meaningless to call a habitat a 'control' because all habitats are different – just say you have four habitats. Then, as David said in Prometheus, 'your answers depend on what you hoped to achieve'. So you need some definite questions (or hypotheses) that you require answering before you start analysing your data. It seems that your main question is 'habitat use by birds', the categories of which will be things such as foraging, roosting (or perching), and nesting. So, use the tree characteristics to define your habitats. Then, as Emigdio stated, you can perform GLM with post-hoc tests to determine if bird data differed significantly between habitats in breeding and non-breeding seasons. Your factors are 'habitat' (characterised by tree traits) and 'season', both of which are fixed, and your variables are the bird data. Canonical ordination is also an option. I hope this helps :)
• asked a question related to Ecological Statistics
Question
Our research campaing involved  sampling of 4 rivers (at three different altitude stations each river). At every station we collected three different samples in a longitudinal 100 m-transect of the river taking special care to sample the full heterogeneity of substrata, and analyzed for benthic macroinvertebrates.
At the same time we evaluated numerous catchment variables in order to test the relevance of the land use and catchment properties on the macroinvertebrate community.
Therefore, we ended up with 4 rivers * 3 stations * 3 samples per station = 36 samples.
My question is  Whether all samples could be wisely included as  cases in a Random Forest Model (n=36)?…, or should I instead average macroinvertebrate samples per station to avoid pseudoreplication (n=12)?
I would greatly appreciate any help and advice on this issue.
Salud, y gracias
Manuel
• asked a question related to Ecological Statistics
Question
I want to compare evolution rate between two set of data (morphological and reproductive traits) because the values of the reproductive traits is very small in compare with morphological traits (both of the same unit and scale [mm]), so I want to perform a log-log regression of species mean trait measurements on species mean thorax volume which I used it as the index of the body size, but I don't know how can I perform it. Thus, I am looking for help or potential collaborator.
thanks for any suggestion.
Transformation does not LEAD to skewed data; rather, transformation changes skewed data to normal :)
• asked a question related to Ecological Statistics
Question
There are four sampling sites on a hillslope (top, upper, lower, bottom), each site has three replications. We have studied soil nutrients and plant biomass in these four treatments, a reviewer suggested that a proper statictical analysis (autocorrelation between topographic positions) was needed.
My question is that how can I do the analysis for autocorrelation for my study? I am familiar with SPSS. Thank you very much.
Here is a short, relevant, easily understood editorial article which explains pseudo-replication from a journal editor's point-of-view:
Binkley, D. (2008). Three key points in the design of forest experiments. Forest Ecology and Management 255.
Use the article to understand how to draw appropriate conclusions from your present and future studies.
• asked a question related to Ecological Statistics
Question
I have data where I measured the distribution of individuals among 4 patches that differ in known resource density (a continuous variable). Groups of 12 individuals were observed and their presence in the 4 patch types was recorded 10 times (every 20 minutes). Trials differed in the presence or absence of another species. If I only had two patches, I believe I would use a GLMM (family = binomial) with arena_ID as a random variable and presence/absence of other species as a fixed effect.
However, with 4 patch types I want to use patch type (amount of resources) as a continuous fixed effect. However, the percent patch types always sums to 100% (so a regression of percent in patch versus resource in patch seems somewhat incorrect). And the data is multinomial, rather binomial.
I have been calculating the slope of percent in patch versus patch resource for each arena_ID, and then asking if the collection of slopes differ from 0 using a t-test, but I am looking for a better way.
Hi Barney
You may want to consider a GLMM (family=poisson) and use counts of individuals on each of the patches over time. Use of random and fixed effects are analogous to the binomial case.
• asked a question related to Ecological Statistics
Question
To see if fish species distribution is dependent on area, I have collected fish sample for one year from four selected sampling site of a given lake. Thus, ten fish species were recorded from all sites. Hence, I used a two way classification chi square to see the difference of the species distribution among the sampling sites and I had got significant difference. However, could not able to see which site is significantly different from which other sites. Therefore, is there anyone who can give me some explanation about this.
One way is to analyze the component 2xn tables. In this case you compare Site 1 with Site 2, and in a separate test compare Site 1 to Site 3, and so on.
Since you are conducting multiple tests, you may want to adjust the p-value or alpha values to deal with the "inflation of type-i errors."
The example here may be helpful:
• asked a question related to Ecological Statistics
Question
Hi. I want to assess the diversity of fish from 10 sampling sites at temporal scales. However, at each site I am going to use only 1 fishing gear to catch the fishes. If I want to prepare 5 replicates per site/sampling, is it valid if I take the replicates from the same fishing gear? I'm planning to use stratified random sampling technique to prepare these replicates. Thank you.
depends on the species under selection
• asked a question related to Ecological Statistics
Question
Dear all,
Does someone know if any R package can be used to perform meta-analysis take into account spatial and temporal autocorrelation (maybe separately)?
#spatial
I work on fish abundance data and associated diversity metrics at 35 stations located along several large French rivers (Rhône, Vienne, Loire, Meuse, Seine).
Some of the stations are closers than others : for example, 7 stations are located in the same area (distance <10 km) while some of them are located in different catchments without direct connectivity. Consequently, I expect that my data/results will be strongly spatially autocorrelated.
I am looking for a way to correct the time series meta-analysis for this spatial heterogeneity in R. Ideally, I was thinking of a method that would allow the weighting of the different time series in the meta-analysis according to their relative distance along the river network.
#temporal
The stations were sampled annually and the time series range from 18 to 36 years. So, consecutive years are likely to be more correlated than the first and the last years for instance. I would like to correct the temporal autocorrelation in the meta-analysis. For now, I have applied a Mann-Kendall trend analysis that account for the temporal autocorrelation, and I have extracted the correlation coefficient to be used in the meta-analysis. Do you think of another way to perform this correction?
The package below might be an option, but I agree with the recommendation above to test if your errors are independent, and then, if they are not, test if they are spatially auto-correlate, before trying to use a spatial model. https://cran.r-project.org/web/packages/MARSS/vignettes/UserGuide.pdf
• asked a question related to Ecological Statistics
Question
I would like to calculate the functional diversity measures proposed by Chiu & Chao (2014). The paper gives formulas for the calculation of functional hill numbers, mean functional diversity and (total) functional diversity. But it doesn't mention any R script or package and I don't feel confident encoding these formulas myself.
I was wondering if anyone already calculated these measures and how they did it.
Ref:
Chiu, C. H., & Chao, A. (2014). Distance-based functional diversity measures and their decomposition: A framework based on hill numbers. PLoS ONE, 9(7). doi:10.1371/journal.pone.0100014
Sorry that I answer this query so late as I just saw it.
The R code for Chiu and Chao (2014) has been uploaded in my website:
Thanks for your interest in my work.
• asked a question related to Ecological Statistics
Question
I am currently testing the correlation between environmental variables and the biological data. I have been using DistLM but since well I have more environmental variables, the programme is unable to process. Please I need suggestions on similar programme which can handle more variables.
An alternative analysis is correspondence analysis or detrended correspondence analysis. Both can be done using R. Google for R and the suggested analysis. Also a program like DECORANA performs the analysis. I learned it from Cajo ter Braak. To see what you can do with the analysis, google his name.
• asked a question related to Ecological Statistics
Question
I'm currently working on the alkaloid composition of the skin secretions of salamanders and am trying to test whether this composition differs between different populations.
In line with previous research on alkaloid profiles in poison frogs, I tested for differences among populations using an ANOSIM. Since I work with relative concentrations (a.k.a. proportions), I thought it was more appropriate to construct an Aitchison dissimilarity matrix for this analysis.
I was further interested in seeing which exact compounds were responsible for differences between the populations. A SIMPER, often associated with an ANOSIM, seemed perfect ... but SIMPER in R uses Bray-Curtis dissimilarities.
I was wondering if there is an alternative for SIMPER that uses other indices of dissimilarity? Could a PCA do the same?
GIlles, although this doesn't answer your question, just for your information, SIMPER and ANOSIM have serious issues since it is very difficult to determine whether differences are attributed to within-group or between-group variation, which may provide misleading results. You may want to look Warton et al. 2012. Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3, 89--101.
• asked a question related to Ecological Statistics
Question
I have a statistical question concerning my data. I am currently working on a methodological experiment to model some soil chemical variables. Let's say, for simplicity, that I have measured variable A in 6 different SITES (S) in four, fixed layers in each site (L). I want to use A as a model for another variable, B.
My question is: what would be the best model to use in this case? I am aware that, given my experimental design, factor S is Random and factor L is nested within S. I have tried to use a GLMM, but the results are not clear to me and, I think, too obscure for the purpose of my research. Given that my goal is to prove that variable A can be a good proxy for variable B, can I use a more straightforward regression to create a simple model that can be useful to the scientific community?
I can help! Factor L cannot be nested in itself; it can only be nested in another different factor. So in your case, there are 2 factors, S and L – one is random, the other fixed, therefore you have a mixed ANOVA model for testing variable A. Also, you cannot 'nest' because L is not nominal i.e. it does not have discrete categories (e.g. + or -, boy or girl, etc.) and L is not a subset of S (or of itself lol). So the GLMM data file is very simple, with two columns of factors and one of the variable (A). The ANOVA table will therefore show S, L, and S x L.
To use A as a proxy for variable B, you would then have to show a significant CORRELATION between A and B first.
I hope this helps :)
• asked a question related to Ecological Statistics
Question
Best approach for statistical analysis of intercropped maize+cowpea experiment. Please take a example of any growth parameter such as plant height, number of leaves, LAI, CGR etc. and compare the both ANOVA tables. Thank u.
Dr Ronald we have four crop rotations in the main plots i.e. Maize+Cowpea-Oats-Cowpea, Maize-Oats-Mungbean, Maize+cowpea(2:1)-oats-mungbea and Cowpea-oats-cowpea and 5 nutrient management treatments in sub plots  I.e. Control, 100% NPK (inorganic, 125% NPK inorganic, 75%NPK+FYM+PGPR, 100%NPK+FYM. The experiment is ongoing under split plot design.
• asked a question related to Ecological Statistics
Question
I am performing a site selection analysis for wastewater treatment plants, however, I need help in performing AHP to determine the criterion weights of land cover, slope, and distance to roads for overlaying it in ArcGIS. Thank you for kind response.
Hi Jerome Marquez,
I made a video about AHP for ArcGIS. Clip is attached to this message.
thanks
• asked a question related to Ecological Statistics
Question
Hi, I am using the analyses; Canonical analysis of principal coordinates (CAP) and Similarity Percentages - species contributions (SIMPER) in the stats program, PRIMER. However, I prefer to use R and was wondering what the equivalent functions were? I believe the vegan package is where one goes for permutational analyses but am not sure which functions to use. Any help would be appreciated. Cheers
Hi Jessica,
Yes vegan has the simper() function.
For CAP analysis you need to use the rda() function. You will need to transform your data into euclidean space though (Hellinger transformation). If you are using bray-curtis dissimilarities, you should use db-RDA using the capscale() function.
Vegan is a great substitute for the PRIMER software.
Good luck!
Chris
• asked a question related to Ecological Statistics
Question
Good day
I am currently working on a project attempting to assess the niche overlap of various species using functional traits.
The issue I am running into is that the analysis I had intended to use (link in replies) is individual based and requires multiple individuals of the same species within the data set in the form (Sheet 1) however my data takes the form (Sheet 2) due to my data dedicating a single row to a species and their predominant trait (literature based). My data incorporates categorical and continuous data (reason for using first analysis).
Any suggestions?
See YouTube: SPSS for newbies: Changing a scale/continuous variable to a categorical variable
• asked a question related to Ecological Statistics
Question
We have distribution data for larvae of 2 species that were first ransdomly spread on an experimental square arena. After some time, the area was divided in 100 quadrats of 1*1cm and the number of larvae of each species in each quadrat was counted.
What is the best way (i.e. agregation index) to 1/evidence that the observed distribution of larvae is aggregative and 2/evidence that this agregation is interspecific? There are several index and methods reported in the litterature, but I was unable to find the best way to answers these 2 questions according to our dataset (quadrat).
Thanks
Dear Damien,
i suggest to you links and attached files in topics.
- Aggregation Behaviour and Interspecific Responses in Three Species ...
- Intra- and interspecific aggregation among dung beetles - Cambridge ...
Best regards
• asked a question related to Ecological Statistics
Question
I am trying to analyze an ecological data set. At different sites, we measured different parameters, e.g. sedimentation (n = 3) and fish biomass (n = 5). I want to explain fish biomass with the sedimentation rate. I can’t do this directly, because the sample size is different (3 and 5) and the measurements were not directly linked: The sediment traps used to assess sedimentation rates were spot measurements, the fish biomass was assessed with transects). I don’t want to explain fish biomass by the mean of the sedimentation rate because then I would loose the variability of the sediment rate measurements. Would resampling be an appropriate approach to deal with these problems? Are there other possibilities?
Hi Andreas,
Taken into account your new pieces of information, you should find valuable solutions by following these links:
1. Jombard et al., "Finding essential scales of spatial variation in ecological data: a multivariate approach", 2009: http://pbil.univ-lyon1.fr/members/dray/files/articles/jombart2009.pdf & Supplementary material:http://www.ecography.org/sites/ecography.org/files/appendix/e5567.pdf
2. http://www.ievbras.ru/ecostat/Kiril/R/Biblio/Statistic/Legendre%20P.,%20Legendre%20L.%20Numerical%20ecology.pdf (Chapters 11 to 13)
3. http://www.ievbras.ru/ecostat/Kiril/R/Biblio/R_eng/Numerical%20Ecology%20with%20R%20(use%20R).pdf
4. Chen, "New Methodology of Spatial Crosscorrelation Analysis", 2015: https://arxiv.org/ftp/arxiv/papers/1503/1503.02908.pdf
• asked a question related to Ecological Statistics
Question
we have tried to make NMDS analysis (for plant species composition versus environmental gradients) by using both presence-absence and abundance data. I have found the result quite similar. but, I get confused to decide which result should be presented. Is that possible to compare the abundance of tree to bushy species?
Dear Mekdes Ourge,
The outcome of an nMDS analysis depends on the similarity/distance index you choose. The choice, in turn, depends a lot on the data you have. Generally, abundance data is more informative because it contains more information than just presence or absence. Apart from that, indices are designed to specific needs; e.g., you might want to use an index applying data normalization when having different data types or large ranges. A frequently used index for abundance data in ecological studies is the Bray-Curtis similarity index.
Alternatively, you could also run a Correspondence Analysis (or a Detrended CA), which is specifically designed for investigating species composition vs. environmental data.
Best regards,
Thomas
• asked a question related to Ecological Statistics
Question
I have mortality data for incubating Atlantic salmon embryos that have been recorded as proportions of dead embryos/total embryos fertilized. In order to run ANOVAs with data that satisfied the heterogeneity of variance assumption, the proportion data were logit transformed. When presenting the data in a paper, should the original proportions be plotted or should the logit transformed data be plotted? My instinct is that the transformed data should be plotted, however I feel that the biological significance and readability of the plot would be sacrificed if I ignored the original proportions... Thoughts?
If you're trying to illustrate your ANOVA's results in a journal paper I think ploting the transformed data is more appropriate. you can alway mention what are the difference in terms of mortality rate in the text.
if it's fora conference paper, I would plot the raw data, and just mention that the data was logit transformed for your analysis
• asked a question related to Ecological Statistics
Question
Dear all,
I am working on an ecological community species data matrix (site by species), and I have many species and sites. I want to select sub-communities with different sample sizes randomly, and later compare the similarity of these communities. The idea of doing is that some of my sites have a few specimens, so I want to find a sample size (a threshold) that I can use to compare the communities with each other, and discard certain sites that fall below that threshold. I am trying to decide which sites I want to include in my data analysis.
Two questions:
1- How can I randomly subselect the communities? Along with this line, I tried various options, i.e., rarefy the communities to a certain size or use 'sample' package of R.
2- If I have communities with different sizes, and generate distance matrices using these communities, I am not able to compare them using mantel test in R, due to incompatible dimensions. How would you compare samples with different sizes, regarding their similarity?
Any suggestions on these issues are appreciated.
It may be better to perform stratified sampling. A stratified sample is a mini-reproduction of the population. Before sampling, the population is divided into characteristics of importance for the research. For example, by main species type. Then the population is randomly sampled within each category or stratum. Random sampling has a very precise meaning in that each community has an equal probability of selection, which it may in fact not have.
Since communities may be of different sizes, perhaps you should compare composition by percentage, for example using Shannon's or Simpson's diversity indices.
See: On sampling procedures in population and community ecology, Vegetatio 83: 195-207, 1989.
I hope this helps :-)

• asked a question related to Ecological Statistics
Question
I am analysing abundance data using Primer 7 and I am a bit confused about how to pre-treat the data before carrying out SIMPER. I don't know if I have to standardise the samples by total or if I need to standardised the variables (species).
Many thanks!
Hello Paz Aranega Bou.
When the unit of sampling can not be tightly controlled, standardization (the samples by total) may be neccesary (Clarke and Gorley 2006), this would turn abundance data to values of relative abundance (percentage).
Clarke K., Gorley R. 2006. Primer v6: User Manual/Tutorial. PRIMER-E, Plymouth, UK, 193 pp.
• asked a question related to Ecological Statistics
Question
ToxCast_AssayData_2013_12_11.zip
I would like to know how come there are compounds that have a higher inhibitory effect on aromatase enzyme, when compared to the control (Letrozole)?
Thanks!
• asked a question related to Ecological Statistics
Question
Considering more than one random factor is sometimes very important and useful. However, in the case of glmmPQL (in R), I do not why it is not possible to consider two (or more) random factors ?!!
Many thanks Boudjema. I will try to do this command in R to see what will happen.
Sincerely.
• asked a question related to Ecological Statistics
Question
Hi everyone,
I'm currently looking for a statistical model to analyse competition between three individuals/species. So far I have only been able to find papers where they focus on one individual, not on all three at the same time.
If anyone can direct me towards papers that do this that would be fantastic as I'm sure they're out there!
I hope this helps:
Nauplius 21(1): 01–07, 2013
Ecological model of competitive interaction among three species of amphipods associated to Bryocladia thrysigera (J. Agardh) and extreme environmental stress effects.
• asked a question related to Ecological Statistics
Question
I want to compute Moran's I with spdep in r using nearest-neighbour distances. I have computed Moran's I with ape using inverse distance weights but this isn't quite what I need to do.
I am trying to find if abundance data from sample plots within the same field are spatially autocorrelated.
I have attached an example data file.
Any help with code would be greatly appreciated.
• asked a question related to Ecological Statistics
Question
Now I have 10 years data of species abundance, I try to test community composition response to treatment.  Whether can add the year as covariate factor in RDA. As following formula:
RDA= treatment+temperature*precipitation+year?
How to add the multiyears in Rudandance analysis (RDA)?
Vegan library can be applied with time as factor, you may see http://evol.bio.lmu.de/_statgen/Multivariate/11SS/rda.pdf, RIKZ ex.
• asked a question related to Ecological Statistics
Question
I am currently studying about marine gastropod species composition in mangrove and rocky habitat. And I want to see the correlation between environmental variable and species composition. What ordination should i choose? DCA? NMDS? PCA? And what is the difference?
I agree with what Denis wrote about CCA, I would add that CCA assumed an unimodal, bell shaped, response between environmental variables and species composition, which usually is the case when you have large enough environmental gradients so that you have both suitable and unsuitable environments for your species. If instead you have short environmental gradients, so that you can assume a linear response between environmental variables and species composition, Redundancy Analysis (RDA) is usually recomended. RDA is very similar to PCA (is a multiple linear regression) but it features explicit dependence of response variables (species composition in your case) from environmental variables, whilst PCA handles all variables together.
I think the most complete manual avout those methods id Legendre and Legendre's Numerical Ecology, and all those methods are implemented in the R package Vegan, which in my opinion is very well done and easy to use.
• asked a question related to Ecological Statistics
Question
With my experiment on decomposition, I will need to obtain a litter group to study. I would like to encompass a fair amount while still maintaining an element of randomness
Thank you for the response. This greatly assisted me in my study
• asked a question related to Ecological Statistics
Question
Count data are the dependant variable (Y) and ecological area is independent (x). Can you recommend which test would be most appropriate (its a small data set)?
How small is a "small dataset"?
Anyway, assuming you need a univariate analysis and simply want to know if "region" (X) significantly affects your observed counts (Y), I would try a Generalised Linear Model (GLM) with a Poisson distribution. This way you can analyse your actual data and don't need to transform them.
You would need to validate the model residuals as usual to make sure it was the correct approach, but the models can be made relatively easily in R:
As someone else said, this won't actually demonstrate a causal relationship between X and Y, so report your findings accordingly!
• asked a question related to Ecological Statistics
Question
My understanding of statistics is very weak, so I apologise if this is not clear, or the question is unwarranted
Undergraduate students set out to find out whether substrate type (four types on a coral reef) affected algal community structure. They used a stratified random sampling design, with five replicates in each stratum, with each replicate represented by a quadrat that was randomly thrown.
Using a 1m by 1m quadrat subdivided into 100 squares they estimated percentage cover of different species of algae (i.e., percentage of each quadrat occupied by each species). Their resolution was 0.25% (quarter of a square).
They wanted to do an ANOVA, so they needed continuous data. Instead of using the data as percentages, they instead used actual area covered (each square is 10 cm by 10 cm, or 100 cm2).
A colleague disagreed with this strategy on the following counts:
1. they felt the percentage data should have been Arc Sine transformed instead, and converting to area did not represent a valid transformation for this purpose. Applying an arithmetic conversion was not a satisfactory option
2. the percentages, and hence areas, were an estimate and not an absolute measure. They mentioned that that means they are likely to vary from one person to the next (no questions were raised about whether or not the estimates were done by one or more students)
3. the resolution of measure was quarter of a square, or 25 cm2, so they felt this was not really continuous data
Are these valid concerns, and if so, which and why? In addressing 2 and 3, please add comments on how using ArcSine would have been better than using area (I felt that the transformation may carry forward the concerns – estimated data and what I think they meant as inadequate resolution.
Hi Dawn
I really do not know how and why people may disagree on the original approach of using % (or proportion) cover as measurement unit and then performing ANOVAs (or t-test or any other parametric or non-parametric method). There are literally tons of papers using this approach for many many decades. This is just a random paper I googled:
Of course the only problem is to ensure that the data meets the parametric assumptions. Cover data often is not normal, and the regular recommended transformation is the so-called 'angular transformation' (arcsine of the square root of the proportion). Check out this paper:
Even if the angular transformation does not work, ANOVA is very robust to violations to normality (but not to heterocedasticity).
1) Yes, I agree with you colleague, using areas instead of original point-data (converted to % cover) was not really an improvement, is redundant. After all is exactly the same point-data treated in a different way. Moreover the whole idea of using point-data is that simplifies the sampling, focusing on specific points that can be easily inspected by any observer. These are points, and cannot be converted to area. A very different situation would be if you have photographic records, so you could estimate the area covered by each algae in your plots. That would provide you a direct estimation of the area (cm2) covered by each algae.
2) I partially disagree with your colleague. Point-data from quadrants is a very common and classic  method used by both marine and terrestrial ecologists for decades. Bias introduced by observers is always a possibility (when it does not?), but you can make the case that all observers were well trained students, with good experience in the identification of algae. In addition, you could run a few trials to quantify the % of error introduced by the use of different observers and just reported it.
3) I really disagree. Of course point-data is not truly continuous, but using a 100 point quadrant ensures a good approximation. Besides, using 100 point quadrant is a standard method for the bulk of ecological studies in coastal rocky shores. Again, do a simple google search and you will find 1000's of papers using exactly the same approach. Your colleague's critique is really unfair, because it is more at the core the historic approach used by the discipline rather than to your specific research. And even if we decide to obey to the "talibans of the statistics", there are methods to deal with non-continous data (e.g. generalized linear models with Poisson errors).
In summary, I recommend you to go back to the % cover (proportion data), use the angular transformation, test the parametric assumptions, and if the data is still homocedastic go the classic ANOVA. Be sure of citing a bunch of classic marine ecology papers to justify your methods.
Marcelo
• asked a question related to Ecological Statistics
Question
Good day to everyone!
I've completed a Mood's median test comparing sea louse median intensity levels (number of lice/individual host) at four separate sites. I am doing a separate test for each month of data collection to detect if and when significant differences occurred between site locations. There are some results that signify a statistical difference of medians between at least one site and the other three, but I need a method to determine which site differs statistically. I've been unsuccessful at finding an appropriate post-hoc choice and have resorted to doing pairwise Mood's median tests when significance is found. I know that this this is not a strong method, and was hoping for feedback....
I also gathered that the Kruskall-Willis compares medians, but that this is not an appropriate method for parasitic count data that has a negative binomial distribution. If I could use it, this would be simple because I would be able to complete a Dunn's median test for post-hoc.... Am I being much too cautious? Any thoughts?
What are procedures of computing Cohen's Kappa agreement coefficient in R?
• asked a question related to Ecological Statistics
Question
In a morphometric variability study of montane shrub species populations, which I have just launched, leaf shape would be quite a promising character. Although commonly used in such studies, leaf shape is often analysed as actually a set of individual “shape-describing” linear traits and their ratios, leaf area and perimeter. But still, each of these traits is treated by the analysis as an individual independent variable. Thus, I am looking for a high-precision method to measure and analyse leaf shape as a single whole.
The supposed algorithm is following: on the photographed or scanned leaf images, a number of control points are placed along the leaf outline in a computer program. The program then analyses the differences between leaf outlines based on these points, resulting in numerical/graphical representation of the leaf shape variation.
Having reviewed some literature, I found that this can be done by so-called Elliptic Fourier leaf shape analysis using R statistics. Has anyone dealt with such kind of analysis? Is it applicable for within-species population studies? This analysis can be carried out by any of numerous algorithms, so did anybody compare their effectiveness? Also, are there any easier-to-use substitutes for this method? I would be grateful for recommending a relevant statistics and software, some manuals and publications.
Fourier shape analysis is quite common in fisheries studies where fish can be classified into groups by the shape of the otolith (ear bone). Momocs is a really neat package in R for this and the user guide should give you some good ideas on how to start.
regards,
james
• asked a question related to Ecological Statistics
Question
Hi, everybody!
In my regression analyses (performed through LM and GLM models) I found R-squared values from 8% to 15%, but high P values. The predictor variable was incisors' procumbency angle and the response variables were the mechanical advantages of jaw adductor muscles in a subterranean rodent species.
Can I include such low R-squared values in my research paper? Or R-squared values always have to be 70% or more. If anyone can refer me any books or journal articles about validity of low R-squared values, it would be highly appreciated.
Hi Alejandra,
"High" or "low" R squared really does depend on the context. In some context, R-squared of 70% is considered to be appropriate, but in some other context, an R-squared of about 20% may satisfy the scientists already.
In your model, you have only one predictor, and it is significantly related to the dependent variable, but you have R-squared of 8% to 15%. You can, consider, to include other possibly important covariates.
It is shown that R-squared always increases if you add more covariates into your model. So what we can consider, is to use adjusted R-squared instead of R-squared (in case you have many covariates). The adjusted R-squared can even decrease when we add more covariates in the model.
Below you can see a clear discussion on this problem:
I hope it help.
• asked a question related to Ecological Statistics
Question
i am doing phytosociological research and interested only to find vegetational composition (plant communities), so which is the best statistical software i should use?
Try pls
Distance software
Sigmastat software
• asked a question related to Ecological Statistics
Question
I am looking at boldness in male Siamese fighting fish. Using sand and white gravel I created a slope from a deep end (16 cm deep) to a shallow end (5 cm deep) and split the tank into 3 equal sized sections to create a 'safe', intermediate and 'scary' zone. The 'safe' zone provides shelter with stones and plants. The scary zone is shallow, brightly lit and empty. A bird silhouette hangs above to act as a predation risk. The intermediate zone creates a gradient between the two zones. I allowed each male to acclimatise in the safe end for 5 minutes and then recorded the time they spent in each zone over 20 minutes. I repeated this 3 times for each fish over 3 weeks with different stimuli each trial.
I want to know what is the best way to analyse this information as I have had trouble with the intermediate zone and what to do with it.
Hi! Given that you have repeated observations of the same individuals in different contexts (different stimuli), I think some kind of (generalized) linear mixed model makes most sense. You could perhaps focus only on the proportion (given that you have limited the trial in time) of time spent in the "scary" zone, because that seems to be related to your research question (something about boldness...).You say that the zones are equally large - do they have the same volume, or the same bottom area? Given that Betta are not bottom dwellers, I would assume that sections are standardized to volume - otherwise you may want to control that any differences are not just proportional to volume available within a section...
• asked a question related to Ecological Statistics
Question
I am running some exercises to assess the correspondence of bootstrap samples to asymptotically expected results. For example, from a relatively small, single-variable, simulated normal population (N=100), can one confirm that the sample means and variances of a large number of bootstrap samples (with replacement) are independent? How dependent is the result on the size of the original population, the number of bootstrap samples, and the size of each sample? Not asking for actual results, just what you might expect to see, or resources you can suggest in the literature.
Thanks for your responses, they have tremendously helped me to understand the situation.
• asked a question related to Ecological Statistics
Question
Currently I tested shoots hormones with each concentration 15 replicates and root hormones with each concentration 5 replicates only...
I need to analyse the increased in length, number of newly formed shoots and the length of newly formed shoots. For root, I need to analyse the number of roots formed, length of primary roots and length of secondary roots.
In fact graphs are also a form of analysis. It may not be very definitive, but it can help us in understanding our own data. Draw different graphs and ask yourself what each graph tells you. Then you can turn to statistical tests. Here too, do simple tests first and then more complicated tests.
• asked a question related to Ecological Statistics
Question
dear all,
I need to know about the recent/advanced softwares for statistical analysis specially for plant tissue culture and diversity studies. if you suggest me some softwares or available sites,so that i can download if it is available for free, otherwise you can send me the *exe files too if possible.
thank you all.
If free is the key word, then R is your best choice. I think the correct site is https://www.r-project.org/
There is a long list of non-free choices. SAS is my favorite, but someone else pays for it. Many people use SPSS. Other options include Minitab, Mathematica, Matlab, Statistica, JMP, SYSTat, and a very long list of old programs like BMDP.
Your best option depends a great deal upon what you want to do and you level of skill both as a statistician and a programmer. If all you need is a t-test, then something like Microsoft Excel could work. If you need awesome statistical power because you are developing awesome new methods and applications then your best bet might be R.
1) SAS is expensive. However, it has awesome documentation, good support (at least in the USA), and is stable. The IO has changed a great deal, but the program that I wrote in 1980 will still run. SAS is actively updated. Thus while my old program still runs, I have a whole range of new choices. Also, I really like that the 4,000++ pages that used to be the SAS documentation are now online and searchable.
2) R is free. However, the documentation is hard to use (at least for a beginner). There are a large number of friendly people willing to help. However, the R platform is flexible. So much so, that a graduate student had to rewrite his code when one of the R packages changed (or stopped getting support). So a large number of people are producing a large amount of computer code, but that diversity can be confusing and not all of it is supported by the core R product. That said, this platform enables very rapid change.
3) Products like JMP that are more point-and-click are great for performing standard analyses. However, you are stuck with whatever standard the software company thinks that its user's want.
4) Products like Matlab and Mathematica can (I think) do any of the analyses you like if you are sufficiently skilled to program them correctly.
5) Minitab has changed a great deal over the last few decades. Originally I liked it as a tool to do matrix manipulation. It is great to see all the steps without having to invert a 10 x 10 matrix, or transposing the matrix without making an error.
6) Of course the most powerful stats package of all is doing the programming yourself. Fortran used to be the main platform, but I think C (or newer C++) is the more common platform. However, this is time consuming and risks a great deal of error.
7) The biodiversity analysis can be done using EstimateS. http://viceroy.eeb.uconn.edu/estimates/
• asked a question related to Ecological Statistics
Question
I have 50 soil samples from which I took sub-samples to carry out a number of tests such as chemistry analysis, pH, EC etc. I have 3 replicates (i.e. 3 sub-samples) of 3 of the samples. I would like to assess how representative my sub-samples are of the sample as a whole. Are there any statistical tests I can use for this? I have considered comparing the standard deviations of my replicates to the standard deviation of the samples overall, but I'm unsure if this is correct or if there is a critical value I can use to assess whether or not the difference in standard deviation is significant.
Thank you both,  you've been very helpful
• asked a question related to Ecological Statistics
Question
I want to calculate the Simpson Index of Diversity(1-D) for cover % data of plant species in plots. I have a lot of plant species that have <1% cover in a plot which then result in - values in the formula. E.g, plant A is 0.17%-->D=n*(n-1)=0.17*(0.17-1)=-0.1411.
I also have a lot of species that have a cover of 1% which results in 0 values in the formula: D=n*(n-1)=1*(1-1)=0. Because of these low/0% cover values, my Simpson Index for some of my plots result in values >1 (it should be betw 0-1). Is that possible/correct?
Could I possibly multiply all my % cover values by 10 to get rid off my values <1 and 1 to avoid - values/0-values in my formula?
As Joh and Daniel pointed out, the formula you are using is for individuals and is not suited to your data.
Simpson's complementary diversity (1-D) relies on Simpson's dominance (D). This is a purely probabilistic approach: It describes the probability of two random individuals drawn (without replacement) from a sample to belong to the same class (=species). Suppose you have a bag with 99 white balls and one black ball. White balls dominate. The probability of randomly extracting two white balls is 0.99 x 0.99 (that is, the product of the independent probabilities of drawing each white ball, which is n/N where n is the number of white balls and N the total number of balls).
As you can see, there is a direct approximation to your percent data. The formula you are using has the n(n-1) continuity correction to estimate dominance based on individuals AND a finite sample. However you could use the pure probabilistic probability (n*n) as long as your "individual" comes from a sample large enough to be considered a fraction of an effectively infinite pool.
So you could, without much error and assuming that your data allows for fractionary values, calculate the Simpson's complementary index as 1-SUM[(ni/N)^2] where ni is the percent cover for each species and N the sum of all percent covers (because species can overlap you can't simply use ni^2, as sum(ni)>=100%.)
That should both do away with your problem and get you a reasonable estimate.
Regards
• asked a question related to Ecological Statistics
Question
I'm interested in ecological statistics, so I want to know the best software to use.
Ecological statistics is a very broad field. I personally would use R, if you are planning to continue doing this type of analysis in the future. It is an extremely flexible system and is quickly becoming the "standard software" for statistics in many different fields including ecology. In addition it is free and runs on most operating systems Windows, Linux, OS X, Solaris with no difficulties and even on micro-controller boards like the Raspberry Pi 2 with some difficulty for simpler jobs.
I have used PASW (new name for SPSS)  and I if you meant Origin, I have also used it. Origin is mainly a system for plotting with some statistics added. PASW comes with different modules, and depending on the licence you have it may not cover all your needs. My own experience is that it is unwise to spend too much time learning a commercial system that because of its high cost may not be available to you at a future workplace or if you decide to work independently as a consultant or in a non-profit or similar organization. R is free, but there is now also a version from Microsoft that is just the same program with some "quality assurance" from Microsoft and a very few performance tweaks.
• asked a question related to Ecological Statistics
Question
like Maxent SDM or any other
Hi Husam,
I'm most familiar with the presence-only SDMs like Maxent.  In general for SDMs, you'll want to focus on biologically-relevant environmental factors as your predictor variables.  So feasibly, you could include some sorts of habitat suitability index (HSI) metrics as predictor variables; however, you would need those HSI metrics calculated for every unit you want to predict the species distribution into.  This is why almost every example simply uses larger, publicly-available datasets with large spatial coverage.  If you are interested in using Maxent, I would highly recommend Elith et al. 2011:
• asked a question related to Ecological Statistics
Question
I am dealing with samples from trees, where samples which are higher up in adjacent trees are closer to each other than they are to those lower down on the same tree. Using a standard distance matrix function places samples on the same tree at zero distance from each other. Is there any way to do this that doesn't involve manually calculating each sample pair? For example an R package which has a distance function incorporating the height or elevation data, I couldn't find any.
The goal is to use this in a mantel test comparing with an ecological distance matrix.
Bryce,
The dist() function in R, if provided with data in 3 dimensions (x, y and elevation) will do the trick.
For example the following script in R creates and visualises an example data set with coordinates in 3 dimensions. The output is attached as a figure:
# example dataset with x coordinates ,y coordinates and elevation
xcood <- c(10, 30, 40, 45, 40)
ycood <- c(10, 10, 40, 42, 40)
elev <- c(40, 42, 38, 57, 60)
dat1 <- cbind(xcood, ycood, elev)
# visualise data
library(scatterplot3d)
a <- (scatterplot3d(xcood, ycood, elev, type="h", xlim=c(0,70), ylim=c(0,50), zlim=c(0,100),
pch=16, box=T))
a\$points3d(xcood+2,ycood,elev, pch=c("1", "2", "3", "4", "5"), col=2)
As you can see from the figure, data point 5 should be closer to 4 than to 3, even though 3 and 5 are at the same x-y coordinates.
You can then calculate the distance matrix by providing the dist() function with the 3-dimensional dataset
# calculate distance matrix of dataset dat1
dist(dat1)
The output of the command above will show you that the distance between 5 and 3 (on the same tree) = 22. But the distance between 5 and 4 (on different trees) = 6.16
I think this is what you need.
• asked a question related to Ecological Statistics
Question
• I placed 19 litter traps in both areas.
• I separated the litters  into four constituent part – fruits, flowers, branches and leaves.
• And obtained their dry biomasses
• And I have two collections for each site.
What statistical analysis can I use to compare these two areas?
Dear Sheik
You need to determine the goals of your study first, ( I suppose that you want to compare two areas ), if you want to compare the means of two groups ( Two areas ), then you must test your data for normal distribution and can use T-test for normally distributed data or Mann-Whitney test for non-normal distribution data
• asked a question related to Ecological Statistics
Question
Hi,
I have two separately generated habitat suitability maps and a set of independently gathered GPS relocations of the species.
How can I use the independent data to judge which one of the suitability maps is more accurate?
Hi Maarten,
You can use the receiver operating characteristic (ROC) curves for determining the accuracy of the suitability maps that you have prepared. Its perfect to use the independent datasets for validation, because it avoids the over prediction in accuracy.
Please see the discussion link attached below. you can see more discussion regarding the ROC by searching in RG itself. The link is related to validation of landslide susceptibility maps, but the process is same for ROC in any analysis.
Hope this will give some idea about the process of validation of spatial maps using ROC.
Good luck.
Vijith
• asked a question related to Ecological Statistics
Question