Science topic

# Geostatistical Analysis - Science topic

Explore the latest questions and answers in Geostatistical Analysis, and find Geostatistical Analysis experts.
Questions related to Geostatistical Analysis
Question
I would like to use the R package "gstat" for predicting and mapping the distribution of water quality parameters using tow methods (using kriging and co-kriging)
I need a guide to codes or resources to do this
Azzeddine
So you want the code for Kriging using gstat. A simple Google search shows plenty of tutorials. You just need to apply the principles showed on these tutorials to your data.
Question
Presently I am attempting sequential gaussian simulation using SGeMS software. I have followed the steps as shown in various videos available online and also the user guide (Remy et. al., 2006). I have set a grid of 50x20x100 as per procedure and finally derived 50 realisations. Now, I want to retrieve the simulated data along with the grid point locations so that I can plot them in some different software.
I followed the steps of "object -> save project -> grid to save -> property to save -> csv format" and saved the file. However, when i open it, I get the simulated values of variable but not its XYZ coordinates. I have attempted multiple times unsuccessfully.
Can any one point out where am I missing the step?
TIA.
Thank you Dr Robin for your prompt answer. However, my problem still persists. The reasons are:
1. My cell sizes are 2mx25mx5m in XYZ direction, hence even if I reshape my computation grid of 1x100000 to 50x20x100, I wont be able to assign the right XYZ coordinates to each centroid unless I know the sequence of numbering of centroids.
2. I don't know use of Python.
I wonder whether/why the software which takes input of cordinates in proper format with clearly defined X,Y,Z coordinates, does not deliver the output in similar fashion.
Thank you for your help anyway.
Best wishes
Question
What ways can I get measurements of ozone and nitrogen dioxide concentrations in an area? What satellites provide this type of data and can I use Google Earth Engine to retrieve the data when it is available?
Reghais.A
Thanks
Question
Hi, I was hoping someone could recommend papers that discuss the impact of using averaged data in random forest analyses or in making regression models with large data sets for ecology.
For example, if I had 4,000 samples each from 40 sites and did a random forest analysis (looking at predictors of SOC, for example) using environmental metadata, how would that compare with doing a random forest of the averaged sample values from the 40 sites (so 40 rows of averaged data vs. 4,000 raw data points)?
I ask this because a lot of the 4,000 samples have missing sample-specific environmental data in the first place, but there are other samples within the same site that do have that data available.
I'm just a little confused on 1.) the appropriateness of interpolating average values based on missingness (best practices/warnings), 2.) the drawbacks of using smaller, averaged sample sizes to deal with missingness vs. using incomplete data sets vs. using significantly smaller sample sizes from only "complete" data, and 3.) the geospatial rules for linking environmental data with samples? (if 50% of plots in a site have soil texture data, and 50% of plots don't, yet they're all within the same site/area, what would be the best route for analysis?) (it could depend on variable, but I have ~50 soil chemical/physical variables?)
Thank you for any advice or paper or tutorial recommendations.
Thank you!
Question
Many open source programs exist in the field of geology with all its specializa (Water resources , hydrology , Hydrogeology, Geostatistics ,Quality water .......etc) that many people are unaware of.
What software do you want to suggest to us ?
Thanks
Reghais Azzeddine
Here is a list of common software and free alternatives
Software list
Illustrator => Inkscape, Scribus
Photoshop => gimp
Matlab => Python, R, GNU Octave
Anaconda, R studio, Jupyter notebook, Spyder
ArcGIS Pro and ArcGIS Online => QGIS, GRASS, uDig , GEODA, FOSS4G, Leaflet
PowerPoint => Google slides, LibreOffice, FreeOffice
Microsoft Word => LaTex, Google docs, LibreOffice, FreeOffice
Excel => Google sheets, LibreOffice, FreeOffice
Microsoft OS, Mac OS (and older computers) => Linux (Ubuntu, many others)
Others
GitHub, Arduino, Raspberry pi, Audacity, BRL-CAD, freecad, Dia, PDFCreator, Blender, Cinelerra, Bluefish, KeePass, 7-Zip, Psiphon, Clonezilla, VLC, Quanta Plus, NixNote, Overleaf, TeXstudio
Question
I want to use the daily GPP data from "Fluxnet 2015 data". However, in FLX there are two kinds of daily GPP ("GPP_DT_VUT_MEAN" & "GPP_NT_VUT_MEAN"). I want to know the difference between "Daytime partitioning method" with "from Nighttime partitioning method". And how can I get the daily GPP value. Thank you very much.
• GPP_DT_VUT (Gross Primary Production, from Daytime partitioning method)
• GPP_NT_VUT(Gross Primary Production, from Nighttime partitioning method)
I don't know how to get the daily GPP. Thank you very much.
There are 40 GPP estimates in FLUXNET2015 dataset for a site. DT (daytime) and NT (nighttime) are two NEE partitioning methods, and CUT and VUT are USTAR (friction velocity) thresholds. You can download daily GPP from https://fluxnet.org/data/fluxnet2015-dataset/. For all GPP estimates, XX_REF is the most representative version according to Pastorello et al. (2020).
Question
I have two datasets. One with 9 past cyclones with their damage on the forest, wind speed, distance from the study site, recovery area. Another dataset with future sea-level rise (SLR) projections and potential loss area due to SLR.
1. By using data from both disturbance events datasets (loss area, recovery area, wind speed, predicted loss area from SLR) can I create any kinds of disturbance risk/vulnerability/disturbance index/ hazard indicator map of the study area?
2. What kinds of statistical analysis can I include in my study with these limited data sets which will help me to show some sort of relationship of "Loss Area" with other variables?
I hope these references might be helpful. Kindly have a look.
Regards
Gowhar Meraj
Question
We know that the disperssion variance is related to the domain size V and the support size v. The textbook said that if keeping v unchanged, the disperssion variance will increase as the domain size V increased. I want to know why? Is there mathemtical evidence?
Question
We know that LU decomposition is an important method for stochastic simulation of 2D RF. Assume the covariance matrix of regionalized RVs is C, it can be decomposed as C=LU according to the LU algorithm, then a RF can be generated by X=L'*y, where y is a vector consisting of independent standard normal random numbers. I want to know whether can I use LU decomposition for simulation if n observations exist as the conditioning data? If so, how can it be demonstrated?
Thare are many decomposition types, Why you do not try another of what you applied before
regards
Question
I want to simulate a random field Z(u) that has a nested variogram, say gamma(h)=gamma1(h) + gamma2(h) + gamma3(h), assuming the variogram is isotropic. Whether can I simuate independently three random fields: Z1(u) with correlation structure gamma1(h), Z2(u) with correlation structure gamma2(h), Z3(u) with correlation structure gamma3(h), and then sum them up Z(u) = Z1(u) + Z2(u) + Z3(u)?
Thank you, Donald Myers. I intend to use unconditional sequential (Gaussian) simulation based on simple kriging. From my understanding, we do not have to assume the RF is Gaussian. Note that I assume second-order stationarity for my question.
I want to simulate a random field with more complex spatial correlation structure. But the code available for me (Matlab) only provides simulation based on variogram with definition of one structure. Therefore, if the practice is equivalent in my question, then I can do the simulation by independently simulating three random fields and sum them up.
Question
I did a MCA analysis using FactoMineR. I know how to interpret cos2, contributions and coordinates, but I don't know how values of v.test should be interpreted.
Thank you
(a v.test over 1.96 is equivalent to a p-value less than 0.05 )
Question
The published RP mainly shows the use of cluster analysis like agglomerative hierarchical clustering (AHC) technique to divide the data into clusters having the common trait followed by Oneway ANOVA test followed by DUNCAN test as a posthoc test. However, no one mentioned the detail of the procedure of how the multiple variables or clusters were utilized to prepare single site-specific management zones. This makes it difficult for the layman. Please share the detail or any video link to prepare the zonation map ArcGIS.
Efforts are good and there are many research papers on MZs, but still, these are not useful for laymen as only a written text is given in the RPs. It is easy to write but different in practice.
Like (paper attached) the result of the cluster analysis(factor score attached), we can prepare PCs maps (Fig. 3).
Still, the question remains unanswered for preparing the final MZs (Fig. 5). How to get value with respect to each sampling point for preparing (Fig 5 and table 4). Even in the email to many corresponding authors RPs no reply was received.
Question
I have created a soil moisture proxy model that shows areas with higher or lower soil moisture. In other words, each pixel in the GIS raster data has a value. I developed and applied the model in ArcGIS 10.2.2. I am wanting to find a high resolution land surface temperature dataset that I could use to validate the model. By high resolution, I would like to have at least sub-kilometer. For my purposes, I need it to be high resolution, but I am just trying to determine whether they correlate on an ordinal scale. It would be preferable that it was in raster format as well.
I know that MODIS is probably one of the best datasets, but I think it's resolution is only 1-km. Does anyone have any other suggestions? Thanks so much for your time and consideration.
Question
I want to carry out PCA on a set of chemical data, some of them in oxide form and some in elemental form. The oxides are in percentage and the elements are in ppm.
I have understood that, the data have to be normalised/standardised before starting PCA. Now,
1) Should I have to convert all oxides to element first?
2) Should I have to convert all into single type of unit (either percentage or ppm )?
3) For normalisation, should I go for lognormal (10), lognormal (2) or natural log? What is the best way to decide which one is ideal?
4) If some elements show lognormal (10) distribution and some show Ln distribution, can I apply them separately or a single method to be followed for all?
5) Can I attempt IDF-Normal method for normalisation of such data?
Question
Dear all,
I know it might depend also in the distribution / behavior of the variable that we are studying. The sample spacing must be able to capture the spatial dependence .
But, since Kriging is very much dependent in the computed variance within lag distance, if we have few number of observations we might fail to capture the spatial dependence because we would have few pairs of points within a specific lag distance. We would also have few number of lags. Specially, when we have points with a very irregular distribution across the study area, with a lot of observation in a specific region and sparce observations in other region, this will also will affect the estimation of computed variance among lag (different accuracy).
Therefore, I think in such circumstances computing semivariogram seems useless. What is the best practices if iwe still want to use kriging instead of other interpolation methods?
PS
You need to separate two questions, first there is the number and spatial pattern of the data locations used in estimating and modeling the variogram. Secondly there is the number and spatial pattern of the data locations used in applying the kriging estimator/interpolator. These are two entirely different problems. The system of equations used to determine the coefficients in the kriging estimator only requires ONE data location but the results will not be very useful or reliable. Now you must decide whether to use a "unique" search neighborhood to determine the data locations used in the kriging equations or a "moving" neighborhood. Most geostatistical software will use a "moving" neighborhood, if you use a moving neighborhood then about 25 data locations is adequate, using more may result in negative weights and larger kriging variances. Depending on the total number of data locations and the spatial pattern there may be interpolation locations where there are less than 25 data locations. Using a "unique" search neighborhood will likely result in a very large coefficient matrix to invert.
With respect to estimating and modeling the variogram you must first consider how you are going to do this. Usually this will include computing empirical/experimental variograms but for a given data set the empirical variogram is NOT unique. It will depend on various choices made by the user such as the maximum lag distance, the width of the lag classes and whether it is directional or omnidirectional. An empirical variogram does not directly determine the variogram model type, e.g. spherical, gaussian, exponential, etc. It also does not directly determine the model parameters such as sill, range.
Silva's question may seem like a reasonable one to ask but it does NOT have a simple answer. Asking it implies a lack of understanding about geostatistics and kriging.
1991, Myers,D.E., On Variogram Estimation in Proceedings of the First Inter. Conf. Stat. Comp., Cesme, Turkey, 30 Mar.-2 April 1987, Vol II, American Sciences Press, 261-281
• 1987, A. Warrick and D.E. Myers, Optimization of Sampling Locations for Variogram Calculations Water Resources Research 23, 496-500
Question
Aim is to find signal value at x0 from signal values at xi, i=1,..N using Kriging, given as Z(x0)=sum(wi Z(xi)).
After fitting a non-decreasing curve to empirical variogram, we solve following equation to find the weights wi's-
Aw = B,
where A is padded matrix containing Cov(xi,xj) terms and B is vector containing Cov(xi,x0).
In my simulation setup, weights often have negative value (which is non-intuitive). Am I missing any step? As per my understanding, choice of curve-fitting function affects A. Weights are positive only if A is positive-definite. Is there a way to ensure that A is positive-definite?
I think the problem is that the sample space of the variables considered has not been taken into account. Kriging in any form (simple, ordinary, ...) has been devised for real random variables, with support the whole real line, going at least conceptually from minus infinity to plus infinity, and endowed with the usual Euclidean geometry. In such a case negative weights would be no problem. If you expect positive estimates, than the variable is not supported on the whole real line, and you need to take this into account. In the field of compositional data analysis you can find tools to adress this problem. The essential tool is to determine the natural scale of your data, to find appropriate orthonormal coordinates for your data, and to perform estimation in the resulting representation of your data.
Question
We know that the estimation error (in percent) can be calculated with EEx = a*sqrt(var(x))*100/x, where EEx is the estimation error, a is a constant, var(x) is the variance (came from (co)kriging), and x is the estimated value by (co)kriging.
Considering an estimated value below 1%, the above equation may leads to large and meaningless estimation errors (thousands of percentages).
The question is that what should we do for estimated values less than 1%?
in the other side, we need to report the estimation error in percentages to classify mineral resources based on their estimation error.
p.s.: an example is Phosphorus grades at an iron ore mine. They vary between 0 and 1, but after performing (co)kriging, their estimation error ranges around 1500%.
Despite performing compositional data analysis (CoDa), this happens, more or less.
The equation must solve any problem
Question
I am interested to know the suitability of Geostatistical Analysis or Geospatial Analysis in ArcGIS.
Both of the tools contain interpolation techniques, I am not sure which one should be used for mapping of Soil properties (like Bearing Capacity) using point data?
Geostatistical Analyst uses sample points taken at different locations in a landscape and creates (interpolates) a continuous surface. The sample points are measurements of some phenomenon, such as radiation leaking from a nuclear power plant, an oil spill, or elevation heights. Geostatistical Analyst derives a surface using the values from the measured locations to predict values for each location in the landscape.
Geostatistical Analyst provides two groups of interpolation techniques: deterministic and geostatistical. All methods rely on the similarity of nearby sample points to create the surface. Deterministic techniques use mathematical functions for interpolation. Geostatistics relies on both statistical and mathematical methods, which can be used to create surfaces and assess the uncertainty of the predictions.
Geospatial analysis is the gathering, display, and manipulation of imagery, GPS, satellite photography and historical data, described explicitly in terms of geographic coordinates or implicitly, in terms of a street address, postal code, or forest stand identifier as they are applied to geographic models.
Question
Hello! I have a data base of geochemistry analyzes.
My doubts is:
1 - Can I apply statistics measurements for the same variable even though it was measured by several analytical methods?
Eg: Can I get the mean of SiO2 from whole rock samples which was measured by X-Ray Fluorescence and by Electron microprobe?
2 - Can I get linear correlation from two different variables measured by different analytical methods?
Eg: Linear correlation between H2 that was measured by Gas chromatography and FeO that was measured by Electron microprobe?
3 - In which readings can I learn this theory?
Question
I have a dataset with distances between beneficiaries and the nearest provision point (nearest hub).
I want to develop a model to explain distances based on several atrributes like category of beneficiary, category of provision point among others:
distance ~ cat_beneficiary + cat_provision + altitude + ...
I guess I should use a GLM, but I don't know which model would fits better with this kind of data (continuos and positive). Can I use a count data model (like Poisson or NB)? Or they just work with discrete data?
I attach a histogram.
Dear Josep,
Replacing zeros with small numbers (10^-6 or .Machine\$double.xmin) may "solve" the problem, but a more elegant solution is using zero-inflated models (e.g. packages 'glmmTMB' or 'glmmADMB').
HTH,
Ákos
Question
In geostatistics, structural analysis aims to get a universal expression characterizing the spatial variation of regionalized variable in different directions. The general expression is γ(h)=Σγi(h), where γi(h) represents the variogram model in the ith direction, which is subject to some geometric transformation and then is isotropic.
My question is：
(1) whether the expression γ(h)=Σγi(h) applies for the geometric anisotropic variogram? In detail, if the variograms in two orthogonal directions have the same nugget and sill, but the ranges are different, we call this geometric anisotropy. The common practice is to make a linear transformation [1, 0; 0, K] to the lag vector, then the variogram model can be used for all directions.
(2) If the geometric anisotropy can also be expressed as sum of variograms in different directions, then it should be equivalent with the commonly used way as mentioned in (1). But how these two ways can be related to each other?
(3) I think the core question is: why the variograms in different directions can be added?
Could anyone give me some explanations, and some illustrated examples?
2008 Myers, D.E.Anisotropic radial basis functions International J. of Pure and Applied Mathematics 42, 197-203 No, this model can result in a non-invertible kriging matrix.
What you are trying to construct is not a geometric anisotropy but rather is more like a zonal anisotropy.
See 1990, D.E. Myers and A. Journel, Variograms with Zonal Anisotropies and Non- Invertible Kriging Systems Mathematical Geology 22, 779-785
Also see 2008 Myers, D.E.Anisotropic radial basis functions International J. of Pure and Applied Mathematics 42, 197-203
A geometric anisotropy is when the range varies with direction
Question
My observations are points along a transect, irregularly spaced.
I aim at finding the distance values that maximize the clustering of my observation attribute, in order to use it in the following LISA analysis (Local Moran I).
I iteratively run Global Moran I function with PySAL 2.0, recreating a different distance-based weight matrix (binary, assigning 1 to neighbors and 0 to not neighbors) with a search radius 0.5m longer at every iteration.
At every iteration, I save z_sim,p_sim, I statistics, together with the distance at which these stats have been computed.
From these information, what strategy is best to find distances that potentially show underlying spatial processes that (pseudo)-significantly cluster my point data?
• Esri style: ArcMap Incremental Global Moran I tool identify peaks of z-values where p is significant as interesting distances
• Literature: I found many papers that simply choose the distance with the higher absolute significant value of I
CONSIDERATIONS
Because with varying search radius the number of observations considered in the neighborhood change, thus, the weight matrix also change, the I value is not comparable
Hi everyone,
after a little research, I finally came up with the answer I was looking for.
when using Global Moran's I index (I) with incrementally increasing distance searches (thus, changing the weight matrix at every iteration), only the the z-values are independent from both weight matrices and variable intensity variations, thus, they are comparable across multiple analyses.
The I in Moran's I statistics is not comparable across analyses, i.e, if with distance of 10m I=0.3 and distance 15m I=0.6, we cannot say that with a distance of 15m the clustering strength is double.
We could only say that in both cases there is a positive (sign of the I) spatial autocorrelation.
For the strengths, we use the z-values.
That is why ESRI plots distances in the x-axis and z-values in the y axis, indicating significant (p-value < than specified signification level) peaks as interesting distances.
For more information, it is clearly explained during a class that Luc Anselin in this Global Autocorrelation class, given in 2016 in Chicago University.
Enjoy!
Question
I want to compare two maps (covering the same area):
• a raster map of the Topographic Position Index (TPI) with values ranging from 0 to 100. This is a continuous variable. I added a picture in attachment.
• a raster map with wetlands (permanent wetland, temporal wetland, no wetlands present), so this is a categorical variable. I added a picture in attachment.
I want to check whether the TPI map can predict wetlands. I want to look for similar spatial patterns.
I was thinking in terms of probability (e.g. Bayesian Networks), but I am not sure this is the right technique.
Does someone know a technique I could use to analyse these two maps? I am using ArcGIS but other software programs (like R) can also be used of course.
Annelies Broeckx Thanks. I have downloaded the files and played a bit around with this. If you would assume, that any kind of gradients indicate different regions, then the following may work. I used mainly gradient operators & minkowski metric and and compared everything. Enclosed you will find some colored results which show potential overlaps as color code (whitish). I got 17% overlay for the gradients and if I dilate them 50% at max to respect influences of misclassification and misregistration.
Question
At present, I have a map of geological classification results, the classification is categorical and orderly, and there are some areas in the map are vacant.
I want to use geostatistical analysis to interpolate the vacancy areas, such as Kriging method.
So I want to know how to use these methods for discontinuous data types, or whether there is a Kriging method for such data?
If you have a vector of probabilities at each sampling point, then you can use a compositional approach to interpolate them consistently.
Question
I first state that my domain of interest lies in the understanding of distances in the spatial, geographical domain. This does not involve distances in statistical non-spatial analysis.
Distances in the mathematical sense are characterised by the properties of positivity, separation, symmetry and triangular inequality. In the literature this last property is often wrongly interpreted. I state the hypothesis that the triangular inequality (d(AC) <= d(AB) + d(BC) whatever B) has for main purpose to guarantee the minimal nature of distances. I have so far listed three errors:
1. confusion with with the non-euclidean nature of geographical distances : the fact that distances are not following the straight line is accounted for triangular inequality as in Müller 1982 ("Non-Euclidean geographic spaces: mapping functional distances" Geographical Analysis vol 14). This is clearly a misunderstanding of the non euclidean nature of the transport and mobility realities.
2. considering non-optimal sets of measures (the word "measure" is used to avoid considering them as distances) containing triangular inequalities as in Haggett 2001 p 341 ("Geography a global synthesis" Prentice Hall). This argument is not consistent with the idea of minimality that I mentioned earlier: what can be a sub-optimal distance? Does it exist in spatial analysis of transport such distances?
3. the existence of rest-stops along transport routes creates the possibility that the additivity of time-distances (for instance) is not guaranteed. On the optimal route from A to C passing through B, if there is a need for a stop (for rest, for energy refuelling or other similar purposes) in or around the point B then the addition of time-distances AB and BC is inferior to the overall AC distance creating an apparent violation of the triangle inequality. This idea can be found in the reference article Huriot, Smith and Thisse 1989 "Minimum cost distances in Spatial Analysis", Geographical Analysis, p 300. This argument could be surpassed by using a non-continuous time-distance function to overcome the sub-additivity paradox.
All this discussion brings to the idea that distances are optimal in nature, and that in the spatial analysis of transport and mobility, the concept of distance contains an idea of optimality. This idea could link geography and economics through the principle of optimisation in the distribution of activities and the functioning of transport systems.
I would like to open here a discussion to test if my reasoning is consistent and solid: I welcome any counter arguments, examples, illustrations.
" All this discussion brings to the idea that distances are optimal in nature, and that in the spatial analysis of transport and mobility, the concept of distance contains an idea of optimality. This idea could link geography and economics through the principle of optimisation in the distribution of activities and the functioning of transport systems "
Interesting work ..... you are probably aware of the fact that the solution to a linear programming "transport problem" links activities with transport systems, and provides a set of prices (dual variables, or shadow prices) that relate to, and are derived from, the transport costs --- those transport costs do not have to obey the triangle inequality.
Question
I'm looking for examples and analyses of autocorrelation tests performed on cross-sectional data and the issues encountered (data prop for analysis, problems with estimations, etc). Not interested in spatial autocorrelation.
The richer you are the more you save
The taller and good looking you are the higher your income
The more fashionable is your neighbourhood... the more expensive your house will be.
OK? You get it: the residuals on each of the auto-correlated variables will move in the same direction.
Should not be confused with multi-collinearity when 2 or more variables mean the same thing and become redundant.
Question
To do ArcGIS geostatistical analysis I need China Climatic Zone shape file. I search it in DIvaGIS but it was not found..
The Köppen-Geiger classification Eric Delgado dos Santos Mafra Lino mentioned is a good choice.
I also found another link where you can download the global climate zones as a shapefile. I am not sure if that is what you are looking for, but have a look at it:
Question
I plan to use sequential Gaussian simulation (sGs) approaches in study. However, when I apply the inverse normal score transformation (NST) to the data after the calculation, there are some errors (especially in negative results). Therefore, I simulated the data with Box-Cox transformation instead of NST and obtained more appropriate results.
Can the Box-Cox transformation be used instead of normal score transformation in sequential Gaussian simulation (sGs)? Can I also find examples of such transformations previously used (to reference in the article)?
Thank you very much in advance
Mr. Murphy
That could be a good idea. I'il take your advice. Thank you so much for your help.
Question
I am using ArcSWAT.I encounter some problems when I try to estimate the flow through these GIS files. I loaded my land use after the watershed delineation, but when I tried to add my look up table, and select "Cropland data layer", I got this error message(error 1).
"The grid value 0 has not been defined.
I double click the 'value 0' and select one of an item for it(Crop-AGRR), but 'Land Use\Soils\Solpe Definition can't complete this time. Then I add my look up table again, but select "User Table" this time, just like running the Example1 which come with the SWAT model, but I got another error message(error 2).

'An error has occurred in the application. Record the call stack sequence and description error.
Error call stack sequence

fromChoose2LUSoilsfunctions.vb
Error Number

5
Description
Argument ‘Length’ must be greater or equal to zero'
I appreciate it if someone could give me some suggestions.
because it have some grid not define value check to find that grid
Question
Hi everyone!
I would greatly appreciate your help
This is my first weeks of self-studying GIS and geostatistical analysis for my hydrobiological research.
I have about 30 points on the lake with x,y,z information and species occurence data.
I'm just interpolated z value (depth) and made bathymetric map (isolines) of the lake using Surfer Golden Software 11.
Now i need to compare this information with species occurence data and maybe find some relation between them, but I don't clearly understand how can I do this.
Which software or tools i should use? Should I continue my work on Surfer? Is there any tutorials or literature for begginers? I'm stuck and in despair.
I've heard that cokriging maybe can help, but I'm not sure. Also I heard about Maxent, but as a beginner I can't understand how to work on this soft, so I need some simple tutorials or something like that I think..
And I am sorry for my English)
Hi all,
All these information that you guys provided is very helpful but it might be overwhelming. Zhanna, you mentioned that you have Surfer software available to conduct your analysis. So you have spatial data then with coordinates. Unfortunately none of us seems to be an expert in Surfer. I used it once and don't remember much about it. There is one more program for free SGeMS. You can download it from website, it can perform variogram modeling, interpolation and even sequential gaussian simulation. It is quite efficient in constructing histograms, QQ plots, scatterplots. It is a little pain to work with and it also collapses on you but in the end of the day it does a solid job.
Question
The original data is 2D, and has been detrended using a second order polynomial obtained by OLS. The variogram of the detrended data is in red as shown on the attached figure. Conditional sequential Gaussian simulation is then performed and 100 realizations were obtained. The variograms of these realizations are in gray in the figure. Why is there an apparent deviation? Whether is there something wrong with my procedure? Thank you.
You need to recognize a number of aspects of your problem that you may not be understanding correctly.
1. "Detrending" is a technique applied to a data set, it does not affect nor is it a characteristic of a random function. The empirical variogram is only an unbiased estimator of variogram values if the expected value of the random function is a constant (with respect to the spatial locations). Unfortunately without knowledge of the multivariate probability density function (don't confuse multivariate with multidimensional) you can't actually compute the expected value. As a practical solution to the problem of whether the expected value is constant it is common to fit a trend surface to the data and then possibly use the residuals to compute the empirical variogram. Note that for a particular data set the empirical variogram is not unique, i.e. you must make decisions about the width of the distance classes, the number of distance classes and whether to use an omnidirectional or directional empirical variograms.
2. Conditional sequential Gaussian simulation means you are going to generate a partial realization of a random function using a covariance function (you can't use a variogram directly). The algorithm is supposed to preserve certain properties and also is based on certain assumptions. one of which is that the random function has a multivariate Gaussian distribution and is second order stationary (the existence of a variogram only assumes Intrinsic Stationarity). "Conditional" means that the realization is forced to match the data values at the data locations. "Sequential" means that the algorithm generates a simulated value at one location at a time, in sequence hence you must choose (or the software chooses it for you) a "path". If the same set of simulation locations is chosen in a different order there is no assurance that the set of simulated values is the same.
3. The simulation algorithm does not ensure that the empirical variograms for individual partial realizations will match the empirical variogram for the original data set.
4. Usually the software generates simulated values on a regular grid whereas the original data locations are usually not on a regular grid and there will be far fewer data locations than simulation locations.
5. Any spatial data simulation algorithm depends on the use of a random number generator and in turn the random number generator depends on the use of a "seed". To obtain multiple partial realizations you keep changing the seed. Note that for a given random number algorithm and given seed the sequence of "random numbers" is in fact not really random (see the book "Numerical Recipes").
a. The plot of the empirical variogram for the original data appears to be almost too smooth to be true unless you had a really large data set.
b. I assume that the single plot for the simulated data is an "average" of the empirical variograms for the 100 realizations, is that true or did you average the simulated values at each location to compute a single empirical variogram?
c. In addition to "detrending" the data did you also do a "normal" transform, if so before or after computing the empirical variogram for the detrended data.
d. Almost certainly the empirical variogram for the simulated data is based on a much larger number of "data" locations, you didn't provide enough information about what you did
e. You should list all the choices you made in computing empirical variograms and also the choices you made in using the Sgsim software
f. The difference between the two plots appears to be difference in the implied sills, i.e. a difference in the variances.
You need to understand what the software is doing both for computing empirical variograms and also for the conditional sequential Gaussian simulation. i.e. the underlying algorithms and how the software implements those algorithms.
Question
The formula of Advanced Vegatation Index for Landsat 7 is:
AVI = [(B4 + 1) (256 - B3) (B4 - B3)] ^ 1/3
and for Landsat 8,
AVI = [(B5 + 1) (65536 - B4) (B5 - B4)] ^ 1/3
Now if I transform the DN values into Reflectance values, should the constants (256, 65536) be changed or they remain constant for reflectance as well?
I contacted USGS about this. They replied, "The constants represent the number of possible bits in a resolution (8 bit = 256; 16 bit = 65536), so they shouldn't change."
Question
When performing kriging with anisotropy it can be the case that a very high length-to-width-ratio of my sampling points can pretend an anisotropy which is in reality not existent. So my question is if there is a recommended ratio of the longest and shortest sampling axis for kriging to avoid this effect? Or how can I deal with this situation?
There is no way of determining that absent some knowledge of the phenomena that generates the data and at least some information on the spatial correlation for the variable of interest.
This is a little like asking what the minimum number of data locations is necessary to apply kriging and again without some preliminary information you can't say. Moreover the problem is different for the variogram estimation/modeling step vs using the variogram for kriging
Note that for a given data set, the total number of data location pairs (used in the sample variogram(s) is fixed, all of which might be used in an omni-directional sample variogram but fewer in each of the directional sample variograms (but not necessarily the same number for different directions).
One of the consequences is that the directional sample variograms may be difficult to interpret (too few pairs) and/or for one direction the sample variogram is moderately good but for one or more other directions it is not good. The spatial pattern of the data locations can also cause effects in the sample variograms as you have noted.
You might find the following interesting
1987, A. Warrick and D.E. Myers, Optimization of Sampling Locations for Variogram Calculations Water Resources Research 23, 496-500
Question
I have a set of disease cases in the polygon form as an attribute of each city. There are some 180 cities (polygons) that 2-5 of them recorded more than 300 cases, about 100 of them contain 0-2 cases and the rest recorded 2-20 disease cases. I'm going to evaluate the possible correlation between illness and some environmental factor such as temperature, precipitation, etc.
However, the distribution of the disease data is severely non-normal and violates many statistical methods' assumptions.
Do you have any suggestion in this case?
To assess the correlations in your data set you could use a non-parametric correlation measure like Spearman's rho. Also, if you analyse spatial autocorrelation in your spatial pattern, you should use a Monte Carlo/ramdomization approach to determine you p-values. See e.g. http://pysal.readthedocs.io/en/latest/library/esda/moran.html --> p_rand
Another thing you could use is Poisson regression to determine the strength and direction of influence of your environmental factors on the number of local disease cases.
Cheers,
Jan
Question
Hello everyone, I'm doing work about solute transportation in soil. There are 40 sample points and in GS+ software, active lag distance is default as half of sample distance. After calculation, only four or five plotted points were showed, but R2 is more than 0.9. Is this analysis reliable? How many plotted points are needed at least?
If you going to use the variogram for kriging or simulation remember that the purpose of the variogram is purely utilitarian; you are trying to develop a model that captures the spatial covariance of observed values.  The model you specify for kriging or simulation helps control the smoothness and roughness of the resulting surface.  Therefore you want enough variogram points to indicate a range of spatial dependence,  existence and  magnitude of any very small-scale variability ("nugget effect"), and a value for the sill.  Four points might be enough for some datasets, including yours.  For others, forty might not be enough.  You can play around with different lag sizes but remember that as the number of variogram values increases the number of pairs used to calculated each value decreases and the variogram can become more noisy.
Question
I have used AHP, Frequency Ratio (FR), and Fuzzy Logic (FL) to create land suitability maps in the GIS environment. Do you know other methods?
If you create land suitability model so you have face pixel mixing problem for accuracy purposes.
Weighted Overall analysis method for land suitability map & sub- pixel is another very good method for pixel mixing of classes.
Question
1) From 73 statistic levels I interpolated the groundwater table using three methods: topo to raster, IDW and kriging. From these last two I used geostatistical analysis and so errors are calculated. However, I do not know how to validate the "topo to raster" interpolation. In addition, I interpolated the GW flow direction using ArcHydro Groundwater. Is this valid? Is this modeling?
2) I calculated the mix fraction of "X" water source using a 3-mixing model analysis (EMMA). I interpolated the mixing fraction using the same three  methods. Topo to raster (aparentelly) resulted the best fit...still, I do not know how to validate the method (i.e. RMSE). Kriging overstimated a lot of values and so IDW.
Dear Christian,
The variogram for your data could be very telling.  If the data are very noisy the nugget effect will be large and the nugget constant relative to sill will be large.  Kriging is a type of smoother and will overestimate low values and underestimate high values.  Smoothing increases with the nugget/sill ratio.  Whether overestimation of low values or underestimation of high values is the more obvious depends on the distribution of data.
Sparseness in geographic data distribution makes the situation even worse.
As suggested above, look for outliers in the data.  Alternatively, if your software includes the option of a normal scores transform with conditional simulation you might want to explore that route.
Question
ku-band Geostationary satellites elevation angle, for internet and TV.
no reference ... I performed the calculus by myself.
It's only a problem of spherical geometry, considering the position of the geostationnary satellite (whatever the used band), and the position of the user on ground.
The only impact of the used band (Ku in your case) is the lowest usable elevation.
Question
Hello,
I want to extract this netcdf file 'Sea_sur_temp_tos_Omon_MPI-ESM-MR_historical_r1i1p1_200001-200512.nc'. It has total 8 variables. Those are time, time_bnds, j(cell index along second dimension), i (cell index along first dimension), lat, lat_vertices, lon, lon_vertices, tos( sea surface temperature). While extracting through ArcGIS, in the dimension box it is asking for j and i values instead of lat and lon. I don't understand what should be the value for i and j.
Kindly help me in this regard.
Thanks
Question
-
Generally speaking, specifics about the individual chemical cocktails used in a project are considered proprietary, and are usually only available with explicit permission from the data owner (the company doing the job). Some exceptions for example, would be data collected from IHS (usually comes at a cost), or data voluntarily promulgated (e.g. fracfocus.org). If you are looking for general soil/ water data, I would start here:
I would then peruse similar pubs like this one:
then continue mining data from the subsources therein (i.e. the Acknowledgements). Hope that helps.
Question
I have the cloud masked time series images, I want to apply the temporal gap filling by interpolating the values of non cloud pixels to the one which had clouds(NaN after cloud masking) in my time series images. Please is there any software, methodology or script which I can use to perform such temporal refilling?
Here a work we did on Land Surface Temperature data (LST), see
Metz, M.; Rocchini, D.; Neteler, M. 2014: Surface temperatures at the continental scale: Tracking changes with remote sensing at unprecedented detail. Remote Sensing. 2014, 6(5): 3822-3840, http://www.mdpi.com/2072-4292/6/5/3822
Question
I am trying to evaluate impact of an intervention that was implemented in very poor areas (more poor people, undeserved communities). In addition, the location of these areas were such that health services were limited because of various administrative reasons. Thus, the intervention areas had two problems: (1) individuals residing in these areas were mostly poor, illiterate and belonged to undeserved communities; (2) the geographical location of the area was also contributing to their vulnerability (as people with similar profile but living elsewhere (non-intervention areas) had better access to services. I have a cross sectional data about health service utilization from both types of areas at endline. There is no baseline data available for intervention and control. I am willing to do two analyses: (1) intent to treat analysis: Here, I wish to compare the service utilization in "areas" (irrespective of whether the household in intervention area was exposed to the intervention). The aim is to see whether the intervention could bring some change at "area" (village) level. My question is: can I use Propensity Score Analysis for this? (by matching intervention "areas" with control "areas" on aggregated values of covariates obtained from survey and Census?). For example, matching intervention areas with non-intervention areas in terms of % of poor households, % of illiterate population, etc. The second analysis is to examine the treatment effect: Here I am using Propensity score analysis at individual level (comparing those who were exposed in intervention areas with matched unexposed people from non-intervention areas). Is it right way of analysing data for my objective?
Thanks a lot, Sebastian. This is very useful. I am working on it and I will update you once I address all these concern, to the extent possible with existing data available.
Question
I have a time series of satellite derived rasters. what is the best approach to define spatial pattern of spatio-temporal variation by means of geostatistical analysis?
I would like to characterize how areas are affected by temporal and spatial variation of a parameter derived by satellite. how i can relate this with another forcing driver ? this one is measured as a numerical vector.
Is the variogram analysis the best? Or the empirical orthogonal function?
if one of these was feasible how would you develop this with R, or other open source softwares?
The time series raster dataset has more than 50 images with a variable revisit time. the lag time range from 7 to 96 days.
Thank you for the replies but I am interested to see if the use of variograms or EOF have been applied by someone to identify spatial patterns and what are their opinion. And what software have been used.
Question
Dear RG members
During reading a paper, I passed the statement "M 6.0 Ranau earthquake dated on June 05, 2015 coupling with intense and prolonged rainfall caused several mass movements such as debris flow, deep-seated and shallow landslides in Mesilou, Sabah" given in the paper and forced to ask a question like that.
regards
IJAZ
Dear Ijaz,
there is nothing unusual in the text you cited in your question.
Obviously, an earthquake cannot cause rainfalls. Volcanic eruptions can cause it (due to dusts immision in the high troposphere), but there is not a relationship between earthquake (lithosphere) and rainfall (atmosphere).
However, what the text you cited says is clear and is talking about the relationship between earthquake and mass movements (landslides, rockfalls, debris-flow).
Prolonged rainfall events are key factors for ignite mass movements such as debris-flow, rotational and sheet-on-sheet landslides. This is because in a region where hydrogeological instability exists (limestone above evaporites, schists or swelling clay, for example), water can flow within rock layers, eroding and putting excessive weight on slope terrains.
Thus, just after an earthquake, seismically-induced landslides are (unfortunatly) a common related-hazard. In fact, just after earthquakes (while aftershoks still occur), the geologists must work with fireworks or army (emergency services) to monitor mass movements in the targeted area.
If prolonged rainfalls event follow or coincide with an earthquake, now it is clear that water is an augmenting factor for such mass movement events.
Best regards
Nic
Question
Interested to know about flood return period.
It is an interesting question, and I am not an expert on earthquake risk.  Earthquakes tend to be dependent on the local geology, stratigraphy, plate tectonics.  The location of storm severity is less dependent on surface or subsurface features, more dependent on jet stream and air circulation patterns interacting with as they move over surface features.  Flood recurrence intervals are based primarily on storm data collected or historic records that can be retrieved, and to some degree, this may be the same as for earthquakes.  Since I am not an earthquake expert, I would hesitate to say that they are directly comparable, even though with enough data, the tools to plot flood and earthquake frequencies and severities for a location may be the same or similar.  The time series for floods seems like it is apt to be more frequent and widespread within a climactic zone, while earthquakes frequency and severity tend to be more localized within defined zones, less frequent for most areas, but also with connected aftershocks for a period of time that are related to the initial severe earthquake event.  As long as you understand the differences, the basic methods to assess risk exist, but it should be noted about the average comment made is often applied to data, so 50% of data could be above, 50% below the trend lines.  Even though the lines may appear impressive on plots, the confidence limits help to assess the uncertainty, and should be greatly respected if severity is related to life and property damage.
Question
I am using krige.cv() for cross validation of a data set with 1394 locations. In my code, empirical variogram estimation, model fitting, and kriging (by krige()) everything works fine. But when I use krige.cv() I get the following error.
Error: dimensions do not match: locations 2786 and data 1394
One can notice that 1394*2 = 2786. What could I be missing? Please note that there are no NA or NaN or missing values in the data, variogram or kringing results. Everything works fine, and it's just krige.cv() that does not work.
Your question is not very correct because you didn't mention the software used in your study. I know several geostatistical programs with different organized type of imputing the data and outputting of the results. The first reaction to your question is that some technical problem is coming from the software itself, or from your data file structure. Something is not read correctly.
Question
I am caught in a strange situation. I have been doing some kriging using gstat package, but the data is exhibiting a lot of bad signs. It has a second order strong trend surface, with almost all the coefficients declared significant with three asterisks *** in lm(). Though R^2 is small, but the empirical variograms are clearly different in two cases i.e. variogram(attribute~1) and variogram(attribute~SO(x, y)) have different sills. Furthermore, the directional variograms also show both zonal and geometric anisotropies. When I try to fit a variogram model, I see a big change between the fitted ranges and sills of the simple (no anisotropy assumed) and anisotropy corrected variogram models. How do I deal with this analysis?
First of all there is a question of terminology
1. "Trend surface" pertains to the data but ultimately it is not the data but rather the expected value of the random function (that presumably generated the data). Unfortunately without knowing the multivariate probability distribution there is no way to actually compute the expected value hence the Trend surface is an ad-hoc solution. Note that any statistical tests in lm() are based on specific statistical assumptions which actually contradict the assumptions implicit in the use of kriging.
2. Geometric anisotropy and zonal anisotropy are characteristics of the random function and its variogram (the model, not the sample variogram)
Having said all that there is the practical matter of what to do. First I would suggest making a plot of the data locations with each location coded by the data value as a quick way of checking to see if there is a "trend". The "trend" may have a directional effect and this could lead to thinking there is an anisotropy. Next look at a histogram of the data. Third, ask your self what you know about the phenomenon that generated the data, e.g. is it reasonable to expect a non-constant expected value and/or a directiional effect.
lm( ) should be able to compute the residuals for both a first order polynomial trend and a second order trend. Generate the coded plots for both sets of residuals, how do they compare with each other and with the coded plot of the original data. Compute the histograms for both sets of residuals, again compare with each other and with histogram of original data.
Compute and plot the sample variograms for the original data, for the residuals from first order polynomial and for second order polynomial.
If you can fit reasonable variogram models to any or all of these data sets (original, first order residuals, second order residuals), use the model with cross-validation and the associated data sets.
For each of the three data sets the cross validation should produce four values at each data location (the actual data or residual, the kriged value, the "error" and the kriging std deviation. Since the kriging estimator is unbiased, the "mean error" should be zero so compare with average error in each of the three cases, the mean square normalized error (normalized by the kriging std deviation) should be one so compare with empirical value, Look at the fraction of nomalized errors exceeding 2.5, theoreticalkly this should be less than 0.5. Make a bi-variate plot for  "data value  vs kriged value" the correlation should be high, make a bi-variate plot of kriged values vs error, these should be uncorrelated.
Did you see evidence of anisotropy in the directional sample variograms for the first and/or second order residuals?
If the cross validation results look acceptable for either the first order residuals or the second order residuals then you can proceed two ways
A. Use Ordinary kriging with the residuals data set and add to those the value of the trend surface at each location. You won't get a kriging variance this way (you can't used the estimated variance from lm( ))
B. Use Universal kriging with the original data set, the order of the trend surface but not the coefficients in the trend surface. This way you will get a kriging variance at each interpolation location.
Unless you have an explanation related to the phenomenon to justify using an anisotropy I suggest not doing it, in  particular no zonal. gstat does not include a zonal anisotropy. As for the geometric there is always the problem of an insufficient number of pairs for the plotted points in the directional sample variograms (the total number of pairs if fixed by the size of the data set and hence the pairs are split up for the different directions.
Note that the cross validation results as well as any final kriging results may be sensitive to choices such as (1) whether you use a circular or elliptial seach neighborhood (2) the minimum and maximum number of data locations used in the search neighborhood as well as the variogram model type (exponential, gaussian, spherical, etc) but not as sensitive to the variogram parameter values
Question
Is it valid to transform ilrs using normal score, Box-Cox or other transformations in order to perform Sequential Gaussian Simulation (SGS)?
Is it true to use the combination of isometric logratios (ilrs) with Direct Sequential Simulation without any transformation?
ilr(s) is a transformation. In my point of view it is true to do it without any further transformation.
Question
Steps to WGS1984 ASCii file project to Kertau_RSO_Malaya_Meters ASCii file. Been trying but still not projected well.I cannot open in FMP stated that this file has no projection. Seeking an answer from all experts and professors. Thank you and appreciate your feedback.
I have also attached my original file for reference.
Question
Dear all
Regards
Ijaz
Excellent discussion. Two important ways landslides can be incorporated into  studies of tectonic deformation:
1.  Strong ground shaking due to large earthquakes can cause landslides over a very large area (100's of km2), and age-dating is the critical piece of the puzzle, as mentioned by William Hansen, above. If there is a known active fault (or faults) in the region, the timing of large seismic events as determined by paleoseismic methods can be compared to the timing of the landslides, and if the data allow for them to occur at the same time an earthquake is a good candidate for the trigger.
2. Landslides of all sizes can also be directly triggered by uplift associated with fault movement.  For example movement on a daylighting thrust fault can produce very large gravitational failures of the overlying hangingwall at the range front, especially where the fault decreases in dip in the shallow section (upper <1-2 km) which is very common. They also occur over the forelimb of folds formed over blind thrust faults where fault movement steepens the forelimb and oversteepens the overlying terrain.  I observed and mapped both types of earthquake-related landslides caused by the 1999 M7.6 Chi Chi earthquake in Taiwan.
Question
Hello,
It will be really helpful, if someone can suggest some sources where DMSP-OLS data with thermal band is available.
Hi Sayan,
Hope you are doing good.
Concerning your question, as Vijith Sir suggested, NOAA seems to be the most appropriate link to retrieve TIR data for DMSP. Additionally, you may also want to check out the attached link.
Question
I have soil contaminant samples collected at different depths/layers. I generated a contaminant surface at each depth/layer using ArcGIS krigging tool. However, I need to have a vertical feel of this contaminant across layers and which I think I can achieve by interpolating the values across the different layers. As far as I know ArcGIS can't do this. So, I will be happy to know any freeware I can use to achieve this. Many thanks.
I would use GRASS GIS module v.vol.rst which interpolates vector point data to a 3D raster map using regularized spline with tension algorithm (RST).
Here is a small example in command line syntax (you can do the same in Python, GUI or modeler):
v.import input=points.shp output=points
g.region vector=points res=0.5 t=5 b=-15 tbres=0.5
v.vol.rst input=points wcolumn=values_column tension=20 smooth=0.6 segmax=400 dmin=0.5 zscale=100 elevation=result
r3.to.rast input=result output=result_slice
where:
points.shp is name or full path of a vector file you want to import (uses GDAL/OGR)
points is the name of the vector map in GRASS GIS
values_column is name of a attribute table column with values to interpolate
result is the resulting 3D raster map
result_slice is a beginning of a name for 2D rasters created from horizontal layers of 3D raster
v.import imports the data into GRASS GIS database
g.region set the computation region extent and resolution (2D and 3D) for subsequent raster calculations
v.vol.rst does the 3D interpolation
r3.to.rast does horizontal slices of the 3D raster and creates a series of 2D rasters
The result is a 3D raster which you can further analyze or visualize in GRASS GIS or you can slice it horizontally into a series of 2D rasters and analyze and visualize them (or export them using r.out.gdal).
Question
I have 5 raster layers depicting different temperature levels across a given geographic space. I need to use a common/same ‘Stretched” color ramp to show how this phenomenon varies across space, where a given color in the color scale in each of the raster layers represent same value in all the layers. Kindly see attached sample.  I need it in a “stretched” style.I use ArcGIS 10.3.1
I have tried a couple of things which haven’t worked yet.
I made a dummy raster with values equivalent to the min-max of the 5 rasters. The lowest value of all the raster is 5 and the max is 74. So, I created a dummy raster with min value of  5 and max as 74. Layer properties-Symbology-I then symbolized the dummy with a color ramp of choice, choose -minimum-maximun- under “Stretch” and choose -From Custom Setting (below)- under  setting under “Statistics”. Save the layer as layer file. lyr
The problem is that when I import the symbology or apply the layer file, all the raster retain same symbology and shows same min-max values (5 and 74). I need them to show their real values-such as 34 to 58 and the colors should reflect that range in the common color scale/symbology.
How do I go about this? I need a quick way out. I am not experienced in Python or other programming languages, but with detailed steps I can also try if that’s the only way out.
Did you try saving a layer file of the ideal raster symbology? After you create that .lyr file you can import it to the other rasters you want to match.
Question
AIC and BIC are Information criteria methods used to assess model fit while penalizing the number of estimated parameters. As I understand, when performing model selection, the one with the lowest AIC or BIC is preferred. In a situation am working on, a model with the lowest AIC is not necessarily the one with the lowest BIC. Is there any reason to prefer one over the other?
As Geoffrey pointed out, the BIC penalizes more heavily for complex models.  So the context certainly matters here.
Question
Hello every body
I used Arc Map Geostatistical  analyst to interpolate a surface for heavy metals. Before Interpolation I divide my data set to train and test. By training data I interpolate a surface among dataset and now I want to determine residuals for test data to control the precision of interpolation. How can I determine value in an unknown position?
A few observations:
1. When you use kriging you are relying on a model for the random function (the data is a non-random sample from one realization of the random function). and that the random function satisfies several crucial assumptions such as intrinsic stationarity and that the data can be used to estimate/model the variogram. There are no statistical tests to determine whether these assumptions are valid..
2. The kriging estimator is "exact" (sometimes also called "perfect"). This means that if you generate an interpolated value at a data location and use the data value in the interpolation the interpolated value will be the data value (the residual is always zero).
3. If you split the data locations into a training set and an interpolation set you are assuming that both are samples from the same realization of the same random function. The method of choosing the two data subsets does not ensure this assumption is satisfied. Hence this technique does not really quantify the accuracy/precision of the interpolation.
4. Cross-valation (leave one out) is really a method to compare one variogram model against another and/or one choice of the variogram parameters against another or possibly to see the effect of the search neighborhood parameters on the interpolated values. It can also be useful for identifying "unusual" data values
5. As an exploratory technique, splitting the data locations into two subsets may be useful but there is no real theory to back it up. A different split might produce very different results and there is no way to determine which set of results is more relevant.
As for cross validation there are multiple statistics that can be computed but no single one is more important or reliable. For a discussion of these see the following
1991, Myers,D.E., On Variogram Estimation in Proceedings of the First Inter. Conf. Stat. Comp., Cesme, Turkey, 30 Mar.-2 April 1987, Vol II, American Sciences Press, 261-281
Question
Hi everyone, I have a point data set with 197 different coordinates. I am selecting 25% to be used for training. When I run maxent with certain environmental layers, it is only using 8 presence records used for training, 2 for testing. Any ideas why this is?
Thanks for the suggestions guys. I discovered it was a TWI layer I produced that was causing the problem. It was in a floating point pixel format which I have changed to signed 32 bit, the same as my other layers, and this seems to have done the trick.
Question
I want to compare between blending data using Regression Kriging and Bayesia Kriging. What is the advantage of Bayesian Kriging compare to Regression Kriging? Anyone has a recommended link/journal for learning Bayesian Kriging? Many thanks in advance
You should also consider cokriging. Exactly what do you mean by "more powerful"? I.e., how would you quantify "power"?
Regression kriging usually means combining in some way regression and kriging. In particular where the expected value of the random function is nonconstant, e.g. it is a polynomial function of the position coordinates or is function (usually a low degree polynomical) of another spatially distributed variable, e.g. elevation in the case of precipitation.
Precipitation data is nearly always "point" data whereas satellite data is non point data  (there is a pixel size) so you must compensate for the difference in data support. Regression doesn't do that and in general Bayesian modeling doesn't either. I suggest that you need to look more at the literature on combining rain gage data with Doppler rain data, this is more analogous to your problem. You will see that cokriging has most often been used
Question
I'm modeling labor productivity on farm sites spread widely across the US, and I would like to include NDVI (or another vegetation index) as a predictor variable. I'm wondering if there is month during the growing season that makes the most sense for comparing across disparate climate types. Thank you for any suggestions!
Hi Rafter,
One tool to detect onset and other parameters of vegetation seasonality, as suggested by Wietske, is TimeSat: http://web.nateko.lu.se/timesat/timesat.asp
As in your study you are using isolated farm sites it is probably better and more simpler to use an available map of vegetation phenology (e.g. http://onlinelibrary.wiley.com/doi/10.1029/2006JG000217/epdf) to choose the month when NDVI should be compared (for example the one corresponding to green peak), instead of collecting the NDVI data and running TimeSat or other software.
All the best,
João
Question
I am modeling undrained behavior of clays/drained behavior of sands under various static loading cases. I understand that the geostatic step is utilized in general analysis of soils. I was wondering if one can replace this with static, general case under gravity loading. I am not interested in calculating pore pressure and the soil is homogeneous in nature.
Elaheh, first let me try to describe about specifying the initial stresses. The predefined stresses option allows you to define the stress state of any object at the initial time step. You can use it in one of two methods namely the initial stress option or the geostatic stress options. Using the initial stress you can specify the 3 principal and 3 shear stresses. For most common geotechnical problems we do not need to use this option and we stick to geostatic stresses. ABAQUS requires the specification of vertical stresses at 2 points which it would then use for calculating the stresses at all nodes in between them by linear interpolation. For example assume a column of soil 1m high below the ground level then the vertical stress at the top of the soil is 0 for vertical coordinate 0. The stress at vertical coordinate 1m is equal to - density x gravity x height (here 1m), here the sign convention for compressive stresses need to be negative.  The K0 value is specified using the lateral coefficient 1, if the horizontal stress is asymmetric then you specify the other value in lateral coefficient 2. As far as modeling your wellbore problem, I am not sure on the nature of the simulation. If you can give a few more details, we can look at solving it. I hope this helps!!
Question
Hello all.
I have a data set with coordinates from locations that some fishes have been collected.
I want to be able to estimate the distances between each of those locations, but following the river path. I know how to estimate euclidian distances, but they are estimated as lines connecting each location, ignoring the river.
Is there any program, R function or QGIS plugin that could calculate such distances? I have a shape file for the river basin in question, that I can transform to lines file in QGIS, but I can't get the estimates of distance of these locations along the river.
Thanks,
JP
This is known as Linear referencing  . In ArcGIS, you can use the linear tools - please refer to tutorial here . For QGIS, you may like to have a look at this plugin  from Faunalia, working on a PostGis base (the plugin expects that all data for processing is available in PostGIS database).
Main steps in Arcgis to understand the workflow  :
• First you create a "route" (=your river and a PolylineM object). it must be clean, oriented in the right direction and not in multi-part to work properly. You must have a unique id by river part if necessary.
• With your route feature created with Create Routes,you "Locate features along routes" wich will ultimately give you a kilometric value from start to end of the river for each of your points (fishes). With this kilometric value,  you can then calculate the distance between them, following the river.
Hope this will help.
______
 Wikipedia : "Linear referencing (also called linear reference system or linear referencing system or LRS), is a method of spatial referencing, in which the locations of features are described in terms of measurements along a linear element, from a defined starting point, for example a milestone along a road. Each feature is located by either a point (e.g. a signpost) or a line (e.g. a no-passing zone). The system is designed so that if a segment of a route is changed, only those milepoints on the changed segment need to be updated. Linear referencing is suitable for management of data related to linear features like roads, railways, oil and gas transmission pipelines, power and data transmission lines, and rivers."
Question
I plotted a variogram for a dataset which contain 83 observation sites. The maximum distance between the observation points is 60 km but i got the range of the variogram is 261 km. What does it mean?
[I thought that range is 261 km means kriging can predict well in the radious of 261 km.probably my concept is wrong. ]
Jennifer is right, if your software permits it look at the numbers of pairs used to compute the sample variogram for different lag distances and you will see that the numbers begin to decrease for lag distances that exceed about half the maximum distance between data locations. You may want to consider whether there are other hidden problems in the software if it produces  clearly wrong results like this. It is certainly possible that the range is actually greater than the maximum distance but you can't determine that from the sample variogram. There are multiple possible algorithms for auto fitting of the variogram parameters but no one choice is best possible for all data sets. Also if your software permits it, look at the kriging weights. In gereral you don't want any negative weights, those would indicate too oarge a search neighborhood and/or too large a range
Question
i want to know if there is an equation to compute the miminum sample to take in the field.
There is no simple answer because a lot depends on the specific geographic region you are studying. It also depends on the spatial pattern of the data locations. If you have not already consulted them see papers in the European J.  of Soil Science (formerly the J. of Soil Science). and the Soil Science Society of America Journal.
As suggested above, if you really don't know anything about the spatial distribution of the particular soil parameters of interest, you are to have to do some preliminary data collection and use those results to guide you in further sampling. Also as suggested above data collection for estimating and modeling the variogram is not the same as data collection for subsequent kriging. The cost of sampling will always have some impact, if there were no cost or difficulty in getting data then get a couple of thousand but that is usually not the situation
Once you have some data then you want to do various exploratory statistics (average, std deviation, histogram, coded plot of the data locations (coded by the data values) Do you have any information on the soil type(s)? Are they the same over the entire region of interest or different?. If you are using an auger to collect soil samples for lab analysis, note the size of the auger (depth, diameter)
1987, A. Warrick and D.E. Myers, Optimization of Sampling Locations for Variogram Calculations Water Resources Research 23, 496-500
1990, R. Zhang, A. Warrick and D.E. Myers, Variance as a function of sample support size Mathematical Geology 22, 107-121
R. Webster and some of his students had a series of four papers in the J. of Soil Science , circa 1980
Question
HI im looking for Aphonopelma chalcodes distribution data for use in a gis, i need the data in geotiff or shape format.
The recent revision of the genus Aphonopelma will give you most of the data you need, certainly the best data available before you seek to add records and distributional data for any of the species..
Question
Can anyone please tell me the use of Normalized Rank in Analytical Hierarchy Process used in ArcGIS? Is it used to scale all the maps into one as it is being taken the value from 0 to 1 for all the features.
Yes, the scale will be the same for all the features, as in the second to last slide of the presentation I attached to my previous answer. If you consider the fields in the maps as consisting of pixels of the same value in the initial three maps, which were View, Slope and Price. They are standardised, to change the rankings into probabilities, and multiplied by the weight calculated in the previous slide, to effectively change each map into a weighted factor in the final decision. When they are added together to form the decision map, the units are still probabilities of the suitability of that field, because the weights add up to one.
What was done with discrete units in the slide 6, is now done with fields of pixels on different feature maps on slide 7. For your laptop example, you will have to use the approach in slide 6.
There your factors will be Price, Design, RAM, Colour, Screen, Weight, Webcam, Hard disk and Size. So you have to decide which of the factors count the most, the second most, ..., the eighth most and the least, effectively ranking the factors from 1 to 9. The factor ranks are then standardised to have values between zero and one, where 1 is the highest and 0 the lowest and normalised by dividing each by the sum of all the ranks. These probabilities should add up to one and they are the numbers in blue under the attributes in slide 6.
Each factor must then be rated against each laptop producer in the same way and these probabilities, which add up to one for each factor over the three products, become the numbers in blue on the lines in the last section of slide 6.
The final probabilities are calculated in the way illustrated by the calculation on the far right of slide 6.
Question
the nugget effect is the value of the variogram when lag distance is equal to zero, and according to litterature, it's dependant on measurement error, i don't if any one can help me to calirify this point.
A large nugget effect relative to total variability can be caused by sampling that is too sparse with respect to spatial variability, or because of measurement error.  In either case it is small-scale variability.
Generally you fit a nugget effect by eye from the variogram.  The first few points--two or more--are the most instructive.  However, you can also use your experience as a guide in fitting the nugget effect.
If you are doing variography as a prelude to kriging or simulation the nugget effect  affects the degree of smoothing in the results.
Question
Hi everyone
I want to add coordinates of several points as post map in Surfer. The problem is that after importing data they are not in a straight line, they are shown in a zigzag formation. How can I fix the problem? The points are the coordinates of ship movement, so they should be shown in straight line.
Thanks
Thank you, Christian Günther. It's very good.
Question
Dear collegues,
I do geostatic procedure for modeling of tunnel in clay-soil.
ABAQUS is used.
The model of soil - Cam-Clay.
Initial stresses in the soil are given.
Material data is fully specified.
Nevertheless - the program displays an error message:
The sum of initial pressue and elastic tensile strength for the porous elastic material in element 1 instance ground1-1 must be positive.
It seems that you use the *INITIAL CONDITIONS,TYPE=PORE PRESSURE in your model. You have to reduce the initial pore pressure imposed before the start of the analysis. If its value looks rational, though, check the consistency of your units.
George
Question
I downloaded MODIS level 2 ocean colour images and displayed them in Seadas version 7.2. My area of interest in Lake Victoria. However I am facing a problem with the areas that are affected by cloud cover hence causing no data to be available for those parts. Is there a way I can extrapolate using the available data in order to have data in those areas affected by cloud cover. Also is there any other software I can use to perform that function?
There are several options depending on what you want to do and the spatial and temporal size of the gaps in data. I've seen that for small gaps ppl only uses spatial filters or linear interpolation, for example. You could also empirically set something that interpolates both in space and time.
In my case, over the argentinean sea (using MODIS L3, 8-day composites, 4.6 km), I had huge regions with no data for 3 months every year. I used HANTS (Harmonic ANalysis of Time Series, available as an Add-on in GRASS GIS 7) and DINEOF (available through the package sinkr in R). But for my case, DINEOF worked much better :)
Question
I try to find any reference to such question:
Suppose, we have 4 arrays (with same zonds on each).
First and second are hybridized with sample 1
Third and fourth are hybridized with sample 2
Then we want to compare signal from zonds between samples.
As I understood, we must carry out RMA procedure for each array to correct background and then construct empirical signal distribution from ALL 4 ARRYAS and make quantile normaliztion. So, our input matrix for QN will consist of 4 columns.
But collegues say, that we must make QN independetly for
first and second and third and fourth
Who is right?
Do you have any house keeping genes which you can use to normalize for total RNA quantity? The problem with thematic arrays: you have 2 sources of variation (a) technical due to difference in extracted/amplified RNA quantity and (b) due to the process you target. It is not clear how to distinguish between them without additional information.
I would try something less perturbing the data as well, for example - simple linear centring/scaling. Have a look on distributions of original data and normalized (of course in log scale) and identify the problems you would like to remove by normalization.
Question