REGRESSION AND CORRELATION ANALYSIS OF GRID SOIL DATA VERSUS
CELL SPATIAL DATA1
J.P. MOLIN, H.T.Z. COUTO
University of São Paulo, Piracicaba, Brazil
E-mail: email@example.com, firstname.lastname@example.org
L.M. GIMENEZ, V. PAULETTI, R. MOLIN
ABC Foundation, Castro, Brazil
Instituto Agronômico, Campinas, Brazil.
One of the many issues in precision agriculture is the correlation between yield maps and the
yield factors involved. Low correlations have lead to questions on how to make fertilizers
prescriptions, for instance, based on yield and soil sampling information. Some software run
correlations and regressions relating each cell of the field generated from interpolation. Another
way to analyze is by correlating the yield data only from grid points with the information that
comes from the laboratory, but a yield has to be attributed to each point as the yield data comes
from a set of spot information not related to the grid soil sampling. In this study some attempts
were made to represent the yield for each soil sample point and to run the analysis in the two
ways with data from one field. It was sampled on three depths (0-5, 0-10 and 0-20 cm) in a 30 m
regular grid. Both cases resulted in low coefficients of determination (R2).
The complexity of explaining yield variability using specific factors is well known. Recent
literature presents many examples where yield factors have been listed based on the correlation
between yield and soil fertility parameters. Acock and Pachepsky (1997) state that it is not
possible to explain yield changes in the field by a only few limiting factors. They discuss the use
of mechanistic crop models as a powerful tool for helping in understanding the complex
soil/plant/atmosphere system. Nevertheless, because the plant behavior is, as yet, not completely
understood, these models still have limitations and need improvement.
Regression analysis and other techniques relating yield to the factors that have more potential for
causing yield variability have been frequently used. Drummond et al. (1995) presented a study
where they used different strategies of regression analysis on the kriging-interpolated values on
10 m spatialized cells. Clay et al. (1998) presented this same idea. They evaluated the impact of
grid distance on spatial analysis and profitability, correlating yield and its limiting factors to soil
1 This work is being partially supported by FAPESP (São Paulo State Research Foundation).
An interesting aspect of correlating yield and soil fertility and other factors is the procedure
generally used. With the use of yield monitors data, geoestatistics and GIS, spatialized cell data is
frequently used to run correlations and regressions. Mallarino et al. (1999) collected grain yields
by hand around soil sampling points instead of using a monitor and ran simple correlations and
multiple regressions between the factors. In the first case, estimations take place and in the
second, the yield data had to be manipulated in order to estimate the yield that represents some
neighborhood of the point where soil samples were collected. Some specialized software used in
data collection and analysis for precision agriculture run correlations and regressions relating
each cell of the field. Those cells result from data already interpolated and based on assumptions
related to geostatistics criteria. Another way to analyze it is by correlating crop yield with the
information that comes from the laboratory for each point in the field. The difficulty is to attribute
a yield to each point as it comes from a set of spot information collected by the yield monitor and
not related to the grid soil sampling. This paper deals with the two different scenarios; multiple
regression analyses were run with grain yield and possible causes of yield variability related to
soil fertility using data from one field being cultivated with no-till for years.
The study is part of a project involving local farmers, a private organization and a University in
Paraná State, South of Brazil (long. 490 55’’W, lat. 240 51 S). A 23.6 ha field is being monitored
and has been cultivated with no-till, rotating corn, soybeans and winter crops for more than 17
years. Soybean yield data were collected in 1999 using a commercial yield monitor on a combine.
Before planting the soybeans in 1999, grid soil samples were taken at 0-5, 0-10 and 0-20 cm
depth on a 30 m grid totaling 225 samples for each depth. The soil samples were sent to the
laboratory for soil fertility analysis.
Fertility data were analyzed using geoestatistical analysis procedure and the best model for each
semivariogram was used for kriging 10 m cells, using SSToolbox software (SST Development
Group, Inc.). This process resulted in a total of 2360 common cells for yield (using inverse
distance method) and soil chemical variables. This constituted the first scenario for regression
Another approach to analysis that correlates yields and soil fertility factors was based on grid
sampling results with a population of 225 points for each depth. As the yield data do not have the
same coordinates as the soil samples, a procedure was developed using spreadsheet tools to
define a representative yield for each of those spots. The procedure allowed selecting the search
radius around the point, so circles of 5, 10 and 15 m radius were tested, resulting in an average
yield for each sampling location. This was the second scenario for the regression analysis.
Multiple regression analysis was run using SAS with the stepwise procedure. Five adjusting
models were tested for yield (linear, logarithmic, quadratic, inverse and square root) involving 10
factors from soil fertility laboratory tests: phosphorus (P), organic mater (OM), pH, hydrogen
plus aluminum (H+Al), potash (K), calcium (Ca), magnesium (Mg), Base Saturation (V%), CEC
and sum of bases – Ca2+, Mg2+ and K+ (SB). A probability of 5% was used for the regression
analyses and the results were expressed by the corresponding coefficients of determination (R2).
The whole analysis was run for each depth for both, grid data and cell data.
RESULTS AND DISCUSSION
All the regression results are presented by their R2 values when significant at 5% level. The
results related to cell data are presented on Table 1 and those related to the grid data are
summarized on Table 2.
Table 1 – Multiple regression results with variables involved and coefficients of determination
(R2) for yield versus soil fertility factors for each model and depth for cell data analysis.
Model N. of Var. R2 Variables
Depth: 0 – 5 cm
Linear 6 0,120 P, OM, pH, H+Al, Al, Mg
Logaritmic 8 0,147 P, OM, pH, H+Al, Al, K, Ca, CEC
Quadratic 8 0,210 P, H+Al, Al, K, Ca, Mg, SB, V%
Inverse 10 0,160 P, OM, pH, H+Al, K, Ca, Mg, SB, CEC, V%
Root square 9 0,179 P, pH, H+Al, Al, K, Ca, Mg, SB, CEC
Depth: 0 – 10 cm
Linear 5 0,156 P, pH, H+Al, Al, CEC
Logaritmic 3 0,193 pH, H+Al, Mg
Quadratic 10 0,228 P, OM, pH, H+Al, Al, Ca, Mg, SB, CEC, V%
Inverse 3 0,191 H+Al, Mg, CEC
Root square 3 0,158 pH, Mg, CEC
Depth: 0 – 20 cm
Linear 4 0,232 P, H+Al, Al, K
Logaritmic 6 0,247 pH, H+Al, Al, K, SB, CEC
Quadratic 8 0,284 P, OM, H+Al, K, Ca, Mg, SB, CEC
Inverse 6 0,255 H+Al, Al, K, SB, CEC, V%
Root square 6 0,244 P, pH, H+Al, Al, K, SB
Table 2 – Multiple regression results with variables involved and coefficients of determination
(R2) for yield versus soil fertility factors for each model and depth for grid data analysis.
Model N. of Var. R2 Variables
Depth: 0 – 5 cm; Search radius: 5 m
Root square NS
Depth: 0 – 5 cm; Search radius: 10 m
Inverse 1 0,065 K
Root square NS
Depth: 0 – 5 cm; Search radius: 15 m
Root square NS
Depth: 0 – 10 cm; Search radius: 5 m
Linear 2 0,159 pH, H+Al
Logaritmic 2 0,164 pH, H+Al
Quadratic 1 0,100 OM
Inverse 4 0,244 pH, H+Al, Al, K
Root square 2 0,163 pH, H+Al
Depth: 0 – 10 cm; Search radius: 10 m
Linear 1 0,113 CEC
Logaritmic 3 0,223 pH, Mg, CEC
Quadratic 3 0,196 pH, Mg, CEC
Inverse 2 0,172 K, CEC
Root square 3 0,222 pH, Mg, CEC
Depth: 0 – 10 cm; Search radius: 15 m
Linear 2 0,213 Mg, CEC
Logaritmic 3 0,280 pH, Mg, CEC
Quadratic 3 0,251 pH, Mg, CEC
Inverse 3 0,263 pH, Mg, CEC
Root square 3 0,281 pH, Mg, CEC
Depth: 0 – 20 cm; Search radius: 5 m
Linear 2 0,134 OM, CEC
Quadratic 2 0,216 pH, V%
Inverse 3 0,205 OM, K, CEC
Root square 3 0,172 OM, K, CEC
Depth: 0 – 20 cm; Search radius: 10 m
Linear 3 0,279 pH, K, V%
Quadratic 3 0,248 pH, K, V%
Inverse 3 0,271 OM, K, CEC
Root square 3 0,290 pH, K, V%
Depth: 0 – 20 cm; Search radius: 15 m
Linear 4 0,406 pH, Al, K, Ca
Logaritmic 4 0,415 pH, Al, K, Ca
Quadratic 5 0,417 pH, Al, Mg, CEC, V%
Inverse 3 0,336 pH, K, CEC
Root square 4 0,415 pH, Al, K, Ca
* NS – not significant at 5%
The set of data represented by cells resulted in more elements involved in the regression
equations but the coefficients of determination (R2) resulted in poor values, showing that even
with several factors, yield limitations cannot be attributed to soil fertility in a simple way as it is,
in general, believed. The regressions based on grid data resulted in higher R2 values and with less
factors involved. It indicates that when using cell data it causes attenuation on spatial variability.
Looking at the R2 values for grid data we see numbers increasing as the searching radius and
depth increases. It means that the yield and the superficial soil crust also have a local variability
not well explained.
As the yield averaging area increases it tends to be more related to some of the soil fertility
limiting factors represented by a composite sample taken in the center of an area being
represented by that yield. In one way, spatial variability of yield is attenuated by increasing the
searching radius, on the other the estimation of soil parameters has uncertainties due to
interpolation. Coefficients of determination increasing with the increase of searching radius also
may be related to the fact that yield is strongly spatially dependent or local errors on yield
readings can cause distortions.
It is known that soil sampling depth is an important issue, especially on no-till. As the depth
increased, the results of regressions were higher. On the grid data, the 5 cm depth was almost not
significant. It shows that the spatial variability on the top soil is higher and more work has to be
done on soil sampling depth and its criteria.
In these cases pH, Al, K, Ca consistently appear to explain part of the variability on yield.
Nevertheless results were not consistent and it corroborates with the idea of Acock and
Pachepsky (1997) that it is not correct try to explain yield variability only by a few factors,
normally restricted to soil fertility.
- Increasing the size of yield sampling area increased the coefficients of determination,
meaning that yield has its local variability dependent of sampling area.
- Increasing soil sampling depth also increased the coefficients of determination, showing that
soil local variability is related to depth, especially in no-till and it deserves more studies.
- Coefficients of determination resulted in poor values, showing that even with several factors,
yield limitations cannot be attributed to soil fertility in a simple way as it is, in general,
- The regressions based on grid data resulted in higher R2 values and with less factors involved,
indicating the attenuation caused by interpolations on the cell data.
Acock B. and Pachepsky, Ya. (1997) Holes in precision farming: mechanistic crop models.
Precision Agriculture ´97. Stafford, J. (Ed.), pp.397-404.
Clay, D.E., Carlson, C.G., Chang, J., Clay, S.A., Malo, D.D., Ellsbury, M.M. and Lee, J. (1998)
Systematic evaluation of precision farming soil sampling requirements. In: Precision
Agriculture, Robert, P.C., Rust, R.H., Larson, W.E. (Eds.), Minneapolis, MN, USA,
Drummond S.T., Sudduth, K.A. and Birrell, S.J. (1995) Analysis and correlation methods from
spatial data, ASAE Paper 95-1335, ASAE, St. Joseph, MI, USA.
Mallarino, A.P., Oyarzabal, E.S. and Hinz, P.N. (1999) Interpreting within-field relationships
between crop yields and soil and plant variables using factor analysis. Precision
Agriculture, 1 (1), pp. 15-25.