Ensemble of Cubist models for soy yield prediction using
soil features and remote sensing variables
Toronto, ON, Canada
University of Toronto
Toronto, ON, Canada
The goal of this work is to develop a predictive model for
selecting elite soy variants for commercial production. Current
breeding practices for new soy variants require rigorous
evaluation over three stages of field tests, corresponding to three
successive growing seasons. We propose to leverage machine
learning methods for identifying high yielding variants using
remote sensing and soil features. To support this proposition, we
trained an ensemble of fifteen decision tree models, one for each
relative maturity band. Collectively, our models identified fifteen
elite varieties from 21 predictive variables to forecast soybean
yields in 2015 at 58 test locations. This method can boost
commercial soy yields by about 5% and shorten the time for
commercial variant development.
•CCS → Applied computing → Computers in other
domains → Agriculture
yield prediction; decision trees; cubist; remote sensing; NDVI;
Land Surface Temperature; MODIS; Random Forest .
Crop yields are highly variable across fields as a result of complex
interactions among factors such as environmental conditions, soil
properties, management practices, and disease and pest attack [3,
10, 4]. In particular, environmental conditions play a vital role in
yield variability, and crop productivity is limited by water
availability, light, temperature, and nutrients [15,21]. Here, we
augmented the soil and weather data supplied by Syngenta with
publicly available remote sensing data and selected 21 variables
that were the most predictive of soybean yield. Our solution uses
vegetation indices calculated from satellite images to predict crop
yields on a site-by-site basis. Vegetation indices, such as the
normalized difference vegetation index (NDVI) have been widely
used for agricultural mapping and monitoring and are calculated
using the red and near-infrared wavelengths . In the last
decade, remote sensing-focused yield forecasting has shifted to
the National Aeronautics and Space Administration’s (NASA)
Moderate Resolution Imaging Spectroradiometer (MODIS),
which provide spatial data at a fine resolution [8,24].
Soybean variants are classified according to relative
maturity (RM), which reflects the time it takes a variety to reach
physiological maturity . Depending on the day length and
flowering time, RM is designated with a numerical system
ranging from 0 to 10 from north to south [2, 12]. Effectively, RM
is an indicator of the fit of genetically coded flowering time to the
environmental conditions in geographic locations. The dataset
provided in this challenge includes yield data for variants mostly
in RM bands ranging from 2.0 to 3.5 covering the soy regions of
the US Midwest (Indiana, Nebraska, Illinois, Iowa and
Minnesota). The differences in yields, weather and soil conditions
across RM bands prompted us to avoid a global yield model for
all RM bands, and we opted instead to generate models for each
RM band, creating an ensemble of predictive models, one for each
RM band. We applied decision tree methods to explore the
predictive properties of these variables. Other reported work
explore correlation, multiple linear regression, decision trees and
neural networks to explore potential predictive models for crop
yields with a varying degree of success [20, 9, 1]. Our approach
differs from the existing body of published work by predicting
soybean yields for specific variants of interest at the farm site
2. CRITERIA FOR ELITE VARIETIES
An ‘elite’ variety is stably producing high soy yields across years
and locations. Our primary goal was the development of
quantitative models for predicting soy yields of each variant and
the identification of elite soy variants. We used the median yield
of commercial (‘CHECK’) varieties in each RM band as a
benchmark for comparison (Table 1) and selected 15 elite
varieties exhibiting yields 2.5 bu/ac higher than corresponding
commercial varieties. These elite lines are expected to boost soy
yields by about 4.5% relative to commercial varieties, in each
RM. Time series analysis of these elite variants indicates stable
productivity over the course of the study and reinforces their elite
3. ESTIMATES OF TYPE I ERRORS
The current process of seed selection is exposed to potential Type
I errors in which variants are designated as highly productive elite
lines, yet fail to produce high yields in subsequent years. To
evaluate the potential of our yield analytic procedure to reduce
Type I errors we examined the potential utility of our Cubist
ensemble (described in details in the next section) to reduce Type
I error by applying our model to the class of 2013 to predict yields
in 2014 using soil and remote sensing variables (RMSE=1.16,
correlation =0.84, Table 2). When a similar 2.5 bu/ac gain over
RM matched check lines is used as a cutoff to select elite lines,
our algorithm eliminated six soy lines from the elite list due to
insufficient yield consistency. We conclude that our predictive
analytics method can successfully ‘weed out’ non-elite lines and
enhance precision in soy seed selection by Syngenta.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
KDD’17, Month 8, 2017, Hallifax, N.S., Canada.
Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.
Table 1. Predicted yields of elite soy variants (2014 class)
Yields are reported in bu/ac
We supplemented soil features provided by Syngenta with
publicly available remote sensing data. Reflectance of vegetation
in multiple light spectrums (vegetation indices) is a good
estimator of crop yield, fruit ripening, biomass, and plant
senescence [11,18, 25].Vegetation indices can be calculated from
space and are conveniently supplied by NASA in Moderate
Resolution Imaging Spectroradiometer (MODIS) and Normalized
Difference Vegetation Index (NDVI) data products. Here we use a
smoothed and gap-filled NDVI product generated for the
conterminous US for the period 2012-Jan-01 through 2015-Dec-
31. The data was obtained from the NASA Stennis Time Series
Product Tool (TSPT) generating smoothed NDVI data from both
the Terra satellite (MODIS MOD13Q1 product) and Aqua
satellite (MODIS MYD13Q1 product) instruments. The data is
provided at a spatial resolution of 250 m and a temporal resolution
of 8 calendar days. NDVI was chosen over other vegetation
indices due to its ability to cancel out a proportion of variability
caused by changing sun angles, topography, clouds, and various
atmospheric conditions . We merged data from locations
identical in their longitude, latitude, soil, and climatic variables
(six sites). We then focused on sites that were present only in
stage three of the class of 2014, thus NDVI data was collected for
58 locations. For each growing season per each location, we
determined maximal NDVI value (NDVIMAX, in a 0 to 1 scale),
the timing in the year of the NDVI peak (MAXTIME, and the
sum of all NDVI values (SUMNDVI).
Surface temperature and soil moisture can be estimated
successfully from space and provide valuable variables for soy
yield prediction . We obtained daytime and nighttime land
surface temperatures (LST) from MODIS (product MOD11A1).
This data was downloaded at a spatial resolution of 1 km and a
temporal resolution of 8 calendar days for the 58 locations of
interest. Plotting the NDVI and LST time series data reveals the
expected seasonal periodicity in the remote sensing NDVI and
LST data and the potential for yield prediction. The potential of
remote sensing to explore correlations of surface temperatures and
biomass production are illustrated in Figure 1 using remote
sensing data from one location (location 4370) as an example. In
this location low soy yields in 2012 are correlated with high day
time temperatures and low NDVI scores (Figure 1).
Figure 1: Yields (top), LST, and NDVI (bottom) in location
We extracted several features from the NDVI and LST time-series
and performed correlation analyses to select features that correlate
with soy yields (Figure 2). We found a positive correlation of
NDVIMAX with soy yield and a negative correlation of
JULYSUM and AUGSUM with soy yield. In addition strong
negative correlation of NDVIMAX with JULYSUM and
AUGSUM are noticeable. JULYSUM and AUGSUM are also
strongly correlated with each other. The sensitivity of soy to high
temperature is well known, and we surveyed cumulative monthly
daytime temperatures and minimum nightly temperatures as
potential predictors of soy yield [6,7].We identified cumulative
daytime temperature in July and August (JULYSUM, AUGSUM),
and monthly minimum nighttime temperatures in May and July
(MAYMIN, JULYMIN), as most inverse correlated with yields
and selected these variables as predictive variables in our models
Figure 2: Correlation plots of remote sensing variables and
Figure 3: Correlation plots of soil variables and soy yield.
We also surveyed the correlations of soil variables with each other
and with soy yields (Figure 3). The strongest correlation was
observed with the irrigation variable. We decided to retain all soil
variables in our models, despite strong correlations of some soil
variables with each other.
Decision tree based methods are versatile and powerful machine
learning techniques that are well suited for this challenge. Our
preliminary analysis compared random forests  and Cubist
 models for feature discovery and model tuning in the R
software package . We used the training data set that we
developed above to compare the predictive potential of the
randomForest and Cubist packages in R. With minimal tuning,
randomForest models achieved lower error rates relative to Cubist
models. However, when random Forest models where cross
validated using data outside of the training set (or when applied to
real test data on the CodaLab portal) we noticed poorer predictive
power relative to Cubist models (Random Forest models achieved
lower scores on the CodaLab platform). We postulate that
Random Forest could be over fitting the models to the data and we
therefore present here only our Cubist models.
5. QUANTITATIVE RESULTS
We used the Cubist package in R to generate regression decision
trees for yields in each RM band using 21 predictor variables
(VARIETY_ID, LATITUDE, LONGITUDE, FIPS, AREA,
IRRIGATION, CEC, PH, ORGANIC.MATTER, CLAY,
SILT_TOP, SAND_TOP, AWC_100CM, NDVIMAX,
MAXTIME, SUMNDVI, JULYSUM, AUGSUM, MAYMIN,
JULYMIN, FAMILY). The correlations of these models with the
training data varied from 0.3 to 0.85 (Table 3). Out of fifteen RM
bands with sufficient number of cases for modeling, we generated
three high quality models (correlation coefficient ~0.85), eight
medium quality models (correlation coefficient~0.7), and four
models with poor predictive power (<0.4). Low quality models
were typically generated when smaller datasets were used for
training (average of 114 cases) and are thus less reliable for yield
prediction. Generation of more experimental data for variants in
RM bands with low data concentration is recommended to
enhance RM coverage by elite soy variants (RMs: 2.3, 2.9, 3.1,
3.3 and 3.6). We validated this modeling procedure by applying
this methodology to ‘predict’ the yields of the class of 2013 in
2014 (see section 3, Table 2). We demonstrated the utility of the
‘divide-and-conquer’ procedure to (a) predict yields from soil and
remote data and (b) select consistently high yielding soy variants.
We than used these models to predict the yields of 28 soy variants
grown in 58 locations in 2015 (Table 1, using soil and remote
sensing variables) and obtained high prediction accuracy on the
Coda-Lab portal (0.45 FMEASURE, 0.99 ACCURACY, 0.47
MATHEWSCC) reflecting the predictive value of our approach.
One advantage of using modern decision tree tools is
the interpretable ranking of variables according to their
contribution to the model. Here we list top four variables in each
RM model (Table 3). Notably, the high variability in ranking
orders likely reflects the high heterogeneity among different RM
bands and supports our decision to segment the data according to
RM. Grouping RM into proximity zones (RM 2.0-2.5, RM2.6-2.9,
RM 3.0-RM3.5) drove down internal RMSE values to about 4.
However the Codalab score generated from RM zones models
were below scores obtained with the separate RM models. Future
work will focus on (a) adding additional remote sensing variables
to our models to enhance precision and (b) utilizing time-series
prediction methods to extrapolate future NDVI and LST values
using data trends in each location to allow true predictions of
future yield values.
We thank Syngenta for providing data and funding for this
project, IdeaConnection and AI For Good for organizing this
competition and the R community for developing and maintaining
the tools used in this analysis.
 Ainong, L., Shunlin, L., Angsheng, W., Jun, Q. (2007)
Estimating Crop Yield from Multi-temporal Satellite Data
Using Multivariate Regression and Neural Network
Techniques. Photogrammetric Engineering & Remote
Sensing, 10, 1149-1157
 Alliprandini, L. F., C. Abatti, P. F. Bertagnolli, J. E.
Cavassim, H. L. Gabe, A. Kurek, M. N. Matsumoto, M. A.
R. de Oliveira, C. Pitol, L. C. Prado, and C. Steckling. 2009.
Understanding Soybean Maturity Groups in Brazil:
Environment, Cultivar Classification, and Stability Crop Sci.
 Al-Kaisi, M.M., Elmore, R.W., Guzman, J.G., Hanna, H.M,
Hart, C.E., Helmers, M.J., Hodgson, E.W., Lenssen, A.W.,
Mallarino, A.P., Robertson, A.E., and Sawyer, J.E. 2012
Drought impact on crop production and the soil environment:
2012 experiences from Iowa. Journal of Soil and Water
Conservation. 68, 19 – 24
 Altieri, M.A., Nicholls, C.I., Henao, A. et al. Agroecology
and the design of climate change-resilient farming systems
(2015) Agron. Sustain. Dev. 35, 869 – 890.
 Das, A., Qi, J. T., Oneto, M., Shere, S. & Ma, Z. Soybean
Yield Prediction: Using Satellite Imagery to Predict
Commodity Yields. (2016).
 Djanaguiraman, M., P. V V Prasad, D. L. Boyle, and W. T.
Schapaugh. 2013. “Soybean Pollen Anatomy, Viability and
Pod Set under High Temperature Stress.” Journal of
Agronomy and Crop Science 199: 171–77.
 Djanaguiraman, M, P.V. Vara Prasad, and W. T. Schapaugh.
2013. “High Day- or Nighttime Temperature Alters Leaf
Assimilation, Reproductive Success, and Phosphatidic Acid
of Pollen Grain in Soybean [Glycine Max (L.) Merr.].” Crop
Science 53(4): 1594–1604.
 Doraiswamy, P.C., Hatfield, J.L., Jackson, T.J., Akhmedov,
B., Prueger, J., Stern, A., (2004) Crop condition and yield
simulations using Landsat and MODIS. Remote Sens.
Environ, 92, 548–559
 Elizondo, D.A., McClendon, R. W., Hoogenboom, G (1994)
Neural Network Models for Predicting Flowering and
Physiological Maturity of Soybean. American Society of
Agricultural and Biological Engineers 37(3): 981-988.
 Hartman, G.L. Chang, H.X., Leandro, L.F. (2015) Research
advances and management of soybean sudden death
syndrome. Crop Protection. 73, 60 – 66.
 Hatfield, J.L., Prueger, J.H., 2010. Value of using different
vegetative indices to quantify agricultural crop characteristics
at different growth stages under varying management
practices. Remote Sens. 2, 562–578.
 Henderson Communications LLC “Syngenta Seeds Develops
First Ever Soybean Relative maturity Map for Canada (2009)
AgriMarketing Global Hub for Agribusiness
 Holzman, M. E., Rivas, R., & Piccolo, M. C. (2014).
Estimating soil moisture and the relationship with crop yield
using surface temperature and vegetation index. International
Journal of Applied Earth Observation and Geoinformation,
 Liaw A. and Wiener M.(2002). Classification and Regression
by randomForest. R News 2(3), 18--22.
 Lobell, David B, and Claudia Tebaldi. 2014. “Getting Caught
with Our Plants down: The Risks of a Global Crop Yield
Slowdown from Climate Trends in the next Two Decades.”
Environmental Research Letters 9(7): 74003.
 Maccherone, B. & Frazier, S. MODIS Vegetation Index
Products (NDVI and EVI). Available at:
 Matsushita, B., Yang, W., Chen, J., Onda, Y. & Qiu, G.
Sensitivity of the Enhanced Vegetation Index (EVI) and
Normalized Difference Vegetation Index (NDVI) to
Topographic Effects: A Case Study in High-density Cypress
Forest. Sensors 7, 2636–2651 (2007).
 Merzlyak,M.N., Gitelson, A.A., Chivkunova, O.B., Rakitin,
V.Y., 1999. Non-destructive opti- cal detection of pigment
changes during leaf senescence and fruit ripening. Physiol.
Plant. 106, 135–141.
 Myers, S. S. et al. Climate Change and Global Food Systems:
Potential Impacts on Food Security and Undernutrition.
Annu. Rev. Public Health 38, 259–277 (2017).
 Norouzi, M., Ayoubi, S., Jalalian, A., Khademi, H.,
Dehghani, A.A. Predicting rainfed wheat quality and quantity
by artificial neural network using terrain and soil
characteristics (2010) Acta Agriculturae Scandinavica 60,
341 - 352
 Powell, Nicola et al. 2012. “Yield Stability for Cereals in a
Changing Climate.” Functional Plant Biology 39(7): 539–52.
 Quinlan, J. R. (1992). Learning with continuous classes. In
5th Australian joint conference on artificial intelligence (Vol.
92, pp. 343-348).
 R Core Team (2013). R: A language and environment for
statistical computing. R Foundation for Statistical
Computing, Vienna, Austria.URL: http://www.R-
 Ren, J., Chen, Z., Zhou, Q., Tang, H. (2008) Regional yield
estimation for winter wheat with MODIS-NDVI data in
Shandong China. Int. J. Appl. Earth Obs. Geoinf, 10, 403–
 Yu, N., Li, L., Schmitz, N., Tian, L.F., Greenberg, J.A.,
Diers, B.W., Development of methods to improve soybean
yield estimation and predict plant maturity with an unmanned
aerial vehicle based platform (2016) Remote Sensing
Environment 187, 97 - 10.
Table 2. Error estimates in the Class of 2013
Yields are reported in bu/ac
Table 3: Evaluation of Soy Yield Models
Top Contributing Variables
organic.matter, augsum, area, julysum
sand_top, silt_top, julysum,ndvismax
clay,pH, irrigation, cec
julysum, cec, sand_top, organic.matter
cec, julysum, ndvimax, maymin