Content uploaded by Stefanie M. Falconi
Author content
All content in this area was uploaded by Stefanie M. Falconi on May 31, 2016
Content may be subject to copyright.
This article was downloaded by: [Johns Hopkins University], [Julie Shortridge]
On: 08 May 2015, At: 12:31
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Click for updates
Journal of Applied Statistics
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/cjas20
Climate, agriculture, and
hunger: statistical prediction of
undernourishment using nonlinear
regression and data-mining techniques
Julie E. Shortridgea, Stefanie M. Falconia, Benjamin F. Zaitchikb &
Seth D. Guikemaa
a Department of Geography and Environmental Engineering, Johns
Hopkins University, 3400 N. Charles Street, Ames Hall, Room 511,
Baltimore, MD 21218, USA
b Department of Earth and Planetary Sciences, Johns Hopkins
University, 3400 N. Charles Street, 327 Olin Hall, Baltimore, MD
21218, USA
Published online: 16 Apr 2015.
To cite this article: Julie E. Shortridge, Stefanie M. Falconi, Benjamin F. Zaitchik & Seth D.
Guikema (2015): Climate, agriculture, and hunger: statistical prediction of undernourishment
using nonlinear regression and data-mining techniques, Journal of Applied Statistics, DOI:
10.1080/02664763.2015.1032216
To link to this article: http://dx.doi.org/10.1080/02664763.2015.1032216
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
Journal of Applied Statistics,2015
http://dx.doi.org/10.1080/02664763.2015.1032216
Climate, agriculture, and hunger: statistical
prediction of undernourishment using
nonlinear regression and data-mining
techniques
Julie E. Shortridgea∗, Stefanie M. Falconia,BenjaminF.Zaitchik
band Seth
D. Guikemaa
aDepartment of Geography and Environmental Engineering, Johns Hopkins University, 3400 N. Charles
Street, Ames Hall, Room 511, Baltimore, MD 21218, U SA; bDepartment of Earth and Planetary Sciences,
Johns Hopkins University, 3400 N. Charles Street, 327 Olin Hall, Baltimore, MD 21218, USA
(Received 14 April 2014; accepted 18 March 2015)
An estimated 1 billion people suffer from hunger worldwide, and climate change, urbanization, and
globalization have the potential to exacerbate this situation. Improved models for predicting food security
are needed to understand these impacts and design interventions. However, food insecurity is the result of
complex interactions between physical and socio-economic factors that can overwhelm linear regression
models. More sophisticated data-mining approaches could provide an effective way to model these rela-
tionships and accurately predict food insecure situations. In this paper, we compare multiple regression
and data-mining methods in their ability to predict the percent of a country’s population that suffers from
undernourishment using widely available predictor variables related to socio-economic settings, agricul-
tural production and trade, and climate conditions. Averaging predictions from multiple models results in
the lowest predictive error and provides an accurate method to predict undernourishment levels. Partial
dependence plots are used to evaluate covariate influence and demonstrate the relationship between food
insecurity and climatic and socio-economic variables. By providing insights into these relationships and
a mechanism for predicting undernourishment using readily available data, statistical models like those
developed here could be a useful tool for those tasked with understanding and addressing food insecurity.
Keywords: food security; hunger; data mining; regression; undernourishment; climate
1. Introduction
The simultaneous occurrence of climate change, globalization, and urbanization is expected to
have dramatic impacts on the global food system. Already, it is estimated that approximately 1
billion people suffer from hunger worldwide [15], and the food price crises of 2008 and 2011
demonstrated how rapidly the food security situation can change in many parts of the world.
*Corresponding author. Emails: jshortr1@jhu.edu;julieshortridge@gmail.com
c
⃝2015 Taylor & Francis
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
2J.E. Shortridge et al.
In the near term, development of predictive models with demonstrable accuracy is needed to
identify conditions likely to lead to food insecure situations, support interventions, and reduce
suffering [2]. In the long term, an improved understanding of the physical, cultural, and socio-
economic conditions that lead to food insecurity is needed to better anticipate the effects of
globalization, climate change, and urbanization. Statistical and data-mining techniques have the
potential to be useful tools in both contexts, but have not yet been widely applied to questions of
food security.
Existing efforts to understand how global changes will influence food security have
largely relied on physical models of agricultural output. For example, numerous studies have
explored how climate change will impact yields of specific crops in different growing regions
[6,8,16,24,34,37], often using climate projections as inputs to physically based crop growth mod-
els. Comparatively, the number of empirical evaluations assessing the statistical relationship
between climate variables and crop yields [19,26,27,35]issmall.Nevertheless,thesestudies
provide useful insights into the relationship between climate and agricultural yields in different
regions and serve as an important empirical counterpart to physical models.
However, estimates of agricultural yields do not sufficiently predict the prevalence of human
outcomes related to food insecurity. Large numbers of people throughout the world remain under-
nourished, despite a steady increase in per-capita food production [2]andcurrentagricultural
production levels that could provide enough food for the world’s population if all this food was
accessible to those who needed it [23]. Particularly in urban areas, where an increasing percent-
age of the world’s population lives, insufficient food access is likely to be the driving factor of
food insecurity, and this in turn is largely influenced by socio-economic conditions [7]. While
low crop yields can certainly influence food access through increased prices, there are numerous
other factors, such as socio-economic conditions, market integration, and trade policies that com-
plicate this relationship. To understand how changes in agricultural production could influence
human outcomes such as undernourishment, it is important to consider a full range of complex
and interacting factors.
Efforts to incorporate trade and socio-economic conditions into food security projections have
often been based on integrated dynamic models that combine climate scenarios, projections of
agricultural yield, and international trade dynamics to predict outcomes such as food prices and
undernourishment [17,30,31]. These detailed evaluations simulate interactions between trade,
economic development, and crop yields and their effects on food security under different future
scenarios. However, these simulations generally rely on detailed climatic and socio-economic
projections which are subject to considerable uncertainty, and require that physically based crop
growth models – which are generally calibrated using information on plant type, growing condi-
tions, and agricultural practices from a small number of regions – be scaled up and extrapolated
to cover much of the globe.
Statistical assessments can serve as an important complement to these dynamic models, and
can also provide an empirical basis for model structure and parameterization. Yet, statistical anal-
yses of food security outcomes to date have been relatively limited. A number of assessments
have used linear regression to evaluate the relationship between socio-economic factors, climate,
and food security outcomes at the household [1,7,20]andsub-national[25]level.However,these
localized analyses do not consider how external conditions can generally impact food security
through impacts on price and availability. The globalized nature of the food system requires anal-
yses at multiple scales, including international evaluations that can identify drivers and patterns
of food insecurity at a global scale. Furthermore, food aid programs often operate at an interna-
tional level, so being able to identify countries at risk for food insecurity could provide valuable
support to food aid and monitoring programs.
One challenge associated with statistical analysis of global food security is the many complex,
nonlinear relationships that exist between climate, socio-economic conditions, and food security
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 3
outcomes. This complexity is likely to confound linear regression approaches and lead to limited
success in developing models with good predictive accuracy. Furthermore, response data on
food security outcomes such as undernourishment rates are often inconsistent with assumptions
regarding parametric distribution and linearity that are required for linear regression approaches.
The evaluations referenced above were able to create linear regression models that fit observed
data reasonably well, but none evaluated the predictive capacity of their model. Demonstrable
predictive capacity is required to support interventions and policy-making, as well as projections
of food security into the future. Because of the complex relationships and interactions between
climate, development, and food security, we hypothesize that improved predictive accuracy can
be better obtained through the use of non-parametric data-mining approaches than linear models.
In this paper, we assess the ability of multiple regression and data-mining techniques in
predicting country-level prevalence of undernourishment based on a combination of socio-
economic, food production, trade, and climatic covariates. The development of statistical models
with demonstrable predictive accuracy serves two objectives. First, accurate estimates of under-
nourishment based on predictors that are relatively easy to measure could be used to supplement
survey-based estimates of undernourishment, which are resource intensive to produce and gen-
erally conducted on infrequent timescales. The flexible nature of these statistical methods allows
them to be tailored to available information and used in situations with relatively limited data,
which is often the case in developing countries suffering from food insecurity. Second, statisti-
cal models can highlight the empirical relationships that exist between predictor variables and
undernourishment. While these cross-sectional relationships cannot be used to prove causality,
they can be informative in understanding patterns of food security and guiding decisions about
where data collection and research activities should be focused.
This paper includes two evaluations. In the first, we evaluate models that predict undernour-
ishment when previous undernourishment levels are known, whereas the second evaluation does
not include information on previous levels of undernourishment. Through holdout testing of
predictive accuracy, we show that current undernourishment can be estimated quite accurately
using the previous period’s undernourishment alongside other predictor variables. Such models
can be used to supplement survey-based estimates which are generally difficult to obtain and
infrequently available. When prior levels of undernourishment are unknown (e.g. in countries
and territories where these data are unavailable or in simulations of food insecurity far into the
future), models are less accurate than they are when prior undernourishment is known. However,
they still provide significant predictive skill and can be used to approximate undernourishment
using predictors that are relatively easy to measure or model. Additionally, running the two eval-
uations provides comparative insights into which indicators are most important in estimating
incremental changes in undernourishment and which are most important in developing baseline
estimates of undernourishment when these data are not available.
2. Methods
2.1 Data sources
We eva l u a te nation a l - level foo d i n s ecurity i n 1 4 4 c ountri e s d u ring the pe r i o d 2000–20 0 2 . T he
number of countries included in the study was limited by data availability, which excluded
countries without Food and Agriculture Organization (FAO) food security data and very small
countries without cereal-growing regions defined in the Monthly Irrigated and Rainfed Crop
Areas (MIRCA 2000) data set [32]. FAO estimates of the prevalence of undernourishment were
chosen as a response variable to represent food insecurity in each country. This measurement
represents the percentage of the population whose food consumption is continuously below
a minimum level. It is based on three factors: (1) the amount of food available for human
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
4J.E. Shortridge et al.
consumption within the country as estimated in FAO food balance sheets; (2) the level of inequal-
ity in access to that food, based on household surveys of food expenditures and income levels,
or (in cases where food expenditure or income data are unavailable) empirical relationships
with infant mortality; and (3) the estimated minimum nutritional requirements for the popula-
tion based on demographic distribution [14]. The FAO quantifies undernourishment down to a
minimum value of 5%; therefore, 64 of the 144 countries have undernourishment rates simply
reported as <5% in 2000–2002.
Three classes of variables were used as covariates to predict undernourishment. Socio-
economic variables included per-capita gross domestic product (GDP) and the percent of crop
areas equipped for irrigation based on the study of Portmann et al. [32]. In the first evaluation
only, this also included the prevalence of undernourishment for the period 1995–1997 (the last
period in which the FAO undernourishment estimates were available). Food trade and production
variables included categorical variables on the quantity of imports and net trade as a percentage
of total dietary consumption, the change in the Production Index (PIN) relative to 1995–1997,
as well as the percent of protein and fat in dietary consumption. Hydroclimatic variables were
derived using daily and monthly meteorological fields from the Princeton 50-year reanalysis [36],
and associated hydrological fields generated by the Global Land Data Assimilation System ver-
sion 2 (GLDAS-2) simulations [33], which uses Princeton meteorological forcing. Hydroclimatic
data were only extracted for each country’s crop-growing seasons and regions to exclude the
influence of climate conditions during non-growing months and regions. Latitude was included
as a covariate to account for observed differences in development between temperate and tropical
countries [29].
While existing studies on climate and food have generally included temperature and precip-
itation only, we included growing degree days (GDDs), precipitation, relative humidity, actual
evapotranspiration, and soil moisture in an effort to capture a wider range of conditions and
climatic interactions that affect agriculture. GDDs are a cumulative measurement of the heat
accumulation of a plant, which acts as a control on plant development, and are calculated as in
Equation (1):
GDD =Tmax −Tmin
2−Tbase,(1)
where Tmax and Tmin are the daily maximum and minimum temperatures, and Tbase is a crop-
specific baseline temperature below which growth does not occur. GDD is constrained to be
non-negative; any day where the average of the maximum and minimum temperature was below
Tbase would be calculated as GDD =0. We used 10°C as Tbase , which is suggested for most
cereal crops. Relative humidity is a control on water lost through plant transpiration. When
humidity is low, the vapor pressure gradient between plant leaves and the ambient air is stronger
and more water is lost through plant transpiration, resulting in higher water requirements for
plant growth. Evapotranspiration includes evaporation from bare ground or leaf surfaces as well
as plant transpiration, and is sensitive to available water and solar energy and to atmospheric
turbulence. Soil moisture represents the amount of water available for plant growth in a more
sophisticated way than precipitation alone, since only precipitation that infiltrates the soil (and is
not lost through runoff) will be available for crops.
While near-surface relative humidity, evapotranspiration, and soil moisture are all related
through biophysical processes, collinearity between these variables was found not to be sig-
nificant. This is in part because each variable is influenced by multiple factors; for example,
relative humidity is strongly influenced by mesoscale and synoptic scale atmospheric dynam-
ics while evapotranspiration is sensitive to local solar radiation, winds, and temperature along
with atmospheric humidity and soil moisture availability. These covariates also exhibit differ-
ing timescales of variability, as soil moisture responds slowly to changes in weather conditions
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 5
while evapotranspiration and relative humidity can respond almost instantaneously. As a result,
each variable contains distinct information on hydrometeorological conditions relevant to crop
production.
Because a cross-sectional evaluation of mean climate conditions would simply measure dif-
ferences between countries (e.g. Bangladesh is warm and wet compared to Canada), we used
anomalies to assess climatic conditions during the response period relative to the prior decade.
Standard deviations were also included to represent the variability in climatic conditions over
the growing season in each country – for example, whether rainfall was characterized by regular
moderate events or heavy storms followed by long dry periods. A summary of the covariates
included as potential predictors of undernourishment is presented in Table 1.
Table 1. Response variable and covariate data summary.
Variable Source Description
Undernourishment FAO S TAT Pe r cen t age o f c oun t ry un a ble t o m eet c a lor i c
dietary requirements in the period of
2000–2002
Previous undernourishment
Previous period’s
undernourishment
FAO S TAT Un d ern o uri s hme n t for t h e per iod of 1 995– 1997
(only included in Evaluation 1)
Socio-economic and regional
Per-capita GDP UNSTATS Per-capita GDP in US dollars
Percent irrigated MIRCA Aerial percentage of country equipped for
irrigation
Latitude Socrata Opendata Average latitude measurement
Food production, trade, and consumption
Change in PIN FAO S TAT Re l ati v e lev e l of ag r icu l tur a l pro d uct i on
(volumetric) compared to base period (2002–
2004), subtracted from the PIN from the
previous data period
Fat FAO S TAT Pe r cen t age o f c alo r ies f r om fa t i n con s ume d f ood
Protein FAO S TAT Pe r cen t age o f c alo r ies f r om pr o tei n i n con s ume d
food
Food import FAO S TAT Ca t ego r ica l d ata o n t he qu a nti t y of fo od imp orts
as a percentage of total food consumption
Food trade FAO S TAT Ca t ego r ica l d ata o n t he qu a nti t y of ne t f ood t r ade
as a percentage of total food consumption
Average climate anomaly (relative to previous decade)
Precipitation Sheffield et al. [36]Averagedailyprecipitation(mm)
Relative humidity Sheffield et al. [36] Average daily relative humidity (%)
GDD Sheffield et al. [36]AverageannualaccumulatedGDDs
Evapotranspiration GLDAS Average monthly actual evapotranspiration (mm)
Soil moisture GLDAS Average monthly soil moisture (mm in top meter
of soil)
Variability of climate
Precipitation Sheffield et al. [36]Standarddeviationofdailyprecipitation(mm)
Relative humidity Sheffield et al. [36] Standard deviation of daily relative humidity (%)
GDD Sheffield et al. [36]StandarddeviationofdailyaccumulatedGDDs
Evapotranspiration GLDAS Standard deviation of monthly actual
evapotranspiration (mm)
Soil moisture GLDAS Standard deviation of monthly soil moisture (mm
in top meter of soil)
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
6J.E. Shortridge et al.
2.2 Data processing
The majority of data collected from the FAOSTATS database (http://faostat.fao.org/) was rep-
resentative of average conditions over the response period (2000–2002) and was not processed
or altered prior to use. The only exception to this was the percentage of fat and protein in the
national diet; this was calculated by dividing FAO estimates of national average consumption
of fat and protein by the average dietary caloric consumption. The per-capita GDP covariate is
based on data for the year 2000 from the UNSTAT database (http://unstats.un.org). The percent
of the country equipped for irrigation was estimated based on MIRCA data [32]onthemonthly
total growing area devoted to irrigated and rain fed maize, rice, wheat, and soy. The total growing
area for each crop was calculated as the sum of the growing region in January and July to account
for the existence of two-season crops (primarily wheat). The percent of the country equipped for
irrigation was calculated by dividing the total area devoted to irrigated maize, rice, wheat, and
soy by the total growing area (irrigated and rain fed) for these crops.
Daily precipitation, GDD, and relative humidity were extracted on a daily basis from GLDAS-
2meteorologicalforcingfields[36] and monthly evapotranspiration and soil moisture were
extracted from 0.25° resolution GLDAS-2 simulations [33]thatusetheNoahLandSurface
Model [9,12]. Each meteorological and hydrological variable was aggregated to an average
value for the three-year response period as follows:
(1) For each country, gridded data were extracted for each of eight crop-growing regions (irri-
gated and rain fed areas for maize, wheat, soy, and rice). Data were only extracted for months
during the crop’s growing season. This was done to exclude the influence of climate con-
ditions during non-growing months and regions (e.g. mountainous regions and winter in
temperate areas).
(2) For each crop region (e.g. irrigated maize, rain fed wheat, etc.), the daily and monthly
readings from crop-growing months from 2000 to 2002 were averaged to calculate a mea-
surement representative of conditions in the geographical regions and months devoted to
growing that crop during the response period.
(3) For each crop region, the standard deviation of all measurements during 2000–2002 was also
calculated to represent climatic variability during the growing season.
(4) To calculate an anomaly for each crop-growing region, the average value from 2000 to
2002 (from step (2)) was subtracted from the average value during 1990–1999, which was
calculated in the same manner, and then divided by the standard deviation from 1990 to
1999.
(5) Finally, the anomalies for each crop-growing region within the country were combined in a
weighted average based on the total area devoted to each crop. The same was done with the
standard deviation measurements. This resulted in a measurement of the anomaly and stan-
dard deviation of daily or monthly measurements averaged over the crop-growing months
and regions within the country.
(6) For a few large countries (Australia, Brazil, Canada, China, Germany, India, and the USA),
climate data were extracted at a state or provincial level. In these instances, an anomaly
and standard deviation was calculated for each sub-region as described above, and these
measurements were combined in a weighted average for the whole country based on the
total growing region in each sub-region.
Table 2shows examples of data from a representative group of countries, as well as the mini-
mum, mean, and maximum values for each covariate. Prior to their inclusion within the models,
each covariate was tested for multicollinearity and correlation with other covariates. While mul-
ticollinearity will not influence the accuracy of model predictions, it can lead to misinterpretation
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 7
Tab le 2. C o var iat e s fro m sel ecte d cou ntr i es.
Units Algeria Bolivia France Indonesia Kazakhstan Mozambique Max Min Mean
Prevalence of undernourishment (2000–2002) % <522<515 8 46 59<515.46
Prevalence of undernourishment (1995–1997) % <524<511 5 47 62<516.90
Per-capita GDP $USD 1794 1011 21,828 773 1223 237 46544 124 6181
Extent of irrigation % .02 .03 .08 .43 .03 .01 1.00 .00 .23
Average latitude Degrees 28 17 46 5 48 18 64 1 27
Change in PIN % 0 4 −2−111 −443−34 1.91
Percent calories from fat % .21 .19 .42 .17 .27 .18 .42 .06 .25
Percent calories from protein % .11 .11 .13 .09 .12 .07 .14 .07 .11
Imports as percentage of food consumption % 100–150 25–50 50–100 0–25 0–25 25–50 NA NA NA
Net trade as percentage of food consumption % <−50 0–25 >50 0–25 >50 −25–0 NA NA NA
GDD anomaly – .53 −0.20 .05 .50 −0.46 .87 3.92 −1.57 .64
Evapotranspiration anomaly – −0.63 −0.34 −0.02 .23 −0.12 .51 1.58 −3.36 −0.22
Precipitation anomaly – −0.83 −0.19 1.64 .30 .65 .39 2.98 −1.35 .11
Relative humidity anomaly – −2.12 2.50 .94 −0.85 1.20 −0.13 2.50 −3.49 −0.33
Soil moisture anomaly – −1.00 −0.04 1.80 .47 1.99 .78 1.99 −3.06 .09
GDD S.D. GDD 6.10 2.10 3.63 .63 3.91 1.64 8.87 .43 2.89
Evapotranspiration S.D. mm/month 14.24 12.06 27.10 16.40 21.29 24.27 41.45 3.89 18.77
Precipitation S.D. mm/day 2.12 4.87 3.05 6.40 .97 4.86 29.66 .66 6.39
Relative humidity S.D. % 11.26 9.16 6.56 2.49 14.71 12.76 20.80 1.88 9.66
Soil moisture S.D. mm/m soil 28.29 36.72 20.76 28.03 27.74 42.40 83.56 3.74 32.48
Notes: The prevalence of undernourishment is only quantified down to 5%, so countries with undernourishment rates below this are simply shown as having <5% undernourishment.
S.D., standard deviation.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
8J.E. Shortridge et al.
of covariate importance within the model. Collinearity was evaluated by assessing the correla-
tion coefficients between each pair of predictors, as well as the variance inflation factors for each
predictor. This measure indicates the degree to which a predictor can be explained by a linear
combination of all other predictors. All correlation coefficients between variables had an absolute
value less than 0.8, and variance inflation factors were all less than 10; thus all of the variables
above were left in the models.
2.3 Model development and selection
The relationship between undernourishment, climate, and socio-economic conditions is complex.
To find models that could capture the nonlinearities and interactions likely to be present in this
system, we compared the fit and predictive accuracy of multiple regression and data-mining
techniques available in the R software package. Holdout cross-validation was used to compare
the models’ out-of-sample predictive accuracy based on the mean-squared difference between
actual and predicted undernourishment.
Eleven models were tested to compare their out-of-sample predictive accuracy in each eval-
uation (with and without previous undernourishment). The prevalence of undernourishment
followed a log-normal distribution (p-value for log-transformed undernourishment values from
Shapiro test for normality >0.05), so the natural logarithm of undernourishment was used as
the response variable so that parametric models could be compared to non-parametric models.
The previous period’s undernourishment was log-transformed so that it would be on the same
scale as the response variable, and GDP was also log-transformed because this was found to
reduce error in preliminary model evaluation. Model predictions were converted back to percent
undernourished before calculating model error.
The predictive accuracy of each model was assessed using a 100-fold holdout analysis, hold-
ing out a randomly selected 10% of the countries in each iteration. Models were fit using the
non-held out countries and then used to predict undernourishment in the held out countries. In
each iteration, the mean square error (MSE) between predicted and actual undernourishment was
estimated for all held out countries, as well as for only those countries with actual undernourish-
ment above 5% (‘high-undernourishment’ countries). The eleven models evaluated are described
in the following sections.
2.3.1 Linear models
Linear regression models assume that the response variable is normally distributed and
homoskedastic. While link functions can be used to develop generalized linear models (GLMs)
when the response variable follows certain non-normal distributions, this cannot address the
inflated number of 5s in our data set that resulted from the FAO not reporting undernourishment
values below 5%. To address this, we used a two-part model based on the study of Guikema and
Quiring [21]thatfirstusedaclassificationtreetopredictwhetheranobservationhadanunder-
nourishment rate greater than 5%, and then fit a linear regression model to predict the logarithm
of undernourishment in only those countries predicted to have undernourishment over 5%.
The first stage of the model used the full training data set to fit a binary classification tree
using the standard two-step grow-and-prune method of Breiman et al. [5]. A classification tree
is first grown using binary recursive partitioning on the basis of maximizing node impurity until
amaximumsizeisreached.Thiswillgenerallyresultinatreethatisoverfittothedata,so
splitting points are then removed from the tree until it is ‘pruned’ to the size that minimizes model
deviance based on 10-fold cross-validation. The pruned tree is then used to predict whether
observations should be classified as having low ( <5%) or high ( >5%) undernourishment.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 9
Second, the high ( >5%) undernourishment observations are used to fit a regression model
using the method of maximum likelihood that relates the logarithm of undernourishment to the
independent variables:
ln(Ui)=
m
!
j=1
βjxij +εi,(2)
where Uiis the undernourishment rate in country i,xij is the value of the jth covariate in country
i,andβjis the linear regression parameter for the jth covariate.
Two linear models were developed; the first (GLM1) included all covariates, and the second
(GLM2) used variable removal based on Akaike information criterion minimization.
2.3.2 Generalized additive models
Generalized additive models (GAMs) are a semi-parametric regression approach, where the
response variable is estimated as the sum of smoothing functions applied over the independent
variables as shown in Equation (3) [22]:
ln(U)=
m
!
j=1
sj(xj).(3)
In the above equation, sj(xj)representsasmoothingfunctionappliedovereachcovariate,
allowing the models to capture nonlinear relationships between covariates and the response vari-
able. Smoothing functions are fit using penalized likelihood maximization to prevent overfitting
of the model.
Because GAMs are semi-parametric, we used the same two-stage approach to first classify
each observation as greater or less than 5% undernourished, and then predict those observations
classified as having high undernourishment. The classification tree used in the first stage was
equivalent to the one used for the linear models above. A GAM was then fit to the data with
undernourishment greater than 5% using cubic regression splines smoothed over each covariate
separately [22]. Two GAMs were developed; the first included all covariates (GAM1), while
the second incorporated variable removal by allowing covariate terms within the model to be
penalized to zero if they do not improve model fit (GAM2).
2.3.3 Gaussian mixture models
GMMs are a semi-parametric regression approach that assumes that observations in a data set are
drawn from Gunderlying Gaussian distributions. The observations therefore represent a mixture
of sub-populations, but the sub-population to which each observation belongs is not known a
priori. Observations are assumed to be random draws from a probability density function that
can be represented as
f(x;#)=
G
!
g=1
πgfg(x;θ),(4)
where fg(x,θ) represents the distribution for sub-population gwith parameters θand πg
represents the proportion of observations coming from sub-population g[28].
In each evaluation, GMMs assuming two and three sub-populations were evaluated. Because
of the parametric assumptions implicit in the use of the mixture model, these would ideally be fit
using the same two-step approach used for the GLM and GAM models, where observations with
undernourishment below 5% are first identified using a classification tree, and then the paramet-
ric model is fit to the remaining high-undernourishment values. However, removing the <5%
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
10 J.E. Shortridge et al.
observations, especially after removing approximately 10% of the samples for out-of-sample
holdout testing, results in a high-dimensional training sample relative to the number of obser-
vations (generally 70–80 observations with 18–19 covariates). The expectation–maximization
(EM) algorithm generally failed to fit mixture models to this data. Because of this issue, the mix-
ture models evaluated were fit over the full training data set, including all <5% observations.
The parameter #of the mixture model was estimated through likelihood maximization using the
EM algorithm [11]. Predictions for testing data sets were estimated as the weighted average of
the predictions from each component model:
ln(Ui)=
G
!
g=1
πg
m
!
j=1
βgjxij +εi.(5)
2.3.4 Tree m ode ls
Four different regression-tree models were evaluated. The first was a single regression tree (clas-
sification and regression tree [CART]) that was grown and pruned to an optimal size using the
same method from Breiman et al. [5]asthefirststageofthetwo-partlinearandGAMs.However,
in this instance, the tree predicts the undernourishment rate, rather than classifying countries as
low or high undernourishment. The tree is grown by minimizing node impurity and then pruned
back to the size that minimizes model deviance based on 10-fold cross-validation.
The second model was a Bayesian additive regression tree (BART) model [10]. This is a sum
of trees model based on the predictions from Ksmall trees:
ln(U)=1
K
K
!
k=1
g(x,Tk,Mk)+ε,(6)
where g(X,T,M) denotes the function that assigns each observation to a terminal node Mwithin
tree T. A prior is imposed over the sum of trees model to keep each of the 200 individual trees
from being overly influential, and posterior samples are generated through Markov Chain Monte
Carlo simulation.
The third regression tree model was a bagged CART (BC) model consisting of 50 regression
trees, each of which is trained on a bootstrap resampled version of the data set [3]:
ln(U)=1
K
K
!
k=1
g(x(B)
k,T(B)
k,M(B)
k),(7)
where Kis the number of bootstrapped samples and g(X(B),T(B),M(B))representsthefunction
that assigns each observation to a terminal node Mwithin tree Tbased upon on bootstrapped
resampling X(B)of the data set. Bagging (bootstrap-aggregation) can result in lower model
variance and improved accuracy if the individual trees are unbiased, uncorrelated predictors [3].
The final regression tree method employed was a random forest model [4]. Like the BC model,
the random forest model is created by averaging the predictions from individual regression trees
trained on separate bootstrapped resamples of the data. However, the tree-learning algorithm is
modified so that the tree is fit using only a small, randomly selected subset of predictor variables.
This results in reduced correlation between trees. A total of 500 regression trees were used in the
random forest model, and each tree was trained using a randomly selected subset of pcovariates,
where p=1/3 of the total number of covariates included in the data set.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 11
2.3.5 Ensemble model
This model generated a prediction by averaging the prediction from the highest performing
models listed above. Model averaging methods can improve predictive accuracy by reducing
overall model variance, assuming that the contributing models are unbiased, uncorrelated pre-
dictors [18]. For each evaluation, the models with the lowest predictive (out-of-sample) MSE
were included in the ensemble model.
2.3.6 Null model
For each evaluation, a null model was included for comparative purposes. In the evaluation
where previous undernourishment was included as a covariate, this model assumed that under-
nourishment remained unchanged from the previous period. In the evaluation without previous
undernourishment, the model used the mean prevalence of undernourishment in the non-held out
countries as the predicted undernourishment value for all of the held out countries, with <5%
values counted as 5%.
3. Results
3.1 Evaluation 1: informed by previous period’s undernourishment
The average and standard deviation of MSEs between model predictions and observed values
from the holdout analysis are shown in Table 3. Among the individual models, the GLMs,
GMM3, CART, and BC all outperform the baseline model based on at least one error metric
(indicated by italics). However, the best performance is from the ensemble model, which aver-
ages the predictions from GLM2, GMM3, CART, and BC. This model has lower predictive error
than the baseline model based on all error metrics, and dominates all other models in terms of
minimizing predictive error. This model also exhibits a lower variance in MSE compared to the
null model, indicating a reduced possibility for instances where MSE is very high.
Paired Wilcox P-values were estimated to evaluate the statistical significance of the lower
mean errors obtained for the ensemble model compared to the null model. The uncorrected p-
values for MSE in all countries and high-undernourishment countries only were both below .005.
Bonferroni correction of these values indicates that the ensemble model results in statistically
significant reductions in error from the null model at a confidence level of .01.
Evaluation of the influence of different covariates in the ensemble model was conducted
to assess which variables were most important in predicting undernourishment. It is impor-
tant to recognize that a cross-sectional evaluation such as this does not provide the statistical
Tab le 3. M e an an d sta ndar d dev iat i on of M SE fo r mode ls as ses s ed in E val uati on 1.
GLM1 GLM2 GAM1 GAM2 GMM2 GMM3 BART CART RF BC Ensemble Null
All samples
MSE mean 22.15 22.14 32.52 33.03 24.24 20.229.89 22.32 41.03 24.51 17.96 23.88
MSE SD 14.32 13.15 25.81 24.87 17.27 11.28 18.77 16.18 29.14 15.42 10.4 15.36
Undernourishment >5% only
MSE mean 37.24 37.44 54.32 55.51 40.72 33.99 50.29 34.65 66.20 39.23 29.45 40.03
MSE SD 25.14 25.77 44.52 43.91 29.88 18.96 34.12 24.73 46.88 24.11 17.31 27.9
Notes: Italic results indicate models that outperform the null model based on MSE. The ‘All Samples’ portion of the
table refers to the MSE across all samples in the test data set, assuming an actual undernourishment value of 5% in the
countries with <5% undernourishment reported. The ‘Undernourishment >5% only’ portion of the table refers to the
MSE across only those samples with actual undernourishment above 5%. RF, random forest.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
12 J.E. Shortridge et al.
power needed to prove a causal association between covariates and undernourishment. How-
ever, evaluation of covariate influence over model predictions can identify which indicators are
the most valuable in estimating undernourishment (and thus inform data collection activities)
and illuminate correlations that may warrant further assessment.
Assessing covariate influence in a model that consists of parametric and non-parametric com-
ponents presents a number of challenges. While regression coefficients and their corresponding
p-values could be used to assess covariate influence in the GLM, this method is not applicable
to regression tree models. Therefore, partial dependence plots were developed to assess covari-
ate importance and influence in the ensemble model and each of its component models. Partial
dependence plots were developed by fitting the model using all countries, and then measuring
the marginal influence that changing the covariate of interest, while keeping all other covari-
ates equal, has on model predictions. A relatively flat partial dependence plot indicates that the
covariate of interest has little influence over the model’s predictions, while a large change in
response variable values indicates that the covariate has a large degree of influence in model
predictions. This variation was measured for the ensemble model and each of its mcomponent
models by estimating the relative ‘swing’ attributable to each covariate, which consisted of the
range of partial dependence values associated with the covariate of interest, divided by the total
swing over all ncovariates in that model (Equation (8)). Table 4shows the influence of each
covariate in each model based on the relative swing in partial dependence values. High values
indicate that a particular covariate had a large influence on model predictions, whereas values of
zero indicate that a model did not use that covariate at all. Partial dependence plots for the eight
most influential covariates in the ensemble model are shown in Figure 1.
Swingmn =max(PDmn)−min(PDmn )
"nSwingmn
.(8)
The previous period’s undernourishment made up 98% and 90% of the total variation in pre-
dictions in the CART and BC models, respectively, whereas the GLM and GMM made use of
Tab le 4. R e lat ive c o var iat e influ enc e in ea c h mod el in E valu ati on 1 as me asu red b y t he ra nge o f
partial dependence values for each covariate.
Covariate GLM2 GMM3 CART BC Ensemble
Previous period’s undernourishment .426 .434 .984 .903 .584
Relative humidity variability .127 .065 .000 .002 .070
Percent of calories obtained from fat .065 .090 .000 .008 .057
Change in agricultural production .093 .040 .000 .000 .049
GDD variability .063 .061 .000 .000 .045
Food trade .063 .033 .000 .000 .034
Latitude .046 .019 .000 .004 .024
GDD anomaly .051 .013 .000 .000 .024
Gross domestic product .000 .051 .000 .024 .022
Percent of country equipped for irrigation .025 .022 .016 .004 .020
Evapotranspiration variability .000 .046 .000 .002 .017
Relative humidity anomaly .000 .029 .000 .002 .011
Soil moisture anomaly .000 .029 .000 .000 .010
Soil moisture variability .041 .001 .000 .035 .010
Food imports .000 .025 .000 .005 .009
Percent of calories obtained from protein .000 .019 .000 .001 .007
Precipitation anomaly .000 .007 .000 .003 .003
Precipitation variability .000 .008 .000 .001 .003
Evapotranspiration anomaly .000 .006 .000 .005 .003
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 13
Figure 1. Relative covariate influence in the ensemble model from Evaluation 1. Dashed lines represent
95% confidence bounds.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
14 J.E. Shortridge et al.
more covariates. Because the ensemble model is an average of these three, it is still heavily influ-
enced by the previous period’s undernourishment, but is also influenced by covariates related to
food availability (percent of dietary calories from fat, change in agricultural production, and net
food trade) and climate (variability in humidity and GDD and GDD anomaly).
In Figure 1,solidlinesshowthepartialdependenceplotusingthefulldatasetonly,and
the dashed lines show 95% confidence bounds developed by calculating 100 partial dependence
plots using bootstrapped resampling. The results for the ensemble model indicate that a variety
of indicators are important in predicting undernourishment. High previous undernourishment, a
decrease in agricultural production, a high portion of calories from fat, and a negative food trade
balance all led to high predictions of undernourishment. Highly variable humidity levels, low
temperature variability, and warmer than average temperatures were also associated with high
rates of undernourishment.
To further assess the model’s predictive capacity, the ensemble and null models were used
to generate predictions for the prevalence of undernourishment for the period 2006–2008 (the
next period in which the FAO data was available). The error (predicted minus actual under-
nourishment) for each country is shown in Figure 2.Usingtheensemblemodelresultedinan
MSE of 13.69, compared to an MSE of 16.15 using the null model. When only countries with
high levels of undernourishment are considered, the MSE for the ensemble and null models are
26.09 and 30.51, respectively. The predictions generated by the null model also demonstrate a
tendency to overestimate undernourishment, as represented by a skewness coefficient in error of
0.16, compared with 0.04 for the ensemble model. Both the ensemble and null models tended to
overestimate undernourishment in Africa and South America, although errors were smaller in the
ensemble model. However, the ensemble model also tended to overestimate undernourishment
in Southeast Asia compared to the null model.
Finally, the ensemble model was evaluated to assess how it responded to stochastic pertur-
bations in covariate observations. Many of the covariates used to predict undernourishment
are likely to be subject to some degree of measurement and sampling error. For example,
survey-based measurements (such as the percent of diet obtained by fat and protein, as well
as undernourishment prevalence itself) are likely to be based on only a small sample of the over-
all population. Similarly, error in estimates of country-wide climatic conditions can stem from
limited meteorological station data and unaddressed bias in satellite measurements.
To understand how these errors might impact model predictions, 1000 simulations were con-
ducted in which the null and ensemble models (fit to the complete 2000–2002 data set) were used
to generate predictions for 2006–2008 covariate data with random errors. Errors were induced by
assuming that all covariates except for food imports (categorical), food trade (categorical), and
latitude were random Gaussian variables with a mean equal to the observed 2006–2008 value,
and a coefficient of variation equal to 0.2.
Kernel density plots for predicted undernourishment using the ensemble and null mod-
els in six representative countries are shown in Figure 3.Onecountry(Denmark)hasan
undernourishment rate below 5%, and the other countries were selected because they had under-
nourishment rates roughly equivalent to quartile values for undernourishment in high ( >5%)
undernourishment countries. In predictions from the null model, the uncertainty in prior-period
undernourishment directly translates into a Gaussian distribution of model predictions. How-
ever, the distribution of predictions from the ensemble model varies from country to country.
For the low-undernourishment country, the distribution of predictions from the ensemble model
is tightly centered around 5%, exhibiting much lower variance than predictions from the null
model. In the high-undernourishment countries, the model resulted in multi-modal distributions
of predictions. Interestingly, in three countries where the deterministic model predictions over-
estimated undernourishment (Brazil, Philippines, and Malawi), the observed value was found in
the second modes of the distributions. While this points toward the potential utility of generating
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 15
Figure 2. Model error in predicting undernourishment in 2006–2008 for the ensemble and null models.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
16 J.E. Shortridge et al.
Figure 3. Kernel density plots for ensemble and null model predictions of undernourishment in 2006–2008
assuming random Gaussian errors in covariate measurements. The heavy vertical line shows observed
undernourishment, and the light vertical line shows ensemble model predictions using observed covariate
values.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 17
Tab le 5. M e an an d sta ndar d dev iat i on of M SE fo r mode ls as ses s ed in E val uati on 2.
GLM1 GLM2 GAM1 GAM2 GMM2 GMM3 BART CART RF BC Ensemble Null
All countries
Mean MSE 134.46 98.83 252.65 166.96 117.37 108.45 95.92 108.08 96.66 86.70 87.15 184.47
MSE SD 94.93 52.71 459.82 282.07 65.38 53.07 52.47 57.24 57.15 48.09 48.65 72.65
Undernourishment >5% only
Mean MSE 204.67 153.43 363.77 237.57 179.70 166.90 153.38 161.83 154.00 128.42 137.33 230.50
MSE SD 147.08 84.47 636.78 383.22 109.60 85.81 85.50 86.18 92.63 71.17 79.41 127.88
Notes: Italic results indicate models that outperform the null model based on a given error metric. The ‘All Samples’
portion of the table refers to the MSE across all samples in the test data set, assuming an actual undernourishment value
of 5% in the countries with <5% undernourishment reported. The ‘Undernourishment >5% only’ portion of the table
refers to the MSE across only those samples with actual undernourishment above 5%.
probabilistic projections of undernourishment by treating predictor covariates as random vari-
ables, quantification of probabilistic model error is hindered by the lack of data on expected
sampling error in the covariates of interest.
3.2 Evaluation 2: no information on previous period’s undernourishment
In Evaluation 2, we developed predictive models that did not rely on previous estimates of under-
nourishment, which may not be available in all countries or situations (such as projections far
into the future). The average and standard deviation of MSEs between model predictions and
observed values from the holdout analysis without information on previous undernourishment
are shown in Table 5.AllofthetestedmodelsexceptGAM1areanimprovementoverthebase-
line model. The ensemble model consisted of the average predictions from GLM2, BART, RF,
and BC and resulted in the lowest predictive error based on both error metrics.
Paired Wilcox P-values were estimated to evaluate the statistical significance of the lower
mean errors obtained for the ensemble model compared to the null model. The uncorrected p-
values for MSE in all countries and high-undernourishment countries only are both less than
Tab le 6. R e lat ive c o var iat e influ enc e in ea c h mod el in E valu ati on 2 as me asu red b y t he
range of partial dependence values for each covariate.
Covariate GLM BC BART RF Ensemble
GDP .48 .51 .25 .31 .41
Precipitation variability .29 .02 .04 .03 .12
Latitude .02 .12 .10 .14 .09
Relative humidity anomaly .12 .08 .04 .05 .08
Tem per a tur e var iabi lit y .06 .06 .06 .07 .06
Food trades .00 .02 .10 .08 .05
Food imports .02 .01 .07 .05 .04
Soil moisture variability .00 .05 .04 .06 .03
Relative humidity variability .00 .02 .05 .03 .02
Percent of calories obtained from fat .00 .01 .04 .04 .02
Evapotranspiration variability .00 .02 .04 .03 .02
Tem per a tur e ano maly .00 .01 .05 .02 .02
Soil moisture anomaly .00 .01 .03 .02 .01
Percent of calories obtained from protein .00 .00 .02 .02 .01
Change in agricultural production .00 .01 .02 .01 .01
Evapotranspiration anomaly .02 .02 .01 .02 .01
Precipitation anomaly .00 .01 .01 .01 .01
Percent of country equipped for irrigation .00 .01 .01 .01 .00
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
18 J.E. Shortridge et al.
Figure 4. Relative covariate influence in the ensemble model in Evaluation 2. Dashed lines represent 95%
confidence bounds.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 19
.005. This indicates that the ensemble model results in statistically significant reductions in error
from the null model at a confidence level of .01.
The relative influence of each covariate in determining model predictions based on partial
dependence plots is shown in Table 6. Relative influence of different covariates was generally
consistent between models, with the largest influence coming from GDP. Kendall’s coefficient
of concordance between the four models that comprise the ensemble model was 0.71, indicating
arelativelyhighamountofagreementintermsofrankedcovariateusage.Withoutinformation
on previous undernourishment, the models rely more heavily on GDP and latitude, with higher
undernourishment predicted in low GDP, low-latitude countries (Figure 4). Lower undernourish-
ment is associated with a positive trade balance and high levels of food imports into the country.
In terms of climatic variables, high undernourishment was predicted in countries with highly
variable precipitation and soil moisture, a negative relative humidity anomaly, and low levels of
temperature variability.
4. Discussion
4.1 Predictive capabilities
The results of Evaluation 1 indicate that country-level undernourishment can be accurately pre-
dicted by using previous undernourishment information in combination with covariates related
to socio-economic conditions, food production and trade, and climate. The predictors used in our
model are generally more easily estimated and available more frequently than FAO projections
of undernourishment. By providing a mechanism to estimate undernourishment using frequently
and readily available data, statistical methods such as those developed here could provide a useful
way of estimating and predicting undernourishment when surveyed estimates of food insecurity
are unavailable.
The strong performance of most models in Evaluation 1 indicates that undernourishment is a
persistent phenomenon in most countries. While this persistence is reflected by the large influ-
ence that previous undernourishment had in generating model predictions, inclusion of other
covariates resulted in an approximately 25% reduction in MSE compared to the null model
which assumed that undernourishment remained unchanged from the previous period. When
the ensemble and null models were used to predict undernourishment for the period 2006–2008,
the null model had a tendency to overestimate undernourishment. This is likely due to the fact
that undernourishment tends to decline with time. The average change in undernourishment from
2000–2002 to 2006–2008 was −1.9%, and in countries with undernourishment greater than 5%,
it was −3.9%. However, it is important to recognize that this is not always the case – in that
same period, changes in undernourishment ranged from increases of 5% to decreases of 20%.
Our model predicted changes in undernourishment from 2000–2002 to 2006–2008 ranging from
an increase of 6% to decreases of 10% (mean change =−0.8%), indicating that it can capture
the overall trend of decreasing undernourishment, as well as country-to-country variations on
this trend in a particular year, that the null model cannot.
4.2 Model structure, covariate influence, and implications for prediction
In both evaluations, using an ensemble model that averaged predictions from multiple high-
performing individual models resulted in the lowest predictive error, as well as the lowest
variance in predictive error. However, there were differences in terms of which individual mod-
els performed the best in each evaluation. When undernourishment was available as a covariate,
many of the highest performing models were relatively simple – the GLM, a single regression
tree based entirely on previous undernourishment, and bagged regression tree model largely
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
20 J.E. Shortridge et al.
based on previous undernourishment. Conversely, accurate prediction without information on
previous undernourishment levels required the use of more sophisticated models; instead of a
single regression tree, the ensemble model in Evaluation 2 included a BART and a random forest
model. These models each consist of hundreds of individual regression trees that are coerced to
be uncorrelated predictors of undernourishment. Additionally, the model predictions in Eval-
uation 1 were largely attributable to differences in the previous period’s undernourishment,
whereas influence in Evaluation 2 was distributed among more covariates. These differences may
indicate that predicting undernourishment without previous estimates not only relies on more
covariates, but also requires sophisticated models that can capture complex interactions between
covariates.
When information on previous undernourishment is available, it accounts for the majority
of variation in model predictions. Other covariates that were important in this model included
those related to dietary composition, agricultural production, food trade and imports, and cli-
matic conditions. Not surprisingly, the model predicted highest undernourishment in countries
with a decrease in agricultural production. Somewhat counterintuitively, high undernourishment
was also associated with a high percentage of calories from fat. The climate conditions asso-
ciated with highest undernourishment were high variability in humidity and low variability in
temperature as measured by GDD, as well as high GDD relative to previous years. These results
could point toward the negative impact of such climate conditions on agricultural production, as
well as the presence of regional-level patterns of food insecurity.
The ensemble model in Evaluation 2 was largely informed by development levels represented
by GDP and latitude. While the plot for food trade shows lower undernourishment in coun-
tries that are net exporters of food, the food import plot shows lower undernourishment for
countries with high food imports. Taken together, these two plots could be showing that coun-
tries with more robust food trade (high imports and high exports, with an overall positive food
trade balance) are likely to have lower undernourishment. The climatic conditions associated
with high undernourishment included highly variable hydrologic conditions (precipitation and
soil moisture) and dryer than average conditions (negative relative humidity anomaly). How-
ever, agricultural production was not as influential in the second evaluation, indicating that
this covariate might not provide useful information for predicting undernourishment if previ-
ous undernourishment levels are unavailable. Taken as a whole, these results demonstrate that
while physical controls on food production are certainly important, they must be considered in
conjunction with socio-economic influences on food availability and access.
It is important to emphasize that even though the covariates described above were valuable
predictors of undernourishment in our models, this cross-sectional analysis cannot support any
assertions of causal relationships. Developing statistical evidence of causality would require
data-intensive longitudinal analyses, which are likely to prove quite challenging given the limited
amount of long-term time series data related to food security outcomes. However, identification
of the predictors that are most influential in accurately predicting undernourishment can help
focus data collection efforts on information that is most likely to support food security predic-
tions [2], and can also inform the development of systems dynamics models. Moreover, it can
identify overlooked variables that are often neglected from food security studies. For example,
while a number of studies have assessed the influence that precipitation has on crop yields [19,
26,27], our results indicate that other hydrologic variables, particularly relative humidity, evap-
otranspiration, and soil moisture, could be equally important in predicting food insecurity. Our
results also indicated that intra-seasonal variability should be considered alongside average con-
ditions. The predictive capacity achieved by using estimates of intra-seasonal to decadal scale
climate variability suggests that general projections of changes in mean climate and patterns of
variability can usefully inform adaptation measures. These statistical projections are typically
subject to less uncertainty than the detailed, daily projections required for physically based crop
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 21
models. Further development of statistical models of food security could provide support for
decision-makers who are unable to make use of complex, data-intensive agricultural models.
4.3 Limitations and areas for model improvement
While our results demonstrate that it is possible to develop statistical estimates of undernour-
ishment using widely available socio-economic and climatic predictors, there are limitations to
the models that could be addressed through additional refinement. Food insecurity is a complex
phenomenon, and no single metric is likely to provide a complete assessment. The prevalence
of undernourishment is subject to uncertainties in the factors used in its calculation, and pro-
vides no information on the severity of undernourishment or variations within a country. Other
measurements of food insecurity, such as those described in [38], could provide further insights.
Adding covariates to the model could improve its predictive ability and identify other factors
that influence food insecurity. Economic covariates related to food prices could provide predic-
tive power for events such as the 2006–2008 price shocks, although projecting such covariates
into the future would introduce additional uncertainty. Additional climate covariates, such as
days above 30°C, rainfall per event, and others described in [13], could capture other conditions
that influence food production.
The models’ predictive power was limited by the global scope of the analysis. We developed
highly generalized, global models in order to characterize widespread empirical relationships
based on national-level data and demonstrate the ability of non-parametric models in capturing
these relationships. This means that the models did not account for regionally specific indica-
tors or variations within countries, and model accuracy may vary between different regions. For
example, the errors shown in Figure 2indicate that our models outperformed the null model in
generating predictions for South and Central America and Africa, but had higher errors in Asia.
Such differences could be taken into account through the development of customized regional
models that utilize data sets and predictors specific to conditions within a region or other country
grouping of interest.
5. Conclusions
Improved models for measurement and prediction of food security are needed to identify condi-
tions that lead to food insecure situations and to support interventions. However, the relationship
between food security outcomes and the physical and socio-economic drivers that influence them
are highly complex and likely to exhibit many interactions and nonlinearities, making develop-
ment of these models a challenging undertaking. This complexity limits the effectiveness of
linear regression techniques, but more sophisticated data-mining methods have the potential to
be very useful in this regard. In this paper, we demonstrate that data-mining methods can be
used to generate estimates of country-level undernourishment with and without information on
undernourishment rates in a previous time period. When previous undernourishment is known,
model averaging can predict undernourishment quite accurately, capturing the overall trend of
decreasing undernourishment as well as country-to-country variations on this trend in a particular
year. Models that do not consider previous period’s undernourishment sacrifice some accuracy,
but could be useful in situations or countries where past undernourishment data are unavailable
or in simulations of food security far into the future.
Our results indicate that model averaging leads to the greatest predictive accuracy. When
previous undernourishment is known, the lowest predictive error was obtained by averaging
predictions from a linear model, mixture model, regression tree, and bagged regression tree
model. When previous undernourishment is unknown, the lowest predictive error is obtained
by averaging a linear model, BART, random forest and bagged regression tree model. The
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
22 J.E. Shortridge et al.
lower error rates obtained through model averaging suggest that these models are unbiased,
uncorrelated predictors. However, one challenge associated with the use of model averaging
is determining covariate influence, as traditional measures (such as regression coefficients and
their corresponding p-values) are not applicable. To see which covariates were driving model
predictions and understand the nature of this influence, we created partial dependence plots
that showed the marginal influence of altering each covariate in the model. This indicated that
consideration of both climatic and socio-economic covariates are necessary to accurately pre-
dict undernourishment levels. In particular, our results suggest that anomalies and variability
in GDDs and hydrologic variables such as soil moisture, evapotranspiration, and humidity may
provide more informative estimates of food security outcomes than those based on temperature
and precipitation alone.
Our findings indicate that statistical methods can accurately predict undernourishment levels
and could serve as a useful supplement to survey-based estimates which are generally difficult
to obtain and infrequently available. These analyses can also provide insights into covariate
influence, which can inform data collection efforts and systems dynamics models, improve our
understanding of patterns of food insecurity, and identify potential drivers of undernourish-
ment that could be assessed through longitudinal studies. By providing insights into these types
of relationships and a mechanism for estimating undernourishment when survey-based data is
unavailable, statistical models like those developed here could be a useful tool for those tasked
with understanding and addressing food insecurity.
Acknowledgements
We would like to sincerely thank two anonymous reviewers for their thoughtful, informative comments which resulted in
a number of improvements to this paper. The work and views in this paper are those of the authors and do not necessarily
express the views of the sponsors.
Disclosure statement
No potential conflict of interest was reported by the authors.
Funding
This research was funded by the Global Water Program at Johns Hopkins University and the Faculty for the Future
Fellowship by the Schlumberger Foundation. This funding is gratefully acknowledged.
References
[1] D. Balk, A. Storeygard, M. Levy, J. Gaskell, M. Sharma, and R. Flor, Child hunger in the developing world: An
analysis of environmental and social correlates, Food Policy 30 (2005), pp. 584–611.
[2] C.B. Barrett, Measuring food insecurity, Science 327 (2011), pp. 825–828.
[3] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996), pp. 123–140.
[4] L. Breiman, Random forests, Mach. Learn. 45 (2001), pp. 5–32.
[5] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Wadsworth, New York,
1984.
[6] A.J. Challinor and T.R. Wheeler, Crop yield reduction in the tropics under climate change: Processes and
uncertainties, Agr. Forest Meteorol. 148 (2008), pp. 343–356.
[7] N. Chatterjee, G. Fernandes, and M. Hernandez, Food insecurity in urban poor households in Mumbai, India,Food
Secur. 4 (2012), pp. 619–632.
[8] D.R. Chavas, R. César Izaurralde, A.M. Thomson, and X. Gao, Long-term climate change impacts on agricultural
productivity in Eastern China, Agr. Forest Meteorol. 149 (2009), pp. 1118–1128.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
Journal of Applied Statistics 23
[9] F. Chen, K. Mitchell, J. Schaake, Y.K. Xue, H.L. Pan, V. Koren, Q.Y. Duan, M. Ek, and A. Betts, Modeling of land
surface evaporation by four schemes and comparison with FIFE observations, J. Geophys. Res. 101 (1996), pp.
7251–7268.
[10] H.A. Chipman, E.I. George, and R.E. McCulloch, BART: Bayesian additive regression trees, Ann. Appl. Stat. 4
(2010), pp. 266–298.
[11] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm,J.
Roy. Stat. Soc. B 39 (1977), pp. 1–38.
[12] M.B. Ek, K.E. Mitchell, Y. Lin, E. Rogers, P. Grunmann, V. Koren, G. Gayno, and J.D. Tarpley, Implementation of
Noah land surface model advances in the National Centers for Environmental Prediction operational mesoscale Eta
model, J. Geophys. Res. 108 (2003). Available at http://onlinelibrary.wiley.com/doi/10.1029/2002JD003296/pdf.
[13] P. Ericksen, P. Thornton, A. Notenbaert, L. Cramer, P. Jones, and M. Herrero, Mapping hotspots of cli-
mate change and food insecurity in the global tropics, CCAFS Report no. 5, CGIAR Research Pro-
gram on Climate Change, Agriculture and Food Security (CCAFS), Copenhagen, 2011. Available at
http://ccafs.cgiar.org/resources/climate_hotspots (accessed April 2012).
[14] FAO, Proceedings of the international scientific symposium on measurement and assessment of food deprivation
and undernutrition, Rome, 2003. Available at http://www.fao.org/docrep/005/Y4249E/y4249e00.htm (accessed
April 2012).
[15] FAO, The state of food and agriculture 2010–2011, Rome, 2011. Available at http://www.fao.org/docrep/014/
i2330e/i2330e00.htm (accessed April 2012).
[16] J. Felkner, K. Tazhibayeva, and R. Townsend, Impact of climate change on rice production in Thailand, Am. Econ.
Rev. 99 (2009), pp. 205–210.
[17] G. Fischer, M. Shah, F.N. Tubiello, and H. van Velhuizen, Socio-economic and climate change impacts on
agriculture: An integrated assessment, 1980–2080,Phil.Trans.R.Soc.B360(2005),pp.2067–2083.
[18] J. Friedman, T. Hastie, and R. Tibshirani, Elements of Statistical Learning: Data Mining, Inference and Prediction.
Chapter 8: Model Inference and Averaging, 2nd ed., Springer, Berlin, 2009.
[19] C.C. Funk and M.E. Brown, Declining global per capita agricultural production and warming oceans threaten
food security, Food Secur. 1 (2009), pp. 271–289.
[20] G.G. Gebre, Determinants of food insecurity among households in Addis Ababa city, Ethiopia, Interdiscip. Desc.
Com. Sys. 10 (2012), pp. 159–173.
[21] S.D. Guikema and S.M. Quiring, Hybrid data mining-regression for infrastructure risk assessment based on zero-
inflated data, Reliab. Eng. Syst. Safe 99 (2012), pp. 178–182.
[22] T.J. Hastie and R.J. Tibshirani, Generalized Additive Models, Chapman and Hall, London, 1990.
[23] J. Ingram, A food systems approach to researching food security and its interactions with global environmental
change, Food Secur. 3 (2011), pp. 417–431.
[24] P.G. Jones and P.K. Thornton, The potential impacts of climate change on maize production in Africa and Latin
America in 2055, Global Environ. Chang. 13 (2003), pp. 51–59.
[25] R.E.A. Kahn, T. Azid, and M.U. Toseef, Determinants of food security in rural areas of Pakistan,Int.J.Soc.Econ.
39 (2012), pp. 951–964.
[26] D.B. Lobell, M. Banziger, C. Magorokosho, and B. Vivek, Nonlinear heat effects on African maize as evidenced by
historical yield trials, Nat. Clim. Chang. 1 (2011), pp. 42–45.
[27] D.B. Lobell, W. Schlenker, and J. Costa-Roberts, Climate trends and global crop production since 1980,Science
333 (2011), pp. 616–620.
[28] G.J. McLachlan and K.E. Basford, Mixture Models. Inference and Applications to Clustering, Marcel Dekker, Inc.,
New York, 1988.
[29] W.D. Nordhaus, Geography and macroeconomics: New data and new findings, Proc. Natl. Acad. Sci. 103 (2006),
pp. 3510–3517.
[30] M. Parry, C. Rosenzweig, A. Iglesias, G. Fischer, and M. Livermore, Climate change and world food security: A
new assessment, Global Environ. Chang. 9 (1999), pp. S51–S67.
[31] M.L. Parry, C. Rosenzweig, A. Iglesias, M. Livermore, and G. Fischer, Effects of climate change on global
food production under SRES emissions and socio-economic scenarios, Global Environ. Chang. 14 (2004), pp.
53–67.
[32] F.T. Portmann, S. Siebert, and P. Doll, MIRCA2000 – Global monthly irrigated and rainfed crop areas around the
year 2000: A new high-resolution data set for agricultural and hydrological modeling,GlobalBiogeochem.Cy.24
GB1101 (2010), pp. 1–24.
[33] M. Rodell, P.R. Houser, U. Jambor, J. Gottschalck, K. Mitchell, C.-J. Meng, K. Arsenault, B. Cosgrove, J.
Radakovich, M. Bosilovich, J.K. Entin, J.P. Walker, D. Lohmann, and D. Toll, The global land data assimilation
system, B. Am. Meteorol. Soc. 85 (2004), pp. 381–394.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015
24 J.E. Shortridge et al.
[34] R.P. Rötter, T. Palosuo, N.K. Pirttioja, M. Dubrovsky, T. Salo, S. Fronzek, R. Aikasalo, M. Trnka, A. Ristolainen,
and T.R. Carter, What would happen to barley production in Finland if global warming exceeded 4 C? A model-
based assessment, Eur. J. Agron. 35 (4) (2011), pp. 205–214.
[35] W. Schlenker and M.J. Roberts, Nonlinear temperature effects indicate severe damages to U.S. crop yields under
climate change, Proc. Natl. Acad. Sci. 106 (2009), pp. 15594–15598.
[36] J. Sheffield, G. Goteti, and E.F. Wood, Development of a 50-year high-resolution global dataset of meteorological
forcings for land surface modeling, J. Clim. 19 (2006), pp. 3088–3111.
[37] P.K. Thornton, P.G. Jones, G. Alagarswamy, and J. Andreson, Spatial variation of crop yield response to climate
change in East Africa, Global Environ. Chang. 19 (2009), pp. 54–65.
[38] P. Webb, J. Coates, E.A. Frongillo, B.L. Rogers, A. Swindale, and P. Bilinsky, Measuring household food insecurity:
Why it’s so important and yet so difficult to do, J. Nutr. 136 (2006), pp. 1404S–1408S.
Downloaded by [Johns Hopkins University], [Julie Shortridge] at 12:31 08 May 2015