ArticlePDF Available

Integrating citizen science data with expert surveys increases accuracy and spatial extent of species distribution models

Wiley
Diversity and Distributions
Authors:

Abstract and Figures

Aim Information on species’ habitat associations and distributions, across a wide range of spatial and temporal scales, is a fundamental source of ecological knowledge. However, collecting information at relevant scales is often cost prohibitive, although it is essential for framing the broader context of more focused research and conservation efforts. Citizen science has been signalled as an increasingly important source to fill in data gaps where information is needed to make comprehensive and robust inferences on species distributions. However, there are perceived trade‐offs of combining highly structured, scientific survey data with largely un‐structured, citizen science data. Methods We explore these trade‐offs by applying a simplified approach of filtering citizen science data to resemble structured survey data and analyse both sources of data under a common framework. To accomplish this, we integrated high‐resolution survey data on shorebirds in the northern Central Valley of California with observations in eBird for the entire region that were filtered to improve their quality. Results The integration of survey data with the filtered citizen science data resulted in improved inference and increased the extent and accuracy of distribution models on shorebirds for the Central Valley. The structured surveys improved the overall accuracy of ecological inference over models using citizen science data only by increasing the representation of data collected from high‐quality habitats for shorebirds. Main conclusions The practical approach we have shown for data integration can also be used to improve the efficiency of designing biological surveys in the context of larger, citizen science monitoring efforts, ultimately reducing the financial and time expenditures typically required of monitoring programs and focused research. The simple method we present can be used to integrate other types of data with more localized efforts, ultimately improving our ecological knowledge on the distribution and habitat associations of species of conservation concern worldwide.
This content is subject to copyright. Terms and conditions apply.
976  
|
Diversity and Distributions. 2020;26:976–986.wileyonlinelibrary.com/journal/ddi
Received: 28 October 2019 
|
Revised: 21 Feb ruar y 2020 
|
Accepted: 21 Ma rch 2020
DOI : 10.1111/ddi .13068
BIODIVERSITY RESEARCH
Integrating citizen science data with expert surveys increases
accuracy and spatial extent of species distribution models
Orin J. Robinson1| Viviana. Ruiz-Gutierrez1| Mark D. Reynolds2|
Gregory H. Golet3| Matthew Strimas-Mackey1| Daniel Fink1
This is an op en access arti cle under the ter ms of the Creative Commons Attribution L icense, which pe rmits use, dis tribu tion and reprod uction in any med ium,
provide d the original wor k is properly cited.
© 2020 The Authors. Diversity and D istrib ution Publisherd by John Wiley & Sons Ltd
1Cornell L abor atory of Ornit holog y, Cornell
University, Ithaca, NY, USA
2The Nature Conse rvan cy, San Fran cisco,
CA, USA
3The Nature Conse rvan cy, Chico, C A, USA
Correspondence
Orin J. Rob inson, Cornell Labo rator y
of Ornitholog y, Cornell University, 159
Sapsucker Woods Roa d, Ithaca, NY 1485 0
USA.
Email: ojr7@cornell.edu
Editor: Luigi Maiorano
Abstract
Aim: Information on species’ habitat associations and distributions, across a wide
range of spatial and temporal scales, is a fundamental source of ecological knowl-
edge. However, collecting information at relevant scales is often cost prohibitive, al-
though it is essential for framing the broader context of more focused research and
conservation efforts. Citizen science has been signalled as an increasingly important
source to fill in data gaps where information is needed to make comprehensive and
robust inferences on species distributions. However, there are perceived trade-offs
of combining highly structured, scientific survey data with largely un-structured, citi-
zen science data.
Methods: We explore these trade-offs by applying a simplified approach of filtering
citizen science data to resemble structured survey data and analyse both sources of
data under a common framework. To accomplish this, we integrated high-resolution
survey data on shorebirds in the northern Central Valley of California with observa-
tions in eBird for the entire region that were filtered to improve their quality.
Results: The integration of survey data with the filtered citizen science data resulted
in improved inference and increased the extent and accuracy of distribution models
on shorebirds for the Central Valley. The structured surveys improved the overall ac-
curacy of ecological inference over models using citizen science data only by increas-
ing the representation of data collected from high-quality habitats for shorebirds.
Main conclusions: The practical approach we have shown for data integration can
also be used to improve the efficiency of designing biological surveys in the context
of larger, citizen science monitoring efforts, ultimately reducing the financial and time
expenditures typically required of monitoring programs and focused research. The
simple method we present can be used to integrate other types of data with more
localized efforts, ultimately improving our ecological knowledge on the distribution
and habitat associations of species of conservation concern worldwide.
KEYWORDS
citizen science, data filtering, data integration, data quality, ecological niche model, species
distribution model, structured sur vey
  
|
 977
ROBINSO N et al.
1 | INTRODUCTION
Information on species’ habitat associations and distributions is a
fundamental source of ecological knowledge (Sofaer et al., 2019).
This information is often of interest across a broad range of spa-
tial and temporal scales, from high-resolution information that is
more relevant for research on habitat selection (Matthiopoulos,
Hebblewhite, Aart s, & Fieberg, 2011) or needed to inform manage-
ment objectives (Zipkin, Andrew Royle, Dawson, & Bates, 2010) to
larger-scale inferences that are useful to address broader questions
(e.g. potential range shifts with changing climatic conditions; Lyon,
Debinski, & Rangwala, 2019). However, the process of collecting bi-
ological obser vations across large spatial scales is of ten cost prohib-
itive for most research and monitoring efforts. At best, researchers
and practitioners are able to monitor plant and animal communities
just within their study regions, and during specific times of year.
However, the need to make inferences beyond the sampled range of
environmental conditions and seasons often limits our understand-
ing of the broader context of our result s and can limit the use of
applied research to inform future monitoring efforts and effective
conservation actions.
Citizen science data have been signalled as a promising source
of information to fill in information gaps needed to model species
distributions (Bradter et al., 2018; Gouraguine et al., 2019). The
collection of citizen science data is growing rapidly, and for a num-
ber of taxa, large databases of observations exist. For example, the
National Moth Recording Scheme in Great Britain has collec ted
over 11 million observations (Fox et al., 2014). There are many more
taxa for which rich citizen science data sets exist, for example birds
(Sauer et al., 2017; Sullivan et al., 2014), invertebrates (Howard,
Aschen, & Davis, 2010), bats (Newson, Evans, & Gillings, 2015),
whales (Tonachella, Natasi, Kaufman, Maldini, & Rankin, 2012) and
frogs (Westgate et al., 2015) among others. However, there are few
examples of the potential trade-offs of combining observational
data collected at smaller spatial extents with citizen science data
collected across vast extent s. This might be due in part to inherent
differences that often exist between the two data types. On the one
hand, you may have data that is collected at a high spatial resolu-
tion using skilled observers, sampling effort is of ten standardized,
and sampling occurs across a habitat gradient that is representative
of the region of interest . On the other hand, is citizen science data,
which can be collected across a wide range of sampling conditions
by obser vers that vary widely in their level of expertise, collected
across a wide range of spatial resolutions, and sampling effort is not
standardized. This has created a perceived trade-off between data
quality and quantity among data collected from structured, scientific
surveys and data collected from larger-scale, volunteer-based mon-
itoring efforts (Figure 1). The assumption being made is that an in-
crease in quantity of citizen science data comes at a significant cost
to quality. Therefore, the logical framework to integrate these two
sources of information would be one that would treat them as inde-
pendent sources of information used to inform a common underlying
distribution for a given species (Pacifici et al., 2017).
The integration of different data sources is a growing area of
methodological development in ecological statistics, and recent
advances have been made to develop ways of integrating sur vey
data (e.g. structured) with citizen science data (e.g. un-structured
data) (Miller, Pacifici, Sanderlin, & Reich, 2019). For observational
data collected at discrete locations, these methods include spec-
ifying a joint likelihood for the two data sources to estimate the
underlying species distribution (Miller et al., 2019). In cases where
this is not possible, the data source that is deemed as of lower
quality (e.g. citizen science data, museum obser vations) can be
used in two ways: (a) modelled as a covariate of the underlying
distribution or (b) used to estimate a sepa rate sp ecies dis tribution,
where a correlation structure is specified to share information
across data sources. Pacifici et al. (2017) tested these different
approaches to integrate observational data from the citizen sci-
ence project eBird (Sullivan et al., 2014) with more structured
data from the North American Breeding Bird Survey (BBS; Sauer
et al., 2017). Their results showed that the joint-likelihood ap-
proach of combining eBird and BBS data outper formed all other
approaches, including using BBS data alone.
The approach used by Pacifici et al. (2017) and Miller el at
(2019) summarized observational data at a coarse, grid-level in
order to account for dif ferences in effort, sampling approach, and
other variables that are known to influence detectability (Guillera-
Arroita, 2017). In addition, Pacifici et al. (2017) wanted to reduce
potential bias related to the degree of uncertainty about the spatial
FIGURE 1 The perceived trade-off bet ween data qualit y and
quantity of data collected using structured surveys and citizen
science ef forts. You can bypass this trade-off by processing and
filtering citizen science data using criteria such as count type
(e.g. stationary versus. travelling), duration of counts (e.g. all
observations < 30mins) and other measures that align well with
the existing data set. This approach is recommended as the most
flexible data integration method for the application of a broad
range of species distribution models
978 
|
   ROBINSON et al.
scale that observations were collected for each independent eBird
checklist. This mismatch in scales bet ween the two dat a sources is
what often makes data integration between high-resolution survey
data (e.g. point count observations) with lower-resolution citizen
science data non-trivial. However, many citizen science programs
collect high-scale resolution information (e.g. camera traps) in ways
that we can infer absences, and collect additional information on ef-
fort (e.g. distance travelled, number of hours sampled) that is highly
valuable for improving the accuracy of SDMs (Kelling et al., 2019).
Here, we explore a practical approach for data integration be-
tween high-quality, citizen science data with structured survey data
that builds upon existing methods for “data pooling” (e.g. Fithian,
Elith, Hastie, & Keith, 2015). We explore the trade-of fs in inference
of using citizen science data alone, more localized and structured
data alone, and pooling together both data sets combined. In addi-
tion, we examine specific trade-offs when combining structured and
un-str uct ured da ta sour ce s by explor in g the per form ance of incre as -
ing the quantity of citizen science data through simulations, versus.
the addition of data from more targeted survey effort s. To accom-
plish this, we use The Nature Conser vancy's (TNC) BirdReturns
project as a case study (Reynolds et al., 2017). We explore ways
of combining high-resolution bird survey data collected for shore-
birds on rice fields in the northern region of the Central Valley in
California, with observations in eBird for the entire Central Valley
that are filtered to improve their quality. The specific aim of the
case study is to provide a framework for leveraging survey dat a with
citizen science data to build more accurate distribution models for
shorebird species across the extent of the Central Valley.
2 | METHODS
2.1 | Data
We us ed poi nt cou nt s car ried ou t du ri ng spring sur veys (Fe bruar y 1–
May 31; n = 8,19 2) as par t of th e TNC Bir dRetu rns pro jec t con du c te d
in 2014–2017. This project used predicted shorebird occurrence
and abundance (Johnston et al., 2015) along with predicted surface
water in the Sacramento River Valley to identify times and locations
that were likely to be import ant for migrating shorebirds. TNC used
a reverse auction approach to select and incentivize rice farmers in
the identified locations to flood their fields during the spring and fall,
making temporar y wetlands available to the migrating shorebirds.
Observers made point counts at fields enrolled in the program and at
unenrolled control sites, surveying a semi-circle with a 200 m fixed
radius. Each site was sur veyed for at least two minutes and lasted as
long as necessary to count all birds present (for more detail on count
methods see Golet et al., 2018). Effort for each point count consisted
of date, time started and ended, and name of observer.
We combined the point counts with data from the citizen science
project eBird (Sullivan et al., 2014) collected during the same time
period as the point counts. The eBird data were restricted to the
Central Valley of California, USA, and to complete checklists so that
non-detection could be inferred (Johnston et al., 2019). We also re-
stricted the eBird data to stationary checklists and travelling check-
lists limited to 30 0 m. After filtering these data, we were left with
12,891 checklists. Effor t variables for the eBird data set were date,
time obser vations star ted, duration of observation in minutes, sur-
vey protocol (stationary or travelling), distance travelled in metres
and number of obser vers.
We calculated the effort variables from the point counts to
match those of the eBird data set . Obser ver name in the point count
data set was converted into number of observers (however, it was
always one for this study), time started and ended for the point
counts was used to calculate duration in minutes, and each point
count was treated as a stationary count (distance travelled = 0). By
doing this, the two data sets contained the same effor t information
and were identical in struc ture which allowed us to simply join them
into one combined data set. We added a variable to the combined
data set to note whether an observation was a TNC point count or
an eBird checklist. We also calculated a checklist calibration index
(CCI) for each checklist in the combined data set to account for vari-
ation in expertise among observers (e.g. expertise score; Johnston,
Fink, Hochachka, & Kelling, 2018). As environmental variables, we
attached the remotely sensed Cropland Data Layer (CDL; Boryan,
Yang, Mueller, & Craig, 2011; Han, Yang, Di, & Mueller, 2012) to the
combined data set by computing the per cent of each land cover or
crop type in the CDL that was present within a 300 m radius centred
on each point count or eBird checklist. We also similarly attached
cloud-filled data from Water Tracker (Reiter, Elliott, Barbaree, &
Moody, 2018), a high spatial and temporal resolution surface water
tracking system for the Central Valley of California.
A comprehensive data set was created using all of the above
eBird checklists (12,891) plus simulated eBird checklists equal to
the number of TNC point count s that were included in the combined
data set (8,192). This resulted in a data set that was the same magni-
tude (21,083) as the combined data set and allowed us to determine
whether the improvement in accuracy from combining the data sets
was simply a function of increasing the sample size. The simulated
eBird data were created by adding a small amount of noise (via the jit-
ter fun ct ion in base R; R Core Team, 2019) to the spatial covariates of
8,19 2 rando ml y chose n ch ecklist s , si mi la r to the ove rsa mp ling proc e-
dure in Fink et al. (2019); however, we used all checklists rather than
only positive observations here . Th is ens ur ed tha t we were mainta in -
ing roughly the same prevalence rate in the data set and also were
not simply making exact copies of the randomly chosen checklists.
2.2 | Spatial filtering and class imbalance
As spatial bias is always a concern when using citizen science data
(Geldmann et al., 2016), we spatially subsampled the combined data
set. The data were sparse for many species (proportion of detec-
tions for a species <0.05 for all checklists in the data set), so class
imbalance was also a concern (He & Garcia, 20 09). We spatially un-
dersampled the data (e.g. Robinson, Ruiz-Gutierrez, & Fink, 2017) by
  
|
 979
ROBINSO N et al.
first creating a hexagonal grid of 3.5 km (~10 times the radius of each
survey) cells over the region from which our observations came via
the dggridR package (Barnes et al., 2018) in R (R Core Team, 2019).
This was done to reduce the chance that we selected overlapping
observations at each run of each model. We then split the data for
a single species into checklists on which the species was detected
(positive observations) and those on which the species was not de-
tected (negative observations). We selected one checklist from the
negative observations from within each grid cell and recombined the
filtered negative observations with the positive observations. As al-
most all of the spatial bias was from the negative obser vations (i.e.
only a small percentage of the total number of observations were
positive observations), this procedure relieves much of the spatial
bias, and because only negative observations were filtered out of the
data set, class imbalance is also addressed here (King & Zeng, 2001;
Robinson et al., 2017).
To alleviate the effects of class imbalance, after spatially sam-
pling eBird checklists for training distribution and population trend
models, Fink et al. (2019) oversampled eBird checklists for species
that had a prevalence rate of less than 25%. After spatially under-
sampling our data for the current study, class imbalance was still a
concern, as our undersampling improved the imbalance, but it did
not improve class balance to 25% for species other than Yellowlegs.
Following the recommendation of Fink et al. (2019), we oversampled
the positive observations for each of our species and data sets if
prevalence was below 25% before training the distribution models.
We used the synthetic minority oversampling technique (SMOTE;
Chawla, Bowyer, Hall, & Kegelmeyer, 2002) to create one new ex-
ample of the positive class in the training data set for every positive
observation in the spatially undersampled data set. The SMOTE al-
gorithm does not create exact copies of the positive class as in tra-
ditional oversampling, but instead creates examples of the positive
class that occupy the parameter space between a randomly chosen
positive obser vation and a nearest neighbour. We did not randomly
undersample our data as is recommended when using SMOTE
(Chawla et al., 2002) because our data had already been spatially
undersampled as described above. The spatial undersampling and
SMOTE procedure were done to each of the data set s. As the spatial
undersampling randomly chooses from negative observations within
a grid cell, many negative observations may not be included in train-
ing the model. Therefore, we sampled each of the four data sets in
the study using this procedure 100 times, creating 100 unique data
sets unless it met the 25% prevalence criteria described above. In
that case, the data were only spatially undersampled to create the
100 unique data sets.
2.3 | Analysis
We selected and spatially subsampled (but did not oversample) 15%
of the combined data set to be the testing data for evaluation of
each model. This test set was selected because our goal is to make
accurate predictions across the entire Central Valley and give equal
importance to each location. Therefore, the test set must include
observations across the spatial extent of evaluation and be spatially
balanced. For the training data sets, we removed any checklist or
point count that was in the test set. We repeated this process 100
times creating 100 unique data sets against which our models were
tested.
We used the R package “ranger (Wright, Wager, Probst, &
Maintainer, 2019) to train a random forest model for each of the
seven species (or combined species; Table 1) and for each of four
data sets: (a) TNC point counts alone, (b) eBird checklists alone, (c)
an oversampled eBird data set, and (d) the combined TNC and eBird
data set . For each species, 1,000 trees were grown in the ensem-
ble. The number of variables from which the model could select at
each split for each tree was initially set to the square root of the
number of variables included in the model (
n
=12
142
); how-
ever, we allowed this to vary by half, double and triple this common
rule of thumb. We selected the output from the model that maxi-
mized accuracy (Data S1). We evaluated the accuracy of the models
using multiple predictive performance metrics (PPMs). We did not
Common Name Latin Name
Abbreviation in
Tables & Figures
American Avocet Recurvirostra americana AMAV
Dunlin Calidris alpina DUNL
Greater YellowlegsaTringa melanoleuca YLEGa
Least Sandpiper Calidris minutilla LESA
Lesser YellowlegsaTringa flavipes YLEGa
Long-billed Curlew Numenius americanus LBCU
Long-Billed DowitcherbLimnodromus scolopaceus DOWIb
Short-billed DowitcherbLimnodromus griseus DOWIb
Western Sandpiper Calidris mauri WESA
aCombined in analysis and referred to collectively as “Yellowleg s” in this study (e.g. Golet
et al., 2018)
bCombined in analysis and referred to collectively as “Dowitcher” in this study (e.g.Golet
et al., 2018)
TABLE 1 List of species for which we
modelled spring distribution in the Central
Valley of California with each of the four
data sets
980 
|
   ROBINSON et al.
evaluate the models using the out of bag (OOB) metric calculated
by random forest internally as it favours total accuracy over cor-
rectly predicting the minority class and is highly sensitive to class
imbalance. We used the test data set to evaluate mean squared error
(MSE) between the model predictions of presence or absence and
the true presence or absence in the test set. We also evaluated error
using Brier score (Brier, 1950), the mean squared error of the prob-
abilistic model predic tions and the true presence or absence in the
test set. We evaluated each model's ability to rank positive observa-
tions higher than negative ones using the area under the curve (AUC;
Fielding & Bell, 1997). We evaluated each model's ability to predict
presence or absence using Cohen's Kappa (Kappa; Cohen, 1960) and
its components, sensitivity (true positive rate), and specificit y (true
negative rate). We produced distribution maps for each species and
data set, and we recorded the predictor importance metrics from
models trained on each data set. All variables included in the model
may be found in Data S2.
3 | RESULTS
The fields that are stored as part of the eBird project allowed us to
filter the data to be of a similar protocol as the TNC point count s.
This filtering reduced uncer taint y in the loc ation of eBird checklists
and eliminated the need for coarse level summaries of the data for
integration (Miller et al., 2019; Pacifici et al., 2017). For all species,
the combined data set had higher predictive accuracy than either the
TNC poi nt counts or the eBird checklis t data set s on their own; how-
ever, error (MSE, Brier score) was relatively low and AUC was rela-
tively high for all species and all data sets, particularly for the three
data sets where eBird checklists were included (Figure 2; Figure S3).
Improvement in accuracy varied among species, however; for the
Kappa statistic (predicting presence and absence against the test
set), the combined data set was an improvement over all three of
the other data sets evaluated (with exception of western sandpiper;
−3%–25.5% improvement; Figure 3). The improvement or loss in the
error statistics was usually negligible (with the exception of LBCU;
Figure 4; Figure S4) for the combined data set versus the next best
data set; however, error metrics were already very low for most
species, so great improvement here was not likely. Likewise, AUC
was relatively high for most species and for the three data sets con-
taining eBird checklists; therefore, the gain was negligible for many
species (improvement of up to 6%; Figure S4). For the few species/
metrics combinations where the combined data set was not the best
performing model, it was the second best, with the best being the
eBird checklists with simulated eBird checklists added. Neither TNC
FIGURE 2 Box plots for MSE, AUC , Brier score, Cohen's Kappa, Sensitivity and Specificit y for models run for Least Sandpiper (LESA) with
each data set; TNC point counts (TNC), eBird checklist s (eBird), eBird checklists with additional simulated eBird data (eBird + sim), and the
data set of TNC point counts and eBird checklists combined (Combined)
  
|
 981
ROBINSO N et al.
point counts, nor eBird checklists alone performed better for any
metric than the data set where the two were combined.
We produced distribution maps for each species (Figure 5;
Figure S5) to determine whether there was a difference in the over-
all pattern of distribution estimated by models trained on the dif-
ferent dat a sets. For most species, there was greater contrast
between the presence estimates and absence estimates for the
combined data set when compared to the other three. Visually,
this means more obvious differentiation between high predicted
probability of presence and absence (e.g. hotter “hotspots” and
darker regions where absence is predicted; Figure 5; Figure S5).
We collected the important variables identified by the model (via
Gini index) when run with each data set. For all species, the im-
portance of rice and water was apparent as it was among the most
important variables for each of the data sets; however, it was not
until the data sets were combined that each rice and the Water
Trac ker layer had a hi gh imp ort ance score (Figure 6; Figure S6). The
combination of the dat a allowed t he mo dels to home in on rice and
surface water as highly important, where they had only moderate
importance comparatively when using the other data sets. This is
likel y be cause onl y 1.5% of eBird checklist s in our st udy come fr om
locations where the per cent landcover is >50% rice. Conversely,
almost 60% of the TNC point counts come from locations where
the per cent landcover is >50% rice.
FIGURE 3 Average improvement in
Cohen's Kappa for distribution models of
each species when using the combined
data set versus the data set that provided
the next highest Kappa value
FIGURE 4 Average improvement or
loss in Brier score for distribution models
of each species when using the combined
data set versus the data set that provided
the next lowest value
982 
|
   ROBINSON et al.
4 | DISCUSSION
Our results lend further support to effort s looking to combine data
from multiple sources to improve the inference and/or predictive abil-
ity of distribution models (Miller et al., 2019; Pacifici et al., 2017). We
have shown that citizen science data can be filtered to generate a high-
quality data set that can closely match the resolution and sampling
appr oa ch of str uct ur ed surveys, supp ort ing the call for cur re nt and fu-
ture citizen science project s to collect essential information related to
location and effort, as well as complete sur veys (Kelling et al., 2019).
The integration of survey data with the filtered citizen science data in
eBird resulted in improved inference, predictive ability, and ultimately
increased the extent of inference of the structured surveys. In turn,
the structured sur veys were able to improve the ecological inference
of the citizen science data, by improving the representation of sam-
pled habitats that are key for sh orebird species. Mos t impor tantly, the
practical approach we have shown for data integration is an improve-
ment on simpler “data pooling” approaches for data integration and
can be used to improve the efficienc y of designing biological sur veys
to collec t distribution information in the context of larger, citizen sci-
ence monitoring efforts, ultimately reducing the financial and time
expenditures typically required of monitoring programs and focused
research (Reich, Pacifici, & Stallings, 2018).
The combined data set resulted in improved accuracy across all
metrics relative to the TNC survey dat a or eBird data alone, for all
of the species considered in this study. Our combined approach also
predicted presence/absence via agreement with a test data set more
accurately than the different permutations of data sets considered
(e.g. TNC alone, eBi rd alone and eBird plus simulated eBird). The ob-
served improvement in accuracy is not likely a function of simply
increasing sample size, as augmenting eBird data with simulated data
did not show the same levels of improvement. The addition of the
TNC survey data to eBird data improved the coverage of the data
both spatially and in targeted habitats. Previous work has shown that
migrating shorebirds heavily use flooded rice fields in the Central
Valley, and will also use unflooded rice fields (Elphick & Oring, 1998;
Golet et al., 2018). Rice fields are the main focal habitat of the TNC
BirdReturns program, ~60% of the survey data from this project was
carried out at sites where at least 50% of the landcover is rice. In
contrast, the majority of the rice fields in the Central Valley are not
accessible to regular eBird participants, as they are often on private
lands. This is likely why we see so few (~1.5%) eBird checklists from
the Central Valley in locations where at least 50% of the landcover
is rice. The addition of more data from high-qualit y habitat is what
provided the improvement in accuracy of the combined eBird and
TNC data set. This highlights the importance and value of more tar-
geted research and sur vey efforts within the context of large-scale
citizen science monitoring efforts. Given that private lands make up
more than half of the land in the United States, supporting wildlife
monitoring efforts on privately held lands that are linked into large-
scale efforts such as eBird can greatly improve inferences on species
distributions and habitat associations across scales of interest for all
stakeholders (Hilty & Merenlender, 2003).
Interestingly, models based only on the TNC point counts strug-
gled to learn where each species was likely to be absent across the
entire Central Valley because the data came largely from “good”
shorebird habitat in the northern portion of the Central Valley. The
original intent of this project was not focused on species distribution
modelling, so this does not come as a surprise. However, given that
absence information allows for more accurate distribution models
(Brotons, Thuiller, Araújo, & Hirzel, 2004), the addition of eBird data
was able to provide information on where species are likely to be
absent, and improved inferences on what habitat types most benefit
shorebirds, but were not surveyed as part of the TNC monitoring
efforts. On the other hand, the models using the eBird data alone
were able to predict absences well and had relatively high accuracy
when predicting presence/absence, but overall ecological inference
was improved when combined with the point count data set. The
TNC point counts acted as targeted surveys in under-surveyed hab-
itat, which previous work has shown can improve the accuracy of
distribution models using eBird checklists (e.g. Xue, Davies, Fink,
Wood, & Gomes, 2016). The complementary nature of the data sets
is also shown when examining the impor tant predictors. For Dunlin
(Figure 6), the rice la nd cov er and the Water Tracker layer ar e im po rt-
ant predictors for each data set, however, their impor tance values
more than double when the combined data set is used to train the
model. While other species do not show such a drastic shift in the
importance value for these two habitat variables, the pattern is sim-
ilar for most in that these variables become more important to the
model's predictions when the data sets are combined.
The approach we present for combining data is applicable for
other conservation monitoring programs and ongoing research ef-
forts, given that a significant amount of research is conducted over
small spatial and temporal scales (Heidorn, 2008), and often dif fi-
cult to scale-up without similarly struc tured data (Poisot, Bruneau,
Gonzalez, Gravel, & Peres-Neto, 2019). The data fields that exist in
eBird data facilitated further processing and filtering to match the
structure of the individual data set of interest and allowed us to
leverage the strengths of both data sources. However, even citizen
science dat a sets that do collect effort information are often lacking
information on sampling locations, although incentivizing partici-
pants to collect data in these data-poor locations has been shown
to improve the accuracy of distribution models (e.g. avicaching; Xue
et al., 2016). Similarly, data from smaller-scale, individual research
projects can also help fill in these gaps in citizen science data. Given
that inferences from small-scale studies cannot be extrapolated to
larger spatial extents (Sandel & Smith, 2009), the approach we pres-
ent here for com binin g data from small-scale studie s with citizen sci-
ence data filtered to match the existing data structure will increase
the overall extent of inference, and improve our ability to concep-
tualize conservation actions within the larger context of the target
population(s) of interest.
Information on species distributions across large scales is one of
the most fundamental information needs for basic and applied re-
search fields in ecology. However, this level of information often re-
quires large-scale, coordinated surveys that can be time consuming
  
|
 983
ROBINSO N et al.
FIGURE 5 Predicted probability of an expert surveyor detecting a Dunlin in the spring of 2016 on a one-hour long checklist where the
distance travelled was < 30 0m., as estimated by the model using TNC point counts alone (A), eBird checklists alone (B), eBird checklists with
added simulated eBird checklists (C), and the combined data set of TNC point count s and eBird checklists
984 
|
   ROBINSON et al.
and costly to manage. In addition, models used to estimate species
distributions are often data hungry, and are often unable to generate
information at the spatial and temporal scales that are most relevant
for rese ar ch and conser vatio n ef for t s. Citizen science data is a gr ow -
ing source of additional, minimal cost surveys for a number of taxa
including invertebrates, bats, marine mammals, lichens, amphibians
and more (Deutsch, Bilenca, & Agostini, 2017; Devictor, Whittaker,
& Beltrame, 2010; Howard et. al., 2010; Newson et al., 2015;
Tonachella et al., 2012; Westgate et al., 2015). We have shown the
utility of combining survey data with semi-struc tured citizen science
data (Kelling et al., 2019) for improving accuracy in species distribu-
tion models, which can result in more efficient and cost-effective
surveys (Miller et al., 2019; Pacifici et al., 2017; Reich et al., 2018).
The simple method we present for citizen science data allows for the
integration of a small-scale point count data set with eBird check-
lists, can be used to integrate similar types of data being collected
by citizen scientists (e.g. camera traps) with more localized efforts
(e.g. patrolling by park rangers), ultimately improving our ecological
knowledge on the distribution and habitat associations of species of
conservation concern worldwide.
ACKNOWLEDGEMENTS
We would like to thank Point Blue for supplying Central Valley water
data. We would also like to thank Tom Auer, Ali Johnston, Steve
Kelling and Matt Reiter for comments that improved this project and
the manuscript.
DATA AVA ILAB ILITY STATE MEN T
The Nature Conservancy point count data are available from the
Dryad Digital Repository: https://doi.org/10.5061/dryad.724vn
The eBird data used for analyses in this manuscript may be
downloaded from https://ebird.org/data/download
Cropland Data Layer may be downloaded from: https://www.
nass.usda.gov/Resea rch_and_Scien ce/Cropl and/Relea se/index.php
Wate r Trac ker Da t a may be down loa d ed fr om Po int Bl ue: ht tps ://
data.point blue.org/apps/autow ater/
ORCID
Orin J. Robinson https://orcid.org/0000-0001-8935-1242
REFERENCES
Barnes, R., Sahr, K., Evenden, G., Johnson, A., Warmerdam , F., &
Maintainer, ]. (2018). Package “dggridR” Type Package Title Discrete
Global G rids. Retrieved from https://github.com/r-barne s/dggri dR/.
Boryan, C., Yang, Z., Mueller, R., & Craig, M. (2011). Monitoring US agri-
culture: The US department of agriculture, national agricultural sta-
tistic s ser vice, cropland data layer program. Geocarto International,
26(5), 341–358 . ht tps ://doi.org/10.108 0/10106 0 49.2011.56230 9
Bradter, U., Mair, L., nsson, M., Knape, J., Singer, A., & Snäll, T.
(2018). Can opportunistically collec ted Citizen Science data fill a
data gap for habitat suitability models of less common species?
Methods in Ecology and Evolution, 9(7 ), 1667–1678. https://doi.
org /10.1111/2041-210X .13 012
Brier, G. W. (1950). Verification of forecasts expressed in terms of prob-
ability. Monthly Weather Review, 78(1), 1–3. htt ps: //doi.o rg /10.1175/
1520-0493(1950)078<0001:VOFEI T>2.0.CO;2
Brotons, L., Thuiller, W., Ar aújo, M. B., & Hirzel, A. H. (2004). Presence-
absence versus presence-only modelling methods for predicting
bird habitat suitability. Ecography, 27(4), 437–448. https://doi.
org/10.1111/j.0906-7590.2004.03764.x
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (20 02).
SMOTE: Synthetic minority over-sampling technique. Journal of
Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/
jair.953
FIGURE 6 Important variables selected by the model trained on each dat a set for Dunlin as selected by Gini importance index; TNC
point count s alone (A), eBird checklists alone (B) and the combined data set of TNC point counts and eBird checklist s (C). Please note the
different scales for each panel and that this only represents the impor tance of each variable, not the direc tion of it s relationship with
probability of detection
  
|
 985
ROBINSO N et al.
Cohen, J. (1960). A coefficient of agreement for nominal sc ales.
Educational and Psychological Measurement, 20(1), 37–46. https://doi.
org/10.1177/00131 64460 02000104
Deutsch, C., Bilenca, D., & Agos tini, G. (2017). In search of the horned
frog (Ceratophrys ornata) in Argentina: Complementing field surveys
with citizen science. Herpitological Conservation and Biology., 12,
664 –672 .
Devictor, V., Whittaker, R. J., & Beltrame, C. (2010). Beyond scarcity:
Citizen science programmes as useful tools for conservation bio-
geo gr aphy. Diversity and Distributions, 16, 354–362. https://doi.
org /10.1111/j .1472- 46 42.200 9.0 0615 .x
Elphick, C. S., & Oring, L. W. (1998). Winter management of Californian
rice fields for waterbirds. Journal of Applied Ecology, 35(1), 95–108.
https://doi.org/10.1046/j.1365-2664.1998.00274.x
Fielding, A. H., & Bell, J. F. (1997). A review of methods for the assess-
ment of prediction errors in conservation presence/absence models.
Environmental Conservation, 24(1), 38– 49. https ://doi.org/10 .1017/
S0376 89299 7000088
Fink, D., Auer, T., Johnston , A., Ruiz-Gutierrez, V., Hochachka, W. M.,
& Kelling, S. (2019). Modeling Avian Full Annual Cycle Distribution
and Population Trends with Citizen Science Data. BioRxiv, 25186 8 ,
htt ps://doi.org/10.1101/25186 8
Fithian , W., Elith, J., Hastie, T., & Keith, D. A . (2015). Bias correction in
species distribution models: Pooling survey and collection data for
multiple species. Methods in Ecology and Evolution, 6(4), 424–438.
https ://doi.or g/10.1111/20 41-210X.12242
Fox, R., Oliver, T. H., Harrower, C., Parsons, M. S., Thomas, C . D., & Roy,
D. B. (2014). Long-term changes to the frequency of occurrence of
British moths are consis tent with opposing and synergistic effect s of
climate and land-use changes. Journal of Applied Ecology, 51, 949–957.
https ://doi.or g/10.1111/1365 -266 4.1 2256
Geldmann, J., Heilmann-Clausen, J., Holm, T. E., Levinsky, I., Markussen,
B., Olsen, K., … Tøttrup, A. P. (2016). What determines spatial bias in
citizen science? Exploring four recording schemes with different pro-
ficiency requirements. Diversity and Distributions, 22(11), 1139–1149.
https ://doi.or g/10.1111/d di.12477
Golet, G . H., Low, C., Avery, S., Andrews, K., McColl, C. J., Laney, R., &
Reynolds, M. D. (2018). Using ricelands to provide temporary shore-
bird habitat during migration. Ecological Applications, 28(2), 409–426.
https://doi.org/10.1002/eap.1658
Gouraguine, A., Morant a, J., Ruiz-Frau, A ., Hinz, H ., Reñones, O.,
Ferse, S. C. A., Smith, D. J. (2019). Citizen science in data and
resource-limited areas: A tool to detect long-term ecosystem
changes. PLoS ONE, 14(1), 1–14. https://doi.org/10.1371/journ
al.pone.0210007
Guillera-Arroita, G. (2017). Modelling of species distributions, range
dynamics and communities under imperfect detec tion: Advances,
challenges and opportunities. Ecography, 40(2), 281–295. https://doi.
org /10.1111/ecog.024 45
He, H., & Garcia, E. A. (200 9). Learning from Imbalanced Data. IEEE
Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
https://doi.org/10.1109/TKDE.2008.239
Han, W., Yang, Z., Di, L., & Mueller, R. (2012). CropScape: A Web ser vice
based application for exploring and disseminating US conterminous
geospatial cropland data products for decision support. Computers
and Electronics in Agriculture, 84, 111–123. https://doi.org/10.1016/J.
COMPAG.2012.03.005
Heidorn, B. (20 08). Shedding light on the dark data in the long tail of
science. Library Trends, 57(2 ), 28 0–29 9. h ttp s://doi .or g /10 .135 3/
lib.0.0036
Hilty, J., & Merenlender, A. M. (2003). Studying biodiversity on pri-
vate lands. Conservation Biology, 17(1), 132–137. ht tps://doi.
org /10.1046/j .1523-1739.200 3.01361. x
Howard, E., Aschen, H., & Davis, A. K. (2010). Citizen science observa-
tions of monarch butterfly overwintering in the southern United
States. Psyche: A Journal of Entomology, 2010, 1–6. https://doi.
org /10.1155/2010/6893 01
Johnston, A., Fink, D., Hochachka, W. M., & Kelling, S. (2018). Estimates
of obser ver expertise improve species distributions from citizen sci-
ence data. Methods in Ecology and Evolution, 9(1), 8 8–97. https://doi.
org /10.1111/2041-210X .1283 8
Johnston, A., Fink, D., Reynolds, M. D., Hochachka, W. M., Sullivan, B.
L., Bruns , N. E., … Kelling, S. (2015). Abun dance models improve spa-
tial and temporal prioritization of conservation resources. Ecological
Applications, 25(7 ), 1749–1756. https ://doi. org/10.1890/14-1826 .1
Johnston, A., Hochachka, W., Strimas-Mackey, M., Gutierrez, V. R .,
Robinson, O., Miller, E., Fink, D. (2019). Best practices for mak-
ing reliable inferences from citizen science data: Case study using
eBird to es timate species distributions. BioRxiv, 574 39 2, ht tp s://d o i .
org /10.1101/574392
Kelling, S., Johnston, A., Bonn, A., Fink, D., Ruiz-Gutierrez, V., Bonney,
R., … Guralnick, R . (2019). Using semistructured surveys to improve
citizen science data for monitoring biodiversit y. BioScience, 69(3),
170–179. https://doi.or g/10 .1093/bi osc i/biz010
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data.
Political Analysis, 9(2), 137–163. ht tps://doi.org/10.1093/oxfor djour
nals.pan.a004868
Lyon, N. J., Debinski, D. M., & Rangwala, I. (2019). Evaluating the utilit y
of species distribution models in informing climate change-resilient
grassland restoration strategy. Frontiers in Ecology and Evolution,
7(FEB), 1–8. https://doi.org/10.3389/fevo.2019.00033
Matthiopoulos, J., Hebblewhite, M., Aarts, G., & Fieberg, J. (2011).
Generalized functional responses for species distributions. Ecology,
92(3), 583–589. https://doi.org/10.1890/10-0751.1
Miller, D. A. W., Pacifici, K., Sanderlin, J. S., & Reich, B. J. (2019). The
recent past and promising future for data integration methods to es-
timate species’ distributions. Methods in Ecology and Evolution, 10(1),
22–37. ht tp s://doi.o rg /10.1111 /20 41-210X.13110
Newson, S. E., Evans, H. E., & Gillings, S. (2015). A novel citizen science
approach for large-scale standardised monitoring of bat activity and
distribution, evaluated in eastern England. Biological Conservation,
191, 38–49. https://doi.org/10.1016/j.biocon.2015.06.009
Pacifici, K ., Reich, B. J., Miller, D. A. W., Gardner, B., Stauffer, G., Singh,
S., … Collazo, J. A. (2017). Integrating multiple data sources in species
distribution modeling: A framework for data fusion*. Ecology, 98(3),
840–850. ht tps://doi.or g/10.1002/ecy.1710
Poisot, T., Bruneau, A., Gonzalez, A., Gravel, D., & Peres-Neto, P. (2019).
Ecological Data Should Not Be So Hard to Find and Reuse. Tre nds
in Ecology & Evolution, 34(6), 494–496. https://doi.org/10.1016/j.
tree.2019.04.005
R Core Team. (2019). R: A language and environment for statistical com-
puting. V ienna, Austria: R Foundation for Statistical Computing.
Retrieved from https://www.R-proje ct.org/.
Reich, B. J., Pacifici, K., & Stallings, J. W. (2018). Integrating auxiliar y
data in optimal spatial design for species distribution modelling.
Methods in Ecology and Evolution, 9(6), 1626–1637. ht tps://doi.
org /10.1111/2041-210X .13 0 02
Reiter, M. E., Elliott, N. K., Barbaree, B., & Moody, D. (2018). Water Tracker
An Automated Open Sur face Water Tracking Sys tem for California’s
Central Valley Water Tracker An Automated Open Sur face Water
Tracking System for California’s Central Valley Point Blue Conservation
Science. Retrieved from w ww.pointblue.org/watert racker
Reynolds, M. D., Sullivan, B. L., Hallstein, E., Matsumoto, S., Kelling, S.,
Merrif ield, M., … Morrison, S. A . (2017). Dynamic conservation for
migratory species . Science Advances, 3(8), e1700707. https://doi.
org /10.11 26/sc iadv.170 0707
Robinson, O. J., Ruiz-Gutierrez, V., & Fink, D. (2017). Correcting for bias
in distribution modelling for rare species using citizen science data.
Diversity and Distributions, 24(4), 4 60 –472. ht tp s://doi.o rg /10.1111/
ddi.12698
986 
|
   ROBINSON et al.
Sandel, B., & Smith, A. B. (2009). Scale as a lurking factor: Incorporating
scale-dependence in experimental ecology. Oikos, 118 (9), 1284–
1291 . ht tps://doi .org/10.1111/j.160 0-0706.2 00 9.17421. x
Sauer, J. R., Pardieck, K. L., Ziolkowski, D. J., Smith, A . C., Hudson, M.-
A.-R., Rodriguez, V., & Link, W. A. (2017). The first 50 years of the
North American Breeding Bird Survey. The Condor, 119(3), 576–593.
https://doi.org/10.1650/condo r-17-83.1
Sofaer, H. R., Jarnevich, C. S., Pearse, I. S., Smyth, R. L., Auer, S., Cook,
G. L., … Hamilton, H. (2019). Development and Delivery of Species
Distribution Models to Inform Decision-Making. BioScience, 69(7),
544–557. https://doi.org/10.1093/biosc i/biz045
Sullivan, B. L., Aycrigg, J. L., Barry, J. H., Bonney, R. E ., Bruns, N.,
Cooper, C. B., Kelling, S. (2014). T he eBird enterprise: An inte-
grated approach to development and application of citizen sci-
ence. Biological Conservation, 169, 31–40. https://doi.org/10.1016/j.
biocon.2013.11.003
Tonachella, N., Natasi, A., Kaufman, G., Maldini, D., & Rankin, R. W.
(2012). Predicting trends in humpback whale (Megaptera novaean-
gliae) abundance using citizen science. Pacific Conservation Biology,
18, 297–309. https://doi.org/10.1071/PC120297
Westgate, M. J., Scheele, B. C., Ikin, K., Hoefer, A. M., Beaty, R. M., Evans,
M., Driscoll, D. A. (2015). Citizen Science Progr am Shows Urban
Areas Have Lower Occurrence of Frog Species, but Not Accelerated
Declines. PLoS ONE, 10 (11), e0140973. https://doi.or g/10 .1371/
journ al.pone.0140973
Wright, M. N., Wager, S., & Probst, P. (2019). Package “ranger” A Fast
Implementation of Random Forests. https://doi.org/10.1080/10618
600.2014.983641
Xue, Y., Davies, I., Fink, D., Wood, C., & Gomes, C. P. (2016). Behavior iden-
tification in two-stage games for incentivizing citizen science explo-
ration. Lecture Notes in Computer Science (Including Subseries Lecture
Notes in Ar tificial Intellig ence and Lecture Note s in Bioinformatic s), 9892
LNCS, 701–717. https://doi.org/10.1007/978-3-319-44953 -1_44
Zipkin, E. F., Royle, J. A., Dawson, D. K., & Bates, S. (2010). Multi-species
occurrence models to evaluate the effe cts of conse rvation and man-
agement actions. Biological Conservation, 143(2), 479–484. https://
doi.org/10.1016/j.biocon.2009.11.016
BIOSKETCH
Orin Robinson is an ecologist at the Cornell Lab of Ornithology
interested in using and developing quantitative tools to learn
about vertebrate population and community ecology, and using
lessons learned to inform conservation.
SUPPORTING INFORMATION
Additional supporting information may be found online in the
Supporting Information section.
How to cite this article: Robinson OJ, Ruiz-Gutierrez V,
Reynolds MD, Golet GH, Strimas-Mackey M, Fink D.
Integrating citizen science data with expert surveys increases
accuracy and spatial extent of species distribution models.
Divers Distrib. 2020;26:976–986. ht tp s://doi.o rg /10.1111/
ddi.13068
... (2) Understanding patterns, dynamics, and drivers of tree diversity presented a major research priority for urban ecosystems worldwide (Figure 7, green cluster). Studies analyzed alpha and beta diversities to quantify variations in community composition between sites and regions (Bourne & Conway, 2014;Ricotta et al., 2012). ...
... Landscape ecology approaches assessed habitat patches, fragmentation, and connectivity to understand compositional patterns (Norton et al., 2016). Remote sensing, field surveys, and citizen science datasets were combined to map and monitor communities at broader scales (Ozkan et al., 2016;Robinson et al., 2020). Experimental studies in gardens and field sites provided a process-based understanding of diversity drivers (Gaston et al., 2005;Mathieu et al., 2007). ...
... Biodiversity projections incorporated urbanization scenarios and environmental changes. Multiple facets were analyzed, from species composition to functional to phylogenetic diversity, to support conservation planning and prioritization (Pyšek et al., 2004;Ricotta et al., 2012;Stas et al., 2020). ...
Article
Full-text available
Ecosystem services offered by urban forests must be proactively managed to remain diverse and sustainable. Recent research findings deserve a systematic synthesis to elucidate inherent knowledge structures and dynamics. This study focused on the urban tree diversity theme from 2000 to 2022. Web of Science Core Collection database provided bibliometric details on academic publications. The data‐driven quantitative analysis explored research quantities, emphasis, trends, patterns, linkages, and impacts by countries, institutions, authors, journals, and citations. Publications and research topics have expanded continually, with accelerated growth in recent years. Research activities, outputs and interactions demonstrated conspicuous spatial clustering. A few countries, institutions and researchers generated a notable proportion of publications. Their scholarly contributions were visualized in knowledge graphs as complex networks of nodes and inter‐node links. Keyword analysis generated a network to indicate research hotspots and frontiers to steer and prioritize future studies. Recent findings affirmed that cities can harbor substantial tree diversity due to enhanced habitat heterogeneity and successful species adaptation. Aligning tree traits with environmental conditions and management objectives can improve benefits. Urbanization can filter tree traits to shape community assemblages through stressors: habitat degradation, fragmentation and loss, in conjunction with pollution, climate change, and introduced species. Diversity preservation strategies include protecting remnant natural vegetation, connecting green spaces, and restoring complex canopy geometry and biomass structure. The emerging frontiers are marked by modeling future species distributions, leveraging technologies like remote sensing, linking ecology with human values, and committing to community‐based stewardship. Management can be upgraded by interdisciplinary perspectives integrating ecological science and social engagement. The findings highlight the need for biodiversity enrichment anchored by native species, trait‐matched assemblages, adaptive policies, and community participation to create livable‐green cities. This review synthesizes key advances in urban tree ecology and biodiversity research to inform the planning and stewardship of resilient urban forests.
... Based on such thinking, some studies have proposed "reverse engineering" structure in biodiversity data by filtering data points (Rapacciuolo, Young & Johnson, 2021). Part of this reverse engineering has attempted to deal with spatial biases; for instance, by spatially subsampling data to reduce the unevenness of sampling effort across the landscape (Steen et al., 2021;Matutini et al., 2021;Steen, Elphick & Tingley, 2019;Boria et al., 2014;Robinson et al., 2020). This has been tested on, for instance, the semi-structured data compiled by eBird (Johnston et al., 2021). ...
... Recent class balancing approaches have been developed to ensure that important detections for rare species are not lost during the subsampling process (Robinson et al., 2020;Steen et al., 2021;Gaul et al., 2022). ...
Article
Full-text available
Big biodiversity data sets have great potential for monitoring and research because of their large taxonomic, geographic and temporal scope. Such data sets have become especially important for assessing temporal changes in species' populations and distributions. Gaps in the available data, especially spatial and temporal gaps, often mean that the data are not representative of the target population. This hinders drawing large‐scale inferences, such as about species' trends, and may lead to misplaced conservation action. Here, we conceptualise gaps in biodiversity monitoring data as a missing data problem, which provides a unifying framework for the challenges and potential solutions across different types of biodiversity data sets. We characterise the typical types of data gaps as different classes of missing data and then use missing data theory to explore the implications for questions about species' trends and factors affecting occurrences/abundances. By using this framework, we show that bias due to data gaps can arise when the factors affecting sampling and/or data availability overlap with those affecting species. But a data set per se is not biased. The outcome depends on the ecological question and statistical approach, which determine choices around which sources of variation are taken into account. We argue that typical approaches to long‐term species trend modelling using monitoring data are especially susceptible to data gaps since such models do not tend to account for the factors driving missingness. To identify general solutions to this problem, we review empirical studies and use simulation studies to compare some of the most frequently employed approaches to deal with data gaps, including subsampling, weighting and imputation. All these methods have the potential to reduce bias but may come at the cost of increased uncertainty of parameter estimates. Weighting techniques are arguably the least used so far in ecology and have the potential to reduce both the bias and variance of parameter estimates. Regardless of the method, the ability to reduce bias critically depends on knowledge of, and the availability of data on, the factors creating data gaps. We use this review to outline the necessary considerations when dealing with data gaps at different stages of the data collection and analysis workflow.
... sampling tends to concentrate in densely populated or touristic areas; Kendal et al., 2020;Reddy and Dávalos, 2003), SDMs such as MaxEnt can account for such spatial biases by considering the spatial distribution of sampling efforts when selecting pseudo-absence (background) locations (Milanesi et al., 2020;Phillips et al., 2009). When sampling efforts are adequately controlled, adding community-sourced data improves the accuracy of SDMs (Johnston et al., 2018;Robinson et al., 2020;Steen et al., 2019). This implies that SDMs may be substantially improved by utilising rapidly accumulating Biome's species occurrence records if we adequately control the sampling efforts. ...
... The inclusion of Biome data resulted in improved accuracy of SDMs ( Figure 3). The most accurate model predictions were obtained when the training data consisted of 50-70% Biome data (Appendix 1), highlighting the necessity of incorporating both traditional surveys and citizen observations for a comprehensive understanding of species distributions (Miller et al., 2019;Pacifici et al., 2017;Robinson et al., 2020). ...
Article
Full-text available
Comprehensive biodiversity data is crucial for ecosystem protection. The Biome mobile app, launched in Japan, efficiently gathers species observations from the public using species identification algorithms and gamification elements. The app has amassed >6 million observations since 2019. Nonetheless, community-sourced data may exhibit spatial and taxonomic biases. Species distribution models (SDMs) estimate species distribution while accommodating such bias. Here, we investigated the quality of Biome data and its impact on SDM performance. Species identification accuracy exceeds 95% for birds, reptiles, mammals, and amphibians, but seed plants, molluscs, and fishes scored below 90%. Our SDMs for 132 terrestrial plants and animals across Japan revealed that incorporating Biome data into traditional survey data improved accuracy. For endangered species, traditional survey data required >2000 records for accurate models (Boyce index ≥ 0.9), while blending the two data sources reduced this to around 300. The uniform coverage of urban-natural gradients by Biome data, compared to traditional data biased towards natural areas, may explain this improvement. Combining multiple data sources better estimates species distributions, aiding in protected area designation and ecosystem service assessment. Establishing a platform for accumulating community-sourced distribution data will contribute to conserving and monitoring natural ecosystems.
... Integrated distribution models (IDMs) provide a way to leverage the strengths while overcoming the weaknesses of multiple data sets from unmarked populations (count, detection/non-detection and presence-only; Dorazio, 2014, Pacifici et al., 2017, Fletcher et al., 2019, Miller et al., 2019, Isaac et al., 2020, sometimes involving participatory science data (Di Febbraro et al., 2023;Farr et al., 2021;Pagel et al., 2014;Robinson et al., 2020;Schindler et al., 2022;Stillman et al., 2023). Given the relative ease with which participatory science data can be obtained across broad spatiotemporal extents, IDMs can thus provide new opportunities for understanding relationships between ecological patterns and global change drivers (Di Febbraro et al., 2023;Doser et al., 2021;Grüss et al., 2023). ...
... We addressed this challenge by using a parameter in the integrated model that scale the estimates of participatory science data to the same level of the estimates of rigorous survey data (Schindler et al., 2022;Stillman et al., 2023). In this way, we successfully used participatory science data in a dynamic N-mixture model to inform population growth, while previous studies that integrated participatory science data were mostly constrained to static (Conn et al., 2022;Dambly et al., 2023;Farr et al., 2021;Gelfand & Shirota, 2019;Koshkina et al., 2017;Robinson et al., 2020) or trend models (Schindler et al., 2022;Stillman et al., 2023). Considering the fast development of dynamic N-mixture models to allow inferences of demographic rates (Dail & Madsen, 2011), disease structures (DiRenzo et al., 2019), spatial dependency among sites (Howell et al., 2020) or interspecific interactions , there are great potentials to extend our model to understand various ecological processes at broad spatiotemporal scales represented by participatory science data. ...
Article
Full-text available
Knowledge of variation in population processes (e.g. population growth) across broad spatiotemporal scales is fundamental to population ecology and critical for conservation decision‐making. Count data from rigorous surveys (e.g. surveys with probabilistic sampling design and distance sampling information) can inform population processes but are often limited in space and time. Participatory science data cover broader spatiotemporal extents but are prone to bias due to limited to no sampling design and lack of distance sampling information, hindering their capability of informing population processes. Here, we developed an integrated dynamic N‐mixture model that jointly analyses rigorous survey and participatory science data to inform population growth at broad spatiotemporal extents. The model contains a flexible scaling parameter that allows fixed and random effects to account for biases and errors in participatory science data. We conducted simulations to evaluate the inference performance of this model across a broad range of spatial and temporal overlap between rigorous survey and participatory science data. We also conducted a case study of Baird's Sparrow (Centronyx bairdii), a species of conservation concern, to illustrate the application of the integrated model with rigorous survey data from the Integrated Monitoring in Bird Conservation Regions programme and participatory science North American Breeding Bird Survey and eBird data. Simulations showed that the integrated model improved precision without biasing parameter estimates, in comparison with a model informed by rigorous survey data alone. The case study further demonstrated the utility of the integrated model for quantifying range‐wide, long‐term population processes and environmental drivers despite limited spatiotemporal extent of rigorous survey data. In particular, we found that population growth rate peaked under medium temperature, which were only apparent in the integrated model. The integrated model developed in this study is useful for understanding wildlife population processes at broad spatiotemporal scales with count data. The flexible structure of this model, in particular the scaling parameter, makes it highly adaptable to a broad range of ecological systems and survey procedures. These properties make this modelling approach highly relevant for both population ecology and conservation practice.
... This comparison facilitates the calibration of different datasets, making them suitable for trend analysis and enhancing their reliability (Forti et al., 2024;Hertzog et al., 2021). In fact, the integration of citizenscience data and structured surveys has been shown to offer effective complementary insights (Dimson et al., 2023;Robinson et al., 2020;Tulloch, Mustin, et al., 2013). ...
... In spite of the data available, other vertebrate groups have received less attention (Feldman et al., 2021). Based on GBIF data, citizen science can contribute to more accurate distribution models for some species (Robinson et al., 2020), such as the House gecko, which has over 1000 observations in Brazil. ...
Article
Full-text available
Understanding the distribution of rare species is important for conservation prioritisation. Traditionally, museums and other research institutions have served as depositories for specimens and biodiversity information. However, estimating abundance from these sources is challenging due to spatiotemporally biased collection methods. For instance, large‐bodied reptiles that are found near research institutions or in popular, easily accessible sites tend to be overrepresented in collections compared to smaller species found in remote areas. Recently, a substantial number of observations have been amassed through citizen (or community) science initiatives, which are invaluable for monitoring purposes. Given the unstructured nature of this sampling, these datasets are often affected by biases, such as taxonomic, spatial and temporal preferences. Therefore, analysing data from these two sources can lead to different abundance estimates. This study compiled data on Brazilian reptile species from the Global Information Biodiversity Facility (GBIF). It employed a community‐ecology approach to analyse data from research institutions and citizen science initiatives, separately and collectively, to assess taxonomic and spatial species coverage and predict species rarity. Using a 1‐degree hexagonal grid, we analysed the spatial distribution of reptile communities and calculated rarity indices for 754 reptile species. Our findings reveal that 87 species were exclusively recorded in the citizen science subset, while 212 were recorded only by research institutions. The number of observations per species in the citizen science data followed a Gambin distribution, which aligns with the expected pattern of abundance in natural communities, unlike the data from research institutions. This suggests that citizen science data may be a more accurate source for estimating species abundance and rarity. The discrepancies in rarity classifications between the datasets were likely due to differences in sample size and potentially other sampling parameters. Nevertheless, combining data collected by both research institutions and citizen science initiatives can help to fill knowledge gaps in reptile species occurrence, thus enhancing the foundation for conservation efforts on a national scale.
... For regions and taxa where there is great public interest in biodiversity recording, the number of records generated can be comparable to EC data, often while sampled over a much shorter time period (Zapponi et al., 2017) and can improve the spatiotemporal coverage of species observations Petersen et al., 2021). Integrating CS data with EC data can thus increase the accuracy and precision of models aimed at estimating species' distributions, which have immense potential for informing conservation decisions (Guisan et al., 2013;Robinson et al., 2020). ...
... This dataset was the primary basis for recent conservation assessments . CS sources were defined as observations collected in large part by nonexperts or non-formally trained individuals through open access repositories, such as iNaturalist (Robinson et al., 2020) and social media-specifically from the Facebook Group "The Bees and Wasps of Singapore" (https:// w w w. faceb ook. com/ groups/ 1450 4 95321 695805). ...
Article
Full-text available
Bees and the ecosystem services they provide are vital to urban ecosystems, but little is understood about their distributions, particularly in the Asian tropics. This is largely due to taxonomic impediments and limited inventorying, monitoring, and digitization of occurrence records. While expert collections (EC) are demonstrably insufficient by themselves as a data source to model and understand bee distributions, the boom of community science (CS) in urban areas provides an untapped opportunity to learn about bee distributions within our cities. We used CS observations in combination with EC observations to model the distribution of bees in Singapore, a small tropical city-state in Southeast Asia. To address the restricted spatial context, we performed multiple bias corrections and show that species distribution models performed well when estimating the distribution of habitat specialists with distinct range limits detectable within Singapore. We successfully modelled 37 bee species, where model statistics improved for 23 species upon the incorporation of CS observations. Nine species had insufficient EC observations to obtain acceptable models, but could be modelled with the incorporation of CS observations. This is the first study to combine both EC and CS observations to map and model the occurrences of tropical Asian bee species for a highly urbanized region at such fine resolution. Our results suggest that urban landscapes with impervious surfaces and higher temperatures are less suitable for bee species, and such findings can be used to advise the management of urban landscapes to optimize the diversity of bee pollinators and other organisms.
... Data found in publicly available databases have to be treated with care due to different biases and sources of uncertainties, e.g., the increasing data collection by citizen scientists with varying levels of expertise or variable data-gathering methodologies that challenge the reliability and comparability of the datasets (Kosmala et al., 2016;Callaghan et al., 2020). Moreover, inaccuracies in species identification and potential disturbances to legally protected or vulnerable species by untrained individuals can compromise data quality and further contribute to conservation concerns, such as the proliferation of pathogens through improper biosecurity during monitoring (Barve et al., 2011;Robinson et al., 2020;. ...
Article
Full-text available
Freshwater crayfish are amongst the largest macroinvertebrates and play a keystone role in the ecosystems they occupy. Understanding the global distribution of these animals is often hindered due to a paucity of distributional data. Additionally, non-native crayfish introductions are becoming more frequent, which can cause severe environmental and economic impacts. Management decisions related to crayfish and their habitats require accurate, up-to-date distribution data and mapping tools. Such data are currently patchily distributed with limited accessibility and are rarely up-to-date. To address these challenges, we developed a versatile e-portal to host distributional data of freshwater crayfish and their pathogens (using Aphanomyces astaci, the causative agent of the crayfish plague, as the most prominent example). Populated with expert data and operating in near real-time, World of Crayfish™ is a living, publicly available database providing worldwide distributional data sourced by experts in the field. The database offers open access to the data through specialized standard geospatial services (Web Map Service, Web Feature Service) enabling users to view, embed, and download customizable outputs for various applications. The platform is designed to support technical enhancements in the future, with the potential to eventually incorporate various additional features. This tool serves as a step forward towards a modern era of conservation planning and management of freshwater biodiversity.
... If different platforms engage more people that have different surveying skills and preferences, new initiatives can focus on poorly represented areas or observational details missing from other initiatives or atlases (Schaaf et al. 2024). Community science can further enhance bird monitoring in the Atlantic Forest (Forti et al. 2024) and its contribution could be elevated by aggregating different datasets (Hertzog et al. 2021), including combining expert-and volunteer-collected data (Robinson et al. 2020;Farias et al. 2022). For instance, WikiAves, one of the major platforms to produce data on birds in Brazil, is not linked to GBIF, because observations are not georeferenced at a fine resolution. ...
Article
Context. Engaging the general public can increase spatio-temporal coverage of wildlife monitoring. Given the potentially substantial costs, we need to evaluate the contribution of known and planned initiatives and confirm whether multiple platforms increase the efficiency of data collection. As observer behaviour affects data quantity and quality, users of specialised and generalist platforms are expected to behave differently, resulting in more connected networks for specialised and higher nestedness for generalist platforms. Specialist observers are expected to contribute a balanced ratio of rare and common species, whereas non-specialist contribution will depend more on species detectability. Aims. We aim to evaluate whether the combined contribution of observers from different platforms can improve the quality of occurrence and distribution data of 218 endemic Atlantic Forest bird species in Brazil. We also describe and compare observer-bird species interaction networks to illustrate observer behaviour on different platforms. Methods. On the basis of data from five community science platforms in Brazil, namely, eBird, WikiAves, Biofaces, iNaturalist and Táxeus, we compared the spatial distribution of bird observations, the number of observers, the presence of the same observers on various platforms, bird species coverage, and the proportion of duplicate observations within and among platforms. Key results. Although species coverage of the joint dataset increased by up to 100%, spatial completeness among the five platforms was low. The network of individual platforms had low values of clustering, and the network of the joint dataset had low connectance and high nestedness. Conclusions. Each platform had a somewhat unique contribution. Pooling these datasets and integrating them with standardised data can inform our knowledge on bird distributions and trends in this fragile biome. Nevertheless, we encourage observers to provide precise coordinates, dates and other data (and platforms to accommodate such data) and recommend submitting data from all platforms into the Global Biodiversity Information Facility to support wildlife research and conservation. Implications. If new platforms engage more and different people, new initiatives can cover poorly represented areas and successfully expand monitoring effort for Atlantic Forest endemic bird species.
... We elected to combine the Oregon 20/20 and eBird data sets because the Oregon 20/20 data set is less susceptible to spatial and temporal biases common to citizen-science data (Strimas-Mackey et al. 2020), and the eBird data set allowed us to fully cover the known species breeding range. Furthermore, combining expert and citizen-science survey data may improve model accuracy, as different survey structures capture complementary information on species occurrences (Robinson et al. 2020b). ...
... By providing local communities with the tools, knowledge, and capability to effectively gather useful information, managers and custodians of ecosystems nationally (Robinson et al., 2020) and internationally (La Sorte and Somveille, 2020) will benefit. Having scientifically rigorous data collected by custodians at the doorstep of the environment in question can potentially deliver improved outcomes for waterways (Shupe, 2017) and the species that depend on them, such as the critically endangered Bellinger River snapping turtle. ...
Article
Full-text available
In 2015, the sudden decline in the only known population of Myuchelys georgesi (the Bellinger River snapping turtle) triggered a strong community response, and a link between turtle mortality and poor water quality in the Bellinger River was suggested. A multi-agency investigation later attributed the mortalities of M. georgesi to a novel virus (the Bellinger River virus) and not a direct effect of poor water quality. However, a lack of consistent water quality or river health data in the catchment limited the research of factors that may have heightened susceptibility to the virus or exacerbated its symptoms. Community consultation identified strong connections with the riverine environment and highlighted the cultural, social, economic, and environmental values of the Bellinger River catchment. In 2017 OzGREEN, a not-for-profit environmental education charity based in Bellingen, built upon their existing citizen science water quality monitoring program in collaboration with the Saving our Species (SoS) team in the New South Wales (NSW) Department of Planning and Environment (DPE), who provided funding and equipment and solicited the involvement of NSW Waterwatch, Western Sydney University and Taronga Zoo. Now known as Bellingen Riverwatch (Riverwatch), the program has become a long-term citizen science program that aims to assist the recovery of M. georgesi, now a critically endangered species, through the delivery of monthly water quality data covering the Bellinger River and its tributaries. SoS also engaged the DPE Estuaries and Catchments Team to commence the Bellinger River Health Program (BRHP), focusing on water quality and aquatic macroinvertebrates to assess river health, with the aim of providing scientifically rigorous data to support the management and recovery of M. georgesi. This case study compares and evaluates the Riverwatch citizen science and the BRHP professional science, examining methods and results to compare the accuracy of the citizen science data and assess its reliability for informing ongoing river management. The results demonstrate that Bellingen Riverwatch is a well-managed citizen science program and generally provides valid, accurate, and representative results that can be confidently used to enhance the spatial and temporal coverage of the professional science monitoring program.
Article
Full-text available
Information on species’ distributions, abundances, and how they change over time is central to the study of the ecology and conservation of animal populations. This information is challenging to obtain at landscape scales across range‐wide extents for two main reasons. First, landscape‐scale processes that affect populations vary throughout the year and across species’ ranges, requiring high‐resolution, year‐round data across broad, sometimes hemispheric, spatial extents. Second, while citizen science projects can collect data at these resolutions and extents, using these data requires appropriate analysis to address known sources of bias. Here, we present an analytical framework to address these challenges and generate year‐round, range‐wide distributional information using citizen science data. To illustrate this approach, we apply the framework to Wood Thrush (Hylocichla mustelina), a long‐distance Neotropical migrant and species of conservation concern, using data from the citizen science project eBird. We estimate occurrence and abundance across a range of spatial scales throughout the annual cycle. Additionally, we generate intra‐annual estimates of the range, intra‐annual estimates of the associations between species and characteristics of the landscape, and interannual trends in abundance for breeding and non‐breeding seasons. The range‐wide population trajectories for Wood Thrush show a close correspondence between breeding and non‐breeding seasons with steep declines between 2010 and 2013 followed by shallower rates of decline from 2013 to 2016. The breeding season range‐wide population trajectory based on the independently collected and analyzed North American Breeding Bird Survey data also shows this pattern. The information provided here fills important knowledge gaps for Wood Thrush, especially during the less studied migration and non‐breeding periods. More generally, the modeling framework presented here can be used to accurately capture landscape scale intra‐ and interannual distributional dynamics for broadly distributed, highly mobile species.
Article
Full-text available
Information on where species occur is an important component of conservation and management decisions, but knowledge of distributions is often coarse or incomplete. Species distribution models provide a tool for mapping habitat and can produce credible, defensible, and repeatable information with which to inform decisions. However, these models are sensitive to data inputs and methodological choices, making it important to assess the reliability and utility of model predictions. We provide a rubric that model developers can use to communicate a model's attributes and its appropriate uses. We emphasize the importance of tailoring model development and delivery to the species of interest and the intended use and the advantages of iterative modeling and validation. We highlight how species distribution models have been used to design surveys for new populations, inform spatial prioritization decisions for management actions, and support regulatory decision-making and compliance, tying these examples back to our model assessment rubric.
Article
Full-text available
Biodiversity is being lost at an unprecedented rate, and monitoring is crucial for understanding the causal drivers and assessing solutions. Most biodiversity monitoring data are collected by volunteers through citizen science projects, and often crucial information is lacking to account for the inevitable biases that observers introduce during data collection. We contend that citizen science projects intended to support biodiversity monitoring must gather information about the observation process as well as species occurrence. We illustrate this using eBird, a global citizen science project that collects information on bird occurrences as well as vital contextual information on the observation process while maintaining broad participation. Our fundamental argument is that regardless of what species are being monitored, when citizen science projects collect a small set of basic information about how participants make their observations, the scientific value of the data collected will be dramatically improved.
Preprint
Full-text available
Citizen science data are valuable for addressing a wide range of ecological research questions, and there has been a rapid increase in the scope and volume of data available. However, data from large-scale citizen science projects typically present a number of challenges that can inhibit robust ecological inferences. These challenges include: species bias, spatial bias, and variation in effort. To demonstrate addressing key challenges in analysing citizen science data, we use the example of estimating species distributions with data from eBird, a large semi-structured citizen science project. We estimate two widely applied metrics of species distributions: encounter rate and occupancy probability. For each metric, we assess the impact of data processing steps that either degrade or refine the data used in the analyses. We also test whether differences in model performance are maintained at different sample sizes. Model performance improved when data processing and analytical methods addressed the challenges arising from citizen science data. The largest gains in model performance were achieved with: 1) the use of complete checklists (where observers report all the species they detect and identify); and 2) the use of covariates describing variation in effort and detectability for each checklist. Occupancy models were more robust to a lack of complete checklists and effort variables. Improvements in model performance with data refinement were more evident with larger sample sizes. Here, we describe processes to refine semi-structured citizen science data to estimate species distributions. We demonstrate the value of complete checklists, which can inform the design and adaptation of citizen science projects. We also demonstrate the value of information on effort. The methods we have outlined are also likely to improve other forms of inference, and will enable researchers to conduct robust analyses and harness the vast ecological knowledge that exists within citizen science data.
Article
Full-text available
Tallgrass prairie ecosystems in North America are heavily degraded and require effective restoration strategies if prairie specialist taxa are to be preserved. One common management tool used to restore grassland is the application of a seed-mix of native prairie plant species. While this technique is effective in the short-term, it is critical that species' resilience to changing climate be evaluated when designing these mixes. By utilizing species distribution models (SDMs), species' bioclimatic envelopes–and thus the geographic area suitable for them–can be quantified and predicted under various future climate regimes, and current seed-mixes may be modified to include more climate resilient species or exclude more affected species. We evaluated climate response on plant functional groups to examine the generalizability of climate response among species of particular functional groups. We selected 14 prairie species representing the functional groups of cool-season and warm-season grasses, forbs, and legumes and we modeled their responses under both a moderate and more extreme predicted future. Our functional group “composite maps” show that warm-season grasses, forbs, and legumes responded similarly to other species within their functional group, while cool-season grasses showed less inter-species concordance. The value of functional group as a rough method for evaluating climate-resilience is therefore supported, but candidate cool-season grass species will require more individualized attention. This result suggests that seed-mix designers may be able to use species with more occurrence records to generate functional group-level predictions to assess the climate response of species for which there are prohibitively few occurrence records for modeling.
Article
Full-text available
With the advance of methods for estimating species distribution models has come an interest in how to best combine datasets to improve estimates of species distributions. This has spurred the development of data integration methods that simultaneously harness information from multiple datasets while dealing with the specific strengths and weaknesses of each dataset. We outline the general principles that have guided data integration methods and review recent developments in the field. We then outline key areas that allow for a more general framework for integrating data and provide suggestions for improving sampling design and validation for integrated models. Key to recent advances has been using point‐process thinking to combine estimators developed for different data types. Extending this framework to new data types will further improve our inferences, as well as relaxing assumptions about how parameters are jointly estimated. These along with the better use of information regarding sampling effort and spatial autocorrelation will further improve our inferences. Recent developments form a strong foundation for implementation of data integration models. Wider adoption can improve our inferences about species distributions and the dynamic processes that lead to distributional shifts.
Article
Full-text available
Coral reefs are threatened by numerous global and local stressors. In the face of predicted large-scale coral degradation over the coming decades, the importance of long-term monitoring of stress-induced ecosystem changes has been widely recognised. In areas where sustained funding is unavailable, citizen science monitoring has the potential to be a powerful alternative to conventional monitoring programmes. In this study we used data collected by volunteers in Southeast Sulawesi (Indonesia), to demonstrate the potential of marine citizen science programmes to provide scientifically sound information necessary for detecting ecosystem changes in areas where no alternative data are available. Data were collected annually between 2002 and 2012 and consisted of percent benthic biotic and abiotic cover and fish counts. Analyses revealed long-term coral reef ecosystem change. We observed a continuous decline of hard coral, which in turn had a significant effect on the associated fishes, at community, family and species levels. We provide evidence of the importance of marine citizen science programmes in detecting long-term ecosystem change as an effective way of delivering conservation data to local government and national agencies. This is particularly true for areas where funding for monitoring is unavailable, resulting in an absence of ecological data. For citizen science data to contribute to ecological monitoring and local decision-making, the data collection protocols need to adhere to sound scientific standards, and protocols for data evaluation need to be available to local stakeholders. Here, we describe the monitoring design, data treatment and statistical analyses to be used as potential guidelines in future marine citizen science projects.
Article
Full-text available
Opportunistically collected species observations contributed by volunteer reporters are increasingly available for species and regions for which systematically collected data are not available. However, it is unclear if they are suitable to produce reliable habitat suitability models (HSMs), and hence if the species–habitat relationships found and habitat suitability maps produced can be used with confidence to advice conservation management and address basic and applied research questions. We evaluated HSMs with opportunistically collected observations against HSMs with systematically collected observations. We enhanced the opportunistically collected presence‐only data by adding inferred species absences. To obtain inferred absences, we asked individual reporters about their identification skills and if they reported certain species consistently and combined this information with their observations. We evaluated several HSM methods using a forest bird species, Siberian jay ( Perisoreus infaustus ), in Sweden: logistic regression with inferred absences, two versions of MaxEnt, a model combining presence–absence with presence‐only observations and a Bayesian site‐occupancy‐detection model. All HSM methods produced nationwide habitat suitability maps of Siberian jay that agreed well with systematically collected observations (AUC: 086–0.88) and were very similar to a habitat suitability map produced from the HSM with systematically collected observations (Spearman rho: 0.94–0.98). At finer geographical scales there were differences among methods. At finer scale, the resulting habitat suitability maps from logistic regression with inferred absences agreed better with results from systematically collected observations than other methods. The species–habitat relationships found with logistic regression also agreed well with those found from systematically collected data and with prior expectations based on the species ecology. Synthesis and application . For many regions and species, systematically collected data are not available. By using inferred absences from high‐quality, opportunistically collected contributions of few very active reporters in logistic regression we obtained HSMs that produced results similar to those from a systematic survey. Adding high‐quality inferred absences to opportunistically collected data is likely possible for many less common species across various organism groups. Well‐performing HSMs are important to facilitate applications such as spatial conservation planning and prioritization, monitoring of invasive species, understanding species habitat requirements or climate change studies.
Article
Drawing upon the data deposited in publicly shared archives has the potential to transform the way we conduct ecological research. For this transformation to happen, we argue that data need to be more interoperable and easier to discover. One way to achieve these goals is to adopt domain-specific data representations.