ArticlePDF Available

Integrating citizen science data with expert surveys increases accuracy and spatial extent of species distribution models


Abstract and Figures

Aim Information on species’ habitat associations and distributions, across a wide range of spatial and temporal scales, is a fundamental source of ecological knowledge. However, collecting information at relevant scales is often cost prohibitive, although it is essential for framing the broader context of more focused research and conservation efforts. Citizen science has been signalled as an increasingly important source to fill in data gaps where information is needed to make comprehensive and robust inferences on species distributions. However, there are perceived trade‐offs of combining highly structured, scientific survey data with largely un‐structured, citizen science data. Methods We explore these trade‐offs by applying a simplified approach of filtering citizen science data to resemble structured survey data and analyse both sources of data under a common framework. To accomplish this, we integrated high‐resolution survey data on shorebirds in the northern Central Valley of California with observations in eBird for the entire region that were filtered to improve their quality. Results The integration of survey data with the filtered citizen science data resulted in improved inference and increased the extent and accuracy of distribution models on shorebirds for the Central Valley. The structured surveys improved the overall accuracy of ecological inference over models using citizen science data only by increasing the representation of data collected from high‐quality habitats for shorebirds. Main conclusions The practical approach we have shown for data integration can also be used to improve the efficiency of designing biological surveys in the context of larger, citizen science monitoring efforts, ultimately reducing the financial and time expenditures typically required of monitoring programs and focused research. The simple method we present can be used to integrate other types of data with more localized efforts, ultimately improving our ecological knowledge on the distribution and habitat associations of species of conservation concern worldwide.
This content is subject to copyright. Terms and conditions apply.
Diversity and Distributions. 2020;26:976–
Received: 28 October 2019 
Revised: 21 Feb ruar y 2020 
Accepted: 21 Ma rch 2020
DOI : 10.1111/ddi .13068
Integrating citizen science data with expert surveys increases
accuracy and spatial extent of species distribution models
Orin J. Robinson1| Viviana. Ruiz-Gutierrez1| Mark D. Reynolds2|
Gregory H. Golet3| Matthew Strimas-Mackey1| Daniel Fink1
This is an op en access arti cle under the ter ms of the Creative Commons Attribution L icense, which pe rmits use, dis tribu tion and reprod uction in any med ium,
provide d the original wor k is properly cited.
© 2020 The Authors. Diversity and D istrib ution Publisherd by John Wiley & Sons Ltd
1Cornell L abor atory of Ornit holog y, Cornell
University, Ithaca, NY, USA
2The Nature Conse rvan cy, San Fran cisco,
3The Nature Conse rvan cy, Chico, C A, USA
Orin J. Rob inson, Cornell Labo rator y
of Ornitholog y, Cornell University, 159
Sapsucker Woods Roa d, Ithaca, NY 1485 0
Editor: Luigi Maiorano
Aim: Information on species’ habitat associations and distributions, across a wide
range of spatial and temporal scales, is a fundamental source of ecological knowl-
edge. However, collecting information at relevant scales is often cost prohibitive, al-
though it is essential for framing the broader context of more focused research and
conservation efforts. Citizen science has been signalled as an increasingly important
source to fill in data gaps where information is needed to make comprehensive and
robust inferences on species distributions. However, there are perceived trade-offs
of combining highly structured, scientific survey data with largely un-structured, citi-
zen science data.
Methods: We explore these trade-offs by applying a simplified approach of filtering
citizen science data to resemble structured survey data and analyse both sources of
data under a common framework. To accomplish this, we integrated high-resolution
survey data on shorebirds in the northern Central Valley of California with observa-
tions in eBird for the entire region that were filtered to improve their quality.
Results: The integration of survey data with the filtered citizen science data resulted
in improved inference and increased the extent and accuracy of distribution models
on shorebirds for the Central Valley. The structured surveys improved the overall ac-
curacy of ecological inference over models using citizen science data only by increas-
ing the representation of data collected from high-quality habitats for shorebirds.
Main conclusions: The practical approach we have shown for data integration can
also be used to improve the efficiency of designing biological surveys in the context
of larger, citizen science monitoring efforts, ultimately reducing the financial and time
expenditures typically required of monitoring programs and focused research. The
simple method we present can be used to integrate other types of data with more
localized efforts, ultimately improving our ecological knowledge on the distribution
and habitat associations of species of conservation concern worldwide.
citizen science, data filtering, data integration, data quality, ecological niche model, species
distribution model, structured sur vey
ROBINSO N et al.
Information on species’ habitat associations and distributions is a
fundamental source of ecological knowledge (Sofaer et al., 2019).
This information is often of interest across a broad range of spa-
tial and temporal scales, from high-resolution information that is
more relevant for research on habitat selection (Matthiopoulos,
Hebblewhite, Aart s, & Fieberg, 2011) or needed to inform manage-
ment objectives (Zipkin, Andrew Royle, Dawson, & Bates, 2010) to
larger-scale inferences that are useful to address broader questions
(e.g. potential range shifts with changing climatic conditions; Lyon,
Debinski, & Rangwala, 2019). However, the process of collecting bi-
ological obser vations across large spatial scales is of ten cost prohib-
itive for most research and monitoring efforts. At best, researchers
and practitioners are able to monitor plant and animal communities
just within their study regions, and during specific times of year.
However, the need to make inferences beyond the sampled range of
environmental conditions and seasons often limits our understand-
ing of the broader context of our result s and can limit the use of
applied research to inform future monitoring efforts and effective
conservation actions.
Citizen science data have been signalled as a promising source
of information to fill in information gaps needed to model species
distributions (Bradter et al., 2018; Gouraguine et al., 2019). The
collection of citizen science data is growing rapidly, and for a num-
ber of taxa, large databases of observations exist. For example, the
National Moth Recording Scheme in Great Britain has collec ted
over 11 million observations (Fox et al., 2014). There are many more
taxa for which rich citizen science data sets exist, for example birds
(Sauer et al., 2017; Sullivan et al., 2014), invertebrates (Howard,
Aschen, & Davis, 2010), bats (Newson, Evans, & Gillings, 2015),
whales (Tonachella, Natasi, Kaufman, Maldini, & Rankin, 2012) and
frogs (Westgate et al., 2015) among others. However, there are few
examples of the potential trade-offs of combining observational
data collected at smaller spatial extents with citizen science data
collected across vast extent s. This might be due in part to inherent
differences that often exist between the two data types. On the one
hand, you may have data that is collected at a high spatial resolu-
tion using skilled observers, sampling effort is of ten standardized,
and sampling occurs across a habitat gradient that is representative
of the region of interest . On the other hand, is citizen science data,
which can be collected across a wide range of sampling conditions
by obser vers that vary widely in their level of expertise, collected
across a wide range of spatial resolutions, and sampling effort is not
standardized. This has created a perceived trade-off between data
quality and quantity among data collected from structured, scientific
surveys and data collected from larger-scale, volunteer-based mon-
itoring efforts (Figure 1). The assumption being made is that an in-
crease in quantity of citizen science data comes at a significant cost
to quality. Therefore, the logical framework to integrate these two
sources of information would be one that would treat them as inde-
pendent sources of information used to inform a common underlying
distribution for a given species (Pacifici et al., 2017).
The integration of different data sources is a growing area of
methodological development in ecological statistics, and recent
advances have been made to develop ways of integrating sur vey
data (e.g. structured) with citizen science data (e.g. un-structured
data) (Miller, Pacifici, Sanderlin, & Reich, 2019). For observational
data collected at discrete locations, these methods include spec-
ifying a joint likelihood for the two data sources to estimate the
underlying species distribution (Miller et al., 2019). In cases where
this is not possible, the data source that is deemed as of lower
quality (e.g. citizen science data, museum obser vations) can be
used in two ways: (a) modelled as a covariate of the underlying
distribution or (b) used to estimate a sepa rate sp ecies dis tribution,
where a correlation structure is specified to share information
across data sources. Pacifici et al. (2017) tested these different
approaches to integrate observational data from the citizen sci-
ence project eBird (Sullivan et al., 2014) with more structured
data from the North American Breeding Bird Survey (BBS; Sauer
et al., 2017). Their results showed that the joint-likelihood ap-
proach of combining eBird and BBS data outper formed all other
approaches, including using BBS data alone.
The approach used by Pacifici et al. (2017) and Miller el at
(2019) summarized observational data at a coarse, grid-level in
order to account for dif ferences in effort, sampling approach, and
other variables that are known to influence detectability (Guillera-
Arroita, 2017). In addition, Pacifici et al. (2017) wanted to reduce
potential bias related to the degree of uncertainty about the spatial
FIGURE 1 The perceived trade-off bet ween data qualit y and
quantity of data collected using structured surveys and citizen
science ef forts. You can bypass this trade-off by processing and
filtering citizen science data using criteria such as count type
(e.g. stationary versus. travelling), duration of counts (e.g. all
observations < 30mins) and other measures that align well with
the existing data set. This approach is recommended as the most
flexible data integration method for the application of a broad
range of species distribution models
   ROBINSON et al.
scale that observations were collected for each independent eBird
checklist. This mismatch in scales bet ween the two dat a sources is
what often makes data integration between high-resolution survey
data (e.g. point count observations) with lower-resolution citizen
science data non-trivial. However, many citizen science programs
collect high-scale resolution information (e.g. camera traps) in ways
that we can infer absences, and collect additional information on ef-
fort (e.g. distance travelled, number of hours sampled) that is highly
valuable for improving the accuracy of SDMs (Kelling et al., 2019).
Here, we explore a practical approach for data integration be-
tween high-quality, citizen science data with structured survey data
that builds upon existing methods for “data pooling” (e.g. Fithian,
Elith, Hastie, & Keith, 2015). We explore the trade-of fs in inference
of using citizen science data alone, more localized and structured
data alone, and pooling together both data sets combined. In addi-
tion, we examine specific trade-offs when combining structured and
un-str uct ured da ta sour ce s by explor in g the per form ance of incre as -
ing the quantity of citizen science data through simulations, versus.
the addition of data from more targeted survey effort s. To accom-
plish this, we use The Nature Conser vancy's (TNC) BirdReturns
project as a case study (Reynolds et al., 2017). We explore ways
of combining high-resolution bird survey data collected for shore-
birds on rice fields in the northern region of the Central Valley in
California, with observations in eBird for the entire Central Valley
that are filtered to improve their quality. The specific aim of the
case study is to provide a framework for leveraging survey dat a with
citizen science data to build more accurate distribution models for
shorebird species across the extent of the Central Valley.
2.1 | Data
We us ed poi nt cou nt s car ried ou t du ri ng spring sur veys (Fe bruar y 1–
May 31; n = 8,19 2) as par t of th e TNC Bir dRetu rns pro jec t con du c te d
in 2014–2017. This project used predicted shorebird occurrence
and abundance (Johnston et al., 2015) along with predicted surface
water in the Sacramento River Valley to identify times and locations
that were likely to be import ant for migrating shorebirds. TNC used
a reverse auction approach to select and incentivize rice farmers in
the identified locations to flood their fields during the spring and fall,
making temporar y wetlands available to the migrating shorebirds.
Observers made point counts at fields enrolled in the program and at
unenrolled control sites, surveying a semi-circle with a 200 m fixed
radius. Each site was sur veyed for at least two minutes and lasted as
long as necessary to count all birds present (for more detail on count
methods see Golet et al., 2018). Effort for each point count consisted
of date, time started and ended, and name of observer.
We combined the point counts with data from the citizen science
project eBird (Sullivan et al., 2014) collected during the same time
period as the point counts. The eBird data were restricted to the
Central Valley of California, USA, and to complete checklists so that
non-detection could be inferred (Johnston et al., 2019). We also re-
stricted the eBird data to stationary checklists and travelling check-
lists limited to 30 0 m. After filtering these data, we were left with
12,891 checklists. Effor t variables for the eBird data set were date,
time obser vations star ted, duration of observation in minutes, sur-
vey protocol (stationary or travelling), distance travelled in metres
and number of obser vers.
We calculated the effort variables from the point counts to
match those of the eBird data set . Obser ver name in the point count
data set was converted into number of observers (however, it was
always one for this study), time started and ended for the point
counts was used to calculate duration in minutes, and each point
count was treated as a stationary count (distance travelled = 0). By
doing this, the two data sets contained the same effor t information
and were identical in struc ture which allowed us to simply join them
into one combined data set. We added a variable to the combined
data set to note whether an observation was a TNC point count or
an eBird checklist. We also calculated a checklist calibration index
(CCI) for each checklist in the combined data set to account for vari-
ation in expertise among observers (e.g. expertise score; Johnston,
Fink, Hochachka, & Kelling, 2018). As environmental variables, we
attached the remotely sensed Cropland Data Layer (CDL; Boryan,
Yang, Mueller, & Craig, 2011; Han, Yang, Di, & Mueller, 2012) to the
combined data set by computing the per cent of each land cover or
crop type in the CDL that was present within a 300 m radius centred
on each point count or eBird checklist. We also similarly attached
cloud-filled data from Water Tracker (Reiter, Elliott, Barbaree, &
Moody, 2018), a high spatial and temporal resolution surface water
tracking system for the Central Valley of California.
A comprehensive data set was created using all of the above
eBird checklists (12,891) plus simulated eBird checklists equal to
the number of TNC point count s that were included in the combined
data set (8,192). This resulted in a data set that was the same magni-
tude (21,083) as the combined data set and allowed us to determine
whether the improvement in accuracy from combining the data sets
was simply a function of increasing the sample size. The simulated
eBird data were created by adding a small amount of noise (via the jit-
ter fun ct ion in base R; R Core Team, 2019) to the spatial covariates of
8,19 2 rando ml y chose n ch ecklist s , si mi la r to the ove rsa mp ling proc e-
dure in Fink et al. (2019); however, we used all checklists rather than
only positive observations here . Th is ens ur ed tha t we were mainta in -
ing roughly the same prevalence rate in the data set and also were
not simply making exact copies of the randomly chosen checklists.
2.2 | Spatial filtering and class imbalance
As spatial bias is always a concern when using citizen science data
(Geldmann et al., 2016), we spatially subsampled the combined data
set. The data were sparse for many species (proportion of detec-
tions for a species <0.05 for all checklists in the data set), so class
imbalance was also a concern (He & Garcia, 20 09). We spatially un-
dersampled the data (e.g. Robinson, Ruiz-Gutierrez, & Fink, 2017) by
ROBINSO N et al.
first creating a hexagonal grid of 3.5 km (~10 times the radius of each
survey) cells over the region from which our observations came via
the dggridR package (Barnes et al., 2018) in R (R Core Team, 2019).
This was done to reduce the chance that we selected overlapping
observations at each run of each model. We then split the data for
a single species into checklists on which the species was detected
(positive observations) and those on which the species was not de-
tected (negative observations). We selected one checklist from the
negative observations from within each grid cell and recombined the
filtered negative observations with the positive observations. As al-
most all of the spatial bias was from the negative obser vations (i.e.
only a small percentage of the total number of observations were
positive observations), this procedure relieves much of the spatial
bias, and because only negative observations were filtered out of the
data set, class imbalance is also addressed here (King & Zeng, 2001;
Robinson et al., 2017).
To alleviate the effects of class imbalance, after spatially sam-
pling eBird checklists for training distribution and population trend
models, Fink et al. (2019) oversampled eBird checklists for species
that had a prevalence rate of less than 25%. After spatially under-
sampling our data for the current study, class imbalance was still a
concern, as our undersampling improved the imbalance, but it did
not improve class balance to 25% for species other than Yellowlegs.
Following the recommendation of Fink et al. (2019), we oversampled
the positive observations for each of our species and data sets if
prevalence was below 25% before training the distribution models.
We used the synthetic minority oversampling technique (SMOTE;
Chawla, Bowyer, Hall, & Kegelmeyer, 2002) to create one new ex-
ample of the positive class in the training data set for every positive
observation in the spatially undersampled data set. The SMOTE al-
gorithm does not create exact copies of the positive class as in tra-
ditional oversampling, but instead creates examples of the positive
class that occupy the parameter space between a randomly chosen
positive obser vation and a nearest neighbour. We did not randomly
undersample our data as is recommended when using SMOTE
(Chawla et al., 2002) because our data had already been spatially
undersampled as described above. The spatial undersampling and
SMOTE procedure were done to each of the data set s. As the spatial
undersampling randomly chooses from negative observations within
a grid cell, many negative observations may not be included in train-
ing the model. Therefore, we sampled each of the four data sets in
the study using this procedure 100 times, creating 100 unique data
sets unless it met the 25% prevalence criteria described above. In
that case, the data were only spatially undersampled to create the
100 unique data sets.
2.3 | Analysis
We selected and spatially subsampled (but did not oversample) 15%
of the combined data set to be the testing data for evaluation of
each model. This test set was selected because our goal is to make
accurate predictions across the entire Central Valley and give equal
importance to each location. Therefore, the test set must include
observations across the spatial extent of evaluation and be spatially
balanced. For the training data sets, we removed any checklist or
point count that was in the test set. We repeated this process 100
times creating 100 unique data sets against which our models were
We used the R package “ranger (Wright, Wager, Probst, &
Maintainer, 2019) to train a random forest model for each of the
seven species (or combined species; Table 1) and for each of four
data sets: (a) TNC point counts alone, (b) eBird checklists alone, (c)
an oversampled eBird data set, and (d) the combined TNC and eBird
data set . For each species, 1,000 trees were grown in the ensem-
ble. The number of variables from which the model could select at
each split for each tree was initially set to the square root of the
number of variables included in the model (
); how-
ever, we allowed this to vary by half, double and triple this common
rule of thumb. We selected the output from the model that maxi-
mized accuracy (Data S1). We evaluated the accuracy of the models
using multiple predictive performance metrics (PPMs). We did not
Common Name Latin Name
Abbreviation in
Tables & Figures
American Avocet Recurvirostra americana AMAV
Dunlin Calidris alpina DUNL
Greater YellowlegsaTringa melanoleuca YLEGa
Least Sandpiper Calidris minutilla LESA
Lesser YellowlegsaTringa flavipes YLEGa
Long-billed Curlew Numenius americanus LBCU
Long-Billed DowitcherbLimnodromus scolopaceus DOWIb
Short-billed DowitcherbLimnodromus griseus DOWIb
Western Sandpiper Calidris mauri WESA
aCombined in analysis and referred to collectively as “Yellowleg s” in this study (e.g. Golet
et al., 2018)
bCombined in analysis and referred to collectively as “Dowitcher” in this study (e.g.Golet
et al., 2018)
TABLE 1 List of species for which we
modelled spring distribution in the Central
Valley of California with each of the four
data sets
   ROBINSON et al.
evaluate the models using the out of bag (OOB) metric calculated
by random forest internally as it favours total accuracy over cor-
rectly predicting the minority class and is highly sensitive to class
imbalance. We used the test data set to evaluate mean squared error
(MSE) between the model predictions of presence or absence and
the true presence or absence in the test set. We also evaluated error
using Brier score (Brier, 1950), the mean squared error of the prob-
abilistic model predic tions and the true presence or absence in the
test set. We evaluated each model's ability to rank positive observa-
tions higher than negative ones using the area under the curve (AUC;
Fielding & Bell, 1997). We evaluated each model's ability to predict
presence or absence using Cohen's Kappa (Kappa; Cohen, 1960) and
its components, sensitivity (true positive rate), and specificit y (true
negative rate). We produced distribution maps for each species and
data set, and we recorded the predictor importance metrics from
models trained on each data set. All variables included in the model
may be found in Data S2.
The fields that are stored as part of the eBird project allowed us to
filter the data to be of a similar protocol as the TNC point count s.
This filtering reduced uncer taint y in the loc ation of eBird checklists
and eliminated the need for coarse level summaries of the data for
integration (Miller et al., 2019; Pacifici et al., 2017). For all species,
the combined data set had higher predictive accuracy than either the
TNC poi nt counts or the eBird checklis t data set s on their own; how-
ever, error (MSE, Brier score) was relatively low and AUC was rela-
tively high for all species and all data sets, particularly for the three
data sets where eBird checklists were included (Figure 2; Figure S3).
Improvement in accuracy varied among species, however; for the
Kappa statistic (predicting presence and absence against the test
set), the combined data set was an improvement over all three of
the other data sets evaluated (with exception of western sandpiper;
−3%–25.5% improvement; Figure 3). The improvement or loss in the
error statistics was usually negligible (with the exception of LBCU;
Figure 4; Figure S4) for the combined data set versus the next best
data set; however, error metrics were already very low for most
species, so great improvement here was not likely. Likewise, AUC
was relatively high for most species and for the three data sets con-
taining eBird checklists; therefore, the gain was negligible for many
species (improvement of up to 6%; Figure S4). For the few species/
metrics combinations where the combined data set was not the best
performing model, it was the second best, with the best being the
eBird checklists with simulated eBird checklists added. Neither TNC
FIGURE 2 Box plots for MSE, AUC , Brier score, Cohen's Kappa, Sensitivity and Specificit y for models run for Least Sandpiper (LESA) with
each data set; TNC point counts (TNC), eBird checklist s (eBird), eBird checklists with additional simulated eBird data (eBird + sim), and the
data set of TNC point counts and eBird checklists combined (Combined)
ROBINSO N et al.
point counts, nor eBird checklists alone performed better for any
metric than the data set where the two were combined.
We produced distribution maps for each species (Figure 5;
Figure S5) to determine whether there was a difference in the over-
all pattern of distribution estimated by models trained on the dif-
ferent dat a sets. For most species, there was greater contrast
between the presence estimates and absence estimates for the
combined data set when compared to the other three. Visually,
this means more obvious differentiation between high predicted
probability of presence and absence (e.g. hotter “hotspots” and
darker regions where absence is predicted; Figure 5; Figure S5).
We collected the important variables identified by the model (via
Gini index) when run with each data set. For all species, the im-
portance of rice and water was apparent as it was among the most
important variables for each of the data sets; however, it was not
until the data sets were combined that each rice and the Water
Trac ker layer had a hi gh imp ort ance score (Figure 6; Figure S6). The
combination of the dat a allowed t he mo dels to home in on rice and
surface water as highly important, where they had only moderate
importance comparatively when using the other data sets. This is
likel y be cause onl y 1.5% of eBird checklist s in our st udy come fr om
locations where the per cent landcover is >50% rice. Conversely,
almost 60% of the TNC point counts come from locations where
the per cent landcover is >50% rice.
FIGURE 3 Average improvement in
Cohen's Kappa for distribution models of
each species when using the combined
data set versus the data set that provided
the next highest Kappa value
FIGURE 4 Average improvement or
loss in Brier score for distribution models
of each species when using the combined
data set versus the data set that provided
the next lowest value
   ROBINSON et al.
Our results lend further support to effort s looking to combine data
from multiple sources to improve the inference and/or predictive abil-
ity of distribution models (Miller et al., 2019; Pacifici et al., 2017). We
have shown that citizen science data can be filtered to generate a high-
quality data set that can closely match the resolution and sampling
appr oa ch of str uct ur ed surveys, supp ort ing the call for cur re nt and fu-
ture citizen science project s to collect essential information related to
location and effort, as well as complete sur veys (Kelling et al., 2019).
The integration of survey data with the filtered citizen science data in
eBird resulted in improved inference, predictive ability, and ultimately
increased the extent of inference of the structured surveys. In turn,
the structured sur veys were able to improve the ecological inference
of the citizen science data, by improving the representation of sam-
pled habitats that are key for sh orebird species. Mos t impor tantly, the
practical approach we have shown for data integration is an improve-
ment on simpler “data pooling” approaches for data integration and
can be used to improve the efficienc y of designing biological sur veys
to collec t distribution information in the context of larger, citizen sci-
ence monitoring efforts, ultimately reducing the financial and time
expenditures typically required of monitoring programs and focused
research (Reich, Pacifici, & Stallings, 2018).
The combined data set resulted in improved accuracy across all
metrics relative to the TNC survey dat a or eBird data alone, for all
of the species considered in this study. Our combined approach also
predicted presence/absence via agreement with a test data set more
accurately than the different permutations of data sets considered
(e.g. TNC alone, eBi rd alone and eBird plus simulated eBird). The ob-
served improvement in accuracy is not likely a function of simply
increasing sample size, as augmenting eBird data with simulated data
did not show the same levels of improvement. The addition of the
TNC survey data to eBird data improved the coverage of the data
both spatially and in targeted habitats. Previous work has shown that
migrating shorebirds heavily use flooded rice fields in the Central
Valley, and will also use unflooded rice fields (Elphick & Oring, 1998;
Golet et al., 2018). Rice fields are the main focal habitat of the TNC
BirdReturns program, ~60% of the survey data from this project was
carried out at sites where at least 50% of the landcover is rice. In
contrast, the majority of the rice fields in the Central Valley are not
accessible to regular eBird participants, as they are often on private
lands. This is likely why we see so few (~1.5%) eBird checklists from
the Central Valley in locations where at least 50% of the landcover
is rice. The addition of more data from high-qualit y habitat is what
provided the improvement in accuracy of the combined eBird and
TNC data set. This highlights the importance and value of more tar-
geted research and sur vey efforts within the context of large-scale
citizen science monitoring efforts. Given that private lands make up
more than half of the land in the United States, supporting wildlife
monitoring efforts on privately held lands that are linked into large-
scale efforts such as eBird can greatly improve inferences on species
distributions and habitat associations across scales of interest for all
stakeholders (Hilty & Merenlender, 2003).
Interestingly, models based only on the TNC point counts strug-
gled to learn where each species was likely to be absent across the
entire Central Valley because the data came largely from “good”
shorebird habitat in the northern portion of the Central Valley. The
original intent of this project was not focused on species distribution
modelling, so this does not come as a surprise. However, given that
absence information allows for more accurate distribution models
(Brotons, Thuiller, Araújo, & Hirzel, 2004), the addition of eBird data
was able to provide information on where species are likely to be
absent, and improved inferences on what habitat types most benefit
shorebirds, but were not surveyed as part of the TNC monitoring
efforts. On the other hand, the models using the eBird data alone
were able to predict absences well and had relatively high accuracy
when predicting presence/absence, but overall ecological inference
was improved when combined with the point count data set. The
TNC point counts acted as targeted surveys in under-surveyed hab-
itat, which previous work has shown can improve the accuracy of
distribution models using eBird checklists (e.g. Xue, Davies, Fink,
Wood, & Gomes, 2016). The complementary nature of the data sets
is also shown when examining the impor tant predictors. For Dunlin
(Figure 6), the rice la nd cov er and the Water Tracker layer ar e im po rt-
ant predictors for each data set, however, their impor tance values
more than double when the combined data set is used to train the
model. While other species do not show such a drastic shift in the
importance value for these two habitat variables, the pattern is sim-
ilar for most in that these variables become more important to the
model's predictions when the data sets are combined.
The approach we present for combining data is applicable for
other conservation monitoring programs and ongoing research ef-
forts, given that a significant amount of research is conducted over
small spatial and temporal scales (Heidorn, 2008), and often dif fi-
cult to scale-up without similarly struc tured data (Poisot, Bruneau,
Gonzalez, Gravel, & Peres-Neto, 2019). The data fields that exist in
eBird data facilitated further processing and filtering to match the
structure of the individual data set of interest and allowed us to
leverage the strengths of both data sources. However, even citizen
science dat a sets that do collect effort information are often lacking
information on sampling locations, although incentivizing partici-
pants to collect data in these data-poor locations has been shown
to improve the accuracy of distribution models (e.g. avicaching; Xue
et al., 2016). Similarly, data from smaller-scale, individual research
projects can also help fill in these gaps in citizen science data. Given
that inferences from small-scale studies cannot be extrapolated to
larger spatial extents (Sandel & Smith, 2009), the approach we pres-
ent here for com binin g data from small-scale studie s with citizen sci-
ence data filtered to match the existing data structure will increase
the overall extent of inference, and improve our ability to concep-
tualize conservation actions within the larger context of the target
population(s) of interest.
Information on species distributions across large scales is one of
the most fundamental information needs for basic and applied re-
search fields in ecology. However, this level of information often re-
quires large-scale, coordinated surveys that can be time consuming
ROBINSO N et al.
FIGURE 5 Predicted probability of an expert surveyor detecting a Dunlin in the spring of 2016 on a one-hour long checklist where the
distance travelled was < 30 0m., as estimated by the model using TNC point counts alone (A), eBird checklists alone (B), eBird checklists with
added simulated eBird checklists (C), and the combined data set of TNC point count s and eBird checklists
   ROBINSON et al.
and costly to manage. In addition, models used to estimate species
distributions are often data hungry, and are often unable to generate
information at the spatial and temporal scales that are most relevant
for rese ar ch and conser vatio n ef for t s. Citizen science data is a gr ow -
ing source of additional, minimal cost surveys for a number of taxa
including invertebrates, bats, marine mammals, lichens, amphibians
and more (Deutsch, Bilenca, & Agostini, 2017; Devictor, Whittaker,
& Beltrame, 2010; Howard et. al., 2010; Newson et al., 2015;
Tonachella et al., 2012; Westgate et al., 2015). We have shown the
utility of combining survey data with semi-struc tured citizen science
data (Kelling et al., 2019) for improving accuracy in species distribu-
tion models, which can result in more efficient and cost-effective
surveys (Miller et al., 2019; Pacifici et al., 2017; Reich et al., 2018).
The simple method we present for citizen science data allows for the
integration of a small-scale point count data set with eBird check-
lists, can be used to integrate similar types of data being collected
by citizen scientists (e.g. camera traps) with more localized efforts
(e.g. patrolling by park rangers), ultimately improving our ecological
knowledge on the distribution and habitat associations of species of
conservation concern worldwide.
We would like to thank Point Blue for supplying Central Valley water
data. We would also like to thank Tom Auer, Ali Johnston, Steve
Kelling and Matt Reiter for comments that improved this project and
the manuscript.
The Nature Conservancy point count data are available from the
Dryad Digital Repository:
The eBird data used for analyses in this manuscript may be
downloaded from
Cropland Data Layer may be downloaded from: https://www. rch_and_Scien ce/Cropl and/Relea se/index.php
Wate r Trac ker Da t a may be down loa d ed fr om Po int Bl ue: ht tps ://
data.point ater/
Orin J. Robinson
Barnes, R., Sahr, K., Evenden, G., Johnson, A., Warmerdam , F., &
Maintainer, ]. (2018). Package “dggridR” Type Package Title Discrete
Global G rids. Retrieved from s/dggri dR/.
Boryan, C., Yang, Z., Mueller, R., & Craig, M. (2011). Monitoring US agri-
culture: The US department of agriculture, national agricultural sta-
tistic s ser vice, cropland data layer program. Geocarto International,
26(5), 341–358 . ht tps :// 0/10106 0 49.2011.56230 9
Bradter, U., Mair, L., nsson, M., Knape, J., Singer, A., & Snäll, T.
(2018). Can opportunistically collec ted Citizen Science data fill a
data gap for habitat suitability models of less common species?
Methods in Ecology and Evolution, 9(7 ), 1667–1678. https://doi.
org /10.1111/2041-210X .13 012
Brier, G. W. (1950). Verification of forecasts expressed in terms of prob-
ability. Monthly Weather Review, 78(1), 1–3. htt ps: //doi.o rg /10.1175/
1520-0493(1950)078<0001:VOFEI T>2.0.CO;2
Brotons, L., Thuiller, W., Ar aújo, M. B., & Hirzel, A. H. (2004). Presence-
absence versus presence-only modelling methods for predicting
bird habitat suitability. Ecography, 27(4), 437–448. https://doi.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (20 02).
SMOTE: Synthetic minority over-sampling technique. Journal of
Artificial Intelligence Research, 16, 321–357.
FIGURE 6 Important variables selected by the model trained on each dat a set for Dunlin as selected by Gini importance index; TNC
point count s alone (A), eBird checklists alone (B) and the combined data set of TNC point counts and eBird checklist s (C). Please note the
different scales for each panel and that this only represents the impor tance of each variable, not the direc tion of it s relationship with
probability of detection
ROBINSO N et al.
Cohen, J. (1960). A coefficient of agreement for nominal sc ales.
Educational and Psychological Measurement, 20(1), 37–46. https://doi.
org/10.1177/00131 64460 02000104
Deutsch, C., Bilenca, D., & Agos tini, G. (2017). In search of the horned
frog (Ceratophrys ornata) in Argentina: Complementing field surveys
with citizen science. Herpitological Conservation and Biology., 12,
664 –672 .
Devictor, V., Whittaker, R. J., & Beltrame, C. (2010). Beyond scarcity:
Citizen science programmes as useful tools for conservation bio-
geo gr aphy. Diversity and Distributions, 16, 354–362. https://doi.
org /10.1111/j .1472- 46 42.200 9.0 0615 .x
Elphick, C. S., & Oring, L. W. (1998). Winter management of Californian
rice fields for waterbirds. Journal of Applied Ecology, 35(1), 95–108.
Fielding, A. H., & Bell, J. F. (1997). A review of methods for the assess-
ment of prediction errors in conservation presence/absence models.
Environmental Conservation, 24(1), 38– 49. https :// .1017/
S0376 89299 7000088
Fink, D., Auer, T., Johnston , A., Ruiz-Gutierrez, V., Hochachka, W. M.,
& Kelling, S. (2019). Modeling Avian Full Annual Cycle Distribution
and Population Trends with Citizen Science Data. BioRxiv, 25186 8 ,
htt ps:// 8
Fithian , W., Elith, J., Hastie, T., & Keith, D. A . (2015). Bias correction in
species distribution models: Pooling survey and collection data for
multiple species. Methods in Ecology and Evolution, 6(4), 424–438.
https ://doi.or g/10.1111/20 41-210X.12242
Fox, R., Oliver, T. H., Harrower, C., Parsons, M. S., Thomas, C . D., & Roy,
D. B. (2014). Long-term changes to the frequency of occurrence of
British moths are consis tent with opposing and synergistic effect s of
climate and land-use changes. Journal of Applied Ecology, 51, 949–957.
https ://doi.or g/10.1111/1365 -266 4.1 2256
Geldmann, J., Heilmann-Clausen, J., Holm, T. E., Levinsky, I., Markussen,
B., Olsen, K., … Tøttrup, A. P. (2016). What determines spatial bias in
citizen science? Exploring four recording schemes with different pro-
ficiency requirements. Diversity and Distributions, 22(11), 1139–1149.
https ://doi.or g/10.1111/d di.12477
Golet, G . H., Low, C., Avery, S., Andrews, K., McColl, C. J., Laney, R., &
Reynolds, M. D. (2018). Using ricelands to provide temporary shore-
bird habitat during migration. Ecological Applications, 28(2), 409–426.
Gouraguine, A., Morant a, J., Ruiz-Frau, A ., Hinz, H ., Reñones, O.,
Ferse, S. C. A., Smith, D. J. (2019). Citizen science in data and
resource-limited areas: A tool to detect long-term ecosystem
changes. PLoS ONE, 14(1), 1–14.
Guillera-Arroita, G. (2017). Modelling of species distributions, range
dynamics and communities under imperfect detec tion: Advances,
challenges and opportunities. Ecography, 40(2), 281–295. https://doi.
org /10.1111/ecog.024 45
He, H., & Garcia, E. A. (200 9). Learning from Imbalanced Data. IEEE
Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Han, W., Yang, Z., Di, L., & Mueller, R. (2012). CropScape: A Web ser vice
based application for exploring and disseminating US conterminous
geospatial cropland data products for decision support. Computers
and Electronics in Agriculture, 84, 111–123.
Heidorn, B. (20 08). Shedding light on the dark data in the long tail of
science. Library Trends, 57(2 ), 28 0–29 9. h ttp s://doi .or g /10 .135 3/
Hilty, J., & Merenlender, A. M. (2003). Studying biodiversity on pri-
vate lands. Conservation Biology, 17(1), 132–137. ht tps://doi.
org /10.1046/j .1523-1739.200 3.01361. x
Howard, E., Aschen, H., & Davis, A. K. (2010). Citizen science observa-
tions of monarch butterfly overwintering in the southern United
States. Psyche: A Journal of Entomology, 2010, 1–6. https://doi.
org /10.1155/2010/6893 01
Johnston, A., Fink, D., Hochachka, W. M., & Kelling, S. (2018). Estimates
of obser ver expertise improve species distributions from citizen sci-
ence data. Methods in Ecology and Evolution, 9(1), 8 8–97. https://doi.
org /10.1111/2041-210X .1283 8
Johnston, A., Fink, D., Reynolds, M. D., Hochachka, W. M., Sullivan, B.
L., Bruns , N. E., … Kelling, S. (2015). Abun dance models improve spa-
tial and temporal prioritization of conservation resources. Ecological
Applications, 25(7 ), 1749–1756. https ://doi. org/10.1890/14-1826 .1
Johnston, A., Hochachka, W., Strimas-Mackey, M., Gutierrez, V. R .,
Robinson, O., Miller, E., Fink, D. (2019). Best practices for mak-
ing reliable inferences from citizen science data: Case study using
eBird to es timate species distributions. BioRxiv, 574 39 2, ht tp s://d o i .
org /10.1101/574392
Kelling, S., Johnston, A., Bonn, A., Fink, D., Ruiz-Gutierrez, V., Bonney,
R., … Guralnick, R . (2019). Using semistructured surveys to improve
citizen science data for monitoring biodiversit y. BioScience, 69(3),
170–179. https://doi.or g/10 .1093/bi osc i/biz010
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data.
Political Analysis, 9(2), 137–163. ht tps:// djour
Lyon, N. J., Debinski, D. M., & Rangwala, I. (2019). Evaluating the utilit y
of species distribution models in informing climate change-resilient
grassland restoration strategy. Frontiers in Ecology and Evolution,
7(FEB), 1–8.
Matthiopoulos, J., Hebblewhite, M., Aarts, G., & Fieberg, J. (2011).
Generalized functional responses for species distributions. Ecology,
92(3), 583–589.
Miller, D. A. W., Pacifici, K., Sanderlin, J. S., & Reich, B. J. (2019). The
recent past and promising future for data integration methods to es-
timate species’ distributions. Methods in Ecology and Evolution, 10(1),
22–37. ht tp s://doi.o rg /10.1111 /20 41-210X.13110
Newson, S. E., Evans, H. E., & Gillings, S. (2015). A novel citizen science
approach for large-scale standardised monitoring of bat activity and
distribution, evaluated in eastern England. Biological Conservation,
191, 38–49.
Pacifici, K ., Reich, B. J., Miller, D. A. W., Gardner, B., Stauffer, G., Singh,
S., … Collazo, J. A. (2017). Integrating multiple data sources in species
distribution modeling: A framework for data fusion*. Ecology, 98(3),
840–850. ht tps://doi.or g/10.1002/ecy.1710
Poisot, T., Bruneau, A., Gonzalez, A., Gravel, D., & Peres-Neto, P. (2019).
Ecological Data Should Not Be So Hard to Find and Reuse. Tre nds
in Ecology & Evolution, 34(6), 494–496.
R Core Team. (2019). R: A language and environment for statistical com-
puting. V ienna, Austria: R Foundation for Statistical Computing.
Retrieved from https://www.R-proje
Reich, B. J., Pacifici, K., & Stallings, J. W. (2018). Integrating auxiliar y
data in optimal spatial design for species distribution modelling.
Methods in Ecology and Evolution, 9(6), 1626–1637. ht tps://doi.
org /10.1111/2041-210X .13 0 02
Reiter, M. E., Elliott, N. K., Barbaree, B., & Moody, D. (2018). Water Tracker
An Automated Open Sur face Water Tracking Sys tem for California’s
Central Valley Water Tracker An Automated Open Sur face Water
Tracking System for California’s Central Valley Point Blue Conservation
Science. Retrieved from w racker
Reynolds, M. D., Sullivan, B. L., Hallstein, E., Matsumoto, S., Kelling, S.,
Merrif ield, M., … Morrison, S. A . (2017). Dynamic conservation for
migratory species . Science Advances, 3(8), e1700707. https://doi.
org /10.11 26/sc iadv.170 0707
Robinson, O. J., Ruiz-Gutierrez, V., & Fink, D. (2017). Correcting for bias
in distribution modelling for rare species using citizen science data.
Diversity and Distributions, 24(4), 4 60 –472. ht tp s://doi.o rg /10.1111/
   ROBINSON et al.
Sandel, B., & Smith, A. B. (2009). Scale as a lurking factor: Incorporating
scale-dependence in experimental ecology. Oikos, 118 (9), 1284–
1291 . ht tps://doi .org/10.1111/j.160 0-0706.2 00 9.17421. x
Sauer, J. R., Pardieck, K. L., Ziolkowski, D. J., Smith, A . C., Hudson, M.-
A.-R., Rodriguez, V., & Link, W. A. (2017). The first 50 years of the
North American Breeding Bird Survey. The Condor, 119(3), 576–593. r-17-83.1
Sofaer, H. R., Jarnevich, C. S., Pearse, I. S., Smyth, R. L., Auer, S., Cook,
G. L., … Hamilton, H. (2019). Development and Delivery of Species
Distribution Models to Inform Decision-Making. BioScience, 69(7),
544–557. i/biz045
Sullivan, B. L., Aycrigg, J. L., Barry, J. H., Bonney, R. E ., Bruns, N.,
Cooper, C. B., Kelling, S. (2014). T he eBird enterprise: An inte-
grated approach to development and application of citizen sci-
ence. Biological Conservation, 169, 31–40.
Tonachella, N., Natasi, A., Kaufman, G., Maldini, D., & Rankin, R. W.
(2012). Predicting trends in humpback whale (Megaptera novaean-
gliae) abundance using citizen science. Pacific Conservation Biology,
18, 297–309.
Westgate, M. J., Scheele, B. C., Ikin, K., Hoefer, A. M., Beaty, R. M., Evans,
M., Driscoll, D. A. (2015). Citizen Science Progr am Shows Urban
Areas Have Lower Occurrence of Frog Species, but Not Accelerated
Declines. PLoS ONE, 10 (11), e0140973. https://doi.or g/10 .1371/
journ al.pone.0140973
Wright, M. N., Wager, S., & Probst, P. (2019). Package “ranger” A Fast
Implementation of Random Forests.
Xue, Y., Davies, I., Fink, D., Wood, C., & Gomes, C. P. (2016). Behavior iden-
tification in two-stage games for incentivizing citizen science explo-
ration. Lecture Notes in Computer Science (Including Subseries Lecture
Notes in Ar tificial Intellig ence and Lecture Note s in Bioinformatic s), 9892
LNCS, 701–717. -1_44
Zipkin, E. F., Royle, J. A., Dawson, D. K., & Bates, S. (2010). Multi-species
occurrence models to evaluate the effe cts of conse rvation and man-
agement actions. Biological Conservation, 143(2), 479–484. https://
Orin Robinson is an ecologist at the Cornell Lab of Ornithology
interested in using and developing quantitative tools to learn
about vertebrate population and community ecology, and using
lessons learned to inform conservation.
Additional supporting information may be found online in the
Supporting Information section.
How to cite this article: Robinson OJ, Ruiz-Gutierrez V,
Reynolds MD, Golet GH, Strimas-Mackey M, Fink D.
Integrating citizen science data with expert surveys increases
accuracy and spatial extent of species distribution models.
Divers Distrib. 2020;26:976–986. ht tp s://doi.o rg /10.1111/
... The technology and software of our days have allowed the generation of a significant amount of biodiversity data, helpful to biological sciences (Dickinson et al. 2012, Ivanova & Shashkov 2021, Gueroun et al. 2022). Data about observations of organisms recorded by citizens is of immense value to ecological, evolutionary, and conservation research (Tiago et al. 2017, Petrovan et al. 2020, Robinson et al. 2020. The biodiversity information from citizen science fre-__________________ Corresponding editor: Sergio Palma quently is opportunistic data, only presence records of organisms without a sampling design, and without a method to standardize (Van Strien et al. 2013, Karppinen et al. 2022. ...
... The biodiversity information from citizen science fre-__________________ Corresponding editor: Sergio Palma quently is opportunistic data, only presence records of organisms without a sampling design, and without a method to standardize (Van Strien et al. 2013, Karppinen et al. 2022. However, the alliance of citizen science, field monitoring, and standardized surveys has revealed an increase in the spatial range of some species, distribution trends, and the distinct phenologies of populations in different ecoregions (Van Strien et al. 2013, Robinson et al. 2020, Gueroun et al. 2022). ...
... and Aurelia spp.; Lindsay et al. 2017, Scorrano et al. 2017, Lawley et al. 2021, this leads to that more than half of the total records of the colonial hydrozoan in the Mexican Atlantic were from biodiversity open-access databases generated through citizen science. The current These databases can play a crucial role in the evolutionary, ecological, and conservation research of P. porpita since they bring access to taxonomic revisions (Telenius 2011), niche modeling (Telenius 2011, Tiago et al. 2017, Robinson et al. 2020, population monitoring declining (Petrovan et al. 2020), and forecasts of management programs to mention a few examples (Bradter et al. 2021). However, sometimes, it is necessary to do an exhaustive search of original resources and review each georeferenced record, as in the case of records in the OBIS database in this study; a similar situation is mentioned in other studies (e.g. ...
Access to digital technology and software allows the production of a significant amount of biodiversity data, including citizens' species records, combined with field monitoring and standardized surveys, which are valuable to biological sciences; these data can help to know the distribution of organisms like medusozoans. We compiled records of the presence of Porpita porpita from open-access databases, scientific articles, and field observations to know their current distribution on the Mexican Atlantic for the first time. The yearly records of P. porpita increased over time, adding up to 22 records, of which 18 were assigned to citizen science. Most records correspond to 2011-2022, from April to May, and the Veracruz and Quintana Roo states. The records of the species in warm months and states with long shores can be related to an increase in touristic activities and observations by citizens. On the Mexican Atlantic, it is suggested that the use of digital technological devices under citizen science practices for recording P. porpita and its compilation in open-access databases acts as the principal method for tracking the distribution of this species. This work evidenced the need for a solid research framework of knowledge for P. porpita in the Mexican Atlantic, and future research could combine citizen science records and fieldwork records and improve the relationship between biological and oceanographic data to understand their spatial-temporal distribution patterns.
... Among the plethora of different data integration methods proposed (e.g. Broennimann & Guisan, 2008;Gallien et al., 2012;Robinson et al., 2020), the most recent approaches focused on hierarchically combining big, unstructured citizen science data sets with small, scientifically structured data sets, often generated at different spatial resolution, extent and data type (i.e. presence-only and presence/absence; Ahmad Suhaimi et al., 2021;Chevalier et al., 2021;Fletcher et al., 2019;Grabow et al., 2022;Ovaskainen et al., 2016;Stuber et al., 2022). ...
... presence-only and presence/absence; Ahmad Suhaimi et al., 2021;Chevalier et al., 2021;Fletcher et al., 2019;Grabow et al., 2022;Ovaskainen et al., 2016;Stuber et al., 2022). Most of these studies explicitly compared the predictive performance of traditional versus integrated distribution models (Chevalier et al., 2021;Robinson et al., 2020;Zulian et al., 2021), without providing conclusive evidence to support the use of the latter (Ahmad Suhaimi et al., 2021;Simmonds et al., 2020). Only a small number of the most recent studies proposing data integration evaluated the contribution of this process to the effective sampling of species environmental niche and, consequently, its effect on model predictions on new areas/time intervals (see Chevalier et al., 2021;Scherrer et al., 2021). ...
... Besides the idea that data integration does not automatically imply a reduction in niche truncation, our outcomes suggest that pooled ENMs are not necessarily more accurate than models calibrated with separate data sets, since we did not find any significant difference among the evaluation metrics achieved by the three ENMs groups. About this outcome, there is inconsistent evidence in literature, with some authors underlining the superior predictive abilities of integrated modelling approaches (Robinson et al., 2020;Zulian et al., 2021), while others reported no significant differences (Chevalier et al., 2022), or even contexts where integration approaches might only be a suboptimal choice (Ahmad Suhaimi et al., 2021;Simmonds et al., 2020). At any rate, what is firmly supported by our findings is that a net reduction in niche truncation (whenever it happens) significantly explains an increase in future biological invasion risk, substantially confirming our second working hypothesis. ...
Citizen science initiatives have been increasingly used by researchers as a source of occurrence data to model the distribution of alien species. Since citizen science presence-only data suffer from some fundamental issues, efforts have been made to combine these data with those provided by scientifically structured surveys. Surprisingly, only a few studies proposing data integration evaluated the contribution of this process to the effective sampling of species' environmental niches and, consequently, its effect on model predictions on new time intervals. We relied on niche overlap analyses, machine learning classification algorithms and ecological niche models to compare the ability of data from citizen science and scientific surveys, along with their integration, in capturing the realized niche of 13 invasive alien species in Italy. Moreover, we assessed differences in current and future invasion risk predicted by each data set under multiple global change scenarios. We showed that data from citizen science and scientific surveys captured similar species niches though highlighting exclusive portions associated with clearly identifiable environmental conditions. In terrestrial species, citizen science data granted the highest gain in environmental space to the pooled niches, determining an increased future biological invasion risk. A few aquatic species modelled at the regional scale reported a net loss in the pooled niches compared to their scientific survey niches, suggesting that citizen science data may also lead to contraction in pooled niches. For these species, models predicted a lower future biological invasion risk. These findings indicate that citizen science data may represent a valuable contribution to predicting future spread of invasive alien species, especially within national-scale programmes. At the same time, citizen science data collected on species poorly known to citizen scientists, or in strictly local contexts, may strongly affect the niche quantification of these taxa and the prediction of their future biological invasion risk.
... roads or other human infrastructure) or interest in species and areas of conservation concern (Botts et al., 2011;Steger et al., 2017;Stolar & Nielsen, 2015;Tulloch et al., 2013). Thus, a trade-off between the quality and quantity of citizen science data is often assumed (Robinson et al., 2020). ...
... In peer-reviewed literature, distribution data for some ground-feeding birds are biased towards threatened species and protected areas (Boakes et al., 2010). Because sampling biases in different datasets can be complementary, combining citizen science observations with professional data has increased the spatial coverage of monitoring for shorebirds (Robinson et al., 2020), insects (Hochmair et al., 2020;Wilson et al., 2020;Zapponi et al., 2017), large mammals (Farhadinia et al., 2018) and easily recognizable non-native aquatic species (Lehtiniemi et al., 2020). Following brief training, citizen scientists have produced similar field estimates of species cover and occurrence as professional scientists (Crall et al., 2011;Danielsen et al., 2014). ...
... The effect of these treatments can vary depending on the species modelled, but several studies have shown that filtered eBird data can successfully produce more accurate models that are K E Y W O R D S biodiversity monitoring, citizen science, habitat suitability model, iNaturalist, invasive species, sampling bias comparable to those based on more structured survey data (Robinson et al., 2018;Steen et al., 2019Steen et al., , 2021. Additionally, eBird-based studies have demonstrated that citizen science and professional observations can contain unique biases (Coxen et al., 2017;Robinson et al., 2020;Tanner et al., 2020). Combining multiple datasets to potentially mitigate their respective biases has thus become more common; integration methods range from data pooling to more formal techniques that can account for sampling issues (Fletcher et al., 2019). ...
Full-text available
Aim Citizen science is a cost‐effective potential source of invasive species occurrence data. However, data quality issues due to unstructured sampling approaches may discourage the use of these observations by science and conservation professionals. This study explored the utility of low‐structure iNaturalist citizen science data in invasive plant monitoring. We first examined the prevalence of invasive taxa in iNaturalist plant observations and sampling biases associated with these data. Using four invasive species as examples, we then compared iNaturalist and professional agency observations and used the two datasets to model suitable habitat for each species. Location Hawai'i, USA. Methods To estimate the prevalence of invasive plant data, we compared the number of species and observations recorded in iNaturalist to botanical checklists for Hawai'i. Sampling bias was quantified along gradients of site accessibility, protective status and vegetation disturbance using a bias index. Habitat suitability for four invasive species was modelled in Maxent, using observations from iNaturalist, professional agencies and stratified subsets of iNaturalist data. Results iNaturalist plant observations were biased towards invasive species, which were frequently recorded in areas with higher road/trail density and vegetation disturbance. Professional observations of four example invasive species tended to occur in less accessible, native‐dominated sites. Habitat suitability models based on iNaturalist versus professional data showed moderate overlap and different distributions of suitable habitat across vegetation disturbance classes. Stratifying iNaturalist observations had little effect on how suitable habitat was distributed for the species modelled in this study. Main Conclusions Opportunistic iNaturalist observations have the potential to complement and expand professional invasive plant monitoring, which we found was often affected by inverse sampling biases. Invasive species represented a high proportion of iNaturalist plant observations, and were recorded in environments that were not captured by professional surveys. Combining the datasets thus led to more comprehensive estimates of suitable habitat.
... Interestingly, some of the spatial biases present in professional monitoring demonstrate opposite patterns to those observed in community science datasets. Professional monitoring is less often concentrated near accessible areas such as roads and rivers (Hughes et al., 2021;Yang et al., 2014); and there is some evidence that structured protocols are biased towards natural or 'ideal' habitat, even when the nominal intention is to sample modified or fragmented landscapes (Martin, Blossey, et al., 2012;Robinson et al., 2020;Zhang et al., 2021). Because of these biases, models built around data collected by professionals can result in poorer inferences than using community science, performing worse when predicting species absences (Robinson et al., 2018(Robinson et al., , 2020. ...
... Professional monitoring is less often concentrated near accessible areas such as roads and rivers (Hughes et al., 2021;Yang et al., 2014); and there is some evidence that structured protocols are biased towards natural or 'ideal' habitat, even when the nominal intention is to sample modified or fragmented landscapes (Martin, Blossey, et al., 2012;Robinson et al., 2020;Zhang et al., 2021). Because of these biases, models built around data collected by professionals can result in poorer inferences than using community science, performing worse when predicting species absences (Robinson et al., 2018(Robinson et al., , 2020. ...
... Approaches exist to limit the effects of spatial sampling biases, including spatially subsetting the data (e.g. Robinson et al., 2020) or using spatially explicit models that account for differences in data density (e.g. Fink et al., 2014). ...
Full-text available
Conservation planning requires extensive amounts of data, yet data collection is expensive, and there is often a trade‐off between the quantity and quality of data that can be collected. Researchers are increasingly turning to community science programs to meet their biodiversity data needs, yet the reliability of such data sources is still a common source of debate. Here, we argue that professionally collected data are subject to many of the limitations and biases present in community science datasets. We explore four common criticisms of community science data, and comparable issues that exist in data collected by experts: spatial biases, observer variability, taxonomic biases and the misapplication of data. We then outline solutions to these problems that have been developed to make better use of community science data, but can (and should) be equally applied to both kinds of data. We highlight four main solutions based on research using community science data that can be applied across all biodiversity data collection and research. Statistical techniques that have been developed for processing community science data can equally help account for spatial biases and observer variation in professional datasets. Benchmarking or vetting one dataset against another can strengthen evidence and uncover unknown sources of biases. Professional and community science datasets can be used together to fill knowledge gaps that are unique to each. Careful study design that accounts for the collection of relevant and important covariate data can help statistically account for sources of bias. Currently, a double standard exists in how researchers view data collected by professionals versus those collected by community scientists. Our aim is to ensure that valuable community science data are given the prominent place they deserve, and that data collected by experts are appropriately vetted and biases accounted for using all the tools at our disposal.
... Also, while citizen science data often suffer from spatially-biased sampling efforts (i.e., sampling tends to concentrate in densely populated or touristic areas [19,20]), SDMs such as Maxent can account for such spatial biases by considering the spatial distribution of sampling efforts when selecting pseudo-absence (background) locations [21,22]. When sampling efforts are adequately controlled, adding citizen science data improves the accuracy of SDMs [10,23,24]. This implies that SDMs may be substantially improved by utilizing rapidly accumulating Biome's species occurrence records if we adequately control the sampling efforts. ...
... CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in [23,27,28]. ...
Full-text available
Kunming-Montreal Global Biodiversity Framework increased the demand for biodiversity distribution data. To gather species observation from the public, we introduced a mobile application called 'Biome' in Japan. By employing species identification algorithms and gamification elements, Biome has gathered >5M observations since its launch in 2019. However, cloud-sourced data often exhibit spatial and taxonomic biases. Species distribution models (SDMs) enable infer species distribution while accommodating such bias. We investigated Biome data's quality and how incorporating Biome data influences the performance of SDMs. Species identification accuracy of Biome data exceeds 95% for birds, reptiles, mammals, and amphibians, but seed plants, molluscs, and fishes scored below 90%. The distributions of 132 terrestrial plants and animals across Japan were modeled, and their accuracy was improved by incorporating Biome data into traditional survey data. For endangered species, traditional survey data required >2,000 records to build accurate models (Boyce index >0.9), though only ca.300 records were required when Biome was blended. The unique data distributions may explain this improvement: Biome data covers urban-natural gradients uniformly, while traditional data is biased towards natural areas. Combining multiple data sources offers insights into species distributions across Japan, aiding protected area designation and ecosystem service assessment.
... It can also determine potential suitable areas for future surveys (Brito et al., 2011) or reintroductions (Kuemmerle et al., 2010, assessment of conservation areas (Peers et al., 2016), evaluation of the effect of climate change, and even lay out further research questions (Thornton and Peers, 2020). Furthermore, the combination of citizen science and other sources to expand the database (Robinson et al., 2020), and the use of remote sensing to obtain land cover, topographic and climatic variables can improve the predictions from these models (Thornton and Peers, 2020). ...
... Citizen science: to multiply the effort, we involved local communities and showed them how to identify the target species with the open-source smartphone application iNaturalist ( Other studies have previously shown how citizen science can work as a tool to multiply researchers' efforts, contribute to expand biodiversity knowledge (Chandler et al., 2017;Kobori et al., 2016), and to improve distribution and ecological niche models (Miller et al., 2019;Reich et al., 2018;Robinson et al., 2020). Phone apps like iNaturalist are curated by peers, including experts that confirm the species identification (Nugent, 2008). ...
Full-text available
Habitat encroachment can have devastating effects upon biodiversity, especially amphibians. Phyllobates vittatus is an endemic frog from Costa Rica, where land cover has seen significant changes over recent decades. Here we use remote sensing to create a land cover map of the region and carry out ecological niche modelling to identify the main abiotic factors associated to the distribution of this species. We have informed our models based on our own field observations, those from other researchers, and citizen science participants to obtain a comprehensive database of P. vittatus occurrences. Elevation, forest percentage, distance to lakes and rivers, annual temperature range and precipitation variables were found to shape the ecological niche of P. vittatus , which is mostly located within protected areas. Prior knowledge of the habitat of the species was key to interpret the model output. We identify populations that might be isolated, and areas where presence has not yet been verified or that have not been occupied by the species, thus, identifying potential areas for reintroductions. We also calculated the area of occupancy and recommend that P. vittatus’ status be adjusted to “Endangered”. Future surveys and evaluation of population health and connectivity would help to better ensure the protection of the species in the long-term.
... We subsampled this detection-non-detection data to reduce the spatial bias in recording (i.e. the large skew in number of visits per grid cell) and high imbalance in species records (i.e. for most species, there were a small number of detections vs. a large number of non-detections), following recent recommendations (Gaul et al., 2022;Robinson et al., 2020). To do this, we first retained all the detections for each species (i.e. ...
Woodland cover in Britain has increased over the past century and is set to increase further through woodland creation schemes aiming to tackle climate change. Nonetheless, the wider repercussions of increasing woodland cover for species, especially invertebrates, have not been comprehensively assessed. Here, we quantified the woodland associations of 2762 invertebrate species in Britain across 21 broad taxon groups using species occurrence records collected by specialist recording societies. We then related the strength of species' woodland associations to published estimates of their long‐term national distribution trends between 1970 and 2015. Across all taxa, 29% of species were positively associated with broadleaf woodland cover, whereas 27% of species were negatively associated. There was a slight tendency for species associated with broadleaf woodland to have more positive long‐term distribution trends, but the effect had little explanatory power. For 15% of species, we detected a non‐monotonic association with broadleaf woodland cover, such that their occurrence peaked at intermediate levels of cover. Intermediate‐cover species had more positive long‐term distribution trends than species with monotonic positive or negative woodland associations. Our findings suggest that woodland invertebrates have not consistently increased, despite the increases in woodland cover. While some caution is warranted owing to our use of heterogeneous occurrence records, the considerable variation in distribution trends of woodland‐associated species could be explained by the high diversity of woodland species and ways in which they use woodland habitat. Woodland creation, or increasing tree cover in general, could have idiosyncratic impacts on species, depending on how new woodlands are created and managed.
... Thus, practitioners often leverage opportunistic datasets that are available on smaller scales than the desired modeling application, when used with appropriate caution, to develop SDMs that can predict outside the original spatial extent (e.g., Stirling et al., 2016). While some work has shown that "scaling up" relatively small-scale, scientific survey data with opportunistic citizen science data can result in improved accuracy and spatial extent of SDMs (Robinson et al., 2020), our results suggest that survey-quality data may not be necessary when multiple, complementary, large-scale datasets exist, as is common for highly migratory marine species. Our results also corroborate previous findings that spatial mismatch between training data and the desired modeling application may not inhibit the development of robust SDMs. ...
Full-text available
Species distribution models (SDMs) are becoming an important tool for marine conservation and management. Yet while there is an increasing diversity and volume of marine biodiversity data for training SDMs, little practical guidance is available on how to leverage distinct data types to build robust models. We explored the effect of different data types on the fit, performance and predictive ability of SDMs by comparing models trained with four data types for a heavily exploited pelagic fish, the blue shark (Prionace glauca), in the Northwest Atlantic: two fishery-dependent (conventional mark-recapture tags, fisheries observer records) and two fishery-independent (satellite-linked electronic tags, pop-up archival tags). We found that all four data types can result in robust models, but differences among spatial predictions highlighted the need to consider ecological realism in model selection and interpretation regardless of data type. Differences among models were primarily attributed to biases in how each data type, and the associated representation of absences, sampled the environment and summarized the resulting species distributions. Outputs from model ensembles and a model trained on all pooled data both proved effective for combining inferences across data types and provided more ecologically realistic predictions than individual models. Our results provide valuable guidance for practitioners developing SDMs. With increasing access to diverse data sources, future work should further develop truly integrative modeling approaches that can explicitly leverage strengths of individual data types while statistically accounting for limitations, such as sampling biases.
... B. bei oder aus Mitmachaktionen wie dem NABU-Insektensommer -mindestens eine sinnvolle Ergänzung zu stark wissenschaftlich geprägten Artenerfassungen sind (ROBINSON et al. 2020); diese bei geeigneter Herangehensweise aber auch für sich alleine erkenntnisbringende Informationen und Einblicke aus der Welt der belebten Natur zutage fördern können. ...
Full-text available
Zusammenfassung: Mitmachaktionen kommen immer häufi ger in verschiedenen Bürgerwissen-schaften zum Einsatz. Sie bieten einen niederschwelligen Zugang für Interessierte, um neues Wissen in einem bestimmten Themenbereich zu schaffen. Eine solche Citizen-Science-Mitmachaktion soll hier anhand des NABU-Insektensommers dargestellt werden. Neben dem Projekt und der Durchführung werden auch ökonomische Aspekte dieser Mitmachaktion beleuchtet, anhand derer die Wertschöpfung von Bürgerwissenschaften unterstrichen wird. Bürger*innen leisten im Zuge von Citizen-Science eine ehrenamtliche Aufgabe, die hauptamtlich meist gar nicht mehr geleistet werden kann. Dabei können diese Mitmachaktionen veritable Werkzeuge zur Erzeugung größerer Datensätze sein, die einem-über längere Zeiträume hinweg durchgeführt-statistisch belastbare Veränderungen z. B. bestimmter Organismengruppen aufzeigen können. Anhand der Daten, die während des NABU-Insektensommers gesammelt wurden, wird das Auftreten der Blauen Holzbiene in Deutschland nachgezeichnet und ein Anstieg der Beobachtungszahlen mit milderen Wintern verknüpft. Im Weiteren werden Beobachtungen des Taubenschwänzchens aufgezeigt und Daten präsentiert, die ein Überwintern in Deutschland nahelegen.
Full-text available
Rapid global urbanization has caused habitat degradation and fragmentation, resulting in biodiversity loss and the homogenization of urban species. Birds play a crucial role as biodiversity indicators in urban environments, providing multiple ecosystem services and demonstrating sensitivity to changes in habitat. However, construction activities often disrupt urban bird habitats, leading to a decline in habitat quality. This paper proposes a framework for prioritizing habitat restoration by pinpointing bird hotspots that demand attention and considering the matching relationship between bird richness and habitat quality. Shanghai represents a typical example of the high-density megacities in China, posing a significant challenge for biodiversity conservation efforts. Utilizing the random forest (RF) model, bird richness patterns in central Shanghai were mapped, and bird hotspots were identified by calculating local spatial autocorrelation indices. From this, the habitat quality of hotspot areas was evaluated, and the restoration priority of bird habitats was determined by matching bird richness with habitat quality through z-score standardization. The results were as follows: (1) Outer-ring green spaces, large urban parks, and green areas along coasts or rivers were found to be the most important habitats for bird richness. Notably, forests emerged as a crucial habitat, with approximately 50.68% of the forested areas identified as hotspots. (2) Four habitat restoration types were identified. The high-bird-richness–low-habitat-quality area (HBR-LHQ), mainly consisting of grassland and urban construction land, was identified as a key priority for restoration due to its vulnerability to human activities. (3) The Landscape Shannon’s Diversity Index (SHDI) and Normalized Difference Vegetation Index (NDVI) are considered the most significant factors influencing the bird distribution. Our findings provide a scientifically effective framework for identifying habitat restoration priorities in high-density urban areas.
Full-text available
Information on species' distributions, abundances, and how they change over time is central to the study of the ecology and conservation of animal populations. This information is challenging to obtain at landscape scales across range‐wide extents for two main reasons. First, landscape‐scale processes that affect populations vary throughout the year and across species' ranges, requiring high resolution, year‐round data across broad — sometimes hemispheric — spatial extents. Second, while citizen science projects can collect data at these resolutions and extents, using these data requires appropriate analysis to address known sources of bias. Here we present an analytical framework to address these challenges and generate year‐round, range‐wide distributional information using citizen science data. To illustrate this approach, we apply the framework to Wood Thrush (Hylocichla mustelina), a long‐distance Neotropical migrant and species of conservation concern, using data from the citizen science project eBird. We estimate occurrence and abundance across a range of spatial scales throughout the annual cycle. Additionally, we generate intra‐annual estimates of the range, intra‐annual estimates of the associations between species and characteristics of the landscape, and inter‐annual trends in abundance for breeding and non‐breeding seasons. The range‐wide population trajectories for Wood Thrush show a close correspondence between breeding and non‐breeding seasons with steep declines between 2010 and 2013 followed by shallower rates of decline from 2013 to 2016. The breeding season range‐wide population trajectory based on the independently collected and analyzed North American Breeding Bird Survey data also shows this pattern. The information provided here fills important knowledge gaps for Wood Thrush, especially during the less studied migration and non‐breeding periods. More generally, the modeling framework presented here can be used to accurately capture landscape scale intra‐ and inter‐annual distributional dynamics for broadly distributed, highly mobile species.
Full-text available
Information on where species occur is an important component of conservation and management decisions, but knowledge of distributions is often coarse or incomplete. Species distribution models provide a tool for mapping habitat and can produce credible, defensible, and repeatable information with which to inform decisions. However, these models are sensitive to data inputs and methodological choices, making it important to assess the reliability and utility of model predictions. We provide a rubric that model developers can use to communicate a model's attributes and its appropriate uses. We emphasize the importance of tailoring model development and delivery to the species of interest and the intended use and the advantages of iterative modeling and validation. We highlight how species distribution models have been used to design surveys for new populations, inform spatial prioritization decisions for management actions, and support regulatory decision-making and compliance, tying these examples back to our model assessment rubric.
Full-text available
Biodiversity is being lost at an unprecedented rate, and monitoring is crucial for understanding the causal drivers and assessing solutions. Most biodiversity monitoring data are collected by volunteers through citizen science projects, and often crucial information is lacking to account for the inevitable biases that observers introduce during data collection. We contend that citizen science projects intended to support biodiversity monitoring must gather information about the observation process as well as species occurrence. We illustrate this using eBird, a global citizen science project that collects information on bird occurrences as well as vital contextual information on the observation process while maintaining broad participation. Our fundamental argument is that regardless of what species are being monitored, when citizen science projects collect a small set of basic information about how participants make their observations, the scientific value of the data collected will be dramatically improved.
Full-text available
Citizen science data are valuable for addressing a wide range of ecological research questions, and there has been a rapid increase in the scope and volume of data available. However, data from large-scale citizen science projects typically present a number of challenges that can inhibit robust ecological inferences. These challenges include: species bias, spatial bias, and variation in effort. To demonstrate addressing key challenges in analysing citizen science data, we use the example of estimating species distributions with data from eBird, a large semi-structured citizen science project. We estimate two widely applied metrics of species distributions: encounter rate and occupancy probability. For each metric, we assess the impact of data processing steps that either degrade or refine the data used in the analyses. We also test whether differences in model performance are maintained at different sample sizes. Model performance improved when data processing and analytical methods addressed the challenges arising from citizen science data. The largest gains in model performance were achieved with: 1) the use of complete checklists (where observers report all the species they detect and identify); and 2) the use of covariates describing variation in effort and detectability for each checklist. Occupancy models were more robust to a lack of complete checklists and effort variables. Improvements in model performance with data refinement were more evident with larger sample sizes. Here, we describe processes to refine semi-structured citizen science data to estimate species distributions. We demonstrate the value of complete checklists, which can inform the design and adaptation of citizen science projects. We also demonstrate the value of information on effort. The methods we have outlined are also likely to improve other forms of inference, and will enable researchers to conduct robust analyses and harness the vast ecological knowledge that exists within citizen science data.
Full-text available
Tallgrass prairie ecosystems in North America are heavily degraded and require effective restoration strategies if prairie specialist taxa are to be preserved. One common management tool used to restore grassland is the application of a seed-mix of native prairie plant species. While this technique is effective in the short-term, it is critical that species' resilience to changing climate be evaluated when designing these mixes. By utilizing species distribution models (SDMs), species' bioclimatic envelopes–and thus the geographic area suitable for them–can be quantified and predicted under various future climate regimes, and current seed-mixes may be modified to include more climate resilient species or exclude more affected species. We evaluated climate response on plant functional groups to examine the generalizability of climate response among species of particular functional groups. We selected 14 prairie species representing the functional groups of cool-season and warm-season grasses, forbs, and legumes and we modeled their responses under both a moderate and more extreme predicted future. Our functional group “composite maps” show that warm-season grasses, forbs, and legumes responded similarly to other species within their functional group, while cool-season grasses showed less inter-species concordance. The value of functional group as a rough method for evaluating climate-resilience is therefore supported, but candidate cool-season grass species will require more individualized attention. This result suggests that seed-mix designers may be able to use species with more occurrence records to generate functional group-level predictions to assess the climate response of species for which there are prohibitively few occurrence records for modeling.
Full-text available
With the advance of methods for estimating species distribution models has come an interest in how to best combine datasets to improve estimates of species distributions. This has spurred the development of data integration methods that simultaneously harness information from multiple datasets while dealing with the specific strengths and weaknesses of each dataset. We outline the general principles that have guided data integration methods and review recent developments in the field. We then outline key areas that allow for a more general framework for integrating data and provide suggestions for improving sampling design and validation for integrated models. Key to recent advances has been using point‐process thinking to combine estimators developed for different data types. Extending this framework to new data types will further improve our inferences, as well as relaxing assumptions about how parameters are jointly estimated. These along with the better use of information regarding sampling effort and spatial autocorrelation will further improve our inferences. Recent developments form a strong foundation for implementation of data integration models. Wider adoption can improve our inferences about species distributions and the dynamic processes that lead to distributional shifts.
Full-text available
Coral reefs are threatened by numerous global and local stressors. In the face of predicted large-scale coral degradation over the coming decades, the importance of long-term monitoring of stress-induced ecosystem changes has been widely recognised. In areas where sustained funding is unavailable, citizen science monitoring has the potential to be a powerful alternative to conventional monitoring programmes. In this study we used data collected by volunteers in Southeast Sulawesi (Indonesia), to demonstrate the potential of marine citizen science programmes to provide scientifically sound information necessary for detecting ecosystem changes in areas where no alternative data are available. Data were collected annually between 2002 and 2012 and consisted of percent benthic biotic and abiotic cover and fish counts. Analyses revealed long-term coral reef ecosystem change. We observed a continuous decline of hard coral, which in turn had a significant effect on the associated fishes, at community, family and species levels. We provide evidence of the importance of marine citizen science programmes in detecting long-term ecosystem change as an effective way of delivering conservation data to local government and national agencies. This is particularly true for areas where funding for monitoring is unavailable, resulting in an absence of ecological data. For citizen science data to contribute to ecological monitoring and local decision-making, the data collection protocols need to adhere to sound scientific standards, and protocols for data evaluation need to be available to local stakeholders. Here, we describe the monitoring design, data treatment and statistical analyses to be used as potential guidelines in future marine citizen science projects.
Full-text available
Opportunistically collected species observations contributed by volunteer reporters are increasingly available for species and regions for which systematically collected data are not available. However, it is unclear if they are suitable to produce reliable habitat suitability models (HSMs), and hence if the species–habitat relationships found and habitat suitability maps produced can be used with confidence to advice conservation management and address basic and applied research questions. We evaluated HSMs with opportunistically collected observations against HSMs with systematically collected observations. We enhanced the opportunistically collected presence‐only data by adding inferred species absences. To obtain inferred absences, we asked individual reporters about their identification skills and if they reported certain species consistently and combined this information with their observations. We evaluated several HSM methods using a forest bird species, Siberian jay ( Perisoreus infaustus ), in Sweden: logistic regression with inferred absences, two versions of MaxEnt, a model combining presence–absence with presence‐only observations and a Bayesian site‐occupancy‐detection model. All HSM methods produced nationwide habitat suitability maps of Siberian jay that agreed well with systematically collected observations (AUC: 086–0.88) and were very similar to a habitat suitability map produced from the HSM with systematically collected observations (Spearman rho: 0.94–0.98). At finer geographical scales there were differences among methods. At finer scale, the resulting habitat suitability maps from logistic regression with inferred absences agreed better with results from systematically collected observations than other methods. The species–habitat relationships found with logistic regression also agreed well with those found from systematically collected data and with prior expectations based on the species ecology. Synthesis and application . For many regions and species, systematically collected data are not available. By using inferred absences from high‐quality, opportunistically collected contributions of few very active reporters in logistic regression we obtained HSMs that produced results similar to those from a systematic survey. Adding high‐quality inferred absences to opportunistically collected data is likely possible for many less common species across various organism groups. Well‐performing HSMs are important to facilitate applications such as spatial conservation planning and prioritization, monitoring of invasive species, understanding species habitat requirements or climate change studies.
Drawing upon the data deposited in publicly shared archives has the potential to transform the way we conduct ecological research. For this transformation to happen, we argue that data need to be more interoperable and easier to discover. One way to achieve these goals is to adopt domain-specific data representations.