PreprintPDF Available

Standardized NEON organismal data for biodiversity research

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Understanding patterns and drivers of species distributions and abundances, and thus biodiversity, is a core goal of ecology. Despite advances in recent decades, research into these patterns and processes is currently limited by a lack of standardized, high-quality, empirical data that spans large spatial scales and long time periods. The National Ecological Observatory Network (NEON) fills this gap by providing freely available observational data that are: generated during robust and consistent organismal sampling of several sentinel taxonomic groups within 81 sites distributed across the United States; and will be collected for at least 30 years. The breadth and scope of these data provides a unique resource for advancing biodiversity research. To maximize the potential of this opportunity, however, it is critical that NEON data be maximally accessible and easily integrated into investigators’ workflows and analyses. To facilitate its use for biodiversity research and synthesis, we created a workflow to process and format NEON organismal data into the ecocomDP (ecological community data design pattern) format, and available through the `ecocomDP` R package; we then provided the standardized data as an R data package (`neonDivData`). We briefly summarize sampling designs and data wrangling decisions for the major taxonomic groups included in this effort. Our workflows are open-source so the biodiversity community may: add additional taxonomic groups; modify the workflow to produce datasets appropriate for their own analytical needs; and regularly update the data packages as more observations become available. Finally, we provide two simple examples of how the standardized data may be used for biodiversity research. By providing a standardized data package, we hope to enhance the utility of NEON organismal data in advancing biodiversity research.
Standardized NEON organismal data for biodiversity1
research2
Daijiang Li1,2†‡, Sydne Record3†‡, Eric Sokol4,5†‡, Mahew E. Biers6, Melissa Y. Chen6, Anny Y. Chung7,3
Mahew R. Helmus8, Ruvi Jaimes9, Lara Jansen10, Marta A. Jarzyna11,12, Michael G. Just13, Jalene M.4
LaMontagne14, Bre Melbourne6, Wynne Moss6, Kari Norman15, Stephanie Parker4, Natalie Robinson4, Bijan5
Seyednasrollah16, Colin Smith17, Sarah Spaulding5, ilina Surasinghe18, Sarah omsen19, Phoebe6
Zarnetske20,21
7
01 September, 20218
1Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States9
2Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, United States10
3Department of Biology, Bryn Mawr College, Bryn Mawr, PA, United States11
4Baelle, National Ecological Observatory Network (NEON), Boulder, CO, United States12
5Institute of Arctic and Alpine Research (INSTAAR), University of Colorado Boulder, Boulder, CO, United States13
6Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, CO, United States14
7Departments of Plant Biology and Plant Pathology, University of Georgia, Athens, GA, United States15
8Integrative Ecology Lab, Center for Biodiversity, Department of Biology, Temple University, Philadelphia, PA,16
United States17
9St. Edward’s University, Austin, Texas18
10 Department of Environmental Science and Management, Portland State University, Portland, OR, United States19
11 Department of Evolution, Ecology and Organismal Biology, e Ohio State University, Columbus, OH, United20
States21
12 Translational Data Analytics Institute, e Ohio State University, Columbus, OH, United States22
13 Ecological Processes Branch, U.S. Army ERDC CERL, Champaign, IL, United States23
14 Department of Biological Sciences, DePaul University, Chicago, IL, United States24
15 Department of Environmental Science, Policy, and Management, University of California Berkeley, Berkeley, CA,25
United States26
16 School of Informatics, Computing and Cyber Systems, Northern Arizona University, Flagsta, AZ, United States27
17 Environmental Data Initiative, University of Wisconsin-Madison, Madison, WI28
18 Department of Biological Sciences, Bridgewater State University, Bridgewater, MA, United States29
19 Department of Integrative Biology, Oregon State University, Corvallis, OR, United States30
1
20 Department of Integrative Biology, Michigan State University, East Lansing, MI, United States31
21 Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI, United States32
Equal contributions33
Corresponding authors: dli30@lsu.edu;srecord@brynmawr.edu;esokol@battelleecology.org34
Open Research Statement35
No data were collected for this study. All original data were collected by NEON and are publicly36
available at NEON’s data portal. We standardized such data and provided them as a data package,37
which is available at Github (https://github.com/daijiang/neonDivData). Data were also38
permanently archived at the EDI data repository39
(https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=190&revision=2). e40
code in the Supporting Information (CodeS1) is novel and will be available at Github upon41
acceptance.42
Abstract: Understanding paerns and drivers of species distributions and abundances, and thus43
biodiversity, is a core goal of ecology. Despite advances in recent decades, research into these44
paerns and processes is currently limited by a lack of standardized, high-quality, empirical data45
that spans large spatial scales and long time periods. e National Ecological Observatory46
Network (NEON) lls this gap by providing freely available observational data that are:47
generated during robust and consistent organismal sampling of several sentinel taxonomic48
groups within 81 sites distributed across the United States; and will be collected for at least 3049
years. e breadth and scope of these data provides a unique resource for advancing biodiversity50
research. To maximize the potential of this opportunity, however, it is critical that NEON data be51
maximally accessible and easily integrated into investigators’ workows and analyses. To52
facilitate its use for biodiversity research and synthesis, we created a workow to process and53
format NEON organismal data into the ecocomDP (ecological community data design paern)54
format, and available through the ecocomDP R package; we then provided the standardized data55
as an R data package (neonDivData). We briey summarize sampling designs and data56
wrangling decisions for the major taxonomic groups included in this eort. Our workows are57
open-source so the biodiversity community may: add additional taxonomic groups; modify the58
2
workow to produce datasets appropriate for their own analytical needs; and regularly update59
the data packages as more observations become available. Finally, we provide two simple60
examples of how the standardized data may be used for biodiversity research. By providing a61
standardized data package, we hope to enhance the utility of NEON organismal data in62
advancing biodiversity research.63
Key words: NEON, Biodiversity, Organismal Data, Data Product, R, Data package, EDI64
Introduction (or why standardized NEON organismal data)65
A central goal of ecology is to understand the paerns and processes of biodiversity, and this is66
particularly important in an era of rapid global environmental change (Midgley and uiller67
2005, Blowes et al. 2019). Such understanding is only possible through studies that address68
questions like: How is biodiversity distributed across large spatial scales, ranging from69
ecoregions to continents? What mechanisms drive spatial paerns of biodiversity? Are spatial70
paerns of biodiversity similar among dierent taxonomic groups, and if not, why do we see71
variation? How does community composition vary across spatial and environmental gradients?72
What are the local and landscape scale drivers of community structure? How and why do73
biodiversity paerns change over time? Answers to such questions will enable beer74
management and conservation of biodiversity and ecosystem services.75
Biodiversity research has a long history (Worm and Tiensor 2018), beginning with major76
scientic expeditions (e.g., Alexander von Humboldt, Charles Darwin) aiming to document77
global species lists aer the establishment of Linnaeus’s Systema Naturae (Linnaeus 1758).78
Beginning in the 1950’s (Curtis 1959, Hutchinson 1959), researchers moved beyond79
documentation to focus on quantifying paerns of species diversity and describing mechanisms80
underlying their heterogeneity. Since the beginning of this line of research major theoretical81
breakthroughs (MacArthur and Wilson 1967, Hubbell 2001, Brown et al. 2004, Harte 2011) have82
advanced our understanding of potential mechanisms causing and maintaining biodiversity.83
Modern empirical studies, however, have been largely constrained to local or regional scales and84
focused on one or a few taxonomic groups, because of the considerable eort required to collect85
3
observational data. ere are now unprecedented numbers of observations from independent86
small and short-term ecological studies. ese data support research into generalities through87
syntheses and meta-analyses (Vellend et al. 2013, Blowes et al. 2019, Li et al. 2020), but this work88
is challenged by the diculty of integrating data from dierent studies and with varying89
limitations. Such limitations include: diering collection methods (methodological90
uncertainties); varying levels of statistical robustness; inconsistent handling of missing data;91
spatial bias; publication bias; and design aws (Martin et al. 2012, Nakagawa and Santos 2012,92
Koricheva and Gurevitch 2014, Welti et al. 2021). Additionally, it has historically been93
challenging for researchers to obtain and collate data from a diversity of sources for use in94
syntheses and/or meta-analyses (Gurevitch and Hedges 1999).95
Barriers to meta-analyses have been reduced in recent years to bring biodiversity research into96
the big data era (Hampton et al. 2013, Farley et al. 2018) by large eorts to digitize museum and97
herbarium specimens (e.g., iDigBio), successful community science programs (e.g., iNaturalist,98
eBird), technological advances (e.g., remote sensing, automated acoustic recorders), and long99
running coordinated research networks. Yet, each of these remedies comes with its own100
limitations. For instance, museum/herbarium specimens and community science records are101
increasingly available, but are still incidental and unstructured in terms of the sampling design,102
and exhibit marked geographic and taxonomic biases (Martin et al. 2012, Beck et al. 2014,103
Geldmann et al. 2016). Remote sensing approaches may cover large spatial scales, but may also104
be of low spatial resolution and unable to reliably penetrate vegetation canopy (Palumbo et al.105
2017, G Pricope et al. 2019). e standardized observational sampling of woody trees by the106
United States Forest Service’s Forest Inventory and Analysis and of birds by the United States107
Geological Survey’s Breeding Bird Survey have been ongoing across the United States since 2001108
and 1966, respectively (Bechtold and Paerson 2005, Sauer et al. 2017), but cover few taxonomic109
groups. e Long Term Ecological Research Network (LTER) and Critical Zone Observatory110
(CZO) both are hypotheses-driven research eorts built on decades of previous work (Jones et al.111
2021). While both provide considerable observational and experimental datasets for diverse112
ecosystems and taxa, their sampling and dataset design are tailored to their specic research113
questions and a priori, standardization is not possible. us, despite recent advances biodiversity114
research is still impeded by a lack of standardized, high quality, and open-access data spanning115
4
large spatial scales and long time periods.116
e recently established National Ecological Observatory Network (NEON) provides117
continental-scale observational and instrumentation data for a wide variety of taxonomic groups118
and measurement streams. Data are collected using standardized methods, across 81 eld sites in119
both terrestrial and freshwater ecosystems, and will be freely available for at least 30 years.120
ese consistently collected, long-term, and spatially robust measurements are directly121
comparable throughout the Observatory, and provide a unique opportunity for enabling a beer122
understanding of ecosystem change and biodiversity paerns and processes across space and123
through time (Keller et al. 2008).124
NEON data are designed to be maximally useful to ecologists by aligning with FAIR principles125
(ndable, accessible, interoperable, and reusable, Wilkinson et al. 2016). Despite meeting these126
requirements, however, there are still challenges to integrating NEON organismal data for127
reproducible biodiversity research. For example: eld names may vary across NEON data128
products, even for similar measurements; some measurements include sampling unit129
information, whereas units must be calculated for others; and data are in a raw form that oen130
includes metadata unnecessary for biodiversity analyses. ese issues and inconsistencies may131
be overcome through data cleaning and formaing, but understanding how best to perform this132
task requires a signicant investment in the comprehensive NEON documentation for each data133
product involved in an analysis. oroughly reading large amounts of NEON documentation is134
time consuming, and the path to a standard data format, as is critical for reproducibility, may135
vary greatly between NEON organismal data products and users - even for similar analyses.136
Ultimately, this may result in subtle dierences from study to study that hinder meta-analyses137
using NEON data. A simplied and standardized format for NEON organismal data would138
facilitate wider usage of these datasets for biodiversity research. Furthermore, if these data were139
formaed to interface well with datasets from other coordinated research networks, more140
comprehensive syntheses could be accomplished and to advance macrosystem biology (Record et141
al. 2020).142
One aractive standardized formaing style for NEON organismal data is that of ecocomDP143
(ecological community data design paern, O’Brien et al. 2021). EcocomDP is the brainchild of144
5
members of the LTER network, the Environmental Data Initiative (EDI), and NEON sta, and145
provides a model by which data from a variety of sources may be easily transformed into146
consistently formaed, analysis ready community-level organismal data packages. is is done147
using reproducible code that maintains dataset “levels”: L0 is incoming data, L1 represents an148
ecocomDP data format and includes tables representing observations, sampling locations, and149
taxonomic information (at a minimum), and L2 is an output format. us far, >70 LTER150
organismal datasets have been harmonized to the L1 ecocomDP format through the R package151
ecocomDP (Smith et al. 2021) and more datasets are in the queue for processing into the152
ecocomDP format by EDI (O’Brien et al. 2021).153
We standardized NEON organismal data into the ecocomDP format and all R code to process154
NEON data products can be obtained through the R package ecocomDP. For the major155
taxonomic groups included in this initial eort, NEON sampling designs and major data156
wrangling decisions are summarized in the Materials and Methods section. We archived the157
standardized data in the EDI Data Repository. To facilitate the usage of the standardized datasets,158
we also developed an R data package, neonDivData. We refer to the input data streams provided159
by NEON as data products, whereas the cleaned and standardized collection of data les provided160
here as objects within the R data package, neonDivData, across this paper. Standardized datasets161
will be maintained and updated as new data become available from the NEON portal. We hope162
this eort will substantially reduce data processing times for NEON data users and greatly163
facilitate the use of NEON organismal data to advance our understanding of Earth’s biodiversity.164
Materials and Methods (or how to standardize NEON165
organismal data)166
ere are many details to consider when starting to use NEON organismal data products. Below167
we outline key points relevant to community-level biodiversity analyses with regards to the168
NEON sampling design and decisions that were made as the data products presented in this169
paper were converted into the ecocomDP data model. While the methodological sections below170
are specic to particular taxonomic groups, there are some general points that apply to all NEON171
6
organismal data products. First, species occurrence and abundance measures as reported in172
NEON biodiversity data products are not standardized to sampling eort. Because there are oen173
multiple approaches to cleaning (e.g., dealing with multiple levels of taxonomic resolution,174
interpretations of absences, etc.) and standardizing biodiversity survey data, NEON publishes175
raw observations along with sampling eort data to preserve as much information as possible so176
that data users can clean and standardize data as they see t. e workows described here for177
twelve taxonomic groups represented in eleven NEON data products produce standardized178
counts based on sampling eort, such as count of individuals per area sampled or count179
standardized to the duration of trap deployment, as described in Table 1. e data wrangling180
workows described below can be used to access, download, and clean data from the NEON Data181
Portal by using the R ecocomDP package (Smith et al. 2021). To view a catalog of available182
NEON data products in the ecocomDP format, use ecocomDP::search_data(“NEON”). To183
import data from a given NEON data product into your R environment, use184
ecocomDP::read_data(), and set the id argument to the selected NEON to ecocomDP mapping185
workow (the “L0 to L1 ecocomDP workow ID” in Table 1). is will return a list of ecocomDP186
formaed tables and accompanying metadata. To create a at data table (similar to the R objects187
in the data package neonDivData described in Table 2), use the188
ecocomDP::atten_ecocomDP() function. Second, it should be noted that NEON data189
collection eorts will continue well aer this paper is published and new changes to data190
collection methods and/or processing may vary over time. Such changes (e.g., change in the191
number of traps used for ground beetle collection) or interruptions (e.g., due to COVID-19) to192
data collection are documented in the Issues log for each data product on the NEON Data Portal193
as well as the Readme text le that is included with NEON data downloads.194
Terrestrial Organisms195
Breeding Land Birds196
NEON Sampling Design NEON designates breeding landbirds as “smaller birds (usually197
exclusive of raptors and upland game birds) not usually associated with aquatic habitats” (Ralph198
1993, ibault 2018). Most species observed are diurnal and include both resident and migrant199
7
Legend
Sensor station
Water Chemistry Sampling
Groundwater Well
Meteorological Station
Riparian Assessment
Reaeration Drip
Reaeration Sampling
Note: Fish, sed iments, mac roinverteb rates, pl ants
and macroalgae are sampled based on site-specific
habitats and are not identified in the figures.
B) Wadeable Stream
C) Non-wadeable River D) Lake
A) Terrestrial Observation System
Figure 1: Generalized sampling schematics for Terrestrial Observation System (A) and Aquatic Ob-
servation System (B-D) plots. For Terrestrial Observation System (TOS) plots, Distributed, Tower,
and Gradient plots, and locations of various sampling regimes, are presented via symbols. For
Aquatic Observation System (AOS) plots, Wadeable streams, Non-wadeable streams, and Lake
plots are shown in detail, with locations of sensors and dierent sampling regimes presented us-
ing symbols. Panel A was originally published in orpe et al. (2016).
8
Table 1: Mapping NEON data products to ecocomDP formaed data packages with abundances
standardized to observation eort. IDs in the L0 to L1 ecocomDP workow ID columns were
used in the R package ecocomDP to standardize organismal data. Notes: *Bird counts are reported
per taxon per “cluster” observed in each point count in the NEON data product and have not been
further standardized to sampling eort because standard methods for modeling bird abundances
are beyond the scope of this paper; ** plants percent cover value NA represents presence/absence
data only; *** incidence rate per number of tests conducted is reported for tick pathogens.
Taxon group L0 dataset
(NEON data
product ID)
Version of NEON data used
in this study
L0 to L1 ecocomDP workow
ID
Primary variable
reported in
ecocomDP
observation table
Units
Algae DP1.20166.001 hps://doi.org/10.48443/3cvp-
hw55 and provisional data
neon.ecocomdp.20166.001.001 cell density OR cells
OR valves
cells/cm2 OR cells/mL
Beetles DP1.10022.001 hps://doi.org/10.48443/tx5f-
dy17 and provisional data
neon.ecocomdp.10022.001.001 abundance count per trap day
Birds* DP1.10003.001 hps://doi.org/10.48443/s730-
dy13 and provisional data
neon.ecocomdp.10003.001.001 cluster size count of individuals
Fish DP1.20107.001 hps://doi.org/10.48443/17cz-
g567 and provisional data
neon.ecocomdp.20107.001.001 abundance catch per unit eort
Herptiles DP1.10022.001 hps://doi.org/10.48443/tx5f-
dy17 and provisional data
neon.ecocomdp.10022.001.002 abundance count per trap day
Macroinvertebrates DP1.20120.001 hps://doi.org/10.48443/855x-
0n27 and provisional data
neon.ecocomdp.20120.001.001 density count per square
meter
Mosquitoes DP1.10043.001 hps://doi.org/10.48443/9smm-
v091 and provisional data
neon.ecocomdp.10043.001.001 abundance count per trap hour
Plants** DP1.10058.001 hps://doi.org/10.48443/abge-
r811 and provisional data
neon.ecocomdp.10058.001.001 percent cover percent of plot area
covered by taxon
Small mammals DP1.10072.001 hps://doi.org/10.48443/j1g9-
2j27 and provisional data
neon.ecocomdp.10072.001.001 count unique individuals
per 100 trap nights
per plot per month
Tick pathogens*** DP1.10092.001 hps://doi.org/10.48443/5fab-
xv19 and provisional data
neon.ecocomdp.10092.001.001 positivity rate positive tests per
pathogen per
sampling event
Ticks DP1.10093.001 hps://doi.org/10.48443/dx40-
wr20 and provisional data
neon.ecocomdp.10093.001.001 abundance count per square
meter
Zooplankton DP1.20219.001 hps://doi.org/10.48443/qzr1-
jr79 and provisional data
neon.ecocomdp.20219.001.001 density count per liter
9
species. Landbirds are surveyed via point counts in each of the 47 terrestrial sites (ibault 2018).200
At most NEON sites, breeding landbird points are located in ve to ten 3 ×3 grids (Fig. 1), which201
are themselves located in representative (dominant) vegetation. Whenever possible, grid centers202
are co-located with distributed base plot centers. When sites are too small to support a minimum203
of ve grids, separated by at least 250 m from edge to edge, point counts are completed at single204
points instead of grids. In these cases, points are located at the southwest corners of distributed205
base plots within the site. Five to 25 points may be surveyed depending on the size and spatial206
layout of the site, with exact point locations dictated by a stratied-random spatial design that207
maintains a 250 m minimum separation between points.208
Surveys occur during one or two sampling bouts per season, at large and small sites respectively.209
Observers go to the specied points early in the morning and track birds observed during each210
minute of a 6-minute period, following a 2-minute acclimation period, at each point (ibault211
2018). Each point count contains species, sex, and distance to each bird (measured with a laser212
rangender except in the case of yovers) seen or heard. Information relevant for subsequent213
modeling of detectability is also collected during the point counts (e.g., weather, detection214
method). e point count surveys for NEON were modied from the Integrated Monitoring in215
Bird Conservation Regions (IMBCR) eld protocol for spatially-balanced sampling of landbird216
populations (Pavlacky Jr et al. 2017).217
Data Wrangling Decisions e bird point count NEON data product (‘DP1.10003.001’) consists218
of a list of two associated data frames: brd_countdata and brd_perpoint. e former data219
frame contains information such as locations, species identities, and their counts. e laer data220
frame contains additional location information such as latitude and longitude coordinates and221
environmental conditions during the time of the observations. e separate data frames are222
linked by ‘eventID’, which refers to the location, date and time of the observation. To prepare the223
bird point count data for the L1 ecocomDP model, we rst merged both data frames into one and224
then removed columns that are likely not needed for most community-level biodiversity analyses225
(e.g., observer names, etc.). e eld taxon_id in the R object data_bird with the neonDivData226
data package consists of the standard AOU 4-leer species code, although taxon_rank refers to227
eight potential levels of identication (class, family, genus, species, speciesGroup, subfamily, and228
subspecies). Users can decide which level is appropriate, for example one might choose to229
10
exclude all unidentied birds (taxon_id = UNBI), where no further details are available below the230
class level (Aves sp.). e NEON sampling protocol has evolved over time, so users are advised to231
check whether the ‘samplingProtocolVersion’ associated with bird point count data232
(‘DP1.10003.001’) ts their data requirements and subset as necessary. Older versions of233
protocols can be found at the NEON document library.234
Ground Beetles and Herp Bycatch235
NEON Sampling Design Ground beetle sampling is conducted via pitfall trapping, across 10236
distributed plots at each NEON site. e original sampling design included the placement of a237
pitfall trap at each of the cardinal directions along the distributed plot boundary, for a total of238
four traps per plot and 40 traps per site. In 2018, sampling was reduced via the elimination of the239
North pitfall trap in each plot, resulting in 30 traps per site (LeVan et al. 2019b).240
Beetle pitfall trapping begins when the temperature has been >4℃ for 10 days in the spring and241
ends when temperatures dip below this threshold in the fall. Sampling occurs biweekly242
throughout the sampling season with no single trap being sampled more frequently than every 12243
days (LeVan 2020a). Aer collection, the samples are separated into carabid species and bycatch.244
Invertebrate bycatch is pooled to the plot level and archived. Vertebrate bycatch is sorted and245
identied by NEON technicians, then archived at the trap level. Carabid samples are sorted and246
identied by NEON technicians, aer which a subset of carabid individuals are sent to be pinned247
and re-identied by an expert taxonomist. More details can be found in Hoekman et al. (2017)248
and LeVan et al. (2019b).249
Pitfall traps and sampling methods are designed by NEON to reduce vertebrate bycatch (LeVan et250
al. 2019b). e pitfall cup is medium in size with a low clearance cover installed over the trap251
entrance to minimize large vertebrate bycatch. When a live vertebrate with the ability to move252
on its own volition is found in a trap, the animal is released. Live but morbund vertebrates are253
euthanized and collected along with deceased vertebrates. When ≥15 individuals of a vertebrate254
species are collected, cumulatively, within a single plot, NEON may initiate localized mitigation255
measures such as temporarily deactivating traps and removing all traps from the site for the256
remainder of the season. us, while herpetofaunal (herp) bycatch is present in many pitfall257
11
samples it is unclear how well these pitfall traps capture herp community structure and diversity258
- due to these active eorts to reduce vertebrate bycatch. Users of NEON herp bycatch data259
should be aware of these limitations.260
Data Wrangling Decisions e beetle and herp bycatch data product identier is261
‘DDP1.10022.001’. Carabid samples are recorded and identied in a multi-step workow wherein262
a subset of samples are passed on in each successive step. Individuals are rst identied by the263
sorting technician aer which a subset is sent on to be pinned. Some especially dicult264
individuals are not identied by technicians during sorting, instead being labelled “other265
carabid”. e identications for those individuals are recorded with the pinning data. Any266
individuals for which identication is still uncertain are then veried by an expert taxonomist.267
ere are a few cases where an especially dicult identication was sent to multiple expert268
taxonomists and they did not agree on a nal taxon, these individuals were excluded from the269
data set at the recommendation of NEON sta.270
Preference is given to expert identication whenever available. However, these dierences in271
taxonomic expertise do not seem to cause systematic biases in estimating species richness across272
sites, but non-expert taxonomists are more likely to misidentify non-native carabid species (Egli273
et al. 2020). Beetle abundances are recorded for the sorted samples by NEON technicians. To274
account for individual samples that were later reidentied, the nal abundance for a species is the275
original sorting sample abundance minus the number of individuals that were given a new ID.276
Prior to 2018, trappingDays values were not included for many sites. Missing entries were277
calculated as the range from setDate through collectDate for each trap. We also accounted for a278
few plots for which setDate was not updated based on a previous collection event in the279
trappingDays calculations. To facilitate easy manipulation of data within and across bouts, a280
new boutID eld was created to identify all trap collection events at a site in a bout. e original281
EventID eld is intended to identify a bout, but has a number of issues that necessitates creation282
of a new ID. First, EventID does not correspond to a single collection date but rather all283
collections in a week. is is appropriate for the small number of instances when collections for284
a bout happen over multiple consecutive days (~5% of bouts), but prevents analysis of bout285
paerns at the temporal scale of a weekday. e data here were updated so all entries for a bout286
12
correspond to the date (i.e., collectDate) on which the majority of traps are collected to maintain287
the weekday-level resolution with as high of delity as possible, while allowing for easy288
aggregation within bouts and collectDate’s. Second, there were a few instances in which plots289
within a site were set and collected on the same day, but have dierent EventID’s. ese290
instances were all considered a single bout by our new boutID, which is a unique combination of291
setDate,collectDate, and siteID.292
Herpetofaunal bycatch (amphibian and reptile) in pitfall traps were identied to species or the293
lowest taxonomic level possible within 24 h of recovery from the eld. To process the herp294
bycatch NEON data we cleaned trappingDays and the other variables and added boutID as295
described above for beetles. e variable sampleType in the bet_sorting table provides the type296
of animal caught in a pitfall trap as one of ve types: ‘carabid’, ‘vert bycatch herp’, ‘other297
carabid’, ‘invert bycatch’ and ‘vert bycatch mam’. We ltered the beetle data described above to298
only include the ‘carabid’ and ‘other carabid’ types. For herps, we only kept the sampleType of299
‘vert bycatch herp’. Abundance data of beetles and herps bycatch were standardized to be the300
number of individuals captured per trap day.301
Mosquitos302
NEON Sampling Design Mosquito specimens are collected at 47 terrestrial sites across all303
NEON domains and the data are reported in NEON data product DP1.10043.001. Traps are304
distributed throughout each site according to a stratied-random spatial design used for all305
Terrestrial Observation System sampling, maintaining stratication across dominant (>5% of306
total cover) vegetation types (LeVan 2020b). e number of mosquito traps placed in each307
vegetation type is proportional to its percent cover, until 10 total mosquito traps have been308
placed in the site. Mosquito traps are typically located within 30 m of a road to facilitate309
expedient sampling, and are placed at least 300 m apart to maintain independence.310
Mosquito monitoring is divided into o-season and eld season sampling (LeVan et al. 2019a).311
O-season sampling begins aer three consecutive zero-catch eld sampling bouts have312
occurred, and represents a reduced sampling regime that is designed for the rapid detection of313
when the next eld season should begin and to provide mosquito phenology data. O-season314
13
sampling is conducted at three dedicated mosquito traps spread throughout each core site, while315
temperatures are >10 ℃. Once per week, technicians deploy traps at dusk and then collect them316
at dawn the following day.317
Field season sampling begins when the rst mosquito is detected during o season sampling318
(LeVan et al. 2019a). Technicians deploy traps at all 10 dedicated mosquito trap locations per site.319
Traps remain out for a 24-hour period, or sampling bout, and bouts occur every two or four320
weeks at core and relocatable terrestrial sites, respectively. During the sampling bout, traps are321
serviced twice and yield one night-active sample, collected at dawn or about eight hours aer the322
trap was set, and one day-active sample, collected at dusk or ~16 hours aer the trap was set.323
us, a 24-hour sampling bout yields 20 samples from 10 traps.324
NEON collects mosquito specimens using Center for Disease Control (CDC) CO2light traps325
(LeVan et al. 2019a). ese traps have been used by other public health and mosquito-control326
agencies for a half-century, so that NEON mosquito data align across NEON eld sites and with327
existing long-term data sets. A CDC CO2light trap consists of a cylindrical insulated cooler that328
contains dry ice, a plastic rain cover aached to a baery powered light/fan assembly, and a329
mesh collection cup. During deployment, the dry ice sublimates and releases CO2. Mosquitoes330
aracted to the CO2bait are sucked into the mesh collection cup by the baery-powered fan,331
where they remain alive until trap collection.332
Following eld collection, NEON’s eld ecologists process, package, and ship the samples to an333
external lab where mosquitoes are identied to species and sex (when possible). A subset of334
identied mosquitoes are tested for infection by pathogens to quantify the presence/absence and335
prevalence of various arboviruses. Some mosquitoes are set aside for DNA barcode analysis as336
well as long-term archiving. Particularly rare or dicult to identify mosquito specimens are337
prioritized for DNA barcoding. More details can be found in LeVan et al. (2019a).338
Data Wrangling Decisions e mosquito data product (DP1.10043.001) consists of four data339
frames: trapping data (mos_trapping), sorting data (mos_sorting), archiving data340
(mos_archivepooling), and expert taxonomist processed data341
(mos_expertTaxonomistIDProcessed). We rst removed rows (records) with missing342
information about location, collection date, and sample or subsample ID for all data frames. We343
14
then merged all four data frames into one, wherein we only kept records for target taxa (i.e.,344
targetTaxaPresent = “Y”) with no known compromised sampling condition (i.e., sampleCondition345
= “No known compromise”). We further removed a small number of records with species346
identied only to the family level; all remaining records were identied at least to the genus level.347
We estimated the total individual count per trap-hour for each species within a trap as348
(individualCount/subsampleWeight) * totalWeight / trapHours. We then removed columns349
that were not likely to be used for calculating biodiversity values.350
Small Mammals351
NEON Sampling Design NEON denes small mammals based on taxonomic, behavioral,352
dietary, and size constraints, and includes any rodent that is (1) nonvolant; (2) nocturnally active;353
(3) forages predominantly aboveground; and (4) has a mass >5 grams, but <~ 500-600 grams354
(ibault et al. 2019). In North America, this includes cricetids, heteromyids, small sciurids, and355
introduced murids, but excludes shrews, large squirrels, rabbits, or weasels, although individuals356
of these species may be incidentally captured.357
Small mammals are collected at NEON sites using Sherman traps, identied to species in the358
eld, marked with a unique tag, and released (ibault et al. 2019). Multiple 90 m ×90 m359
trapping grids are set up in each terrestrial eld site within the dominant vegetation type. Each360
90 m ×90 m trapping grid contains 100 traps placed in a paern with 10 rows and 10 columns361
set 10 m apart. ree of these 90 m ×90 m grids per site are designated pathogen (as opposed to362
diversity) grids and additional blood sampling is conducted here.363
Small mammal sampling occurs in bouts, with a bout comprised of three consecutive (or nearly364
consecutive) nights of trapping at each pathogen grid and one night of trapping at each diversity365
grid. e timing of sampling occurs within 10 days before or aer the new moon. e number of366
bouts per year is determined by site type: core sites are typically trapped for six bouts per year367
(except for areas with shorter seasons due to cold weather), while relocatable sites are trapped368
for four bouts per year. More information can be found in ibault et al. (2019).369
Data Wrangling Decisions In the small mammal NEON data product (DP1.10072.001), records370
are stratied by NEON site, year, month, and day and represent data from both the diversity and371
15
pathogen sampling grids. Capture records were removed if they were not identied to genus or372
species (e.g., if the species name was denoted as ‘either/or’ or as family name), or if their trap373
status is not “5 - capture” or “4 - more than 1 capture in one trap”. Abundance data for each plot374
and month combination were standardized to be the number of individuals captured per 100 trap375
nights.376
Terrestrial Plants377
NEON Sampling Design NEON plant diversity sampling is completed once or twice per year378
(one or two ‘bouts’) in multiscale, 400 m2(20 m ×20 m) plots (Barne 2019). Each multiscale plot379
is subdivided into four 100 m2(10 m ×10 m) subplots that each encompass one or two sets of 10380
m2(3.16 m ×3.16 m) subplots within which a 1 m2(1 m ×1 m) subplot is nested. e percent381
cover of each plant species is estimated visually in the 1 m2subplots, while only species382
presences are documented in the 10 m2and 100 m2subplots.383
To estimate plant percent cover by species, technicians record this value for all species in a 1 m2
384
subplot (Barne 2019). Next, the remaining 9 m2area of the associated 10 m2subplot is searched385
for the presence of species. e process is repeated if there is a second 1 and 10 m2nested pair in386
the specic 100 m2subplot. Next, the remaining 80 m2area is searched for the presence of387
species; data can be aggregated for a complete list of species present at the 100 m2subplot scale.388
Data for all four 100 m2subplots represent indices of species at the 400 m2plot scale. In most389
cases, species encountered in a nested, ner scale, subplot are not rerecorded in any390
corresponding larger subplot - in order to avoid duplication. Plant species are occasionally391
recorded more than once, however, when data are aggregated across all nested subplots within392
each 400 m2plot, and these require removal from the dataset. More details about the sampling393
design can be found in Barne et al. (2019).394
NEON manages plant taxonomic entries with a master taxonomy list that is based on the395
community standard, where possible. Using this list, synonyms for a given species are converted396
to the currently used name. e master taxonomy for plants is the USDA PLANTS Database397
(USDA, NRCS. 2014. https://plants.usda.gov), and the portions of this database included in the398
NEON plant master taxonomy list are those pertaining to native and naturalized plants present399
16
within the NEON sampling area. A sublist for each NEON domain includes those species with400
ranges that overlap the domain as well as nativity designations - introduced or native - in that401
part of the range. If a species is reported at a location outside of its known range, and the record402
proves reliable, the master taxonomy list is updated to reect the distribution change. For more403
details on plant taxonomic handling, see Barne (2019). For more on the NEON plant master404
taxonomy list see NEON.DOC.014042405
(https://data.neonscience.org/api/v0/documents/NEON.DOC.014042vK).406
Data Wrangling Decisions In the plant presence and percent cover NEON data product407
(DP1.10058.001) sampling at the 1 m ×1 m scale also includes observations of abiotic and408
non-target species ground cover (i.e., soil, water, downed wood), so we removed records with409
divDataType as “otherVariables. We also removed records whose targetTaxaPresent is N(i.e.,410
a non-target species). Additionally, for all spatial resolutions (i.e., 1 m2, 10 m2, and 100 m2data),411
any record lacking information critical for combining data within a plot and for a given sampling412
bout (i.e., plotID,subplotID,boutNumber,endDate, or taxonID) was dropped from the dataset.413
Furthermore, records without a denitive genus or species level taxonID (i.e., those representing414
unidentied morphospecies) were not included. To combine data from dierent spatial415
resolutions into one data frame, we created a pivot column entitled sample_area_m2 (with416
possible values of 1, 10, and 100). Because of the nested sampling design of the plant data, to417
capture all records within a subplot at the 100 m2scale, we incorporated all data from both the 1418
m2and 10 m2scales for that subplot. Similarly, to obtain all records within a plot at the 400 m2
419
scale, we included all data from that plot. Species abundance information was only recorded as420
area coverage within 1 m by 1 m subplots; however, users may use the frequency of a species421
across subplots within a plot or plots within a site as a proxy of its abundance if needed.422
Ticks and Tick Pathogens423
NEON Sampling Design Tick sampling occurs in six distributed plots at each site, which are424
randomly chosen in proportion to NLCD land cover class (LeVan et al. 2019c). Ticks are sampled425
by walking the perimeter of a 40 m ×40 m plot using a 1 m ×1 m drag cloth. Ideally, 160 meters426
are sampled (shortest straight line distance between corners), but the cloth can be dragged427
17
around obstacles if a straight line is not possible. Acceptable total sampling area is between 80428
and 180 m per plot. e cloth can also be agged over vegetation when the cloth cannot be429
dragged across it. Ticks are collected from the cloth and technicians’ clothing at appropriate430
intervals, depending on vegetation density, and at every corner of the plot. Specimens are431
immediately transferred to a vial containing 95% ethanol.432
Onset and oset of tick sampling coincides with phenological milestones at each site, beginning433
within two weeks of the onset of green-up and ending within two weeks of vegetation434
senescence (LeVan et al. 2019c). Sampling bouts are only initiated if the high temperature on the435
two consecutive days prior to planned sampling was >0℃. Early season sampling is conducted436
on a low intensity schedule, with one sampling bout every six weeks. When more than ve ticks437
of any life stage have been collected within the last calendar year at a site, sampling switches to a438
high intensity schedule at the site - with one bout every three weeks. A site remains on the high439
intensity schedule until fewer than ve ticks are collected within a calendar year, then sampling440
reverts back to the low intensity schedule.441
Ticks are sent to an external facility for identication to species, life stage, and sex (LeVan et al.442
2019c). A subset of nymphal ticks are additionally sent to a pathogen testing facility. Ixodes443
species are tested for Anaplasma phagocytophilum,Babesia microti,Borrelia burgdorferi sensu444
lato, Borrelia miyamotoi,Borrelia mayonii, other Borrelia species (Borrelia sp.), and a Ehrlichia445
muris-like agent (Pri et al. 2017). Non-Ixodes species are tested for Anaplasma phagocytophilum,446
Borrelia lonestari (and other undened Borrelia species), Ehrlichia chaeensis,Ehrlichia ewingii,447
Francisella tularensis, and Rickesia rickesii. Additional information about tick pathogen testing448
can be found in the Tick Pathogen Testing SOP449
(https://data.neonscience.org/api/v0/documents/UMASS_LMZ_tickPathogens_SOP_20160829)450
for the NEON Tick-borne Pathogen Status data product.451
Data Wrangling Decisions e tick NEON data product (DP1.10093.001) consists of two452
dataframes: ‘tck_taxonomyProcessed’ hereaer referred to as ‘taxonomy data’ and ‘tck_elddata’453
hereaer referred to as ‘eld data. Users should be aware of some issues related to taxonomic ID.454
Counts assigned to higher taxonomic levels (e.g., at the order level Ixodida; IXOSP2) are not the455
sum of lower levels; rather they represent the counts of individuals that could not reliably be456
18
assigned to a lower taxonomic unit. Samples that were not identied in the lab were assigned to457
the highest taxonomic level (order Ixodida; IXOSP2). However, users could make an informed458
decision to assign these ticks to the most probable group if a subset of individuals from the same459
sample were assigned to a lower taxonomy.460
To clean the tick data, we rst removed surveys and samples not meeting quality standards. In461
the taxonomy data, we removed samples where sample condition was not listed as “OK” (<1% of462
records). In the eld data, we removed records where samples were not collected due to logistical463
concerns (10%). We then combined male and female counts in the taxonomy table into one464
“adult” class. e taxonomy table was re-formaed so that every row contained a sampleID and465
counts for each species life-stages were separate columns (i.e., “wide format”). Next, we joined466
the eld data to the taxonomy data, using the sample ID to link the two tables. When joining, we467
retained eld records where no ticks were found in the eld and thus there were no associated468
taxonomy data. In drags where ticks were not found, counts were given zeros. All counts were469
standardized by area sampled.470
Prior to 2019, both eld surveyors and laboratory taxonomists enumerated each tick life-stage;471
consequently, in the joined dataset there were two sets of counts (“eld counts” and “lab counts”).472
However, starting in 2019, counts were performed by taxonomists rather than eld surveyors.473
Field surveys conducted aer 2019 no longer have eld counts. Users of tick abundance data474
should be aware that this change in protocol has several implications for data wrangling and for475
analysis. First, aer 2019, tick counts are no longer published at the same time as eld survey476
data. Subsequently, some eld records from the most recent years have tick presence recorded477
(targetTaxaPresent = “Y”), but do not yet have associated counts or taxonomic information and478
so the counts are still listed as NA. Users should be aware that counts of zero are therefore479
published earlier than positive counts. We strongly urge users to lter data to those years where480
there are no counts pending.481
e second major issue is that in years where both eld counts and lab counts were available,482
they did not always agree (8% of records). In cases of disagreement, we generally used lab counts483
in the nal abundance data, because this is the source of all tick count data aer 2019 and484
because life-stage identication was more accurate. However, there were a few exceptions where485
19
we used eld count data. In some cases, only a subsample of a certain life-stage was counted in486
the lab, which resulted in higher eld counts than lab counts. In this case, we assigned the487
additional un-identied individuals (e.g., the dierence between the eld and lab counts) to the488
order level (IXOSP2). If quality notes from NEON described ticks being lost in transit, we also489
added the additional lost individuals to the order level. ere were some cases (<1%) where the490
eld counts were greater than lab counts by more than 20% and where the explanation was not491
obvious; we removed these records.We note that the majority of samples (~85%) had no492
discrepancies between the lab or eld, therefore this process could be ignored by users whose493
analyses are not sensitive to exact counts.494
e tick pathogen NEON data product (DP1.10092.001) consists of two dataframes:495
tck_pathogen hereaer referred to as ‘pathogen data’ and tck_pathogenqa hereaer referred to496
as ‘quality data’. First, we removed any samples that had agged quality checks from the quality497
data and removed any samples that did not have a positive DNA quality check from the498
pathogen data. Although the original online protocol aimed to test 130 ticks per site per year499
from multiple tick species, the nal sampling decision was to extensively sample IXOSCA,500
AMBAME, and AMBSP species only because IXOPAC and Dermacentor nymph frequencies were501
too rare to generate meaningful pathogen data. Borrelia burgdorferi and Borrelia burgdorferi sensu502
lato tests were merged, since the former was an incomplete pathogen name and refers to B.503
burgdorferi sensu lato as opposed to sensu stricto (Rudenko et al. 2011). Tick pathogen data are504
presented as positivity rate calculated as number positive tests per number of tests conducted for505
a given pathogen on ticks collected during a given sampling event.506
Aquatic Organisms507
Aquatic macroinvertebrates508
NEON Sampling Design Aquatic macroinvertebrate sampling occurs three times/year at509
wadeable stream, river, and lake sites from spring through fall. Timing of sampling is510
site-specic and based on historical hydrological, meteorological, and phenological data511
including dates of known ice cover, growing degree days, and green up and brown down (Cawley512
20
et al. 2016). Samplers vary by habitat and include Surber, Hess, hand corer, modied kicknet,513
D-frame sweep, and petite ponar samplers (Parker 2019). Stream sampling occurs throughout the514
1 km permied reach in wadeable areas of the two dominant habitat types. Lake sampling occurs515
with a petite ponar near buoy, inlet, and outlet sensors, and D-frame sweeps in wadeable lioral516
zones. Riverine sample collections in deep waters or near instrument buoys are made with a517
petite ponar, and in lioral areas are made with a D-frame sweep or large-woody debris sampler.518
In the eld, samples are preserved in pure ethanol, and later in the domain support facility,519
glycerol is added to prevent the samples from becoming brile. Samples are shipped from the520
domain facility to a taxonomy lab for sorting and identication to lowest possible taxon (e.g.,521
genus or species) and counts of each taxon per size are made to the nearest mm.522
Data Wrangling Decisions Aquatic macroinvertebrate data contained in the NEON data523
product DP1.20120.001 are subsampled and identied to the lowest practical taxonomic level,524
typically genus, by expert taxonomists in the inv_taxonomyProcessed table, measured to the525
nearest mm size class, and counted. Taxonomic naming has been standardized in the526
inv_taxonomyProcessed le, according to NEON’s master taxonomy527
(https://data.neonscience.org/taxonomic-lists), removing any synonyms. We calculated528
macroinvertebrate density by dividing estimatedTotalCount (which includes the corrections for529
subsampling in the taxonomy lab) by benthicArea from the inv_eldData table to return count530
per square meter of stream, lake, or river boom (Chesney et al. 2021).531
MicroAlgae (Periphyton and Phytoplankton)532
NEON Sampling Design NEON collects periphyton samples from natural surface substrata (i.e.,533
cobble, silt, woody debris) over a 1 km reach in streams and rivers, and in the lioral zone of534
lakes. Various collection methods and sampler types are used, depending on substrate (Parker535
2020). In lakes and rivers, periphyton are also collected from the most dominant substratum type536
in three areas within the lioral (i.e., shoreline) zone. Prior to 2019, lioral zone periphyton537
sampling occurred in ve areas.538
NEON collects three phytoplankton samples per sampling date using Kemmerer or Van Dorn539
samplers. In rivers, samples are collected near the sensor buoy and at two other deep-water540
21
points in the main channel. For lakes, phytoplankton are collected near the central sensor buoy541
as well as at two lioral sensors. Where lakes and rivers are stratied, each phytoplankton542
sample is a composite from one surface sample, one sample from the metalimnion (i.e., middle543
layer), and one sample from the boom of the euphotic zone. For non-stratied lakes and544
non-wadeable streams, each phytoplankton sample is a composite from one surface sample, one545
sample just above the boom of the euphotic zone, and one mid-euphotic zone sample - if the546
euphotic zone is > 5 m deep.547
All microalgae sampling occurs three times per year (i.e., spring, summer, and fall bouts) in the548
same sampling bouts as aquatic macroinvertebrates and zooplankton. In wadeable streams,549
which have variable habitats (e.g., ries, runs, pools, step pools), three periphyton samples are550
collected per bout in the dominant habitat type (ve samples collected prior to 2019) and three551
per bout in the second most dominant habitat type. No two samples are collected from the552
sample habitat unit (i.e., the same rie).553
Samples are processed at the domain support facility and separated into subsamples for554
taxonomic analysis or for biomass measurements. Aliquots shipped to an external facility for555
taxonomic determination are preserved in glutaraldehyde or Lugol’s iodine (before 2021).556
Aliquots for biomass measurements are ltered onto glass-ber lters and processed for ash-free557
dry mass.558
Data Wrangling Decisions e periphyton, seston, and phytoplankton NEON data product559
(DP1.20166.001) contains three dataframes for algae containing information on algae taxonomic560
identication, biomass and related eld data, which are hereaer referred to as alg_tax_long,561
alg_biomass and alg_eld_data. Algae within samples are identied to the lowest possible562
taxonomic resolution, usually species, by contracting laboratory taxonomists. Some specimens563
can only be identied to the genus or even class level, depending on the condition of the564
specimen. Ten percent of all samples are checked by a second taxonomist and are noted in the565
qcTaxonomyStatus. Taxonomic naming has been standardized in the alg_tax_long les,566
according to NEON’s master taxonomy, removing nomenclatural synonyms. Abundance and567
cell/colony counts are determined for each taxon of each sample with counts of cells or colonies568
that are either corrected for sample volume or not (as indicated by algalParameterUnit =569
22
‘cellsperBole’).570
We corrected sample units of cellsperBottle to density (Parker and Vance 2020). First, we571
summed the preservative volume and the lab’s recorded sample volume for each sample (from572
the alg_biomass le) and combined that with the alg_tax_long le using sampleID as a573
common identier. Where samples in the alg_tax_long le were missing data in the574
perBottleSampleVolume eld (measured aer receiving samples at the external laboratory), we575
estimated the sample volume using NEON domain lab sample volumes (measured prior to576
shipping samples to the external laboratory). With this updated le, we combined it with577
alg_eld_data to have the related eld conditions, including benthic area sampled for each578
sample. parentSampleID was used for alg_eld_data to join to the alg_biomass le’s579
sampleID as alg_eld_data only has parentSampleID. We then calculated cells per milliliter580
for the uncorrected taxon of each sample, dividing algalParameterValue by the updated sample581
volume. Benthic sample results are expressed in terms of area (i.e., multiplied by the eld sample582
volume, divided by benthic area sampled), in square meters. e nal abundance units are either583
cells/mL (phytoplankton and seston samples) or cells/m2for benthic samples.584
e sampleIDs are child records of each parentSampleID that will be collected as long as585
sampling is not impeded (i.e., ice covered or dry). In the alg_biomass le, there should be only a586
single entry for each parentSampleID,sampleID, and analysisType. Most oen, there were two587
sampleID’s per parentSampleID with one for ash-free dry mass (AFDM) and taxonomy588
(analysis types). For the creation of the observation table with standardized counts, we used only589
records from the alg_biomass le with the analysisType of taxonomy. In alg_tax_long, there590
are multiple entries for each sampleID for each taxon by scienticName and algalParameter.591
Fish592
NEON Sampling Design Fish sampling is carried out across 19 of the NEON eco-climatic593
domains, occuring in a total of 23 lotic (stream) and ve lentic (lake) sites. In lotic sites, up to 10594
non-overlapping reaches, each 70 to 130 m long, are designated within a 1 km section of stream595
(Jensen et al. 2019a). ese include three constantly sampled ‘xed’ reaches, which encompass596
all representative habitats found within the 1 km stretch, and seven ‘random’ reaches that are597
23
sampled on a rotating schedule. In lentic sites, 10 pie-shaped segments are established, with each598
segment ranging from the riparian zone into the lake center, therefore eectively capturing both599
nearshore and oshore habitats (Jensen et al. 2019b). ree of the 10 segments are xed and are600
surveyed twice a year, and the remaining segments are random and are sampled rotationally. e601
spatial layouts of these sites are designed to capture spatial and temporal heterogeneity in the602
aquatic habitats.603
Lotic sampling occurs at three xed and three random reaches per sampling bout, and there are604
two bouts per year - one in spring and one in fall. During each bout, the xed reaches are605
sampled via a three-pass electroshing depletion approach (Moulton II et al. 2002, Peck et al.606
2006) while the random reaches being sampled are done so with a single-pass depletion approach.607
Which random reaches are surveyed depends on the year, with three of the random reaches608
sampled every other year. All sampling occurs during daylight hours, with each sampling bout609
completed within ve days and with a minimum two-week gap in between two successive610
sampling bouts. e initial sampling date is determined using site-specic historical data on ice611
melting, water temperature (or accumulated degree days), and riparian peak greenness.612
e lentic sampling design is similar to that discussed above, with xed segments being sampled613
twice per year and random segments sampled twice per year on a rotational basis (i.e., each614
random segment is not sampled every year). Lentic sampling is conducted using three gear types,615
with backpack electroshing and mini-fyke nets near the shoreline and gill nets in deeper waters.616
Backpack electroshing is done on a 4 m ×25 m reach near the shoreline via a three-pass (for617
xed segments) or single-pass (for random segments) electroshing depletion approach618
(Moulton II et al. 2002, Peck et al. 2006). All three passes in a xed sampling segment are619
completed on the same night, with ≤30 minutes between successive passes. Electroshing begins620
within 30 minutes of sunset and ceases within 30 minutes of sunrise, with a maximum of ve621
passes per sampling bout. A single gill net is also deployed within all segments being sampled,622
both xed and random, for 1-2 hours in either the morning or early aernoon. Finally, a fyke623
(Baker et al. 1997) or mini-fyke net is deployed at each xed or random segments, respectively.624
Fyke nets are positioned before sunset and recovered aer sunrise on the following day. Precise625
start and end times for electroshing and net deployments are documented by NEON technicians626
at the time of sampling.627
24
In all surveys, captured sh are identied to the lowest practical taxonomic level, and628
morphometrics (i.e., body mass and body length) are recorded for 50 individuals of each taxon629
before releasing. Relative abundance for each sh taxon is also recorded by direct enumeration630
(up to rst 50 individuals) or estimation by bulk counts (>50 individuals, i.e., by placing sh of a631
given taxon into a dip net (i.e., net scoop), counting the total number of specimens in the dip net,632
and then multiplying the total number of scoops of captured sh by the counts from the rst633
scoop).634
Data Wrangling Decisions Fish sampled via both electroshing and trapping are identied at635
variable taxonomic resolutions (as ne as subspecies level) in the eld. Most identications are636
made to the species or genus level by a single eld technician for a given bout per site. Sampled637
sh are identied, measured, weighed, and then released back to the site of capture. If eld638
technicians are unable to identify to the species level, such specimens are identied to the nest639
possible taxonomic resolution or assigned a morphospecies with a coarse-resolution640
identication. e standard sources consulted for identication and a qualier for identication641
validity are also documented in the fsh_perFish table. e column bulkFishCount of the642
fsh_bulkCount table records relative abundance for each species or the alternative next possible643
taxon level (specied in the column scienticName).644
Fish data (taxonomic identication and relative abundance) are recorded per each sampling reach645
in streams or per segment in lakes in each bout and documented in the fsh_perFsh table646
(Monahan et al. 2020). e column eventID uniquely identies the sampling date of the year, the647
specic site within the domain, a reach/segment identier, the pass number (i.e., number of648
electroshing passes or number of net deployment eorts), and the survey method. e eventID649
column helps tie all sh data with stream reach/lake segment data or environmental data (i.e.,650
water quality data) and sampling eort data (e.g., electroshing and net set time). A reachID651
column provided in the fsh_perPass table uniquely identies surveys done per stream reach or652
lake segment. e reachID is nested within the eventID as well. We used eventID as a nominal653
variable to uniquely identify dierent sampling events and to join dierent, stacked sh data les654
as described below.655
e sh NEON data product (DP1.20107.001) consists of fsh_perPass,fsh_eldData,656
25
fsh_bulkCount,fsh_perFish, and the complete taxon table for sh, for both stream and lake657
sites. To join all reach-scale data, we rst joined the fsh_perPass with fsh_eldData, and658
eliminated all bouts where sampling was untenable. Subsequently, we joined the reach-scale659
table with fsh_perFsh to add individual sh counts and sh measurements. en, to add bulk660
counts, we joined the reach-scale table with fsh_bulkCount datasets, and subsequently added661
taxonRank which included the taxonomic resolution into the bulk-processed table. Aerward,662
both individual-level and bulk-processed datasets were appended into a single table. To include663
samples where no sh were captured, we ltered the fsh_perPass table retaining records where664
target taxa (sh) were absent, joined it with fsh_eldData, and nally merged it with the table665
that contained both bulk-processed and individual-level data. For each ner-resolution taxon in666
the individual-level dataset, we considered the relative abundance as one since each row667
represented a single individual sh. Whenever possible, we substituted missing data by668
cross-referencing other data columns, omied completely redundant data columns, and retained669
records with genus- and species-level taxonomic resolution. For the appended dataset, we also670
calculated the relative abundance for each species per sampling reach or segment at a given site.671
To calculate species-specic catch per unit eort (CPUE), we normalized the relative abundance672
by either average electroshing time (i.e., efTime,efTime2) or trap deployment time (i.e., the673
dierence between netEndTime and netSetTime). For trap data, we assumed that size of the674
traps used, water depths, number of neers used, and the reach lengths (a signicant proportion675
of bouts had reach lengths missing) to be comparable across dierent sampling reaches and676
segments.677
Zooplankton678
NEON Sampling Design Zooplankton samples are collected at seven NEON lake sites across679
four domains. Zooplankton samples are collected at the buoy sensor set (deepest location in the680
lake) and at the two nearshore sensor sets using a vertical tow net for locations deeper than 4 m681
and a Schindler trap for locations shallower than 4 m (Parker and Roehm 2019). is results in682
three samples collected per sampling day. Samples are preserved with ethanol in the eld and683
shipped from the domain facility to a taxonomy lab for sorting and identication to lowest684
possible taxon (e.g., genus or species) and counts of each taxon per size are made to the nearest685
26
mm.686
Data Wrangling Decisions e NEON zooplankton data product (DP1.20219.001) consists of687
dataframes for taxonomic identication and related eld data (Parker and Sco 2020).688
Zooplankton in NEON samples are identied at contracting labs to the lowest possible689
taxonomic resolution, usually genus, however some specimens can only be identied to the690
family (or even class) level, depending on the condition of the specimen. Ten percent of all691
samples are checked by two taxonomists and are noted in the qcTaxonomyStatus column. e692
taxonomic naming has been standardized in the zoo_taxonomyProcessed table, according to693
NEON’s master taxonomy, removing any synonyms. Density was calculated using694
adjCountPerBottle and towsTrapsVolume to correct count data to “count per liter”.695
Results (or how to get and use standardized NEON696
organismal data)697
All cleaned and standardized datasets can be obtained from the R package neonDivData and698
from the EDI data repository (temporary link, which will be nalized upon acceptance:699
https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=190&revision=2). Note700
that neonDivData included both stable and provisional data released by NEON while the data701
repository in EDI only included stable datasets. If users want to change some of the decisions to702
wrangle the data dierently, they can nd the code in the R package ecocomDP and modify703
them for their own purposes.704
e data package neonDivData can be installed from Github. Installation instructions can be705
found on the Github webpage (https://github.com/daijiang/neonDivData). Table 2shows the706
brief summary of all data objects. To get data for a specic taxonomic group, we can just call the707
objects in the R object column in Table 2. Such data products include cleaned (and standardized708
if needed) occurrence data for the taxonomic groups covered and are equivalent to the709
“observation” table of the ecocomDP data format. If environmental information were provided by710
NEON for some taxonomic groups, they are also included in these data objects. Information such711
as latitude, longitude, and elevation for all taxonomic groups were saved in the neon_location712
27
Table 2: Summary of data products included in this study (as of 01 September, 2021). Users can call
the R objects in the R object column from the R data package neonDivData to get the standardized
data for specic taxonomic groups.
Taxon group R object N species N sites Start date End date
Algae data_algae 1946 33 2014-07-02 2019-07-15
Beetles data_herp_bycatch 756 47 2013-07-03 2020-10-13
Birds data_bird 541 47 2015-05-13 2020-07-20
Fish data_sh 147 28 2016-03-29 2020-12-03
Herptiles data_herp_bycatch 128 41 2014-04-02 2020-09-29
Macroinvertebrates data_macroinvertebrate 1330 34 2014-07-01 2020-08-12
Mosquitoes data_mosquito 128 47 2014-04-09 2020-06-16
Plants data_plant 6197 47 2013-06-24 2020-10-23
Small mammals data_small_mammal 145 46 2013-06-19 2020-11-20
Tick pathogens data_tick_pathogen 12 15 2014-04-17 2018-10-03
Ticks data_tick 19 46 2014-04-02 2020-10-06
Zooplankton data_zooplankton 157 7 2014-07-02 2020-07-22
object of the R package, which is equivalent to the “sampling_location” table of the ecocomDP713
data format. Information about species scientic names of all taxonomic groups were saved in714
the neon_taxa object, which is equivalent to the “taxon” table of the ecocomDP data format.715
To demonstrate the use of data packages, we used data_plant to quickly visualize the716
distribution of species richness of plants across all NEON sites (Fig. 2). To show how easy it is to717
get site level species richness, we presented the code used to generate the data for Fig. 2as718
CodeS1 in the supporting information.719
Figure 2shows the utility of the data package for exploring macroecological paerns at the720
NEON site level. One of the most well known and studied macroecological paerns is the721
latitudinal biodiversity gradient, wherein sites are more species at lower latitudes relative to722
higher latitudes; temperature, biotic interactions, and historical biogeography are potential723
reasons underlying these paerns (Fischer 1960, Hillebrand 2004). Herbaceous plants of NEON724
generally follow this paern. e latitudinal paern for NEON small mammals is similar, and is725
best explained by increased niche space and declining similarity in body size among species in726
lower latitudes, rather than a direct eect of temperature (Read et al. 2018).727
In addition to allowing for quick exploration of macroecological paerns of richness at NEON728
sites, the data packages presented in this paper enable investigation of eects of taxonomic729
28
250
500
750
1000
Species
Richness
Species
Richness
250
500
750
1000
0
250
500
750
1000
20 40 60
Latitude
Species Richness
Figure 2: Plant species richness mapped across NEON terrestrial sites. e inset scaerplot shows
latitude on the x-axis and species richness on the y-axis, with red points representing sites in
Puerto Rico and Hawaii.
29
resolution on diversity indices since taxonomic information is preserved for observations under730
family level for all groups. e degree of taxonomic resolution varies for NEON taxa depending731
on the diversity of the group and the level of taxonomic expertise needed to identify an organism732
to the species level, with more diverse groups presenting a greater challenge. Beetles are one of733
the most diverse groups of organisms on Earth and wide-ranging geographically, making them734
ideal bioindicators of environmental change (Rainio and Niemelä 2003). To illustrate how the use735
of the beetle data package presented in this paper enables NEON data users to easily explore the736
eects of taxonomic resolution on community-level taxonomic diversity metrics, we calculated737
Jost diversity indices (Jost 2006) for beetles at the Oak Ridge National Laboratory (ORNL) NEON738
site for data subseed at the genus, species, and subspecies level. To quantify biodiversity, we739
used Jost indices, which are essentially Hill Numbers that vary in how abundance is weighted740
with a parameter q. Higher values of qgive lower weights to low-abundance species, with q= 0741
being equivalent to species richness and q= 1 representing the eective number of species given742
by the Shannon entropy. ese indices are ploed as rarefaction curves, which assess the743
sampling ecacy. When rarefaction curves asymptote they suggest that additional sampling will744
not capture additional taxa. Statistical methods presented by Chao et al. (2014) provide estimates745
of sampling ecacy beyond the observed data (i.e., extrapolated values shown by dashed lines in746
Fig. 3). For the ORNL beetle data, Jost indices calculated with higher values of q(i.e., q> 0)747
indicated sampling has reached an asymptote in terms of capturing diversity regardless of748
taxonomic resolution (i.e., genus, species, subspecies). However, rarefaction curves for q= 0,749
which is equivalent to species richness do not asymptote, even with extrapolation. ese plots750
suggest that if a researcher is interested in low abundance, rare species, then the NEON beetle751
data stream at ORNL may need to mature with additional sample collections over time before752
condent inferences may be made, especially below the taxonomic resolution of genus.753
30
genus
species
subspecies
0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000
0
30
60
90
120
Number of individuals
Species diversity
0
1
2
interpolated
extrapolated
Figure 3: Rarefaction of beetle abundance data from collections made at the Oak Ridge National
Laboratory (ORNL) National Ecological Observatory Network (NEON) site from 2014-2020 gener-
ated using the beetle data package presented in this paper and the iNEXT package in R (Hsieh et al.
2016) based on dierent levels of taxonomic resolution (i.e., genus, species, subspecies). Dierent
colors indicate Jost Indices with diering values of q (Jost 2006).
Discussion (or how to maintain and update standardized754
NEON organismal data)755
NEON organismal data hold enormous potential to understand biodiversity change across space756
and time (Balch et al. 2019, Jones et al. 2021). Multiple biodiversity research and education757
programs have used NEON data even before NEON became fully operational in May 2019 (e.g.,758
Farrell and Carey 2018, Read et al. 2018). With the expected long-term investment to maintain759
NEON over the next 30 years, NEON organismal data will be an invaluable tool for760
understanding and tracking biodiversity change. NEON data are unique relative to data collected761
by other similar networks (e.g., LTER, CZO) because observation collection protocols are762
standardized across sites, enabling researchers to address macroscale questions in environmental763
science without having to synthesize disparate data sets that dier in collection methods (Jones764
et al. 2021). e data package presented in this paper holds great potential in making NEON data765
easier to use and more comparable across studies. Whereas the data collection protocols766
implemented by NEON sta are standardized, the decisions NEON data users make in wrangling767
31
their data aer downloading NEON’s open data will not necessarily be similar unless the user768
community adopts a community data standard, such as the ecocomDP data model. Adopting769
such a data model early on in the life of the observatory will ensure that results of studies using770
NEON data will be comparable and thus easier to synthesize. By providing a standardized and771
easy-to-use data package of NEON organismal data, our eort here will signicantly lower the772
barriers to use the NEON organismal data for biodiversity research by many current and future773
researchers and will ensure that studies using NEON organismal data are comparable.774
ere are some important notes about the data package we provided. First, our processes assume775
that NEON ensured correct identications of species. However, since records may be identied776
to any level of taxonomic resolution, and IDs above the genus level may not be useful for most777
biodiversity projects, we removed records with such IDs for groups that are relatively easy to778
identify (i.e., sh, plant, small mammals) or have very few taxon IDs that are above genus level779
(i.e., mosquito). However, for groups that are hard to identify (i.e., algae, beetle, bird,780
macroinvertebrate, tick, and tick pathogen), we decided to keep all records regardless of their781
taxon IDs level. Such information can be useful if we are interested in questions such as782
species-to-genus ratio or species rarefaction curves at dierent taxonomic levels (e.g., Fig. 3).783
Users thus need to carefully consider which level of taxon IDs they need to address their784
research questions. Another note regarding species names is the term ‘sp.’ vs ‘spp. across NEON785
organismal data collections; the term ‘sp. refers to a single morphospecies whereas the term786
‘spp. refers to more than one morphospecies. is is an important point to consider for787
community ecology or biodiversity analyses because it may add uncertainty into estimates of788
biodiversity metrics such as species richness. It is also important to point out that NEON fuzzed789
taxonomic IDs to one higher taxonomic level to protect species of concern. For example, if a790
threatened Black-capped vireo (Vireo atricapilla) is recorded by a NEON technician, the791
taxonomic identication is fuzzed to Vireo in the data. Rare, threatened and endangered species792
are those listed as such by federal and/or state agencies. Second, we standardized species793
abundance measurements to make them comparable across dierent sampling events within794
each taxonomic group (Table 1). Such standardization is critical to study and compare795
biodiversity. And nally, NEON publishes data for additional organismal groups, which were not796
included in this study given the complexity of the data. For example, aquatic plants797
32
(DP1.20066.001 and DP1.20072.001); benthic microbe abundances (DP1.20277.001), metagenome798
sequences (DP1.20279.001), marker gene sequences (DP1.20280.001), and community799
composition (DP1.20086.001); surface water microbe abundances (DP1.20278.001), metagenome800
sequences (DP1.20281.001), marker gene sequences (DP1.20282.001), and community801
composition (DP1.20141.001); and soil microbe biomass (DP1.10104.001), metagenome sequences802
(DP1.10107.001), marker gene sequences (DP1.10108.001), and community composition803
(DP1.10081.001) were not considered here, though future work may utilize neonDivData to align804
these datasets. Users interested in further explorations of these data products may nd more805
information on the NEON data portal (https://data.neonscience.org/). Additionally, concurrent806
work on a suggested bioinformatics pipeline and how to run sensitivity analyses on user-dened807
parameters for NEON soil microbial data, including code and vignees, is described in Qin et808
al. in prep.809
All code for the Data Wrangling Decisions are available within the R package ecocomDP810
(https://github.com/EDIorg/ecocomDP). Users can modify the code if they need to make dierent811
decisions during the data wrangling process and update our workows in our code by submiing812
a pull request to our Github repository. If researchers wish to generate their own derived813
organismal data sets from NEON data with slightly dierent decisions than the ones outlined in814
this paper, we recommend that they use the ecocomDP framework, contribute their workow to815
the ecocomDP R package, upload the data to the EDI repository, and cite their data with the816
discoverable DOI given to them by EDI. Note that the ecocomDP data model was intended for817
community ecology analyses and may not be well suited for population-level analyses.818
Because ecocomDP is an R package to access and format datasets following the ecocomDP819
format, we developed an R data package neonDivData to host and distribute the standardized820
NEON organismal data derived from ecocomDP. A separate dedicated data package has several821
advantages. First, it is easier and ready to use and saves time for users to run the code in822
ecocomDP to download and standardize NEON data products. Second, it is also easy to update823
the data package when new raw data products are uploaded by NEON to their data portal; and824
the updating process does not require any change in the ecocomDP package. is is ideal825
because ecocomDP provides harmonized data from other sources besides NEON. ird, the826
Github repository page of neonDivData can serve as a discussion forum for researchers827
33
regarding the NEON data products without competing for aention in the ecocomDP Github828
repository page. By opening issues on the Github repository, users can discuss and contribute to829
improve our workow of standardizing NEON data products. Users can also discuss whether830
there are other data models that the NEON user community should adopt at the inception of the831
observatory. As the observatory moves forward, this is an important discussion for the NEON832
user community and NEON technical working groups to promote synthesis of NEON data with833
data from other eorts (e.g., LTER, CZO, Ameriux, the International LTER, National Phenology834
Network, Long Term Agricultural Research Network). Note that the standardized datasets that835
are stable (dened by NEON as stable release) were archived at EDI and some of the above836
advantages also apply to the data repository at EDI.837
e derived data products presented here collectively represent hundreds of hours of work by838
members of our team - a group that met at the NEON Science Summit in 2019 in Boulder,839
Colorado and consists of researchers and NEON science sta. Just as it is helpful when working840
with a dataset to either have collected the data or be in close correspondence with the person841
who collected the data, nal processing decisions were greatly informed by conversations with842
NEON science sta and the NEON user community. Future opportunities that encourage843
collaborations between NEON science sta and the NEON user community will be essential to844
achieve the full potential of the observatory data.845
Conclusion846
Macrosystems ecology (sensu Heernan et al. 2014) is at the start of an exciting new chapter847
with the decades long awaited buildout of NEON completed and standardized data streams from848
all sites in the observatory becoming publicly available online. As the research community849
embarks on discovering new scientic insights from NEON data, it is important that we make850
our analyses and all derived data as reproducible as possible to ensure that connections across851
studies are possible. Harmonized data sets will help in this endeavor because they naturally852
promote the collection of provenance as data are collated into derived products (Reichman et al.853
2011, O’Brien et al. 2021). Harmonized data also make synthesis easier because eorts to clean854
and format data leading up to analyses do not have to be repeatedly performed by individual855
34
researchers (O’Brien et al. 2021). e data standardizing processes and derived data package856
presented here illustrate a potential path forward in achieving a reproducible framework for data857
derived from NEON organismal data for ecological analyses. is derived data package also858
highlights the value of collaboration between the NEON user community and NEON sta for859
advancing NEON-enabled science.860
Acknowledgement861
is work is a result of participating in the rst NEON Science Summit in 2019 and an internship862
program through the St. Edward’s Institute for Interdisciplinary Science (i4) funded through a863
National Science Foundation award under Grant No. 1832282. e authors acknowledge support864
from the NSF Award #1906144 to aend the 2019 NEON Science Summit. Additionally, the865
authors acknowledge support from the NSF DEB 1926568 to S.R., NSF DEB 1926567 to P.L.Z., NSF866
DEB 1926598 to M.A.J, and NSF DEB 1926341 to J.M.L.. Comments from NEON sta (Katie LeVan,867
Dylan Mpnahan, Sata Paull, Dave Barne, Sam Simkin), Margaret O’Brien and Tad Dallas greatly868
improved this work. e National Ecological Observatory Network is a program sponsored by869
the National Science Foundation and operated under cooperative agreement by Baelle870
Memorial Institute. is material is based in part upon work supported by the National Science871
Foundation through the NEON Program.872
Reference873
Baker, J. R., D. V. Peck, and D. W. Suon. 1997. Environmental monitoring and assessment874
program surface waters: Field operations manual for lakes. US Environmental Protection875
Agency, Washington.876
Balch, J. K., R. Nagy, and B. S. Halpern. 2019. NEON is seeding the next revolution in ecology.877
Frontiers in Ecology and the Environment 18.878
Barne, D. 2019. TOS protocol and procedure: DIV - plant diversity sampling.879
NEON.DOC.014042vK. NEON (National Ecological Observatory Network).880
35
Barne, D. T., P. B. Adler, B. R. Chemel, P. A. Duy, B. J. Enquist, J. B. Grace, S. Harrison, R. K.881
Peet, D. S. Schimel, T. J. Stohlgren, and others. 2019. e plant diversity sampling design for882
the national ecological observatory network. Ecosphere 10:e02603.883
Bechtold, W. A., and P. L. Paerson. 2005. e enhanced forest inventory and analysis884
program–national sampling design and estimation procedures. USDA Forest Service,885
Southern Research Station.886
Beck, J., M. Böller, A. Erhardt, and W. Schwanghart. 2014. Spatial bias in the gbif database and its887
eect on modeling species’ geographic distributions. Ecological Informatics 19:10–15.888
Blowes, S. A., S. R. Supp, L. H. Antão, A. Bates, H. Bruelheide, J. M. Chase, F. Moyes, A. Magurran,889
B. McGill, I. H. Myers-Smith, and others. 2019. e geography of biodiversity change in890
marine and terrestrial assemblages. Science 366:339–345.891
Brown, J. H., J. F. Gillooly, A. P. Allen, V. M. Savage, and G. B. West. 2004. Toward a metabolic892
theory of ecology. Ecology 85:1771–1789.893
Cawley, K. M., S. Parker, R. Utz, K. Goodman, C. Sco, M. Fitzgerald, J. Vance, B. Jensen, C.894
Bohall, and T. Baldwin. 2016. NEON aquatic sampling strategy. NEON.DOC.001152vA.895
NEON (National Ecological Observatory Network).896
Chao, A., N. J. Gotelli, T. Hsieh, E. L. Sander, K. Ma, R. K. Colwell, and A. M. Ellison. 2014.897
Rarefaction and extrapolation with hill numbers: A framework for sampling and estimation898
in species diversity studies. Ecological monographs 84:45–67.899
Chesney, T., S. Parker, and C. Sco. 2021. NEON user guide to aquatic macroinvertebrate900
collection (dp1.20120.001). Revision b. NEON (National Ecological Observatory Network).901
Curtis, J. T. 1959. e vegetation of wisconsin: An ordination of plant communities. University902
of Wisconsin Pres.903
Egli, L., K. E. LeVan, and T. T. Work. 2020. Taxonomic error rates aect interpretations of a904
national-scale ground beetle monitoring program at national ecological observatory network.905
Ecosphere 11:e03035.906
Farley, S. S., A. Dawson, S. J. Goring, and J. W. Williams. 2018. Situating ecology as a big-data907
science: Current advances, challenges, and solutions. BioScience 68:563–576.908
36
Farrell, K. J., and C. C. Carey. 2018. Power, pitfalls, and potential for integrating computational909
literacy into undergraduate ecology courses. Ecology and evolution 8:7744–7751.910
Fischer, A. G. 1960. Latitudinal variations in organic diversity. Evolution 14:64–81.911
Geldmann, J., J. Heilmann-Clausen, T. E. Holm, I. Levinsky, B. Markussen, K. Olsen, C. Rahbek,912
and A. P. Tørup. 2016. What determines spatial bias in citizen science? Exploring four913
recording schemes with dierent prociency requirements. Diversity and Distributions914
22:1139–1149.915
G Pricope, N., K. L Mapes, and K. D Woodward. 2019. Remote sensing of human–environment916
interactions in global change research: A review of advances, challenges and future917
directions. Remote Sensing 11:2783.918
Gurevitch, J., and L. V. Hedges. 1999. Statistical issues in ecological meta-analyses. Ecology919
80:1142–1149.920
Hampton, S. E., C. A. Strasser, J. J. Tewksbury, W. K. Gram, A. E. Budden, A. L. Batcheller, C. S.921
Duke, and J. H. Porter. 2013. Big data and the future of ecology. Frontiers in Ecology and the922
Environment 11:156–162.923
Harte, J. 2011. Maximum entropy and ecology: A theory of abundance, distribution, and924
energetics. OUP Oxford.925
Heernan, J. B., P. A. Soranno, M. J. Angillea Jr, L. B. Buckley, D. S. Gruner, T. H. Kei, J. R.926
Kellner, J. S. Kominoski, A. V. Rocha, J. Xiao, and others. 2014. Macrosystems ecology:927
Understanding ecological paerns and processes at continental scales. Frontiers in Ecology928
and the Environment 12:5–14.929
Hillebrand, H. 2004. On the generality of the latitudinal diversity gradient. e American930
Naturalist 163:192–211.931
Hoekman, D., K. E. LeVan, C. Gibson, G. E. Ball, R. A. Browne, R. L. Davidson, T. L. Erwin, C. B.932
Knisley, J. R. LaBonte, J. Lundgren, and others. 2017. Design for ground beetle abundance and933
diversity sampling within the national ecological observatory network. Ecosphere 8:e01744.934
Hsieh, T., K. Ma, and A. Chao. 2016. INEXT: An r package for rarefaction and extrapolation of935
species diversity (h ill numbers). Methods in Ecology and Evolution 7:1451–1456.936
37
Hubbell, S. P. 2001. e unied neutral theory of biodiversity and biogeography (mpb-32).937
Princeton University Press.938
Hutchinson, G. E. 1959. Homage to santa rosalia or why are there so many kinds of animals? e939
American Naturalist 93:145–159.940
Jensen, B., S. Parker, and J. R. Fischer. 2019a. AOS protocol and procedure: Fish sampling in941
wadeable streams. NEON.DOC.001295vF. NEON (National Ecological Observatory Network).942
Jensen, B., S. Parker, and J. R. Fischer. 2019b. AOS protocol and procedure: Fish sampling in lakes.943
NEON.DOC.001296vF. NEON (National Ecological Observatory Network).944
Jones, J., P. Groman, J. Blair, F. Davis, H. Dugan, E. Euskirchen, S. Frey, T. Harms, E. Hinckley,945
M. Kosmala, and others. 2021. Synergies among environmental science research and946
monitoringnetworks: A research agenda. Earth’s Future:e2020EF001631.947
Jost, L. 2006. Entropy and diversity. Oikos 113:363–375.948
Keller, M., D. S. Schimel, W. W. Hargrove, and F. M. Homan. 2008. A continental strategy for949
the national ecological observatory network. e Ecological Society of America: 282-284.950
Koricheva, J., and J. Gurevitch. 2014. Uses and misuses of meta-analysis in plant ecology. Journal951
of Ecology 102:828–844.952
LeVan, K. 2020a. NEON user guide to ground beetles sampled from pitfall traps953
(dp1.10022.001).version c. NEON (National Ecological Observatory Network).954
LeVan, K. 2020b. NEON user guide to mosquitoes sampled from co2 traps (dp1.10043.001) and955
mosquito-borne pathogen satatus (dp1.10041.001).version c. NEON (National Ecological956
Observatory Network).957
LeVan, K., S. Paull, K. Tsao, D. Hoekman, and Y. Springer. 2019a. TOS protocol and procedure:958
MOS - mosquito sampling. NEON.DOC.014049vL. NEON (National Ecological Observatory959
Network).960
LeVan, K., N. Robinson, D. Hoekman, and K. Blevins. 2019b. TOS protocol and procedure:961
Ground beetle sampling. NEON.DOC.014041vJ. NEON (National Ecological Observatory962
Network).963
38
LeVan, K., K. ibault, K. Tsao, and Y. Springer. 2019c. TOS protocol and procedure: Tick and964
tick-borne pathogen sampling. NEON.DOC.014045vK. NEON (National Ecological965
Observatory Network).966
Li, D., J. D. Olden, J. L. Lockwood, S. Record, M. L. McKinney, and B. Baiser. 2020. Changes in967
taxonomic and phylogenetic diversity in the anthropocene. Proceedings of the Royal Society968
B 287:20200777.969
Linnaeus, C. 1758. Systema naturae. Stockholm Laurentii Salvii.970
MacArthur, R. H., and E. O. Wilson. 1967. e theory of island biogeography. Princeton971
university press.972
Martin, L. J., B. Blossey, and E. Ellis. 2012. Mapping where ecologists work: Biases in the global973
distribution of terrestrial ecological observations. Frontiers in Ecology and the Environment974
10:195–201.975
Midgley, G. F., and W. uiller. 2005. Global environmental change and the uncertain fate of976
biodiversity. e New Phytologist 167:638–641.977
Monahan, D., B. Jensen, S. Parker, and C. Sco. 2020. NEON user guide to sh electroshing, gill978
neing, and fyke neing counts (dp1.20107.001). Revision b. NEON (National Ecological979
Observatory Network).980
Moulton II, S. R., J. G. Kennen, R. M. Goldstein, and J. A. Hambrook. 2002. Revised protocols for981
sampling algal, invertebrate, and sh communities as part of the national water-quality982
assessment program. Geological Survey (US).983
Nakagawa, S., and E. S. Santos. 2012. Methodological issues and advances in biological984
meta-analysis. Evolutionary Ecology 26:1253–1274.985
O’Brien, M., C. A. Smith, E. R. Sokol, C. Gries, N. Lany, S. Record, and M. C. Castorani. 2021.986
EcocomDP: A exible data design paern for ecological community survey data. Ecological987
Informatics 64:101374.988
Palumbo, I., R. A. Rose, R. M. Headley, J. Nackoney, A. Vodacek, and M. Wegmann. 2017.989
Building capacity in remote sensing for conservation: Present and future challenges. Remote990
Sensing in Ecology and Conservation 3:21–29.991
39
Parker, S. 2019. AOS protocol and procedure: INV - aquatic macroinvertebrate sampling.992
NEON.DOC.003046vE. NEON (National Ecological Observatory Network).993
Parker, S. 2020. AOS protocol and procedure: ALG - periphyton and phytoplankton sampling.994
NEON.DOC.003045vE. NEON (National Ecological Observatory Network).995
Parker, S., and C. Roehm. 2019. AOS protocol and procedure: ZOO - zooplankton sampling in996
lakes. NEON.DOC.001194. NEON (National Ecological Observatory Network).997
Parker, S., and C. Sco. 2020. NEON user guide to aquatic zooplankton collection (dp1.20219.001).998
Revision b. NEON (National Ecological Observatory Network).999
Parker, S., and T. Vance. 2020. NEON user guide to periphyton and phytoplankton collection1000
(dp1.20166.001). Revision c. NEON (National Ecological Observatory Network).1001
Pavlacky Jr, D. C., P. M. Lukacs, J. A. Blakesley, R. C. Skorkowsky, D. S. Klute, B. A. Hahn, V. J.1002
Dreitz, T. L. George, and D. J. Hanni. 2017. A statistically rigorous sampling design to1003
integrate avian monitoring and management within bird conservation regions. PloS one1004
12:e0185924.1005
Peck, D. V., Herlihy, A. T., Hill, B. H., Hughes, R. M., Kaufmann, P. R., Klemm, D. J., Lazorchak, J.1006
M., McCormick, F. H., Peterson, S. A., Ringold, P. L., Magee, T., and M. R. and Cappaert. 2006.1007
Environmental monitoring and assessment program — surface waters: Western pilot study1008
eld operations manual for wadeable streams. US Environmental Protection Agency,1009
Washington.1010
Pri, B. S., M. E. Allerdice, L. M. Sloan, C. D. Paddock, U. G. Munderloh, Y. Rikihisa, T. Tajima, S.1011
M. Paskewitz, D. F. Neitzel, D. K. H. Johnson, and others. 2017. Proposal to reclassify1012
ehrlichia muris as ehrlichia muris subsp. Muris subsp. Nov. And description of ehrlichia1013
muris subsp. Eauclairensis subsp. Nov., a newly recognized tick-borne pathogen of humans.1014
International journal of systematic and evolutionary microbiology 67:2121.1015
Rainio, J., and J. Niemelä. 2003. Ground beetles (coleoptera: Carabidae) as bioindicators.1016
Biodiversity & Conservation 12:487–506.1017
Ralph, C. J. 1993. Handbook of eld methods for monitoring landbirds. Pacic Southwest1018
Research Station.1019
40
Read, Q. D., J. M. Grady, P. L. Zarnetske, S. Record, B. Baiser, J. Belmaker, M.-N. Tuanmu, A.1020
Strecker, L. Beaudrot, and K. M. ibault. 2018. Among-species overlap in rodent body size1021
distributions predicts species richness along a temperature gradient. Ecography1022
41:1718–1727.1023
Record, S., N. M. Voelker, P. L. Zarnetske, N. I. Wisnoski, J. D. Tonkin, C. Swan, L. Marazzi, N.1024
Lany, T. Lamy, A. Compagnoni, and others. 2020. Novel insights to be gained from applying1025
metacommunity theory to long-term, spatially replicated biodiversity data. Frontiers in1026
Ecology and Evolution 8:479.1027
Reichman, O. J., M. B. Jones, and M. P. Schildhauer. 2011. Challenges and opportunities of open1028
data in ecology. Science 331:703–705.1029
Rudenko, N., M. Golovchenko, L. Grubhoer, and J. H. Oliver Jr. 2011. Updates on borrelia1030
burgdorferi sensu lato complex with respect to public health. Ticks and tick-borne diseases1031
2:123–128.1032
Sauer, J. R., K. L. Pardieck, D. J. Ziolkowski Jr, A. C. Smith, M.-A. R. Hudson, V. Rodriguez, H.1033
Berlanga, D. K. Niven, and W. A. Link. 2017. e rst 50 years of the north american1034
breeding bird survey. e Condor: Ornithological Applications 119:576–593.1035
Smith, C., E. Sokol, and M. O’Brien. 2021. EcocomDP: Work with datasets in the ecological1036
community design paern.1037
ibault, K. 2018. TOS protocol and procedure: Breeding landbird abundance and diversity.1038
NEON.DOC.014041vJ. NEON (National Ecological Observatory Network).1039
ibault, K., K. Tsao, Y. Springer, and L. Knapp. 2019. TOS protocol and procedure: Small1040
mammal sampling. NEON.DOC.000481vL. NEON (National Ecological Observatory1041
Network).1042
orpe, A. S., D. T. Barne, S. C. Elmendorf, E.-L. S. Hinckley, D. Hoekman, K. D. Jones, K. E.1043
LeVan, C. L. Meier, L. F. Stanish, and K. M. ibault. 2016. Introduction to the sampling1044
designs of the n ational e cological o bservatory n etwork t errestrial o bservation s ystem.1045
Ecosphere 7:e01627.1046
41
Vellend, M., L. Baeten, I. H. Myers-Smith, S. C. Elmendorf, R. Beauséjour, C. D. Brown, P. De1047
Frenne, K. Verheyen, and S. Wipf. 2013. Global meta-analysis reveals no net change in1048
local-scale plant biodiversity over time. Proceedings of the National Academy of Sciences1049
110:19456–19459.1050
Welti, E., A. Joern, A. M. Ellison, D. Lightfoot, S. Record, N. Rodenhouse, E. Stanley, and M.1051
Kaspari. 2021. Meta-analyses of insect temporal trends must account for the complex1052
sampling histories inherent to many long-term monitoring eorts. Nature Ecology and1053
Evolution.1054
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg,1055
J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, and others. 2016. e fair guiding principles1056
for scientic data management and stewardship. Scientic data 3:1–9.1057
Worm, B., and D. P. Tiensor. 2018. A theory of global biodiversity (mpb-60). Princeton1058
University Press.1059
42
... A standardized taxonomy from the USDA PLANTS Database is used for naming, and voucher specimens for most observed species are also kept. We downloaded plant diversity data across all NEON sites from the R data package "neonDivData" (Li et al., 2021) and combined reporting across all plots and years available up to 2019, in order to generate site-level species diversity and process these data further, as discussed below. ...
Article
Full-text available
Aim Nitrogen (N)‐fixing plants are an important component of global plant communities, but the drivers of N‐fixing plant diversity, especially in temperate regions, remain underexplored. Here, we examined broad‐scale patterns of N‐fixing and non‐fixing plant phylogenetic diversity (PD) and species richness (SR) across a wide portion of temperate North America, focusing on relationships with soil N and aridity. We also tested whether exotic species, with and without N‐fixing symbiosis, have fewer abiotic limitations compared with native species. Location USA and Puerto Rico. Time period Current. Major taxa studied Vascular plants, focusing on N‐fixing groups (orders Fabales, Fagales, Rosales and Cucurbitales). Methods We subset National Ecological Observatory Network (NEON) plant plot data from all sites along two axes (N fixing–non‐N fixing and native–exotic), calculating plot‐level SR, PD and mean pairwise phylogenetic distance (MPD). We then used linear mixed models to investigate relationships between diversity values and key soil measurements, along with aridity, temperature and fire frequency. Results Aridity was the sole predictor of proportional phylogenetic diversity of N fixers. The SR of N fixers still decreased marginally in arid regions, whereas native N‐fixer MPD increased with aridity, indicative of unique lineages of N fixers in the driest conditions, in contrast to native non‐N fixers. The SR of both native N fixers and non‐N fixers increased in low‐N soils. Aridity did not affect SR of exotic non‐N fixers, unlike other groups, whereas exotic N fixers showed lower MPD in increasingly high‐N soils, suggesting filtering, contrary what was found for native N fixers. Main conclusions Our results suggest that it is not nitrogen, or any soil nutrient, that has the strongest effect on the relative success of N fixers in plant communities. Rather, aridity is the key driver, at least for native species, in line with empirical results from other biomes and increased understanding of N fixation as a key mechanism to avoid water loss.
Article
Full-text available
In a recently published study, Crossley et al. (2020, Nature Ecology & Evolution, “No net insect abundance and diversity declines across US Long Term Ecological Research sites”) examine patterns of change in insect abundance and diversity across US Long-Term Ecological Research (LTER) sites, concluding “a lack of overall increase or decline”. This is notable if true, given mixed conclusions in the literature regarding the nature and ubiquity of insect declines across regions and insect taxonomic groups. The data analyzed, downloaded from and collected by US LTER sites, represent unique time series of arthropod abundances. These long-term datasets often provide critical insights, capturing both steady changes and responses to sudden unpredictable events. However, a number of the included datasets are not suitable for estimating long-term observational trends because they come from experiments or have methodological inconsistencies. Additionally, long-term ecological datasets are rarely uniform in sampling effort across their full duration as a result of the changing goals and abilities of a research site to collect data. We suggest that Crossley et al.’s results rely upon a key, but flawed, assumption, that sampling was collected “in a consistent way over time within each dataset”. We document problems with data use prior to statistical analyses from eight LTER sites due to datasets not being suitable for long-term trend estimation and not accounting for sampling variation, using the Konza Prairie (KNZ) grasshopper dataset (CGR022) as an example.
Article
Full-text available
Global loss of biodiversity and its associated ecosystem services is occurring at an alarming rate and is predicted to accelerate in the future. Metacommunity theory provides a framework to investigate multi-scale processes that drive change in biodiversity across space and time. Short-term ecological studies across space have progressed our understanding of biodiversity through a metacommunity lens, however, such snapshots in time have been limited in their ability to explain which processes, at which scales, generate observed spatial patterns. Temporal dynamics of metacommunities have been understudied, and large gaps in theory and empirical data have hindered progress in our understanding of underlying metacommunity processes that give rise to biodiversity patterns. Fortunately, we are at an important point in the history of ecology, where long-term studies with cross-scale spatial replication provide a means to gain a deeper understanding of the multiscale processes driving biodiversity patterns in time and space to inform metacommunity theory. The maturation of coordinated research and observation networks, such as the United States Long Term Ecological Research (LTER) program, provides an opportunity to advance explanation and prediction of biodiversity change with observational and experimental data at spatial and temporal scales greater than any single research group could accomplish. Synthesis of LTER network community datasets illustrates that long-term studies with spatial replication present an under-utilized resource for advancing spatio-temporal metacommunity research. We identify challenges towards synthesizing these data and present recommendations for addressing these challenges. We conclude with insights about how future monitoring efforts by coordinated research and observation networks could further the development of metacommunity theory and its applications aimed at improving conservation efforts.
Article
Full-text available
The role of remote sensing and human–environment interactions (HEI) research in social and environmental decision-making has steadily increased along with numerous technological and methodological advances in the global environmental change field. Given the growing inter- and trans-disciplinary nature of studies focused on understanding the human dimensions of global change (HDGC), the need for a synchronization of agendas is evident. We conduct a bibliometric assessment and review of the last two decades of peer-reviewed literature to ascertain what the trends and current directions of integrating remote sensing into HEI research have been and discuss emerging themes, challenges, and opportunities. Despite advances in applying remote sensing to understanding ever more complex HEI fields such as land use/land cover change and landscape degradation, agricultural dynamics, urban geography and ecology, natural hazards, water resources, epidemiology, or paleo HEIs, challenges remain in acquiring and leveraging accurately georeferenced social data and establishing transferable protocols for data integration. However, recent advances in micro-satellite, unmanned aerial systems (UASs), and sensor technology are opening new avenues of integration of remotely sensed data into HEI research at scales relevant for decision-making purposes that simultaneously catalyze developments in HDGC research. Emerging or underutilized methodologies and technologies such as thermal sensing, digital soil mapping, citizen science, UASs, cloud computing, mobile mapping, or the use of “humans as sensors” will continue to enhance the relevance of HEI research in achieving sustainable development goals and driving the science of HDGC further.
Article
Full-text available
Hill numbers (or the effective number of species) have been increasingly used to quantify the species/taxonomic diversity of an assemblage. The sample‐size‐ and coverage‐based integrations of rarefaction (interpolation) and extrapolation (prediction) of H ill numbers represent a unified standardization method for quantifying and comparing species diversity across multiple assemblages. We briefly review the conceptual background of H ill numbers along with two approaches to standardization. We present an R package iNEXT (i N terpolation/ EXT rapolation) which provides simple functions to compute and plot the seamless rarefaction and extrapolation sampling curves for the three most widely used members of the H ill number family (species richness, S hannon diversity and S impson diversity). Two types of biodiversity data are allowed: individual‐based abundance data and sampling‐unit‐based incidence data. Several applications of the iNEXT packages are reviewed: (i) Non‐asymptotic analysis: comparison of diversity estimates for equally large or equally complete samples. (ii) Asymptotic analysis: comparison of estimated asymptotic or true diversities. (iii) Assessment of sample completeness (sample coverage) across multiple samples. (iv) Comparison of estimated point diversities for a specified sample size or a specified level of sample coverage. Two examples are demonstrated, using the data (one for abundance data and the other for incidence data) included in the package, to illustrate all R functions and graphical displays.
Article
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Article
To understand how the integration of contextual spatial data on land cover and human infrastructure can help reduce spatial bias in sampling effort, and improve the utilization of citizen science-based species recording schemes. By comparing four different citizen science projects, we explore how the sampling design's complexity affects the role of these spatial biases. Denmark, Europe. We used a point process model to estimate the effect of land cover and human infrastructure on the intensity of observations from four different citizen science species recording schemes. We then use these results to predict areas of under- and oversampling as well as relative biodiversity ‘hotspots’ and ‘deserts’, accounting for common spatial biases introduced in unstructured sampling designs. We demonstrate that the explanatory power of spatial biases such as infrastructure and human population density increased as the complexity of the sampling schemes decreased. Despite a low absolute sampling effort in agricultural landscapes, these areas still appeared oversampled compared to the observed species richness. Conversely, forests and grassland appeared undersampled despite higher absolute sampling efforts. We also present a novel and effective analytical approach to address spatial biases in unstructured sampling schemes and a new way to address such biases, when more structured sampling is not an option. We show that citizen science datasets, which rely on untrained amateurs, are more heavily prone to spatial biases from infrastructure and human population density. Objectives and protocols of mass-participating projects should thus be designed with this in mind. Our results suggest that, where contextual data is available, modelling the intensity of individual observation can help understand and quantify how spatial biases affect the observed biological patterns.
Article
Macrosystems ecology is the study of diverse ecological phenomena at the scale of regions to continents and their interactions with phenomena at other scales. This emerging subdiscipline addresses ecological questions and environmental problems at these broad scales. Here, we describe this new field, show how it relates to modern ecological study, and highlight opportunities that stem from taking a macrosystems perspective. We present a hierarchical framework for investigating macrosystems at any level of ecological organization and in relation to broader and finer scales. Building on well-established theory and concepts from other subdisciplines of ecology, we identify feedbacks, linkages among distant regions, and interactions that cross scales of space and time as the most likely sources of unexpected and novel behaviors in macrosystems. We present three examples that highlight the importance of this multiscaled systems perspective for understanding the ecology of regions to continents. Read More: http://www.esajournals.org/doi/full/10.1890/130017