John Richard WieczorekUniversity of California, Berkeley | UCB
John Richard Wieczorek
Bachelor of Arts
About
121
Publications
33,026
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,165
Citations
Introduction
Publications
Publications (121)
Georeferencing is a key process in the digitization of natural history collections as it assigns spatial coordinates to preserved specimen collecting locations, facilitating their use in ecological, evolutionary and conservation research. Georeference data in public repositories such as GBIF is often missing or incomplete, jeopardising their use in...
Camera trapping has revolutionized wildlife ecology and conservation by providing automated data acquisition, leading to the accumulation of massive amounts of camera trap data worldwide. Although management and processing of camera trap‐derived Big Data are becoming increasingly solvable with the help of scalable cyber‐infrastructures, harmonizati...
The standardization of data, encompassing both primary and contextual information (metadata), plays a pivotal role in facilitating data (re-)use, integration, and knowledge generation. However, the biodiversity and omics communities, converging on omics biodiversity data, have historically developed and adopted their own distinct standards, hinderi...
Access to high-quality ecological data is critical to assessing and modeling biodiversity and its changes through space and time. The Darwin Core standard has proven to be immensely helpful in sharing species occurrence data (see Wieczorek et al. 2012, Global Biodiversity Information Facility, GBIF) and promoting biodiversity research following the...
All aspects of biodiversity research, from taxonomy to conservation, rely on data associated with species names. Effective integration of names across multiple fields is paramount and depends on the coordination and organization of taxonomic data. We assess current efforts and find that even key applications for well-studied taxa still lack commona...
GeoPick is a new web application aimed at providing a simple yet powerful georeferencing tool to the natural history collections community (Fig. 1). Its conceptual foundation is based on the Georeferencing Best Practices by Chapman and Wieczorek (2020), whose guidelines it intends to implement. GeoPick also provides a close and direct relation betw...
Camera trapping has revolutionized wildlife ecology and conservation by providing automated data acquisition, leading to the accumulation of massive amounts of camera trap data worldwide. Although management and processing of camera trap-derived Big Data are becoming increasingly solvable with the help of scalable cyber-infrastructures, harmonizati...
All aspects of biodiversity research, from taxonomy to conservation, rely on data associated with species names. Effective integration of names across multiple fields is paramount and depends on coordination and organization of taxonomic data. We assess current efforts and find that even key applications for well-studied taxa still lack commonality...
Darwin Core, the data standard used for sharing modern biodiversity and paleodiversity occurrence records, has previously lacked proper mechanisms for reporting what is known about the estimated age range of specimens from deep time. This has led to data providers putting these data in fields where they cannot easily be found by users, which impede...
Assessing and addressing biodiversity needs are of critical and time-sensitive importance, with the post-2020 Global Biodiversity Framework’s Global Taxonomy Initiative underscoring the need to build capacity in how we conceptualize biodiversity (Abrahamse et al. 2021). Species—as biological units—and their names are the backbone for the data integ...
The Global Biodiversity Information Facility (GBIF) has been immensely successful in mobilizing a large number of records documenting species occurrence through a global network of data publishers. Today over 2 billion records are available for search and download through the GBIF.org infrastructure, with over 200 million originating in natural sci...
Access to high-quality ecological data is pivotal to assessing and modeling biodiversity and its change through space and time. Inventory data (i.e., recording multiple species at specific places and times) are particularly relevant to monitoring species distributions and abundance, but their reliability for use in downstream models depends on repo...
The Biodiversity Information Standards (TDWG) Material Sample Task Group*1 kicked off in the third quarter of 2021. The group’s initial focus was to
1) achieve a clear conceptual delineation between the terms MaterialSample , PreservedSpecimen , LivingSpecimen , and FossilSpecimen (the terms used in basisOfRecord in the current DwC-A provided to th...
Data Quality Task Group 2 was established to create a suite of core tests and associated assertions about the 'quality' of biodiversity informatics data (Chapman et al. 2020). The group has been active since January 2017, about four years longer than its four main members would have anticipated. We all thought “How hard could it be?” The answer was...
Natural history collections (NHCs) represent an enormous and largely untapped wealth of information on the Earth's biota, made available through GBIF as digital preserved specimen records. Precise knowledge of where the specimens were collected is paramount to rigorous ecological studies, especially in the field of species distribution modelling. H...
The field of distributional ecology has seen considerable recent attention, particularly surrounding the theory, protocols, and tools for Ecological Niche Modeling (ENM) or Species Distribution Modeling (SDM). Such analyses have grown steadily over the past two decades—including a maturation of relevant theory and key concepts—but methodological co...
This sampling-event dataset provides primary data about species diversity, age structure, abundance (in terms of biomass and density) and seasonal activity of earthworms (Lumbricidae). The study was carried out in old-growth broad-leaved and young forests of two protected areas ("Kaluzhskiye Zaseki" Nature Reserve and Ugra National Park) of Kaluga...
Darwin Core, the data standard used for sharing modern biodiversity and paleodiversity occurrence records, has previously lacked proper mechanisms for reporting what is known about the estimated age range of specimens from deep time. This has led to data providers putting these data in fields where they cannot easily be found by users, which impede...
Phenological data (i.e., data on growth and reproductive events of organisms) are increasingly being used to study the effects of climate change, and biodiversity specimens have arisen as important sources of phenological data. However, phenological data are not expressly treated by the Darwin Core standard (Wieczorek et al. 2012), and specimen-bas...
Biodiversity is increasingly being assessed using omic technologies (e.g. metagenomics or metatranscriptomics); however, the metadata generated by omic investigations is not fully harmonised with that of the broader biodiversity community.
There are two major communities developing metadata standards specifications relevant to omic biodiversity dat...
Full texts are openly available at the given doi in PDF or HTML format. Published in English and Spanish.
This publication provides guidelines to the best practice for georeferencing. Though it is targeted specifically at biological occurrence data, the concepts and methods presented here can be applied in other disciplines where spatial interpret...
Note: For those requesting full texts, it is openly available at the given doi in PDF or HTML format. Published in English and Spanish.
This is a practical guide for georeferencing. It describes the protocols to determine the shapes of features and how to use them as the basis for georeferencing with the point-radius georeferencing method (Wieczor...
Natural history collections constitute an enormous wealth of information of Life on Earth. It is estimated that over 2 billion specimens are preserved at institutions worldwide, of which less than 10% are accessible via biodiversity data aggregators such as GBIF. Moreover, they are a very important resource for eco‐evolutionary research, which grea...
Motivation
Other than data availability, ‘Data Quality’ is probably the most significant issue for users of biodiversity data and this is especially so for the research community. Data Quality Tests and Assertions Task Group (TG-2) from the Biodiversity Information Standards (TDWG) Biodiversity Quality Interest Group is reviewing practical aspects...
To understand biological and geological events and the history of collected samples, it is essential to determine and communicate location information accurately. The accuracy of a georeference depends upon the circumstances of the event. Historical collections depend on having clear verbatim locality descriptions, the correct interpretation of dat...
Scientists frequently collect biological and environmental information over years and store it in database systems to answer their own research questions without exposing it in repositories that make it easy to find and retrieve.
While in recent years the community working on biodiversity informatics has made significant strides by creating common...
The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community.
The Data Quality Interest Group...
Description and specifications for the tests following the conventions of the Fitness For Use Framework. This supplement is a copy of https://github.com/tdwg/bdq/blob/master/tg2/core/TG2_tests.csv as of commit 941e774 2019-Aug-20. Supplementary Material 1 to Chapman AD, Belbin L, Zermoglio PF, Wieczorek J, Morris PJ, Nicholls M, Rees ER, Veiga AK,...
Vocabulary of Terms used for the TDWG Task Group on Data Quality Tests and Assertions, plus key additional terms from the Use Case Study. Ther terms are consistent with the terms used in the Fitness for Use Framework (Veiga et al. 2017). Supplementary Material 1 to Chapman AD, Belbin L, Zermoglio PF, Wieczorek J, Morris PJ, Nicholls M, Rees ER, Vei...
To improve the suitability of the Darwin Core standard for the research and management of alien species, the standard needs to express the native status of organisms, how well established they are and how they came to occupy a location. To facilitate this, we propose:
1. To adopt a controlled vocabulary for the existing Darwin Core term dwc:establi...
‘Data Quality Test and Assertions’ Task Group 2 (https://www.tdwg.org/community/bdq/tg-2/) has taken another year to clarify the 102 tests (https://github.com/tdwg/bdq/issues?q=is%3Aissue+is%3Aopen+label%3ATest). The original mandate to develop a core suite of tests that could be widely applied from data collection to user evaluation of aggregated...
For the last 15 years, Biodiversity Information Standards (TDWG) has recognized two competing standards for organism occurrence data, ABCD (Access to Biological Collections Data; Holetschek et al. 2012) and DarwinCore (Wieczorek et al. 2012). These two representations emerged from contrasting strategies for mobilizing information about organism occ...
Interdisciplinary collaborations and data sharing are essential to addressing the long history of human-environmental interactions underlying the modern biodiversity crisis. Such collaborations are increasingly facilitated by, and dependent upon, sharing open access data from a variety of disciplinary communities and data sources, including those w...
The full spreadsheet from the Parnell site, an archaeological site in Florida.
This shows the cleaned dataset before the Darwin Core cross-walking is complete. Note the 'Verbatim' and 'Clean' Taxon and Element fields, which shows how these fields are edited slightly in order to accommodate the UBERON mappings for element and the VertNet propagation...
There has been major progress over the last two decades in digitising historical knowledge of biodiversity and in making biodiversity data freely and openly accessible. Interlocking efforts bring together international partnerships and networks, national, regional and institutional projects and investments and countless individual contributors, spa...
Annex B - List of GBIC2 Attendees
دعوة لتكوين تحالف لمعرفة التنوع البيولوجي
Call for an alliance for biodiversity knowledge
Convocatoria de una alianza para el conocimiento de la biodiversidad
Apelo a uma aliança para o conhecimento da biodiversidade
Annex A - Outputs from GBIC2 Working Groups
Appel à une alliance pour la connaissance sur la biodiversité
Призыв к созданию альянса знаний по биоразнообразию
Biodibertsitatea ezagutzeko aliantza baterako deialdia
Call: Een alliantie voor kennis over biodiversiteit
The field of biodiversity informatics is in a massive, “grow-out” phase of creating and enabling large-scale biodiversity data resources. Because perhaps 90% of existing biodiversity data nonetheless remains unavailable for science and policy applications, the question arises as to how these existing and available data records can be mobilized most...
Response to longer-form reviews
Data for bird collections from VertNet
Herbarium specimen data derived from GBIF
As curators of biodiversity data in natural science collections, we are deeply concerned with data quality, but quality is an elusive concept. An effective way to think about data quality is in terms of fitness for use (Veiga 2016). To use data to manage physical collections, the data must be able to accurately answer questions such as what objects...
VertNet (vertnet.org) is a collaborative project that makes biodiversity data free and available on the web. VertNet is also a tool designed to help people discover, improve, and publish biodiversity data. It is also the core of a collaboration between hundreds of biocollections that contribute biodiversity data and work together to improve it. Ver...
Zooarchaeological specimens are the remains of animals, including vertebrate and invertebrate taxa, recovered from, or in association with, archaeological contexts of deposition or surrounding landscapes. The physical scope of zooarchaeological specimens is diverse and includes macro- and micro-zooarchaeological specimens composed of archaeological...
Task Group 2 of the TDWG Data Quality Interest Group aims to provide a standard suite of tests and resulting assertions that can assist with filtering occurrence records for as many applications as possible. Currently ‘data aggregators’ such as the Global Biodiversity Information Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run...
The temporality of specimens is an often overlooked but quintessential part of using aggregated biodiversity occurrences for research, especially when millions of these occurrences exist in deep time. Presently in Darwin Core, there are terms for describing the geological context of specimens, which is needed for paleontological specimens. However,...
Biodiversity data exist in large quantities, which is a boon to biodiversity science. In spite of large numbers of data records being available, however, the proportion of those records that is readily usable for science applications is quite small. The difference between the full number of data records existingversusthe records that are ready for...
Since its ratification as a TDWG standard in 2009, data publishers have had to struggle with the essential step of mapping fields in working databases to the terms in Darwin Core Wieczorek et al. 2012 in order to publish and share data using that standard. Doing so requires a good understanding of both the data set and Darwin Core. The accumulated...
Arctos (http://arctosdb.org) is a leader in providing museums with collaborative solutions to managing information in their collections. As both a community and a collection management database platform, Arctos is a consortium of museums that collaborate to serve secure and rich data on over 3 million records from natural and cultural history colle...
The YesWorkflow McPhillips et al. 2015b, McPhillips et al. 2015a toolkit was designed to annotate data curation workflows in conventional scripts (e.g., Python, R, Java) but it can also be used to annotate YAML-based Kurator workflow configuration files. From just a file that has been annotated by YesWorkflow, YesWorkflow is able to render a top-le...
In the Kurator project, we are developing libraries of small modules, each designed to address a particular data quality test. These libraries, which can be run on single computers or scalable architecture, can be incorporated into data management processes in the form of customizable data quality scripts. A script composed of these modules can be...
Darwin Core Wieczorek et al. 2012 has become broadly used for biodiversity data sharing since its ratification as a standard in 2009. Despite its popularity, or perhaps because of it, questions about Darwin Core, its definitions, and its applications continue to arise. However, no easy mechanism previously existed for the users of the standard to a...
Biodiversity data may come from myriad sources. From data capture in the field through digitization processes, each source may choose distinctive ways to capture data. When it comes to sharing data more broadly at national or regional levels, it is imperative that data is presented in ways that encourage understanding both by humans and machines, a...
As part of efforts to mobilize zooarchaeological collections data, there is a strong need for new terms that can extend the Darwin Core standard in order to describe material condition, preparation history, and chronology. These data are important for understanding the full context of specimens from an array of natural and cultural heritage discipl...
Over the past two decades, the natural history collections community has ramped up efforts to mobilize museum data in order to increase access for biodiversity research. The willingness of collections to participate in these efforts has exceeded expectations and requires a close collaboration between museum staff, database managers, and informatics...
For vast areas of the globe and large parts of the tree of life, data needed to inform trait diversity is incomplete. Such trait data, when fully assembled, however, form the link between the evolutionary history of organisms, their assembly into communities, and the nature and functioning of ecosystems. Recent efforts to close data gaps have focus...
The Darwin Core vocabulary is widely used to transmit biodiversity data in the form of simple text files. In order to support expression of biodiversity data in the Resource Description Framework (RDF), a guide was created as a non-normative addition to the Darwin Core standard. This paper describes the major issues that were addressed in the creat...
Genomic samples of non-model organisms are becoming increasingly important in a broad range of studies from developmental biology, biodiversity analyses, to conservation. Genomic sample definition, description, quality, voucher information and metadata all need to be digitized and disseminated across scientific communities. This information needs t...
Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance....
Fields used to construct the Reference Data Set for Vertebrate Taxon Name Resolution.
Fields are grouped into the four categories used in the study: Input, Convenience, Assessment and Output. Note that, among the assessment fields, sn-inf-missing only applies to scientificnameplus, and con-autherror, con-rnk and con-sgerror only apply to constructe...
Taxonomic Name Sources Consulted.
Number of name combinations for which distinct taxonomic name sources provided the name given in validCanonical. We do not include 30 sources that were only consulted once. Some name combinations were checked utilizing multiple sources, and in some cases secondary sources used for confirmation were not always captu...
Has issue.
Effect of the interaction of the variables basisOfRecord, Geographic region (shown as “continent”), Clade (shown as “cladedf”), Volume of records shared by Institution (shown as “RecordsCountPerInst”) and year on the probability of occurrence of at least one of the following issues: Synonymy, Misspelling, Conceptual error, Format Error....
Has issue.
Effect of the interaction of the variables basisOfRecord, Geographic region (shown as “continent”), Clade (shown as “cladedf”), Volume of records shared by Institution (shown as “RecordsCountPerInst”) and year on the probability of occurrence of at least one of the following issues: Synonymy, Misspelling, Conceptual error, Format Error....
Has issue.
Effect of the interaction of the variables basisOfRecord, Geographic region (shown as “continent”), Clade (shown as “cladedf”), Volume of records shared by Institution (shown as “RecordsCountPerInst”) and year on the probability of occurrence of at least one of the following issues: Synonymy, Misspelling, Conceptual error, Format Error....
Detailed VertNet Names Data Acquisition
(DOC)
Detailed assessment of the 1000 name combinations.
Number of name combinations for which a field or characteristic of the name combination matched a given condition. Numbers outside of parentheses are for the 991 name combinations for which there was no disagreement between the assessment of the two researchers. Numbers in parentheses are the count...
This report describes the outcomes of a recent workshop, building on a series of workshops from the last three years with the goal if integrating genomics and biodiversity research, with a more specific goal here to express terms in Darwin Core and Audubon Core, where class constructs have been historically underspecified, into a Biological Collect...
Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an e...
We describe the outcomes of three recent workshops aimed at advancing development of the Biological Collections Ontology (BCO), the Population and Community Ontology (PCO), and tools to annotate data using those and other ontologies. The first workshop gathered use cases to help grow the PCO, agreed upon a format for modeling challenging concepts s...