Specimens held in private natural history collections form an essential, but often neglected part of the specimens held worldwide in natural history collections. When engaging in regional, national or international initiatives aimed at increasing the accessibility of biodiversity data, it is paramount to include private collections as much and as often as possible. Compared to larger collections in national history institutions, private collections present a unique set of challenges: they are numerous, anonymous, small and diverse in all aspects of collection management. In ICEDIG, a design study for DiSSCo these challenges were tackled in task 2 "Inventory of content and incentives for digitisation of small and private collections" under Workpackage 2 "Inventory of current criteria for prioritization of digitization". First, we need to understand the current state and content of private collections within Europe, to identify and tackle challenges more effectively. While some private collections will duplicate material already held in public collections, many are likely to fill more specialised or unusual niches, relevant to the particular collector(s). At present, there is little evidence about the content of private collections and this needs to be explored. In 2018, a European survey was carried out amongst private collection owners to gain more insight in the volume, scope and degree of digitisation of these collections. Based on this survey, all of the respondents’ collections combined are estimated to contain between 9 and 33 million specimens. This is only the tip of the iceberg for private collections in Europe and underlines the importance of these private collections. Digitisation and sharing collection data are activities that are overall considered important among private collection owners. The survey also showed that for those who have not yet started digitising their collection, the provision of tools and information would be most valuable. These and other highlights of the survey will be presented. In addition, protocols for inventories of private collections will be discussed, as well as ways to keep these up to date. To enhance the inclusion of private collections in Europe’s digitisation efforts, we recognise that we mainly have to focus on the challenges regarding the ‘how’ (work-process), and the sharing of information residing in private collections (including ownership, legal issues, sensitive data). Where necessary, we will also draw attention to the ‘why’ (motivation) of digitisation. A communication strategy aimed at raising awareness about digitisation, offering insight in the practicalities to implement digitisation as well as providing answers to issues related to sharing information, is an essential tool. Elements of a communication strategy to further engage private collection owners will be presented, as will conclusions and recommendations. Finally, digitisation and communication aspects related to private collection owners will need to be tested within the community. Therefore, a pilot project is currently (2018-2019) being carried out in Estonia, Finland and the Netherlands to digitise private collections in a variety of settings. Preliminary results will be presented, zooming in on different approaches to include data from private collections in the overall (research) infrastructures.
The paper investigates how to implement open access to data in collection institutions and in the DiSSCo research infrastructure. Large-scale digitisation projects generate lots of images, but data transcription often remains backlogged for years. The paper discusses minimum information standards (MIDS) for digital specimens, and tentatively defines 4 hierarchical MIDS levels. Also partially available data can be useful for some purposes and it is recommended that data and media be made openly accessible after minimal delay. The paper then discusses the FAIR concepts and obstacles and restrictions to making data openly accessible. Data policies and data management plans (DMP) of six ICEDIG collection institutions are reviewed. Typically only one or the other exists, depending on how digitisation and related data management is organised. Data policies are found at institution level whereas DMPs seem to belong to project level. The paper comes to the following conclusions: 1) Digital Specimen Objects (DSO) must be findable and accessible at lowest (MIDS-0) level. Data should be deposited in a public research data repository, such as Zenodo. 2) As far as possible, projects must enable third parties to access, mine, exploit, reproduce and disseminate this data by using a copyright waiver such as CC0 or an open access licence such as CC-BY. 3) Exceptions to openness policy must be stated clearly and strictly limited to reasons of national security, legal or regulatory compliance, sensitivity of collection information, and third party rights.
There is a growing need to set data-driven priorities when planning for the digitisation of European natural history collections. Currently, there is no single location where the required information is gathered and where it can be easily consulted and used by decision-makers and scientists. In particular, the information on digitised and non-digitised natural history collections can inform digitisation-on-demand and mass-digitisation for certain taxonomic or geographic parts of the collection that are not (yet) digitally available. In this Deliverable D2.3 we aim to prepare a preliminary design for a Collection Digitisation Dashboard (CDD), with the main purpose to make European natural history collections visible and discoverable and to highlight the institutional contributions, strengths and weaknesses. First, we identified six main user groups of the CDD via workshop discussions: a) institutions harbouring natural history collections, b) (non-)professional researchers and collectors, c) education, d) policy makers and financing bodies, e) NGO nature groups and organisations, and f) the wider community interested in natural heritage. User stories were collected and the data elements that belonged to these stories were summarised. The CDD will primarily be used to present high level collection data for communication purposes and as a digitisation planning and data discovery tool. Secondly, we propose a set of collection classification schemes to be able to describe and characterise a natural history collection at a metadata level. We distinguished a 'taxonomic' and a 'storage' classification that exist in parallel and are based on a scientific or a collection managers' view, respectively. For further description of geodiversity collections we identified a third parallel 'stratigraphic' classification. In addition, 'geographic' and 'digitisation' classifications were identified to further characterize the spatial coverage and levels of digitisation of the collections. The most important parameters to be minimally included in the CDD are institution, country of institute, 'taxonomy', geography and digitisation. Based on these requirements we piloted two different CDDs. The first is based on an initial collection survey among DiSSCo partners, and the second is based on a pilot study with Dutch natural collection institutes based on improved classifications. In this deliverable we have provided a draft on how to create a collection digitisation dashboard to present collection digitisation data and give recommendations on how to proceed from here.
ICEDIG is a design study for the new research infrastructure Distributed System of Scientific Collections (DiSSCo), focusing on the issues around digitisation of the collections and making their data freely and openly available following the FAIR principles (data being Findable, Accessible, Interoperable, and Re-usable). As a design study, ICEDIG does not implement anything in an operational fashion, and therefore the amount of research data which ICEDIG will deal with is limited. DiSSCo, on the other hand, is expected to deal with huge amounts of data. The data management plan (DMP) for DiSSCo will be produced as one of the design documents of ICEDIG. In other words, ICEDIG will produce two data management plans: one for ICEDIG and another for DiSSCo. In order to achieve its objectives, ICEDIG will carry out a number of tasks, including specific pilots which may produce limited amounts of research data. ICEDIG will also deploy large volumes of already existing research data in order to explore how it can best be pooled in the available European Open Science Cloud infrastructures. This data management plan follows the Horizon 2020 DMP template, which has been designed to be applicable to any Horizon 2020 project that produces, collects or processes research data. The template is a set of questions that have been answered with a level of detail appropriate to the project. It has been understood that the DMP is a living document which may be updates as the implementation of the project progresses and when significant changes occur. Therefore, the DMP has a clear version number and includes below a timetable for planned updates.
There are many ways to capture data from herbarium specimen labels. Here we compare the results of in-house verses out-sourced data transcription with the aim of evaluating the pros and cons of each approach and guiding future projects that want to do the same. In 2014 Meise Botanic Garden (BR) embarked on a mass digitization project. We digitally imaged of some 1.2 million herbarium specimens from our African and Belgian Herbaria. The minimal data for a third of these images was transcribed in-house, while the remainder was out-sourced to a commercial company. The minimal data comprised the fields: specimen’s herbarium location, barcode, filing name, family, collector, collector number, country code and phytoregion (for the Democratic Republic of Congo, Rwanda & Burundi). The out-sourced data capture consisted of three types: additional label information for central African specimens having minimal data; complete data for the remaining African specimens; and, species filing name information for African and Belgian specimens without minimal data. As part of the preparation for out-sourcing, a strict protocol had to be established as to the criteria for acceptable data quality levels. Also, the creation of several lookup tables for data entry was necessary to improve data quality. During the start-up phase all the data were checked, feedback given, compromises made and the protocol amended. After this phase, an agreed upon subsample was quality controlled. If the error score exceeded the agreed level, the batch was returned for retyping. The data had three quality control checks during the process, by the data capturers, the contractor’s project managers and ourselves. Data quality was analysed and compared in-house versus out-sourced modes of data capture. The error rate by our staff versus the external company was comparable. The types of error that occurred were often linked to the specific field in question. These errors include problems of interpretation, legibility, foreign languages, typographic errors, etc. A significant amount of data cleaning and post-capture processing was required prior to import into our database, despite the data being of good quality according to protocol (error < 1%). By improving the workflow and field definitions a notable improvement could be made in the “data cleaning” phase. The initial motivation for capturing some data in-house was financial. However, after analysis, this may not have been the most cost effective approach. Many lessons have been learned from this first mass digitisation project that will implemented in similar projects in the future.
Globally there are a number of citizen science portals to support digitisation of biodiversity collections. Digitisation not only involves imaging of the specimen itself, but also includes the digital transcription of label and ledger data, georeferencing and linking to other digital resources. Making use of the skills and enthusiasm of volunteers is potentially a good way to reduce the great backlog of specimens to be digitised. These citizen science portals engage the public and are liberating data that would otherwise remain on paper. There is also considerable scope for expansion into other countries and languages. Therefore, should we continue to expand? Volunteers give their time for free, but the creation and maintenance of the platform is not without costs. Given a finite budget, what can you get for your money? How does the quality compare with other methods? Is crowdsourcing of label transcription faster, better and cheaper than other forms of transcription system? We will summarize the use of volunteer transcription from our own experience and the reports of other projects. We will make our evaluation based on the costs, speed and quality of the systems and reach conclusions on why you should or should not use this method.
More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.
Anno 2017 the task of mobilizing data from biocollections ahead of us is still enormous (data of 90% of the biocollections still needs to be mobilized). It is imperative for stakeholders, individual keepers of natural science collections, the community at large, and even for funding agencies, not only to tackle this backlog as quickly as possible, but do it in the best possible order. To establish the best possible order for digitizing biocollections a demand driven framework is required based among others on criteria used to digitize biocollections.