ThesisPDF Available

Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions

Authors:

Abstract and Figures

Europeana, a non-profit foundation launched in 2008, aims to improve access to Europe’s digital cultural heritage through its open data platform that aggregates metadata and links to digital surrogates held by over 3700 providers. The data comes both directly from cultural heritage institutions (libraries, archives, museums) as well as through intermediary aggregators. Europeana’s current operating model leverages the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and the Europeana Data Model (EDM) for data import through Metis, Europeana's ingestion and aggregation service. However, OAI-PMH is an outdated technology, and is not web-centric, which presents high maintenance implications, in particular for smaller institutions. Consequently, Europeana seeks to find alternative aggregation mechanisms that could complement or supersede it over the long-term, and which could also bring further potential benefits. In scope, this master’s thesis seeks to extend the research on earlier aggregation experiments that Europeana successfully carried out with various technologies, such as aggregation based on Linked Open Data (LOD) datasets or through the International Image Interoperability Framework (IIIF) APIs. The literature review first focuses on metadata standards and the aggregation landscape in the cultural heritage domain, and then provides an extensive overview of Web-based technologies with respect to two essential components that enable aggregation: data transfer and synchronisation as well as data modelling and representation. Three key results were obtained. First, the participation in the Europeana Common Culture project resulted in the documentation revision of the LOD-aggregator, a generic toolset for harvesting and transforming LOD. Second, 52 respondents completed an online survey to gauge the awareness, interest, and use of technologies other than OAI-PMH for (meta)data aggregation. Third, an assessment of potential aggregation pilots was carried out considering the 23 organisations who expressed interest in follow-up experiments on the basis of the available data and existing implementations. In the allotted time, one pilot was attempted using Sitemaps and Schema.org. In order to encourage the adoption of new aggregation mechanisms, a list of proposed suggestions was then established. All of these recommendations were aligned with the Europeana Strategy 2020-2025 and directed towards one or several of the key roles of the aggregation workflow (data provider, aggregator, Europeana). Even if a shift in Europena’s operating model would require extensive human and technical resources, such an effort is clearly worthwhile as solutions presented in this dissertation are well-suited for data enrichment and for allowing data to be easily updated. The transition from OAI-PMH will also be facilitated by the integration of such mechanisms within the Metis Sandbox, Europeana's new ad-hoc system where contributors will be able to test their data sources before ingestion into Metis. Ultimately, this shift is also expected to lead to a better discoverability of digital cultural heritage objects.
Content may be subject to copyright.
Enabling better aggregation and discovery of cultural
heritage content for Europeana and its partner institutions
Master’s thesis supervised by:
Arnaud GAUDINAT, Associate Professor
Geneva, Switzerland, 14 August 2020
Information Science Department
Master of Science HES-SO in Information Science
Haute école de gestion de Genève
Master’s thesis in Information Science carried out by:
Julien Antoine RAEMY
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY i
Acknowledgements
I would like to thank all of the people who have helped me in any way possible over the course
of this master’s thesis, which also marks the end of my higher education at the Haute école de
gestion de Genève, where I completed a bachelor’s degree.
First of all, I would like to thank Professor Arnaud Gaudinat for his sound advice, his
availability and for agreeing to oversee this thesis.
I am very grateful to Emmanuelle Bermès, Deputy Director for Services and Networks at the
Bibliothèque nationale de France, for having agreed to be the external expert assessing this
master’s thesis.
I am very much indebted to Antoine Isaac, R&D Manager at Europeana, for involving me in
the team’s discussion, for facilitating exchanges and for his countless suggestions, feedback
as well as his patience. Many thanks as well to all Europeana R&D team members, with whom
I interacted on a weekly basis and who offered me a great deal of assistance: Nuno Freire,
Mónica Marrero, Albin Larsson, and José Eduardo Cejudo Grano de Oro.
In this respect, I would also like to mention and thank the following people working within the
Europeana Network for their ongoing support: Valentine Charles, Gregory Markus, Andy
Neale, Hugo Manguinhas, Henning Scholz, Sebastiaan ter Burg, Erwin Verbruggen,
Enno Meijers, Haris Georgiadis and Cosmina Berta.
I would also like to extend my deepest appreciation to Anne McLaughlin who kindly agreed
to give some of her time to proofread this dissertation.
Special thanks also go to everyone who took part in the online survey, and to all of those with
whom I was able to correspond through email for possible pilots or simply to get additional
information.
Lastly, I would like to express my deep appreciation to my girlfriend, friends, colleagues and
family who supported me not only during the last semester of this master’s degree in
Information Science, but throughout the course of my studies.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY ii
Abstract
Europeana, a non-profit foundation launched in 2008, aims to improve access to Europe’s
digital cultural heritage through its open data platform that aggregates metadata and links to
digital surrogates held by over 3700 providers. The data comes both directly from cultural
heritage institutions (libraries, archives, museums) as well as through intermediary
aggregators. Europeana’s current operating model leverages the Open Archives Initiative
Protocol for Metadata Harvesting (OAI-PMH) and the Europeana Data Model (EDM) for data
import through Metis, Europeana's ingestion and aggregation service.
However, OAI-PMH is an outdated technology, and is not web-centric, which presents high
maintenance implications, in particular for smaller institutions. Consequently, Europeana
seeks to find alternative aggregation mechanisms that could complement or supersede it over
the long-term, and which could also bring further potential benefits.
In scope, this master’s thesis seeks to extend the research on earlier aggregation experiments
that Europeana successfully carried out with various technologies, such as aggregation based
on Linked Open Data (LOD) datasets or through the International Image Interoperability
Framework (IIIF) APIs.
The literature review first focuses on metadata standards and the aggregation landscape in
the cultural heritage domain, and then provides an extensive overview of Web-based
technologies with respect to two essential components that enable aggregation: data transfer
and synchronisation as well as data modelling and representation.
Three key results were obtained. First, the participation in the Europeana Common Culture
project resulted in the documentation revision of the LOD-aggregator, a generic toolset for
harvesting and transforming LOD. Second, 52 respondents completed an online survey to
gauge the awareness, interest, and use of technologies other than OAI-PMH for (meta)data
aggregation. Third, an assessment of potential aggregation pilots was carried out considering
the 23 organisations who expressed interest in follow-up experiments on the basis of the
available data and existing implementations. In the allotted time, one pilot was attempted using
Sitemaps and Schema.org.
In order to encourage the adoption of new aggregation mechanisms, a list of proposed
suggestions was then established. All of these recommendations were aligned with the
Europeana Strategy 2020-2025 and directed towards one or several of the key roles of the
aggregation workflow (data provider, aggregator, Europeana).
Even if a shift in Europena’s operating model would require extensive human and technical
resources, such an effort is clearly worthwhile as solutions presented in this dissertation are
well-suited for data enrichment and for allowing data to be easily updated. The transition from
OAI-PMH will also be facilitated by the integration of such mechanisms within the Metis
Sandbox, Europeana's new ad-hoc system where contributors will be able to test their data
sources before ingestion into Metis. Ultimately, this shift is also expected to lead to a better
discoverability of digital cultural heritage objects.
Keywords: API, Cultural heritage, Data aggregation, Digital transformation, Discovery,
Europeana Common Culture, EDM, IIIF, LOD, OAI-PMH, RDF, ResourceSync, Schema.org,
SEO, Sitemaps, Social Web Protocols
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY iii
Table of Contents
Acknowledgements ....................................................................................................... i
Abstract ......................................................................................................................... ii
List of Tables ................................................................................................................ vi
List of Figures ............................................................................................................. vii
List of Abbreviations ................................................................................................. viii
Terminology ................................................................................................................ xii
1. Introduction ............................................................................................................ 1
2. Context .................................................................................................................... 2
2.1 Europeana ................................................................................................................... 2
2.2 Rationale and background ....................................................................................... 2
2.2.1 Motivations for revising the aggregation workflow at Europeana ........................ 2
2.2.2 R&D projects and pilot experiments ..................................................................... 3
2.2.3 Aggregation strategy ............................................................................................ 4
2.3 Research scope ......................................................................................................... 5
2.3.1 Expectations ......................................................................................................... 5
2.3.2 Constraints ............................................................................................................ 5
2.3.3 Research questions .............................................................................................. 6
2.3.4 Objectives ............................................................................................................. 6
3. Literature review .................................................................................................... 7
3.1 Metadata in the cultural heritage domain ............................................................... 7
3.1.1 Types of metadata ................................................................................................ 7
3.1.2 Metadata standards .............................................................................................. 7
3.1.3 Metadata convergence and interoperability ......................................................... 9
3.2 Aggregation landscape in the cultural heritage domain..................................... 11
3.2.1 Cultural heritage aggregation platforms ............................................................. 11
3.2.2 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) ............ 11
3.2.3 Publication requirements of the Europeana Network ........................................ 12
3.3 Alternative web-based technologies for (meta)data aggregation...................... 13
3.3.1 Aggregation components ................................................................................... 13
3.3.2 Technologies for data transfer and synchronisation .......................................... 13
3.3.2.1 ActivityStreams 2.0 (AS2) ................................................................................... 14
3.3.2.2 ActivityPub (AP) .................................................................................................. 14
3.3.2.3 International Image Interoperability Framework (IIIF) ......................................... 14
3.3.2.4 Linked Data Notifications (LDN) .......................................................................... 16
3.3.2.5 Linked Data Platform (LDP) ................................................................................ 17
3.3.2.6 Open Publication Distribution System Catalog 2.0 (OPDS2) ............................. 18
3.3.2.7 ResourceSync (RS) ............................................................................................. 18
3.3.2.8 Sitemaps.............................................................................................................. 19
3.3.2.9 Webmention ........................................................................................................ 20
3.3.2.10 WebSub ............................................................................................................... 20
3.3.3 Technologies for data modelling and representation ......................................... 21
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY iv
3.3.3.1 Data Catalog Vocabulary (DCAT) ....................................................................... 21
3.3.3.2 Schema.org ......................................................................................................... 21
3.3.3.3 Vocabulary of Interlinked Datasets (VoID) .......................................................... 22
3.3.4 Overview of aggregation mechanisms ............................................................... 22
4. Methodology ......................................................................................................... 23
4.1 Overall approach ...................................................................................................... 23
4.2 Methods of data collection ..................................................................................... 24
4.2.1 Reviewing the state-of-the-art ............................................................................ 24
4.2.2 Europeana Common Culture’s LOD Functional Application ............................. 24
4.2.3 Survey on alternative aggregation mechanisms ................................................ 25
4.2.3.1 Timeline and promotion ....................................................................................... 25
4.2.3.2 Objectives ............................................................................................................ 25
4.2.3.3 Structure and questions ...................................................................................... 25
4.2.3.4 Hypotheses.......................................................................................................... 26
4.2.4 Assessment of potential aggregation pilots ....................................................... 26
4.3 Methods of data analysis ........................................................................................ 26
4.3.1 Tools ................................................................................................................... 26
4.3.1.1 Spreadsheet software ......................................................................................... 26
4.3.1.2 Text editor............................................................................................................ 26
4.3.1.3 Europeana R&D tools as testbed ........................................................................ 27
4.3.2 Service design .................................................................................................... 27
4.4 Limitations ................................................................................................................ 27
5. Results .................................................................................................................. 28
5.1 Analysis of ECC LOD Functional Application ...................................................... 28
5.1.1 Sustainability discussions ................................................................................... 28
5.1.2 Assessment of the LOD-aggregator................................................................... 28
5.1.3 Metadata and Semantics Research (MTSR) paper ........................................... 30
5.2 Survey ....................................................................................................................... 31
5.2.1 Number and provenance of participants ............................................................ 31
5.2.2 Findings .............................................................................................................. 32
5.2.2.1 Metadata for publishing and exchanging purposes ............................................ 32
5.2.2.2 Metadata serialisations........................................................................................ 33
5.2.2.3 OAI-PMH ............................................................................................................. 34
5.2.2.4 Alternative aggregation mechanisms .................................................................. 35
5.2.2.5 LOD ..................................................................................................................... 36
5.2.2.6 IIIF ....................................................................................................................... 37
5.2.2.7 Possibility of further experiments ........................................................................ 38
5.2.2.8 Feedback ............................................................................................................. 38
5.2.3 Survey biases ..................................................................................................... 39
5.3 Aggregation pilots ................................................................................................... 40
5.3.1 Parameters for defining and assessing potential pilots ..................................... 40
5.3.2 Identifying aggregation routes for potential pilots .............................................. 40
5.3.3 Assessment of potential aggregation pilots ....................................................... 42
5.3.3.1 Triage of potential pilots ...................................................................................... 42
5.3.3.2 Aggregation route selection ................................................................................ 43
5.3.3.3 Follow-up emails ................................................................................................. 43
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY v
5.3.3.4 Resolution on the immediate conduct of pilots ................................................... 45
5.3.3.5 Attempts to carry out the MuseuMap pilot .......................................................... 46
5.3.4 Conclusion .......................................................................................................... 46
6. Recommendations ............................................................................................... 47
6.1 Target levels ............................................................................................................. 47
6.2 Opportunity Solution Tree ...................................................................................... 47
6.3 Alignment with the Europeana Strategy ............................................................... 49
6.4 Suggestions for implementing the identified solutions ...................................... 50
7. Conclusion............................................................................................................ 52
7.1 Retrospective ........................................................................................................... 52
7.1.1 Study achievements and outcomes ................................................................... 52
7.1.2 Alternative mechanisms to OAI-PMH................................................................. 53
7.1.3 Conditions to deploy alternative mechanisms for aggregation .......................... 54
7.2 Future work and discussion ................................................................................... 54
Bibliography ................................................................................................................ 56
Research Stakeholders .................................................................... 64
Europeana’s current ingestion process ........................................ 65
Mapping examples of the Mona Lisa in EDM ................................ 66
Survey invitation and reminder ...................................................... 68
Survey structure ............................................................................... 69
Survey questions .............................................................................. 70
Identified resources to support aggregation ................................ 74
Survey findings and pilot follow-up email templates .................. 75
Overview of aggregation mechanisms .......................................... 79
Opportunity Solution Tree (full) .................................................. 81
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY vi
List of Tables
Table 1: Glossary of Terms ...................................................................................................... xii
Table 2: A selection of metadata standards in the CH domain ................................................. 8
Table 3: Aggregation components ........................................................................................... 13
Table 4: Validation interviews................................................................................................... 23
Table 5: Assessment criteria of the LOD-aggregator .............................................................. 29
Table 6: Survey participants' provenance ................................................................................ 31
Table 7: Additional metadata standards .................................................................................. 33
Table 8: Additional ways to publish LOD ................................................................................. 36
Table 9: Survey participants' feedback .................................................................................... 38
Table 10: Alternative aggregation routes ................................................................................. 41
Table 11: Triage on the conduct of potential aggregation pilots.............................................. 43
Table 12: Type and number of follow-up email templates sent ............................................... 44
Table 13: Typical questions raised in the follow-up emails (template 4) ................................. 44
Table 14: Resolution on the conduct of aggregation pilots ..................................................... 45
Table 15: Alignment between identified solutions and Europeana Strategy priorities ............ 50
Table 16: Proposed suggestions .............................................................................................. 50
Table 17: Research Stakeholders ............................................................................................ 64
Table 18: Ingestion process in Metis........................................................................................ 65
Table 19: Options for survey question 3 .................................................................................. 70
Table 20: Options for survey question 5 .................................................................................. 71
Table 21: Options for survey question 8 .................................................................................. 72
Table 22: Options for survey question 9 .................................................................................. 72
Table 23: Options for survey question 12 ................................................................................ 72
Table 24: Resources for facilitating alternative (meta)data aggregation ................................. 74
Table 25: High-level overview of aggregation mechanisms .................................................... 79
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY vii
List of Figures
Figure 1: Europeana's current operating model......................................................................... 3
Figure 2: Aggregation strategy's conceptual solution ................................................................ 4
Figure 3: Extended Metis Sandbox Concept ............................................................................. 5
Figure 4: Five stars of LOD ...................................................................................................... 10
Figure 5: OAI-PMH Structure ................................................................................................... 11
Figure 6: ActivityPub request examples ................................................................................... 14
Figure 7: IIIF APIs in the client-server model ........................................................................... 15
Figure 8: Overview of a IIIF Discovery ecosystem .................................................................. 16
Figure 9: LDN overview ............................................................................................................ 17
Figure 10: Class relationship of types of LDP Containers ....................................................... 18
Figure 11: ResourceSync Framework Structure ...................................................................... 19
Figure 12: WebSub high-level protocol flow ............................................................................ 21
Figure 13: High-level architecture of the LOD-aggregator....................................................... 29
Figure 14: Typology of survey participants .............................................................................. 31
Figure 15: Awareness, use, and interest in metadata standards for publishing and
exchanging purposes ............................................................................................................... 33
Figure 16: Awareness, use, and interest in metadata serialisations ....................................... 34
Figure 17: Use of OAI-PMH ..................................................................................................... 34
Figure 18: Use of OAI-PMH in the Europeana context ............................................................ 35
Figure 19: Awareness, use, and interest in alternative aggregation mechanisms .................. 35
Figure 20: Awareness, use, and interest in publishing LOD ................................................... 36
Figure 21: Awareness, use and interest in IIIF APIs ............................................................... 37
Figure 22: Interest in pilot participation .................................................................................... 38
Figure 23: Summarised representation of the assessment of aggregation pilots ................... 46
Figure 24: Desired outcome and opportunities with respect to the target levels .................... 48
Figure 25: Proposed solution and experiments for the digital object level .............................. 48
Figure 26: Proposed solution and experiments for the metadata level ................................... 48
Figure 27: Proposed solutions and experiments for the providing institution level ................. 49
Figure 28: Simple representation of the Mona Lisa in EDM .................................................... 66
Figure 29: Object-centric representation of the Mona Lisa in EDM ........................................ 66
Figure 30: Event-centric representation of the Mona Lisa in EDM .......................................... 67
Figure 31: General structure of the survey on alternative aggregation mechanisms ............. 69
Figure 32: Opportunity Solution Tree to enable better aggregation and discovery of cultural
heritage content ........................................................................................................................ 81
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY viii
List of Abbreviations
AACR2 Anglo-American Cataloguing Rules, 2nd edition
ABCD Access to Biological Collections Data
AP ActivityPub
API Application programming interface
ArCo Architecture of Knowledge
AS2 ActivityStreams 2.0
BIBFRAME Bibliographic Framework
CARARE Connecting Archaeology and Architecture in Europe
CC Creative Commons
CCO Cataloguing Cultural Objects
CC0 Creative Commons Zero Public Domain Dedication
CH Cultural heritage
CHI Cultural heritage institution
CHO Cultural heritage object
CIDOC-CRM CIDOC Conceptual Reference Model
CLI Command-line interface
CMS Content management system
CSV Comma-separated values
DACS Describing Archives: A Content Standard
DAL Data Aggregation Lab
DC Dublin Core
DCAT Data Catalog Vocabulary
DCAT-AP DCAT Application Profile for data portals in Europe
DCT Dublin Core Terms
DCMES Dublin Core Metadata Element Set
DEA Data Exchange Agreement
DPLA Digital Public Library of America
DSI Digital Service Infrastructure
EAC-CPF Encoded Archival Context - Corporate Bodies, Persons, and Families
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY ix
EAD Encoded Archival Description
EAG Encoded Archival Guide
ECC Europeana Common Culture
EDM Europeana Data Model
EAF Europeana Aggregators Forum
EF Europeana Foundation
ENA Europeana Network Association
EPF Europeana Publishing Framework
ESE Europeana Semantics Elements
FRBR Functional Requirements for Bibliographic Records
GLAM Galleries, Libraries, Archives, Museums
HDT Header Dictionary Triples
HTML Hypertext Markup Language
HTTP HyperText Transfer Protocol
IIIF International Image Interoperability Framework
IIIF-C IIIF Consortium
ISAD(G) International Standard Archival Description (General)
JSON JavaScript Object Notation
JSON-LD JavaScript Object Notation for Linked Data
KB Koninklijke Bibliotheek (The Royal Library of the Netherlands)
LD Linked Data
LDF Linked Data Fragments
LDN Linked Data Notification
LDP Linked Data Platform
LIDO Lightweight Information Describing Objects
LOD Linked Open Data
MADS Metadata Authority Description Schema
MARC Machine-Readable Cataloging
METS Metadata Encoding and Transmission Standard
MODS Metadata Object Description Schema
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY x
MTSR Metadata and Semantics Research
NDLI National Digital library of India
NDE Netwerk Digitaal Erfgoed (Dutch Digital Heritage Network)
NISO National Information Standards Organization
NISV Netherlands Institute for Sound and Vision
NT N-Triples
N3 Notation 3
N/A Not applicable
OAI-ORE Open Archives Initiative Object Reuse and Exchange
OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting
ONIX Online Information Exchange
OPDS2 Open Publication Distribution System 2.0
OWL Web Ontology Language
PID Persistent identifier
PSNC Poznan Supercomputing and Networking Center
PuSH PubSubHubbub
RDA Resource Description & Access
RDF Resource Description Framework
RDFa Resource Description Framework in Attributes
RDFS Resource Description Framework Schema
REST Representational state transfer
RiC Records in Contexts
RS ResourceSync
R&D Research and Development
SEO Search Engine Optimization
SHACL Shapes Constraint Language
SKOS Simple Knowledge Organization System
SOCH Swedish Open Cultural Heritage
SPARQL SPARQL Protocol and RDF Query Language
SRU/SRW Search and Retrieve URL/Web Service
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY xi
TBD To be determined
Turtle Terse RDF Triple Language
UCD University College Dublin
URL Uniform Resource Locator
URI Uniform Resource Identifier
VoID Vocabulary of Interlinked Datasets
VRA Visual Resources Association
XML Extensible Markup Language
XSLT Extensible Stylesheet Language Transformations
W3C World Wide Web Consortium
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY xii
Terminology
This terminology lays out the definitions of the most relevant concepts discussed in this
master’s thesis. The concepts identified in Table 1 provide a general overview and are not an
exhaustive list of the topics covered in this dissertation. Above all, it allows readers to have a
good sense of the study's rationale.
Table 1: Glossary of Terms
Term
Definition
Source
Application
programming
interface (API)
An API is an abstraction implemented in software that defines
how others should make use of a software package such as a
library or other reusable program. APIs are used to provide
developers access to data and functionality from a given system.
(Hyland et
al. 2013)
Conceptual
Model
Conceptual Models provide a high-level approach to resource
description in a certain domain. They typically define the entities
of description and their relationship to one another. Metadata
structure standards typically use terminology found in conceptual
models in their domain.
(Riley,
Becker
2010)
Content
Standard
Content Standards provide specific guidance on the creation of
data for certain fields or metadata elements, sometimes defining
what the source of a given data element should be. They may or
may not be designed for use with a specific metadata structure
standard.
(Riley,
Becker
2010)
Controlled
Vocabulary
Controlled Vocabularies are enumerated (either fully or by stated
patterns) lists of allowable values for elements for a specific use
or domain.
(Riley,
Becker
2010)
Data
Modelling
Data modelling is a process of organising data and information
describing it into a faithful representation of a specific domain of
knowledge.
(Hyland et
al. 2013)
Digital
transformation
Digital transformation is an umbrella term that captures the
impact of digital innovation on the ground in different sectors.
[For Europeana, it’s not about simply applying technology, but by
doing it] sensibly and with serious consideration to implementing
[Europeana’s] values.
(D’Alterio
2018)1
Discovery
Discovery is the ability for automated processes to find
harvestable content for the purposes of aggregating it, thus
allowing that content to be subsequently retrieved on a search
engine which is used by either humans with a user interface or
machines via an API.
NB: This is a very specific view on discovery as it is here
conceptually understood as a process.
Robert
Sanderson2
Ingestion
The process of collecting, mapping and publishing the data from
a data provider.
(Europeana
2015)
Interview with Harry Verwayen, Europeana Foundation Executive Director
Direct message from Robert Sanderson, Cultural Heritage Metadata Director at Yale
University, IIIF Slack instance, 4 June 2020
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY xiii
Term
Definition
Source
Linked Data
(LD)
A pattern for hyperlinking machine-readable data sets to each
other using Semantic Web techniques, especially via the use of
RDF and URIs. Enables distributed SPARQL queries of the data
sets and a browsing or discovery approach to finding information
(as compared to a search strategy). Linked Data is intended for
access by both humans and machines. Linked Data uses the
RDF family of standards for data interchange (e.g., RDF/XML,
RDFa, Turtle) and query (SPARQL).
(Hyland et
al. 2013)
Linked Open
Data (LOD)
Linked Data published on the public Web and licensed under
one of several open licenses permitting reuse.
(Hyland et
al. 2013)
Markup
Language
A formal way of annotating a document or collection of digital
data using embedded encoding tags to indicate the structure of
the document or data file and the contents of its data elements. It
also provides a computer with information about how to process
and display marked-up documents.
[Markup Language] are unlike other "metadata" formats in that
they provide not a surrogate for or other representation of a
resource, but rather an enhanced version of the full resource
itself.
(Baca
2016a;
Riley,
Becker
2010)
Metadata
Information used to administer, describe, preserve, present, use
or link other information held in resources, especially knowledge
resources, be they physical or virtual. Metadata may be further
subcategorized into several types (including general, access and
structural metadata). Linked Data incorporates human and
machine-readable metadata along with it, making it self-
describing.
(Hyland et
al. 2013)
Metadata
Aggregation
Metadata aggregation is an approach where centralized efforts like
Europeana facilitate the discoverability [of resources] by collecting
[ingesting] their metadata.
(Freire,
Meijers, et
al. 2018)
Metadata
Mapping
An expression of rules to convert structured data from one
format or model to another such as the Europeana Data Model
(EDM).
(Europeana
2015)
Record
Format
Record Formats are specific encodings for a set of data
elements. Many structure standards are defined together with a
record format that implements them.
(Riley,
Becker
2010)
Structure
Standard
Structure Standards are those that define at a conceptual level
the data elements applicable for a certain purpose or for a
certain type of material. These may be defined anew or
borrowed from other standards. This category includes formal
data dictionaries. Structure standards do not necessarily define
specific record formats.
(Riley,
Becker
2010)
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY xiv
Term
Definition
Source
Uniform
Resource
Identifier (URI)
A global identifier standardized by joint action of the World Wide
Web Consortium (W3C) and Internet Engineering Task Force. A
Uniform Resource Identifier (URI) may or may not be resolvable
on the Web. URIs play a key role in enabling Linked Data. URIs
can be used to uniquely identify virtually anything including a
physical building or more abstract concepts such as colours.
(Hyland et
al. 2013)
User/end-user
A person or entity making use of the services offered by
Europeana through the Europeana Portal, Europeana API, third
party services or social networks.
(Europeana
2015)
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 1
1. Introduction
The presentation of digital objects by cultural heritage institutions on their respective web
platforms is a great opportunity to showcase resources, facilitate access for researchers as
well as engage with new audiences. This is all the truer for unique digitised artefacts that are
rarely accessible to the general public and which, in most cases, can usually only be consulted
by a select few users.
Such democratisation and easy access to digital resources nonetheless faces significant
challenges because end users, using Internet commodity search engines, hardly ever discover
what has been indexed in digital library catalogues, which are generally built in silos i.e.
restricted access within bespoke applications and often poorly referenced. Similarly, cultural
heritage institutions do not have an equal chance to cope with the pace of digital
transformation, and small and medium sized institutions do not necessarily have the necessary
tools and resources.
To avoid users looking for a needle in a haystack, federated efforts are critical. In this respect,
Europeana, a web portal created by the European Union and officially launched in 2008, has
strived to position itself as the main gateway for accessing Europe's cultural heritage.
Europeana is in line with other large-scale digital library initiatives such as the Digital Public
Library of America, the National Digital Library of India, or Trove in Australia, which not only
want to aggregate and disseminate content on their platform, but, thanks to their expertise in
research and development, are able to explore new harvesting approaches.
Nevertheless, digital transformation is far from being an easy task. It is especially the case in
the cultural heritage field where libraries, archives and museums are accustomed to working
with their own metadata standards and once a technology is implemented, it tends to be used
for a long time to justify the investment as any deployed technical solution is usually kept in
their infrastructure for an extended period of time. For instance, Europeana has to deal with
some technologies to ingest the collections displayed on their open data platform that have
been around for twenty years. Indeed, the technology of choice in the context of metadata
aggregation is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which
is itself an outdated mechanism that pre-dates certain principles of the architecture of the World
Wide Web.
This master's thesis sought to query some Europeana stakeholders on alternative aggregation
mechanisms facilitating a transition to other technologies to bring other benefits such as more
efficient web referencing, enhanced synchronisation, or even greater interoperability. Improved
aggregation holds the promise of greater discoverability for both human and machine users.
An upgraded workflow is also key for cultivating new pathways for organisations and users
alike to further engage with digital cultural heritage resources.
Beginning with a presentation of the Context, this dissertation then follows a relatively
standard structure of any scholarly paper: the essential state-of-the-art components are
highlighted within the Literature review, the overall Methodology is listed and briefly
described, the outcomes are showcased and analysed in the Results, some
Recommendations are drawn and a Conclusion reflecting on the achievements and
outcomes as well as establishing future work completes the master’s thesis.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 2
2. Context
This master’s thesis is part of the final examination requirements of the Haute école de gestion
de Genève (HEG-GE), for obtaining the Master of Science HES-SO in Information Science.
This chapter outlines the background of the master's thesis which was conducted by the author
in collaboration with the Europeana Research and Development (R&D) team from 20 February
to 14 August 2020
. In this regard, it provides a few key insights on Europeana, the rationale
and background, as well as the research scope.
2.1 Europeana
Europeana is a non-profit foundation based in The Hague that supports the Europeana service,
launched in 2008 an initiative of the European Commission
. The Europeana Foundation (EF)
serves as facilitator for a community of 2400 experts in digital cultural heritage (the Europeana
Network Association - ENA). Their mission is to improve access to Europe’s digital cultural
heritage through their open data platform which aggregates metadata and links to digital
surrogates held by over 3700 providers (Isaac 2019) from cultural heritage institutions (CHIs)
such as libraries, archives, museums.
The data comes both directly from organisations, also called data providers, as well as through
aggregators, which are intermediaries in the aggregation process who collect data from
specific countries or regions, and from specific domains (audio heritage, fashion, photography,
etc.). These aggregators advise their providers, for instance, on formats, licenses or any
technical conditions under which data can be aggregated.
Since the beginning of Europeana, the data import has been based on the Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH). The resources, which need to comply
with the Europeana Data Model (EDM), are currently ingested by Metis, Europeana's ingestion
and aggregation service.
2.2 Rationale and background
This section summarises the motivations for revising the aggregation workflow at Europeana,
previous and current R&D projects, as well as the current aggregation strategy.
2.2.1 Motivations for revising the aggregation workflow at Europeana
Europeana would like to use technologies other than OAI-PMH, which began development in
1999 and has been stabilised in its second version since 2002 (Lagoze et al. 2002), to
aggregate metadata. There are many arguments to support discontinuing the use this protocol.
For instance, OAI-PMH is an outdated technology that is not very efficient as data must be
copied in several places, its scalability is not optimal, it is not web-centric (Van de Sompel,
Nelson 2015; Bermès 2020, p. 52). Furthermore, it is also rather expensive to maintain -
especially for institutions that use it only for data consumption by Europeana.
In Appendix 1, there is a table listing the research stakeholders within the ENA who had a
direct or indirect impact to the research addressed by this master’s thesis (cf. Table 17).
The Europeana service is funded by individual EU member states and by the European
Commission via the Europeana Digital Service Infrastructure (DSI), which is in its fourth
iteration (DSI-4). Europeana DSI-4 is intended to fulfil Europeana’s 2020 strategy and
provides access to Europe’s cultural and scientific heritage”. See (Europeana 2018; 2019a)
and project documentation at https://pro.europeana.eu/project/europeana-dsi-4
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 3
Europeana also seeks to limit the burden of (semi-) manual labour (such as the scheduling
and frequency of updates or the granularity of these updates) when registering and managing
the collections that shall be aggregated from its partners’ data services. For instance, the
updating of cultural heritage objects (CHOs) and associated metadata by content providers is
complex to perform because, at the moment, data partners need to flag Europeana manually,
when their data has been amended. Below, Figure 1 illustrates Europeana's current operating
model.
Figure 1: Europeana's current operating model
(Neale, Charles 2020)
Europeana therefore seeks to find alternative mechanisms to OAI-PMH. It would also help in
pushing technologies to Europeana’s content providers, which can impact their digital
transformation independently of their contribution to Europeana (i.e. these technologies can
have a benefit with respect to more general data publication and exchange processes). In
particular, Europeana is interested in technologies that are essentially geared towards
exchanging higher quality data, namely data with more semantics or better standardisation.
2.2.2 R&D projects and pilot experiments
Europeana has already carried out quite a few tests with different Web technologies
to
aggregate metadata and links to digitised objects in different ways. Among the aggregation
pilots, the following three can be mentioned (Freire, Isaac, Raemy 2020):
The Rise of Literacy Project, which consisted of evaluating the application of
Linked Data and the Schema.org data model. It was carried out by the Royal
Library of the Netherlands (KB) as a data provider, the Dutch Digital Heritage
Network (NDE) as an intermediary aggregator, and Europeana.
IIIF aggregation pilots with the University College Dublin (UCD) and the
Wellcome Library where the first dataset was ingested via Sitemaps pointing to
IIIF Manifests and the second where the crawling was done via IIIF Collection
using the Data Aggregation Lab (DAL).
Evaluation of Wikidata for data enrichment where the usability of Wikidata
as a Linked Data source for acquiring richer descriptions of CHOs was
evaluated.
Finally, it is also important to highlight that from January 2019 to December 2020 the
Europeana Common Culture (ECC) project is being carried out. Within this project, there is
a Linked Data aggregation functional application led by the Netherlands Institute for Sound
Cf. 3.3 to get more information on these technologies.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 4
and Vision (NISV) to bring the Linked Open Data (LOD) to EDM aggregation route into practice
(Freire 2020a; 2020b). The NDE has then developed the LOD-aggregator, a specific pipeline
to harvest the data and convert Schema.org into EDM (Freire, Verbruggen, et al. 2019).
2.2.3 Aggregation strategy
In May 2020, Europeana released a new aggregation strategy to “provide long-term direction
of the aggregation of European cultural heritage metadata and content(Neale, Charles 2020).
This strategy has been adopted with the intention of supporting Europeana's technical
infrastructure, in particular Metis, with a view to providing more optimal and swifter publishing
options, facilitating data onboarding as well as ensuring improved data quality in a very
complex landscape where the three (groups of) stakeholders (CHIs, aggregators, and the EF)
have varying motivations, resources and technological skills (Neale, Charles 2020).
Furthermore, it should also be stressed that the strategy has been produced to be in line with
Europeana's global strategy 2020-2025 and more specifically with Objective 1A: Develop a
more efficient aggregation infrastructure (Europeana 2020).
As part of this aggregation strategy, the following seven outcomes have been articulated:
1. Maintain the current Metis service
2. Speed up dataset updates
3. Involve contributors in testing
4. Enable fast track publishing workflow
5. Add new data source options
6. Encourage data enrichment
7. Investigate content hosting
These different outcomes have also been designed to be represented as a conceptual solution
evolving over time, as shown in Figure 2 below with the top elements of the pyramid
symbolising a longer-term approach.
Figure 2: Aggregation strategy's conceptual solution
(Neale, Charles 2020)
Some of the outcomes identified in the conceptual solution have a greater impact on this
dissertation, particularly those that propose a number of alternative mechanisms for
aggregators and CHIs.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 5
For instance, within the third outcome, which aims to involve contributors in testing their data
sources before ingestion into Metis, the idea of a Sandbox has been devised. The different
functionalities of this tool would range from data import, to data conversion to EDM, and from
data enrichment, to preview.
In addition, the fourth, fifth and sixth outcomes of the aggregation strategy, which involve the
possibility to have a fast track publishing workflow, to add new data source options (such as
IIIF and Linked Data) as well as to encourage data enrichment, all rely on an extended Metis
Sandbox concept (cf. Figure 3). Encouraging of data enrichment latter would, for instance,
allow the reduction of human intervention, upload data and transform it from common
standards, as well as improve the overall data quality (Neale, Charles 2020).
Figure 3: Extended Metis Sandbox Concept
(Neale, Charles 2020)
Moreover, within the strategy itself, a three-stage roadmap was planned with a view to
implementing the seven different outcomes. This sequential planning, outlining different tasks,
is foreseen to take place over a period of two years (Neale, Charles 2020).
2.3 Research scope
The scope of this research is briefly explained in this section in terms of expectations,
constraints as well as research questions and objectives.
2.3.1 Expectations
One of the main expectations of the research was to support Europeana’s decisions on the
directions for improved data aggregation, notably, but not exclusively, in terms of compliance
with EDM prior to ingestion in Metis, traceability upon data update, as well as guidance to data
providers and aggregators.
This master's thesis would allow the work already carried out by Europeana's experts in this
field to be continued while making new technical and strategic recommendations.
2.3.2 Constraints
The two identified areas that can constrain the realisation of this master’s thesis were the
dependencies in relation to the management of Europeana’s activities as well as the limitations
resulting from the current technological landscape and aggregation operating model.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 6
2.3.3 Research questions
This master’s thesis sought to address two research questions (in bold). The latter led to other
interrogations (in italics) that necessitated further investigation.
Are there suitable alternatives to OAI-PMH in terms of scalability, ease of use and which
covers the requirements of the ENA?
Are those technologies covering all the requirements currently supported by OAI-PMH?
What additional features do these technologies bring that are not or are badly
supported by OAI-PMH?
Could those technologies complement or replace the existing technologies?
What are the feasibility conditions to deploy these technologies in the Europeana
context?
How would adoption of these technologies impact Europeana's current operating
model?
How these new alternatives should be presented to CHIs so that they are interested in
investing in them?
How could institutions start using those technologies without too much investment?
2.3.4 Objectives
Three main objectives, themselves divided into several specific ones, were identified.
1) To provide a brief historical background since the creation of OAI-PMH as well
as a comparison between the different methods of aggregating cultural heritage
resources and their associated metadata.
a. To carry out a state-of-the-art study on the different methods and technologies
applied to metadata aggregation
b. To provide a comparative overview of the various technologies identified for
aggregation
c. To help identify the requirements of the most promising technology that can handle
updates and variety of metadata models
2) To participate in the design and evaluation of prototypes and pilot experiments
with the technologies identified
a. To gather representative data from Europeana's partner institutions
b. To establish a procedure with Europeana’s R&D team
c. To conduct and refine tests on different technologies with the help of DAL
d. To analyse and extrapolate the acquired outcomes
3) To investigate what services Europeana could offer to encourage adoption of
technologies that will gradually be used in place of OAI-PMH
a. To assess the impact of leveraging Web technologies to aggregate metadata
from Europeana’s partner institutions
b. To suggest different scenarios that conform to the strategy for Metis
c. To establish recommendations and guidelines to Europeana and partner
institutions to reduce non-automated labour
d. To determine the next steps to be carried out within Europeana and its Network
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 7
3. Literature review
This literature review focuses on the different (meta)data aggregation mechanisms in the
cultural heritage (CH) domain and how specific technologies enhance resource discoverability
and sharing.
First, it gives an overview of some important metadata standards in the CH domain. Then, a
dedicated section is devoted to the current aggregation process. Lastly, the literature review
covers the different Web-based technologies that can enable (meta)data aggregation.
3.1 Metadata in the cultural heritage domain
If CHIs have common information management goals and interests (Lim, Li Liew 2011), such
as providing access to knowledge and ensuring the sustainability of CHOs, they are also
characterised in their diversity with respect to the metadata landscape. Each domain has
distinct ways of describing the resources they collect, preserve, and showcase. Even within a
particular domain, significant differences can still be observed (Mitchell 2013; Freire, Voorburg,
et al. 2019).
The purpose of this section is to succinctly present the metadata types and standards used by
CHIs, both with respect to their distinct natures, but also in what brings them together, such as
through the application of LOD technologies.
3.1.1 Types of metadata
Metadata can be divided into several categories, or types, to serve different data management
purposes. Traditional library cataloguing focuses, for example, on the identification and
description of resources, but there are obviously other types of metadata that carry valuable
insights (Hillmann, Marker, Brady 2008).
According to Zeng and Qin (2016), five key purposes can be distinguished: administrative,
technical, descriptive, preservation, and use. These types of metadata can either coexist within
the same standard or be the subject of a specific one.
The metadata typology can facilitate the extrapolation of future actions to be performed by
individuals in charge of data curation. For instance, technical metadata can be leveraged for
collection profiling or format validation (Lindlar 2020) as well as use metadata which can
provide indications when or if a given resource can enter the public domain or be freely
accessible (Whalen 2016).
Metadata are not fixed statements and can be created and maintained incrementally
throughout the data’s lifecycle (Baca 2016b; 2016c).
3.1.2 Metadata standards
Metadata standards are critical to establishing structured consistency of information, thereby
enabling a common interpretation between different stakeholders, both those who own and
those who use the resources. Within the CH domain, the use of metadata “aids in the
identification, assessment, and management of the described entities [users] seek(Zeng, Qin
2016, p. 3).
Standards derive and have evolved on the basis of the different cultures of each respective
subdomain and the underlying (typical) application focus. In libraries, the value is in the content
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 8
and the objects are rather regarded as carriers, in archives, the value is in the collection and
in its description, and in museums which have many unique artefacts, the value is in the object
(Sanderson 2020a). In addition, the automation of museum and archives collection
management took place later than for libraries. As such, interoperability and information
exchange between libraries has therefore progressed more rapidly than in other types of CHIs
(Jacquesson, Roten, Levrat 2019).
Each subdomain has accordingly created and maintained their own metadata standards, rules
and models. Many specifications developed for information resources have also been
endorsed by standards bodies (Greenberg 2005) and some of these standards are solely used
within a specific domain community (Hillmann, Marker, Brady 2008).
Although this dissertation does not specifically focus on CH metadata standards, Table 2
presents some notable examples of some of the most common and widely used standards
along with a short description and the main functions they fulfil. For the latter, the choice of
classification was made on the basis of the comprehensive standard visualisation and six of
the seven functions of Riley and Becker (2010): conceptual model, content standard, controlled
vocabulary, markup language, record format, and structure standard
.
Table 2: A selection of metadata standards in the CH domain
Standard
Short description
Functions
Anglo-American
Cataloguing Rules,
2nd edition (AACR2)
AACR2 is a data content standard for
describing bibliographic materials (Baca
2016a).
Content Standard
Bibliographic
Framework
(BIBFRAME)
BIBFRAME is a data model for bibliographic
description designed to replace the MARC
standards and to use the principles of linked
data to make bibliographic data more useful
within the library community as well as in the
broader universe of information (Baca 2016a).
Conceptual Model
Structure Standard
Content Standard
Cataloguing Cultural
Objects (CCO)
CCO is a manual for describing, documenting,
and cataloguing cultural works and their visual
surrogates (Coburn et al. 2010).
Content Standard
Controlled Vocabulary
CIDOC Conceptual
Reference Model
(CIDOC-CRM)
CIDOC-CRM is an object-oriented model for
the publication and interchange of cultural
heritage information (Baca 2016a).
Conceptual Model
Structure Standard
Dublin Core (DC):
Dublin Core
Metadata
Element Set
(DCMES)
Dublin Core
Terms (DCT)
Originally, the Dublin Core Metadata Initiative
proposed a set of fifteen metadata elements
(DCMES) as a common denominator for
metadata mapping. The more recent Dublin
Core Terms (DCT) include additional metadata
elements for greater precision. Both
namespaces can be used for a Linked Data
application since the terms are expressed as
RDF vocabularies (Baca 2016a; Jaffe 2017).
Structure Standard
Encoded Archival
Description (EAD)
EAD is a data structure standard for encoding
archival finding aids in SGML or XML
Record Format
Markup Language
Most of the descriptions come from Introduction to Metadata’s glossary edited by Baca
(2016a): https://www.getty.edu/publications/intrometadata/glossary/
All of these functions are described in the Terminology (cf. Table 1)
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 9
Standard
Short description
Functions
according to the EAD document type definition
(DTD) or XML schema that makes it possible
for the semantic contents of a finding aid to be
machine processed (Baca 2016a).
Structure Standard
Linked Art
Linked Art is both a community and a data
model based on LOD to describe art (Delmas-
Glass, Sanderson 2020).
Conceptual Model
Record Format
Structure Standard
Lightweight
Information
Describing Objects
(LIDO)
LIDO is a simple XML schema for describing
and interchanging core information about
museum objects (Baca 2016a).
Record Format
Markup Language
Structure Standard
Machine-Readable
Cataloging (MARC)
MARC is a set of standardized data structures
for describing bibliographic materials that
facilitates cooperative cataloging and data
exchange in bibliographic information systems
(Baca 2016a).
Structure Standard
Record Format
Content Standard
Metadata Encoding
and Transmission
Standard (METS)
METS is a standard for encoding descriptive,
administrative, and structural metadata relating
to objects in a digital library, expressed in XML
(Baca 2016a).
Structure Standard
Record Format
Resource Description
& Access (RDA)
RDA is a cataloguing standard for libraries
which has begun to replace AACR2 (Baca
2016a).
Content Standard
Structure Standard
Visual Resources
Association (VRA)
Core
VRA Core is a data standard for the description
of works of art and architecture as well as the
digital surrogates that document them (Riley,
Becker 2010; Baca 2016c).
Structure Standard
Record Format
Controlled Vocabulary
Moreover, the classification of metadata standards according to their functionality is not crisp,
i.e. classification decisions may vary depending on perspective. Other classifications of
metadata standards in the CH domain are known to sort them according to each subdomain,
for instance, the taxonomy of Elings and Waibel (2007) and the metadata blocks’ clustering of
Mitchell (2013).
3.1.3 Metadata convergence and interoperability
Among the first interoperability efforts in the CH sector were the development of MARC
standards in the 1960s to facilitate the exchange of bibliographic data between libraries
(Bermès 2011; Baca 2016c). The manner in which libraries could exchange records with each
other was facilitated with the establishment of Z39.50, a technology that predates the Web
(client-server standard from the late 1970s), which allows one to query different library
catalogues (Alexander, Gautam 2004).
At the end of the 1990s, the development of OAI-PMH, founded in the open access movement
(Gaudinat et al. 2017), which relies on DC and XML to be a simple denominator for achieving
interoperability, is a well-established protocol within the CH domain (cf. 3.2.2 for more
information) that bridges the gaps within the broader GLAM (Galleries, Libraries, Archives,
Museums) community; even though the protocol is not based on the architecture of the Web
(Freire et al. 2017). Alternatively, SRU/SRW (Search and Retrieve URL/Web Service), which
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 10
was specified in 2006, is a “web service for search and retrieval based on Z39.50 semantics
(Lynch 1997; Reiss 2007), but it is limited to the library domain (Bermès 2011).
As all of these efforts (with the exception of SRU/SRW) are not based on web standards, they
presuppose that end-users will end up on CHI’s platforms to discover CHOs (Bermès 2011).
In the last 30 years, the advent of the World Wide Web in 1989 (Berners-Lee, Fischetti 2001)
and the emergence of online catalogues in the decades that followed have challenged the
traditional functions of cataloguing as well as the metadata standards in use (Bermès, Isaac,
Poupeau 2013, p. 20). The Web is a tremendous enabler for CHIs to share information and
showcase their collections to a wider spectrum of users. However, in order to facilitate a rich
user experience where individuals can navigate seamlessly from one resource to another on
CH platforms without being concerned about their provenance, some common denominators
must be found, as CHIs have historically maintained siloed catalogues, disconnected from the
broader web ecosystem (Bermès 2011).
The Semantic Web, which can be seen as an extension of the Web through World Wide Web
Consortium (W3C) standards (Berners-Lee, Hendler, Lassila 2001), enables interoperability
based on URIs and the creation of a global information space. It is based on Resource
Description Framework (RDF) assertions that follow a subject-predicate-object structure
(Bermès, Isaac, Poupeau 2013, p. 37).
The publication of data as part of the Semantic Web requires following certain steps such as
using URIs to designate resources, dereferencing HTTP URIs, using W3C standards (RDF,
SPARQL), as well as linking its dataset to other endpoints. A generic deployment scheme is
the five stars of LOD (as shown in Figure 4 below) initiated by Tim Berners-Lee (2009). This
scheme can, for instance, be used to indicate the compliance level and to assess the effort
required to reach LOD.
Figure 4: Five stars of LOD
(Berners-Lee 2006)
Historically, the CH domain has been interested in the Semantic Web from the very start.
Among others, DC is worth mentioning, which was heavily inspired by RDF (Wolf 1998).
For the past few years, most of the LOD projects in the CH domain have been carried out to
expose data to larger audiences, for metadata enrichment, or to facilitate data interlinking
(Smith-Yoshimura 2018).
While an increasing number of CHIs, essentially research libraries or national libraries, are
involved in LOD publishing (Smith-Yoshimura 2018), it should be noted that, according to a
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 11
survey performed in 2018 by the ADAPT Centre at Trinity College Dublin, among 185
information professionals, the greatest challenge lies in the difficulty of integrating and
interlinking datasets and that the mapping between traditional CH metadata standards and
native LOD models still causes problem (McKenna, Debruyne, O’Sullivan 2018; 2020).
3.2 Aggregation landscape in the cultural heritage domain
This section gives a brief overview of the aggregation landscape in the CH domain, outlining
the main national or transnational aggregation initiatives, then providing an overview of OAI-
PMH, as well as providing Europeana's requirements in terms of publication on their platform.
3.2.1 Cultural heritage aggregation platforms
Recent years have seen several national and transnational initiatives set up scalable and
sustainable platforms to support resource discoverability in the CH domain, such as DigitalNZ
(New Zealand) launched in 2006, Europeana in 2008, Trove (Australia) in 2009, the Digital
Public Library of America (DPLA) in 2013, as well as the National Digital Library of India (NDLI)
in 2016. However, it should also be noted that all these initiatives still partially or heavily depend
on OAI-PMH to harvest metadata.
3.2.2 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
OAI-PMH is an XML-based specification that started in 1999 to improve the discoverability of
e-prints through metadata aggregation. The second version of OAI-PMH is the latest stable
release of the protocol and was defined in 2002 The data modelling through OAI-PMH relies,
but is not restricted to, DC (Lagoze et al. 2002).
OAI-PMH divides the framework's actors into data providers, which provide access to
metadata, and service providers, which use harvested metadata to store and enrich their own
repositories (Alexander, Gautam 2004). As shown in Figure 5, the protocol defines six requests
(or “verbs”) that can be issued as parameters for HTTP GET or POST requests: Identify, List
Metadata Formats, List Sets, Get Record, List Record, List Identifiers. OAI-PMH defines five
possible types of responses, encoded in XML: General Information, Metadata Formats, Set
Structure, Record Identifier, Metadata (Lagoze et al. 2002).
Figure 5: OAI-PMH Structure
(Lovrečić 2010)
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 12
As such, it is an asynchronous protocol that predates the modern architecture of the Web
(Bermès, Isaac, Poupeau 2013, p. 36). It is technically not in line with REST principles and can
be seen, on a conceptual level, as repository-centricas opposed to resource-centric(Van
de Sompel, Nelson 2015). Another resulting concern is the fact that OAI-PMH “has rarely been
implemented to its full scale, i.e. benefiting from its incremental harvesting features” and that
the CHO metadata of a data provider evolves separately from that hosted on the service
provider’s side (Freire, Robson, et al. 2018).
3.2.3 Publication requirements of the Europeana Network
This subsection covers metadata and content requirements that CHIs and aggregators have
to follow to publish their data onto the Europeana platform.
Once the Europeana Data Exchange Agreement (DEA) has been signed by a CHI or by an
aggregator, there is first a test phase with a potential data partner where a sample of their
collection has to be provided to Europeana, either by ZIP or via OAI-PMH (Scholz 2019a).
Europeana's overall data contribution workflow can be divided into three phases: data
submission, data processing (cf. Table 18 in the Appendices where the nine relevant ingestion
steps in Metis are listed and briefly explained) and data publication.
At Europeana, EDM is the solution that has been found to reconcile the different models as
well as to publish LOD (Charles, Isaac 2015). EDM is a generic model based on OAI-ORE,
SKOS, and DC among others (Doerr et al. 2010),is an improvement of the Europeana
Semantics Elements (ESE), and aim[s] at being an integration medium for collecting,
connecting and enriching the descriptions provided by Europeana’s content providers''
(Charles et al. 2017, p. 8). While this dissertation does not go into detail on EDM, it is important
to mention that each CHO issued to Europeana leads to the creation of instances of the
following main classes of EDM:
“[a] Cultural Heritage Object (i.e., edm:ProvidedCHO and ore:Proxy that represent
different data sources for objects), one or more digital representations (i.e,
edm:WebResource) and ‘contextual’ resources (places, persons, concepts, timespans), in
compliance with the one-to-one principle” (Wallis et al. 2017)
For contributing metadata to Europeana
, a number of EDM elements are mandatory to ensure
that the data and associated metadata are of the highest possible quality (Isaac, Clayphan
2013; Charles, Isaac 2015). The labelling of objects with valid rights statements, through the
edm:rights property, is also required. The 14 available rights statements
come from
rightsstatements.org, an initiative led by Europeana, DPLA, Kennisland, and Creative
Commons (CC) (Fallon 2015; Scholz 2019a). Additionally, to display CHOs in thematic
collections (archaeology, art, fashion, etc.) and hence make them more visible on the
Europeana platform, a data partner must provide relevant keywords (Scholz 2019a).
The quality of data contributed to Europeana is measured via different tiers for metadata (A,
B, C) and for content (1, 2, 3, 4) within the Europeana Publishing Framework (EPF). In
essence, the better rated the data related to a CHO, the more that CHO will be visible on the
Europeana platform (Daley, Scholz, Charles 2019; Scholz 2019a; 2019b).
Mapping examples in EDM can be found in Appendix 3 starting on page 66.
Available rights statements: https://pro.europeana.eu/page/available-rights-statements
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 13
Finally, it bears remembering that EDM is not a fixed model and is updated in accordance with
the needs of the Europeana Foundation and the Europeana Network. As an example,
Europeana has recently extended EDM in order to extend its data ingestion framework so that
it can accept and recognize IIIF resources
(Isaac, Charles 2016; Isaac 2019).
3.3 Alternative web-based technologies for (meta)data aggregation
This section looks at the different aggregation components and technologies which can be a
part of alternative mechanisms that Europeana could deploy and propose to its aggregators
and CHIs. These technologies may potentially one day supersede OAI-PMH either entirely or
partially.
The first subsection introduces the two essential components and the subsequent two
subsections outline the underlying technologies in terms of their capabilities and their
relevance as an aggregation mechanism. The fourth and last subsection gives a high-level
overview of the different aggregation mechanisms that are highlighted in this literature review.
3.3.1 Aggregation components
Several components need to be considered in the aggregation process. Based on Freire et al.
2017, two main categories have been identified:
Data transfer and synchronisation
Data modelling and representation
In Table 3, the two aggregation components are briefly described.
Table 3: Aggregation components
Component
Short description
Data transfer and
synchronisation
One of the essential components of aggregation is the transfer of data sources
from a hosting website to a third party platform, in other words finding a way
for an aggregator to collect (meta)data from a CHI. Resources are also likely
to evolve on the CHI website and consideration needs to be given to the
synchronisation of data sources as well, for instance by using an incremental
approach which could be achieved by using a notification mechanism based
on Semantic Web technologies.
Data modelling
and representation
Data transfer and synchronisation needs to rely upon an agreed data model in
order to tackle data heterogeneity. This is all the truer for aggregators
showcasing on their platforms data from various domains, like the CH sector.
In the case of Europeana, the data model and representation chosen, with
which data providers and intermediary aggregators must comply, is EDM.
Other data models can be explored, as has already been the case with
Schema.org, but some metadata mapping would still be required.
3.3.2 Technologies for data transfer and synchronisation
The technologies for data transfer and synchronisation are organized in alphabetical order.
At the beginning of each section, there is a dotted box indicating one or more namespaces
depending on the number of mechanisms tied with the protocol, as well as comment on
whether or not a pilot experiment has already taken place within the Europeana Network. Then
there is a short presentation of the technology and its relevance for (meta)data aggregation.
IIIF to EDM profile: https://pro.europeana.eu/page/edm-profiles#iiif-to-edm-profile
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 14
3.3.2.1 ActivityStreams 2.0 (AS2)
https://www.w3.org/TR/activitystreams-core/
https://www.w3.org/TR/activitystreams-vocabulary/
Pilot experiment already conducted in the
context of the IIIF Change Discovery API
which leverages AS2.
ActivityStreams 2.0 (AS2) is as much a syntax as a W3C vocabulary, being part of the Social
Web protocols series, allowing to represent activity flows, actors, objects and collections in
JSON(-LD) and to syndicate them within social web applications (Guy 2017; Snell, Prodromou
2017a; 2017b).
AS2 is generally not used on its own and is a valuable adjunct to help with data transfer and
synchronisation. For example, its verbs are used by the IIIF Change Discovery API (cf. 3.3.2.3)
and AS2 can be combined with other Social Web Protocols such as AP or Linked Data
Notifications (LDN) for the payload of notification requests (Guy 2017; Sanderson 2018;
Appleby et al. 2020a).
3.3.2.2 ActivityPub (AP)
https://www.w3.org/TR/activitypub/
No previous pilot experiment
ActivityPub (AP) is an open and decentralized W3C standard created in 2018, and is based
on the AS2 vocabulary which provides a JSON(-LD) API for client-to-server (publishing) and
server-to-server (federation) interactions.
It is part of the suite of Social Web Protocols and its use in metadata aggregation lies in its
ability to notify actors across a given network via GET and POST HTTP Requests of each
action (or activity) (Lemmer Webber, Tallon 2018). In other words, each actor has an inbox to
receive messages and an outbox to send them (Guy 2017). These different boxes are
equivalent to endpoints as illustrated by Figure 6.
Figure 6: ActivityPub request examples
(Lemmer Webber, Tallon 2018)
3.3.2.3 International Image Interoperability Framework (IIIF)
https://iiif.io/api/image/2.1/
https://iiif.io/api/presentation/2.1/
https://iiif.io/api/discovery/0.9/
Pilot experiments already conducted with aggregation based
on IIIF and Sitemaps, IIIF Collections (collection of IIIF
Manifests or collection of IIIF collections), as well as with the
Change Discovery API.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 15
https://iiif.io/api/image/3.0/
https://iiif.io/api/presentation/3.0/
https://iiif.io/api/content-state/0.2/
No previous pilot experiment
The International Image Interoperability Framework (IIIF pronounced triple-eye-eff’) is a
community, driven-initiative that creates shared APIs to display and annotate digital
representations of objects (Hadro 2019). As shown in Figure 7, IIIF has enabled the creation
of an ecosystem around web-based images, which consists of various organisations deploying
software that comply with the specifications (Snydman, Sanderson, Cramer 2015).
Figure 7: IIIF APIs in the client-server model
(Hadro 2019)
The following are the current stable IIIF APIs, which are all HTTP-based web services
serialised in JSON-LD (Raemy, Schneider 2019):
IIIF Image API 3.0
: specifies a web service that returns an image in response
to a standard HTTP(S) request (Hadro 2019; Appleby et al. 2020b).
IIIF Presentation API 3.0: provides the necessary information about the object
structure and layout (Hadro 2019; Appleby et al. 2020c).
IIIF Content Search API 1.0: gives access and interoperability mechanisms for
searching within a textual annotation of an object (Appleby et al. 2016; Raemy
2017).
IIIF Authentication API 1.0: allows application of IIIF for access-restricted
objects (Appleby et al. 2017; Raemy 2017).
While all of these APIs have a strong focus on delivering rich data to end users, they were not
specifically designed to support metadata aggregation (Rabun 2016; Warner 2017). For
example, there aren’t any requirements in terms of metadata standards that accompany IIIF
Manifests (which are the representation and description of an object) and there aren’t any
elements indicating a timestamp for the creation or modification of an object (Freire et al. 2017).
The IIIF Image and Presentation APIs are sometimes referred to as the core IIIF
specifications. They both have been upgraded from 2.1.1 to 3.0 (with breaking changes) to
integrate time-based media in June 2020. Most organisations who have implemented IIIF,
including Europeana, will need to revamp their infrastructure and their IIIF resources as well
to support the newest versions of the IIIF core APIs.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 16
The first issue can be addressed by adding an optional link to structured metadata (through
rdfs:seeAlso) and the second matter is currently being addressed by the IIIF community
through their Discovery Technical Specification Group and the creation of several
specifications, namely the Change Discovery and Content State APIs, as well as the creation
of a central IIIF registry (Sanderson 2018; Robson et al. 2020).
IIIF Change Discovery API 0.9: specifies a machine to machine API that
provides the information needed to discover and subsequently make use of IIIF
resources. It leverages ActivityStreams to describe changes to resources and
facilitates crawling to build search indexes (Sanderson 2018; Raemy, Schneider
2019; Appleby et al. 2020a).
IIIF Content State API 0.2: describes the current or desired state of the content
that a client is rendering to a user. The API allows for standardised approach to
deep-linking into objects and annotation from search results (Warner 2017;
Appleby et al. 2019).
An overview of a IIIF Discovery ecosystem enabling a well-defined harvesting process of IIIF
resources is illustrated by Figure 8 below.
Figure 8: Overview of a IIIF Discovery ecosystem
(Sanderson 2018)
Lastly, IIIF is, technically speaking, not LOD, but it is in a conceptual sense as it is somewhat
a visual support for LOD(Cossu 2020) and the two frameworks can work alongside each
other. For example, Linked Art can be used to reference IIIF resources or services and IIIF can
point to a Linked Art description via a rdfs:seeAlso property to leverage semantic discovery
(Sanderson 2020b).
3.3.2.4 Linked Data Notifications (LDN)
https://www.w3.org/TR/ldn/
No previous pilot experiment
Linked Data Notifications (LDN) is a JSON-LD based Social Web Protocol for delivery,
facilitating messages sent by servers (receivers) to different applications (senders) and
defining how these applications (consumers) can retrieve those messages (Capadisli, Guy
2017). In other words, LDN is a resource-centric protocol where notifications are structured as
well as being identifiable and reusable by different applications on the Web. Also, LDN treats
notifications as persistent entities (Capadisli 2019).
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 17
An overview of the different LDN concepts and the possible HTTP requests are shown in
Figure 9.
Figure 9: LDN overview
(Capadisli, Guy 2017)
The interest of LDN in aggregation is its great modularity as it leverages the Linked Data
concepts of shared vocabularies and URIs. For instance, the storage of notifications is
compatible with LDP, an LDN receiver can understand requests coming from AP federated
servers and finally LDN can also draw on AS2 syntax and vocabulary. Finally, a combined
implementation with a IIIF ecosystem is also possible and has already been done to connect
distributed scholarly discussion (Witt 2017a; 2017b) or as part of exploratory activities carried
out by the IIIF Discovery Technical Specification Group
.
3.3.2.5 Linked Data Platform (LDP)
https://www.w3.org/TR/ldp/
No previous pilot experiment
Linked Data Platform (LDP) is a W3C standard based on HTTP requests, some of them on
RDF, defining a set of rules allowing a read-write Linked Data architecture. LDP considers
everything as resources and can interact with RDF as well as non-RDF sources (Speicher,
Arwe, Malhotra 2015).
For RDF sources, the Container type has been defined by LDP, representing a collection of
linked documents or information resources. Three types of containers have been conceived: a
basic one defining a simple link to the information it contains, a direct container adding the
notion of membership, and the indirect container that can link to a totally different resource
than the one added
(Correa 2015). Even though LDP is a W3C standard that predates the
Social Web Protocols, an LDP Basic Container can be compared to an LDN inbox (Capadisli
2019).
Figure 10 illustrates the relationships between the different types within the LDP standard.
LDN for aggregation of IIIF Services: https://github.com/nfreire/LDN4IIIF
For more information about LDP Containers, please consult
https://gist.github.com/hectorcorrea/dc20d743583488168703
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 18
Figure 10: Class relationship of types of LDP Containers
(Speicher, Arwe, Malhotra 2015)
While LDP’s potential for aggregation lies in its ability to easily access and manipulate
metadata, the issues of pointing to specific harvestable resources and the incrementation part
cannot solely be resolved by leveraging this mechanism (Freire et al. 2017).
3.3.2.6 Open Publication Distribution System Catalog 2.0 (OPDS2)
https://drafts.opds.io/opds-2.0
No previous pilot experiment
The Open Publication Distribution System Catalog 2.0 (OPDS2) is a syndication format for
electronic publications based on the Readium Web Publication Manifest model and JSON-LD.
The second version of the specification is still at the draft level and differs from V1.2 which was
based on Atom and an XML serialisation (Freire et al. 2017).
The purpose of this protocol is the aggregation, distribution, discovery and acquisition of
electronic publications. If its interest is mostly geared towards e-books, other types of
publications can be syndicated (Gardeur 2020). Additionally, the core metadata vocabulary of
OPDS2 is Schema.org, which is an advantage over DC that is used in V1.2 since it provides
greater expressiveness and can enable better web indexing.
3.3.2.7 ResourceSync (RS)
http://www.openarchives.org/rs/1.1/resourcesync
http://www.openarchives.org/rs/notification/1.0.1/notification
http://www.openarchives.org/rs/notification/1.0.1/framework_notification
Pilot experiment
conducted in the context
of aggregation based on
extended Sitemaps
leveraging elements from
the ResourceSync
namespace.
ResourceSync (RS) is a specification issued as a joint effort between the Open Archives
Initiative (OAI) and the National Information Standards Organization (NISO). RS, which also
known as Z39.99-2017, leverages Sitemaps and adds extensions to the protocol to enable
third-party systems to remain synchronised with a data provider’s CHOs and their associated
metadata. It is also possible to use RS in conjunction with WebSub to establish a notification
mechanism (Haslhofer et al. 2013; Klein et al. 2013; Freire et al. 2017).
RS support the following capabilities, which are all serialised in XML as it relies on the
Sitemaps protocol (American National Standards Institute, NISO 2017):
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 19
Resource List: list and describe the resources that a Source makes available
for synchronisation
Resource List Index: a group of multiple Resource Lists
Resource Dump: a link to a package of the resources’ bitstreams
Resource Dump Index: a group of multiple Resource Dumps
Resource Dump Manifest: a description of the package’s constituent
bitstreams
Change List: a document that contains a description of changes to a Source’s
resources
Change List Index: a group of Change List
Change Dump: a document that points to packages containing bitstreams for
the Source’s changed resources
Change Dump Index: a group of multiple Change Dumps
Change Dump Manifest: a description of the constituent bitstreams of the
package
Figure 11 gives an overview of the framework structure.
Figure 11: ResourceSync Framework Structure
(American National Standards Institute, NISO 2017)
Its relevance for aggregation is that not only can RS synchronize metadata but also content.
In addition, RS relies on Sitemaps, a well-known and fairly easily deployable protocol.
3.3.2.8 Sitemaps
https://www.sitemaps.org/protocol.html
Pilot experiments conducted with aggregation based on
standard Sitemaps, Sitemaps extended with elements from
the IIIF namespace, as well as Sitemaps extended with
elements from the ResourceSync namespace
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 20
Sitemaps is a protocol that enables webmasters to tell search engines which webpages of a
given site are available to be crawled by robots. It consists of an XML file listing URLs and
additional metadata (Schonfeld, Shivakumar 2009).
The following three XML tag definitions are required: <urlset> which references the
Sitemaps protocol and encapsulates the file, <url> which encapsulates the other URL entries,
as well as <loc> which provides the URL of a specific webpage. The other optional tags can
provide information about the last update (<lastmod>), the change in frequency
(<changefreq>) and a property value from 0.0 to 1.0 in relation to other pages of the site
(<priority>).
The relevance of Sitemaps in aggregation is that the protocol is very widespread on the Web
and that it can be an entry point to crawl pages that need to be harvested (Freire et al. 2017).
It is, in fact, a protocol that can be combined with other mechanisms, or a core technology, for
instance like ResourceSync, which is built on it (Haslhofer et al. 2013; Freire, Robson, et al.
2018).
3.3.2.9 Webmention
https://www.w3.org/TR/webmention/
No previous pilot experiment
Webmention is a form-encoding-based protocol for delivery developed by the W3 Social Web
Working Group which relies on HTTP and URL Encoded Form (x-www-urlencoded
content). “It provides an API for sending and receiving notifications when a relationship is
created (or updated or deleted) between two documents(a source and a target) (Parecki
2017).
It should also be noted that Webmention and LDN are indeed both intended for delivery and
have some overlapping functionality, but they differ in how they handle “(...) different content
types of requests(Guy 2017).
The interest that Webmention could have within an aggregation ecosystem is in its ability for a
data provider to keep track of when their CHO’s URLs are mentioned on a third-party platform,
as well as providing a mechanism for a data provider to notify an aggregator which resources
should be harvested (Freire et al. 2017).
3.3.2.10 WebSub
https://www.w3.org/TR/websub/
No previous pilot experiment
WebSub, previously known as PubSubHubbub (PuSH), is a W3C standard that is part of the
Social Web Protocols which describes an approach for “(...) subscription of any resource and
delivery of updates about it(Guy 2017).
With WebSub, it is the platform hosting the data that itself pushes new content to the
aggregators, as opposed, for example, to an RSS feed that must regularly check for updates.
To receive these updates though, a subscription over HTTP through a dedicated hub is
needed. A hub acts as an intermediary entity relaying “fat ping” notifications
(Genestoux,
Parecki 2018; Capadisli 2019).
Fat ping is a “(...) ping which contains a copy of the content that has been changed (fat ping
2015)
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 21
This standard can be seen as a mechanism for communication between publishers and their
subscribers where hub and topic URLs can be discovered by looking at HTTP headers of the
resource URL (Genestoux, Parecki 2018). For instance, WebSub is leveraged by RS for its
notification mechanism (Haslhofer et al. 2013).
Figure 12 below outlines a high-level protocol flow for WebSub.
Figure 12: WebSub high-level protocol flow
(Genestoux, Parecki 2018)
3.3.3 Technologies for data modelling and representation
The presentation of the technologies follows the same logic as the previous subsection. It
should also be noted that of the three elements presented here, only Schema.org could really
be considered as a complement to EDM and that the two other methods are pieces that can
assist in data modelling and representation.
3.3.3.1 Data Catalog Vocabulary (DCAT)
https://www.w3.org/ns/dcat#
Pilot experiment already conducted with aggregation based on
Linked Data
Data Catalog Vocabulary (DCAT) is an RDF vocabulary initially created in Ireland by the Digital
Enterprise Research Institute and then transferred under W3C governance. DCAT facilitates
interoperability between different data catalogues published on the Web (Albertoni et al. 2020).
In terms of its harvesting options, DCAT allows specifying a downloadable dataset distribution
as well as referring to a SPARQL endpoint (Freire 2020b).
3.3.3.2 Schema.org
https://schema.org/
Pilot experiments already conducted with aggregation based on Linked Data
and based on Sitemaps and Schema.org in HTML pages
Schema.org is the name of the cross-domain vocabulary as well as the initiative that was
created by major Internet search engines (Google, Bing, Yahoo and Yandex) in 2011, which
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 22
(...) seeks to encourage the publication and consumption of structured data on the Internet
(Freire, Verbruggen, et al. 2019). It allows indexing crawlers to more accurately identify the
meaning of indexed pages, sometimes referred as Semantic Search Engine Optimization
(SEO) (Wallis et al. 2017). Schema.org can be serialized in Microdata, RDFa as well as JSON-
LD (Freire et al. 2017).
Schema.org became a community-based effort in 2015 with the creation of the W3C
Schema.org Community Group and its vocabulary maintenance is done through GitHub
repositories.
Its key role in aggregation lies in its vocabulary which provides relatively precise descriptions
of CHOs, as well as being able to indicate to crawlers where a downloadable dataset
distribution is available (Freire 2020a). Last but not least, Schema.org can also facilitate the
referencing of web pages (Freire, Charles 2017; Freire, Charles, Isaac 2018).
3.3.3.3 Vocabulary of Interlinked Datasets (VoID)
https://www.w3.org/TR/void/
Pilot experiment already conducted with aggregation based on
Linked Data
The Vocabulary of Interlinked Datasets (VoID) is an RDF vocabulary for discovering and
leveraging Linked Data sets (Keith et al. 2011). VoID consists mainly of the following two
classes:
void:Dataset to describe datasets issued by a single publisher in RDF and
accessible either through dereferenceable URIs, via a SPARQL endpoint, or by
other methods such as RDF data dumps or the ability to specify a list of URIs.
void:Linkset (subclass of void:Dataset) to specify the links between
these datasets.
The relevance of VoID in aggregation is to allow a crawler to point towards the appropriate
target in several ways (Freire 2020b).
3.3.4 Overview of aggregation mechanisms
Table 25 in the Appendices provides a high-level overview of all the different technologies that
are highlighted here.
This overview contains the following categories
: the name of the technology, the associated
URL, version, date, aggregation component (data transfer and synchronisation or data
modelling and representation), a short description, the governance bodies, HTTP Requests
(such as GET, HEAD, POST, etc.), the serialisations (XML, JSON-LD, etc.), as well as its
notification style of network communication (push or/and pull).
Items that cannot be categorised are labelled N/A (not applicable).
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 23
4. Methodology
This chapter is divided into four sections, first taking into account the general methodological
approach applied throughout the master's thesis
, then focusing on the data collection stage,
followed by the subsequent data analysis as well as the limitations of the different methods
that were applied.
4.1 Overall approach
From a methodological point of view, a mixed approach (qualitative and quantitative) was used
to address the research questions.
The qualitative methods consisted mainly of regular interviews with collaborators of
Europeana, conducting a literature review, taking part in the testing and documentation phase
of the Europeana Creative Common's LOD-aggregator functional application, as well as the
assessment and conduct of aggregation pilots.
As for the quantitative methods, an online survey provided metrics on the use, interest and
awareness of different aggregation mechanisms. In addition, some outputs generated during
the aggregation pilots also yielded quantitative figures.
With the aim of validating the research and analysis carried out, informal and punctual
interviews were conducted with relevant stakeholders throughout the dissertation. It is worth
pointing out, though, that no verbatim accounts were produced, and the minutes were recorded
in the author's internal logbook. Most of these meetings tended to turn into constructive
discussions, working sessions or even demos. Table 4 outlines the different interactions that
were conducted. It lists the dates, the names of the people involved (cf. Table 17 in the
Appendices to have more information on the mentioned stakeholders, especially their role) as
well as the main meeting objectives.
Table 4: Validation interviews
Date
Stakeholder(s)
Objectives
05.03.2020
Valentine Charles
Giving a demo of Metis and explaining
Europeana’s current operating model
27.03.2020
Nuno Freire
Providing clarification on LD aggregation and on
the functionalities of DAL
17.04.2020
Cosmina Berta, Enno Meijers,
Erwin Verbruggen
ECC defining the next steps and onboarding of
the author
20.04.2020
Nuno Freire, Enno Meijers, Erwin
Verbruggen
ECC LOD Functional Application documenting
the pilot and its results
23.04.2020
Cosmina Berta, Erwin
Verbruggen
ECC outlining the sustainability aspects of the
different ECC functional applications
Trello, a Kanban board software, was used for the management of the overall project:
https://trello.com/b/w1Cb85vd/
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 24
Date
Stakeholder(s)
Objectives
23.04.2020
Antoine Isaac, Nuno Freire
Comparing the different aggregation mechanisms
and discussing the deployment requirements of
each mechanism
20.05.2020
Antoine Isaac, Nuno Freire, Albin
Larsson
Analysing the survey findings and the future
aggregation pilots 1
20.05.2020
Nuno Freire, Enno Meijers, Erwin
Verbruggen
ECC LOD Functional Application assessing the
report on documentation and functionalities of the
LOD-aggregator pipeline carried out by the author
29.05.2020
Antoine Isaac, Nuno Freire, Albin
Larsson
Analysing the survey findings and the future
aggregation pilots 2
12.06.2020
Antoine Isaac, Nuno Freire, Albin
Larsson
Analysing the survey findings and the future
aggregation pilots 3
15.07.2020
Nuno Freire
Taking a decision on the feasibility of aggregation
pilots
4.2 Methods of data collection
The data were collected in a variety of ways throughout the master's thesis, but all stemmed
from a collaborative effort and mutual understanding with Europeana R&D team.
Indeed, these data collections were carried out thanks to active participation by the author in
their weekly catchups, through one-to-one meetings with Antoine Isaac, R&D Manager, as well
as by means of ad hoc meetings listed in Table 4.
4.2.1 Reviewing the state-of-the-art
In order to fully capture the technological and strategic stakes of improving (meta)data within
the field of cultural heritage, an extensive literature review (cf. 3) has been conducted
.
Based on this literature review, a comparison of various aggregation mechanisms was then
realised (cf. 3.3.4).
4.2.2 Europeana Common Culture’s LOD Functional Application
Following a discussion with the members of Europeana R&D team at the beginning of April, it
was agreed that there were significant crossovers between this study and the LOD Functional
Application carried out within the ECC project (such as the willingness to improve metadata
harvesting).
The author was therefore involved in a couple of meetings related to the LOD Functional
Application and a few others concerning the ECC project in general. The main outcomes of
The relevant resources, which are not necessarily all included in this dissertation’s
Bibliography, were recorded on a public library on Zotero:
https://www.zotero.org/groups/2446985/ch_aggregation_discovery
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 25
this collaboration were the review of the LOD-aggregator (cf. 5.1.1) and the co-authoring of a
submitted conference paper (cf. 5.1.3).
4.2.3 Survey on alternative aggregation mechanisms
The survey on alternative aggregation methods (cf. 5.2), which was later retitled survey on
alternative aggregation mechanisms as most technologies have to be combined with one
another to provide an aggregation mechanism, was, to some extent, the centrepiece of this
master's thesis as it enabled to identify trends and interests of different data partners, and it
was also useful for devising potential future pilots.
The following parts succinctly outline the timeline and promotion of the survey, its objectives,
structure as well as the hypotheses that were considered.
4.2.3.1 Timeline and promotion
Firstly, a test was carried out with Europeana R&D team in April to validate the questions as
well as to correct any grammar and spelling errors and ensure that the flow of questions
worked.
After verification, the survey was conducted online through Google Forms
and was available
from 20 April to 8 May 2020.
The call for participation was published on several channels, including EuropeanaTech's
listserv and Twitter account
, on the author’s Twitter account, on a dedicated Europeana
channel within IIIF's Slack instance. It was also presented through a lightning talk on the first
day of the Europeana Aggregators Forum (EAF) on 6 and 7 May 2020. On EuropeanaTech's
listserv, two announcements were sent out, the very first one on 20 April and a reminder on 4
May (see both messages in Appendix 4 on page 68).
4.2.3.2 Objectives
The main objective was to gauge the awareness, interest, and use of technologies other than
OAI-PMH for (meta)data aggregation. The main target audiences of the survey were the data
providers and the aggregators of the Europeana Network, albeit it was decided to keep it open
to other organisations and individuals working in the CH field.
The secondary objective of the survey was to identify possible pilot experiments that
Europeana could conduct with interested organisations.
4.2.3.3 Structure and questions
The survey was divided into nine sections with a total of fifteen questions (ten mandatory and
five optional). As shown in Figure 31 in Appendix 5, two sections were only shown to
participants depending on the answers given to a preceding question (cf. Appendix 6 on page
70 to see all survey questions).
https://forms.gle/iq2fZ8wCgBMGTrDq6
https://twitter.com/EuropeanaTech/status/1252163772652929024
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 26
4.2.3.4 Hypotheses
The following three assumptions were made prior to the launch of the survey:
ResourceSync or W3C's Social Web Protocols (ActivityStreams, Linked Data
Notifications, Webmention, WebSub) are relatively unknown and rarely used
within the CH domain.
Header Dictionary Triples (HDT), an RDF binary format, which was created to
compress large sets of data and facilitate query scalability (Vander Sande et al.
2018), is still quite recent in the LOD sector and certainly a rarity in the CH field.
The variety of metadata standards is very important and the number of in-house
"flavours" of these standards used within the Europeana Network is quite high.
The survey wasn’t aimed to get a thorough view of the metadata landscape
though, but rather to get an idea of what metadata mappings would be
necessary.
4.2.4 Assessment of potential aggregation pilots
On the basis of the survey findings, the interest of the participants and the available data and
existing implementations, an assessment was carried out to determine the feasibility of
aggregation pilots (cf. 5.3). The following courses of action were identified:
Carrying out an initial triage among the survey respondents who expressed an
interest in an aggregation pilot.
Selecting the appropriate aggregation routes.
Contacting the relevant organisations to inform them on the feasibility of a pilot
and/or to request additional information if necessary.
Reaching a decision on whether or not a pilot could be conducted.
Conducting the aggregation pilots that could be done in the allotted time.
4.3 Methods of data analysis
This section provides further information in terms of the tools and the service design method
used during the data analysis phase.
4.3.1 Tools
Three main types of tools were applied for data analysis: spreadsheet software, command-line
interface (CLI), as well as web-based prototypes.
4.3.1.1 Spreadsheet software
For the analysis of the different aggregation mechanisms, the results of the online survey as
well as for the production of a few charts, standard spreadsheet software, both MS-Excel, for
backup and Google's own application, to facilitate easy collaboration on several files, were
employed.
4.3.1.2 Text editor
The review of the LOD-aggregator was done by forking the repository from GitHub. Then, the
capabilities were tested using a command-line interface (CLI) to test the various functionalities
as well as a text editor to display the available datasets produced.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 27
4.3.1.3 Europeana R&D tools as testbed
The DAL as well as the Europeana Metadata Testing Tool, two web-based prototypes set up
by Nuno Freire, were utilised to test various aggregation mechanisms.
4.3.2 Service design
The “Opportunity Solution Tree” template, a four-step visual aid which maps out connections
to serve a desired outcome (Becker 2020), was chosen to formulate the different scenarios
that can enable better aggregation and discovery of CH content.
The propositions stemming from that visual representation were then linked to the Europeana
Strategy's priorities and further broken down into suggested steps.
4.4 Limitations
The possible methodological limitations of this study are mainly:
the possible representation (sample bias) of the survey participants which does
not fully reflect the comprehensive nature of all CHIs (cf. 5.2.3 for more details);
the significant involvement of experts in the field of LOD and IIIF throughout the
entire study, as these people are very keen on deploying new protocols at a
relatively early stage.
In addition, the time constraints did not allow some potential pilot aggregations to take place
because it required too much work on the data partners' side to adjust their data models or
protocol implementations (cf. 5.3.3) or for anyone else to help them doing so.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 28
5. Results
This chapter on the results of the master's thesis is divided into three sections and first
highlights the author's contribution to the ECC LOD Functional Application, then presents the
online survey findings, and, thirdly, features the aggregation pilots.
5.1 Analysis of ECC LOD Functional Application
The participation within the ECC project consisted of gathering stake-holders’ considerations
on the sustainability of the various project’s components, assessing the functionalities and the
conformity of the documentation of the LOD-aggregator, the pipeline created for the LOD
Functional Application, as well as the submission of a paper for the 2020 Metadata and
Semantics Research (MTSR) conference.
Besides the elements presented in this section, it is also worth mentioning that during the EAF,
which took place online at the beginning of May, there was a lightning talk co-presented by the
author about the activities on alternative aggregation mechanisms carried out by Europeana
R&D and its data partners in recent years, a call for participation in the online survey (cf. 5.2),
as well as an account of the goals of the ECC LOD Functional Application and its related
technical infrastructure (Raemy, Freire 2020).
5.1.1 Sustainability discussions
As the different project outcomes of the ECC project need to be sustained for a period of three
years, a sustainability effort was undertaken through a series of meetings to first determine
whether or not these outcomes could be further developed into production and what
requirements would be necessary. For the LOD Functional Application, the following two
sustainability points were addressed by the author:
Looking for new datasets and pilots, in particular as a follow-up to the online
survey (cf. 5.2).
Assessing the usability and possible integration of the system within the Metis
Sandbox.
5.1.2 Assessment of the LOD-aggregator
The LOD-aggregator, a generic open-source toolset based on Docker containers for
harvesting and transforming LOD for ingest into Europeana, was assessed with respect to its
documentation and for testing the various functionalities.
The toolset leverages Docker for flexibility and scalability reasons as it “allows aggregators to
deploy only part of [it], according to their needs” (Freire et al. 2020). Its main components are
the Dataset Description Validator, the LD Harvester, the Mapper Service, the EDM Validator,
as well as the RDF to EDM RDF/XML Validator (cf. Figure 13). It should also be mentioned
that the toolset works as a CLI.
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 29
Figure 13: High-level architecture of the LOD-aggregator
(Freire et al. 2020)
Based on an internal Europeana document on criteria for selecting software components
as
well as a guide on how to document code hosted on GitHub, the following seven assessment
criteria were taken into account: Value proposition, Licence, Maintenance, Functionality
testing, Documentation, Versioning, Quality/Security.
Although no problems were found regarding the functionalities, and although test runs for all
the datasets that went through the pipeline were easily executed, there were some concerns
about the terminology, a few typos, as well as the rationale of the toolset which was not clearly
stated. Indeed, the value proposition was not sufficiently explicit.
Some aspects, such as maintenance or versioning, were considered minor and were not
properly accounted for at this stage of the evaluation. Finally, the criterion concerning
quality/security was not addressed by the author, having judged that he did not necessarily
have the required expertise. Table 5 provides an exhaustive account of what was assessed.
Colour markings of Table 5
All in order
It doesn't appear to be a
concern at this time.
Some improvements are
needed.
This criterion was not
assessed
Table 5: Assessment criteria of the LOD-aggregator
Criteria
Feedback
Value
proposition
The value proposition is not highlighted well enough, some tags should be
added as well as a summary in the README. This aspect should be clearly
articulated in the report to demonstrate the added value for Europeana,
aggregators and data providers.
Internal document under preparation by the Europeana Platform Services
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 30
Criteria
Feedback
Licence
European Union Public Licence (EUPL) v.1.2.
Maintenance
There are no open issues, but there aren’t many contributors21. Also, as it was a
functional application part of the ECC project, the maintenance would need to be
assessed at a later stage.
Functionality
testing
Installation: The .env file step should be explained before using the crawl
command with an example. Otherwise, it is quite straightforward.
Crawler, Mapper, Validator, Export, Convertor, Zip: they all work well
Documentation
Functionalities aspects are well-documented.
There are still a couple of typos though and some labels need to be consistent
across the repository. Missing information in the README (project name, a
description/summary, a table of content, contributing, credits)
In the near future, it would be interesting to set up a wiki, containing for example
more in-depth tutorials and a FAQ. Important acronyms (ECC, LOD, etc.) should
also be fleshed out the first time they appear.
Versioning
No version/no release.
Quality/
Security
Code vulnerabilities and critical issues should perhaps be evaluated through an
audit assessment.
To overcome these issues, a couple of pull requests were suggested to developers who
incorporated them between 5 May and 20 June 2020
.
5.1.3 Metadata and Semantics Research (MTSR) paper
A paper written by five individuals, including the author of this dissertation, titled "Metadata
Aggregation via Linked Data in Europeana: results of the Common Culture project" was
submitted on 1 August 2020 to the 14th International Conference on Metadata and Semantics
Research. The authors shall be notified at the beginning of September 2020 whether or not
this paper has been accepted in the conference proceedings.
Some Europeana recommendations for maintenance recommend the existence of "an
active and sufficiently large community". But there is no explicit indication on what this
means.
https://github.com/netwerk-digitaal-erfgoed/lod-aggregator/commits/master
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 31
5.2 Survey
This section is divided into three parts starting with a short subsection on the number and
provenance of survey participants, followed by a substantive subsection on the findings and
closing with any potential biases.
5.2.1 Number and provenance of participants
A total of 52 participants completed the survey. Aggregators (20 occurrences: 38.5%
),
libraries (16: 30.7%) and museums (13: 25%) were the three most commonly selected types
of affiliation
. Some of the affiliations mentioned in the "Other" category by respondents were
grouped together. These include three organisations or volunteers that were identified as part
of the Wikidata community and classified as "Wikimedia-affiliated". Apart from the latter
affiliation, each identifiable organisation was only accounted once in the survey (cf. Figure 14).
Figure 14: Typology of survey participants
The country was not asked, but extrapolating from organisation names, one can see that
respondents came from 20 different countries and that except in three cases (Brazil, Israel,
and England), all participants were from the European Union (cf. Table 6). The highest
participation by country was Lithuania (six times), followed by Belgium, Germany and Italy
(each four times). In some cases, designating a country was not possible. In total, there were
nine instances where a specific country could not be ascertained, and to address this issue,
two categories were created: international (seven occurrences) for thematic aggregators and
N/A (two occurrences) when extrapolation was not possible.
Table 6: Survey participants' provenance
Country
Occurrences
Lithuania
6
Belgium, Germany, Italy
4
Unless otherwise indicated, the 100% is measured with respect to all participants (N = 52),
even for questions where multiple responses were possible.
Note that the choice was not exclusive here. The overlap is one-quarter with 13 survey
participants who checked off several options, with a very large majority from aggregators.
4
16
7
13
7
1
20
3 3 2
0
5
10
15
20
25
Gallery
Library
Archive
Museum
Research institute
Industry
Aggregator
Wikimedia-affiliated
Government entity
Independent
Institution type or institutional domain
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 32
France, Sweden, The Netherlands
3
Czech Republic, Greece, Ireland,
2
Brazil, England, Estonia, Hungary, Israel,
Latvia, Poland, Romania, Slovakia, Spain
1
International
7
N/A
2
5.2.2 Findings
The sequence of questions outlined in this section highlighting the survey findings does not
strictly follow the online survey structure, but clusters the questions thematically, although
maintaining a chronological order. In addition, some information such as names of institutions,
LOD and IIIF endpoints as well as emails are not disclosed in this dissertation
.
In addition, a summary of the survey findings presented in this dissertation was included in an
official deliverable of the Europeana Digital Service Infrastructure (DSI-4) project (Freire, Isaac,
Raemy 2020).
5.2.2.1 Metadata for publishing and exchanging purposes
The survey demonstrates the wide variety of metadata and serialisations used or known by
the participants (cf. Figure 15), giving a fairly representative sample of the different sub-
domains of the CH field as well as the requirements that national or thematic aggregators
expect for ingestion.
The metadata standards that participants are most familiar with (without necessarily using
them) are Schema.org (27 occurrences: 51.9%), CIDOC-CRM (26: 50%) and EDM (22:
42.3%).
As for the deployment side, Dublin Core (33: 63.4%), EDM (28: 53.8%), MARC (25: 48%),
LIDO (14: 26.9%) and MODS (12: 23.1%) are, in order, the standards most used by survey
participants. Schema.org, which was identified as having a sufficient level of expressiveness
for CHOs and become a potential complement to EDM for data modelling and representation,
is used by 11 survey participants (21.2%).
If the survey shows that Schema.org is still in its early phase of adoption within the CH domain,
its interest is growing. When Europeana started to investigate back in 2016, they had indeed
only been able to find cases of Schema.org usage outside of Europe (Wallis et al. 2017).
On the other hand, BIBFRAME and RiC were never indicated as standards being used for
publication and exchange purposes and Linked.art was only mentioned by one participant. The
latter three are also the least known standards, which isn’t a real surprise, considering that
they are fairly new and that each of these standards is rather aimed at a particular subdomain.
The interest in using any of the specific standards is rather limited, out of those mentioned
most often, RiC was selected eight times (15.4%), Schema.org six times (11.5%), and
BIBFRAME, RDA as well as EDM had each five occurrences (9.6%).
The anonymised version of the survey responses is accessible here:
https://doi.org/10.5281/zenodo.3966693 (Raemy 2020a)
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 33
Figure 15: Awareness, use, and interest in metadata standards for publishing and
exchanging purposes
Participants also had the opportunity to cite other metadata they use. METS was mentioned
four times and ESE as well as in-house variations (of metadata standards mentioned
beforehand) were each mentioned (or hinted) three times. A total of 26 different instances of
standards were cited by the participants and 15 of these 26 instances were mentioned once
(cf. Table 7).
Table 7: Additional metadata standards
Metadata standards
Occurrences
METS
4
In-house variation, ESE
3
ABCD, IIIF, EAC-CPF, EAG, MADS, UNIMARC, CARARE Metadata Schema
2
OAI-PMH, Omeka XML, SOCH/K-samsök, ONIX, SKOS, ArCo Ontology,
DCAT, PICA, Z39.50, Datacite, ResourceSync, EN19507 (Cinematographic
Works Standard), Spectrum, PLMET, DNZ, JATS
1
5.2.2.2 Metadata serialisations
As shown in Figure 16, the vast majority of participants are aware of or use one or more
metadata serialisations (CSV, JSON, MARCXML or MarcXchange, RDFa, RDF serialisations,
XML).
The serialisation that has the highest number of occurrences in terms of awareness (without
necessarily being used) is RDFa (23 occurrences: 44.2%). MARC serialisations (MARCXML
and MarcXchange) ranks second with 16 instances (30.8%), but it is also the least known
among the participants (17: 32.7%). The latter can be explained because this serialisation type
is the only one to be almost exclusively used by libraries.
XML (39: 75%), CSV (36: 69.2%) and JSON (26: 50%) are the most commonly used
serialisations. Half of the survey respondents use one of the RDF serialisations (RDF/XML,
Participants could select more than one answer per question but were required to choose at
least one answer, so the total figure per item amounts to a minimum of 52.
BIBFRAME CIDOC-
CRM DC EAD EDM LIDO Linked.art MARC MODS RDA RiC Schema.org
I use it. 0 5 33 10 28 14 1 25 12 8 0 11
I am interested in using it. 5 4 2 3 5 3 3 1 2 5 8 6
I am familiar with it. 19 26 18 20 22 18 7 18 21 16 10 27
I don't know this scheme/model. 30 20 4 20 4 23 42 9 17 25 38 13
0
10
20
30
40
50
60
Metadata for publishing and exchanging purposes
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 34
JSON-LD, Turtle, etc.), and while RDFa is the least used serialisation (8: 15.4%), it is also the
one that respondents are most interested in (12: 23%).
Figure 16: Awareness, use, and interest in metadata serialisations
5.2.2.3 OAI-PMH
OAI-PMH, the current technical solution for metadata aggregation into Europeana, is a
standard used by several aggregation efforts. 37 survey participants (71.2%) use OAI-PMH
for aggregation purposes, 13 don’t (25%) and 2 do not know (3.8%) whether they use this
protocol (cf. Figure 17).
Figure 17: Use of OAI-PMH
Of these 37 participants who use OAI-PMH, more than a third do so only in the context of
aggregation towards the Europeana platform (13 out of 37: 35.1% cf. Figure 18).
Among these 13 participants who use OAI-PMH only in this context, the great majority of them
did not know, test or implement the alternative aggregation methods that were outlined in the
survey. However, it is worth noting that two IIIF-related aggregation mechanisms (based on
IIIF Collections or on Sitemaps) as well as aggregation via Sitemaps and Schema.org were all
mentioned twice (cf. Figure 19 to consult the answers of all participants).
Ibid.
CSV JSON MARCXML or
MarcXchange RDFa RDF serialisations XML
I use it. 36 29 17 8 26 39
I am interested in using it. 1 5 3 12 8 0
I am familiar with it. 13 14 16 23 12 13
I don't know this serialisation. 7 8 17 12 9 5
0
10
20
30
40
50
60
Serialisations
71.2%
25.0%
3.8%
DO YOU USE THE OPEN A RCHIVES INITIATIVE PROTOCOL
FOR METADATA HARVESTING (OAI-PMH)?
Yes N o I don't know
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 35
Figure 18: Use of OAI-PMH in the Europeana context
5.2.2.4 Alternative aggregation mechanisms
With respect to alternative aggregation mechanisms, out of the 10 proposed methods, only a
very small fraction has ever been tested or implemented by survey participants and most of
these methods were also unknown to them. It is worth noting that aggregation via IIIF
collections and the mechanism combining Sitemaps and Schema.org are the ones that have
been the most implemented (6 occurrences - 11.5% - each).
The participants are particularly interested in all IIIF-related mechanisms (between 16 and 20
participants responded that they were interested in using one of the three methods - between
30.8% and 38.5%), in the aggregation of LOD datasets (16: 30.8%) as well as in LDN (11:
21.2%).
Figure 19: Awareness, use, and interest in alternative aggregation mechanisms
Ibid.
64.9%
35.1%
IS YOUR OAI-PMH SERVER USED FOR
ANYTHING OTHER THAN AGGREGATION
TOWARDS THE EUROPEANA PLATFORM?
Yes No (only for Europeana)
Sitemaps and
Schema.org
in HTML
pages
LOD datasets IIIF
Collections IIIF/Sitemaps I II F Change
Discovery LDN LDP OPDS RS/WebSub Webmention
I have implemented it. 6 5 6 3 0 0 0 1 2 1
I have already tested it. 2 4 2 0 3 1 3 0 1 0
I am interested in it. 8 16 20 16 18 11 5 8 8 5
I am familiar with it. 24 18 13 12 7 5 9 5 5 4
I don't know this technology/method. 17 13 20 26 29 38 37 40 38 43
0
10
20
30
40
50
60
70
Alternative aggregation mechanisms
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 36
5.2.2.5 LOD
While the methods for publishing LOD are familiar to about a quarter among the survey
participants (ranging from 7 to 17 occurrences: 13.5-32.7%), most of these methods remain
largely unexplored. For instance, HDT and LDF are largely unknown to the participants, with
only one implementation (by the same participant).
SPARQL is the means which participants implement the most in order to publish LOD (14
occurrences: 28%). HTTP Content Negotiation and providing RDF file dumps (both at 12:
23.1%) and publication of LOD inside HTML pages (11: 21.2%) follow. It should be noted that
seven participants responded that they had implemented SPARQL, HTTP Content Negotiation
as well as RDF file dumps, indicating that the degree of co-occurrence is almost two-thirds.
Figure 20: Awareness, use, and interest in publishing LOD
6 other ways of publishing LOD were raised by participants. A dedicated API was mentioned
twice, and all other ways were referred once (cf. Table 8).
Table 8: Additional ways to publish LOD
Means to publish LOD
Occurrences
API
2
CSV, RDF generated on-the-fly though OAI-PMH,
Vocabularies in SKOS RDF, RAW JSON dumps
1
Regarding the following question prompting for LOD examples and endpoints, 19 participants
responded and 38 links pointing to LOD data were provided.
Ibid.
Within HTML
pages SPARQL HTTP Content
Negotiation RDF file dumps HDT LDF
I have implemented it. 11 14 12 12 1 1
I have already tested it. 4 7 5 8 0 0
I am interested in it. 9 7 9 6 5 5
I am familiar with it. 17 17 11 16 7 11
I don't know this method of publishing LOD. 16 12 21 15 41 37
0
10
20
30
40
50
60
Publishing LOD
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 37
5.2.2.6 IIIF
IIIF-based aggregation mechanisms are the ones where Europeana sees the most potential
for innovating metadata aggregation in the shorter term. The survey asked participants about
their awareness and experience in implementing the IIIF APIs, including the IIIF Presentation
API which is key for accessing metadata via IIIF.
Regarding the awareness of the IIIF APIs, participants have a good understanding of the four
specifications ranging from 60 to 75% (excluding those who answered "I don't know this API").
As for the implementation, the IIIF Image API is the most deployed specification among the
participants (11 occurrences: 21.2%), followed by the IIIF Presentation API (10: 19.2%) and
the IIIF Content Search API (once). All institutions that have deployed the IIIF Presentation
API have also deployed the IIIF Image API (both APIs are often referred to as the "IIIF core
APIs"), which is expected since the latter works in conjunction with the former.
The IIIF Authentication API has never been tested or implemented, which is in line with a
survey conducted by IIIF in 2017 (Rabun 2017).
Figure 21: Awareness, use and interest in IIIF APIs
In addition, nine URLs (IIIF Manifests, Canvas and other landing pages related to IIIF) were
given by six participants.
Ibid.
IIIF Image API IIIF Presentation API IIIF Content Search API IIIF Authentication API
I have implemented it. 11 10 1 0
I have already tested it. 6 6 0 0
I am interested in it. 15 13 21 20
I am familiar with it. 14 13 15 15
I don't know this API. 12 13 16 20
0
10
20
30
40
50
60
IIIF APIs
Enabling better aggregation and discovery of cultural heritage content for Europeana and its partner institutions
Julien Antoine RAEMY 38
5.2.2.7 Possibility of further experiments
Regarding the pilot experiment phase, of the 52 participants, 23 responded positively (44.2%)
to have a subset of their metadata used to experiment with an alternative aggregation route.
24 (46.2%) were