PosterPDF Available

Semantic Enrichment of the Schoenberg Database of Manuscripts Name Authority through Wikidata

Authors:

Abstract

This case study explored the semantic enrichment of name authority data from the Schoenberg Database of Manuscripts, a database of manuscript provenance data. Informed by previous linked data and semantic enrichment research , this study utilized a test dataset of approximately 12,500 named entities to align and link to corresponding Wikidata items. Working with the Wikidata community on data property creation and using OpenRefine for reconciliation and batch editing, approximately 9,000 SDBM names were linked to Wikidata pages. The resulting linked dataset was tested using a series of data-and research related SPARQL queries of interest to manuscript scholars and Schoen-berg Institute staff. All but one of ten exploratory questions were answered satisfactorily by the results of the SPARQL test queries. Future research will focus on expanding the number of SDBM name authority entities linked to Wikidata as well as using Wikidata as a linked data repository for other manuscript-related metadata projects.
Semantic Enrichment of the Schoenberg Database of
Manuscripts Name Authority through Wikidata
L.P. Coladangelo1[0000-0003-1512-0649] and Lynn Ransom2[0000-0002-5231-3602]
1 College of Communication and Information, Kent State University, Kent OH 44242, USA
2 Schoenberg Institute for Manuscript Studies, University of Pennsylvania, Philadelphia PA
19104, USA
lcoladan@kent.edu
Abstract. This case study explored the semantic enrichment of name authority
data from the Schoenberg Database of Manuscripts, a database of manuscript
provenance data. Informed by previous linked data and semantic enrichment re-
search, this study utilized a test dataset of approximately 12,500 named entities
to align and link to corresponding Wikidata items. Working with the Wikidata
community on data property creation and using OpenRefine for reconciliation
and batch editing, approximately 9,000 SDBM names were linked to Wikidata
pages. The resulting linked dataset was tested using a series of data- and re-
search-related SPARQL queries of interest to manuscript scholars and Schoen-
berg Institute staff. All but one of ten exploratory questions were answered sat-
isfactorily by the results of the SPARQL test queries. Future research will focus
on expanding the number of SDBM name authority entities linked to Wikidata
as well as using Wikidata as a linked data repository for other manuscript-
related metadata projects.
Keywords: semantic enrichment, name authorities, linked data, Wikidata,
manuscript studies.
1 Introduction
1.1 Background
The Schoenberg Database of Manuscripts (SDBM) [1] administered through the
Schoenberg Institute for Manuscript Studies in the Kislak Center for Special Collec-
tions, Rare Books and Manuscripts within the University of Pennsylvania Libraries
maintains Name and Place Authorities for entities related to manuscript provenance
data. The SDBM Name Authority includes a controlled vocabulary and authoritative
data regarding people, groups, and organizations related to manuscript culture, such
as scribes, collectors, monastic orders, auction houses, and cultural institutions.
Metadata from the SDBM featured prominently in previous research developing the
Linked Open Data project Mapping Manuscript Migrations [2, 3], which harmonized
and published data through a semantic portal to aid manuscript research. This case
study was developed as part of work undertaken through the LIS Education and Data
2
Science Integrated Network Group (LEADING) Fellowship1 through Drexel Univer-
sity, in which the Schoenberg Institute, as a hosting site, proposed a project to explore
and leverage SDBM authority data as linked data.
1.2 Framework
Research in the digital humanities has increasingly advanced the importance of smart
data [4], that is, data which is well-structured to aid applications for big data analysis.
Metadata produced from the library, archive, and museum (LAM) communities have
the advantages of being both authoritative and often well-structured, such that LAM
metadata can be aligned and linked with other data for use in semantic enrichment
projects [5]. Such projects often include steps to link and augment resources to add to
existing metadata [6] using contextual resources, which have included name authori-
ties [7]. Semantic enrichment with one such resource, Wikidata, has become increas-
ingly popular in recent years [8], as Wikidata has been used to enrich various types of
research data and cultural heritage metadata [5, 9, 10, 11, 12].
2 Methods
Initial SPARQL queries were conducted to explore the dataset and extract candidate
name authority records from the SDBM for linking with Wikidata items. Because
many Wikidata items for people, groups, and organizations already contained well-
structured metadata for identifiers from other external authorities, we decided to lev-
erage the existence of Virtual International Authority File identifiers (VIAF IDs) in
both Wikidata and the SDBM Name Authority. A targeted SPARQL query identified
records for names which were not deprecated/deleted, and which had a corresponding
VIAF ID. This dataset yielded over 12,500 named entities.
At the same time, we worked within the Wikidata community to propose and re-
quest an authority control property for the SDBM Name IDs (as well as SDBM Place
IDs), to link SDBM URIs to Wikidata pages for the same item. Once the candidate
dataset was finalized and the requisite Wikidata property was created, the dataset was
loaded into OpenRefine version 3.4.1. Using the built-in Wikidata reconciliation ser-
vice in OpenRefine, the SDBM name dataset was reconciled to Wikidata items using
name and VIAF ID properties. As SDBM entities were successfully reconciled to
Wikidata items, OpenRefine was used to automate batch editing of Wikidata pages to
include values for the SDBM name ID property (see Fig. 1).
1 Institute of Museum and Library Services (IMLS), LB21 LEADING project: RE-246450-
OLS-20
3
Fig. 1. Diagram showing a conceptual overview of the project stages and workflows.
3 Results
Automated and semi-automated reconciliation through alignment of name and VIAF
ID properties in OpenRefine ultimately yielded approximately 9,000 entities from the
SDBM Name Authority linked to corresponding Wikidata items. This process was
three-fold due to the nature of VIAF IDs recorded in the SDBM records. Where there
was no discrepancy between the VIAF IDs in both the SDBM and Wikidata, the
matching score between entities was high enough to allow for automatic reconcilia-
tion, and thus batch editing of corresponding Wikidata pages could be accomplished
with minimal human oversight through OpenRefine. Where discrepancies occurred,
such as when more than one Wikidata item was assigned the same VIAF ID or when
the wrong VIAF ID had been recorded in an SDBM authority record, errors had to be
manually corrected or metadata had to be verified to make sure that the alignment of
entities was accurate. Finally, where SDBM authority records contained deprecated
VIAF IDs which redirected to current VIAF IDs, OpenRefine was used to retrieve
JSON data from VIAF, which was then parsed to isolate the most up to date VIAF ID
4
for a given entity. Those VIAF IDs were then used to repeat the reconciliation and
editing processes through OpenRefine.
After editing of Wikidata pages, SPARQL queries were used to explore and test
data- and research-related questions of interest to manuscript scholars and Schoenberg
Institute and Kislak Center staff. Data-related questions that could be asked and an-
swered about the new linked dataset included the number of human and non-human
entities and the presence of other external authorities for SDBM-linked entities in
Wikidata. Research-related questions answered by SPARQL queries addressed famil-
ial relationships, student-teacher relationships, collectors by gender, collectors by
occupation, and names in the SDBM affiliated with or members of specific organiza-
tions (e.g., Franciscans; the Athenaeum Club, London). Only one question proposed
by scholars and staff members regarding the gender affiliation of monastic institutions
could not be directly answered through SPARQL query, due to the nature of the prop-
erties available to describe people and organizations within the Wikidata data model
as currently structured.
4 Conclusions and Future Directions
Based on initial results, Wikidata presented a viable option for semantic enrichment
of name authority data related to premodern manuscripts and the present-day scholar-
ly community researching manuscript culture. In all but one case, SPARQL queries
constructed to address inquiries of interest to a team of manuscript and special collec-
tions researchers were successful in retrieving desired results and yielded data which
accorded accurately with the subject experts’ domain knowledge. The results were
also promising enough to encourage the Schoenberg Institute and Kislak Center teams
to continue enrichment of their Name Authority with metadata from Wikidata. More
work will have to be done to increase the number of SDBM name IDs which have
currently been added to Wikidata. This will mitigate a present limitation of this study
(which linked and enriched less than a quarter of the total name present in the
SDBM). It will also offer an opportunity to further enhance the Wikidata community
with authoritative data regarding named entities which do not currently exist as Wiki-
data items. Moreover, this work will help direct further research and practical applica-
tions of interest to the manuscript community by utilizing Wikidata as an Open
Linked Data repository for other manuscript-related metadata.
References
1. Ransom, L., Emery, D., Cawlfield, E., Heller, B., & Budisin, M. The new Schoenberg da-
tabase of manuscripts: Creating an opensource tool for manuscript research and discovery.
In Driscoll, M. J. (ed.), Care and Conservation of Manuscripts 16: Proceedings of the Six-
teenth International Seminar Held at the University of Copenhagen, 13th–15th April 2016.
Museum Tusculanum Press, Copenhagen. (2018).
5
2. Burrows, T., Hyvönen, E., Ransom, L., & Wijsman, H. Mapping manuscript migrations:
Digging into data for the history and provenance of Medieval and Renaissance manu-
scripts. Manuscript Studies, 3(1), 249-252. (2019).
3. Koho, M., Burrows, T., Hyvönen, E., Ikkala, E., Page, K., Ransom, L., Tuominen, J., Em-
ery, D., Fraas, M., Heller, B., Lewis, D., Morrison, A., Porte, G., Thomson, Velios, A., &
Wijsman, H. Harmonizing and publishing heterogeneous premodern manuscript metadata
as Linked Open Data. Journal of the Association for Information Science and Technology,
asi.24499. https://doi.org/10.1002/asi.24499 (2021).
4. Schöch, C. Big? Smart? Clean? Messy? Data in the humanities. Journal of Digital Human-
ities, 2(3). http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-
humanities/ (2013).
5. Zeng, M. L. Semantic enrichment for enhancing LAM data and supporting digital humani-
ties. El profesional de la información, 28(1), e280103.
https://doi.org/10.3145/epi.2019.ene.03 (2019).
6. Isaac, A., Manguinhas, H., Stiller, J., & Charles, V. Report on enrichment and evaluation.
Europeana Task Force on Enrichment and Evaluation, The Hague.
http://pro.europeana.eu/files/Europeana_Professional/EuropeanaTech/EuropeanaTech_task
forces/Enrichment_Evaluation/FinalReport_EnrichmentEvaluation_102015.pdf (2015)
7. Chen, S.-J. Semantic enrichment of linked personal authority data: A case study of elites in
late Imperial China. Knowledge Organization 46(8), 607-614.
https://doi.org/10.5771/0943-7444-2019-8-607 (2019).
8. Smith-Yoshimura, K. The rise of Wikidata as a linked data source. Hanging Together: The
OCLC Research Blog. http://hangingtogether.org/?p=6775 (2018).
9. Candela, G., Escobar, P., Carrasco, R. C., & Marco-Such, M. A linked open data frame-
work to enhance the discoverability and impact of culture heritage. Journal of Information
Science, 45(6), 756-766. (2019).
10. Höper, J., & Müller-Birn, C. Assisting in semantic enrichment of scholarly resources by
connecting neonion and Wikidata. https://refubium.fu-berlin.de/handle/fub188/22790
(2018).
11. Hyvönen, E., Leskinen, P., Tamper, M., Rantala, H., Ikkala, E., Tuominen, J., & Keravu-
ori, K. Linked data—A paradigm change for publishing and using biography collections
on the Semantic Web. In Proceedings of the Third Conference on Biographical Data in a
Digital World, 5-6 September 2019, Varna, Bulgaria.
https://seco.cs.aalto.fi/publications/2019/hyvonen-et-al-bs-2019b.pdf (2019).
12. Röpert, D., Reimeier, F., Holetschek, J., & Güntsch, A. Semantic annotation of botanical
collection data. Biodiversity Information Science and Standards, 3: e36187.
https://doi.org/10.3897/biss.3.36187 (2019).
ResearchGate has not been able to resolve any citations for this publication.
The new Schoenberg database of manuscripts: Creating an opensource tool for manuscript research and discovery
  • L Ransom
  • D Emery
  • E Cawlfield
  • B Heller
  • M Budisin
Ransom, L., Emery, D., Cawlfield, E., Heller, B., & Budisin, M. The new Schoenberg database of manuscripts: Creating an opensource tool for manuscript research and discovery. In Driscoll, M. J. (ed.), Care and Conservation of Manuscripts 16: Proceedings of the Sixteenth International Seminar Held at the University of Copenhagen, 13th-15th April 2016. Museum Tusculanum Press, Copenhagen. (2018).