added a research item
- Christiane Klaes
Person entities are important linking nodes both within and between Linked Open Data resources across different domains and use cases. Therefore, efficient identity management is a crucial part of resource development and maintenance. This case study is concerned with the task of semi-automatic population of a newly developed domain knowledge graph, LexBib Wikibase [https://lexbib.elex.is/wiki/Main_Page] with high-quality person data. We aim to transform person name literals taken from publication metadata into Semantic Web entities, to enable improved retrieval and entity enrichment for the domain-specific discovery portal ElexiFinder [http://finder.elex.is/intelligence?type=articles]. In a prototype workflow to achieve this transformation, the open source tool OpenRefine is used as a one-tool solution to perform deduplication (synonym problem), disambiguation (homonym problem) and reconciliation of person names with reference datasets, using a sample of 3.104 name literals taken from LexBib domain bibliography. We closely examine OpenRefine’s clustering functions with its underlying string matching algorithms, to gain a better understanding of their ability to account for different error types that frequently occur in person name matching, such as spelling errors, phonetic variations, initials, or double names. Following the same approach, string matching processes implemented in two widely used reconciliation services for Wikidata [https://github.com/wetneb/openrefine-wikibase] and VIAF [https://github.com/codeforkjeff/conciliator] are examined. OpenRefine offers various features to support further processing of algorithmic output. Therefore, we also analyse the usefulness of these features within the range of the presented use case. The results of this case study may contribute to a better understanding and subsequent further development of interlinking features in OpenRefine and adjoining reconciliation services. By offering empiric data on OpenRefine’s underlying string matching algorithms, the study’s results supplement existing guides and tutorials on clustering and reconciliation, especially for person name matching projects.
added a research item
- Christiane Klaes
Semantic Web technologies and applications are steadily gaining traction in the field of knowledge representation. Data about persons are important linking nodes in a variety of Linked Open Data resources across different domains and use cases. The domain ontology "LexDo" is a newly developed knowledge base for the field of Lexicography and Dictionary Research, primarily intended for publication indexing within the digital library "ElexiFin-der", which is part of the European research infrastructure ELEXIS. This study is focused on developing a modular workflow for populating LexDo with data on persons extracted from bibliographic metadata. To this end, the open source tool "OpenRefine" is used to semi-automatically clean and transform name literals into Semantic Web entities, which are then linked to reference datasets (VIAF, Wikidata). OpenRefine's algorithms for clustering and interlin-king are evaluated in detail for a sample of persons, focusing on the tool's possibilities to leverage scalability and data quality. The study's results lead to practice-oriented recommendations for the ongoing development and long-term maintenance of the domain ontology "Lex-Do".
In this paper, we present ongoing work on Elexifinder (https://finder.elex.is), a lexicographic literature discovery portal developed in the framework of the ELEXIS (European Lexicographic Infrastructure) project. Since the first launch of the tool, the database behind Elexifinder has been enriched with publication metadata and full texts stemming from the LexBib project, and from other sources. We describe data curation and migration workflows, including the development of an RDF database, and the interaction between the database and Elexifinder. Several new features that have been added to the Elexifinder interface in version 2 are presented, such as a new Lexicography-focused category system for classifying article subjects called LexVoc, enhanced search options, and links to LexBib Zotero collection. Future tasks include getting lexicographic community more involved in the improvement of Elexifinder, e.g. in translation of LexVoc vocabulary, improving LexVoc classification, and suggesting new publications for inclusion.
This short paper presents preliminary considerations regarding LexBib, a corpus, bibliography, and domain ontology of Lexicography and Dictionary Research, which is currently being developed at University of Hildesheim. The LexBib project is intended to provide a bibliographic metadata collection made available through an online reference platform. The corresponding full texts are processed with text mining methods for the generation of additional metadata, such as term candidates, topic models, and citations. All LexBib content is represented and also publicly accessible as RDF Linked Open Data. We discuss a data model that includes metadata for publication details and for the text mining results, and that considers relevant standards for an integration into the LOD cloud. 1 Introduction Our goal is an online bibliography of Lexicography and Dictionary Research (i. e. metalex-icography) that offers hand-validated publication metadata as needed for citations, that represents, if possible, metadata using unambiguous identifiers and that, in addition, is complemented with the output of a Natural Language Processing toolchain applied to the full texts. Items are tagged using nodes of a domain ontology developed in the project; terms extracted from the full texts serve as suggestions for a mapping to the domain ontology. Main considerations regarding the project have been presented in . In this publication, we focus on the data model for LexBib items, its integration into the LOD cloud, and on relevant details of our workflow. In Section 2 we describe how publication metadata and full texts are collected and stored using Zotero, data enrichment and transfer to RDF format. Section 3 addresses the text mining toolchain used for the generation of additional metadata, that are linked to the corresponding bibliographical items. As shown in Fig. 1, an OWL-RDF file is the place where this merging is carried out. In Section 4 we describe the multilingual domain ontology that will be used to describe the full text content with keywords or tags.
12 This short paper presents preliminary considerations regarding LexBib, a corpus, bibliography, and 13 domain ontology of Lexicography and Dictionary Research, which is currently being developed 14 at University of Hildesheim. The LexBib project is intended to provide a bibliographic metadata 15 collection made available through an online reference platform. The corresponding full texts 16 are processed with text mining methods for the generation of additional metadata, such as term 17 candidates, topic models, and citations. All LexBib content is represented and also publicly accessible 18 as RDF Linked Open Data. We discuss a data model that includes metadata for publication details 19 and for the text mining results, and that considers relevant standards for an integration into the 20 LOD cloud. 21 1 Introduction 29 Our goal is an online bibliography of Lexicography and Dictionary Research (i. e. metalex-30 icography) that offers hand-validated publication metadata as needed for citations, that 31 represents, if possible, metadata using unambiguous identifiers and that, in addition, is 32 complemented with the output of a Natural Language Processing toolchain applied to the 33 full texts. Items are tagged using nodes of a domain ontology developed in the project; terms 34 extracted from the full texts serve as suggestions for a mapping to the domain ontology. 35 Main considerations regarding the project have been presented in . 36 In this publication, we focus on the data model for LexBib items, its integration into the 37 LOD cloud, and on relevant details of our workflow. In Section 2 we describe how publication 38 metadata and full texts are collected and stored using Zotero, data enrichment and transfer 39 to RDF format. Section 3 addresses the text mining toolchain used for the generation of 40 additional metadata, that are linked to the corresponding bibliographical items. As shown 41 in Fig. 1, an OWL-RDF file is the place where this merging is carried out. In Section 4 we 42 describe the multilingual domain ontology that will be used to describe the full text content 43 with keywords or tags. 44
Ziel der hier vorgestellten Studie ist eine Beschreibung der Schnittmenge von Diskursräumen in der Lexikographie bzw. Metalexikographie und den Digital Humanities (DH). Dabei geht es um die Bestimmung von explizit bzw. implizit als Teil der DH aufzufassenden Beiträgen zu lexikographischen Themen und, andersherum, von lexikographierelevanten Themen, die in den DH diskutiert werden. Zur Bestimmung der Diskursräume, von Schnitt- und disjunktiven Mengen, werden Volltexte und Metadaten analysiert, bibliometrische Netzwerke (Autoren- bzw. Zitationsnetzwerke) verglichen und Topic Modelings vorgenommen.
This paper presents preliminary considerations regarding objectives and workflow of LexBib, a project which is currently being developed at the University of Hildesheim. We briefly describe the state of the art in electronic bibliographies in general, and bibliographies of lexicography and dictionary research in particular. The LexBib project is intended to provide a collection of full texts and metadata of publications on metalexicog-raphy, as an online resource and research infrastructure; at the same time, LexBib has a strong experimental component: computational linguistic methods for automated keyword indexing, topic clustering and citation extraction will be tested and evaluated. The goal is to enrich the bibliography with the results of the text ana-lytics in the form of additional metadata. 1 Introduction Domain-specific bibliographies are important tools for scientific research. We believe that much of their usefulness depends on the metadata they provide for (collections of) publications, and on advanced search functionalities. What is more, bibliographies for a limited domain may offer hand-validated publication metadata. As for lexicography and dictionary research, several bibliographies with different scopes and formats exist independently from each other; none of them covers the field completely, and most of them do not support advanced search functionalities, so that usability is dramatically reduced. Searches for bibliographical data and for the corresponding full texts are therefore most often performed using general search engines and domain-independent bibliography portals. However, big domain-independent repositories have two major shortcomings: They often contain noisy or incomplete publication metadata which have to be hand-validated by the users when copying them into their personal bibliographies, e. g. for citations. Closely related to that, the search functions of leading bibliography portals still focus on query-based information retrieval, since a combination of cascaded filter options using keywords and metadata such as persons, places, events, and relations to other items, only yields good results if the metadata meet certain requirements on precision and completeness. Our goal is a domain-specific online bibliography of lexicography and dictionary research (i.e. metalexicography) which offers hand-validated publication metadata as they are needed for citations, and which in addition is complemented with the output of an NLP toolchain. Several methods from computational linguistics produce useful results for seeking and retrieving scientific publications. For example, topic clustering has become very popular in the Digital Humanities. We suggest that assigning topics to publications provides valuable metadata for finding related work. Methods for term extraction have a similar objective. They detect text patterns (thus: terms) that are more significant in a (more specific) domain corpus than in a (more general) reference corpus.
In this paper, we present a generic workflow for retro-digitizing and structuring large entry-based documents, using the 33.000 entries of Internationale Bibliographie der Lexikographie, by Herbert Ernst Wiegand, as an example (published in four volumes (Wiegand 2006-2014)). The goal is to convert the large bibliography, at present available as collection of images, to TEI compliant XML, a structured format that enables enhanced interoperability and search functionalities (Lindemann, Kliche and Heid, 2018). Images of the printed publication are first processed with Optical Character Recognition (OCR) tools which are part of the Transkribus software application (Mühlberger and Terbul, 2018),1 the output of which is used for creating manually validated Hand-Written Text Recognition (HTR) training material. The retro-digitised output is the used to train and create dedicated machine learning models in GROBID-Dictionaries2 (Khemakhem, Foppiano and Romary, 2017), a tool for automatic segmentation of entry-based text documents and representation as TEI-compliant XML. Both Transkribus and GROBID-Dictionaries are tools freely available to the community. Preliminary results suggest that the proposed workflow yields good precision in retro-digitisation and segmentation.