Table 4 - uploaded by David Lindemann
Content may be subject to copyright.
Term candidates extracted from the DH and Lexicography subcorpora.

Term candidates extracted from the DH and Lexicography subcorpora.

Source publication
Conference Paper
Full-text available
This paper presents preliminary considerations regarding objectives and workflow of LexBib, a project which is currently being developed at the University of Hildesheim. We briefly describe the state of the art in electronic bibliographies in general, and bibliographies of lexicography and dictionary research in particular. The LexBib project is in...

Context in source publication

Context 1
... term extraction tool produced a list of term candidates for each of the two subcorpora, Digital Humanities (DH), and Lex- icography (Lexicog), ranked by their termhood in relation to the frequency in the reference corpus, the BNC. Table 4 lists the top 25 terms for DH, Lexicog, and their overlap, i.e. lexicography terms (out of the top 1,000) also found in the DH top 500, sorted by their termhood ratio in comparison to their frequency in the BNC. After manually validating the 500 most salient DH terms, we performed the same term extraction pro- cedure for every year of publication and measured the intersection of the Lexicog terms and the top 500 DH terms. ...

Citations

... In parallel, at University of Hildesheim, the LexBib project was planned, with the goal of creating a domain ontology and digital bibliography of Lexicography and Dictionary Research. First steps in that project are described in Lindemann, Kliche & Heid (2018); the LexBib metadata and full text collection was put together and made accessible using the Zotero platform, 3 and workflows for a conversion of publication metadata to RDF Linked Data were explored. Despite slightly different aims (ELEXIS being focussed more on providing a tool to efficiently find the relevant publication, and the LexBib Zotero collection having a more bibliographic focus, that is, to provide validated metadata for the purpose of citations), both initiatives had a great deal of overlap. ...
Conference Paper
Full-text available
In this paper, we present ongoing work on Elexifinder (https://finder.elex.is), a lexicographic literature discovery portal developed in the framework of the ELEXIS (European Lexicographic Infrastructure) project. Since the first launch of the tool, the database behind Elexifinder has been enriched with publication metadata and full texts stemming from the LexBib project, and from other sources. We describe data curation and migration workflows, including the development of an RDF database, and the interaction between the database and Elexifinder. Several new features that have been added to the Elexifinder interface in version 2 are presented, such as a new Lexicography-focused category system for classifying article subjects called LexVoc, enhanced search options, and links to LexBib Zotero collection. Future tasks include getting lexicographic community more involved in the improvement of Elexifinder, e.g. in translation of LexVoc vocabulary, improving LexVoc classification, and suggesting new publications for inclusion.
... Items are tagged using nodes of a domain ontology developed in the project; terms extracted from the full texts serve as suggestions for a mapping to the domain ontology. Main considerations regarding the project have been presented in [7]. In this publication, we focus on the data model for LexBib items, its integration into the LOD cloud, and on relevant details of our workflow. ...
... We run this method twice for each document; for English, 7 once with the British National Corpus (BNC) as a reference corpus in general language in order to retrieve domain specific terms; and once with the whole LexBib English corpus as a reference corpus in order to identify document specific keywords. An example of term candidates extracted by this approach is shown in [7]. Term candidates will be stored in the LexBib database, linked to the corresponding item. ...
... Specific thematic indices of Lexicography and Dictionary Research have been proposed (see [7] for reference), isolated from each other. Most proposals are a flat list of keywords, while some define hierarchical relations between them. ...
Article
Full-text available
This short paper presents preliminary considerations regarding LexBib, a corpus, bibliography, and domain ontology of Lexicography and Dictionary Research, which is currently being developed at University of Hildesheim. The LexBib project is intended to provide a bibliographic metadata collection made available through an online reference platform. The corresponding full texts are processed with text mining methods for the generation of additional metadata, such as term candidates, topic models, and citations. All LexBib content is represented and also publicly accessible as RDF Linked Open Data. We discuss a data model that includes metadata for publication details and for the text mining results, and that considers relevant standards for an integration into the LOD cloud. 1 Introduction Our goal is an online bibliography of Lexicography and Dictionary Research (i. e. metalex-icography) that offers hand-validated publication metadata as needed for citations, that represents, if possible, metadata using unambiguous identifiers and that, in addition, is complemented with the output of a Natural Language Processing toolchain applied to the full texts. Items are tagged using nodes of a domain ontology developed in the project; terms extracted from the full texts serve as suggestions for a mapping to the domain ontology. Main considerations regarding the project have been presented in [7]. In this publication, we focus on the data model for LexBib items, its integration into the LOD cloud, and on relevant details of our workflow. In Section 2 we describe how publication metadata and full texts are collected and stored using Zotero, data enrichment and transfer to RDF format. Section 3 addresses the text mining toolchain used for the generation of additional metadata, that are linked to the corresponding bibliographical items. As shown in Fig. 1, an OWL-RDF file is the place where this merging is carried out. In Section 4 we describe the multilingual domain ontology that will be used to describe the full text content with keywords or tags.
... In this paper, we present a generic workflow for retro-digitizing and structuring large entry-based documents, using the 33.000 entries of Internationale Bibliographie der Lexikographie, by Herbert Ernst Wiegand, as an example (published in four volumes (Wiegand 2006(Wiegand -2014). The goal is to convert the large bibliography, at present available as collection of images, to TEI compliant XML, a structured format that enables enhanced interoperability and search functionalities (Lindemann, Kliche and Heid, 2018). Images of the printed publication are first processed with Optical Character Recognition (OCR) tools which are part of the Transkribus software application (Mühlberger and Terbul, 2018), 1 the output of which is used for creating manually validated Hand-Written Text Recognition (HTR) training material. ...
Conference Paper
Full-text available
In this paper, we present a generic workflow for retro-digitizing and structuring large entry-based documents, using the 33.000 entries of Internationale Bibliographie der Lexikographie, by Herbert Ernst Wiegand, as an example (published in four volumes (Wiegand 2006-2014)). The goal is to convert the large bibliography, at present available as collection of images, to TEI compliant XML, a structured format that enables enhanced interoperability and search functionalities (Lindemann, Kliche and Heid, 2018). Images of the printed publication are first processed with Optical Character Recognition (OCR) tools which are part of the Transkribus software application (Mühlberger and Terbul, 2018),1 the output of which is used for creating manually validated Hand-Written Text Recognition (HTR) training material. The retro-digitised output is the used to train and create dedicated machine learning models in GROBID-Dictionaries2 (Khemakhem, Foppiano and Romary, 2017), a tool for automatic segmentation of entry-based text documents and representation as TEI-compliant XML. Both Transkribus and GROBID-Dictionaries are tools freely available to the community. Preliminary results suggest that the proposed workflow yields good precision in retro-digitisation and segmentation.