About the lab

The NLP research group at the Hochschule Hannover (HsH, Hanover University of Applied Sciences and Arts) is embedded in the Department of Information Management and part of the Research Cluster Smart Data Analytics. The focus of the research group is the extraction of information from texts. Distributional Semantics in various forms is a common thread in the work of the group. On the application side keyword extraction and keyword similarity have been central topics. E.g. keyword extraction from image captions is done in the NOA Project and various papers on controlled vocabularies have been published. Further topics include e.g. acronym disambiguation and structural analysis of legal texts.

Featured research (11)

Many terms used in physics have a different meaning or usage pattern in general language, constituting a learning barrier in physics teaching. The systematic identification of such terms is considered to be useful for science education as well as for terminology extraction. This article compares three methods based on vector semantics and a simple frequency-based base-line for automatically identifying terms used in general language with domain-specific use in physics. For evaluation, we use ambiguity scores from a survey among physicists and data about the number of term senses from Wik-tionary. We show that the so-called Vector Ini-tialization method obtains the best results.
Concreteness is a property of words that has recently received attention in computational linguistics. Since concreteness is a property of word senses rather than of words, it makes most sense to determine concreteness in a given context. Recent approaches for predicting the concreteness of a word occurrence in context have relied on collecting many features from all words in the context. In this paper, we show that we can achieve state-of-the-art results by using only contextualized word embeddings of the target words. We circumvent the problem of missing training data for this task by training a regression model on context-independent con-creteness judgments, which are widely available for English. The trained model needs only a few additional training data to give good results for predicting concreteness in context. We can even train the initial model on English data and do the final training on another language and obtain good results for that language as well.
To learn a subject, the acquisition of the associated technical language is important. Despite this widely accepted importance of learning the technical language, hardly any studies are published that describe the characteristics of most technical languages that students are supposed to learn. This might largely be due to the absence of specialized text corpora to study such languages at lexical, syntactical and textual level. In the present paper we describe a corpus of German physics text that can be used to study the language used in physics. A large and a small variant are compiled. The small version of the corpus consists of 5.3 Million words and is available on request.
Legal documents often have a complex layout with many different headings, headers and footers, side notes, etc. For the further processing, it is important to extract these individual components correctly from a legally binding document, for example a signed PDF. A common approach to do so is to classify each (text) region of a page using its geometric and textual features. This approach works well, when the training and test data have a similar structure and when the documents of a collection to be analyzed have a rather uniform layout. We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types. After that, we can train a classifier for each of the three page types and thereby improve the accuracy on a manually annotated collection of 70 legal documents consisting of 20,938 text elements. When we split by page type, we achieve an improvement from 0.95 to 0.98 for single-column pages with left marginalia and from 0.95 to 0.96 for double-column pages. We developed our own feature-based method for page layout detection, which we benchmark against a standard implementation of a CNN image classifier. The approach presented here is based on corpus of freely available German contracts and general terms and conditions. Both the corpus and all manual annotations are made freely available. The method is language agnostic.
In this paper we investigate how concreteness and abstractness are represented in word embedding spaces. We use data for English and German, and show that concreteness and abstractness can be determined independently and turn out to be completely opposite directions in the embedding space. Various methods can be used to determine the direction of concreteness, always resulting in roughly the same vector. Though concreteness is a central aspect of the meaning of words and can be detected clearly in embedding spaces, it seems not as easy to subtract or add concreteness to words to obtain other words or word senses like e.g. can be done with a semantic property like gender.

Lab head

Christian Wartena
Department
  • Faculty of Information and Communication

Members (3)

Frieda Josi
  • Dataport AöR
Jean Charbonnier
  • Hochschule Hannover

Alumni (3)

John Rothman
John Rothman
Birte Rohden
Birte Rohden
Rosa Tsegaye Aga
  • Hochschule Hannover