Wolfgang Lehner

Wolfgang Lehner
TU Dresden | TUD · Faculty of Computer Science

Prof. Dr.-Ing.

About

562
Publications
117,111
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,805
Citations
Introduction
Wolfgang Lehner conducts are variety of different research projects with his team members ranging from designing data-warehouse infrastructures from a modeling perspective, supporting data-intensive applications and processes in large distributed information systems, adding novel database functionality to relational database engines to support data mining/forecast algorithms, investigating techniques of approximate query processing (e.g. sampling) to speed up execution times over very large data sets, and exploiting the power of main-memory centric database architectures with an emphasis on modern hardware capabilities.
Additional affiliations
May 2024 - present
Aalborg University
Position
  • Professor
Description
  • Teaching at the Data Engineering, Science and Systems Group
October 2000 - September 2002
Martin Luther University Halle-Wittenberg
Position
  • Temporary C3/C4 professorship
August 1999 - September 2000
Friedrich-Alexander-University Erlangen-Nürnberg
Position
  • Senior Research Assistant
Education
October 1998 - July 2001
October 1995 - September 1998

Publications

Publications (562)
Preprint
Full-text available
The Single Instruction Multiple Data (SIMD) parallel paradigm is a well-established and heavily-used hardware-driven technique to increase the single-thread performance in different system domains such as database or machine learning. Depending on the hardware vendor and the specific processor generation/version, SIMD capabilities come in different...
Article
Full-text available
Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-encoder based language models still requires a significant amount of labeled data to achieve satisfying work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is Active Learning (AL): a...
Article
Full-text available
In the field of query optimization, a tremendous amount of research has been delivered into fixing and adjusting existing optimizer solutions. Moreover, recent work has also shown that substantial performance gain can be achieved by setting query optimizer instructions appropriately. However, the sets of beneficial instructions may vary greatly for...
Chapter
Searching for the right code snippet is cumbersome and not a trivial task. Online platforms such as Github.com or searchcode.com provide tools to search, but they are limited to publicly available and internet-hosted code. However, during the development of research prototypes or confidential tools, it is preferable to store source code locally. Co...
Chapter
This work focuses on the task of Mathematical Answer Retrieval and studies the factors a recent Transformer-Encoder-based Language Model (LM) uses to assess the relevance of an answer for a given mathematical question. Mainly, we investigate three factors: (1) the general influence of mathematical formulae, (2) the usage of structural information o...
Article
Full-text available
Air pollution through particulate matter (PM) is one of the largest threats to human health. To understand the causes of PM pollution and enact suitable countermeasures, reliable predictions of future PM concentrations are required. In the scientific literature, many methods exist for machine learning (ML)-based PM prediction, though their quality...
Chapter
Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-encoder based language models still requires a significant amount of labeled data to achieve satisfying work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is Active Learning (AL): a...
Article
Full-text available
The traditional and well-established cost-based query optimizer approach enumerates different execution plans for each query, assesses each plan with costs, and selects the plan that promises the lowest costs for execution. However, the optimal execution plan is not always selected. To steer the optimizer in the right direction, many query optimize...
Article
Full-text available
Integer compression plays an important role in columnar database systems to reduce the main memory footprint as well as to speedup query processing. To keep the additional computational effort of (de)compression as low as possible, the powerful Single Instruction Multiple Data (SIMD) extensions of modern CPUs are heavily applied. While a scalar com...
Article
Full-text available
The task of the judge of difficulty in trampoline gymnastics is to check the elements and difficulty values entered on the competition cards and the difficulty of each element according to a numeric system. To do this, the judge must count all somersaults and twists for each jump during a routine and thus record the difficulty of the routine. This...
Article
Full-text available
The Single Instruction Multiple Data (SIMD) paradigm became a core principle for optimizing query processing in columnar database systems. Until now, only the instructions are considered to be efficient enough to achieve the expected speedups, while avoiding is considered almost imperative. However, the instruction offers a very flexible way to pop...
Article
Full-text available
Software Testing is an established activity in the software development process to ensure and improve the quality of a software. Consequently, there exists a wide range of literature, popular information, and even multiple ISO standards covering this topic. However, we found that testing very large database management systems (DBMS) requires specia...
Conference Paper
Full-text available
The results of the presented work show that machine learning (ML) can be used to support correct training logging in order to improve technical performance in trampoline gymnastics. They indicate considerable potential for expanding mobile applications in a sport with complex movement requirements .
Chapter
Active Learning (AL) is a well-known standard method for efficiently obtaining annotated data by first labeling the samples that contain the most information based on a query strategy. In the past, a large variety of such query strategies has been proposed, with each generation of new strategies increasing the runtime and adding more complexity. Ho...
Preprint
Full-text available
Integer compression plays an important role in columnar database systems to reduce the main memory footprint as well as to speedup query processing. To keep the additional computational effort of (de)compression as low as possible, the powerful Single Instruction Multiple Data (SIMD) extensions of modern CPUs are heavily applied. While a scalar com...
Preprint
Full-text available
Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-based language models still requires a significant amount of labeled data to work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is \textit{Active Learning} (AL): an iterative proces...
Preprint
Full-text available
Active Learning (AL) is a well-known standard method for efficiently obtaining annotated data by first labeling the samples that contain the most information based on a query strategy. In the past, a large variety of such query strategies has been proposed, with each generation of new strategies increasing the runtime and adding more complexity. Ho...
Article
30 Promovierende von über 20 Hochschulen und Forschungsinstitutionen aus ganz Deutschland nahmen an dem von der Datenbank-Gruppe der TU Dresden organisierten Workshop zum Thema „Reproducible Science in Data Management“ vom 8.–10. Juni 2022 an der Fakultät Informatik der TU Dresden teil. Mit einem ausgewogenen Mix aus Theorie und Praxis wurden die N...
Chapter
Mathematical Information Retrieval (MIR) deals with the task of finding relevant documents that contain text and mathematical formulas. Therefore, retrieval systems should not only be able to process natural language, but also mathematical and scientific notation to retrieve documents.In this work, we evaluate two transformer-encoder-based approach...
Article
The optimization of select-project-join (SPJ) queries entails two major challenges: (i) finding a good join order and (ii) selecting the best-fitting physical join operator for each single join within the chosen join order. Previous work mainly focuses on the computation of a good join order, but leaves open to which extent the physical join operat...
Article
Full-text available
Query execution techniques in database systems constantly adapt to novel hardware features to achieve high query performance, in particular for analytical queries. In recent years, vectorization based on the Single Instruction Multiple Data parallel paradigm has been established as a state-of-the-art approach to increase single-query performance. H...
Chapter
The demand for annotated datasets for supervised machine learning (ML) projects is growing rapidly. Annotating a dataset often requires domain experts and is a timely and costly process. A premier method to reduce this overhead drastically is Active Learning (AL). Despite a tremendous potential for annotation cost savings, AL is still not used univ...
Article
Full-text available
Cardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent...
Conference Paper
Full-text available
Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous--...
Article
Full-text available
For its third installment, the Data Science Challenge of the 19th symposium “Database Systems for Business, Technology and Web” (BTW) of the Gesellschaft für Informatik (GI) tackled the problem of predictive energy management in large production facilities. For the first time, this year’s challenge was organized as a cooperation between Technische...
Article
Full-text available
Processing and analyzing time series datasets have become a central issue in many domains requiring data management systems to support time series as a native data type. A core access primitive of time series is matching, which requires efficient algorithms on-top of appropriate representations like the symbolic aggregate approximation (SAX) repres...
Preprint
Full-text available
One of the biggest challenges that complicates applied supervised machine learning is the need for huge amounts of labeled data. Active Learning (AL) is a well-known standard method for efficiently obtaining labeled data by first labeling the samples that contain the most information based on a query strategy. Although many methods for query strate...
Article
In this demo, we present PostCENN , an enhanced PostgreSQL database system with an end-to-end integration of machine learning (ML) models for cardinality estimation. In general, cardinality estimation is a topic with a long history in the database community. While traditional models like histograms are extensively used, recent works mainly focus on...
Preprint
Processing and analyzing time series data\-sets have become a central issue in many domains requiring data management systems to support time series as a native data type. A crucial prerequisite of these systems is time series matching, which still is a challenging problem. A time series is a high-dimensional data type, its representation is storag...
Chapter
Graph-structured data can be found in nearly every aspect of today’s world which contributes to an increasing importance of this data structure for storing and processing data. From a processing perspective, finding comprehensive patterns in graph-structured data is a processing primitive in a variety of applications, such as fraud detection, biolo...
Chapter
Supervised Learning requires a huge amount of labeled data, making efficient labeling one of the most critical components for the success of Machine Learning (ML). One well-known method to gain labeled data efficiently is Active Learning (AL), where the learner interactively asks human experts to label the most informative data point. Nevertheless,...
Article
Ad hoc code generation is a state-of-the-art processing paradigm for database execution engines. It minimizes resource consumption by generating specialized code, tailored and streamlined for the single query at hand. In this work, we apply ad hoc code generation to regular path queries (RPQs), an advanced query type in declarative graph query lang...
Article
Full-text available
In this paper, we present MorphStore, an open-source inmemory columnar analytical query engine with a novel holis-tic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory columnstore database systems, but mainly for base data. In particula...
Article
Full-text available
Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying data-intensive workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50–80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the...
Article
OLTP applications are usually executed by a high number of clients in parallel and are typically faced with high throughput demand as well as a constraint latency requirement for individual statements. Interestingly, OLTP workloads are often read-heavy and comprise similar query patterns, which provides a potential to share work of statements belon...
Preprint
Cardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches can deliver more accurate cardinality estimations than traditional approaches. However, a lot of example queries have to be executed during the model training phase to learn a data-dependent M...
Article
Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks...
Article
Full-text available
The long-awaited nonvolatile random-access memory technology NVRAM is finally publicly available on the market and requires significant changes to the architecture of in-memory database systems. Since such hybrid DRAM–NVRAM database systems may be able to keep the primary data solely persistent in the NVRAM, efficient replication mechanisms need to...
Article
Full-text available
The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. In prior work, we have proposed general dynamic Yannakakis (GDyn), a general framework for dynamically processing acyclic conjunctive queries with \(\theta \)-joins in the presence of data updates. Whereas traditional approaches face a tr...
Preprint
In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column-store database systems, but mainly for base data. In particul...
Article
The Internet of Things (IoT) sparks a revolution in time series forecasting. Traditional techniques forecast time series individually, which becomes unfeasible when the focus changes to thousands of time series exhibiting anomalies like noise and missing values. This work presents CSAR, a technique forecasting a set of time series with only one mod...
Article
Modern applications employ key-value stores (KVS) in at least some point of their software stack, often as a caching system or a storage manager. Many of these applications also require a high degree of responsiveness and performance predictability. However, most KVS have similar design decisions which focus on improving throughput metrics, at time...
Conference Paper
Full-text available
Vectorization based on the Single Instruction Multiple Data (SIMD) parallel paradigm is a core technique to improve query processing performance especially in state-of-the-art in-memory column-stores. In mainstream CPUs, vectoriza-tion is offered by a large number of powerful SIMD extensions growing not only in vector size but also in terms of comp...
Conference Paper
With the ongoing shift to a data-driven world in almost all application domains, the management and in particular the analytics of large amounts of data gain in importance. For that reason, a variety of new big data systems has been developed in recent years. Aside from that, a revision of the data organization and formats has been initiated as a f...
Preprint
There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naive one-to-one mapping of each word in a datab...
Article
The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. Traditional approaches to this problem were developed around the notion of Incremental View Maintenance (IVM), and are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults...
Conference Paper
This paper presents DECO, a dataset of spreadsheet files, annotated on the basis of layout and contents. It comprises of 1,165 files, extracted from the Enron corpus (Hermans, 2015). Using a tool, specifically developed for this task, three different annotators (judges) assigned layout roles to non-empty cells and marked the borders of tables. File...
Conference Paper
Spreadsheets are very successful content generation tools, used in almost every enterprise to create a wealth of information. However, this information is often intermingled with various formatting, layout, and textual metadata, making it hard to identify and interpret the tabular payload. Previous work proposed to solve this problem by mainly usin...