
Ingo MüllerGoogle Inc. | Google
Ingo Müller
Dr. rer. nat.
About
29
Publications
18,428
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
592
Citations
Citations since 2017
Introduction
Additional affiliations
January 2016 - present
April 2011 - December 2015
Education
September 2009 - August 2010
September 2009 - August 2010
September 2008 - August 2010
Publications
Publications (29)
In the domain of high-energy physics (HEP), general-purpose query languages have found little adoption in analysis. This is surprising regarding SQL-based systems, as HEP data analysis matches SQL’s processing model well: the data is fully structured and makes use of predominantly standard operators. To better understand the situation, we select si...
Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining (and paying for) a fixed infrastructure. Recent research has analyzed the potential of serverless for data proce...
With the advent of cloud computing, where computational resources are expensive and data movement needs to be secured and minimized, database management systems need to reconsider their architecture to accommodate such requirements. In this paper, we present our analysis, design and evaluation of an FPGA-based hardware accelerator for offloading co...
Serverless platforms have attracted attention due to their promise of elasticity, low cost, and fast deployment. Instead of using a fixed virtual machine (VM) infrastructure, which can incur considerable costs to operate and run, serverless platforms support short computations, triggered on demand, with cost proportional to fine-grain function exec...
In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of...
The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators...
Query languages in general and SQL in particular are arguably one of the most successful programming interfaces. Yet, in the domain of high-energy physics (HEP), they have found limited acceptance. This is surprising since data analysis in HEP matches the SQL model well: it is fully structured data queried using combinations of selections, projecti...
This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In co...
The massive, instantaneous parallelism of serverless functions has created a lot of excitement for interactive batch applications. We argue that functions are in fact the wrong abstraction for this use case. We call instead for another type of infrastructure, "serverless clusters, " and discuss what is missing to make them a reality.
Today's data analytics displays an overwhelming diversity along many dimensions: data types, platforms, hardware acceleration, etc. As a result, system design often has to choose between depth and breadth: high efficiency for a narrow set of use cases or generality at a lower performance. In this paper, we pave the way to get the best of both world...
Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the num...
The promise of ultimate elasticity and operational simplicity of serverless computing has recently lead to an explosion of research in this area. In the context of data analytics, the concept sounds appealing, but due to the limitations of current offerings, there is no consensus yet on whether or not this approach is technically and economically v...
This paper introduces Rumble, an engine that executes JSONiq
queries on large, heterogeneous and nested collections of
JSON objects, leveraging the parallel capabilities of Spark so
as to provide a high degree of data independence. The design
is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) h...
Concurrency control is a cornerstone of distributed database engines and storage systems. In pursuit of scalability, a common assumption is that Two-Phase Locking (2PL) and Two-Phase Commit (2PC) are not viable solutions due to their communication overhead. Recent results, however, have hinted that 2PL and 2PC might not have such a bad performance....
Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collecti...
Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collecti...
Industry-grade database systems are expected to produce the same result if the same query is repeatedly run on the same input. However, the numerous sources of non-determinism in modern systems make reproducible results difficult to achieve. This is particularly true if floating-point numbers are involved, where the order of the operations affects...
Industry-grade database systems are expected to produce the same result if the same query is repeatedly run on the same input. However, the numerous sources of non-determinism in modern systems make reproducible results difficult to achieve. This is particularly true if floating-point numbers are involved, where the order of the operations affects...
Traditional database operators such as joins are relevant not only in the context of database engines but also as a building block in many computational and machine learning algorithms. With the advent of big data, there is an increasing demand for efficient join algorithms that can scale with the input data size and the available hardware resource...
In this thesis we study the design and implementation of Aggregation operators in the context of relational in-memory database systems. In particular, we identify and address the following challenges: cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with co...
For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conc...
We present scalable parallel algorithms with sublinear communication volume
and low latency for several fundamental problems related to finding the most
relevant elements in a set: the classical selection problem with unsorted
input, its variant with locally sorted input, bulk parallel priority queues,
multicriteria selection using threshold algori...
Recent work has shown that perfect hashing and retrieval of data values associated with a key can be done in such a way that there is no need to store the keys and that only a few bits of additional space per element are needed. We present FiRe -- a new, very simple approach to such data structures. FiRe allows very fast construction and better cac...
Domain encoding is a common technique to compress the columns of a column store and to accelerate many types of queries at the same time. It is based on the assumption that most columns contain a relatively small set of distinct values, in particular string columns. In this paper, we argue that domain encoding is not the end of the story. In real w...
Big Data applications often store or obtain their data distributed over many computers connected by a network. Since the network is usually slower than the local memory of the machines, it is crucial to process the data in such a way that not too much communication takes place. Indeed, only communication volume sublinear in the input size may be af...
The performance of the full table scan is critical for the overall performance of column-store database systems such as the SAP HANA database. Compressing the underlying column data format is both an advantage and a challenge, because it reduces the data volume involved in a scan on one hand and introduces the need for decompression during the scan...
Requirements of enterprise applications have become much more demanding because they execute complex reports on transactional data while thousands of users may read or update records of the same data. The goal of the SAP HANA database is the integration of transactional and analytical workloads within the same database management system. To achieve...
Projects
Project (1)
Project RAPID is a hardware-software co-design project targeting large-scale data management and analysis. RAPID aims to improve the energy efficiency of database-processing systems by an order of magnitude over today's solutions. To achieve these savings, RAPID leverages a heterogeneous hardware architecture combined with architecture-conscious software.