Anastasia Ailamaki's research while affiliated with Lem Sa Switzerland and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (47)
As modern data pipelines continue to collect, produce, and store a variety of data formats, extracting and combining value from traditional and context-rich sources such as strings, text, video, audio, and logs becomes a manual process where such formats are unsuitable for RDBMS. To tap into the dark data, domain experts analyze and extract insight...
To sustain the input rate of high-throughput streams, modern stream processing systems rely on parallel execution. However, skewed data yield imbalanced load assignments and create stragglers that hinder scalability Deciding on a static partitioning for a given set of "hot" keys is not sufficient as these keys are not known in advance, and even wor...
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Modern data analytical workloads often need to run queries over a large number of tables. An optimal query plan for such queries is crucial for being able to run these queries within acceptable time bounds. However, with queries involving many tables, finding the optimal join order becomes a bottleneck in query optimization. Due to the exponential...
Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-me...
Online Transaction Processing (OLTP) deployments are migrating from on-premise to cloud settings in order to exploit the elasticity of cloud infrastructure which allows them to adapt to workload variations. However, cloud adaptation comes at the cost of redesigning the engine, which has led to the introduction of several, new, cloud-based transacti...
Modern Hybrid Transactional/Analytical Processing (HTAP) systems use an integrated data processing engine that performs analytics on fresh data, which are ingested from a transactional engine. HTAP systems typically consider data freshness at design time, and are optimized for a fixed range of freshness requirements, addressed at a performance cost...
Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.
Data cleaning is a time-consuming process which depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate, offline process, that takes place before analysis starts. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on...
Understanding micro-architectural behavior is important for efficiently using hardware resources. Recent work has shown that in-memory online transaction processing (OLTP) systems severely underutilize their core micro-architecture resources [29]. Whereas, online analytical processing (OLAP) workloads exhibit a completely different computing patter...
The constant flux of data and queries alike has been pushing the boundaries of data analysis systems. The increasing size of raw data files has made data loading an expensive operation that delays the data-to-insight time. To alleviate the loading cost, in situ query processing systems operate directly over raw data and offer instant access to data...
While organizing the submission evaluation process for the SIGMOD 2019 research track, we aim at maximizing the value of the reviews while minimizing the probability of misunderstandings due to factual errors, thereby valorizing impactful ideas. The objective is an educating and rewarding experience for both the authors and the reviewers. The actio...
Efficiently querying multiple spatial data sets is a growing challenge for scientists. Astronomers query data sets that contain different types of stars (e.g., dwarfs, giants, stragglers) while neuroscientists query different data sets that model different aspects of the brain in the same space (e.g., neurons, synapses, blood vessels). The results...
Tracing the evolution of the five-minute rule to help identify imminent changes in the design of data management engines.
Understanding micro-architectural behavior is profound in efficiently using hardware resources. Recent work has shown that, despite being aggressively optimized for modern hardware, in-memory online transaction processing (OLTP) systems severely underutilize their core micro-architecture resources [25]. Online analytical processing (OLAP) workloads...
Index joins present a case of pointer-chasing code that causes data cache misses. In principle, we can hide these cache misses by overlapping them with computation: The lookups involved in an index join are parallel tasks whose execution can be interleaved, so that, when a cache miss occurs in one task, the processor executes independent instructio...
Non-Volatile Memory (NVM) technologies exhibit 4X the read access latency of conventional DRAM. When the working set does not fit in the processor cache, this latency gap between DRAM and NVM leads to more than 2X runtime increase for queries dominated by latency-bound operations such as index joins and tuple reconstruction. We explain how to easil...
Nowadays, massive amounts of point cloud data can be collected thanks to advances in data acquisition and processing technologies such as dense image matching and airborne LiDAR scanning. With the increase in volume and precision, point cloud data offers a useful source of information for natural-resource management, urban planning, self-driving ca...
Modern server hardware is increasingly heterogeneous as hardware accelerators, such as GPUs, are used together with multicore CPUs to meet the computational demands of modern data analytics work-loads. Unfortunately, query parallelization techniques used by analytical database engines are designed for homogeneous multicore servers, where query plan...
Caching the results of intermediate query results for future re-use is a common technique for improving the performance of analytics over raw data sources. An important design choice in this regard is whether to lazily cache only the offsets of satisfying tuples, or to eagerly cache the entire tuples. Lazily cached offsets have the benefit of small...
Query optimizers depend heavily on statistics representing column distributions to create good query plans. In many cases, though, statistics are outdated or nonexistent, and the process of refreshing statistics is very expensive, especially for ad hoc workloads on ever bigger data. This results in suboptimal plans that severely hurt performance. T...
Main-memory OLTP engines are being increasingly deployed on multicore servers that provide abundant thread-level parallelism. However, recent research has shown that even the state-of-the-art OLTP engines are unable to exploit available parallelism for high contention workloads. While previous studies have shown the lack of scalability of all popul...
In this chapter, we discuss parallelism opportunities in modern CPUs. We cover all the topics shown in Figure 2.1. First, we show that parallelization already exists inside a single-threaded CPU core. We give a brief overview of instruction pipelining, and we explain superscalar and SIMD processors. Then, we move a step further to CPUs with more th...
As Chapter 2 detailed, hardware vendors heavily innovated on implicit parallelism until 2005 through aggressive microarchitectural techniques (e.g., pipelining, superscalar execution, out-of-order execution, SIMD, etc.). Despite the differences across these techniques, all these innovations aim at one thing: minimizing the stall cycles where a core...
Data Infrastructure for Medical Research is an ideal primer for medical and computer science professionals designing and understanding data infrastructure for medical research. It focuses primarily on the data management and computer aspects of the problem. Nevertheless, the monograph is intended to appeal to both audiences: technical/computer scie...
Different types of workloads are subjecting the database management systems to different kinds of scalability challenges. For transaction processing systems where many threads access small portions of the data and enter numerous critical sections to ensure ACID properties, the main challenge that limits scalability is the access latency to the shar...
In this chapter, we outline a few of the most prominent future directions of databases on modern and novel hardware. These future directions may require a substantial redesign of database systems in order to fully exploit the potential of novel hardware and answer the challenges posed by emerging workloads.
In the previous chapter, we showed that scaling-up OLTP workloads on modern hardware is sensitive mostly to the latency of memory accesses. In this chapter, we show that scaling-up OLAP workloads involves further challenges that pertain to the efficient utilization and saturation of the limited number of hardware resources, e.g., the number of hard...
Citations
... For instance, PostgreSQL 14 by default performs a full search for the optimal plan only up to 11 joins before falling back on heuristic optimisation techniques. Sophisticated pruning methods and parallelisation have been shown to push this threshold higher [32,33], but the the task still remains fundamentally challenging. Moreover, the problem of huge intermediate results is not restricted to the choice of a bad join order. ...
... The implications for applications are natural and real as discussed in [46], including a better use of time and space and the ability to process larger amounts of data with fewer resources. Database systems, which are essential to any kind of large data-analysis task, can benefit the most from this new role of machine learning [47]. ...
... When users query the same data again, the task will directly come from the application cache layer. By saving the result in a cache, a large number of repeated links to databases are avoided [4]. ...
... With increasingly difficult cost and cardinality estimation, fast sampling running on modern hardware [28] or speculation techniques [29] can come in handy to provide mechanisms for practical and adaptive query optimization and execution. ...
... For instance, PostgreSQL 14 by default performs a full search for the optimal plan only up to 11 joins before falling back on heuristic optimisation techniques. Sophisticated pruning methods and parallelisation have been shown to push this threshold higher [32,33], but the the task still remains fundamentally challenging. Moreover, the problem of huge intermediate results is not restricted to the choice of a bad join order. ...
... In MIMD (Multiple Instruction Multiple Data), multiple processors execute different instructions on different data elements in a parallel manner. This parallelization concept is heavily exploited in database systems [1,3,15,20,23] and Heimel et al. [12] already proposed a hardware-oblivious parallel library to easily map to different MIMD processing architectures. In contrast to that, multiple processing elements perform the same operation on multiple data elements simultaneously for SIMD parallelization (often called vectorization). ...
... 1) General Query Optimization: Besides the general handling of the queries, Sirin et al. [47] show the importance of isolating the OLTP and OLAP workloads on shared hardware systems in order to achieve optimal performance. ...
... State Modules [222] store tuples for base tables and support insert and probe operations that facilitate adaptive query processing with, for example, eddies. RouLette [247] is a recent system that exploits reinforcement learning for runtime adaptation at sharing work across multiple queries. QOOP [179] re-plans a distributed query during its execution based on the resource availability in the cluster. ...
... Workloads on OLTP systems are typified by high volumes of short-lived queries on relatively few data items; in order to process such queries at scale and with minimal delay, OLTP systems employ cache-optimized concurrent data structures as index structures that enable key-based lookups on stored data. However, microarchitectural analyses of in-memory OLTP workloads have shown that they are still plagued by memory stalls: long-latency last-level cache misses due to random data accesses can account for more than half of the execution time in otherwise highly optimized OLTP systems [59,60]. Despite their cache-conscious optimizations, index structures are fundamentally pointer-chasing data structures, in which lookups involve iteratively accessing nodes at memory addresses referenced by next-node pointers stored at each node. ...
... They have also been used to reduce I/O access times by reducing the number of disk seeks necessary for range queries [1] or improving disk address prediction from disk indices [4]. SFCs have shown promise in large point-data management software, for both data compression and access [5], [6]. Given their locality preservation properties, SFCs are effective at improving k nearest-neighbour queries [7], [8], [9], [10]. ...