Yannis Papakonstantinou’s research while affiliated with University of California, San Diego and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (183)


Foreign Keys Open the Door for Faster Incremental View Maintenance
  • Article

May 2023

·

17 Reads

·

4 Citations

Proceedings of the ACM on Management of Data

Christoforos Svingos

·

Andre Hernich

·

·

[...]

·

Yannis Ioannidis

Serverless cloud-based warehousing systems enable users to create materialized views in order to speed up predictable and repeated query workloads. Incremental view maintenance (IVM) minimizes the time needed to bring a materialized view up-to-date. It allows the refresh of a materialized view solely based on the base table changes since the last refresh. In serverless cloud-based warehouses, IVM uses computations defined as SQL scripts that update the materialized view based on updates to its base tables. However, the scripts set up for materialized views with inner joins are not optimal in the presence of foreign key constraints. For instance, for a join of two tables, the state of the art IVM computations use a UNION ALL operator of two joins - one computing the contributions to the join from updates to the first table and the other one computing the remaining contributions from the second table. Knowing that one of the join keys is a foreign-key would allow us to prune all but one of the UNION ALL branches and obtain a more efficient IVM script. In this work, we explore ways of incorporating knowledge about foreign key into IVM in order to speed up its performance. Experiments in Redshift showed that the proposed technique improved the execution times of the whole refresh process up to 2 times, and up to 2.7 times the process of calculating the necessary changes that will be applied into the materialized view.


Database Education at UC San Diego

November 2022

·

17 Reads

ACM SIGMOD Record

We are in the golden age of data-intensive computing. CS is now the largest major in most US universities. Data Science, ML/AI, and cloud computing have been growing rapidly. Many new data-centric job categories are taking shape in industry, e.g., data scientists, ML engineers, analytics engineers, and data associates. The DB/data management/data systems area is naturally a central part of all these transformations. Thus, the DB community must keep evolving and innovating to fulfill the need for DB education in all its facets, including its intersection with other areas such as ML, systems, HCI, various domain sciences, etc., as well as bridging the gap with practice and industry.


Artifacts Availability & Reproducibility (VLDB 2021 Round Table)

July 2022

·

25 Reads

·

2 Citations

ACM SIGMOD Record

In the last few years, SIGMOD and VLDB have intensified efforts to encourage, facilitate, and establish reproducibility as a key process for accepted research papers, awarding them with the Reproducibility badge. In addition, complementary efforts have focused on increasing the sharing of accompanying artifacts of published work (code, scripts, data), independently of reproducibility, awarding them the Artifacts Available badge. In this short note, we summarize the discussion of a panel held during VLDB 2021 titled "Artifacts, Availability & Reproducibility". We first present a more detailed summary of the recent efforts. Then, we present the discussion and the contributed key points that were made, aiming to assess the reproducibility of data management research and to propose changes moving forward.



An example of cosine threshold query with six 10-dimensional vectors. The missing values are 0’s. We only need to scan the lists L1, L3, and L4 since the query vector has non-zero values in dimension 1, 3 and 4. For θ = 0.6, the gathering phase terminates after each list has examined three entries (highlighted) because the score for any unseen vector is at most 0.8 × 0.3 + 0.3 × 0.3 + 0.5 × 0.2 = 0.43 < 0.6. The verification phase only needs to retrieve from the database those vectors obtained during the gathering phase, i.e., s1, s2, s3 and s5, compute the cosines and produce the final result
The construction of convex hull H~i\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\mathsf {H}}_{i}$\end{document}
The illustration of why the new convex hull Hi~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\mathsf {H}_{i}}$\end{document} can only contain vertices from the original hull Hi. Here A and C are from Hi and B is not
An example of the lower and upper bounds
The distribution of number of accesses and true similarities

+8

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees
  • Article
  • Publisher preview available

January 2021

·

80 Reads

·

11 Citations

Theory of Computing Systems

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.

View access options

Query Optimization for Faster Deep CNN Explanations

September 2020

·

24 Reads

·

3 Citations

ACM SIGMOD Record

Deep Convolutional Neural Networks (CNNs) now match human accuracy in many image prediction tasks, resulting in a growing adoption in e-commerce, radiology, and other domains. Naturally, "explaining" CNN predictions is a key concern for many users. Since the internal workings of CNNs are unintuitive for most users, occlusion-based explanations (OBE) are popular for understanding which parts of an image matter most for a prediction. One occludes a region of the image using a patch and moves it around to produce a heatmap of changes to the prediction probability. This approach is computationally expensive due to the large number of re-inference requests produced, which wastes time and raises resource costs. We tackle this issue by casting the OBE task as a new instance of the classical incremental view maintenance problem. We create a novel and comprehensive algebraic framework for incremental CNN inference combining materialized views with multi-query optimization to reduce computational costs. We then present two novel approximate inference optimizations that exploit the semantics of CNNs and the OBE task to further reduce runtimes. We prototype our ideas in a tool we call Krypton. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5x (resp. 35x) to produce exact (resp. high-quality approximate) results without raising resource requirements.


Incremental and Approximate Computations for Accelerating Deep CNN Inference

May 2020

·

24 Reads

·

21 Citations

ACM Transactions on Database Systems

Deep learning now offers state-of-the-art accuracy for many prediction tasks. A form of deep learning called deep convolutional neural networks (CNNs) are especially popular on image, video, and time series data. Due to its high computational cost, CNN inference is often a bottleneck in analytics tasks on such data. Thus, a lot of work in the computer architecture, systems, and compilers communities study how to make CNN inference faster. In this work, we show that by elevating the abstraction level and re-imagining CNN inference as queries , we can bring to bear database-style query optimization techniques to improve CNN inference efficiency. We focus on tasks that perform CNN inference repeatedly on inputs that are only slightly different . We identify two popular CNN tasks with this behavior: occlusion-based explanations (OBE) and object recognition in videos (ORV). OBE is a popular method for “explaining” CNN predictions. It outputs a heatmap over the input to show which regions (e.g., image pixels) mattered most for a given prediction. It leads to many re-inference requests on locally modified inputs. ORV uses CNNs to identify and track objects across video frames. It also leads to many re-inference requests. We cast such tasks in a unified manner as a novel instance of the incremental view maintenance problem and create a comprehensive algebraic framework for incremental CNN inference that reduces computational costs. We produce materialized views of features produced inside a CNN and connect them with a novel multi-query optimization scheme for CNN re-inference. Finally, we also devise novel OBE-specific and ORV-specific approximate inference optimizations exploiting their semantics. We prototype our ideas in Python to create a tool called Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5× (respectively, 35×) to produce exact (respectively, high-quality approximate) results without raising resource requirements.


Plato: approximate analytics over compressed time series with tight deterministic error guarantees

March 2020

·

17 Reads

·

17 Citations

Proceedings of the VLDB Endowment

Plato provides fast approximate analytics on time series, by precomputing and storing compressed time series. Plato's key novelty is the delivery of tight deterministic error guarantees for the linear algebra operators over vectors/time series, the inner product operator and arithmetic operators. Composing them allows for evaluating common statistics, such as correlation and cross-correlation. In the offline processing phase, Plato (i) segments each time series into several disjoint segmentations using known fixed-length or variable-length segmentation algorithms; (ii) compresses each segment by a compression function that is coming from a user-chosen compression function family; and (iii) associates to each segment 1 to 3 precomputed error measures. In the online query processing phase, Plato uses the error measures to compute the error guarantees. Importantly, we identify certain compression function families that lead to theoretically and experimentally higher quality guarantees.


Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations

June 2019

·

90 Reads

·

32 Citations

Deep Convolutional Neural Networks (CNNs) now match human accuracy in many image prediction tasks, resulting in a growing adoption in e-commerce, radiology, and other domains. Naturally, explaining CNN predictions is a key concern for many users. Since the internal workings of CNNs are unintuitive for most users, occlusion-based explanations (OBE) are popular for understanding which parts of an image matter most for a prediction. One occludes a region of the image using a patch and moves it around to produce a heat map of changes to the prediction probability. Alas, this approach is computationally expensive due to the large number of re-inference requests produced, which wastes time and raises resource costs. We tackle this issue by casting the OBE task as a new instance of the classical incremental view maintenance problem. We create a novel and comprehensive algebraic framework for incremental CNN inference combining materialized views with multi-query optimization to reduce computational costs. We then present two novel approximate inference optimizations that exploit the semantics of CNNs and the OBE task to further reduce runtimes. We prototype our ideas in Python to create a tool we call Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5X (resp. 35X) to produce exact (resp. high-quality approximate) results without raising resource requirements.


Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

December 2018

·

22 Reads

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold {\theta}. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin's well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for {\theta}-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.


Citations (76)


... -an operator that counts the number of rows aggregated (as described by Mumick [87]), which is used to detect empty groups -a filtering operator (linear), which eliminates NULLs prior to aggregation For many SQL aggregation functions aggregating over an empty set should produce a NULL result. All aggregation circuits described so far will return an empty Z-set for an empty input. ...

Reference:

DBSP: automatic incremental view maintenance for rich query languages
Foreign Keys Open the Door for Faster Incremental View Maintenance
  • Citing Article
  • May 2023

Proceedings of the ACM on Management of Data

... In the past 15 years, the database research community has experienced a great shift in best practices for managing research data, as highlighted in a recent overview [1]: Influential conferences like VLDB now require the availability of research data (or artifacts) in the review-process (and not only after publication). For double-anonymous reviews (e.g., SIGMOD'24), authors go to great lengths to anonymize their research data, either creating pseudonymous GitHub repositories or using services such as Anonymous GitHub 1 . ...

Artifacts Availability & Reproducibility (VLDB 2021 Round Table)
  • Citing Article
  • July 2022

ACM SIGMOD Record

... The main filter exploits a unique upper bound on a similarity, which is calculated by utilizing both the summable property of the inner product (similarity) and an estimated shared (ES) threshold on mean-feature values, based on the foregoing data structures (Sections IV and V). Besides our upper-bound-based pruning (UBP) filter, we can consider other UBP filters with the state-of-the-art techniques such as a threshold algorithm (TA) in the similarity-search field [12], [28] and blockification using the Cauchy-Schwarz inequality (CS) [29] (Section VI-D). Our UBP filter differs from the TA and CS in ways that its threshold is estimated by minimizing an approximate number of multiplications for similarity calculations and is shared with all the objects. ...

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Theory of Computing Systems

... Recent work also leveraged this property for decremental updates [84] (e.g., to remove tuples for GDPR regulations). Further work includes incremental grounding and inference in DeepDive [88], and incremental computation of occlusion-based explanations for CNNs [69,70]. In contrast to the incremental maintenance of intermediates, our partial reuse is more general because it allows rewrites to augment intermediates by complex compensation plans. ...

Query Optimization for Faster Deep CNN Explanations
  • Citing Article
  • September 2020

ACM SIGMOD Record

... However, with the AI inference process, these query systems face several challenges. First of all, the AI inference models usually incur intensive computations [12], [13], which can significantly slow down the query execution process. Secondly, many AI inference procedures may require expensive computing resources, including GPUs and memories, which may not always be affordable in various scenarios. ...

Incremental and Approximate Computations for Accelerating Deep CNN Inference
  • Citing Article
  • May 2020

ACM Transactions on Database Systems

... To overcome these limitations, model-based data compression has proven to be an effective solution for time series data, providing compact representations at low computational costs [6]. Recent methods, such as those utilizing piecewise linear approximation (PLA) [7]- [9], have achieved high compression with deterministic error bounds. However, these methods mainly focus on compression performance, giving limited consideration to direct analytics on compressed data; hence, often overlook multiscale contextual information, which is crucial for tasks requiring fine-grained representations, such as anomaly detection. ...

Plato: approximate analytics over compressed time series with tight deterministic error guarantees
  • Citing Article
  • March 2020

Proceedings of the VLDB Endowment

... In recent years, researchers in the database community have been working on raising the level of abstractions of machine learning (ML) and integrating such functionality into today's data management systems [95,96], e.g., SystemML [25], SystemDS [8], Snorkel [71], ZeroER [91], TFX [5,9], Query 2.0 [92], Krypton [66], Cerebro [67], ModelDB [86], MLFlow [94], Deep-Dive [14], HoloClean [72], EaseML [1], ActiveClean [48], and NorthStar [47]. End-to-end AutoML systems [93,97,33] have been an emerging type of systems that has significantly raised the level of abstractions of building ML applications. ...

Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations
  • Citing Conference Paper
  • June 2019

... The need to store and process large graphs supports growing interest to graph databases. An overview of several aspects of graph databases can be found in (Deutsch & Papakonstantinou, 2018). Typically, such databases can store graph vertices and nodes, labeled with sets of attributes. ...

Graph data models, query languages and programming paradigms
  • Citing Article
  • August 2018

Proceedings of the VLDB Endowment

... There is active research on interactive and human-in-the-loop systems in many computer science sub-disciplines. The database and visualization communities have produced numerous tools [3][4][5][6][7][8] to aid data scientists with data wrangling and analysis. At the decisionmaking stage, the machine learning community has looked at making black-box models explainable [2,[9][10][11][12] while the human-computer interaction (HCI) community has been studying how difference in explainability affects decision making [13,14]. ...

ViDeTTe Interactive Notebooks
  • Citing Conference Paper
  • June 2018

... Furthermore, there has been a plethora of flash-based file systems which utilize hybrid/tiered storage devices, including NOR-based flash [144], Storage Class Memory (SCM) [109,156,174,203,210], byte addressable Non-Volatile Memory (NVM) [142], and methods that expose byte addressable NVRAM on SSDs with custom firmware for metadata placement [104,279]. Our focus is on block address storage, which is the most prevalent for storage. ...

Improving SSD lifetime with byte-addressable metadata
  • Citing Conference Paper
  • October 2017