Michael J. Carey’s research while affiliated with University of California, Irvine and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (317)


Figure 7: Aggregation of subscriptions based on matching parameters and brokers.
Figure 9: (a) Original vs. (b) Optimized channel plan.
Figure 11: The query plans for retrieving relevant tweets for the TweetsAboutCrime channel were created without any index, with a traditional index, and with the BAD index.
Figure 13: Determining the ideal subgroup subscription size relative to frame size f = 80KB.
Figure 14:

+3

Optimizing Big Active Data Management Systems
  • Preprint
  • File available

December 2024

·

5 Reads

Shahrzad Haji Amin Shirazi

·

Xikui Wang

·

Michael J. Carey

·

Within the dynamic world of Big Data, traditional systems typically operate in a passive mode, processing and responding to user queries by returning the requested data. However, this methodology falls short of meeting the evolving demands of users who not only wish to analyze data but also to receive proactive updates on topics of interest. To bridge this gap, Big Active Data (BAD) frameworks have been proposed to support extensive data subscriptions and analytics for millions of subscribers. As data volumes and the number of interested users continue to increase, the imperative to optimize BAD systems for enhanced scalability, performance, and efficiency becomes paramount. To this end, this paper introduces three main optimizations, namely: strategic aggregation, intelligent modifications to the query plan, and early result filtering, all aimed at reinforcing a BAD platform's capability to actively manage and efficiently process soaring rates of incoming data and distribute notifications to larger numbers of subscribers.

Download





A new window Clause for SQL++

December 2023

·

71 Reads

·

1 Citation

The VLDB Journal

Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner.


Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems

June 2023

·

18 Reads

·

2 Citations

ACM SIGMOD Record

Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits.



Multi-valued indexing in Apache AsterixDB (SI DOLAP 2022)

November 2022

·

3 Reads

Information Systems

Secondary indexes in relational database systems are traditionally built under the assumption that one data record maps to one indexed value. Nowadays, particularly in NoSQL systems, single data records can hold collections of values that users want to access efficiently in an ad-hoc manner. Multi-valued indexes aim to give users the best of both worlds: (i) to keep a more natural data model of records with collections of values, and (ii) to reap the benefits of a secondary index. In this paper, we detail the steps taken to realize multi-valued indexes in AsterixDB, a Big Data management system with a structured query language operating over a collection of documents. This includes (a) creating the specification language for such indexes, (b) illustrating data flows for bulk-loading and maintaining an index, and (c) discussing query plans to take advantage of multi-valued indexes for use in predicates with existential and universal quantification. We conclude with experiments that measure the impact of maintaining an AsterixDB multi-valued index and experiments that compare the query performance our multi-valued indexes against similar indexes in MongoDB and Couchbase Server’s Query Service.



Citations (70)


... The BAD platform can enable millions of users to subscribe to data of interest and receive updates continuously. It is different than continuous queries, streaming engines and pub/sub systems as it also supports Big Data analytics with a declarative language, SQL++ (a SQL-inspired query language for semi-structured data [12,18]). ...

Reference:

Optimizing Big Active Data Management Systems
SQL++: We Can Finally Relax!
  • Citing Conference Paper
  • May 2024

... The BAD platform can enable millions of users to subscribe to data of interest and receive updates continuously. It is different than continuous queries, streaming engines and pub/sub systems as it also supports Big Data analytics with a declarative language, SQL++ (a SQL-inspired query language for semi-structured data [12,18]). ...

A new window Clause for SQL++

The VLDB Journal

... In recent years, the management and analysis of open big data have garnered significant attention in both academia and industry (e.g., [26]). In this Section, we provide a comprehensive review of relevant literature by focusing on key developments and methodologies in the field of Open Big Data Management and Analytics (e.g., [27]). ...

Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems
  • Citing Article
  • June 2023

ACM SIGMOD Record

... Furthermore, to reduce cache misses, blocking operators use carefully designed data structures (e.g.,cachefriendly hash tables [3]) and software prefetching [13] as much as possible. Besides, all blocking operators have an optimized spill-to-disk version to handle out-of-memory crises, such as dynamic hybrid hash Join [29]. • Expression Evaluation. ...

Design trade-offs for a robust dynamic hybrid hash join
  • Citing Article
  • June 2022

Proceedings of the VLDB Endowment

... Apache AsterixDB [2,20] is a Big Data Management System (BDMS) designed to manage semi-structured data across multiple nodes in a shared-nothing cluster architecture. It uses AsterixDB Data Model (ADM) which is a super-set of JSON providing !exibility in record structure [3]. ...

Columnar formats for schemaless LSM-based document stores
  • Citing Article
  • June 2022

Proceedings of the VLDB Endowment

... It also does not consider how to include time in ranking search results c.f., [38]. Keyword search is not the only kind of search on JSON or XML data, similarity search has also been popular [7,36,70]. The kinds of search are different, but the techniques presented in this paper could be adapted to similarity search. ...

JEDI: These aren't the JSON documents you're looking for?
  • Citing Conference Paper
  • June 2022

... This passive approach often falls short for users who not only want to analyze data but also actively receive updates on new data items that interest them, explore their relationships with other data, and even enrich them with additional information existing in different datasets. These demands have led to the creation of Big Active Data (BAD) frameworks [13,23,33] that aim to support extensive data subscriptions and analytics for millions of subscribers. A BAD framework circumvents the inefficiencies of cobbling together mul-tiple independent systems (each dealing with a part of the needed processing, i.e., accessing Big Data, managing incoming streaming data, matching subscribers to information etc). ...

Subscribing to big data at scale

Distributed and Parallel Databases

... aiDM ' [21], and route and point-of-interest recommendation [17,26]. For efficient analysis of spatial data, it is essential to use distributed parallel processing systems for spatial data such as Sedona [3], SpatialHadoop [7], and others [1,25,28,29,36]. ...

A brief introduction to geospatial big data analytics with apache AsterixDB

... Combining these optimal solutions in the most suitable manner is thus severely restricted. In contrast, Polyframe appears more effective in simulating the generation of three-dimensional fully compressed truss structures, producing intricate forms such as shells and funnel structures [41,42]. However, when dealing with cantilever or complex hybrid structures, a more robust optimization algorithm is necessary. ...

PolyFrame: a retargetable query-based approach to scaling dataframes
  • Citing Article
  • July 2021

Proceedings of the VLDB Endowment