The VLDB Journal

Published by Springer Nature
Online ISSN: 0949-877X
Print ISSN: 1066-8888
Learn more about this page
Recent publications
Given a graph G where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node v∈G\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v \in G$$\end{document} to a compact vector Xv\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_v$$\end{document}, which can be used in downstream machine learning tasks. Ideally, Xv\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_v$$\end{document} should capture node v’s affinity to each attribute, which considers not only v’s own attribute associations, but also those of its connected nodes along edges in G. It is challenging to obtain high-utility embeddings that enable accurate predictions; scaling effective ANE computation to massive graphs with millions of nodes pushes the difficulty of the problem to a whole new level. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. This paper proposes PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document}, an effective and scalable approach to ANE computation for massive graphs that achieves state-of-the-art result quality on multiple benchmark datasets, measured by the accuracy of three common prediction tasks: attribute inference, link prediction, and node classification. PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document} obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel random walk model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document} includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document} utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. The performance of PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document} depends upon the number of attributes in the input network. To handle large networks with numerous attributes, we further extend PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document} to PANE++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt{PANE}^{++}$$\end{document}, which employs an effective attribute clustering technique. Extensive experiments, comparing 10 existing approaches on 8 real datasets, demonstrate that PANE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {PANE}$$\end{document} and PANE++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt{PANE}^{++}$$\end{document} consistently outperform all existing methods in terms of result quality, while being orders of magnitude faster.
Incremental processing is widely adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper, we develop a novel cost-based optimizer framework, called Tempura, for optimizing incremental data processing. We propose an incremental query planning model called TIP based on the concept of time-varying relations, which can formally model incremental processing in its most general form. We give a full specification of Tempura, which can not only unify various existing techniques to generate an optimal incremental plan, but also allow the developer to add their rewrite rules. We study how to explore the plan space and search for an optimal incremental plan. We evaluate Tempura in various incremental processing scenarios to show its effectiveness and efficiency.
We investigate the problem of incremental denial constraint (DC) discovery, aiming at discovering DCs in response to a set \(\triangle \)r of tuple insertions to a given relational instance r and the known set \(\varSigma \) of DCs holding on r. The need for the study is evident since real-life data are often frequently updated, and it is often prohibitively expensive to perform DC discovery from scratch for every update. We tackle this problem with two steps. We first employ indexing techniques to efficiently identify the incremental evidences caused by \(\triangle r\). We present algorithms to build indexes for \(\varSigma \) and r in the pre-processing step, and to visit and update indexes in response to \(\triangle \)r. In particular, we propose a novel indexing technique for two inequality comparisons possibly across the attributes of r. By leveraging the indexes, we can identify all the tuple pairs incurred by \(\triangle \)r that simultaneously satisfy the two comparisons, with a cost dependent on log(\(|\)r\(|\)). We then compute the changes \(\triangle \varSigma \) to \(\varSigma \) based on the incremental evidences, such that \(\varSigma \oplus \triangle \varSigma \) is the set of DCs holding on \(r+\triangle r\). \(\triangle \varSigma \) may contain new DCs that are added into \(\varSigma \) and obsolete DCs that are removed from \(\varSigma \). Our experimental evaluations show that our incremental approach is faster than the two state-of-the-art batch DC discovery approaches that compute from scratch on \(r + \triangle r\) by orders of magnitude, even when \(\triangle r\) is up to 30% of r.
Computing the shortest distance between two vertices is a fundamental problem in road networks. Since a direct search using the Dijkstra’s algorithm results in a large search space, researchers resort to indexing-based approaches. State-of-the-art indexing-based solutions can be categorized into hierarchy-based solutions and hop-based solutions. However, the hierarchy-based solutions require large search space for long-distance queries, while the hop-based solutions result in high computational waste for short-distance queries. To overcome the drawbacks of both solutions, in this paper, we propose a novel hierarchical 2-hop index (H2H-Index) which assigns a label for each vertex and at the same time preserves a hierarchy among all vertices. With the H2H-Index, we design an efficient query processing algorithm with performance guarantees by visiting part of the labels for the source and the destination based on the hierarchy. We propose a novel algorithm to construct the H2H-Indexbased on distance preserved graphs. We also extend the H2H-Indexand propose a set of algorithms to identify the shortest path between vertices. We conducted extensive performance studies using large real road networks including the whole USA road network. The experimental results demonstrate that our approach can achieve a speedup of an order of magnitude in query processing compared to the state-of-the-art while consuming comparable indexing time and index size.
In this paper, we study the problem of (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite graph G=(U,V,E)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G=(U,V,E)$$\end{document} and two integer parameters p and q, we aim to efficiently count and enumerate all (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-bicliques in G, where a (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-biclique B(L, R) is a complete subgraph of G with L⊆U\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L \subseteq U$$\end{document}, R⊆V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R \subseteq V$$\end{document}, |L|=p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|L|=p$$\end{document}, and |R|=q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|R|=q$$\end{document}. The problem of (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-biclique counting and enumeration has many applications, such as graph neural network information aggregation, densest subgraph detection, and cohesive subgroup analysis. Despite the wide range of applications, to the best of our knowledge, we note that there is no efficient and scalable solution to this problem in the literature . This problem is computationally challenging, due to the worst-case exponential number of (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-bicliques. In this paper, we propose a competitive branch-and-bound baseline method, namely BCList, which explores the search space in a depth-first manner, together with a variety of pruning techniques. Although BCList offers a useful computation framework to our problem, its worst-case time complexity is exponential to p+q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p+q$$\end{document}. To alleviate this, we propose an advanced approach, called BCList++. Particularly, BCList++ applies a layer-based exploring strategy to enumerate (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-bicliques by anchoring the search on either U or V only, which has a worst-case time complexity exponential to either p or q only. Consequently, a vital task is to choose a layer with the least computation cost. To this end, we develop a cost model, which is built upon an unbiased estimator for the density of 2-hop graph induced by U or V. To improve computation efficiency, BCList++ exploits pre-allocated arrays and vertex labeling techniques such that the frequent subgraph creating operations can be substituted by array element switching operations. We conduct extensive experiments on 16 real-life datasets, and the experimental results demonstrate that BCList++ significantly outperforms the baseline methods by up to 3 orders of magnitude. We show via a case study that (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-bicliques optimizes the efficiency of graph neural networks. In this paper, we extend our techniques to count and enumerate (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-bicliques on uncertain bipartite graphs. An efficient method IUBCList is developed on the top of BCList++, together with a couple of pruning techniques, including common neighbor refinement and search branch early termination, to discard unpromising uncertain (p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}, q\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q$$\end{document})-bicliques early. The experimental results demonstrate that IUBCList significantly outperforms the baseline method by up to 2 orders of magnitude.
Language-integrated query (LINQ) frameworks offer a convenient programming abstraction for processing in-memory collections of data, allowing developers to concisely express declarative queries using popular programming languages. Existing LINQ frameworks rely on the type system of statically typed languages such as C♯\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^\sharp $$\end{document} or Java to perform query compilation and execution. As a consequence of this design, they do not support dynamic languages such as Python, R, or JavaScript. Such languages are however very popular among data scientists, who would certainly benefit from LINQ frameworks in data-analytics applications. The gap between dynamic languages and LINQ frameworks has been partially bridged by the recent work DynQ, a novel query engine designed for dynamic languages. DynQ is language-agnostic, since it is able to execute SQL queries on all languages supported by the GraalVM platform. Moreover, DynQ can execute queries combining data from multiple sources, namely in-memory object collections as well as on-file data and external database systems. The evaluation of DynQ shows performance comparable with equivalent hand-optimized code, and in line with common data-processing libraries and embedded databases, making DynQ an appealing query engine for standalone analytics applications and for data-intensive server-side workloads. In this work, we extend DynQ addressing the problem of optimizing high-throughput workloads in the context of fluent APIs. In particular, we focus on applications which make use of data-processing libraries mostly for executing many queries on small batches of datasets, e.g., in micro-services, as well as applications which make use of data-processing libraries within recursive functions. For this purpose, we present reusable compiled queries, a novel approach to query execution which allows reusing the same dynamically compiled code for different queries. As we show in our evaluation, thanks to reusable compiled queries, DynQ can also speed up applications that heavily use data-processing libraries on small datasets using a typical fluent API.
Streaming graph analysis is gaining importance in various fields due to the natural dynamicity in many real graph applications. However, approximately counting triangles in real-world streaming graphs with duplicate edges and sliding window model remains an unsolved problem. In this paper, we propose SWTC algorithm to address approximate sliding-window triangle counting problem in streaming graphs. In SWTC, we propose a fixed-length slicing strategy that addresses both sample maintaining and cardinality estimation issues with a bounded memory usage. We theoretically prove the superiority of our method in sample graph size and estimation accuracy under given memory upper bound. To further improve the performance of our algorithm, we propose two optimization techniques, vision counting to avoid computation peaks, and asynchronous grouping to stabilize the accuracy. Extensive experiments also confirm that our approach has higher accuracy compared with the baseline method under the same memory usage.
Matching items for a user from a travel item pool of large cardinality has been the most important technology for Fliggy, one of the most popular online travel platforms (OTPs) in China. In this paper, we propose a novel Fliggy ITinerary-aware deep matching network (FitNET) to address the major challenges facing OTPs. FitNET is designed based on the effective deep matching framework. First, the concept of user active itinerary is well defined for OTPs. Then, several itinerary-aware attention mechanisms that capture the interactions between user active itineraries and other inputs are designed, to better infer users’ travel intentions, preferences, and handle their diverse needs. Then, two learning objectives, i.e., user travel intention prediction and user click behavior prediction, are proposed to be optimized simultaneously. In addition to the FitNET model, its improved version, named FitNET+, is also proposed. FitNET+ optimizes FitNET by additionally considering the information of a user’s historical itineraries and devising an effective itinerary weighting unit to control the impact of each historical itinerary on the learning of the user’s preferences. An offline experiment on the Fliggy production dataset and an online A/B test both show that FitNET and FitNET+ outperform other state-of-the-art methods, due to the idea that a user should be learned based on the granularity of his or her itinerary rather than on a single order. In addition, FitNET+ further improves FitNET by on average 9.4% in precision and 2.4% in hit rate, which indicates the importance of leveraging the historical itineraries of users to better capture their needs.
Real-world data of multi-class classification tasks often show complex data characteristics that lead to a reduced classification performance. Major analytical challenges are a high degree of multi-class imbalance within data and a heterogeneous feature space, which increases the number and complexity of class patterns. Existing solutions to classification or data pre-processing only address one of these two challenges in isolation. We propose a novel classification approach that explicitly addresses both challenges of multi-class imbalance and heterogeneous feature space together. As main contribution, this approach exploits domain knowledge in terms of a taxonomy to systematically prepare the training data. Based on an experimental evaluation on both real-world data and several synthetically generated data sets, we show that our approach outperforms any other classification technique in terms of accuracy. Furthermore, it entails considerable practical benefits in real-world use cases, e.g., it reduces rework required in the area of product quality control.
Uncertain butterflies are one of, if not the, most important graphlet structures on uncertain bipartite networks. In this paper, we examine the uncertain butterfly structure (in which the existential probability of the graphlet is greater than or equal to a threshold parameter), as well as the global Uncertain Butterfly Counting Problem (to count the total number of these instances over an entire network). To solve this task, we propose a non-trivial exact baseline (UBFC), as well as an improved algorithm (IUBFC) which we show to be faster both theoretically and practically. We also design two sampling frameworks (UBS and PES) which can sample either a vertex, edge or wedge from the network uniformly and estimate the global count quickly. Furthermore, a notable butterfly-based community structure which has been examined in the past is the k-bitruss. We adapt this community structure onto the uncertain bipartite graph setting and introduce the Uncertain Bitruss Decomposition Problem (which can be used to directly answer any k-bitruss search query for any k). We then propose an exact algorithm (UBitD) to solve our problem with three variations in deriving the initial uncertain support. Using a range of networks with different edge existential probability distributions, we validate the efficiency and effectiveness of our solutions.
In this paper, we present the first comprehensive survey of window types for stream processing systems which have been presented in research and commercial systems. We cover publications from the most relevant conferences, journals, and system whitepapers on stream processing, windowing, and window aggregation which have been published over the last 20 years. For each window type, we provide detailed specifications, formal notations, synonyms, and use-case examples. We classify each window type according to categories that have been proposed in literature and describe the out-of-order processing. In addition, we examine academic, commercial, and open-source systems with respect to the window types that they support. Our survey offers a comprehensive overview that may serve as a guideline for the development of stream processing systems, window aggregation techniques, and frameworks that support a variety of window types.
Subgraph counting, as a fundamental problem in network analysis, is to count the number of subgraphs in a data graph that match a given query graph by either homomorphism or subgraph isomorphism. The importance of subgraph counting derives from the fact that it provides insights of a large graph, in particular a labeled graph, when a collection of query graphs with different sizes and labels are issued. The problem of counting is challenging. On the one hand, exact counting by enumerating subgraphs is NP-hard. On the other hand, approximate counting by subgraph isomorphism can only support small query graphs over unlabeled graphs. Another way for subgraph counting is to specify it as an SQL query and estimate the cardinality of the query in RDBMS. Existing approaches for cardinality estimation can only support subgraph counting by homomorphism up to some extent, as it is difficult to deal with sampling failure when a query graph becomes large. A question that arises is how we support subgraph counting by machine learning (ML) and deep learning (DL). To devise an ML/DL solution, apart from the query graphs, another issue is to deal with large data graphs by ML/DL, as the existing DL approach for subgraph isomorphism counting can only support small data graphs. In addition, the ML/DL approaches proposed in RDBMS context for approximate query processing and cardinality estimation cannot be used, as subgraph counting is to do complex self-joins over one relation, whereas existing approaches focus on multiple relations. In this work, we propose an active learned sketch for subgraph counting (ALSS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{ALSS}$$\end{document}) with two main components: a learned sketch for subgraph counting and an active learner. The sketch is constructed by a neural network regression model, and the active learner is to perform model updates based on new arrival test query graphs. Our holistic learning framework supports both undirected graphs and directed graphs, whose nodes and/or edges are associated zero to multiple labels. We conduct extensive experimental studies to confirm the effectiveness and efficiency of ALSS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{ALSS}$$\end{document} using large real labeled graphs. Moreover, we show that ALSS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{ALSS}$$\end{document} can assist query optimizers in finding a better query plan for complex multi-way self-joins.
To bridge the gap between users and data, numerous text-to-SQL systems have been developed that allow users to pose natural language questions over relational databases. Recently, novel text-to-SQL systems are adopting deep learning methods with very promising results. At the same time, several challenges remain open making this area an active and flourishing field of research and development. To make real progress in building text-to-SQL systems, we need to de-mystify what has been done, understand how and when each approach can be used, and, finally, identify the research challenges ahead of us. The purpose of this survey is to present a detailed taxonomy of neural text-to-SQL systems that will enable a deeper study of all the parts of such a system. This taxonomy will allow us to make a better comparison between different approaches, as well as highlight specific challenges in each step of the process, thus enabling researchers to better strategise their quest towards the “holy grail” of database accessibility.
Subtrajectory query has been a fundamental operator in mobility data management and useful in the applications of trajectory clustering, co-movement pattern mining and contact tracing in epidemiology. In this paper, we make the first attempt to study subtrajectory query in trillion-scale GPS databases, so as to support applications with urban-scale moving users and weeks-long historical data. We develop SQUID as a distributed subtrajectory query processing engine on Spark, with threefold technical contributions. First, we propose compact index and storage layers to handle massive trajectory datasets with trillion-scale GPS points. Second, we leverage hybrid partitioning, together with local indexes that are disk I/O friendly, to facilitate pruning. Third, we devise a novel filter-and-refine query processing framework to effectively reduce the number of trajectories for verification. Our experiments are conducted on huge trajectory datasets with up to 520 billion GPS points. The results validate the compactness of the storage mechanism and the scalability of the distributed query processing framework.
We present \(\textsf{Ditto}\), a novel entity matching system based on pre-trained Transformer language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve \(\textsf{Ditto}\) ’s matching capability. \(\textsf{Ditto}\) allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. \(\textsf{Ditto}\) also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, \(\textsf{Ditto}\) adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, \(\textsf{Ditto}\) is forced to learn “harder” to improve the model’s matching capability. The optimizations we developed further boost the performance of \(\textsf{Ditto}\) by up to 9.8%. Perhaps more surprisingly, we establish that \(\textsf{Ditto}\) can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate \(\textsf{Ditto}\) ’s effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, \(\textsf{Ditto}\) achieves a high F1 score of 96.5%.
Real-time detection of anomalies in streaming data is receiving increasing attention as it allows us to raise alerts, predict faults, and detect intrusions or threats across industries. Yet, little attention has been given to compare the effectiveness and efficiency of anomaly detectors for streaming data (i.e., of online algorithms). In this paper, we present a qualitative, synthetic overview of major online detectors from different algorithmic families (i.e., distance, density, tree or projection based) and highlight their main ideas for constructing, updating and testing detection models. Then, we provide a thorough analysis of the results of a quantitative experimental evaluation of online detection algorithms along with their offline counterparts. The behavior of the detectors is correlated with the characteristics of different datasets (i.e., meta-features), thereby providing a meta-level analysis of their performance. Our study addresses several missing insights from the literature such as (a) how reliable are detectors against a random classifier and what dataset characteristics make them perform randomly; (b) to what extent online detectors approximate the performance of offline counterparts; (c) which sketch strategy and update primitives of detectors are best to detect anomalies visible only within a feature subspace of a dataset; (d) what are the trade-offs between the effectiveness and the efficiency of detectors belonging to different algorithmic families; (e) which specific characteristics of datasets yield an online algorithm to outperform all others.
Differential privacy is the state-of-the-art formal definition for data release under strong privacy guarantees. A variety of mechanisms have been proposed in the literature for releasing the output of numeric queries (e.g., the Laplace mechanism and smooth sensitivity mechanism). Those mechanisms guarantee differential privacy by adding noise to the true query’s output. The amount of noise added is calibrated by the notions of global sensitivity and local sensitivity of the query that measure the impact of the addition or removal of an individual on the query’s output. Mechanisms that use local sensitivity add less noise and, consequently, have a more accurate answer. However, although there has been some work on generic mechanisms for releasing the output of non-numeric queries using global sensitivity (e.g., the Exponential mechanism), the literature lacks generic mechanisms for releasing the output of non-numeric queries using local sensitivity to reduce the noise in the query’s output. In this work, we remedy this shortcoming and present the local dampening mechanism. We adapt the notion of local sensitivity for the non-numeric setting and leverage it to design a generic non-numeric mechanism. We provide theoretical comparisons to the exponential mechanism and show under which conditions the local dampening mechanism is more accurate than the exponential mechanism. We illustrate the effectiveness of the local dampening mechanism by applying it to three diverse problems: (i) percentile selection problem. We report the p-th element in the database; (ii) Influential node analysis. Given an influence metric, we release the top-k most influential nodes while preserving the privacy of the relationship between nodes in the network; (iii) Decision tree induction. We provide a private adaptation to the ID3 algorithm to build decision trees from a given tabular dataset. Experimental evaluation shows that we can reduce the error for percentile selection application up to 73%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$73\%$$\end{document}, reduce the use of privacy budget by 2 to 4 orders of magnitude for influential node analysis application, and increase accuracy up to 12%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$12\%$$\end{document} for decision tree induction when compared to global sensitivity-based approaches. Finally, to illustrate the scalability of our local dampening mechanism, we empirically evaluate its runtime performance for the influential node analysis problem and show a sub-quadratic behavior.
Many real-world networks have been evolving and are finely modeled as temporal graphs from the viewpoint of the graph theory. A temporal graph is informative and always contains two types of features, i.e., the temporal feature and topological feature, where the temporal feature is related to the establishing time of the relationships in the temporal graph, and the topological feature is influenced by the structure of the graph. In this paper, considering both these two types of features, we perform time-topology analysis on temporal graphs to analyze the cohesiveness of temporal graphs and extract cohesive subgraphs. Firstly, a new metric named T-cohesiveness is proposed to evaluate the cohesiveness of a temporal subgraph from the time and topology dimensions jointly. Specifically, given a temporal graph Gs=(Vs,Es), cohesiveness in the time dimension reflects whether the connections in Gs happen in a short period of time, while cohesiveness in the topology dimension indicates whether the vertices in Vs are densely connected and have few connections with vertices out of Gs. Then, T-cohesiveness is utilized to perform time-topology analysis on temporal graphs, and two time-topology analysis methods are proposed. In detail, T-cohesiveness evolution tracking traces the evolution of the T-cohesiveness of a subgraph, and combo searching finds out cohesive subgraphs containing the query vertex, which have T-cohesiveness values larger than a given threshold. Moreover, since combo searching is NP-hard, a pruning strategy is proposed to estimate the upper bound of the T-cohesiveness value, and then improve the efficiency of combo searching. Experimental results demonstrate the efficiency of the proposed time-topology analysis methods and the pruning strategy. Besides, four more definitions of T-cohesiveness are compared with our method. The experimental results confirm the superiority of our definition.
Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here, software engineering needs to be re-thought where data become a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.
Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate ProS, a new probabilistic learning-based method that provides quality guarantees for progressive nearest neighbor (NN) query answering. We develop our method for k-NN queries and demonstrate how it can be applied with the two most popular distance measures, namely Euclidean and dynamic time warping. We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Moreover, we describe how this method can be used in order to develop a progressive algorithm for data series classification (based on a k-NN classifier), and we additionally propose a method designed specifically for the classification task. Experiments with several and diverse synthetic and real datasets demonstrate that our prediction methods constitute the first practical solutions to the problem, significantly outperforming competing approaches.
Result diversification is extensively studied in the context of search, recommendation, and data exploration. There are numerous algorithms that return top-k results that are both diverse and relevant. These algorithms typically have computational loops that compare the pairwise diversity of records to decide which ones to retain. We propose an access primitive DivGetBatch() that replaces repeated pairwise comparisons of diversity scores of records by pairwise comparisons of “aggregate” diversity scores of a group of records, thereby improving the running time of these algorithms while preserving the same results. We integrate the access primitive inside three representative diversity algorithms and prove that the augmented algorithms leveraging the access primitive preserve original results. We analyze the worst and expected case running times of these algorithms. We propose a computational framework to design this access primitive that has a pre-computed index structure I-tree that is agnostic to the specific details of diversity algorithms. We develop principled solutions to construct and maintain I-tree. Our experiments on multiple large real-world datasets corroborate our theoretical findings, while ensuring up to a \(24\times \) speedup.
Data lineage allows information to be traced to its origin in data analysis by showing how the results were derived. Although many methods have been proposed to identify the source data from which the analysis results are derived, analysis is becoming increasingly complex both with regard to the target (e.g., images, videos, and texts) and technology (e.g., AI and machine learning (ML)). In such complex data analysis, simply showing the source data may not ensure traceability. For example, ML analysts building image classifier models often need to know which parts of images are relevant to the output and why the classifier made a decision. Recent studies have intensively investigated interpretability and explainability in the AI/ML domain. Integrating these techniques into the lineage framework will help analysts understand more precisely how the analysis results were derived and how the results are trustful. In this paper, we propose the concept of augmented lineage for this purpose, which is an extended lineage, and an efficient method to derive the augmented lineage for complex data analysis. We express complex data analysis flows using relational operators by combining user-defined functions (UDFs). UDFs can represent invocations of AI/ML models within the data analysis. Then, we present a method taking UDFs into consideration to derive the augmented lineage for arbitrarily chosen tuples among the analysis results. We also experimentally demonstrate the efficiency of the proposed method.
Graph neural networks (GNNs) and their variants have generalized deep learning methods into non-Euclidean graph data, bringing substantial improvement in many graph mining tasks. In practice, the large graph could be isolated by different databases. Recently, user privacy protection has become a crucial concern in practical machine learning, which motivates us to explore a GNN framework with data sharing and without violating user privacy leakage in the meanwhile. However, it is challenging to scale GNN training to edge partitioned distributed graph databases while preserving data privacy and model quality. In this paper, we propose a privacy preserving collaborative GNN training framework, P2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}CG, aiming to obtain competitive model performance as the centralized setting. We present the clustering-based differential privacy algorithm to reduce the model degradation caused by the noisy edges generation. Moreover, we propose a novel interaction-based secure multi-layer graph convolution algorithm to alleviate the noise diffusion problem. Experimental results on the benchmark datasets and the production dataset in Tencent Inc. show that P2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}CG can significantly increase the model performance and obtain competitive results as a centralized setting.
Groupjoins combine execution of a join and a subsequent group-by . They are common in analytical queries and occur in about "Equation missing" of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention in many-core systems. Efficient implementations of groupjoins are highly desirable, as groupjoins are not only used to fuse group-by and join, but are also useful to efficiently execute nested aggregates. For these, the query optimizer needs to reason over the result of aggregation to optimally schedule it. Traditional systems quickly reach their limits of selectivity and cardinality estimations over computed columns and often treat group-by as an optimization barrier. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose four novel techniques, aggregate estimates to predict the result distributions of aggregates, parallel groupjoin execution for scalable execution of groupjoins, index groupjoins , and a greedy eager aggregation optimization technique that introduces nested preaggregations to significantly improve execution plans. The resulting system has improved estimates, better execution plans, and a contention-free evaluation of groupjoins, which speeds up TPC-H and TPC-DS queries significantly.
Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities, we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS’s robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world’s largest, publicly available source code archive.
Data series approximate similarity search is a basic building block operation essential for almost all analytical tasks. To speed up this important operation, the prevalent approach is to construct indexes directly on the data series objects. This suffers from very high construction time and storage cost due to the inherent complexity of indexing these high-dimensional data series objects. We instead design a promising new approach that leverages the unique property of correlations between the high-dimensional data series objects and the (often simple) partitioning attribute(s) in distributed data series repositories. Our proposed infrastructure, called PARROT, discovers, assesses, and exploits such correlations for similarity query optimization. PARROT addresses several critical challenges including the high dimensionality of the data series objects, softness (uncertainty) of correlation, correlation granularity, and lack of a proper measure for assessing correlation strength in big data series. We present scalable solutions tackling each of these challenges including pattern-level indexing, exception handling strategies for soft correlations, and a new entropy-based measure for assessing the correlation strength and judging their potential effectiveness. The PARROT query engine efficiently supports approximate kNN similarity queries leveraging the PARROT index. PARROT prototype is implemented on Apache Spark. Extensive experiments on real and synthetic datasets demonstrate that PARROT has substantially lower index construction costs, smaller storage overhead, and better performance and accuracy for processing similarity queries compared to alternate state-of-the-art solutions.
In this paper, we study hypercore maintenance in large-scale dynamic hypergraphs. A hypergraph, whose hyperedges may contain a set of vertices rather than two vertices in pairwise graphs, can represent complex interactions in more sophisticated applications. However, the exponential number of hyperedges incurs unaffordable costs to recompute the hypercore number of vertices and hyperedges when updating a hypergraph. This motivates us to propose an efficient approach for exact hypercore maintenance with the intention of significantly reducing the hypercore updating time comparing with recomputation approaches. The proposed algorithms can pinpoint the vertices and hyperedges whose hypercore numbers have to be updated by only traversing a small sub-hypergraph. We also propose an index called Core-Index that can facilitate our maintenance algorithms. Extensive experiments on real-world and temporal hypergraphs demonstrate the superiority of our algorithms in terms of efficiency.
Approximate nearest neighbor search (ANNS) is a fundamental problem that has attracted widespread attention for decades. Multi-probe ANNS is one of the most important classes of ANNS methods, playing crucial roles in disk-based, GPU-based, and distributed scenarios. The state-of-the-art multi-probe ANNS approaches typically perform in a fixed-configuration manner. For example, each query is dispatched to a fixed number of partitions to run ANNS algorithms locally, and the results will be merged to obtain the final result set. Our observation shows that such fixed configurations typically lead to a non-optimal accuracy–efficiency trade-off. To further optimize multi-probe ANNS, we propose to generate efficient configurations for each query individually. By formalizing the per-query optimization as a 0–1 knapsack problem and its variants, we identify that the kNN distribution (the proportion of k nearest neighbors of a query placed in each partition) is essential to the optimization. Then we develop LEQAT (LEarned Query-Aware OpTimizer), which leverages kNN distribution to seek optimal configurations for each query. LEQAT comes with (i) a machine learning model to learn and estimate kNN distributions based on historical or sample queries and (ii) efficient query optimization algorithms to determine the partitions to probe and the number of searching neighbors in each partition. We apply LEQAT to three state-of-the-art ANNS methods IVF, HNSW, and SSG under clustering-based partitioning, evaluating the overall performance on several real-world datasets. The results show that LEQAT consistently reduces the latency by up to 58% and improves the throughput by up to 3.9 times.
The ERP class integrated IT system as a tool of feed-back measurement [Source: Gospodarek T., 2015]
UML diagram of the subjected process. A 3 standard SAP ERP mode
Direct benefits obtained
4Implementation of an ERP system in manufacturing organization is closely related with customization of the standard prototype of the system offered by a software house. In case of manufacturing health-care products some additional aspects to standard information circuits must be added. These mean QC, ISO, FDA requested data should be processes parallel to bookkeeping ones. In this paper the use of SAP Fiori in the QC process in inbound logistic is presented. Adding SAP Fiori to the existing SAP solution allows to create sociotechni-cal informal group of the users having new key competences. It allows to achieve some benefits measured with CBA method for the organization. It was concluded that integrated IT systems of ERP class may develop sociotechnical systems in organizations used them leading to re-engeneering processes and transition changes. A case study of changes inside a map of process of the manufacturing health-care products plant is presented and discussed.
The Multi-Constraint Shortest Path (MCSP) problem aims to find the shortest path between two nodes in a network subject to a given constraint set. It is typically processed as a skyline path problem. However, the number of intermediate skyline paths becomes larger as the network size increases and the constraint number grows, which brings about the dramatical growth of computational cost and further makes the existing index-based methods hardly capable of obtaining the complete exact results. In this paper, we propose a novel high-dimensional skyline path concatenation method to avoid the expensive skyline path search, which supports the efficient construction of hop labeling index for MCSP queries. Specifically, a set of insightful observations and techniques are proposed to improve the efficiency of concatenating two skyline path set, a n-Cube technique is designed to prune the concatenation space among multiple hops, and a constraint pruning method is used to avoid the unnecessary computation. Furthermore, to scale up to larger networks, we propose a novel forest hop labeling which enables the parallel label construction from different network partitions. Our approach is the first method that can achieve accuracy and efficiency for MCSP query. Extensive experiments on real-life road networks demonstrate the superiority of our method over state-of-the-art solutions.
In this paper, we study the Time-Dependent k Nearest Neighbor (TD-kNN) query on moving objects that aims to return k objects arriving at the query location with the least traveling cost departing at a given time t. Although the kNN query on moving objects has been widely studied in the scenario of the static road network, the TD-kNN query tends to be more complicated and challenging because under the time-dependent road network, the cost of each edge is measured by a cost function rather than a fixed distance value. To tackle such difficulty, we adopt the framework of GLAD and develop an advanced index structure to support efficient fastest travel cost query on time-dependent road network. In particular, we propose the Time-Dependent H2H (TD-H2H) index, which pre-computes the aggregated weight functions between each node to some specific nodes in the decomposition tree derived from the road network. Additionally, we establish a grid index on moving objects for candidate object retrieval and location update. To further accelerate the TD-kNN query, two pruning strategies are proposed in our solution. Apart from that, we extend our framework to tackle the time-dependent approachable kNN (TD-AkNN) query on moving objects targeting for the application of taxi-hailing service, where the moving object might have been occupied. Extensive experiments with different parameter settings on real-world road network show that our solutions for both TD-kNN and TD-AkNN queries are superior to the competitors in orders of magnitude.
Bloom filter is a compact memory-efficient probabilistic data structure supporting membership testing, i.e., to check whether an element is in a given set. However, as Bloom filter maps each element with random hash functions, little flexibility is provided even if the information of negative keys (elements are not in the set) is available, especially when the misidentification of negative keys brings different costs. The problem worsens when the hash functions are non-uniform, i.e., mapping each element into Bloom filter non-uniformly. To address the above problem, we propose a new hash adaptive Bloom filter (HABF) that supports customizing hash functions for keys. Besides, we propose a filter family, including f-HABF (fast hashing version), c-HABF (cache-friendly version), and s-HABF (stacked version). We show that HABF family is Pareto optimal among all comparison filters in terms of accuracy and query latency. We conduct extensive experiments on representative datasets, and the results show that HABF family outperforms the standard Bloom filter and its cutting-edge variants on the whole in terms of accuracy, construction/query time, and memory space consumption. All the source codes are available in our source codes (
Enterprises create domain-specific knowledge bases (KBs) by curating and integrating their business data from multiple sources. To support a variety of query types over domain-specific KBs, we propose Hermes, an ontology-based system that allows storing KB data in multiple backends, and querying them with different query languages. In this paper, we address two important challenges in realizing such a system: data placement and schema optimization. First, we identify the best data store for any query type and determine the subset of the KB that needs to be stored in this data store, while minimizing data replication. Second, we optimize how we organize the data for best query performance. To choose the best data stores, we partition the data described by the domain ontology into multiple overlapping subsets based on the operations performed in a given query workload, and place these subsets in appropriate data stores according to their capabilities. Then, we optimize the schema on each data store to enable efficient querying. In particular, we focus on the property graph schema optimization, which has been largely ignored in the literature. We propose two algorithms to generate an optimized schema from the domain ontology. We demonstrate the effectiveness of our data placement and schema optimization algorithms with two real-world KBs from the medical and financial domains. The results show that the proposed data placement algorithm generates near-optimal data placement plans with minimal data replication overhead, and the schema optimization algorithms produce high-quality schemas, achieving up to two orders of magnitude speed-up compared to alternative schema designs.
We introduce the R everse S patial Top-k K eyword (RSK) query, which is defined as: given a query term q, an integer k and a neighborhood size find all the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods . An obvious approach would be to partition the dataset with a uniform grid structure of a given cell size and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned with the grid. Furthermore, for every neighborhood (square) that is an answer, we can define infinitely more result neighborhoods by minimally shifting the square without including more posts in it. To address that, we need to identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an algorithm to efficiently answer an RSK query using an index structure consisting of a uniform grid augmented by materialized lists of term frequencies. We apply various optimizations that drastically improve query latency against baseline approaches. We also provide a theoretical model to choose the optimal cell size for the index to minimize query latency. We further examine a restricted version of the problem (RSKR) that limits the scope of the answer and propose efficient approximate algorithms. Finally, we examine how parallelism can improve performance by balancing the workload using a smart load slicing technique. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets and crime report datasets, shows the efficiency of our optimizations and the accuracy of the proposed theoretical model.
Data fusion, within the data integration pipeline, addresses the problem of discovering the true values of a data item when multiple sources provide different values for it. An important contribution to the solution of the problem can be given by assessing the quality of the involved sources and relying more on the values coming from trusted sources. State-of-the-art data fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources, and recently it has been also recognized that the trustworthiness of the same source may vary with the domain of interest. In this paper we propose STORM, a novel domain-aware algorithm for data fusion designed for the multi-truth case, that is, when a data item can also have multiple true values. Like many other data-fusion techniques, STORM relies on Bayesian inference. However, differently from the other Bayesian approaches to the problem, it determines the trustworthiness of sources by taking into account their authority: Here, we define authoritative sources as those that have been copied by many other ones, assuming that, when source administrators decide to copy data from other sources, they choose the ones they perceive as the most reliable. To group together the values that have been recognized as variants representing the same real-world entity, STORM provides also a value-reconciliation step, thus reducing the possibility of making mistakes in the remaining part of the algorithm. The experimental results on multi-truth synthetic and real-world datasets show that STORM represents a solid step forward in data-fusion research.
Searching for a community containing a query node in an online social network enjoys wide applications like recommendation, team organization, etc. When applied to real-life networks, the existing approaches face two major limitations. First, they usually take two steps, i.e., crawling a large part of the network first and then finding the community next, but the entire network is usually too big and most of the data are not interesting to end users. Second, the existing methods utilize hand-crafted rules to measure community membership, while it is very difficult to define effective rules as the communities are flexible for different query nodes. This paper proposes an interactive community search method based on graph neural network (shortened by ICS-GNN+) to locate the target community over a subgraph collected on the fly from an online network iteratively. In each iteration, we first build a candidate subgraph around the query node and labeled nodes. We then train a node classification model using GNN to determine whether every node belongs to the target community, which captures similarities between nodes by combining content and structural features seamlessly and flexibly under the guide of users’ labeling. Based on the probabilities inferred from the trained GNN, we introduce a k-sized Maximum-GNN-scores (shortened by kMG) community to describe the target community and design a method to locate the kMG community which will be evaluated by end users to acquire more feedback. Besides, various optimization strategies are proposed including an adaptive method to maintain the subgraph during iterations, combining ranking loss into the GNN model, generating node embedding enhanced by pseudo-labels from node clusters in the subgraph, and a greedy community searching method with benefit computed globally. We conduct the experiments on both offline and online real-life datasets, and demonstrate that ICS-GNN+ can produce effective communities with low overhead in communication, computation, and user labeling.
Personal data management system (PDMS) solutions are flourishing, boosted by smart disclosure initiatives and new regulations. PDMSs allow users to easily store and manage data directly generated by their devices or resulting from their (digital) interactions. Users can then leverage the power of their PDMS to benefit from their personal data, for their own good and in the interest of the community. The PDMS paradigm thus brings exciting perspectives by unlocking novel usages, but also raises security issues. An effective approach, considered in several recent works, is to let the user data distributed on personal platforms, secured locally using hardware and/or software security mechanisms. This paper goes beyond the local security issues and addresses the important question of securely querying this massively distributed personal data. To this end, we propose DISPERS, a fully distributed PDMS peer-to-peer architecture. DISPERS allows users to securely and efficiently share and query their personal data, even in the presence of malicious nodes. We consider three increasingly powerful threat models and derive, for each, a security requirement that must be fulfilled to reach a lower-bound in terms of sensitive data leakage: (1) hidden communications, (2) random dispersion of data and (3) collaborative proofs. These requirements are incremental and, respectively, resist spied, leaking or corrupted nodes. We show that the expected security level can be guaranteed with near certainty and validate experimentally the efficiency of the proposed protocols, allowing for adjustable trade-off between the security level and its cost.
End-to-end AutoML has attracted intensive interests from both academia and industry which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model -- akin to the one supported by modern database systems -- to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn.
In many fields, e.g., data mining and machine learning, distance-based outlier detection (DOD) is widely employed to remove noises and find abnormal phenomena, because DOD is unsupervised, can be employed in any metric spaces, and does not have any assumptions of data distributions. Nowadays, data mining and machine learning applications face the challenge of dealing with large datasets, which requires efficient DOD algorithms. We address the DOD problem with two different definitions. Our new idea, which solves the problems, is to exploit an in-memory proximity graph. For each problem, we propose a new algorithm that exploits a proximity graph and analyze an appropriate type of proximity graph for the algorithm. Our empirical study using real datasets confirms that our DOD algorithms are significantly faster than state-of-the-art ones.
Influential nodes with rich connections in online social networks (OSNs) are of great values to initiate marketing campaigns. However, the potential influence spread that can be generated by these influential nodes is hidden behind the structures of OSNs, which are often held by OSN providers and unavailable to advertisers for privacy concerns. A social advertising model known as influencer marketing is to have OSN providers offer and price candidate nodes for advertisers to purchase for seeding marketing campaigns. In this setting, a reasonable price profile for the candidate nodes should effectively reflect the expected influence gain they can bring in a marketing campaign. In this paper, we study the problem of pricing the influential nodes based on their expected influence spread to help advertisers select the initiators of marketing campaigns without the knowledge of OSN structures. We design a function characterizing the divergence between the price and the expected influence of the initiator sets. We formulate the problem to minimize the divergence and derive an optimal price profile. An advanced algorithm is developed to estimate the price profile with accuracy guarantees. Experiments with real OSN datasets show that our pricing algorithm can significantly outperform other baselines.
Data exploration—the problem of extracting knowledge from database even if we do not know exactly what we are looking for —is important for data discovery and analysis. However, precisely specifying SQL queries is not always practical, such as “finding and ranking off-road cars based on a combination of Price, Make, Model, Age, Mileage, etc”—not only due to the query complexity (e.g.,the queries may have many if-then-else, and, or and not logic), but also because the user typically does not have the knowledge of all data instances (and their variants). We propose DExPlorer, a system for interactive data exploration. From the user perspective, we propose a simple and user-friendly interface, which allows to: (1) confirm whether a tuple is desired or not, and (2) decide whether a tuple is more preferred than another. Behind the scenes, we jointly use multiple ML models to learn from the above two types of user feedback. Moreover, in order to effectively involve human-in-the-loop, we need to select a set of tuples for each user interaction so as to solicit feedback. Therefore, we devise question selection algorithms, which consider not only the estimated benefit of each tuple, but also the possible partial orders between any two suggested tuples. Experiments on real-world datasets show that DExPlorer outperforms existing approaches in effectiveness.
Crowdsourcing has been a prevalent way to obtain answers for tasks that need human intelligence. In general, a crowdsourcing platform is responsible for allocating workers to each received task, with high-quality workers in priority. However, the allocation results can in turn yield knowledge about workers’ quality. For example, those unallocated workers are supposed to be less-qualified. They can be upset if such information is known by the public, which is an invasion of their privacy. To alleviate such concerns, we study the privacy-preserving worker allocation problem in this paper, aiming to properly allocate the workers while protecting their privacy. We propose worker allocation methods with the property of differential privacy, which proceed by first computing weights for each potential allocation and then sampling according to the weights. The Markov Chain Monte Carlo-based method is shown in our experiments to improve over the trivial random allocation method by 18.9% in terms of worker quality on synthetic data. On the real data, it realizes differential privacy with less than 20% loss on quality even when ϵ=13\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon = \frac{1}{3}$$\end{document}.
Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.
More and more weakly structured, and irregular data sources are becoming available every day. The schema of these sources is useful for a number of tasks, such as query answering, exploration and summarization. However, although semantic web data might contain schema information, in many cases this is completely missing or partially defined. In this paper, we present a survey of the state of the art on schema information extraction approaches. We analyze and classify these approaches into three families: (1) approaches that exploit the implicit structure of the data, without assuming that some explicit statements on the schema are provided in the dataset; (2) approaches that use the explicit schema statements contained in the dataset to complement and enrich the schema, and (3) those that discover structural patterns contained in a dataset. We compare these studies in terms of their approach, advantages and limitations. Finally we discuss the problems that remain open.
Given a user-specified minimum degree threshold \(\gamma \), a \(\gamma \)-quasi-clique is a subgraph where each vertex connects to at least \(\gamma \) fraction of the other vertices. Quasi-clique is a natural definition for dense structures, so finding large and hence statistically significant quasi-cliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive, and even a recent algorithm for mining large maximal quasi-cliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasi-cliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task-subgraph size, by (a) using a timeout-based task decomposition strategy, and by (b) utilizing a priority task queue to schedule long-running tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called G-thinker, this paper targets a single-machine multi-core environment which is more accessible to an average end user. A general framework called T-thinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasi-clique mining algorithm. Additionally, we consider the problem of directly mining large quasi-cliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68\(\times \) runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasi-cliques from dense parts can provide an additional speedup of up to 89.46\(\times \).
Reachability is a fundamental problem in graph analysis. In applications such as social networks and collaboration networks, edges are always associated with timestamps. Most existing works on reachability queries in temporal graphs assume that two vertices are related if they are connected by a path with non-decreasing timestamps (time-respecting) of edges. This assumption fails to capture the relationship between entities involved in the same group or activity with no time-respecting path connecting them. In this paper, we define a new reachability model, called span-reachability, designed to relax the time order dependency and identify the relationship between entities in a given time period. We adopt the idea of two-hop cover and propose an index-based method to answer span-reachability queries. Several optimizations are also given to improve the efficiency of index construction and query processing. We conduct extensive experiments on eighteen real-world datasets to show the efficiency of our proposed solution.
Today’s social networks continuously generate massive streams of data, which provide a valuable starting point for the detection of rumours as soon as they start to propagate. However, rumour detection faces tight latency bounds, which cannot be met by contemporary algorithms, given the sheer volume of high-velocity streaming data emitted by social networks. Hence, in this paper, we argue for best-effort rumour detection that detects most rumours quickly rather than all rumours with a high delay. To this end, we combine techniques for efficient, graph-based matching of rumour patterns with effective load shedding that discards some of the input data while minimising the loss in accuracy. Experiments with large-scale real-world datasets illustrate the robustness of our approach in terms of runtime performance and detection accuracy under diverse streaming conditions.
Top-cited authors
Xuemin Lin
  • UNSW Sydney
Wenjie Zhang
  • UNSW Sydney
Lu Qin
  • University of Technology Sydney
Christopher Ré
  • Stanford University
Christian Jensen
  • Aalborg University