[Show abstract][Hide abstract]ABSTRACT: While price and data quality should define the major tradeoff for consumers in data markets, prices are usually prescribed by vendors and data quality is not negotiable. In this paper we study a model where data quality can be traded for a discount. We focus on the case of XML documents and consider completeness as the quality dimension. In our setting, the data provider offers an XML document, and sets both the price of the document and a weight to each node of the document, depending on its potential worth. The data consumer proposes a price. If the proposed price is lower than that of the entire document, then the data consumer receives a sample, i.e., a random rooted subtree of the document whose selection depends on the discounted price and the weight of nodes. By requesting several samples, the data consumer can iteratively explore the data in the document.We present a pseudo-polynomial time algorithm to select a rooted subtree with prescribed weight uniformly at random, but show that this problem is unfortunately intractable. Yet, we are able to identify several practical cases where our algorithm runs in polynomial time. The first case is uniform random sampling of a rooted subtree with prescribed size rather than weights; the second case restricts to binary weights.As a more challenging scenario for the sampling problem, we also study the uniform sampling of a rooted subtree of prescribed weight and prescribed height. We adapt our pseudo-polynomial time algorithm to this setting and identify tractable cases.
[Show abstract][Hide abstract]ABSTRACT: A family of practical queries, which aim to return or manipulate paths as first class objects, cannot be expressed by XPath or XQuery FLWOR expressions. In this paper, we propose a seamless extension to XQuery FLWOR to elegantly express path-centric queries. We further investigate the expression and processing of intra-path aggregation, an analytical operation in path-centric queries.
[Show abstract][Hide abstract]ABSTRACT: In this paper, we propose a learning approach to adaptive performance tuning of database applications. The objective is to validate the opportunity to devise a tuning strategy that does not need prior knowledge of a cost model. Instead, the cost model is learned through reinforcement learning. We instantiate our approach to the use case of index tuning. We model the execution of queries and updates as a Markov decision process whose states are database configurations, actions are configuration changes, and rewards are functions of the cost of configuration change and query and update evaluation. During the reinforcement learning process, we face two important challenges: not only the unavailability of a cost model, but also the size of the state space. To address the latter, we devise strategies to prune the state space, both in the general case and for the use case of index tuning. We empirically and comparatively evaluate our approach on a standard OLTP dataset. We show that our approach is competitive with state-of-the-art adaptive index tuning, which is dependent on a cost model.
[Show abstract][Hide abstract]ABSTRACT: Virtual machines (VM) offer simple and practical mechanisms to address many of the manageability problems of leveraging heterogeneous computing resources. VM live migration is an important feature of virtualization in cloud computing: it allows administrators to transparently tune the performance of the computing infrastructure. However, VM live migration may open the door to security threats. Classic anomaly detection schemes such as Local Outlier Factors (LOF) fail in detecting anomalies in the process of VM live migration. To tackle such critical security issues, we propose an adaptive scheme that mines data from the cloud infrastructure in order to detect abnormal statistics when VMs are migrated to new hosts. In our scheme, we extend classic Local Outlier Factors (LOF) approach by defining novel dimension reasoning (DR) rules as DR-LOF to figure out the possible sources of anomalies. We also incorporate Symbolic Aggregate ApproXimation (SAX) to enable timing information exploration that LOF ignores. In addition, we implement our scheme with an adaptive procedure to reduce chances of performance instability. Compared with LOF that fails in detecting anomalies in the process of VM live migration, our scheme is able not only to detect anomalies but also to identify their possible sources, giving cloud computing operators important clues to pinpoint and clear the anomalies. Our scheme further outperforms other classic clustering tools in WEKA (Waikato Environment for Knowledge Analysis) with higher detection rates and lower false alarm rate. Our scheme would serve as a novel anomaly detection tool to improve security framework in VM management for cloud computing.
Full-text Article · Jun 2015 · Future Generation Computer Systems
[Show abstract][Hide abstract]ABSTRACT: We propose algorithms for the detection of disjoint and overlapping communities in networks. The algorithms exploit both the degree and clustering coefficient of vertices as these metrics characterize dense connections, which we hypothesize as being indicative of communities. Each vertex independently seeks the community to which it belongs, by visiting its neighboring vertices and choosing its peers on the basis of their degrees and clustering coefficients. The algorithms are intrinsically data parallel. We devise a version for Graphics Processing Unit (GPU). We empirically evaluate the performance of our methods. We measure and compare their efficiency and effectiveness to several state-of-the-art community detection algorithms. Effectiveness is quantified by metrics, namely, modularity, conductance, internal density, cut ratio, weighted community clustering and normalized mutual information. Additionally, average community size and community size distribution are measured. Efficiency is measured by the running time. We show that our methods are both effective and efficient. Meanwhile, the opportunity to parallelize our algorithm yields an efficient solution to the community detection problem.
[Show abstract][Hide abstract]ABSTRACT: While price and data quality should define the major trade-off for consumers in data markets, prices are usually prescribed by vendors and data quality is not negotiable. In this paper we study a model where data quality can be traded for a discount. We focus on the case of XML documents and consider completeness as the quality dimension. In our setting, the data provider offers an XML document, and sets both the price of the document and a weight to each node of the document, depending on its potential worth. The data consumer proposes a price. If the proposed price is lower than that of the entire document, then the data consumer receives a sample, i.e., a random rooted subtree of the document whose selection depends on the discounted price and the weight of nodes. By requesting several samples, the data consumer can iteratively explore the data in the document. We show that the uniform random sampling of a rooted subtree with prescribed weight is unfortunately intractable. However, we are able to identify several practical cases that are tractable. The first case is uniform random sampling of a rooted subtree with prescribed size; the second case restricts to binary weights. For both these practical cases we present polynomial-time algorithms and explain how they can be integrated into an iterative exploratory sampling approach.
[Show abstract][Hide abstract]ABSTRACT: In database applications, the availability of a conceptual schema and semantics constitute invaluable leverage for improving the effectiveness, and sometimes the efficiency, of many tasks including query processing, keyword search and schema/data integration. The Object-Relationship-Attribute model for Semi-Structured data (ORA-SS) model is a conceptual model intended to capture the semantics of object classes, object identifiers, relationship types, etc., underlying XML schemas and data. We refer to the set of these semantic concepts as the ORA-semantics. In this work, we present a novel approach to automatically discover the ORA-semantics from data-centric XML. We also empirically and comparatively evaluate the effectiveness of the approach.
[Show abstract][Hide abstract]ABSTRACT: Data is a modern commodity. Yet the pricing models in use on electronic data markets either focus on the usage of computing resources, or are proprietary, opaque, most likely ad hoc, and not conducive of a healthy commodity market dynamics. In this paper we propose a generic data pricing model that is based on minimal provenance, i.e. minimal sets of tuples contributing to the result of a query. We show that the proposed model fulfills desirable properties such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. We present a baseline algorithm to compute the exact price of a query based on our pricing model. We show that the problem is NP-hard. We therefore devise, present and compare several heuristics. We conduct a comprehensive experimental study to show their effectiveness and efficiency.
[Show abstract][Hide abstract]ABSTRACT: We propose an algorithm for the detection of communities in networks. The algorithm exploits degree and clustering coefficient of vertices as these metrics characterize dense connections, which, we hypothesize, are indicative of communities. Each vertex, independently, seeks the community to which it belongs by visiting its neighbour vertices and choosing its peers on the basis of their degrees and clustering coefficients. The algorithm is intrinsically data parallel. We devise a version for Graphics Processing Unit (GPU). We empirically evaluate the performance of our method. We measure and compare its efficiency and effectiveness to several state of the art community detection algorithms. Effectiveness is quantified by five metrics, namely, modularity, conductance, internal density, cut ratio and weighted community clustering. Efficiency is measured by the running time. Clearly the opportunity to parallelize our algorithm yields an efficient solution to the community detection problem.
[Show abstract][Hide abstract]ABSTRACT: Among the many reasons that justify the need for efficient and effective graph sampling algorithms is the ability to replace a graph too large to be processed by a tractable yet representative subgraph. For instance, some approximation algorithms start by looking for a solution on a sample subgraph and then extrapolate it. The sample graph should be of manageable size. The sample graph should preserve properties of interest. There exist several efficient and effective algorithms for the sampling of graphs. However, the graphs encountered in modern applications are dynamic: edges and vertices are added or removed. Existing graph sampling algorithms are not incremental. They were designed for static graphs. If the original graph changes, the sample must be entirely recomputed. Is it possible to design an algorithm that reuses whole or part of the already computed sample?
We present two incremental graph sampling algorithms preserving selected properties. The rationale of the algorithms is to replace a fraction of vertices in the former sample with newly updated vertices. We analytically and empirically evaluate the performance of the proposed algorithms. We compare the performance of the proposed algorithms with that of baseline algorithms. The experimental results on both synthetic and real graphs show that our proposed algorithms realize a compromise between effectiveness and efficiency, and, therefore provide practical solutions to the problem of incrementally sampling the large dynamic graphs.
[Show abstract][Hide abstract]ABSTRACT: We propose a graph-layout based method for detecting communities in networks. We first project the graph onto a Euclidean space using Fruchterman-Reingold algorithm, a force-based graph drawing algorithm. We then cluster the vertices according to Euclidean distance. The idea is a form of dimension reduction. The graph drawing in two or more dimensions provides a heuristic decision as whether vertices are connected by a short path approximated by their Euclidean distance. We study community detection for both disjoint and overlapping communities. For the case of disjoint communities, we use k-means clustering. For the case of overlapping communities, we use fuzzy-c means algorithm. We evaluate the performance of our different algorithms for varying parameters and number of iterations. We compare the results to several state of the art community detection algorithms, each of which clusters the graph directly or indirectly according to geodesic distance. We show that, for non-trivially small graphs, our method is both effective and efficient. We measure effectiveness using modularity when the communities are not known in advance and precision when the communities are known in advance. We measure efficiency with running time. The running time of our algorithms can be controlled by the number of iterations of the Fruchterman-Reingold algorithm.
[Show abstract][Hide abstract]ABSTRACT: In most data markets, prices are prescribed and accuracy is determined by the data. Instead, we consider a model in which accuracy can be traded for discounted prices: “what you pay for is what you get”.
The data market model consists of data consumers, data providers and data market owners. The data market owners are brokers between the data providers and data consumers. A data consumer proposes a price for the data that she requests. If the price is less than the price set by the data provider, then she gets an approximate value. The data market owners negotiate the pricing schemes with the data providers. They implement these schemes for the computation of the discounted approximate values.
We propose a theoretical and practical pricing framework with its algorithms for the above mechanism. In this framework, the value published is randomly determined from a probability distribution. The distribution is computed such that its distance to the actual value is commensurate to the discount. The published value comes with a guarantee on the probability to be the exact value. The probability is also commensurate to the discount. We present and formalize the principles that a healthy data market should meet for such a transaction. We define two ancillary functions and describe the algorithms that compute the approximate value from the proposed price using these functions. We prove that the functions and the algorithm meet the required principles.
[Show abstract][Hide abstract]ABSTRACT: It is now possible to collect and share trajectory data for any ship in the world by various means such as satellite and VHF systems. However, the publication of such data also creates new risks for privacy breach with consequences on the security and liability of the stakeholders. Thus, there is an urgent need to develop methods for preserving the privacy of published trajectory data. In this paper, we propose and comparatively investigate two mechanisms for the publication of the trajectory of individual ships under differential privacy guarantees. Traditionally, privacy and differential privacy is achieved by perturbation of the result or the data according to the sensitivity of the query. Our approach, instead, combines sampling and interpolation. We present and compare two techniques in which we sample and interpolate (a priori) and interpolate and sample (a posteriori), respectively. We show that both techniques achieve a (0, δ) form of differential privacy. We analytically and empirically, with real ship trajectories, study the privacy guarantee and utility of the methods.
[Show abstract][Hide abstract]ABSTRACT: The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.
[Show abstract][Hide abstract]ABSTRACT: Efficient spatial joins are pivotal for many applications and particularly important for geographical information systems or for the simulation sciences where scientists work with spatial models. Past research has primarily focused on disk-based spatial joins; efficient in-memory approaches, however, are important for two reasons: a) main memory has grown so large that many datasets fit in it and b) the in-memory join is a very time-consuming part of all disk-based spatial joins. In this paper we develop TOUCH, a novel in-memory spatial join algorithm that uses hierarchical data-oriented space partitioning, thereby keeping both its memory footprint and the number of comparisons low. Our results show that TOUCH outperforms known in-memory spatial-join algorithms as well as in-memory implementations of disk-based join approaches. In particular, it has a one order of magnitude advantage over the memory-demanding state of the art in terms of number of comparisons (i.e., pairwise object comparisons), as well as execution time, while it is two orders of magnitude faster when compared to approaches with a similar memory footprint. Furthermore, TOUCH is more scalable than competing approaches as data density grows.
[Show abstract][Hide abstract]ABSTRACT: The database group has worked on a wide range of research at the National University of Singapore (NUS), ranging from traditional database technology to more advanced database technology and novel database utilities. The database group has been developing efficient cloud computing platforms for large-scale services, and Big Data management and analytics using commodity hardware. One of its goals is to allow users of MapReduce-based systems to keep the programming model of the MapReduce framework, while empowering them with data management functionalities at an acceptable performance. The database group has also developed query processing engine under the MapReduce framework. The group's proposed MapReduce-based similarity (kNN) join exploits Voronoi diagram to minimize the number of objects to be sent to the reducer node to minimize computation and communication overheads.
[Show abstract][Hide abstract]ABSTRACT: The widespread usage of random graphs has been highlighted in the context of database applications for several years. This because such data structures turn out to be very useful in a large family of database applications ranging from simulation to sampling, from analysis of complex networks to study of randomized algorithms, and so forth. Amongst others, Erdős–Rényi Γv,pΓv,p is the most popular model to obtain and manipulate random graphs. Unfortunately, it has been demonstrated that classical algorithms for generating Erdős–Rényi based random graphs do not scale well in large instances and, in addition to this, fail to make use of the parallel processing capabilities of modern hardware. Inspired by this main motivation, in this paper we propose and experimentally assess a novel parallel algorithm for generating random graphs under the Erdős–Rényi model that is designed and implemented in a Graphics Processing Unit (GPU), called PPreZER. We demonstrate the nice amenities due to our solution via a succession of several intermediary algorithms, both sequential and parallel, which show the limitations of classical approaches and the benefits due to the PPreZER algorithm. Finally, our comprehensive experimental assessment and analysis brings to light a relevant average speedup gain of PPreZER over baseline algorithms.
Article · Mar 2013 · Journal of Parallel and Distributed Computing