[show abstract][hide abstract] ABSTRACT: Analytic functions represent the state-of-the-art way of performing complex
data analysis within a single SQL statement. In particular, an important class
of analytic functions that has been frequently used in commercial systems to
support OLAP and decision support applications is the class of window
functions. A window function returns for each input tuple a value derived from
applying a function over a window of neighboring tuples. However, existing
window function evaluation approaches are based on a naive sorting scheme. In
this paper, we study the problem of optimizing the evaluation of window
functions. We propose several efficient techniques, and identify optimization
opportunities that allow us to optimize the evaluation of a set of window
functions. We have integrated our scheme into PostgreSQL. Our comprehensive
experimental study on the TPC-DS datasets as well as synthetic datasets and
queries demonstrate significant speedup over existing approaches.
[show abstract][hide abstract] ABSTRACT: The corporate network is often used for sharing information among the participating companies and facilitating collaboration in a certain industry sector where companies share a common interest. It can effectively help the companies to reduce their operational costs and increase the revenues. However, the inter-company data sharing and processing poses unique challenges to such a data management system including scalability, performance, throughput, and security. In this paper, we present BestPeer++, a system which delivers elastic data sharing services for corporate network applications in the cloud based on BestPeer – a peer-to-peer (P2P) based data management platform. By integrating cloud computing, database, and P2P tech-nologies into one system, BestPeer++ provides an economical, flexible and scalable platform for corporate network applications and delivers data sharing services to participants based on the widely accepted pay-as-you-go business model. We evaluate BestPeer++ on Amazon EC2 Cloud platform. The benchmarking results show that BestPeer++ outperforms HadoopDB, a recently proposed large-scale data processing system, in performance when both systems are employed to handle typical corporate network workloads. The benchmarking results also demonstrate that BestPeer++ achieves near linear scalability for throughput with respect to the number of peer nodes.
[show abstract][hide abstract] ABSTRACT: Text databases provide rapid access to collections of digital documents. Such databases have become ubiquitous: text search
engines underlie the online text repositories accessible via the Web and are central to digital libraries and online corporate
[show abstract][hide abstract] ABSTRACT: There has been a growing acceptance of the object-oriented data model as the basis of next generation database management
systems (DBMSs). Both pure object-oriented DBMS (OODBMSs) and object-relational DBMS (ORDBMSs) have been developed based on
object-oriented concepts. Object-relational DBMS, in particular, extend the SQL language by incorporating all the concepts
of the object-oriented data model. A large number of products for both categories of DBMS is today available. In particular,
all major vendors of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].
[show abstract][hide abstract] ABSTRACT: Most of the existing privacy-preserving techniques, such as k-anonymity methods, are designed for static data sets. As such, they cannot be applied to streaming data which are continuous, transient, and usually unbounded. Moreover, in streaming applications, there is a need to offer strong guarantees on the maximum allowed delay between incoming data and the corresponding anonymized output. To cope with these requirements, in this paper, we present Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE), a cluster-based scheme that anonymizes data streams on-the-fly and, at the same time, ensures the freshness of the anonymized data by satisfying specified delay constraints. We further show how CASTLE can be easily extended to handle ℓ-diversity. Our extensive performance study shows that CASTLE is efficient and effective w.r.t. the quality of the output data.
IEEE Transactions on Dependable and Secure Computing 07/2011; · 1.06 Impact Factor
[show abstract][hide abstract] ABSTRACT: The International Maritime Organization (IMO) requires a majority of cargo and passenger ships to use the Automatic Identification System (AIS) for navigation safety and traffic control. Distributing live AIS data on the Internet can offer a global view based on ships' status for both operational and analytical purposes to port authorities, shipping and insurance companies, cargo owners and ship captains and other stakeholders. Yet, uncontrolled, this distribution can seriously undermine navigation safety and security and the privacy of the various stakeholders. In this paper we present ASSIST, a system prototype based on our recently proposed access control framework, to protect data streams from unauthorized access. We demonstrate the effectiveness of the system in a real scenario with real AIS data streams.
19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM-GIS 2011, November 1-4, 2011, Chicago, IL, USA, Proceedings; 01/2011
[show abstract][hide abstract] ABSTRACT: Computing preference queries has received a lot of attention in the database community. It is common that the user is unsure of his/her preference, so care must be taken to elicit the preference of the user correctly. In this paper, we propose to elicit the preferred ordering of a user by utilizing skyline objects as the representatives of the possible ordering. We introduce the notion of order-based representative skylines which selects representatives based on the orderings that they represent. To further facilitate preference exploration, a hierarchical clustering algorithm is applied to compute a denogram on the skyline objects. By coupling the hierarchical clustering with visualization techniques, we allow users to refine their preference weight settings by browsing the hierarchy. Extensive experiments were conducted and the results validate the feasibility and the efficiency of our approach.
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010; 01/2010
[show abstract][hide abstract] ABSTRACT: In this paper, we propose an online aggregation system called COSMOS (Continuous Sampling for Multiple queries in an Online aggregation System), to process multiple aggregate queries efficiently. In COSMOS, a dataset is first scrambled so that sequentially scanning the dataset gives rise to a stream of random samples for all queries. Moreover, COSMOS organizes queries into a dissemination graph to exploit the dependencies across queries. In this way, aggregates of queries closer to the root (source of data flow) can potentially be used to compute the aggregates of descendent/dependent queries. COSMOS applies some statistical approach to combine answers from ancestor nodes to generate the online aggregates for a node. COSMOS also offers a partitioning strategy to further salvage intermediate answers. We have implemented COSMOS and conducted an extensive experimental study in PostgreSQL. Our results on the TPC-H benchmark show the efficiency and effectiveness of COSMOS.
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010; 01/2010
[show abstract][hide abstract] ABSTRACT: Many applications require sorting a table over multiple sort orders: generation of multiple reports from a table, evaluation of a complex query that involves multiple instances of a relation, and batch processing of a set of queries. In this paper, we study how multiple sortings of a table can be efficiently performed. We introduce a new evaluation technique, called cooperative sort, that exploits the relationships among the input set of sort orders to minimize I/O operations for the collection of sort operations. To demonstrate the efficiency of the proposed scheme, we implemented it in PostgreSQL and evaluated its performance using both TPC-DS benchmark and synthetic data. Our experimental results show significant performance improvement over the traditional non-cooperative sorting scheme.
Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA; 01/2010
[show abstract][hide abstract] ABSTRACT: In a Moving Object Database (MOD), the dataset, for example, the location of objects and their distribution, and the workload change frequently. Traditional static indexes are not able to cope well with such changes, that is, their effectiveness and efficiency are seriously affected. This calls for the development of novel indexes that can be reconfigured automatically based on the state of the system. In this article, we design and present the ST2B-tree, a Self-Tunable Spatio-Temporal B+-tree index for MODs. In ST2B-tree, the data space is partitioned into regions of different density with respect to a set of reference points. Based on the density, objects in a region are managed using a grid of appropriate granularity; intuitively, a dense region employs a grid with fine granularity, while a sparse region uses a grid with coarse granularity. In this way, the ST2B-tree adapts itself to workload diversity in space. To enable online tuning, the ST2B-tree employs a “multitree” indexing technique. The underlying B+-tree is logically divided into two subtrees. Objects are dispatched to either subtree depending on their last update time. The two subtrees are rebuilt periodically and alternately. Whenever a subtree is rebuilt, it is tuned to optimize performance by picking an appropriate setting (e.g., the set of reference points and grid granularity) based on the most recent data and workload. To cut down the overhead of rebuilding, we propose an eager update technique to construct the subtree. Finally, we present a tuning framework for the ST2B-tree, where the tuning is conducted online and automatically without human intervention, and without interfering with the regular functions of the MOD. We have implemented the tuning framework and the ST2B-tree, and conducted extensive performance evaluations. The results show that the self-tuning mechanism minimizes the degradation of performance caused by workload changes without any noticeable overhead.
[show abstract][hide abstract] ABSTRACT: In many decision making applications, users typically issue aggre- gate queries. To evaluate these computationally expensive queries, online aggregation has been developed to provide approximate an- swers (with their respective confidence intervals) quickly, and to continuously refine the answers. In this paper, we extend the on- line aggregation technique to a distributed context where sites are maintained in a DHT (Distributed Hash Table) network. Our Dis- tributed Online Aggregation (DoA) scheme iteratively and progres- sively produces approximate aggregate answers as follows: in each iteration, a small set of random samples are retrieved from the data sites and distributed to the processing sites; at each processing site, a local aggregate is computed based on the allocated samples; at a coordinator site, these local aggregates are combined into a global aggregate. DoA adaptively grows the number of processing nodes as the sample size increases. To further reduce the sampling over- head, the samples are retained as a precomputed synopsis over the network to be used for processing future queries. We also study how these synopsis can be maintained incrementally. We have conducted extensive experiments on PlanetLab. The results show that our DoA scheme reduces the initial waiting time significantly and provides high quality approximate answers with running con- fidence intervals progressively.
[show abstract][hide abstract] ABSTRACT: In a co-space environment, the physical space and the virtual space co-exist, and interact simultaneously. While the physical space is virtually enhanced with information, the virtual space is continuously refreshed with real-time, real-world information. To allow users to process and manipulate information seamlessly between the real and digital spaces, novel technologies must be developed. These include smart interfaces, new augmented realities, efficient storage and data management and dissemination techniques. In this paper, we first discuss some promising co-space applications. These applications offer experiences and opportunities that neither of the spaces can realize on its own. We then argue that the database community has much to offer to this field. Finally, we present several challenges that we, as a community, can contribute towards managing the co-space.
[show abstract][hide abstract] ABSTRACT: The processing of a Continuous Reverse k-Nearest-Neighbor (CRkNN) query on moving objects can be divided into two sub tasks: continuous filter, and continuous refinement. The algorithms for the two tasks can be completely independent. Existing CRkNN solutions employ Continuous k-Nearest-Neighbor (CkNN) queries for both continuous filter and continuous refinement. We analyze the CkNN based solution and point out that when k > 1 the refinement cost becomes the system bottleneck. We propose a new continuous refinement method called CRange-k. In CRange- k, we transform the continuous verification problem into a Continuous Range-k query, which is also defined in this paper, and process it efficiently. Experimental study shows that the CRkNN solution based on our CRange-k refinement method is more efficient and scalable than the state-of-the- art CRkNN solution.
Mobile Data Management, 2008. MDM '08. 9th International Conference on; 05/2008
[show abstract][hide abstract] ABSTRACT: Most of existing privacy preserving techniques, such as anonymity methods, are designed for static data sets. As such, they cannot be applied to streaming data which are continuous, transient and usually unbounded. Moreover, in streaming applications, there is a need to offer strong guarantees on the maximum allowed delay between an incoming data and its anonymized output. To cope with these requirements, in this paper, we present CASTLE (continuously anonymizing streaming data via adaptive clustering), a cluster-based scheme that anonymizes data streams on-the-fly and, at the same time, ensures the freshness of the anonymized data by satisfying specified delay constraints. We further show how CASTLE can be easily extended to handle Z-diversity. Our extensive performance study shows that CASTLE is efficient and effective.
Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancún, México; 01/2008