Kian-Lee Tan

National University of Singapore, Tumasik, Singapore

Are you Kian-Lee Tan?

Claim your profile

Publications (80)20.69 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 01/2014; 26(7):1670-1678. · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data cubes are widely used as a powerful tool to provide multidimensional views in data warehousing and On-Line Analytical Processing (OLAP). However, with increasing data sizes, it is becoming computationally expensive to perform data cube analysis. The problem is exacerbated by the demand of supporting more complicated aggregate functions (e.g. CORRELATION, Statistical Analysis) as well as supporting frequent view updates in data cubes. This calls for new scalable and efficient data cube analysis systems. In this paper, we introduce HaCube, an extension of MapReduce, designed for efficient parallel data cube analysis on large-scale data by taking advantages from both MapReduce (in terms of scalability) and parallel DBMS (in terms of efficiency). We also provide a general data cube materialization algorithm which is able to facilitate the features in MapReduce-like systems towards an efficient data cube computation. Furthermore, we demonstrate how HaCube supports view maintenance through either incremental computation (e.g. used for SUM or COUNT) or recomputation (e.g. used for MEDIAN or CORRELATION). We implement HaCube by extending Hadoop and evaluate it based on the TPC-D benchmark over billions of tuples on a cluster with over 320 cores. The experimental results demonstrate the efficiency, scalability and practicality of HaCube for cube analysis over a large amount of data in a distributed environment.
    11/2013;
  • Htoo Htet Aung, Long Guo, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Knowledge of the routes frequently used by the tracked objects is embedded in the massive trajectory databases. Such knowledge has various applications in optimizing ports' operations and route-recommendation systems but is difficult to extract especially when the underlying road network information is unavailable. We propose a novel approach, which discovers frequent routes without any prior knowledge of the underlying road network, by mining sub-trajectory cliques. Since mining all sub-trajectory cliques is NP-Complete, we proposed two approximate algorithms based on the Apriori algorithm. Empirical results showed that our algorithms can run fast and their results are intuitive.
    Proceedings of the 13th international conference on Advances in Spatial and Temporal Databases; 08/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.
    Proceedings of the 25th International Conference on Scientific and Statistical Database Management; 07/2013
  • Conference Paper: Nearest group queries
    [Show abstract] [Hide abstract]
    ABSTRACT: k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using R-tree as well as an improved version using Hilbert R-tree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert R-tree achieves significantly better performance than R-tree in answering both kNG query and kNG Join.
    Proceedings of the 25th International Conference on Scientific and Statistical Database Management; 07/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The popularity of similarity search expanded with the increased interest in multimedia databases, bioinformatics, or social networks, and with the growing number of users trying to find information in huge collections of unstructured data. During the ...
    ACM SIGMOD Record 06/2013; 42(2):46-51. · 0.46 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Regularly releasing the aggregate statistics about data streams in a privacy-preserving way not only serves valuable commercial and social purposes, but also protects the privacy of individuals. This problem has already been studied under differential privacy, but only for the case of a single continuous query that covers the entire time span, e.g., counting the number of tuples seen so far in the stream. However, most real-world applications are window-based, that is, they are interested in the statistical information about streaming data within a window, instead of the whole unbound stream. Furthermore, a Data Stream Management System (DSMS) may need to answer numerous correlated aggregated queries simultaneously, rather than a single one. To cope with these requirements, we study how to release differentially private answers for a set of sliding window aggregate queries. We propose two solutions, each consisting of query sampling and composition. We first selectively sample a subset of representative sliding window queries from the set of all the submitted ones. The representative queries are answered by adding Laplace noises in a way satisfying differential privacy. For each non-representative query, we compose its answer from the query results of those representatives. The experimental evaluation shows that our solutions are efficient and effective.
    Proceedings of the 16th International Conference on Extending Database Technology; 03/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this big data era, huge amounts of spatial documents have been generated everyday through various location based services. Top-k spatial keyword search is an important approach to exploring useful information from a spatial database. It retrieves k documents based on a ranking function that takes into account both textual relevance (similarity between the query and document keywords) and spatial relevance (distance between the query and document locations). Various hybrid indexes have been proposed in recent years which mainly combine the R-tree and the inverted index so that spatial pruning and textual pruning can be executed simultaneously. However, the rapid growth in data volume poses significant challenges to existing methods in terms of the index maintenance cost and query processing time. In this paper, we propose a scalable integrated inverted index, named I3, which adopts the Quadtree structure to hierarchically partition the data space into cells. The basic unit of I3 is the keyword cell, which captures the spatial locality of a keyword. Moreover, we design a new storage mechanism for efficient retrieval of keyword cell and preserve additional summary information to facilitate pruning. Experiments conducted on real spatial datasets (Twitter and Wikipedia) demonstrate the superiority of I3 over existing schemes such as IR-tree and S2I in various aspects: it incurs shorter construction time to build the index, it has lower index storage cost, it is order of magnitude faster in updates, and it is highly scalable and answers top-k spatial keyword queries efficiently.
    Proceedings of the 16th International Conference on Extending Database Technology; 03/2013
  • Zhengkui Wang, D. Agrawal, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.
    IEEE Transactions on Knowledge and Data Engineering 01/2013; 25(9):2010-2023. · 1.89 Impact Factor
  • Htoo Htet Aung, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an ongoing PhD research on mining Multi-object Spatial-temporal Movement Patterns (M-STEM Patterns) from a Trajectory Database (TJDB). Information of the M-STEM Pattern instances has numerous applications in epidemiology, ecology, location-based services, transportation, and social and behaviour sciences since it supplements the information provided by a traditional GIS. We describe the research we had conducted to find instances of two M-STEM Patterns, namely the Meeting pattern and the Convoy pattern. We conclude this paper after introducing our ongoing research on discovering instances of another M-STEM pattern called Tried-and-True Route pattern.
    SIGSPATIAL Special. 11/2012; 4(3):14-19.
  • Source
    Yu Cao, Chee-Yong Chan, Jie Li, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Analytic functions represent the state-of-the-art way of performing complex data analysis within a single SQL statement. In particular, an important class of analytic functions that has been frequently used in commercial systems to support OLAP and decision support applications is the class of window functions. A window function returns for each input tuple a value derived from applying a function over a window of neighboring tuples. However, existing window function evaluation approaches are based on a naive sorting scheme. In this paper, we study the problem of optimizing the evaluation of window functions. We propose several efficient techniques, and identify optimization opportunities that allow us to optimize the evaluation of a set of window functions. We have integrated our scheme into PostgreSQL. Our comprehensive experimental study on the TPC-DS datasets as well as synthetic datasets and queries demonstrate significant speedup over existing approaches.
    07/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many database applications require sorting a table (or relation) over multiple sort orders. Some examples include creation of multiple indices on a relation, generation of multiple reports from a table, evaluation of a complex query that involves multiple instances of a relation, and batch processing of a set of queries. In this paper, we study how to optimize multiple sortings of a table. We investigate the correlation between sort orders and exploit sort-sharing techniques of reusing the (partial) work done to sort a table on a particular order for another order. Specifically, we introduce a novel and powerful evaluation technique, called cooperative sorting, that enables sort sharing between seemingly non-related sort orders. Subsequently, given a specific set of sort orders, we determine the best combination of various sort-sharing techniques so as to minimize the total processing cost. We also develop techniques to make a traditional query optimizer extensible so that it will not miss the truly cheapest execution plan with the sort-sharing (post-) optimization turned on. We demonstrate the efficiency of our ideas with a prototype implementation in PostgreSQL and evaluate the performance using both TPC-DS benchmark and synthetic data. Our experimental results show significant performance improvement over the traditional evaluation scheme.
    The VLDB Journal 06/2012; 21(3). · 1.40 Impact Factor
  • Qian Xiao, Htoo Htet Aung, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Social Networking Sites (SNSs) allow users to publish posts to certain user-defined circles (sets of users). However, existing SNS models are limited in several ways. First, it is not practical to predefine all circles a user will ever need for disseminating her posts. Second, existing SNSs do not currently have an effecitive mechanism for a user to create and/or customize dynamic (ad-hoc) circles for each publishing session. Third, SNSs do not have features to assist users to manage and use the circles in an easy way by considering the user's ever-changing habits accordingly. In this paper, we propose a novel model for creating ad-hoc circles as needs arise. We present a recommendation framework -- the Circle OpeRation RECommendaTion (CORRECT) framework -- to assist users in easily utilizing our proposed model. Contrary to current SNS offerings, our proposed model does not require a user to create an extensive list of predefined circles; instead, ad-hoc circles are recommended based on a few building-block circles the user has defined and historical ad-hoc circles the user has created.
    05/2012;
  • Qian Xiao, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Todays online social networks (OSNs) allow a user to share his photos with others and tag the co-owners, i.e., friends who also appear in the co-owned photos. However, it is not uncommon that conflicts may arise among the co-owners because of their different privacy concerns. OSNs, unfortunately, offer only limited access control support where the publisher of the shared content is the sole decision maker to restrict access. There is thus an urgent need to develop mechanisms for multiple owners of the shared content to collaboratively determine the access rights of other users, as well as to resolve the conflicts among co-owners with different requirements. Rather than competing with each other and just wanting ones own decision to be executed, OSN users may be affected their peers concerns and adjust their decisions accordingly. To incorporate such peer effects in the strategy, we formulate a model to simulate an emotional mediation among multiple co-owners. Our mechanism, called CAPE, considers the intensity with which the co-owners are willing to pick up a choice (e.g. to release a photo to the public) and the extent to which they want their decisions to be affected by their peers actions. Moreover, CAPE automatically yields the final actions for the co-owners as the mediation reaches equilibrium. It frees the co-owners from the mediation process after the initial setting, and meanwhile, offers a way to achieve more agreements among themselves.
    Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), 2012 8th International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The corporate network is often used for sharing information among the participating companies and facilitating collaboration in a certain industry sector where companies share a common interest. It can effectively help the companies to reduce their operational costs and increase the revenues. However, the inter-company data sharing and processing poses unique challenges to such a data management system including scalability, performance, throughput, and security. In this paper, we present BestPeer++, a system which delivers elastic data sharing services for corporate network applications in the cloud based on BestPeer – a peer-to-peer (P2P) based data management platform. By integrating cloud computing, database, and P2P tech-nologies into one system, BestPeer++ provides an economical, flexible and scalable platform for corporate network applications and delivers data sharing services to participants based on the widely accepted pay-as-you-go business model. We evaluate BestPeer++ on Amazon EC2 Cloud platform. The benchmarking results show that BestPeer++ outperforms HadoopDB, a recently proposed large-scale data processing system, in performance when both systems are employed to handle typical corporate network workloads. The benchmarking results also demonstrate that BestPeer++ achieves near linear scalability for throughput with respect to the number of peer nodes.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Text databases provide rapid access to collections of digital documents. Such databases have become ubiquitous: text search engines underlie the online text repositories accessible via the Web and are central to digital libraries and online corporate document management.
    07/2011: pages 151-183;
  • [Show abstract] [Hide abstract]
    ABSTRACT: There has been a growing acceptance of the object-oriented data model as the basis of next generation database management systems (DBMSs). Both pure object-oriented DBMS (OODBMSs) and object-relational DBMS (ORDBMSs) have been developed based on object-oriented concepts. Object-relational DBMS, in particular, extend the SQL language by incorporating all the concepts of the object-oriented data model. A large number of products for both categories of DBMS is today available. In particular, all major vendors of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].
    07/2011: pages 1-38;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most of the existing privacy-preserving techniques, such as k-anonymity methods, are designed for static data sets. As such, they cannot be applied to streaming data which are continuous, transient, and usually unbounded. Moreover, in streaming applications, there is a need to offer strong guarantees on the maximum allowed delay between incoming data and the corresponding anonymized output. To cope with these requirements, in this paper, we present Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE), a cluster-based scheme that anonymizes data streams on-the-fly and, at the same time, ensures the freshness of the anonymized data by satisfying specified delay constraints. We further show how CASTLE can be easily extended to handle ℓ-diversity. Our extensive performance study shows that CASTLE is efficient and effective w.r.t. the quality of the output data.
    IEEE Transactions on Dependable and Secure Computing 07/2011; · 1.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent studies suggested that a combination of multiple single nucleotide polymorphisms (SNPs) could have more significant associations with a specific phenotype. However, to discover epistasis, the epistatic interactions of SNPs, in a large number of SNPs, is a computationally challenging task. We are, therefore, motivated to develop efficient and effective solutions for identifying epistatic interactions of SNPs. In this article, we propose an efficient Cloud-based Epistasis cOmputing (eCEO) model for large-scale epistatic interaction in genome-wide association study (GWAS). Given a large number of combinations of SNPs, our eCEO model is able to distribute them to balance the load across the processing nodes. Moreover, our eCEO model can efficiently process each combination of SNPs to determine the significance of its association with the phenotype. We have implemented and evaluated our eCEO model on our own cluster of more than 40 nodes. The experiment results demonstrate that the eCEO model is computationally efficient, flexible, scalable and practical. In addition, we have also deployed our eCEO model on the Amazon Elastic Compute Cloud. Our study further confirms its efficiency and ease of use in a public cloud. The source code of eCEO is available at http://www.comp.nus.edu.sg/~wangzk/eCEO.html. wangzhengkui@nus.edu.sg.
    Bioinformatics 03/2011; 27(8):1045-51. · 5.47 Impact Factor
  • Qian Xiao, Zhengkui Wang, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a randomization scheme, LORA (Link Obfuscation by Randomization), to obfuscate edge existence in graphs. Specifically, we extract the source graph’s hierarchical random graph model and reconstruct the released graph randomly with this model. We show that the released graph can preserve critical graph statistical properties even after a large number of edges have been replaced. To measure the effectiveness of our scheme, we introduce the notion of link entropy to quantify its privacy-preserving strength wrt the existence of edges.
    Secure Data Management - 8th VLDB Workshop, SDM 2011, Seattle, WA, USA, September 2, 2011, Proceedings; 01/2011

Publication Stats

1k Citations
20.69 Total Impact Points

Institutions

  • 1998–2013
    • National University of Singapore
      • • School of Computing
      • • Department of Computer Science
      Tumasik, Singapore
  • 2009
    • Harbin Institute of Technology
      • School of Computer Science and Technology
      Charbin, Heilongjiang Sheng, China
  • 2008
    • The Ohio State University
      Columbus, Ohio, United States
  • 1997–1999
    • Australian National University
      Canberra, Australian Capital Territory, Australia