Kian-Lee Tan

National University of Singapore, Tumasik, Singapore

Are you Kian-Lee Tan?

Claim your profile

Publications (95)28.03 Total impact

  • Qian Xiao, Rui Chen, Kian-Lee Tan
    08/2014;
  • 07/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 01/2014; 26(7):1670-1678. · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data cubes are widely used as a powerful tool to provide multidimensional views in data warehousing and On-Line Analytical Processing (OLAP). However, with increasing data sizes, it is becoming computationally expensive to perform data cube analysis. The problem is exacerbated by the demand of supporting more complicated aggregate functions (e.g. CORRELATION, Statistical Analysis) as well as supporting frequent view updates in data cubes. This calls for new scalable and efficient data cube analysis systems. In this paper, we introduce HaCube, an extension of MapReduce, designed for efficient parallel data cube analysis on large-scale data by taking advantages from both MapReduce (in terms of scalability) and parallel DBMS (in terms of efficiency). We also provide a general data cube materialization algorithm which is able to facilitate the features in MapReduce-like systems towards an efficient data cube computation. Furthermore, we demonstrate how HaCube supports view maintenance through either incremental computation (e.g. used for SUM or COUNT) or recomputation (e.g. used for MEDIAN or CORRELATION). We implement HaCube by extending Hadoop and evaluate it based on the TPC-D benchmark over billions of tuples on a cluster with over 320 cores. The experimental results demonstrate the efficiency, scalability and practicality of HaCube for cube analysis over a large amount of data in a distributed environment.
    11/2013;
  • Htoo Htet Aung, Long Guo, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Knowledge of the routes frequently used by the tracked objects is embedded in the massive trajectory databases. Such knowledge has various applications in optimizing ports' operations and route-recommendation systems but is difficult to extract especially when the underlying road network information is unavailable. We propose a novel approach, which discovers frequent routes without any prior knowledge of the underlying road network, by mining sub-trajectory cliques. Since mining all sub-trajectory cliques is NP-Complete, we proposed two approximate algorithms based on the Apriori algorithm. Empirical results showed that our algorithms can run fast and their results are intuitive.
    Proceedings of the 13th international conference on Advances in Spatial and Temporal Databases; 08/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.
    Proceedings of the 25th International Conference on Scientific and Statistical Database Management; 07/2013
  • Conference Paper: Nearest group queries
    [Show abstract] [Hide abstract]
    ABSTRACT: k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using R-tree as well as an improved version using Hilbert R-tree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert R-tree achieves significantly better performance than R-tree in answering both kNG query and kNG Join.
    Proceedings of the 25th International Conference on Scientific and Statistical Database Management; 07/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The popularity of similarity search expanded with the increased interest in multimedia databases, bioinformatics, or social networks, and with the growing number of users trying to find information in huge collections of unstructured data. During the ...
    ACM SIGMOD Record 06/2013; 42(2):46-51. · 0.46 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Regularly releasing the aggregate statistics about data streams in a privacy-preserving way not only serves valuable commercial and social purposes, but also protects the privacy of individuals. This problem has already been studied under differential privacy, but only for the case of a single continuous query that covers the entire time span, e.g., counting the number of tuples seen so far in the stream. However, most real-world applications are window-based, that is, they are interested in the statistical information about streaming data within a window, instead of the whole unbound stream. Furthermore, a Data Stream Management System (DSMS) may need to answer numerous correlated aggregated queries simultaneously, rather than a single one. To cope with these requirements, we study how to release differentially private answers for a set of sliding window aggregate queries. We propose two solutions, each consisting of query sampling and composition. We first selectively sample a subset of representative sliding window queries from the set of all the submitted ones. The representative queries are answered by adding Laplace noises in a way satisfying differential privacy. For each non-representative query, we compose its answer from the query results of those representatives. The experimental evaluation shows that our solutions are efficient and effective.
    Proceedings of the 16th International Conference on Extending Database Technology; 03/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this big data era, huge amounts of spatial documents have been generated everyday through various location based services. Top-k spatial keyword search is an important approach to exploring useful information from a spatial database. It retrieves k documents based on a ranking function that takes into account both textual relevance (similarity between the query and document keywords) and spatial relevance (distance between the query and document locations). Various hybrid indexes have been proposed in recent years which mainly combine the R-tree and the inverted index so that spatial pruning and textual pruning can be executed simultaneously. However, the rapid growth in data volume poses significant challenges to existing methods in terms of the index maintenance cost and query processing time. In this paper, we propose a scalable integrated inverted index, named I3, which adopts the Quadtree structure to hierarchically partition the data space into cells. The basic unit of I3 is the keyword cell, which captures the spatial locality of a keyword. Moreover, we design a new storage mechanism for efficient retrieval of keyword cell and preserve additional summary information to facilitate pruning. Experiments conducted on real spatial datasets (Twitter and Wikipedia) demonstrate the superiority of I3 over existing schemes such as IR-tree and S2I in various aspects: it incurs shorter construction time to build the index, it has lower index storage cost, it is order of magnitude faster in updates, and it is highly scalable and answers top-k spatial keyword queries efficiently.
    Proceedings of the 16th International Conference on Extending Database Technology; 03/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Finding a location for a new facility such that the facility attracts the maximal number of customers is a challenging problem. Existing studies either model customers as static sites and thus do not consider customer movement, or they focus on theoretical aspects and do not provide solutions that are shown empirically to be scalable. Given a road network, a set of existing facilities, and a collection of customer route traversals, an optimal segment query returns the optimal road network segment(s) for a new facility. We propose a practical framework for computing this query, where each route traversal is assigned a score that is distributed among the road segments covered by the route according to a score distribution model. The query returns the road segment(s) with the highest score. To achieve low latency, it is essential to prune the very large search space. We propose two algorithms that adopt different approaches to computing the query. Algorithm AUG uses graph augmentation, and ITE uses iterative road-network partitioning. Empirical studies with real data sets demonstrate that the algorithms are capable of offering high performance in realistic settings.
    03/2013;
  • Zhengkui Wang, D. Agrawal, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.
    IEEE Transactions on Knowledge and Data Engineering 01/2013; 25(9):2010-2023. · 1.89 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: GPS-enabled devices are pervasive nowadays. Finding movement patterns in trajectory data stream is gaining in importance. We propose a group discovery framework that aims to efficiently support the online discovery of moving objects that travel together. The framework adopts a sampling-independent approach that makes no assumptions about when positions are sampled, gives no special importance to sampling points, and naturally supports the use of approximate trajectories. The framework's algorithms exploit state-of-the-art, density-based clustering (DBScan) to identify groups. The groups are scored based on their cardinality and duration, and the top-k groups are returned. To avoid returning similar subgroups in a result, notions of domination and similarity are introduced that enable the pruning of low-interest groups. Empirical studies on real and synthetic data sets offer insight into the effectiveness and efficiency of the proposed framework.
    IEEE Transactions on Knowledge and Data Engineering 01/2013; 25(12):2752-2766. · 1.89 Impact Factor
  • Htoo Htet Aung, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an ongoing PhD research on mining Multi-object Spatial-temporal Movement Patterns (M-STEM Patterns) from a Trajectory Database (TJDB). Information of the M-STEM Pattern instances has numerous applications in epidemiology, ecology, location-based services, transportation, and social and behaviour sciences since it supplements the information provided by a traditional GIS. We describe the research we had conducted to find instances of two M-STEM Patterns, namely the Meeting pattern and the Convoy pattern. We conclude this paper after introducing our ongoing research on discovering instances of another M-STEM pattern called Tried-and-True Route pattern.
    SIGSPATIAL Special. 11/2012; 4(3):14-19.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper argues that an algebraic approach to regular languages, such as using monoids, can yield efficient algorithms on strings and trees.
    Sigmod Record. 08/2012;
  • Source
    Yu Cao, Chee-Yong Chan, Jie Li, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Analytic functions represent the state-of-the-art way of performing complex data analysis within a single SQL statement. In particular, an important class of analytic functions that has been frequently used in commercial systems to support OLAP and decision support applications is the class of window functions. A window function returns for each input tuple a value derived from applying a function over a window of neighboring tuples. However, existing window function evaluation approaches are based on a naive sorting scheme. In this paper, we study the problem of optimizing the evaluation of window functions. We propose several efficient techniques, and identify optimization opportunities that allow us to optimize the evaluation of a set of window functions. We have integrated our scheme into PostgreSQL. Our comprehensive experimental study on the TPC-DS datasets as well as synthetic datasets and queries demonstrate significant speedup over existing approaches.
    07/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many database applications require sorting a table (or relation) over multiple sort orders. Some examples include creation of multiple indices on a relation, generation of multiple reports from a table, evaluation of a complex query that involves multiple instances of a relation, and batch processing of a set of queries. In this paper, we study how to optimize multiple sortings of a table. We investigate the correlation between sort orders and exploit sort-sharing techniques of reusing the (partial) work done to sort a table on a particular order for another order. Specifically, we introduce a novel and powerful evaluation technique, called cooperative sorting, that enables sort sharing between seemingly non-related sort orders. Subsequently, given a specific set of sort orders, we determine the best combination of various sort-sharing techniques so as to minimize the total processing cost. We also develop techniques to make a traditional query optimizer extensible so that it will not miss the truly cheapest execution plan with the sort-sharing (post-) optimization turned on. We demonstrate the efficiency of our ideas with a prototype implementation in PostgreSQL and evaluate the performance using both TPC-DS benchmark and synthetic data. Our experimental results show significant performance improvement over the traditional evaluation scheme.
    The VLDB Journal 06/2012; 21(3). · 1.40 Impact Factor
  • Qian Xiao, Htoo Htet Aung, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Social Networking Sites (SNSs) allow users to publish posts to certain user-defined circles (sets of users). However, existing SNS models are limited in several ways. First, it is not practical to predefine all circles a user will ever need for disseminating her posts. Second, existing SNSs do not currently have an effecitive mechanism for a user to create and/or customize dynamic (ad-hoc) circles for each publishing session. Third, SNSs do not have features to assist users to manage and use the circles in an easy way by considering the user's ever-changing habits accordingly. In this paper, we propose a novel model for creating ad-hoc circles as needs arise. We present a recommendation framework -- the Circle OpeRation RECommendaTion (CORRECT) framework -- to assist users in easily utilizing our proposed model. Contrary to current SNS offerings, our proposed model does not require a user to create an extensive list of predefined circles; instead, ad-hoc circles are recommended based on a few building-block circles the user has defined and historical ad-hoc circles the user has created.
    05/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A range of applications call for a mobile client to continuously monitor others in close proximity. Past research on such problems has covered two extremes: It has offered totally centralized solutions, where a server takes care of all queries, and totally distributed solutions, in which there is no central authority at all. Unfortunately, none of these two solutions scales to intensive moving object tracking applications, where each client poses a query. In this paper, we formulate the moving continuous query (MCQ) problem and propose a balanced model where servers cooperatively take care of the global view and handle the majority of the workload. Meanwhile, moving clients, having basic memory and computation resources, handle small portions of the workload. This model is further enhanced by dynamic region allocation and grid size adjustment mechanisms that reduce the communication and computation cost for both servers and clients. An experimental study demonstrates that our approaches offer better scalability than competitors.
    Mobile Data Management (MDM), 2012 IEEE 13th International Conference on; 01/2012
  • Qian Xiao, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Todays online social networks (OSNs) allow a user to share his photos with others and tag the co-owners, i.e., friends who also appear in the co-owned photos. However, it is not uncommon that conflicts may arise among the co-owners because of their different privacy concerns. OSNs, unfortunately, offer only limited access control support where the publisher of the shared content is the sole decision maker to restrict access. There is thus an urgent need to develop mechanisms for multiple owners of the shared content to collaboratively determine the access rights of other users, as well as to resolve the conflicts among co-owners with different requirements. Rather than competing with each other and just wanting ones own decision to be executed, OSN users may be affected their peers concerns and adjust their decisions accordingly. To incorporate such peer effects in the strategy, we formulate a model to simulate an emotional mediation among multiple co-owners. Our mechanism, called CAPE, considers the intensity with which the co-owners are willing to pick up a choice (e.g. to release a photo to the public) and the extent to which they want their decisions to be affected by their peers actions. Moreover, CAPE automatically yields the final actions for the co-owners as the mediation reaches equilibrium. It frees the co-owners from the mediation process after the initial setting, and meanwhile, offers a way to achieve more agreements among themselves.
    Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), 2012 8th International Conference on; 01/2012

Publication Stats

1k Citations
28.03 Total Impact Points

Institutions

  • 1998–2013
    • National University of Singapore
      • • School of Computing
      • • Department of Computer Science
      Tumasik, Singapore
  • 2009
    • Harbin Institute of Technology
      • School of Computer Science and Technology
      Charbin, Heilongjiang Sheng, China
  • 2008
    • The Ohio State University
      Columbus, Ohio, United States
  • 1997–1999
    • Australian National University
      Canberra, Australian Capital Territory, Australia