Kian-Lee Tan

National University of Singapore, Tumasik, Singapore

Are you Kian-Lee Tan?

Claim your profile

Publications (115)43.94 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.
    IEEE Transactions on Knowledge and Data Engineering 07/2015; 27(7):1-1. DOI:10.1109/TKDE.2015.2427795 · 1.82 Impact Factor
  • 02/2015; 8(6):666-677. DOI:10.14778/2735703.2735706
  • IEEE Transactions on Knowledge and Data Engineering 01/2015; DOI:10.1109/TKDE.2015.2399306 · 1.82 Impact Factor
  • Article: CANDS
  • Qian Xiao, Rui Chen, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.
  • Article: R3
  • [Show abstract] [Hide abstract]
    ABSTRACT: We examine the spatial keyword search problem to retrieve objects of interest that are ranked based on both their spatial proximity to the query location as well as the textual relevance of the object's keywords. Existing solutions for the problem are based on either using a combination of textual and spatial indexes or using specialized hybrid indexes that integrate the indexing of both textual and spatial attribute values. In this paper, we propose a new approach that is based on modeling the problem as a top-k aggregation problem which enables the design of a scalable and efficient solution that is based on the ubiquitous inverted list index. Our performance study demonstrates that our approach outperforms the state-of-the-art hybrid methods by a wide margin.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1670-1678. DOI:10.1109/TKDE.2014.2326659 · 1.82 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: users in a social network to maximize the expected number of users influenced by the selected users (called influence spread), has been extensively studied, existing works neglected the fact that the location information can play an important role in influence maximization. Many real-world applications such as location-aware word-of-mouth marketing have location-aware requirement. In this paper we study the location-aware influence maximization problem. One big challenge in location-aware influence maximization is to develop an efficient scheme that offers wide influence spread. To address this challenge, we propose two greedy algorithms with 1-1/e approximation ratio. To meet the instant-speed requirement, we propose two efficient algorithms with ε· (1-1/e) approximation ratio for any ε ∈ (0,1]. Experimental results on real datasets show our method achieves high performance while keeping large influence spread and significantly outperforms state-of-the-art algorithms.
  • 04/2014; 7(8):613-624. DOI:10.14778/2732296.2732298
  • [Show abstract] [Hide abstract]
    ABSTRACT: Attributed graphs are becoming important tools for modeling information networks, such as the Web and various social networks (e.g. Facebook, LinkedIn, Twitter). However, it is computationally challenging to manage and analyze attributed graphs to support effective decision making. In this paper, we propose, Pagrol, a parallel graph OLAP (Online Analytical Processing) system over attributed graphs. In particular, Pagrol introduces a new conceptual Hyper Graph Cube model (which is an attributed-graph analogue of the data cube model for relational DBMS) to aggregate attributed graphs at different granularities and levels. The proposed model supports different queries as well as a new set of graph OLAP Roll-Up/Drill-Down operations. Furthermore, on the basis of Hyper Graph Cube, Pagrol provides an efficient MapReduce-based parallel graph cubing algorithm, MRGraph-Cubing, to compute the graph cube for an attributed graph. Pagrol employs numerous optimization techniques: (a) a self-contained join strategy to minimize I/O cost; (b) a scheme that groups cuboids into batches so as to minimize redundant computations; (c) a cost-based scheme to allocate the batches into bags (each with a small number of batches); and (d) an efficient scheme to process a bag using a single MapReduce job. Results of extensive experimental studies using both real Facebook and synthetic datasets on a 128-node cluster show that Pagrol is effective, efficient and scalable.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • Article: epiC
  • Guoliang Li, Jun Hu, Jianhua Feng, Kian-lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid development of social networks has resulted in a proliferation of user-generated content (UGC). The UGC data, when properly analyzed, can be beneficial to many applications. For example, identifying a user's locations from microblogs is very important for effective location-based advertisement and recommendation. In this paper, we study the problem of identifying a user's locations from microblogs. This problem is rather challenging because the location information in a microblog is incomplete and we cannot get an accurate location from a local microblog. To address this challenge, we propose a global location identification method, called Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, states, cities, districts, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales very well.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • Long Guo, Jie Shao, Htoo Htet Aung, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: With the development of GPS-enabled mobile devices, more and more pieces of information on the web are geotagged. Spatial keyword queries, which consider both spatial locations and textual descriptions to find objects of interest, adapt well to this trend. Therefore, a considerable number of studies have focused on the interesting problem of efficiently processing spatial keyword queries. However, most of them assume Euclidean space or examine a single snapshot query only. This paper investigates a novel problem, namely, continuous top-k spatial keyword queries on road networks, for the first time. We propose two methods that can monitor such moving queries in an incremental manner and reduce repetitive traversing of network edges for better performance. Experimental evaluation using large real datasets demonstrates that the proposed methods both outperform baseline methods significantly. Discussion about the parameters affecting the efficiency of the two methods is also presented to reveal their relative advantages.
    GeoInformatica 01/2014; 19(1):29-60. DOI:10.1007/s10707-014-0204-8 · 1.29 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: GPS-enabled devices are pervasive nowadays. Finding movement patterns in trajectory data stream is gaining in importance. We propose a group discovery framework that aims to efficiently support the online discovery of moving objects that travel together. The framework adopts a sampling-independent approach that makes no assumptions about when positions are sampled, gives no special importance to sampling points, and naturally supports the use of approximate trajectories. The framework's algorithms exploit state-of-the-art, density-based clustering (DBScan) to identify groups. The groups are scored based on their cardinality and duration, and the top-k groups are returned. To avoid returning similar subgroups in a result, notions of domination and similarity are introduced that enable the pruning of low-interest groups. Empirical studies on real and synthetic data sets offer insight into the effectiveness and efficiency of the proposed framework.
    IEEE Transactions on Knowledge and Data Engineering 12/2013; 25(12):2752-2766. DOI:10.1109/TKDE.2012.193 · 1.82 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the fast development of location-based services and geo-tagging, spatial keyword queries that retrieve objects satisfying both spatial and keyword conditions are gaining in prevalence. A hybrid index that integrates a spatial index (e.g., the R-tree or its variations) with a keyword filter offers a promising approach for processing such queries efficiently. However, it is still an open problem on how a hybrid index can be effectively constructed from scratch. The state-of-the-art bulk loading algorithms for the R-tree consider only spatial relationship, and cannot be employed for the hybrid index. In this paper, we propose a new bulk loading algorithm, named TPA, which constructs a hybrid index top-down. TPA utilizes a two-phase method to construct the children of nodes at each level of the hybrid index, taking both spatial and keyword information into consideration, and thus optimizes the hybrid index for spatial keyword queries. We analyze and evaluate its performance using both real and synthetic datasets. Comprehensive experiments show that TPA can achieve good performance and space utilization, reducing the construction time, the query latency and the index size remarkably.
    2013 International Conference on Parallel and Distributed Systems (ICPADS); 12/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data cubes are widely used as a powerful tool to provide multidimensional views in data warehousing and On-Line Analytical Processing (OLAP). However, with increasing data sizes, it is becoming computationally expensive to perform data cube analysis. The problem is exacerbated by the demand of supporting more complicated aggregate functions (e.g. CORRELATION, Statistical Analysis) as well as supporting frequent view updates in data cubes. This calls for new scalable and efficient data cube analysis systems. In this paper, we introduce HaCube, an extension of MapReduce, designed for efficient parallel data cube analysis on large-scale data by taking advantages from both MapReduce (in terms of scalability) and parallel DBMS (in terms of efficiency). We also provide a general data cube materialization algorithm which is able to facilitate the features in MapReduce-like systems towards an efficient data cube computation. Furthermore, we demonstrate how HaCube supports view maintenance through either incremental computation (e.g. used for SUM or COUNT) or recomputation (e.g. used for MEDIAN or CORRELATION). We implement HaCube by extending Hadoop and evaluate it based on the TPC-D benchmark over billions of tuples on a cluster with over 320 cores. The experimental results demonstrate the efficiency, scalability and practicality of HaCube for cube analysis over a large amount of data in a distributed environment.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we study the problem of kNN search on road networks. Given a query location and a set of candidate objects in a road network, the kNN search finds the k nearest objects to the query location. To address this problem, we propose a balanced search tree index, called G-tree. The G-tree of a road network is constructed by recursively partitioning the road network into sub-networks and each G-tree node corresponds to a sub-network. Inspired by classical kNN search on metric space, we introduce a best-first search algorithm on road networks, and propose an elaborately-designed assembly-based method to efficiently compute the minimum distance from a G-tree node to the query location. G-tree only takes O(|V|log|V|) space, where |V| is the number of vertices in a network, and thus can easily scale up to large road networks with more than 20 millions vertices. Experimental results on eight real-world datasets show that our method significantly outperforms state-of-the-art methods, even by 2-3 orders of magnitude.
    Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.
    IEEE Transactions on Knowledge and Data Engineering 09/2013; 25(9):2010-2023. DOI:10.1109/TKDE.2012.113 · 1.82 Impact Factor
  • Htoo Htet Aung, Long Guo, Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Knowledge of the routes frequently used by the tracked objects is embedded in the massive trajectory databases. Such knowledge has various applications in optimizing ports' operations and route-recommendation systems but is difficult to extract especially when the underlying road network information is unavailable. We propose a novel approach, which discovers frequent routes without any prior knowledge of the underlying road network, by mining sub-trajectory cliques. Since mining all sub-trajectory cliques is NP-Complete, we proposed two approximate algorithms based on the Apriori algorithm. Empirical results showed that our algorithms can run fast and their results are intuitive.
    Proceedings of the 13th international conference on Advances in Spatial and Temporal Databases; 08/2013