Kian-Lee Tan

National University of Singapore, Tumasik, Singapore

Are you Kian-Lee Tan?

Claim your profile

Publications (123)48.12 Total impact

  • Qi Fan · Zhengkui Wang · Chee-Yong Chan · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In relational DBMS, window functions have been widely used to facilitate data analytics. Surprisingly, while similar concepts have been employed for graph analytics, there has been no explicit notions of graph window analytic functions. In this paper, we formally introduce window queries for graph analytics. In such queries, for each vertex, the analysis is performed on a window of vertices defined based on the graph structure. In particular, we identify two instantiations, namely the k-hop window and the topological window. We develop two novel indices, Dense Block index (DBIndex) and Inheritance index (I-Index), to facilitate efficient processing of these two types of windows respectively. Extensive experiments are conducted over both real and synthetic datasets with hundreds of millions of vertices and edges. Experimental results indicate that our proposed index-based query processing solutions achieve four orders of magnitude of query performance gain than the non-index algorithm and are superior over EAGR wrt scalability and efficiency.

  • ACM SIGMOD Record 08/2015; 44(2):35-40. DOI:10.1145/2814710.2814717 · 1.05 Impact Factor
  • Ruicheng Zhong · Guoliang Li · Kian-Lee Tan · Lizhu Zhou · Zhiguo Gong ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In the recent decades, we have witnessed the rapidly growing popularity of location-based systems. Three types of location-based queries on road networks, single-pair shortest path query, $k$ nearest neighbor ( $k$ NN) query, and keyword-based $k$ NN query, are widely used in location-based systems. Inspired by $tt R$ -$tt tree$, we propose a height-balanced and scalable index, namely $tt G$ -$tt tree$, to efficiently support these queries. The space complexity of $tt G$ - $tt tree$ is $mathcal {O}(|mathcal {V}|log {|mathcal {V}|})$ where ${|mathcal {V}|}$ is the number of vertices in the road network. Unlike previous works that support these queries separately, $tt G$ - $tt tree$ supports all these queries within one framework. The basis for this framework is an assembly-based method to calculate the shortest-path distances between two vertices. Based on the assembly-based method, efficient search algorithms to answer $k$ NN queries and keyword-based $k$ NN queries are developed. Experiment results show $tt G$ - $tt tree$ ’s theoretical and practical superiority over existing methods.
    IEEE Transactions on Knowledge and Data Engineering 08/2015; 27(8):1-1. DOI:10.1109/TKDE.2015.2399306 · 2.07 Impact Factor
  • Dawei Jiang · Sai Wu · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Jun Xu ·
    [Show abstract] [Hide abstract]
    ABSTRACT: The Big Data problem is characterized by the so-called 3V features: volume—a huge amount of data, velocity—a high data ingestion rate, and variety—a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model are inconvenient and inefficient for handling structured data and graph data. This paper presents epiC, an extensible system to tackle the Big Data’s data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. Users process multi-structured datasets with appropriate epiC extensions, and the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC’s concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and inter-machine communications. We present the design and implementation of epiC’s concurrent programming model. We also present two customized data processing models, an optimized MapReduce extension and a relational model, on top of epiC. We show how users can leverage epiC to process heterogeneous data by linking different types of operators together. To improve the performance of complex analytic jobs, epiC supports a partition-based optimization technique where data are streamed between the operators to avoid the high I/O overheads. Experiments demonstrate the effectiveness and efficiency of our proposed epiC.
    The VLDB Journal 07/2015; DOI:10.1007/s00778-015-0393-2 · 1.57 Impact Factor
  • Source
    Hao Zhang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Meihui Zhang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.
    IEEE Transactions on Knowledge and Data Engineering 07/2015; 27(7):1-1. DOI:10.1109/TKDE.2015.2427795 · 2.07 Impact Factor
  • Yuchen Li · Dongxiang Zhang · Kian-Lee Tan ·

    Proceedings of the VLDB Endowment 06/2015; 8(10):1070-1081. DOI:10.14778/2794367.2794376
  • Shuo Chen · Ju Fan · Guoliang Li · Jianhua Feng · Kian-lee Tan · Jinhui Tang ·

    Proceedings of the VLDB Endowment 02/2015; 8(6):666-677. DOI:10.14778/2735703.2735706
  • Weiwei Hu · Guoliang Li · Jiacai Ni · Dalie Sun · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Phase change memory (PCM) has been considered an attractive alternative to flash memory and DRAM. It has promising features, including non-volatile storage, byte addressability, fast read and write operations, and supports random accesses. However, there are challenges in designing algorithms for PCM-based memory systems, such as longer write latency and higher energy consumption compared to DRAM. In this paper, we propose a new predictive B (^{+}) -tree index, called the B (^{p}) -tree, which is tailored for database systems that make use of PCM. Our B (^{p}) -tree reduces data movements caused by tree node splits and merges that arise from insertions and deletions. This is achieved by pre-allocating space on PCM for near future data. To ensure the space are allocated where they are needed, we propose a novel predictive model to ascertain future data distribution based on the current data. In addition, as in [4], when keys are inserted into a leaf node, they are packed but need not be in sorted order. We have implemented the B (^{p}) -tree in PostgreSQL and evaluated it in an emulated environment. Our experimental results show that the B (^{p}) -tree significantly reduces the number of writes, therefore making it write and energy efficient and suitable for a PCM-like hardware environment.
    IEEE Transactions on Knowledge and Data Engineering 10/2014; 26(10):2368-2381. DOI:10.1109/TKDE.2014.5 · 2.07 Impact Factor
  • Article: CANDS

  • Qian Xiao · Rui Chen · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.
  • Article: R3
    Henan Wang · Guoliang Li · Huiqi Hu · Shuo Chen · Bingwen Shen · Hao Wu · Wen-Syan Li · Kian-Lee Tan ·

  • Dongxiang Zhang · Chee-Yong Chan · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: We examine the spatial keyword search problem to retrieve objects of interest that are ranked based on both their spatial proximity to the query location as well as the textual relevance of the object's keywords. Existing solutions for the problem are based on either using a combination of textual and spatial indexes or using specialized hybrid indexes that integrate the indexing of both textual and spatial attribute values. In this paper, we propose a new approach that is based on modeling the problem as a top-k aggregation problem which enables the design of a scalable and efficient solution that is based on the ubiquitous inverted list index. Our performance study demonstrates that our approach outperforms the state-of-the-art hybrid methods by a wide margin.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1670-1678. DOI:10.1109/TKDE.2014.2326659 · 2.07 Impact Factor
  • Guoliang Li · Shuo Chen · Jianhua Feng · Kian-lee Tan · Wen-syan Li ·
    [Show abstract] [Hide abstract]
    ABSTRACT: users in a social network to maximize the expected number of users influenced by the selected users (called influence spread), has been extensively studied, existing works neglected the fact that the location information can play an important role in influence maximization. Many real-world applications such as location-aware word-of-mouth marketing have location-aware requirement. In this paper we study the location-aware influence maximization problem. One big challenge in location-aware influence maximization is to develop an efficient scheme that offers wide influence spread. To address this challenge, we propose two greedy algorithms with 1-1/e approximation ratio. To meet the instant-speed requirement, we propose two efficient algorithms with ε· (1-1/e) approximation ratio for any ε ∈ (0,1]. Experimental results on real datasets show our method achieves high performance while keeping large influence spread and significantly outperforms state-of-the-art algorithms.
  • Dongxiang Zhang · Chee-Yong Chan · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Many of today's publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse database that existing pub/sub systems can no longer support effectively. In this paper, we propose an efficient in-memory index that is scalable to the volume and update of subscriptions, the arrival rate of events and the variety of subscribable attributes. The index is also extensible to support complex scenarios such as prefix/suffix filtering and regular expression matching. We conduct extensive experiments on synthetic datasets and two real datasets (AOL query log and Ebay products). The results demonstrate the superiority of our index over state-of-the-art methods: our index incurs orders of magnitude less index construction time, consumes a small amount of memory and performs event matching efficiently.
    Proceedings of the VLDB Endowment 04/2014; 7(8):613-624. DOI:10.14778/2732296.2732298
  • [Show abstract] [Hide abstract]
    ABSTRACT: Attributed graphs are becoming important tools for modeling information networks, such as the Web and various social networks (e.g. Facebook, LinkedIn, Twitter). However, it is computationally challenging to manage and analyze attributed graphs to support effective decision making. In this paper, we propose, Pagrol, a parallel graph OLAP (Online Analytical Processing) system over attributed graphs. In particular, Pagrol introduces a new conceptual Hyper Graph Cube model (which is an attributed-graph analogue of the data cube model for relational DBMS) to aggregate attributed graphs at different granularities and levels. The proposed model supports different queries as well as a new set of graph OLAP Roll-Up/Drill-Down operations. Furthermore, on the basis of Hyper Graph Cube, Pagrol provides an efficient MapReduce-based parallel graph cubing algorithm, MRGraph-Cubing, to compute the graph cube for an attributed graph. Pagrol employs numerous optimization techniques: (a) a self-contained join strategy to minimize I/O cost; (b) a scheme that groups cuboids into batches so as to minimize redundant computations; (c) a cost-based scheme to allocate the batches into bags (each with a small number of batches); and (d) an efficient scheme to process a bag using a single MapReduce job. Results of extensive experimental studies using both real Facebook and synthetic datasets on a 128-node cluster show that Pagrol is effective, efficient and scalable.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • Article: epiC
    Dawei Jiang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Sai Wu ·

  • Guoliang Li · Jun Hu · Jianhua Feng · Kian-lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid development of social networks has resulted in a proliferation of user-generated content (UGC). The UGC data, when properly analyzed, can be beneficial to many applications. For example, identifying a user's locations from microblogs is very important for effective location-based advertisement and recommendation. In this paper, we study the problem of identifying a user's locations from microblogs. This problem is rather challenging because the location information in a microblog is incomplete and we cannot get an accurate location from a local microblog. To address this challenge, we propose a global location identification method, called Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, states, cities, districts, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales very well.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • Long Guo · Jie Shao · Htoo Htet Aung · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: With the development of GPS-enabled mobile devices, more and more pieces of information on the web are geotagged. Spatial keyword queries, which consider both spatial locations and textual descriptions to find objects of interest, adapt well to this trend. Therefore, a considerable number of studies have focused on the interesting problem of efficiently processing spatial keyword queries. However, most of them assume Euclidean space or examine a single snapshot query only. This paper investigates a novel problem, namely, continuous top-k spatial keyword queries on road networks, for the first time. We propose two methods that can monitor such moving queries in an incremental manner and reduce repetitive traversing of network edges for better performance. Experimental evaluation using large real datasets demonstrates that the proposed methods both outperform baseline methods significantly. Discussion about the parameters affecting the efficiency of the two methods is also presented to reveal their relative advantages.
    GeoInformatica 01/2014; 19(1):29-60. DOI:10.1007/s10707-014-0204-8 · 0.75 Impact Factor
  • Xiaohui Li · Vaida Ceikute · Christian S. Jensen · Kian-Lee Tan ·
    [Show abstract] [Hide abstract]
    ABSTRACT: GPS-enabled devices are pervasive nowadays. Finding movement patterns in trajectory data stream is gaining in importance. We propose a group discovery framework that aims to efficiently support the online discovery of moving objects that travel together. The framework adopts a sampling-independent approach that makes no assumptions about when positions are sampled, gives no special importance to sampling points, and naturally supports the use of approximate trajectories. The framework's algorithms exploit state-of-the-art, density-based clustering (DBScan) to identify groups. The groups are scored based on their cardinality and duration, and the top-k groups are returned. To avoid returning similar subgroups in a result, notions of domination and similarity are introduced that enable the pruning of low-interest groups. Empirical studies on real and synthetic data sets offer insight into the effectiveness and efficiency of the proposed framework.
    IEEE Transactions on Knowledge and Data Engineering 12/2013; 25(12):2752-2766. DOI:10.1109/TKDE.2012.193 · 2.07 Impact Factor