Kian-Lee Tan

National University of Singapore, Tumasik, Singapore

Are you Kian-Lee Tan?

Claim your profile

Publications (124)48.67 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern Internet applications such as websites and mobile games produce a large amount of activity data representing information associated with user actions such as login or online purchases. Cohort analysis, originated from Social Science, is a powerful data exploration technique for ?finding unusual user behavior trends in large activity datasets using the concept of cohort. This paper presents the design and implementation of database support for cohort analysis. We introduce an extended relational data model for representing a collection of activity data as an activity relation, and define a set of cohort operators on the activity relations, for composing cohort queries. To evaluate a cohort query, we present three schemes: a SQL based approach which translates a cohort query into a set of SQL statements for execution, a materialized view approach which materializes birth activity tuples to speed up SQL execution, and a new cohort query evaluation scheme specially designed for cohort query processing. We implement the first two schemes on MySQL and MonetDB respectively and develop a prototype of our own cohort query engine, COHANA, for the third scheme. An extensive experimental evaluation shows that the performance of the proposed cohort query evaluation scheme is up to three orders of magnitude faster than the performance of the two SQL based schemes.
    Full-text · Article · Jan 2016
  • Qi Fan · Zhengkui Wang · Chee-Yong Chan · Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: In relational DBMS, window functions have been widely used to facilitate data analytics. Surprisingly, while similar concepts have been employed for graph analytics, there has been no explicit notions of graph window analytic functions. In this paper, we formally introduce window queries for graph analytics. In such queries, for each vertex, the analysis is performed on a window of vertices defined based on the graph structure. In particular, we identify two instantiations, namely the k-hop window and the topological window. We develop two novel indices, Dense Block index (DBIndex) and Inheritance index (I-Index), to facilitate efficient processing of these two types of windows respectively. Extensive experiments are conducted over both real and synthetic datasets with hundreds of millions of vertices and edges. Experimental results indicate that our proposed index-based query processing solutions achieve four orders of magnitude of query performance gain than the non-index algorithm and are superior over EAGR wrt scalability and efficiency.
    No preview · Article · Oct 2015

  • No preview · Article · Aug 2015 · ACM SIGMOD Record
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the recent decades, we have witnessed the rapidly growing popularity of location-based systems. Three types of location-based queries on road networks, single-pair shortest path query, $k$ nearest neighbor ( $k$ NN) query, and keyword-based $k$ NN query, are widely used in location-based systems. Inspired by $tt R$ -$tt tree$, we propose a height-balanced and scalable index, namely $tt G$ -$tt tree$, to efficiently support these queries. The space complexity of $tt G$ - $tt tree$ is $mathcal {O}(|mathcal {V}|log {|mathcal {V}|})$ where ${|mathcal {V}|}$ is the number of vertices in the road network. Unlike previous works that support these queries separately, $tt G$ - $tt tree$ supports all these queries within one framework. The basis for this framework is an assembly-based method to calculate the shortest-path distances between two vertices. Based on the assembly-based method, efficient search algorithms to answer $k$ NN queries and keyword-based $k$ NN queries are developed. Experiment results show $tt G$ - $tt tree$ ’s theoretical and practical superiority over existing methods.
    No preview · Article · Aug 2015 · IEEE Transactions on Knowledge and Data Engineering
  • Dawei Jiang · Sai Wu · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Jun Xu
    [Show abstract] [Hide abstract]
    ABSTRACT: The Big Data problem is characterized by the so-called 3V features: volume—a huge amount of data, velocity—a high data ingestion rate, and variety—a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model are inconvenient and inefficient for handling structured data and graph data. This paper presents epiC, an extensible system to tackle the Big Data’s data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. Users process multi-structured datasets with appropriate epiC extensions, and the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC’s concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and inter-machine communications. We present the design and implementation of epiC’s concurrent programming model. We also present two customized data processing models, an optimized MapReduce extension and a relational model, on top of epiC. We show how users can leverage epiC to process heterogeneous data by linking different types of operators together. To improve the performance of complex analytic jobs, epiC supports a partition-based optimization technique where data are streamed between the operators to avoid the high I/O overheads. Experiments demonstrate the effectiveness and efficiency of our proposed epiC.
    No preview · Article · Jul 2015 · The VLDB Journal
  • Source
    Hao Zhang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Meihui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.
    Full-text · Article · Jul 2015 · IEEE Transactions on Knowledge and Data Engineering
  • [Show abstract] [Hide abstract]
    ABSTRACT: The increase in the capacity of main memory coupled with the decrease in cost has fueled research in and development of in-memory databases. In recent years, the emergence of new hardware has further given rise to new challenges which have attracted a lot of attention from the research community. In particular, it is widely accepted that hardware solutions can provide promising alternatives for realizing the full potential of in-memory systems. Here, we argue that naive adoption of hardware solutions does not guarantee superior performance over software solutions, and identify problems in such hardware solutions that limit their performance. We also highlight the primary challenges faced by in-memory databases, and summarize their potential solutions, from both software and hardware perspectives.
    No preview · Article · Jun 2015 · ACM SIGMOD Record
  • Yuchen Li · Dongxiang Zhang · Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Advertising in social network has become a multi-billion- dollar industry. A main challenge is to identify key inuencers who can effectively contribute to the dissemina- tion of information. Although the in uence maximization problem, which finds a seed set of k most in uential users based on certain propagation models, has been well stud ied, it is not target-aware and cannot be directly applied to online advertising. In this paper, we propose a new problem, named Keyword-Based Targeted In uence Max- imization (KB-TIM), to find a seed set that maximizes the expected in uence over users who are relevant to a given advertisement. To solve the problem, we propose a sam- pling technique based on weighted reverse in uence set and achieve an approximation ratio of (1-1=e-ε"). To meet the instant-speed requirement, we propose two disk-based solu tions that improve the query processing time by two orders of magnitude over the state-of-the-art solutions, while keep ing the theoretical bound. Experiments conducted on two real social networks confirm our theoretical findings as well as the efficiency. Given an advertisement with 5 keywords, it takes only 2 seconds to find the most in uential users in a social network with billions of edges.
    No preview · Article · Jun 2015 · Proceedings of the VLDB Endowment
  • Shuo Chen · Ju Fan · Guoliang Li · Jianhua Feng · Kian-lee Tan · Jinhui Tang

    No preview · Article · Feb 2015 · Proceedings of the VLDB Endowment
  • Article: CANDS

    No preview · Article · Oct 2014
  • Qian Xiao · Rui Chen · Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.
    No preview · Article · Aug 2014
  • Article: R3

    No preview · Article · Aug 2014
  • Dongxiang Zhang · Chee-Yong Chan · Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: We examine the spatial keyword search problem to retrieve objects of interest that are ranked based on both their spatial proximity to the query location as well as the textual relevance of the object's keywords. Existing solutions for the problem are based on either using a combination of textual and spatial indexes or using specialized hybrid indexes that integrate the indexing of both textual and spatial attribute values. In this paper, we propose a new approach that is based on modeling the problem as a top-k aggregation problem which enables the design of a scalable and efficient solution that is based on the ubiquitous inverted list index. Our performance study demonstrates that our approach outperforms the state-of-the-art hybrid methods by a wide margin.
    No preview · Article · Jul 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    Full-text · Article · Jul 2014 · IEEE Transactions on Knowledge and Data Engineering
  • Guoliang Li · Shuo Chen · Jianhua Feng · Kian-lee Tan · Wen-syan Li
    [Show abstract] [Hide abstract]
    ABSTRACT: users in a social network to maximize the expected number of users influenced by the selected users (called influence spread), has been extensively studied, existing works neglected the fact that the location information can play an important role in influence maximization. Many real-world applications such as location-aware word-of-mouth marketing have location-aware requirement. In this paper we study the location-aware influence maximization problem. One big challenge in location-aware influence maximization is to develop an efficient scheme that offers wide influence spread. To address this challenge, we propose two greedy algorithms with 1-1/e approximation ratio. To meet the instant-speed requirement, we propose two efficient algorithms with ε· (1-1/e) approximation ratio for any ε ∈ (0,1]. Experimental results on real datasets show our method achieves high performance while keeping large influence spread and significantly outperforms state-of-the-art algorithms.
    No preview · Article · Jun 2014
  • Dongxiang Zhang · Chee-Yong Chan · Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: Many of today's publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse database that existing pub/sub systems can no longer support effectively. In this paper, we propose an efficient in-memory index that is scalable to the volume and update of subscriptions, the arrival rate of events and the variety of subscribable attributes. The index is also extensible to support complex scenarios such as prefix/suffix filtering and regular expression matching. We conduct extensive experiments on synthetic datasets and two real datasets (AOL query log and Ebay products). The results demonstrate the superiority of our index over state-of-the-art methods: our index incurs orders of magnitude less index construction time, consumes a small amount of memory and performs event matching efficiently.
    No preview · Article · Apr 2014 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: Attributed graphs are becoming important tools for modeling information networks, such as the Web and various social networks (e.g. Facebook, LinkedIn, Twitter). However, it is computationally challenging to manage and analyze attributed graphs to support effective decision making. In this paper, we propose, Pagrol, a parallel graph OLAP (Online Analytical Processing) system over attributed graphs. In particular, Pagrol introduces a new conceptual Hyper Graph Cube model (which is an attributed-graph analogue of the data cube model for relational DBMS) to aggregate attributed graphs at different granularities and levels. The proposed model supports different queries as well as a new set of graph OLAP Roll-Up/Drill-Down operations. Furthermore, on the basis of Hyper Graph Cube, Pagrol provides an efficient MapReduce-based parallel graph cubing algorithm, MRGraph-Cubing, to compute the graph cube for an attributed graph. Pagrol employs numerous optimization techniques: (a) a self-contained join strategy to minimize I/O cost; (b) a scheme that groups cuboids into batches so as to minimize redundant computations; (c) a cost-based scheme to allocate the batches into bags (each with a small number of batches); and (d) an efficient scheme to process a bag using a single MapReduce job. Results of extensive experimental studies using both real Facebook and synthetic datasets on a 128-node cluster show that Pagrol is effective, efficient and scalable.
    No preview · Conference Paper · Mar 2014
  • Article: epiC
    Dawei Jiang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Sai Wu

    No preview · Article · Mar 2014
  • Guoliang Li · Jun Hu · Jianhua Feng · Kian-lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid development of social networks has resulted in a proliferation of user-generated content (UGC). The UGC data, when properly analyzed, can be beneficial to many applications. For example, identifying a user's locations from microblogs is very important for effective location-based advertisement and recommendation. In this paper, we study the problem of identifying a user's locations from microblogs. This problem is rather challenging because the location information in a microblog is incomplete and we cannot get an accurate location from a local microblog. To address this challenge, we propose a global location identification method, called Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, states, cities, districts, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales very well.
    No preview · Conference Paper · Mar 2014
  • Long Guo · Jie Shao · Htoo Htet Aung · Kian-Lee Tan
    [Show abstract] [Hide abstract]
    ABSTRACT: With the development of GPS-enabled mobile devices, more and more pieces of information on the web are geotagged. Spatial keyword queries, which consider both spatial locations and textual descriptions to find objects of interest, adapt well to this trend. Therefore, a considerable number of studies have focused on the interesting problem of efficiently processing spatial keyword queries. However, most of them assume Euclidean space or examine a single snapshot query only. This paper investigates a novel problem, namely, continuous top-k spatial keyword queries on road networks, for the first time. We propose two methods that can monitor such moving queries in an incremental manner and reduce repetitive traversing of network edges for better performance. Experimental evaluation using large real datasets demonstrates that the proposed methods both outperform baseline methods significantly. Discussion about the parameters affecting the efficiency of the two methods is also presented to reveal their relative advantages.
    No preview · Article · Jan 2014 · GeoInformatica