Beng Chin Ooi

University of California, Santa Barbara, Santa Barbara, California, United States

Are you Beng Chin Ooi?

Claim your profile

Publications (286)112.68 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multicore CPUs and large memories are increasingly becoming the norm in modern computer systems. However, current database management systems (DBMSs) are generally ineffective in exploiting the parallelism of such systems. In particular, contention can lead to a dramatic fall in performance. In this paper, we propose a new concurrency control protocol called DGCC (Dependency Graph based Concurrency Control) that separates concurrency control from execution. DGCC builds dependency graphs for batched transactions before executing them. Using these graphs, contentions within the same batch of transactions are resolved before execution. As a result, the execution of the transactions does not need to deal with contention while maintaining full equivalence to that of serialized execution. This better exploits multicore hardware and achieves higher level of parallelism. To facilitate DGCC, we have also proposed a system architecture that does not have certain centralized control components yielding better scalability, as well as supports a more efficient recovery mechanism. Our extensive experimental study shows that DGCC achieves up to four times higher throughput compared to that of state-of-the-art concurrency control protocols for high contention workloads.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purpose, those systems still need to flush the logs to disk, generating a significant number of I/Os. A new type of logs, the command log, is being employed to replace the traditional data log (e.g., ARIES log). A command log only tracks the transactions being executed, thereby effectively reducing the size of the log and improving the performance. Command logging on the other hand increases the cost of recovery, because all the transactions in the log after the last checkpoint must be completely redone when there is a failure. For distributed database systems with many processing nodes, failures cannot be assumed as exceptions, and as such, the long recovery time incurred by command logging may compromise the objective of providing efficient support for OLTP. In this paper, we first extend the command logging to a distributed system, where all the nodes can perform their recovery in parallel. Showing that the synchronisation cost caused by dependency is the bottleneck for command logging in a distributed system, We consequently propose an adaptive logging approach by combining data logging and command logging. The intuition is to use data logging to break the dependency, while applying command logging for most transactions to reduce I/O costs. The percentage of data logging versus command logging becomes an optimization between the performance of transaction processing and recovery to suit different OLTP applications. Our experimental study compares the performance of our proposed adaptive logging, ARIES style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B+-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
    IEEE Transactions on Knowledge and Data Engineering 12/2014; 26(12):2983-2996. DOI:10.1109/TKDE.2014.2309131 · 1.82 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1670-1678. DOI:10.1109/TKDE.2014.2326659 · 1.82 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is important to provide efficient execution for ad-hoc data processing programs. In contrast to constructing complex declarative queries, many users prefer to write their programs using procedural code with simple queries. As many users are not expert programmers, their programs usually exhibit poor performance in practice and it is a challenge to automatically optimize these programs and efficiently execute the programs. In this paper, we present UniAD, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We propose a novel intermediate representation named UniQL which utilizes HOQs to describe the operations performed in programs. By combining both procedural and declarative logics, we can perform various optimizations across the boundary between procedural and declarative codes. We describe optimizations and conduct extensive empirical studies using UniAD. The experimental results on four benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Most data analytics applications are industry/domain specific, e.g., predicting patients at high risk of being admitted to intensive care unit in the healthcare sector or predicting malicious SMSs in the telecommunication sector. Existing solutions are based on "best practices", i.e., the systems' decisions are knowledge-driven and/or data-driven. However, there are rules and exceptional cases that can only be precisely formulated and identified by subject-matter experts (SMEs) who have accumulated many years of experience. This paper envisions a more intelligent database management system (DBMS) that captures such knowledge to effectively address the industry/domain specific applications. At the core, the system is a hybrid human-machine database engine where the machine interacts with the SMEs as part of a feedback loop to gather, infer, ascertain and enhance the database knowledge and processing. We discuss the challenges towards building such a system through examples in healthcare predictive analysis -- a popular area for big data analytics.
    ACM SIGKDD Explorations Newsletter 06/2014; 16(1). DOI:10.1145/2674026.2674032
  • Source
    IEEE Transactions on Knowledge and Data Engineering 06/2014; 26(6):1298-1299. DOI:10.1109/TKDE.2014.2314529 · 1.82 Impact Factor
  • Feng Li, M. Tamer Ozsu, Gang Chen, Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system, transformed and loaded into the OLAP system for data analysis. With the awareness of the ability of big data in providing enterprises useful insights from vast amounts of data, effective and timely decisions derived from real-time analytics are important. It is therefore desirable to provide real-time OLAP querying support, where OLAP queries read the latest data while OLTP queries create the new versions. In this paper, we propose R-Store, a scalable distributed system for supporting real-time OLAP by extending the MapReduce framework. We extend an open source distributed key/value system, HBase, as the underlying storage system that stores data cube and real-time data. When real-time data are updated, they are streamed to a streaming MapReduce, namely Hstreaming, for updating the cube on incremental basis. Based on the metadata stored in the storage system, either the data cube or OLTP database or both are used by the MapReduce jobs for OLAP queries. We propose techniques to efficiently scan the real-time data in the storage system, and design an adaptive algorithm to process the real-time query based on our proposed cost model. The main objectives are to ensure the freshness of answers and low processing latency. The experiments conducted on the TPC-H data set demonstrate the effectiveness and efficiency of our approach.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed. In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The popularity of similarity search expanded with the increased interest in multimedia databases, bioinformatics, or social networks, and with the growing number of users trying to find information in huge collections of unstructured data. During the ...
    ACM SIGMOD Record 06/2013; 42(2):46-51. DOI:10.1145/2503792.2503803 · 0.96 Impact Factor
  • Source
    Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link ...
    IEEE Transactions on Knowledge and Data Engineering 02/2013; 25(2):241-244. DOI:10.1109/TKDE.2013.6 · 1.82 Impact Factor
  • Feng Li, Beng Chin Ooi, Tamer M Özsu, Sai Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efficient for supporting database-centric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
    ACM Computing Surveys 01/2013; DOI:10.1145/2503009 · 4.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Spatial web objects that possess both a geographical location and a textual description are gaining in prevalence. This gives prominence to spatial keyword queries that exploit both location and textual arguments. Such queries are used in many web services such as yellow pages and maps services. We present SWORS, the Spatial Web Object Retrieval System, that is capable of efficiently retrieving spatial web objects that satisfy spatial keyword queries. Specifically, SWORS supports two types of queries: a) the location-aware top-k text retrieval (LkT) query that retrieves k individual spatial web objects taking into account query location proximity and text relevancy; b) the spatial keyword group (SKG) query that retrieves a group of objects that cover the query keywords and are nearest to the query location and have the shortest inter-object distances. SWORS provides browser-based interfaces for desktop and laptop computers and provides a client application for mobile devices. The interfaces and the client enable users to formulate queries and view the query results on a map. The server side stores the data and processes the queries. We use three real-life data sets to demonstrate the functionality and performance of SWORS.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous applications such as financial transactions (e.g., stock trading) are write-heavy in nature. The shift from reads to writes in web applications has also been accelerating in recent years. Write-ahead-logging is a common approach for providing recovery capability while improving performance in most storage systems. However, the separation of log and application data incurs write overheads observed in write-heavy environments and hence adversely affects the write throughput and recovery time in the system. In this paper, we introduce LogBase - a scalable log-structured database system that adopts log-only storage for removing the write bottleneck and supporting fast system recovery. LogBase is designed to be dynamically deployed on commodity clusters to take advantage of elastic scaling property of cloud environments. LogBase provides in-memory multiversion indexes for supporting efficient access to data maintained in the log. LogBase also supports transactions that bundle read and write operations spanning across multiple records. We implemented the proposed system and compared it with HBase and a disk-based log-structured record-oriented system modeled after RAMCloud. The experimental results show that LogBase is able to provide sustained write throughput, efficient data access out of the cache, and effective system recovery.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Some complex problems, such as image tagging and natural language processing, are very challenging for computers, where even state-of-the-art technology is yet able to provide satisfactory accuracy. Therefore, rather than relying solely on developing new and better algorithms to handle such tasks, we look to the crowdsourcing solution -- employing human participation -- to make good the shortfall in current technology. Crowdsourcing is a good supplement to many computer tasks. A complex job may be divided into computer-oriented tasks and human-oriented tasks, which are then assigned to machines and humans respectively. To leverage the power of crowdsourcing, we design and implement a Crowdsourcing Data Analytics System, CDAS. CDAS is a framework designed to support the deployment of various crowdsourcing applications. The core part of CDAS is a quality-sensitive answering model, which guides the crowdsourcing engine to process and monitor the human tasks. In this paper, we introduce the principles of our quality-sensitive model. To satisfy user required accuracy, the model guides the crowdsourcing query engine for the design and processing of the corresponding crowdsourcing jobs. It provides an estimated accuracy for each generated result based on the human workers' historical performances. When verifying the quality of the result, the model employs an online strategy to reduce waiting time. To show the effectiveness of the model, we implement and deploy two analytics jobs on CDAS, a twitter sentiment analytics job and an image tagging job. We use real Twitter and Flickr data as our queries respectively. We compare our approaches with state-of-the-art classification and image annotation techniques. The results show that the human-assisted methods can indeed achieve a much higher accuracy. By embedding the quality-sensitive model into crowdsourcing query engine, we effectiv...[truncated].
  • Source
    Wei Lu, Yanyan Shen, Su Chen, Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.
    06/2012; DOI:10.14778/2336664.2336674
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An increasing amount of trajectory data is being annotated with text descriptions to better capture the semantics associated with locations. The fusion of spatial locations and text descriptions in trajectories engenders a new type of top-$k$ queries that take into account both aspects. Each trajectory in consideration consists of a sequence of geo-spatial locations associated with text descriptions. Given a user location $\lambda$ and a keyword set $\psi$, a top-$k$ query returns $k$ trajectories whose text descriptions cover the keywords $\psi$ and that have the shortest match distance. To the best of our knowledge, previous research on querying trajectory databases has focused on trajectory data without any text description, and no existing work has studied such kind of top-$k$ queries on trajectories. This paper proposes one novel method for efficiently computing top-$k$ trajectories. The method is developed based on a new hybrid index, cell-keyword conscious B$^+$-tree, denoted by \cellbtree, which enables us to exploit both text relevance and location proximity to facilitate efficient and effective query processing. The results of our extensive empirical studies with an implementation of the proposed algorithms on BerkeleyDB demonstrate that our proposed methods are capable of achieving excellent performance and good scalability.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Scalable database management systems (DBMSs) are a critical part of the cloud infrastructure and play an important role in ensuring the smooth transition of applications from the classical enterprise infrastructures to next generation cloud infrastructures. Though scalable data management on distributed platforms has been a vision for more than three decades and much research has focused on large scale data management in traditional enterprise setting, cloud computing brings its own set of novel challenges that must be addressed to ensure the success of data management solutions in the cloud environment that is inherently distributed. This article presents an organised picture of the challenges faced by application developers and DBMS designers in developing and deploying internet scale applications. Our background study encompasses systems for supporting update heavy applications and focuses on providing an in-depth analysis of such systems. We crystallise the design choices made by some successful large scale database management systems, analyse the application demands and access patterns and how the scalable database management systems have evolved to meet such requirements.
    International Journal of Computational Science and Engineering 03/2012; 7(1):2-16. DOI:10.1504/IJCSE.2012.046177

Publication Stats

6k Citations
112.68 Total Impact Points

Institutions

  • 2015
    • University of California, Santa Barbara
      Santa Barbara, California, United States
  • 2014
    • University of Auckland
      • Department of Computer Science
      Окленд, Auckland, New Zealand
  • 2001–2014
    • National University of Singapore
      • Department of Computer Science
      Tumasik, Singapore
  • 2007
    • Aalborg University
      • Department of Computer Science
      Aalborg, Region North Jutland, Denmark
  • 2006
    • Singapore Management University
      • School of Information Systems
      Tumasik, Singapore
  • 2004
    • University of Michigan
      Ann Arbor, Michigan, United States
    • AT&T Labs
      Austin, Texas, United States
    • Singapore-MIT Alliance
      Cambridge, Massachusetts, United States
  • 2003
    • The Hong Kong University of Science and Technology
      • Department of Computer Science and Engineering
      Kowloon, Hong Kong
    • Fudan University
      • School of Computer Science
      Shanghai, Shanghai Shi, China
  • 1999
    • University of Milan
      Milano, Lombardy, Italy
  • 1997
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States