Beng Chin Ooi

University of California, Santa Barbara, Santa Barbara, California, United States

Are you Beng Chin Ooi?

Claim your profile

Publications (303)119.62 Total impact

  • Ju Fan · Meihui Zhang · Stanley Kok · Meiyu Lu · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the query optimization problem in declarative crowdsourcing systems. Declarative crowdsourcing is designed to hide the complexities and relieve the user of the burden of dealing with the crowd. The user is only required to submit an SQL-like query and the system takes the responsibility of compiling the query, generating the execution plan and evaluating in the crowdsourcing marketplace. A given query can have many alternative execution plans and the difference in crowdsourcing cost between the best and the worst plans may be several orders of magnitude. Therefore, as in relational database systems, query optimization is important to crowdsourcing systems that provide declarative query interfaces. In this paper, we propose CrowdOp, a cost-based query optimization approach for declarative crowdsourcing systems. CrowdOp considers both cost and latency in query optimization objectives and generates query plans that provide a good balance between the cost and latency. We develop efficient algorithms in the CrowdOp for optimizing three types of queries: selection queries, join queries, and complex selection-join queries. We validate our approach via extensive experiments by simulation as well as with the real crowd on Amazon Mechanical Turk.
    IEEE Transactions on Knowledge and Data Engineering 08/2015; 27(8):1-1. DOI:10.1109/TKDE.2015.2407353 · 2.07 Impact Factor
  • Source
    Hao Zhang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Meihui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.
    IEEE Transactions on Knowledge and Data Engineering 07/2015; 27(7):1-1. DOI:10.1109/TKDE.2015.2427795 · 2.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multicore CPUs and large memories are increasingly becoming the norm in modern computer systems. However, current database management systems (DBMSs) are generally ineffective in exploiting the parallelism of such systems. In particular, contention can lead to a dramatic fall in performance. In this paper, we propose a new concurrency control protocol called DGCC (Dependency Graph based Concurrency Control) that separates concurrency control from execution. DGCC builds dependency graphs for batched transactions before executing them. Using these graphs, contentions within the same batch of transactions are resolved before execution. As a result, the execution of the transactions does not need to deal with contention while maintaining full equivalence to that of serialized execution. This better exploits multicore hardware and achieves higher level of parallelism. To facilitate DGCC, we have also proposed a system architecture that does not have certain centralized control components yielding better scalability, as well as supports a more efficient recovery mechanism. Our extensive experimental study shows that DGCC achieves up to four times higher throughput compared to that of state-of-the-art concurrency control protocols for high contention workloads.
  • Source
    Chang Yao · Divyakant Agrawal · Gang Chen · Beng Chin Ooi · Sai Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purpose, those systems still need to flush the logs to disk, generating a significant number of I/Os. A new type of logs, the command log, is being employed to replace the traditional data log (e.g., ARIES log). A command log only tracks the transactions being executed, thereby effectively reducing the size of the log and improving the performance. Command logging on the other hand increases the cost of recovery, because all the transactions in the log after the last checkpoint must be completely redone when there is a failure. For distributed database systems with many processing nodes, failures cannot be assumed as exceptions, and as such, the long recovery time incurred by command logging may compromise the objective of providing efficient support for OLTP. In this paper, we first extend the command logging to a distributed system, where all the nodes can perform their recovery in parallel. Showing that the synchronisation cost caused by dependency is the bottleneck for command logging in a distributed system, We consequently propose an adaptive logging approach by combining data logging and command logging. The intuition is to use data logging to break the dependency, while applying command logging for most transactions to reduce I/O costs. The percentage of data logging versus command logging becomes an optimization between the performance of transaction processing and recovery to suit different OLTP applications. Our experimental study compares the performance of our proposed adaptive logging, ARIES style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.
  • Proceedings of the VLDB Endowment 02/2015; 8(7):762-773. DOI:10.14778/2752939.2752945
  • Proceedings of the VLDB Endowment 12/2014; 8(4):437-448. DOI:10.14778/2735496.2735506
  • [Show abstract] [Hide abstract]
    ABSTRACT: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B+-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
    IEEE Transactions on Knowledge and Data Engineering 12/2014; 26(12):2983-2996. DOI:10.1109/TKDE.2014.2309131 · 2.07 Impact Factor
  • Article: ScalaGiST
    Peng Lu · Gang Chen · Beng Chin Ooi · Hoang Tam Vo · Sai Wu
  • Source
    Ashraf Aboulnaga · Beng Chin Ooi · Patrick Valduriez
    The VLDB Journal 09/2014; 23(6). DOI:10.1007/s00778-014-0371-0 · 1.70 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1670-1678. DOI:10.1109/TKDE.2014.2326659 · 2.07 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.
  • Xiaogang Shi · Bin Cui · Gillian Dobbie · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: It is important to provide efficient execution for ad-hoc data processing programs. In contrast to constructing complex declarative queries, many users prefer to write their programs using procedural code with simple queries. As many users are not expert programmers, their programs usually exhibit poor performance in practice and it is a challenge to automatically optimize these programs and efficiently execute the programs. In this paper, we present UniAD, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We propose a novel intermediate representation named UniQL which utilizes HOQs to describe the operations performed in programs. By combining both procedural and declarative logics, we can perform various optimizations across the boundary between procedural and declarative codes. We describe optimizations and conduct extensive empirical studies using UniAD. The experimental results on four benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most data analytics applications are industry/domain specific, e.g., predicting patients at high risk of being admitted to intensive care unit in the healthcare sector or predicting malicious SMSs in the telecommunication sector. Existing solutions are based on "best practices", i.e., the systems' decisions are knowledge-driven and/or data-driven. However, there are rules and exceptional cases that can only be precisely formulated and identified by subject-matter experts (SMEs) who have accumulated many years of experience. This paper envisions a more intelligent database management system (DBMS) that captures such knowledge to effectively address the industry/domain specific applications. At the core, the system is a hybrid human-machine database engine where the machine interacts with the SMEs as part of a feedback loop to gather, infer, ascertain and enhance the database knowledge and processing. We discuss the challenges towards building such a system through examples in healthcare predictive analysis -- a popular area for big data analytics.
    ACM SIGKDD Explorations Newsletter 06/2014; 16(1). DOI:10.1145/2674026.2674032
  • Source
    Johannes Gehrke · Beng Chin Ooi · Evaggelia Pitoura
    [Show abstract] [Hide abstract]
    ABSTRACT: The ten papers included in this special section were presented at the 28th International Conference on Data Engineering was held in Washington, DC, on April 1-5, 2012. All papers were revised and substantially extended, over their conference versions and went through a rigorous review process to ensure the high quality standards of the IEEE Transactions on Knowledge and Data Engineering. They cover a broad range of topics highlighting the liveliness of the data engineering field.
    IEEE Transactions on Knowledge and Data Engineering 06/2014; 26(6):1298-1299. DOI:10.1109/TKDE.2014.2314529 · 2.07 Impact Factor
  • Hao Zhang · Bogdan Marius Tudor · Gang Chen · Beng Chin Ooi
    Proceedings of the VLDB Endowment 06/2014; 7(10):833-836. DOI:10.14778/2732951.2732956
  • Proceedings of the VLDB Endowment 04/2014; 7(8):649-660. DOI:10.14778/2732296.2732301
  • Sheng Wang · David Maier · Beng Chin Ooi
    Proceedings of the VLDB Endowment 03/2014; 7(7):529-540. DOI:10.14778/2732286.2732290
  • Feng Li · M. Tamer Ozsu · Gang Chen · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system, transformed and loaded into the OLAP system for data analysis. With the awareness of the ability of big data in providing enterprises useful insights from vast amounts of data, effective and timely decisions derived from real-time analytics are important. It is therefore desirable to provide real-time OLAP querying support, where OLAP queries read the latest data while OLTP queries create the new versions. In this paper, we propose R-Store, a scalable distributed system for supporting real-time OLAP by extending the MapReduce framework. We extend an open source distributed key/value system, HBase, as the underlying storage system that stores data cube and real-time data. When real-time data are updated, they are streamed to a streaming MapReduce, namely Hstreaming, for updating the cube on incremental basis. Based on the metadata stored in the storage system, either the data cube or OLTP database or both are used by the MapReduce jobs for OLAP queries. We propose techniques to efficiently scan the real-time data in the storage system, and design an adaptive algorithm to process the real-time query based on our proposed cost model. The main objectives are to ensure the freshness of answers and low processing latency. The experiments conducted on the TPC-H data set demonstrate the effectiveness and efficiency of our approach.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • Article: epiC
    Dawei Jiang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Sai Wu
  • Ju Fan · Meiyu Lu · Beng Chin Ooi · Wang-Chiew Tan · Meihui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014

Publication Stats

7k Citations
119.62 Total Impact Points


  • 2015
    • University of California, Santa Barbara
      Santa Barbara, California, United States
  • 2014
    • University of Auckland
      • Department of Computer Science
      Окленд, Auckland, New Zealand
    • National University (California)
      San Diego, California, United States
  • 2001–2014
    • National University of Singapore
      • Department of Computer Science
      Tumasik, Singapore
  • 2007
    • Aalborg University
      • Department of Computer Science
      Aalborg, Region North Jutland, Denmark
  • 2006
    • Singapore Management University
      • School of Information Systems
      Tumasik, Singapore
  • 2004–2005
    • Singapore-MIT Alliance
      Cambridge, Massachusetts, United States
    • University of Michigan
      Ann Arbor, Michigan, United States
    • AT&T Labs
      Austin, Texas, United States
  • 2003
    • Fudan University
      • School of Computer Science
      Shanghai, Shanghai Shi, China
  • 1999
    • University of Milan
      Milano, Lombardy, Italy
  • 1997
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States
  • 1989
    • Monash University (Australia)
      Melbourne, Victoria, Australia