Beng Chin Ooi

University of California, Santa Barbara, Santa Barbara, California, United States

Are you Beng Chin Ooi?

Claim your profile

Publications (303)127.66 Total impact

  • ACM SIGMOD Record 08/2015; 44(2):35-40. DOI:10.1145/2814710.2814717 · 1.05 Impact Factor
  • Ju Fan · Meihui Zhang · Stanley Kok · Meiyu Lu · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the query optimization problem in declarative crowdsourcing systems. Declarative crowdsourcing is designed to hide the complexities and relieve the user of the burden of dealing with the crowd. The user is only required to submit an SQL-like query and the system takes the responsibility of compiling the query, generating the execution plan and evaluating in the crowdsourcing marketplace. A given query can have many alternative execution plans and the difference in crowdsourcing cost between the best and the worst plans may be several orders of magnitude. Therefore, as in relational database systems, query optimization is important to crowdsourcing systems that provide declarative query interfaces. In this paper, we propose CrowdOp, a cost-based query optimization approach for declarative crowdsourcing systems. CrowdOp considers both cost and latency in query optimization objectives and generates query plans that provide a good balance between the cost and latency. We develop efficient algorithms in the CrowdOp for optimizing three types of queries: selection queries, join queries, and complex selection-join queries. We validate our approach via extensive experiments by simulation as well as with the real crowd on Amazon Mechanical Turk.
    IEEE Transactions on Knowledge and Data Engineering 08/2015; 27(8):1-1. DOI:10.1109/TKDE.2015.2407353 · 2.07 Impact Factor
  • Dawei Jiang · Sai Wu · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Jun Xu
    The VLDB Journal 07/2015; DOI:10.1007/s00778-015-0393-2 · 1.57 Impact Factor
  • The VLDB Journal 07/2015; DOI:10.1007/s00778-015-0391-4 · 1.57 Impact Factor
  • Source
    Hao Zhang · Gang Chen · Beng Chin Ooi · Kian-Lee Tan · Meihui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.
    IEEE Transactions on Knowledge and Data Engineering 07/2015; 27(7):1-1. DOI:10.1109/TKDE.2015.2427795 · 2.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multicore CPUs and large memories are increasingly becoming the norm in modern computer systems. However, current database management systems (DBMSs) are generally ineffective in exploiting the parallelism of such systems. In particular, contention can lead to a dramatic fall in performance. In this paper, we propose a new concurrency control protocol called DGCC (Dependency Graph based Concurrency Control) that separates concurrency control from execution. DGCC builds dependency graphs for batched transactions before executing them. Using these graphs, contentions within the same batch of transactions are resolved before execution. As a result, the execution of the transactions does not need to deal with contention while maintaining full equivalence to that of serialized execution. This better exploits multicore hardware and achieves higher level of parallelism. To facilitate DGCC, we have also proposed a system architecture that does not have certain centralized control components yielding better scalability, as well as supports a more efficient recovery mechanism. Our extensive experimental study shows that DGCC achieves up to four times higher throughput compared to that of state-of-the-art concurrency control protocols for high contention workloads.
  • Source
    Chang Yao · Divyakant Agrawal · Gang Chen · Beng Chin Ooi · Sai Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purpose, those systems still need to flush the logs to disk, generating a significant number of I/Os. A new type of logs, the command log, is being employed to replace the traditional data log (e.g., ARIES log). A command log only tracks the transactions being executed, thereby effectively reducing the size of the log and improving the performance. Command logging on the other hand increases the cost of recovery, because all the transactions in the log after the last checkpoint must be completely redone when there is a failure. For distributed database systems with many processing nodes, failures cannot be assumed as exceptions, and as such, the long recovery time incurred by command logging may compromise the objective of providing efficient support for OLTP. In this paper, we first extend the command logging to a distributed system, where all the nodes can perform their recovery in parallel. Showing that the synchronisation cost caused by dependency is the bottleneck for command logging in a distributed system, We consequently propose an adaptive logging approach by combining data logging and command logging. The intuition is to use data logging to break the dependency, while applying command logging for most transactions to reduce I/O costs. The percentage of data logging versus command logging becomes an optimization between the performance of transaction processing and recovery to suit different OLTP applications. Our experimental study compares the performance of our proposed adaptive logging, ARIES style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.
  • Proceedings of the VLDB Endowment 02/2015; 8(7):762-773. DOI:10.14778/2752939.2752945
  • Proceedings of the VLDB Endowment 12/2014; 8(4):437-448. DOI:10.14778/2735496.2735506
  • [Show abstract] [Hide abstract]
    ABSTRACT: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B+-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
    IEEE Transactions on Knowledge and Data Engineering 12/2014; 26(12):2983-2996. DOI:10.1109/TKDE.2014.2309131 · 2.07 Impact Factor
  • Article: ScalaGiST
    Peng Lu · Gang Chen · Beng Chin Ooi · Hoang Tam Vo · Sai Wu
  • Source
    Ashraf Aboulnaga · Beng Chin Ooi · Patrick Valduriez
    The VLDB Journal 09/2014; 23(6). DOI:10.1007/s00778-014-0371-0 · 1.57 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 07/2014; 26(7):1670-1678. DOI:10.1109/TKDE.2014.2326659 · 2.07 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.
  • Xiaogang Shi · Bin Cui · Gillian Dobbie · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: It is important to provide efficient execution for ad-hoc data processing programs. In contrast to constructing complex declarative queries, many users prefer to write their programs using procedural code with simple queries. As many users are not expert programmers, their programs usually exhibit poor performance in practice and it is a challenge to automatically optimize these programs and efficiently execute the programs. In this paper, we present UniAD, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We propose a novel intermediate representation named UniQL which utilizes HOQs to describe the operations performed in programs. By combining both procedural and declarative logics, we can perform various optimizations across the boundary between procedural and declarative codes. We describe optimizations and conduct extensive empirical studies using UniAD. The experimental results on four benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most data analytics applications are industry/domain specific, e.g., predicting patients at high risk of being admitted to intensive care unit in the healthcare sector or predicting malicious SMSs in the telecommunication sector. Existing solutions are based on "best practices", i.e., the systems' decisions are knowledge-driven and/or data-driven. However, there are rules and exceptional cases that can only be precisely formulated and identified by subject-matter experts (SMEs) who have accumulated many years of experience. This paper envisions a more intelligent database management system (DBMS) that captures such knowledge to effectively address the industry/domain specific applications. At the core, the system is a hybrid human-machine database engine where the machine interacts with the SMEs as part of a feedback loop to gather, infer, ascertain and enhance the database knowledge and processing. We discuss the challenges towards building such a system through examples in healthcare predictive analysis -- a popular area for big data analytics.
    ACM SIGKDD Explorations Newsletter 06/2014; 16(1). DOI:10.1145/2674026.2674032
  • Source
    Johannes Gehrke · Beng Chin Ooi · Evaggelia Pitoura
    [Show abstract] [Hide abstract]
    ABSTRACT: The ten papers included in this special section were presented at the 28th International Conference on Data Engineering was held in Washington, DC, on April 1-5, 2012. All papers were revised and substantially extended, over their conference versions and went through a rigorous review process to ensure the high quality standards of the IEEE Transactions on Knowledge and Data Engineering. They cover a broad range of topics highlighting the liveliness of the data engineering field.
    IEEE Transactions on Knowledge and Data Engineering 06/2014; 26(6):1298-1299. DOI:10.1109/TKDE.2014.2314529 · 2.07 Impact Factor
  • Hao Zhang · Bogdan Marius Tudor · Gang Chen · Beng Chin Ooi
    Proceedings of the VLDB Endowment 06/2014; 7(10):833-836. DOI:10.14778/2732951.2732956
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search relevant reviews and trailers. To solve the problem, a set of mapping functions are learned to project high-dimensional features extracted from data of different media types into a common lowdimensional space so that metric distance measures can be applied. In this paper, we propose an effective mapping mechanism based on deep learning (i.e., stacked auto-encoders) for multi-modal retrieval. Mapping functions are learned by optimizing a new objective function, which captures both intra-modal and inter-modal semantic relationships of data from heterogeneous sources effectively. Compared with previous works which require a substantial amount of prior knowledge such as similarity matrices of intramodal data and ranking examples, our method requires little prior knowledge. Given a large training dataset, we split it into minibatches and continually adjust the mapping functions for each batch of input. Hence, our method is memory efficient with respect to the data volume. Experiments on three real datasets illustrate that our proposed method achieves significant improvement in search accuracy over the state-of-the-art methods.
    Proceedings of the VLDB Endowment 04/2014; 7(8):649-660. DOI:10.14778/2732296.2732301
  • Sheng Wang · David Maier · Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: Huge amounts of data are being generated by sensing de- vices every day, recording the status of objects and the en- vironment. Such observational data is widely used in scien- tific research. As the capabilities of sensors keep improv- ing, the data produced are drastically expanding in pre- cision and quantity, making it a write-intensive domain. Log-structured storage is capable of providing high write throughput, and hence is a natural choice for managing large-scale observational data. In this paper, we propose an approach to indexing and querying observational data in log-structured storage. Based on key traits of observational data, we design a novel index approach called the CR-index (Continuous Range Index), which provides fast query performance without compromis- ing write throughput. It is a lightweight structure that is fast to construct and often small enough to reside in RAM. Our experimental results show that the CR-index is superior in handling observational data compared to other indexing techniques. While our focus is scientific data, we believe our index will be effective for other applications with similar properties, such as process monitoring in manufacturing.
    Proceedings of the VLDB Endowment 03/2014; 7(7):529-540. DOI:10.14778/2732286.2732290

Publication Stats

7k Citations
127.66 Total Impact Points


  • 2015
    • University of California, Santa Barbara
      Santa Barbara, California, United States
  • 2014
    • University of Auckland
      • Department of Computer Science
      Окленд, Auckland, New Zealand
    • National University (California)
      San Diego, California, United States
  • 2001–2014
    • National University of Singapore
      • Department of Computer Science
      Tumasik, Singapore
  • 2007
    • Aalborg University
      • Department of Computer Science
      Aalborg, Region North Jutland, Denmark
  • 2006
    • Singapore Management University
      • School of Information Systems
      Tumasik, Singapore
  • 2004–2005
    • Singapore-MIT Alliance
      Cambridge, Massachusetts, United States
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, Michigan, United States
    • AT&T Labs
      Austin, Texas, United States
  • 2003
    • Fudan University
      • School of Computer Science
      Shanghai, Shanghai Shi, China
  • 1999
    • University of Milan
      Milano, Lombardy, Italy
  • 1997
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States
  • 1989
    • Monash University (Australia)
      Melbourne, Victoria, Australia