Beng Chin Ooi

National University of Singapore, Tumasik, Singapore

Are you Beng Chin Ooi?

Claim your profile

Publications (272)87.08 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
    IEEE Transactions on Knowledge and Data Engineering 01/2014; 26(7):1670-1678. · 1.89 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed. In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The popularity of similarity search expanded with the increased interest in multimedia databases, bioinformatics, or social networks, and with the growing number of users trying to find information in huge collections of unstructured data. During the ...
    ACM SIGMOD Record 06/2013; 42(2):46-51. · 0.46 Impact Factor
  • Source
    Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link ...
    IEEE Transactions on Knowledge and Data Engineering 01/2013; 25(2):241-244. · 1.89 Impact Factor
  • Feng Li, Beng Chin Ooi, Tamer M Özsu, Sai Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efficient for supporting database-centric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
    ACM Computing Surveys 01/2013; · 3.54 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Spatial web objects that possess both a geographical location and a textual description are gaining in prevalence. This gives prominence to spatial keyword queries that exploit both location and textual arguments. Such queries are used in many web services such as yellow pages and maps services. We present SWORS, the Spatial Web Object Retrieval System, that is capable of efficiently retrieving spatial web objects that satisfy spatial keyword queries. Specifically, SWORS supports two types of queries: a) the location-aware top-k text retrieval (LkT) query that retrieves k individual spatial web objects taking into account query location proximity and text relevancy; b) the spatial keyword group (SKG) query that retrieves a group of objects that cover the query keywords and are nearest to the query location and have the shortest inter-object distances. SWORS provides browser-based interfaces for desktop and laptop computers and provides a client application for mobile devices. The interfaces and the client enable users to formulate queries and view the query results on a map. The server side stores the data and processes the queries. We use three real-life data sets to demonstrate the functionality and performance of SWORS.
    Proceedings of the VLDB Endowment. 08/2012; 5(12):1914-1917.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous applications such as financial transactions (e.g., stock trading) are write-heavy in nature. The shift from reads to writes in web applications has also been accelerating in recent years. Write-ahead-logging is a common approach for providing recovery capability while improving performance in most storage systems. However, the separation of log and application data incurs write overheads observed in write-heavy environments and hence adversely affects the write throughput and recovery time in the system. In this paper, we introduce LogBase - a scalable log-structured database system that adopts log-only storage for removing the write bottleneck and supporting fast system recovery. LogBase is designed to be dynamically deployed on commodity clusters to take advantage of elastic scaling property of cloud environments. LogBase provides in-memory multiversion indexes for supporting efficient access to data maintained in the log. LogBase also supports transactions that bundle read and write operations spanning across multiple records. We implemented the proposed system and compared it with HBase and a disk-based log-structured record-oriented system modeled after RAMCloud. The experimental results show that LogBase is able to provide sustained write throughput, efficient data access out of the cache, and effective system recovery.
    06/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Some complex problems, such as image tagging and natural language processing, are very challenging for computers, where even state-of-the-art technology is yet able to provide satisfactory accuracy. Therefore, rather than relying solely on developing new and better algorithms to handle such tasks, we look to the crowdsourcing solution -- employing human participation -- to make good the shortfall in current technology. Crowdsourcing is a good supplement to many computer tasks. A complex job may be divided into computer-oriented tasks and human-oriented tasks, which are then assigned to machines and humans respectively. To leverage the power of crowdsourcing, we design and implement a Crowdsourcing Data Analytics System, CDAS. CDAS is a framework designed to support the deployment of various crowdsourcing applications. The core part of CDAS is a quality-sensitive answering model, which guides the crowdsourcing engine to process and monitor the human tasks. In this paper, we introduce the principles of our quality-sensitive model. To satisfy user required accuracy, the model guides the crowdsourcing query engine for the design and processing of the corresponding crowdsourcing jobs. It provides an estimated accuracy for each generated result based on the human workers' historical performances. When verifying the quality of the result, the model employs an online strategy to reduce waiting time. To show the effectiveness of the model, we implement and deploy two analytics jobs on CDAS, a twitter sentiment analytics job and an image tagging job. We use real Twitter and Flickr data as our queries respectively. We compare our approaches with state-of-the-art classification and image annotation techniques. The results show that the human-assisted methods can indeed achieve a much higher accuracy. By embedding the quality-sensitive model into crowdsourcing query engine, we effectiv...[truncated].
    06/2012;
  • Source
    Wei Lu, Yanyan Shen, Su Chen, Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.
    06/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An increasing amount of trajectory data is being annotated with text descriptions to better capture the semantics associated with locations. The fusion of spatial locations and text descriptions in trajectories engenders a new type of top-$k$ queries that take into account both aspects. Each trajectory in consideration consists of a sequence of geo-spatial locations associated with text descriptions. Given a user location $\lambda$ and a keyword set $\psi$, a top-$k$ query returns $k$ trajectories whose text descriptions cover the keywords $\psi$ and that have the shortest match distance. To the best of our knowledge, previous research on querying trajectory databases has focused on trajectory data without any text description, and no existing work has studied such kind of top-$k$ queries on trajectories. This paper proposes one novel method for efficiently computing top-$k$ trajectories. The method is developed based on a new hybrid index, cell-keyword conscious B$^+$-tree, denoted by \cellbtree, which enables us to exploit both text relevance and location proximity to facilitate efficient and effective query processing. The results of our extensive empirical studies with an implementation of the proposed algorithms on BerkeleyDB demonstrate that our proposed methods are capable of achieving excellent performance and good scalability.
    05/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Scalable database management systems (DBMSs) are a critical part of the cloud infrastructure and play an important role in ensuring the smooth transition of applications from the classical enterprise infrastructures to next generation cloud infrastructures. Though scalable data management on distributed platforms has been a vision for more than three decades and much research has focused on large scale data management in traditional enterprise setting, cloud computing brings its own set of novel challenges that must be addressed to ensure the success of data management solutions in the cloud environment that is inherently distributed. This article presents an organised picture of the challenges faced by application developers and DBMS designers in developing and deploying internet scale applications. Our background study encompasses systems for supporting update heavy applications and focuses on providing an in-depth analysis of such systems. We crystallise the design choices made by some successful large scale database management systems, analyse the application demands and access patterns and how the scalable database management systems have evolved to meet such requirements.
    International Journal of Computational Science and Engineering. 03/2012; 7(1):2-16.
  • Source
    Beng Chin Ooi
    IEEE Transactions on Knowledge and Data Engineering 01/2012; 24:193-196. · 1.89 Impact Factor
  • Source
    Sai Wu, Vibhore Kumar, Kun-Lung Wu, Beng Chin Ooi
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider a distributed stream processing application, expressed as a data-flow graph with operators as vertices connected by streams and deployed over a cluster of compute nodes, where a small subset of the operators are often the performance bottlenecks for the entire application. In cases where a bottleneck operator is stateless, it is obvious that parallelization by splitting the incoming stream among multiple parallel operators deployed on different nodes can help improve performance. However, it is not so obvious when the bottleneck operator is stateful. In such a case, parallelization is much more challenging as it often requires a state sharing mechanism for the parallel operators. Moreover, it incurs additional overheads of required accesses by the parallel operators to shared state and synchronization constructs. In this paper, we propose a parallelization framework for stateful stream processing operators. The framework not only addresses issues related to the system model and support for operator parallelization, but also delves into the theoretical details that model the suitability of parallelization and the optimal degree of parallelism. We have implemented and evaluated our framework in the context of IBM's System S distributed stream processing middleware. While microbenchmarks are used to validate the proposed theoretical model, a parallelized implementation of a moving KNN application is used for the purpose of evaluation.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: There has been a growing acceptance of the object-oriented data model as the basis of next generation database management systems (DBMSs). Both pure object-oriented DBMS (OODBMSs) and object-relational DBMS (ORDBMSs) have been developed based on object-oriented concepts. Object-relational DBMS, in particular, extend the SQL language by incorporating all the concepts of the object-oriented data model. A large number of products for both categories of DBMS is today available. In particular, all major vendors of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].
    07/2011: pages 1-38;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Text databases provide rapid access to collections of digital documents. Such databases have become ubiquitous: text search engines underlie the online text repositories accessible via the Web and are central to digital libraries and online corporate document management.
    07/2011: pages 151-183;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing represents a paradigm shift driven by the increasing demand of Web based applications for elastic, scalable and efficient system architectures that can efficiently support their ever-growing data volume and large-scale data analysis. A typical data management system has to deal with real-time updates by individual users, and as well as periodical large scale analytical processing, indexing, and data extraction. While such operations may take place in the same domain, the design and development of the systems have somehow evolved independently for transactional and periodical analytical processing. Such a system-level separation has resulted in problems such as data freshness as well as serious data storage redundancy. Ideally, it would be more efficient to apply ad-hoc analytical processing on the same data directly. However, to the best of our knowledge, such an approach has not been adopted in real implementation. Intrigued by such an observation, we have designed and implemented epiC, an elastic power-aware data-itensive Cloud platform for supporting both data intensive analytical operations (ref. as OLAP) and online transactions (ref. as OLTP). In this paper, we present ES<sup>2</sup> - the elastic data storage system of epiC, which is designed to support both functionalities within the same storage. We present the system architecture and the functions of each system component, and experimental results which demonstrate the efficiency of the system.
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on; 05/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogeneous data. To achieve high efficiency in processing keyword queries, we first model unstructured, semi-structured and structured data as graphs, and then summarize the graphs and construct graph indices instead of using traditional inverted indices. We propose an extended inverted index to facilitate keyword-based search, and present a novel ranking mechanism for enhancing search effectiveness. We have conducted an extensive experimental study using real datasets, and the results show that EASE achieves both high search efficiency and high accuracy, and outperforms the existing approaches significantly.
    Inf. Syst. 01/2011; 36:248-266.
  • Conference Paper: ES
    Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany; 01/2011
  • Beng Chin Ooi
    IEEE Transactions on Knowledge and Data Engineering 01/2011; 23:1-4. · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The online travel services and resources are far from well organized and integrated. Trip planning is still a laborious job requiring interaction with a combination of services such as travel guides, personal travel blogs, map services and public transportation to piece together an itinerary. To facilitate this process, we have designed a cross-service travel engine for trip planners. Our system seamlessly and semantically integrates various types of travel services and resources based on a geographical ontology. We also built a user-friendly visualization tool for travellers to conveniently browse and design personal itineraries on Google Maps.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011

Publication Stats

5k Citations
87.08 Total Impact Points

Institutions

  • 1991–2013
    • National University of Singapore
      • Department of Computer Science
      Tumasik, Singapore
  • 2009
    • Harbin Institute of Technology
      • School of Computer Science and Technology
      Charbin, Heilongjiang Sheng, China
  • 2003–2008
    • Fudan University
      • School of Computer Science
      Shanghai, Shanghai Shi, China
    • The Hong Kong University of Science and Technology
      • Department of Computer Science and Engineering
      Kowloon, Hong Kong
  • 2006–2007
    • Aalborg University
      • Department of Computer Science
      Aalborg, Region North Jutland, Denmark
    • University of Michigan
      Ann Arbor, Michigan, United States
    • University of Queensland 
      • School of Information Technology and Electrical Engineering
      Brisbane, Queensland, Australia
  • 2004
    • AT&T Labs
      Austin, Texas, United States
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States
  • 1999
    • University of Milan
      Milano, Lombardy, Italy
  • 1997
    • University of Wisconsin, Madison
      • Department of Computer Sciences
      Madison, MS, United States