Conference Paper

RanKloud: A scalable ranked query processing framework on Hadoop

DOI: 10.1145/1951365.1951444 Conference: EDBT 2011, 14th International Conference on Extending Database Technology, Uppsala, Sweden, March 21-24, 2011, Proceedings
Source: DBLP


The popularity of batch-oriented cluster architectures like Hadoop is on the rise. These batch-based systems successfully achieve high degrees of scalability by carefully allocating resources and leveraging opportunities to parallelize basic processing tasks. However, they are known to fall short in certain application domains such as large scale media analysis. In these applications, the utility of a given data element plays a vital role in a particular analysis task, and this utility most often depends on the way the data is collected or interpreted. However, existing batch data processing frameworks do not consider data utility in allocating resources, and hence fail to optimize for ranked/top-k query processing in which the user is interested in obtaining a relatively small subset of the best result instances. A naïve implementation of these operations on an existing system would need to enumerate more candidates than needed, before it can filter out the k best results. We note that such waste can be avoided by utilizing utility-aware task partitioning and resource allocation strategies that can prune unpromising objects from consideration. In this demonstration, we introduce RanKloud, an efficient and scalable utility-aware parallel processing system built for the analysis of large media datasets. RanKloud extends Hadoop's MapReduce paradigm to provide support for ranked query operations, such as k-nearest neighbor and k-closest pair search, skylines, skyline-joins, and top-k join processing.

Download full-text


Available from: K. Selcuk Candan, Aug 25, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Today, multimedia data are produced in massive quantities, thanks to a diverse spectrum of applications including entertainment, surveillance, e-commerce, web, and social media. In particular, social media data have three challenging characteristics: data sizes are enormous, data are often multi-faceted, and data are dynamic. Tensors (multi-dimensional arrays) are widely used for representing such high-order dimensional data. Consequently, a system dealing with social media data needs to scale with the tensor volume and the number and diversity of the data facets. This necessitates highly parallelizable, and in many cases cloud-based, frameworks for scalable processing and efficient analysis of large media and social media collections. Most multimedia applications share a few core operations, including integration/fusion, classification, clustering, graph analysis, near-neighbor search, and similarity search. When performed naively, however, these core operations are often very costly, because the number of objects and object features that need to be considered can be prohibitive. Avoiding this cost requires that redundant work is avoided. Thus, for the next generation cloud-based massive media processing and analysis systems to have transformative impact, the fundamental principles that govern their design must include an awareness of the utilities of data and features to a particular analysis task. Recently, the observation that - while not all - a significant class of data processing applications can be expressed in terms of a small set of primitives that are, in many cases, easy to parallelize, has led to frameworks, such as MapReduce, which have been successfully applied in data processing, mining, and information retrieval domains. Yet, in many other domains (including many aggregation and join tasks that are hard to parallelize) they significantly lag behind traditional solutions. In particular, many multimedia and social media analysis tasks are in the category of applications that pose significant challenges. In this talk, I will present an overview of recent developments in the area of scalable multimedia and social media retrieval and analysis in the cloud and our own efforts [1, 2, 3, 4, 5, 6] to build a scalable data processing middleware, called RanKloud, specifically sensitive to the needs and requirements of multimedia and social media analysis applications. RanKloud avoids waste by intelligently partitioning the data and allocating it on available resources to minimize the data replication and indexing overheads and to prune superfluous low-utility processing. It also includes a tensor-based relational data model to support the complete lifecycle (from collection to analysis) of the data, involving various integration and other manipulation steps. RanKloud also addresses the computational cost of various multi-dimensional data analysis operations, including decomposition or structural change detection, by (a) leveraging a priori background knowledge (or metadata) about one or more domain dimensions and (b) by extending compressed sensing (CS) to tensor data to encode the observed tensor streams in the form of compact descriptors. RanKloud will extend the scope of cloud-based systems to the delivery of efficient and large scale analysis over data with variable utility and, thus, will enable new and efficient applications, tools, and systems for multimedia and social media retrieval and analysis.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Fast skyline selection of high-quality web services is of critically importance to upgrade e-commerce and various cloud applications. In this paper, we present a new MapReduce Skyline method for scalable parallel skyline query processing. Our new angular partitioning of the data space reduces the processing time in selecting optimal skyline services. Our method shortens the Reduce time significantly due to the elimination of more redundant dominance computations. Through Hadoop experiments on large server clusters, our method scales well with the increase of both attribute dimensionality and data-space cardinality. We define a new performance metric to assess the local optimality of selected skyline services. By experimenting over 10,000 real-life web service applications over 10 performance attribute dimensions, we find that the angular-partitioned MapReduce method is 1.7 and 2.3 times faster than the dimensional and grid partitioning methods, respectively with a higher probability to reach the local optimality. These results are very encouraging to select optimal web services in real-time out of a large number of web services.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012