Conference Paper
RanKloud: a scalable ranked query processing framework on hadoop.
DOI: 10.1145/1951365.1951444 Conference: EDBT 2011, 14th International Conference on Extending Database Technology, Uppsala, Sweden, March 2124, 2011, Proceedings
Source: DBLP
 Citations (11)
 Cited In (0)

Conference Paper: Parallel Large Scale Feature Selection for Logistic Regression
[Show abstract] [Hide abstract]
ABSTRACT: In this paper we examine the problem of ecient feature evaluation for logistic regression on very large data sets. We present a new forward feature selection heuristic that ranks features by their estimated eect on the resulting model's performance. An approximate optimization, based on backtting, provides a fast and accurate estimate of each new feature's coecient in the logistic regression model. Further, the algorithm is highly scalable by parallelizing simultaneously over both features and records, allowing us to quickly evaluate billions of potential features even for very large data sets.Proceedings of SIAM International Conference on Data Mining (SDM'09); 01/2009  [Show abstract] [Hide abstract]
ABSTRACT: Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). Each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, such as min or average. To determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. Fagin has given an algorithm (“Fagin's Algorithm”, or FA) that is much more efficient. For some monotone aggregation functions, FA is optimal with high probability in the worst case. We analyze an elegant and remarkably simple algorithm (“the threshold algorithm”, or TA) that is optimal in a much stronger sense than FA. We show that TA is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a highprobability worstcase sense, but over every database. Unlike FA, which requires large buffers (whose size may grow unboundedly as the database size grows), TA requires only a small, constantsize buffer. TA allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. We distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). We consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well.Journal of Computer and System Sciences 01/2001; 66(466):614656. · 1.09 Impact Factor 
Conference Paper: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce.
[Show abstract] [Hide abstract]
ABSTRACT: This paper explores the problem of computing pairwise sim ilarity on document collections, focusing on the application of \more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as largescale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approx imations that trade eectiveness for eciency, the charac teristics of which are studied experimentally. Results show that the brute force algorithm is the most ecient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large ef ciency gains without signicant loss of eectiveness.Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 1923, 2009; 01/2009
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.