About
98
Publications
11,513
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,543
Citations
Publications
Publications (98)
Amazon is one of the world's largest e-commerce sites and Amazon Search powers the majority of Amazon's sales. As a consequence, even small improvements in relevance ranking both positively influence the shopping experience of millions of customers and significantly impact revenue. In the past, Amazon's product search engine consisted of several ha...
The present invention is directed towards systems and methods for providing dynamic search results based upon historical data through the use of one or more widgets. The method of the present invention comprises receiving a request for content from a client and generating one or more widgets for providing search result content. A display profile is...
The critical task of predicting clicks on search advertisements is typically addressed by learning from historical click data. When enough history is observed for a given query-ad pair, future clicks can be accurately modeled. However, based on the empirical distribution of queries, sufficient historical information is unavailable for many query-ad...
Parallel Global Optimization Algorithms (PGOA) provide an efficient way of dealing with hard optimization problems. One method of parallelization of GOAs that is frequently applied and commonly found in the contemporary literature is the so-called Island ...
Previous studies on search engine click modeling have identified two presentation factors that affect users' behavior: (1) position bias: the same result will get a different number of clicks when displayed in different positions and (2) externalities: the same result might get more clicks when displayed with results of relatively lower quality tha...
Sponsored search is a multi-billion dollar business that generates most of the revenue for search engines. Predicting the probability that users click on ads is crucial to sponsored search because the prediction is used to influence ranking, filtering, placement, and pricing of ads. Ad ranking, filtering and placement have a direct impact on the us...
A fundamental problem in sponsored search advertising is the estimation of probability of click for ads displayed in re-sponse to search queries. The historical click-through rate (CTR) is one of the most important predictors of the click, and can be extracted at multiple resolutions of the query-ad hierarchy. However, the new ads do not have any c...
The objective in editing this book was to assemble a sample of the best work in parallel and distributed biologically inspired algorithms. The editors invited researchers in different domains to submit their work. They aimed to include diverse topics to appeal to a wide audience. Some of the chapters summarize work that has been ongoing for several...
The present invention is directed towards systems and methods for predicting a frequency
with which an advertisement displayed in response to a query will be selected. The method
of the present invention comprises receiving analytics data associated with a display of one
or more advertisements in response to one or more queries. One or more feat...
This chapter focuses on the parallelization of Estimation of Distribution Algorithms (EDAs). More specifically, it presents guidelines for designing efficient parallel EDAs that employ parallel fitness evaluation and parallel model building. Scalability analysis techniques are employed to identify and parallelize the main performance bottlenecks to...
The performance of classification algorithms is affected by the features used to describe the labeled examples presented to the inducers. Therefore, the problem of feature subset selection has received considerable attention. Approaches to this problem based on evolutionary algorithms typically use the wrapper method, treating the inducer as a blac...
Parallel genetic algorithms (GAs) have numerous parameters that affect their efficiency and accuracy. Traditionally, these parameters have been studied using empirical studies whose generality and limitations are difficult to assess. This chapter reviews existing theoretical models that predict the effects of the parameters. The models are used to...
Estimation of distribution algorithms (EDAs) are a wide-ranging family of evolutionary algorithms whose common feature is the way they evolve by learning a probability distribution from the best individuals in a population and sampling it to generate ...
A data mining decision tree system that uncovers patterns, associations, anomalies, and other statistically significant structures in data by reading and displaying data files, extracting relevant features for each of the objects, and using a method of recognizing patterns among the objects based upon object features through a decision tree that re...
There are numerous combinations of neural networks (NNs) and evolutionary algorithms (EAs) used in classification problems. EAs have been used to train the networks, design their architecture, and select feature subsets. However, most of these combinations have been tested on only a few data sets and many comparisons are done inappropriately measur...
IntroductionMaster-Slave Parallel GAsMultipopulation Parallel GAsCellular Parallel GAsConclusions
References
This paper describes GridAssist, a user friendly Grid-based workflow management tool that allows users to execute workflows in a Grid environment and hides the underlying technology. Two cases are described in which this tool is now being used: processing ...
The population size of genetic algorithms (GAs) aects the quality of the solutions and the time required to nd them. While progress has been made in estimating the population sizes required to reach a de- sired solution quality for certain problems, in practice the sizing of pop- ulations is still usually performed by trial and error. These trials...
Numerous applications of data mining to scientific data involve the induction of a classification model. In many cases, the collection of data is not performed with this task in mind, and therefore, the data might contain irrelevant or redundant features that affect negatively the accuracy of the induction algorithms. The size and dimensionality of...
Comparing the output of a physics simulation with an experiment is
often done by visually comparing the two outputs. In order to
determine which simulation is a closer match to the experiment, more
quantitative measures are needed. This paper describes our early
experiences with this problem by considering the slightly simpler
problem of finding ob...
Comparing the output of a physics simulation with an experiment is often done by visually comparing the two outputs. In order to determine which simulation is a closer match to the experiment, more quantitative measures are needed. This paper describes our early experiences with this problem by considering the slightly simpler problem of finding ob...
This paper illustrates the application of evolutionary algorithms (EAs) to the problem of oblique decision-tree (DT) induction. The objectives are to demonstrate that EAs can find classifiers whose accuracy is competitive with other oblique tree construction methods, and that, at least in some cases, this can be accomplished in a shorter time. We p...
The FIRST (Faint Images of the Radio Sky at Twenty-cm) survey is an ambitious project scheduled to cover 10,000 square degrees of the northern and southern galactic caps. Until recently, astronomers associated with FIRST identified radio-emitting galaxies with a bent-double morphology through a visual inspection of images. Besides being subjective,...
The performance of classiflcation algorithms in machine learn- ing is afiected by the features used to describe the labeled examples pre- sented to the inducers. Therefore, the problem of feature subset selection has received considerable attention. Genetic approaches to this problem usually follow the wrapper approach: treat the inducer as a black...
The usual approach to deal with noise present in many real-world optimization problems is to take an arbitrary number of samples of the objective function and use the sample average as an estimate of the true objective value. The number of samples is typically chosen arbitrarily and remains constant for the entire optimization process. This paper s...
The performance of classification algorithms in machine learning is affected by the features used to describe the labeled examples presented to the inducers. Therefore, the problem of feature subset selection has received considerable attention. Genetic approaches to this problem usually follow the wrapper approach: treat the inducer as a black box...
This paper describes the application of four evolutionary algorithms to the pruning of neural networks used in classification problems. Besides of a simple genetic algorithm (GA), the paper considers three distribution estimation algorithms (DEAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to de...
There are conflicting reports over whether multiple independent runs of genetic algorithms (GAs) with small populations can reach solutions of higher quality or can find acceptable solutions faster than a single run with a large population. This paper investigates this question analytically using two approaches. First, the analysis assumes that the...
The FIRST (Faint Images of the Radio Sky at Twenty-cm) survey is an ambitious project scheduled to cover 10,000 square degrees of the northern and southern galactic caps. Until recently, astronomers associated with FIRST identified radio-emitting galaxies with a bent-double morphology through a visual inspection of images. Besides being subjective,...
This paper illustrates the application of evolutionary algorithms (EAs) to the problem of oblique decision-tree (DT) induction. The objectives are to demonstrate that EAs can find classifiers whose accuracy is competitive with other oblique tree construction methods, and that, at least in some cases, this can be accomplished in a shorter time. We p...
This paper describes the application of four evolutionary algorithms to the selection of feature subsets for classification problems.
This paper describes the application of four evolutionary algorithms to the pruning of neural networks used in classification problems. Besides of a simple genetic algorithm (GA), the paper considers three distribution estimation algorithms (DEAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to de...
Astronomy data sets have led to interesting problems in mining
scientific data. These problems will likely become more challenging as
the astronomy community brings several surveys online as part of the
National Virtual Observatory, giving rise to the possibility of mining
data across many different surveys. In this article, we discuss the work
we...
Pseudo random number generators (PRNGs) are the basic input to the stochastic selection, recombination, and mutation operations of genetic algorithms (GAs). Although it does not seem like a crucial decision, recent studies suggest that the choice of PRNG can a#ect the performance of GAs. The objective of this paper is to study the e#ect of PRNGs on...
Selection methods are essential components of evolutionary algorithms (EAs). This paper reviews five popular selection methods used in EAs. The algorithms are examined using the cumulants of the fitness distribution of the selected individuals. The cumulants are calculated using order statistics. The method presented here considers finite populatio...
We describe an application of probabilistic modeling to the problem of recognizing radio galaxies with a bent-double morphology. The type of galaxies in question contain distinctive signatures of geometric shape and flux density that can be used to be build a probabilistic model that is then used to score potential galaxy configurations. The experi...
The FIRST survey (Faint Images of the Radio Sky at Twenty-cm) is scheduled to cover 10,000 square degrees of the northern and southern galactic caps. Until recently, astronomers classified radio-emitting galaxies through a visual inspection of FIRST images.
Preface. Acknowledgments. 1. Introduction. 2. The Gambler's Ruin and Population Sizing. 3. Master-Slave Parallel GAs. 4. Bounding Cases of GAs With Multiple Demes. 5. Markov Chain Models of Multiple Demes. 6. Migration Rates and Optimal Topologies. 7. Migration and Selection Pressure. 8. Fine-Grained and Hierarchical Parallel GAs. 9. Summary, Exten...
With computers becoming more pervasive, disks becoming cheaper, and sensors becoming ubiquitous, we are collecting data at an ever-increasing pace. However, it is far easier to collect the data than to extract useful information from it. Sophisticated techniques, such as those developed in the multi-disciplinary field of data mining, are increasing...
With computers becoming more pervasive, disks becoming cheaper, and sensors becoming ubiquitous, we are collecting data at an ever-increasing pace. However, it is far easier to collect the data than to extract useful information from it. Sophisticated techniques, such as those developed in the multi-disciplinary field of data mining, are increasing...
In this paper, we describe the use of data mining techniques to search for radio-emitting galaxies with a bent-double morphology. In the past, astronomers from the FIRST (Faint Images of the Radio Sky at Twenty-cm) survey identified these galaxies through visual inspection. This was not only subjective but also tedious as the on-going survey now co...
This paper investigates how the policy used to select migrants and the individuals they replace affects the selection pressure in parallel evolutionary algorithms (EAs) with multiple populations. The four possible combinations of random and fitness-based emigration and replacement of existing individuals are considered. The investigation follows tw...
This paper investigates how the policy used to select migrants and the individuals they replace affects the selection pressure in parallel evolutionary algorithms (EAs) with multiple populations. The four possible combinations of random and fitness-based emigration and replacement of existing individuals are considered. The investigation follows tw...
This paper introduces simple model-building evolutionary algorithms (EAs) that operate on continuous domains. The algorithms are based on supervised and unsupervised dis-cretization methods that have been used as preprocessing steps in machine learning. The basic idea is to discretize the continuous vari-ables and use the discretization as a simple...
Data mining techniques are increasingly gaining popularity in various scientific domains as viable approaches to the analysis of massive data sets. In this chapter, we describe our experiences in applying data mining to a problem in astronomy, namely, the identification of radio-emitting galaxies with a bent-double morphology. Until recently, astro...
Master-slave parallel GAs are easy to implement, often yield considerable improvements in performance, and all the theory available for simple GAs can be used to choose adequate values for the search parameters. The analysis of this chapter showed that, for many applications, the reduction in computation time is sufficient to overcome the cost of c...
This chapter presented a solution to a long-standing problem in genetic algorithms: how to determine an adequate population size to reach a solution of a particular quality. The model is based on a random walk where the position of a particle on a bounded one-dimensional space represents the number of copies of the correct BBs in the population. Th...
This chapter presented models that predict the expected solution quality of parallel GAs with multiple populations after any number of epochs and for any choice of deme size, deme count, topology, or migration rate. The basic idea was to model the parallel GAs as Markov chains to determine the number of correct BBs that are present in the demes at...
This chapter treated fine-grained and hierarchical parallel GAs. It began with a brief review of fine-grained parallel GAs. The chapter identified some of the most salient design problems of this type of algorithms, and discussed some of the recent work on this area.
This chapter focused on hierarchical combinations of parallel GAs. The hierarchica...
The design of efficient and accurate parallel GAs is a complex problem. One must decide on a configuration among the many choices of topologies, migration rates, deme counts and sizes. Each parameter affects the quality of the search and the efficiency of the algorithm in non-linear ways, which makes the choices difficult. The ultimate goal is to d...
The calculations presented in this chapter recognize that the design of parallel GAs is a complex problem, and that the choices of topologies, migration rates, number of demes, and their size are intimately related. To make progress on the deme-sizing problem without ignoring the other choices, the analysis used bounds on the topologies and migrati...
The choice of migrants and the replacement of individuals are not often considered important parameters of parallel GAs. However, this chapter used two different methods to show that choosing the migrants or replacements according to their fitness increases the selection pressure. Some migration policies may cause the algorithm to converge signific...
This chapter extended the previous deme-sizing equations to consider configurations that are likely to be used by practitioners. The first part of the chapter described the relation between the deme size, the migration rate, and the topology’s degree with the probability of success after two epochs. It showed how to find the configuration that opti...
Decision tress have long been popular in classification as they use
simple and easy-to-understand tests at each node. Most variants of
decision trees test a single attribute at a node, leading to axis-
parallel trees, where the test results in a hyperplane which is parallel
to one of the dimensions in the attribute space. These trees can be
rather...
Implementations of parallel genetic algorithms (GA) with multiple
populations are common, but they introduce several parameters whose
effect on the quality of the search is not well understood. Parameters
such as the number of populations, their size, the topology of
communications, and the migration rate have to be set carefully to reach
adequate...
This paper analyzes convergence properties of the
As data mining techniques are applied to ever larger data sets, it is becoming clear that parallel processors will play an important role in reducing the turn-around time for data analysis. In this paper, we describe the design of a parallel object-oriented toolkit for mining scientic data sets. After a brief discussion of our design goals, we desc...
Migration of individuals between populations may increase the selection pressure. This has the desirable consequence of speeding up convergence, but it may result in an excessively rapid loss of variation that may cause the search to fail. This paper investigates the effects of migration on the distribution of fitness. It considers arbitrary migrat...
This paper proposes an algorithm that uses an estimation of the joint distribution of promising solutions in order to generate new candidate solutions. The algorithm is settled into the context of genetic and evolutionary computation and the algorithms based on the estimation of distributions. The proposed algorithm is called the Bayesian Optimizat...
This paper presents calculations of the selection intensity of common selection and replacement methods used in genetic algorithms (GAs) with generation gaps. The selection intensity measures the increase of the average fitness of the population after selection, and it can be used to predict the average fitness of the population at each iteration a...
With computers becoming more pervasive, disks becoming cheaper, and sensors becoming ubiquitous, we are collecting data at an ever-increasing pace. However, it is far easier to collect the data than to extract useful information from it. Sophisticated techniques, such as those developed in the multi-disciplinary field of data mining, are increasing...
Printout. Thesis (Ph. D.)--University of Illinois at Urbana-Champaign, 1999. Vita. Includes bibliographical references (leaves 140-146).
Parallel implementations of genetic algorithms (GAs) are common, and, in most cases, they succeed to reduce the time required to find acceptable solutions. However, the effect of the parameters of parallel GAs on the quality of their search and on their efficiency are not well understood. This insufficient knowledge limits our ability to design fas...
. Genetic algorithms (GAs) are powerful search techniques that are used successfully to solve problems in many different disciplines. Parallel GAs are particularly easy to implement and promise substantial gains in performance. As such, there has been extensive research in this field. This survey attempts to collect, organize, and present in a unif...
Parallel genetic algorithms (GAs) are complex programs that are controlled by many parameters, which affect their search quality and their efficiency. The goal of this paper is to provide guidelines to choose those parameters rationally. The investigation centers on the sizing of populations, because previous studies show that there is a crucial re...
This paper examines the scalability of several types of parallel genetic algorithms (GAs). The objective is to determine the optimal number of processors that can be used by each type to minimize the execution time. The first part of the paper considers algorithms with a single population. The investigation focuses on an implementation where the po...
This paper presents a model to predict the convergence quality of genetic algorithms based on the size of the population. The model is based on an analogy between selection in GAs and one-dimensional random walks. Using the solution to a classic random walk problem-the gambler's ruin-the model naturally incorporates previous knowledge about the ini...
As genetic algorithms (GAs) are used to solve harder problems, it is becoming necessary to use better algorithms and more efficient implementations to reach good solutions fast. This chapter describes the implementation of master-slave and multiple-population parallel GAs. The goal of the chapter is to help others to implement their own parallel co...
The paper presents a model for predicting the convergence quality of genetic algorithms. The model incorporates previous knowledge about decision making in genetic algorithms and the initial supply of building blocks in a novel way. The result is an equation that accurately predicts the quality of the solution found by a GA using a given population...
This paper presents models that predict the speedup of two cases that bound the possible topologies and migration rates of parallel genetic algorithms (GAs). The first bounding case is a parallel GA with completely isolated demes or subpopulations and for this case the model and the experiments show that the speedup is not very significant when mor...
This paper investigates the possibility of gaining any computational benefit from multiple deme, small population GAs compared to a single large population GA. Our framework is based on an earlier decision theoretic framework developed by Goldberg, Deb and Clark (1992) for population sizing. Our analysis and empirical results for different bounding...
The FIRST (Faint Images of the Radio Sky at Twenty-cm) survey is an ambitious project scheduled to cover 10,000 square degrees of the northern and southern galactic caps. Until recently, astronomers associated with FIRST identified radio-emitting galaxies with a bent-double morphology through a visual inspection of images. Besides being subjective,...
Recent work in classification indicates that significant improvements in accuracy can be obtained by growing an ensemble of classifiers and having them vote for the most popular class. This paper focuses on ensembles of decision trees that are created with a randomized procedure based on sampling. Randomization can be introduced by using random sam...
In earlier work, we have described our experiences with the use of decision tree classifiers to identify radio-emitting galaxies with a bent-double morphology in the FIRST astronomical survey. We now extend this work to include ensembles of decision tree classifiers, including two algorithms developed by us. These algorithms randomize the decision...
This paper uses Markov chainsto analyze the search quality of abounding case of parallel geneticalgorithms with multiple populations.In the bounding case consideredhere, each population exchangesindividuals with all theothers. First, the migration rateis set to the maximum value possible,and later the analysis is refinedto consider lower migration...
A decision tree system that is part of a parallel object-oriented pattern recognition system, which in turn is part of an object oriented data mining system. A decision tree process includes the step of reading the data. If necessary, the data is sorted. A potential split of the data is evaluated according to some criterion. An initial split of the...
High-resolution computer simulations produce large volumes of data. As a first step in the analysis of these data, supervised machine learning techniques can be used to retrieve objects similar to a query that the user finds interesting. These objects may be characterized by a large number of features, some of which may be redundant or irrelevant t...
This paper describes the design and implementation of a general-purpose anomaly detector for streaming data. Based on a survey of similar work from the literature, a basic anomaly detector builds a model on normal data, compares this model to incoming data, and uses a threshold to determine when the incoming data represent an anomaly. Models compac...