Andrei Z. BroderGoogle Inc. | Google · Research Department
Andrei Z. Broder
Doctor of Philosophy
About
195
Publications
65,510
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
23,929
Citations
Introduction
Skills and Expertise
Publications
Publications (195)
We present a new framework to conceptualize and operationalize the total user experience of search, by studying the entirety of a search journey from an utilitarian point of view. Web search engines are widely perceived as "free". But search requires time and effort: in reality there are many intermingled non-monetary costs (e.g. time costs, cognit...
We present double pooling, a simple, easy-to-implement variation on test pooling, that in certain ranges for the a priori probability of a positive test, is significantly more efficient than the standard single pooling approach (the Dorfman method).
A quarter-century ago Web search stormed the world: within a few years the Web search box became a standard tool of daily life ready to satisfy informational, transactional, and navigational queries needed for some task completion. However, two recent trends are dramatically changing the box»s role: first, the explosive spread of smartphones brings...
According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numb...
This talk is a review of some Web research and predictions that I co-authored over the last two decades: both what turned out gratifyingly right and what turned out embarrassingly wrong. Topics will include near-duplicates, the Web graph, query intent, inverted indices efficiency, and others. While this seems a completely idiosyncratic collection t...
The Gartner's 2014 Hype Cycle released last August moves Big Data technology from the Peak of Inflated Expectations to the beginning of the Trough of Disillusionment when interest starts to wane as reality does not live up to previous promises. As the hype is starting to dissipate it is worth asking what Big Data (however defined) means from a scie...
A computer-implemented method is disclosed for determining a type of landing page to which to transfer web searchers that enter a particular query, the method comprising: classifying a landing page as one of a plurality of landing page classes with a trained classifier of a computer based on textual content of the landing page; determining, by the...
Disclosed is a system and method for securely, conveniently and effectively storing information in a secure data repository or database, and securely delivering such information to a respective user. The secure repository and database, referred to as a Vault, is a secure storage utility used for storing and safekeeping valuable personal information...
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when...
The World Wide Web portion of the Internet is largely supported by advertising. To deliver the most effective advertising, a system for dynamically creating customized advertisements is introduced. The behavior and any demographic information known about web viewers is used to select an advertising template that will be used to create an advertisem...
The present invention is directed towards systems, methods and computer program products for providing query-based advertising content. According to one embodiment, a method for providing query-based advertising content comprises receiving a web query and generating an ad query associated with the web query, wherein the ad query is generated on the...
A system and method for implementing a multi-step challenge and response test includes steps or acts of: using an input/output subsystem for presenting a series of challenges to a user that require said user to correctly solve each challenge before a next challenge is revealed to the user; receiving the user's response to each challenge; and submit...
Embodiments are directed towards identifying auto-folder tags for messages by using a combinational optimization approach of bi-clustering folder names and features of messages based on relationship strengths. The combinational optimization approach of bi-clustering, generally, groups a plurality of folder names and a plurality of features into one...
A method is carried out by storing information describing configurations of discussion threads formed of respective series of EMTs that are exchanged among at least two individuals. The discussion threads have a root EMT, zero or more reply EMTs, and a last offspring EMT. The method is further carried out by compacting the EMT discussion threads, a...
The present invention is directed towards methods and computer readable media for annotating and ranking user reviews on social review systems with inferred analytics. A reference framework is provided by creating context according to previous activity, bias, or background information of a given reviewer. The method of the present invention compris...
The central problem in the emerging discipline of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user streaming a movie, a...
Much research in information management begins by asking how to manage a given information corpus. But information management systems can only be as good as the information they manage. They struggle and often fail to correctly infer meaning from large ...
Contextual advertising is a type of Web advertising, which, given the URL of a Web page, aims to embed into the page the most relevant textual ads available. For static pages that are displayed repeatedly, the matching of ads can be based on prior analysis of their entire content; however, often ads need to be matched to new or dynamically created...
We introduce the problem of evaluating graph constraints in content-based publish/subscribe (pub/sub) systems. This problem formulation extends traditional content-based pub/sub systems in the following manner: publishers and subscribers are connected via a (logical) directed graph G with node and edge constraints, which limits the set of valid pat...
Online user interaction is becoming increasingly personalized both via explicit means: customizations, options, add-ons, skins, apps, etc. and via implicit means, that is, deep data mining of user activities that allows automated selection of content and experiences, e.g. individualized top news stories, personalized ranking of search results, pers...
Display advertising is one of the two major advertising channels on the web (in addition to search advertising). Display advertising on the Web is usually done by graphical ads placed on the publishers' Web pages. There is no explicit user query, and the ad selection is performed based on the page where the ad is placed (contextual targeting) or us...
Sponsored search is a three-way interaction between advertisers, users, and the search engine. The basic ad selection in sponsored search, lets the advertiser choose the exact queries where the ad is to be shown. To increase advertising volume, many advertisers opt into advanced match, where the search engine can select additional queries that are...
The central problem of Computational Advertising is to find the “best match” between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine (“sponsored search”), a user reading a web page (“content match” and “display ads”), a user interacting with a portable device, and so on....
The long tail of consumer demand is consistent with two fundamentally different theories. The first, and more popular hypothesis,
is that a majority of consumers have similar tastes and only few have any interest in niche content; the second, is that everyone
is a bit eccentric, consuming both popular and niche products. By examining extensive data...
Ranking Web search results has long evolved beyond simple bag-of-words retrieval models. Modern search engines routinely employ machine learning ranking that relies on exogenous relevance signals. Yet the majority of current methods still evaluate each Web page out of context. In this work, we introduce a novel source of relevance information for W...
Back in the heady days of 1999 and WWW8 (Toronto) we held a panel titled "Finding Anything in the Billion Page Web: Are Algorithms the Key?" In retrospect the answer to this question seems laughably obvious - the search industry has burgeoned on a foundation of algorithms, cloud computing and machine learning. As we move into the second decade of t...
Queries on major Web search engines produce complex result pages, primarily composed of two types of information: organic results, that is, short descriptions and links to relevant Web pages, and sponsored search results, the small textual advertisements often displayed above or to the right of the organic results. Strategies for optimizing each ty...
The success of \innite-inventor y" retailers such as Ama- zon.com and Netix has been ascribed to a \long tail" phe- nomenon. To wit, while the majority of their inventory is not in high demand, in aggregate these \worst sellers," unavail- able at limited-inventory competitors, generate a signicant fraction of total revenue. The long tail phenomenon...
One of the most prevalent online advertising methods is tex- tual advertising. To produce a textual ad, an advertiser must craft a short creative (the text of the ad) linking to a landing page, which describes the product or service being promoted. Furthermore, the advertiser must associate the creative to a set of manually chosen bid phrases repre...
The classic Web search experience, consisting of returning “ten blue links” in response to a short user query, is powered
today by a mature technology where progress has become incremental and expensive. Furthermore, the “ten blue links” represent
only a fractional part of the total Web search experience: today, what users expect and receive in res...
Information extraction from unstructured text has much in common with querying in databases systems. Despite some differences on how data is modeled or represented, the general goal remains the same, i.e. to retrieve data or tag elements that satisfy some user-specified constraints. In recent years, the two paradigms have become much closer thanks...
Unbeknownst to most users, when a query is submitted to a search engine two distinct searches are performed: the organic or algorithmic search that returns relevant Web pages and related data (maps, images, etc.), and the sponsored search that returns paid advertisements. While an enormous amount of work has been invested in understanding the user...
We define and study the process of context transfer in search advertising, which is the transition of a user from the context of Web search to the context of the landing page that follows an ad-click. We conclude that in the vast majority of cases, the user is shown one of three types of pages, which can be accurately distinguished using automatic...
Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text
analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics.
The central challenge of computational advertising is to find the “best match” between a given user in a...
Contextual advertising (also called content match) refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This re...
Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user's query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time...
Motivated by contextual advertising systems and other web applications involving eciency-accuracy tradeos, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the eciency-accuracy tradeo and provide our caching poli- cies...
We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query....
The non-English Web is growing at phenomenal speed, but available language processing tools and resources are pre- dominantly English-based. Taxonomies are a case in point: while there are plenty of commercial and non-commercial taxonomies for the English Web, taxonomies for other lan- guages are either not available or of arguable quality. Given t...
Unbeknownst to most users, when a query is submitted to a search engine two distinct searches are performed: the or- ganic or algorithmic search that returns relevant Web pages and related data (maps, images, etc.), and the sponsored search that returns paid advertisements. While an enor- mous amount of work has been invested in understanding the u...
We introduce the hiring problem, in which a growing company continuously interviews and decides whether to hire applicants. This problem is similar in spirit but quite different from the well-studied secretary problem. Like the secretary problem, it captures fundamental aspects of decision making under uncertainty and has many possible applications...
The non-English Web is growing at breakneck speed, but available language processing tools are mostly English based. Taxonomies are a case in point: while there are plenty of commercial and non-commercial taxonomies for the English Web, taxonomies for other languages are either not available or of very limited quality. Given that building taxonomie...
The business of Web search, a $10 billion industry, relies heavily on sponsored search, whereas a few carefully-selected paid advertisements are displayed alongside algorithmic search results. A key technical challenge in sponsored search is to select ads that are relevant for the user's query. Identifying relevant ads is challenging because querie...
In contextual advertising, estimating the number of impressions of an ad is critical in planning and budgeting advertising campaigns. However, producing this forecast, even within large margins of error, is quite challenging. We attack this problem by simulating the presence of a given ad with its associated bid over historical data, involving bill...
Web textual advertising can be interpreted as a search prob- lem over the corpus of ads available for display in a partic- ular context. In contrast to conventional information re- trieval systems, which always return results if the corpus contains any documents lexically related to the query, in Web advertising it is acceptable, and occasionally e...
Computational advertising is an emerging scientific discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, optimization, and microeconomics. The central challenge of computational advertising is to find the "best match" between a given user in a given context and a sui...
As popular search engines face the sometimes conflicting interests of protecting privacy while retaining query logs for a variety of uses, numerous technical measures have been suggested to both enhance privacy and preserve at least a portion of the ...
Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one,
are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification
problem on a corpus that has been processed primarily for the purpose of searching, and thus our ac...
ABSTRACT The primary business model behind Web search is based on textual advertising, where contextually relevant ads are displayed alongside search results. We address the problem of selecting these ads so that they are both relevant to the queries and protable to the search engine, showing that optimizing ad relevance and revenue is not equivale...
Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central challenge of computational advertising is to find the "best match" between a given user in a...
We propose augmenting collaborative reviewing systems with an automatic annotation capability that helps users inter- pret reviews. Given an item and its review by a certain author, our approach is to find a reference set of similar items that is both easy to describe and meaningful to users. Depending on the number of available same-author reviews...
Folksonomies allow users to collaboratively tag a variety of tex- tual and multimedia objects with sets of labels. The largest folk- sonomy projects, such as FLICKR and DEL.ICIO.US, contain mil- lions of multi-labeled objects, and embed significant amounts of human knowledge. We propose a method for automatically using this knowledge to augment tra...
Contextual Advertising is a type of Web advertising, which, given the URL of a Web page, aims to embed into the page (typically via JavaScript) the most relevant textual ads available. For static pages that are displayed repeat- edly, the matching of ads can be based on prior analysis of their entire content; however, ads need to be matched also to...
We consider the problem of estimating occurrence rates of rare eventsfor extremely sparse data, using pre-existing hierarchies to perform inference at multiple resolutions. In particular, we focus on the problem of estimating click rates for (webpage, advertisement) pairs (called impressions) where both the pages and the ads are classified into hie...
We propose a methodology for building a practical robust query classiflcation system that can identify thousands of query classes with reasonable accuracy, while dealing in real- time with the query volume of a commercial web search en- gine. We use a blind feedback technique: given a query, we determine its topic by classifying the web search resu...
Contextual advertising or Context Match (CM) refers to the placement of commercial textual advertisements within the content of a generic web page, while Sponsored Search (SS) advertising consists in placing ads on result pages from a web search engine, with ads driven by the originating query. In CM there is usually an intermediary commercial ad-n...
We present a framework for margin based active learning of linear separators. We instantiate it for a few important cases, some of which have been previously considered in the literature. We analyze the effectiveness of our frame- work both in the realizable case and in a specific noisy setting related to the Tsy - bakov small noise condition.
The classic IR model assumes a human engaged in activity that generates an “information need”. This need is verbalized and then expressed as a query to search engine over a defined corpus. In the past decade, Web search engines have evolved from a first generation based on classic IR algorithms scaled to web size and thus supporting only informatio...
We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:
Determining the set of categories in a given taxonomy spanned by the search results;
Finding the range of metadata value...
In recent years, the emergence of the Web and the dramatic increase in computing, storage and networking capacity has given
rise to the concept of networked information spaces. The prime example of a networked information space is the World Wide
Web itself. The Web, in its pure form, is a set of hypertext documents, with links in one document point...
For barely a decade now the Web graph (the network formed by Web pages and their hyperlinks) has been the focus of scientific
study. In that short a time, this study has made a significant impact on research in physics, computer science and mathematics.
It has focussed the attention of the scientific community on all the different kinds of networks...
Modern document collections often contain groups of doc- uments with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared con- tent to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allow...
We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation.
The basic idea is to partition the graph into classes of quasi-equivalent vertices, to project the page-based random walk
to be approximated onto those classes, and to compute the stationary probability distribution of the re...
We consider the problem of estimating the size of a collec- tion of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query f...
In the past decade, Web search engines have evolved from a first generation based on classic Information Retrieval (IR) algorithms
scaled to web size and thus supporting only informational queries, to a second generation supporting navigational queries
using web specific information (primarily link analysis), to a third generation enabling transact...
Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our ac...
We consider a multidimensional variant of the balls-and-bins problem, where balls correspond to random D-dimensional 0-1 vectors. This variant is motivated by a problem in load balancing documents for distributed search engines. We demonstrate the utility of the power of two choices in this domain.
This panel will focus on exploring future enhancements of Web technology for active Internet-scale information delivery and dissemination. It will ask the questions of whether the current Web technology is sufficient, what can be leveraged in this endeavor, and how a combination of ideas from a variety of existing disciplines can help in meeting th...
Searching and browsing are the two basic information discovery paradigms, since the early days of the Web. After more than ten years down the road, three schools seem to have emerged: (1) The search-centric school argues that guided navigation is superfluous since free form search has become so good and the search UI so common, that users can satis...
The state of the web today has been and continues to be greatly influenced by the existence of web-search engines. This panel will discuss the ways in which search engines have affected the web in the past and ways in which they may affect it in the future. Both positive and negative effects will be discussed as will potential measures to combat th...
The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating
object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval
specialists.
Recent results range from theoretical (e.g.: models for the graph, semi-external algo...
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In additi...
Unstructured information represents the vast majority of data collected and accessible to enterprises. Exploiting this information requires systems for managing and extracting knowledge from large collections of unstructured data and applications for discovering patterns and relationships. This paper elucidates the differences between search system...
A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in...
We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promisi...
The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval specialists.Recent results range from theoretical (e.g.: models for the graph, semi-external algorit...