Andrew Tomkins's research while affiliated with Mountain View College and other places

Publications (145)

Article
A/B testing is widely used to tune search and recommendation algorithms, to compare product variants as efficiently and effectively as possible, and even to study animal behavior. With ongoing investment, due to diminishing returns, the items produced by the new alternative B show smaller and smaller improvement in quality from the items produced b...
Preprint
Full-text available
Google Maps uses current and historical traffic trends to provide routes to drivers. In this paper, we use microscopic traffic simulation to quantify the improvements to both travel time and CO$_2$ emissions from Google Maps real-time navigation. A case study in Salt Lake City shows that Google Maps users are, on average, saving 1.7% of CO$_2$ emis...
Preprint
Full-text available
Metropolitan scale vehicular traffic modeling is used by a variety of private and public sector urban mobility stakeholders to inform the design and operations of road networks. High-resolution stochastic traffic simulators are increasingly used to describe detailed demand-supply interactions. The design of efficient calibration techniques remains...
Preprint
Full-text available
In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning para...
Preprint
Full-text available
Recent studies have indicated that Graph Convolutional Networks (GCNs) act as a \emph{low pass} filter in spectral domain and encode smoothed node representations. In this paper, we consider their opposite, namely Graph Deconvolutional Networks (GDNs) that reconstruct graph signals from smoothed node representations. We motivate the design of Graph...
Preprint
Full-text available
Adversarial robustness corresponds to the susceptibility of deep neural networks to imperceptible perturbations made at test time. In the context of image tasks, many algorithms have been proposed to make neural networks robust to adversarial perturbations made to the input pixels. These perturbations are typically measured in an $\ell_p$ norm. How...
Preprint
Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user. The problem of result list truncation, or where to truncate the ranked list of results, however, has received less attention despite being crucial in a variety of applications. Such tru...
Conference Paper
Full-text available
We present BusTr, a machine-learned model for translating road traffic forecasts into predictions of bus delays, used by Google Maps to serve the majority of the world’s public transit systems where no official real-time bus tracking is provided. We demonstrate that our neural sequence model improves over DeepTTE, the state-of-the-art baseline, bot...
Preprint
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors...
Preprint
Full-text available
We present BusTr, a machine-learned model for translating road traffic forecasts into predictions of bus delays, used by Google Maps to serve the majority of the world's public transit systems where no official real-time bus tracking is provided. We demonstrate that our neural sequence model improves over DeepTTE, the state-of-the-art baseline, bot...
Preprint
Work in information retrieval has traditionally focused on ranking and relevance: given a query, return some number of results ordered by relevance to the user. However, the problem of determining how many results to return, i.e. how to optimally truncate the ranked result list, has received less attention despite being of critical importance in a...
Preprint
This paper seeks to develop a deeper understanding of the fundamental properties of neural text generations models. The study of artifacts that emerge in machine generated text as a result of modeling choices is a nascent research area. Previously, the extent and degree to which these artifacts surface in generated text has not been well studied. I...
Preprint
We propose improving the privacy properties of a dataset by publishing only a strategically chosen "core-set" of the data containing a subset of the instances. The core-set allows strong performance on primary tasks, but forces poor performance on unwanted tasks. We give methods for both linear models and neural networks and demonstrate their effic...
Conference Paper
In this paper we consider the problem of estimating the difficulty of parking at a particular time and place; this problem is a critical sub-component for any system providing parking assistance to users. We describe an approach to this problem that is currently in production in Google Maps, providing inferences in cities across the world. We prese...
Preprint
Full-text available
Learning image representations to capture fine-grained semantics has been a challenging and important task enabling many applications such as image search and clustering. In this paper, we present Graph-Regularized Image Semantic Embedding (Graph-RISE), a large-scale neural graph learning framework that allows us to train embeddings to discriminate...
Conference Paper
Full-text available
Sequential behavior such as sending emails, gathering in groups, tagging posts, or authoring academic papers may be characterized by a set of recipients, attendees, tags, or coauthors respectively. Such "sequences of sets" show complex repetition behavior, sometimes repeating prior sets wholesale, and sometimes creating new sets from partial copies...
Article
Many online social networks allow directed edges: Alice can unilaterally add an "edge" to Bob, typically indicating interest in Bob or Bob's content, without Bob's permission or reciprocation. In directed social networks we observe the rise of two distinctive classes of users: celebrities who accrue unreciprocated incoming links, and follow spammer...
Conference Paper
Multinomial logistic regression is a classical technique for modeling how individuals choose an item from a finite set of alternatives. This methodology is a workhorse in both discrete choice theory and machine learning. However, it is unclear how to generalize multinomial logistic regression to subset selection, allowing the choice of more than on...
Conference Paper
We study the problem of automatically and efficiently generating itineraries for users who are on vacation. We focus on the common case, wherein the trip duration is more than a single day. Previous efficient algorithms based on greedy heuristics suffer from two problems. First, the itineraries are often unbalanced, with excellent days visiting top...
Article
Full-text available
Significance Scientific peer review has been a cornerstone of the scientific method since the 1600s. Debate continues regarding the merits of single-blind review, in which anonymous reviewers know the authors of a paper and their affiliations, compared with double-blind review, in which this information is hidden. We present an experimental study o...
Conference Paper
Artificial Intelligence has been present in literature at least since the ancient Greeks. Depictions present a wide range of perspectives of AI ranging from malefic overlords to depressive androids. Perhaps the most common recurring theme is the AI Assistant: C3PO from Star Wars; the Jetson's Rosie the Robot; the benign hyper-efficient Minds of Iai...
Article
We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parametrization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theor...
Conference Paper
We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parameterization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theo...
Article
In this paper we study the implications for conference program committees of adopting single-blind reviewing, in which committee members are aware of the names and affiliations of paper authors, versus double-blind reviewing, in which this information is not visible to committee members. WSDM 2017, the 10th ACM International ACM Conference on Web S...
Conference Paper
In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of...
Conference Paper
Multinomial logistic regression is a powerful tool to model choice from a finite set of alternatives, but it comes with an underlying model assumption called the independence of irrelevant alternatives, stating that any item added to the set of choices will decrease all other items' likelihood by an equal fraction. We perform statistical tests of t...
Conference Paper
We study sequences of consumption in which the same item may be consumed multiple times. We identify two macroscopic behavior patterns of repeated consumptions. First, in a given user's lifetime, very few items live for a long time. Second, the last consumptions of an item exhibit growing inter-arrival gaps consistent with the notion of increasing...
Patent
Full-text available
Disclosed are methods and apparatus for matching sets of text to objects are disclosed. In accordance with one embodiment, a set of text is obtained. For instance, the set of text may include a review. A numerical value is determined for each of a plurality of objects, where the numerical value indicates a likelihood that the corresponding one of t...
Article
In this work we study the dynamics of geographic choice, i.e., how users choose one from a set of objects in a geographic region. We postulate a model in which an object is selected from a slate of candidates with probability that depends on how far it is (distance) and how many closer alternatives exist (rank). Under a discrete choice formulation,...
Article
We consider the problem of inferring choices made by users based only on aggregate data containing the relative popularity of each item. We propose a framework that models the problem as that of inferring a Markov chain given a stationary distribution. Formally, we are given a graph and a target steady-state distribution on its nodes. We are also g...
Patent
An improved system and method for evolutionary clustering of sequential data sets is provided. A snapshot cost may be determined for representing the data set for a particular clustering method used and may determine the cost of clustering the data set independently of a series of clusterings of the data sets in the sequence. A history cost may als...
Patent
Full-text available
Methods and systems are described for navigating a corpus of content items stored in one or more information repositories within a distributed communications system. The content items may include video feeds, audio feeds, television broadcasts, website, a web log or the like. Using any browser application, the user views content items presented in...
Patent
Full-text available
The subject matter disclosed herein relates to a process for receiving, evaluating and selecting of content modules such as content summary boxes and landing pages for display on network-accessible search engine results page. In one particular example, potential content providers may be provided with incentives and guidelines for the preparation of...
Conference Paper
We study the patterns by which a user consumes the same item repeatedly over time, in a wide variety domains ranging from check-ins at the same business location to re-watches of the same video. We find that recency of consumption is the strongest predictor of repeat consumption. Based on this, we develop a model by which the item from $t$ timestep...
Patent
Full-text available
An improved system and method for web destination profiling for online population-targeted advertising is provided. A web destination profiler may be provided for generating web destination profiles. Traffic may be analyzed at a particular web destination in order to understand the population visiting the web destination. The analysis of user traff...
Patent
Full-text available
An improved system and method for web destination profiling for online population-targeted advertising is provided. A web destination profiler may be provided for generating web destination profiles. Traffic may be analyzed at a particular web destination in order to understand the population visiting the web destination. The analysis of user traff...
Conference Paper
Full-text available
In this paper, we consider the natural arrival and departure of users in a social network, and ask whether the dynamics of arrival, which have been studied in some depth, also explain the dynamics of departure, which are not as well studied. Through study of the DBLP co-authorship network and a large online social network, we show that the dynamics...
Conference Paper
The phenomenal growth in the volume of easily accessible information via various web-based services has made it essential for service providers to provide users with personalized representative summaries of such information. Further, online commercial services including social networking and micro-blogging websites, e-commerce portals, leisure and...
Article
Did celebrity last longer in 1929, 1992 or 2009? We investigate the phenomenon of fame by mining a collection of news articles that spans the twentieth century, and also perform a side study on a collection of blog posts from the last 10 years. By analyzing mentions of personal names, we measure each person's time in the spotlight, using two simple...
Conference Paper
Social media has witnessed an explosive growth in the past few years. Wikipedia has over 3.5 million pages with descriptions of entities. Flickr members have uploaded over 5 billion photos,You Tube has 35 hours of videos uploaded to the site each minute, and Twitter users generate 65 million tweets a day. While some forms of social media like Wikip...
Conference Paper
We present a model of tabbed browsing that represents a hybrid between a Markov process capturing the graph of hyperlinks, and a branching process capturing the birth and death of tabs. We present a mathematical criterion to characterize whether the process has a steady state independent of initial conditions, and we show how to characterize the li...
Conference Paper
The NP-hard Max-k-cover problem requires selecting k sets from a collection so as to maximize the size of the union. This classic problem occurs commonly in many settings in web search and advertising. For moderately-sized instances, a greedy algorithm gives an approximation of (1-1/e). However, the greedy algorithm requires updating scores of arbi...
Conference Paper
Two-sided markets arise when two dierent types of users may realize gains by interacting with one another through one or more platforms or mediators. We initiate a study of the evolution of such markets. We present an empirical analysis of the value accruing to members of each side of the market, based on the presence of the other side. We codify t...
Conference Paper
In this paper, we undertake a large-scale study of online user behavior based on search and toolbar logs. We propose a new CCS taxonomy of pageviews consisting of Content (news, portals, games, verticals, multimedia), Communication (email, social networking, forums, blogs, chat), and Search (Web search, item search, multimedia search). We show that...
Conference Paper
Back in the heady days of 1999 and WWW8 (Toronto) we held a panel titled "Finding Anything in the Billion Page Web: Are Algorithms the Key?" In retrospect the answer to this question seems laughably obvious - the search industry has burgeoned on a foundation of algorithms, cloud computing and machine learning. As we move into the second decade of t...
Conference Paper
Full-text available
Graphs appear in several settings, like social networks, recommendation systems, computer communication networks, gene/protein biological networks, among others. A deep, recurring question is “What do real graphs look like? ” That is, how can we separate real ones from synthetic or real graphs with masked portions? The main contribution of this pap...
Conference Paper
We develop a general method to match un- structured text reviews to a structured list of objects. For this, we propose a lan- guage model for generating reviews that incorporates a description of objects and a generic review language model. This mix- ture model gives us a principled method to find, given a review, the object most likely to be the t...
Conference Paper
We make the case for developing a web of concepts by start- ing with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extract- ing concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the informa- tion available on the web for each concept instanc...
Conference Paper
Full-text available
In this paper we present a general framework to study sequences of search activities performed by a user. Our framework provides (i) a vocabulary to discuss types of features, models, and tasks, (ii) straightforward feature re-use across problems, (iii) realistic base- lines for many sequence analysis tasks we study, and (iv) a simple mechanism to...
Conference Paper
We develop a generic method for the review matching problem, which is to match unstructured text reviews to a list of objects, where each object has a set of attributes. To this end, we propose a translation model for generating reviews from a structured description of objects. We develop an EM-based method to estimate the model parameters and use...
Article
In this paper we undertake a large-scale study of online user search behavior based on search and toolbar logs. We identify three types of search: web, multimedia, and item. Together, we show that these different flavors represent almost 10% of all online pageviews, and indirectly result in over 21% of all pageviews. We study search queries themsel...
Conference Paper
We address the problem of large-scale automatic detection of online reviews without using any human labels. We propose an efficient method that combines two basic ideas: Building a classifier from a large number of noisy examples and using the structure of the website to enhance the performance of this classifier. Experiments suggest that our metho...
Conference Paper
We address the problem of large-scale auto- matic detection of online reviews without us- ing any human labels. We propose an efficient method that combines two basic ideas: Build- ing a classifier from a large number of noisy examples and using the structure of the web- site to enhance the performance of this classi- fier. Experiments suggest that...
Conference Paper
By now, online social networks have become an indispensable part of both online and offline lives of human beings. A large fraction of time spent online by a user is directly influence by the social networks to which he/she belongs. This calls for a deeper examination of social networks as large-scale dynamic objects that foster efficient person-pe...
Conference Paper
We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic properties of networks. Using a methodology based on...
Article
In this paper we propose a novel document retrieval model in which text queries are augmented with multi-dimensional taxonomy restrictions. These restrictions may be relaxed at a cost to result quality. This new model may be applicable in many arenas, including multifaceted, product, and local search, where documents are augmented with hierarchical...
Conference Paper
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the peopl...
Conference Paper
Full-text available
Given a dynamic corpus whose content and attention are changing on a daily basis, is it possible to collect and maintain the high-quality resources with a minimal investment? We address two problems that arise from this question for hyperlinked corpora such as Web pages or blogs: how to efficiently discover the correct set of authoritative resource...
Conference Paper
Full-text available
Online communities in the form of message boards, listservs, and newsgroups continue to represent a considerable amount of the social activity on the Internet. Every year thousands of groups flourish while others decline into relative obscu- rity; likewise, millions of members join a new community every year, some of whom will come to manage or mod...
Conference Paper
A recently proposed approach to address privacy concerns in stor- ing web search querylogs is bundling logs of multiple users to- gether. In this work we investigate privacy leaks that are possi- ble even when querylogs from multiple users are bundled together, without any user or session identifiers. We begin by quantifying users' propensity to is...
Article
Important properties of users and objects will move from being tied to individual Web sites to being globally available.The conjunction of a global object model with portable user context will lead to a richer content structure and introduce significant shifts in online communities and information discovery.
Article
We consider the problem of visualizing the evolution of tags within the Flickr (flickr.com) online image sharing community. Any user of the Flickr service may append a tag to any photo in the system. Over the past year, users have on average added over a million tags each week. Understanding the evolution of these tags over time is therefore a chal...
Conference Paper
We present a family of measures of proximity of an arbitrary node in a directed graph to a pre-specified subset of nodes, called the anchor. Our measures are based on three dierent propagation schemes and two dierent uses of the connec- tivity structure of the graph. We consider a web-specific application of the above measures with two disjoint anc...
Conference Paper
We investigate the subtle cues to user identity that may be exploited in attacks on the privacy of users in web search query logs. We study the application of simple classifiers to map a sequence of queries into the gender, age, and location of the user issuing the queries. We then show how these classifiers may be carefully com- bined at multiple...
Conference Paper
Social networks are navigable small worlds, in which two ar- bitrary people are likely connected by a short path of intermediate friends that can be found by a "decentralized" routing algorithm using only local information. We develop a model of social networks based on an arbitrary metric space of points, with population density varying across the...
Conference Paper
In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the life of the network. Our measurements expose a su...
Article
This article describes the CLEVER search system developed at the IBM Almaden Research Center. We present a detailed and unified exposition of the various algorithmic components that make up the system, and then present results from two user studies.
Chapter
Introduction Related Work Finding the densest subgraph Trawling Graph Shingling Connection Subgraphs Conclusions References
Conference Paper
Full-text available
We consider the problem of clustering data over time. An evolutionary clustering should simultaneously optimize two potentially conflicting criteria: first, the clustering at any point in time should remain faithful to the current data as much as possible; and second, the clustering should not shift dramatically from one timestep to the next. We pr...
Conference Paper
Full-text available
We consider the problem of estimating the size of a collec- tion of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query f...