About
39
Publications
3,977
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
846
Citations
Publications
Publications (39)
This study presents a systematic comparison of methods for individual treatment assignment. We group the various methods proposed in the literature into three general classes of algorithms (or metalearners): learning models to predict outcomes (the O-learner), learning models to predict causal effects (the E-learner), and learning models to predict...
A challenge that machine learning practitioners in the industry face is the task of selecting the best model to deploy in production. As a model is often an intermediate component of a production system, online controlled experiments such as A/B tests yield the most reliable estimation of the effectiveness of the whole system, but can only compare...
Users of music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner. Providing and evaluating good sequences of recommendations is therefore a central problem for these services. Prior reweighting-based counterfactual evaluation methods either suffer from high variance or make str...
We present a systematic analysis of causal treatment assignment decision making, a general problem that arises in many applications and has received significant attention from economists, computer scientists, and social scientists. We focus on choosing, for each user, the best algorithm for playlist generation in order to optimize engagement. We ch...
It remains unknown whether personalized recommendations increase or decrease the diversity of content people consume. We present results from a randomized field experiment on Spotify testing the effect of personalized recommendations on consumption diversity. In the experiment, both control and treatment users were given podcast recommendations, wi...
Instant search has become a popular search paradigm in which users are shown a new result page in response to every keystroke triggered. Over recent years, the paradigm has been widely adopted in several domains including personal email search, e-commerce, and music search. However, the topic of evaluation and metrics of such systems has been less...
Music listening is a commonplace activity that has transformed as users engage with online streaming platforms. When presented with anytime, anywhere access to a vast catalog of music, users face challenges in searching for what they want to hear. We propose that users who engage in domain-specific search (e.g., music search) have different informa...
Evaluating algorithmic recommendations is an important, but difficult, problem. Evaluations conducted offline using data collected from user interactions with an online system often suffer from biases arising from the user interface or the recommendation engine. Online evaluation (A/B testing) can more easily address problems of bias, but depending...
Recently, there has been considerable interest in the use of historical logged user interaction data—queries and clicks—for evaluation of search systems in the context of counterfactual analysis [8,10]. Recent approaches attempt to de-bias the historical log data by conducting randomization experiments and modeling the bias in user behavior. Thus f...
We present an Information Retrieval framework that leverages Heterogeneous Information Network (HIN) embeddings for contextual suggestion. Our method represents users, documents and other context-related documents as heterogeneous objects in a HIN. Using meta-paths, selected based on domain knowledge, we create graph embeddings from this network, t...
We investigate the use of logged user interaction data---queries and clicks---for offline evaluation of new search systems in the context of counterfactual analysis. The challenge of evaluating a new ranker against log data collected from a static production ranker is that new rankers may retrieve documents that have never been seen in the logs bef...
Many conversational agents (CAs) are developed to answer users' questions in a specialized domain. In everyday use of CAs, user experience may extend beyond satisfying information needs to the enjoyment of conversations with CAs, some of which represent playful interactions. By studying a field deployment of a Human Resource chatbot, we report on u...
The task of onboarding a new hire consumes great amounts of resources from organizations. The faster a “newbie” becomes an “insider”, the higher the chances of job satisfaction, retention, and advancement in their position. Conversational agents (AI agents) have the potential to effectively transform productivity in many enterprise workplace scenar...
Clinical data access involves complex, but opaque communication between medical researchers and query analysts. Understanding such communication is indispensable for designing intelligent human-machine dialog systems that automate query formulation. This study investigates email communication and proposes a novel scheme for classifying dialog acts...
Terminologies can suffer from poor concept coverage due to delays in addition of new concepts. This study tests a similarity-based approach to recommending concepts from a text corpus to a terminology. Our approach involves extraction of candidate concepts from a given text corpus, which are represented using a set of features. The model learns the...
The Generalizability Index for Study Traits (GIST) has been proposed recently for assessing the population representativeness of a set of related clinical trials using eligibility features (e.g., age or BMI), one each time. However, GIST has not yet been evaluated. To bridge this knowledge gap, this paper reports a simulation-based validation study...
Different users may be attempting to satisfy different information needs while providing the same query to a search engine. Addressing that issue is addressing Novelty and Diversity in information retrieval. Novelty and Diversity search task models the task wherein users are interested in seeing more and more documents that are not only relevant, b...
Novel and diverse document ranking is an effective strategy that involves reducing redundancy in a ranked list to maximize the amount of novel and relevant information available to users. Evaluation for novelty and diversity typically involves an assessor judging each document for relevance against a set of pre-identified subtopics, which may be di...
The notion of relevance differs between assessors, thus giving rise to assessor disagreement. Although assessor disagreement has been frequently observed, the factors leading to disagreement are still an open problem. In this paper we study the relationship between assessor disagreement and various topic independent factors such as readability and...
Assessors are well known to disagree frequently on the relevance of documents to a topic, but the factors leading to assessor disagreement are still poorly understood. In this paper, we examine the relationship between the rank at which a document is returned by a set of retrieval systems and the likelihood that a second assessor will disagree with...
Recently, researchers have shown interest in the use of preference judgments for evaluation in IR literature. Although preference judgments have several advantages over absolute judgment, one of the major disadvantages is that the number of judgments needed increases polynomially as the number of documents in the pool increases. We propose a novel...
There has been considerable interest in incorporating diversity in search results to account for redundancy and the space of possible user needs. Most work on this problem is based on subtopics: diversity rankers score documents against a set of hypothesized subtopics, and diversity rankings are evaluated by assigning a value to each ranked documen...
A set of words is often insufficient to express a user's information need. In order to account for various information needs associated with a query, diversification seems to be a reasonable strategy. By diversifying the result set, we increase the probability of results being relevant to the user's information needs when the given query is ambiguo...
Traditional models of information retrieval assume documents are independently relevant. But when the goal is retrieving diverse or novel information about a topic, retrieval models need to capture dependencies between documents. Such tasks require alternative evaluation and optimization methods that operate on different types of relevance judgment...
Evaluation measures play a vital role in analyzing the per-formance of a system, comparing two or more systems, and optimizing systems to perform some task. In this paper, we analyze and highlight the strengths and weaknesses of commonly used measures for evaluating the diversity in search results. We compare MAP-IA, α-nDCG, and ERR-IA using data f...