Haggai Roitman

Haggai Roitman
IBM Research - Haifa · Cognitive Computing

Doctor of Philosophy

About

100
Publications
12,229
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,042
Citations

Publications

Publications (100)
Preprint
Full-text available
A frequent pattern in customer care conversations is the agents responding with appropriate webpage URLs that address users' needs. We study the task of predicting the documents that customer care agents can use to facilitate users' needs. We also introduce a new public dataset which supports the aforementioned problem. Using this dataset and two o...
Article
Schema matching is a process that serves in integrating structured and semi-structured data. Being a handy tool in multiple contemporary business and commerce applications, it has been investigated in the fields of databases, AI, Semantic Web, and data mining for many years. The core challenge still remains the ability to create quality algorithmic...
Preprint
Researchers and students face an explosion of newly published papers which may be relevant to their work. This led to a trend of sharing human summaries of scientific papers. We analyze the summaries shared in one of these platforms Shortscience.org. The goal is to characterize human summaries of scientific papers, and use some of the insights obta...
Article
Schema matching is at the heart of integrating structured and semi-structured data with applications in data warehousing, data analysis recommendations, Web table matching, etc. Schema matching is known as an uncertain process and a common method to overcome this uncertainty introduces a human expert with a ranked list of possible schema matches to...
Conference Paper
The usage of passage-level information has been successfully demonstrated in many core IR tasks, and among such tasks, the task of passage-based document retrieval. In this work, we study the merits of utilizing similar information for the fusion-based document retrieval task. Overall, we show that such information can be highly useful for this tas...
Conference Paper
We study a constrained retrieval setting in which either a single qualitative answer is provided as a response to a user-query or none. Given a user-query and the "best" answer that was retrieved from the underlying search engine, we wish to determine whether or not to accept it. To address this challenge, we propose an answer quality determination...
Preprint
Full-text available
We present a novel system providing summaries for Computer Science publications. Through a qualitative user study, we identified the most valuable scenarios for discovery, exploration and understanding of scientific documents. Based on these findings, we built a system that retrieves and summarizes scientific documents for a given information need,...
Preprint
Full-text available
We study the use of BERT for non-factoid question-answering, focusing on the passage re-ranking task under varying passage lengths. To this end, we explore the fine-tuning of BERT in different learning-to-rank setups, comprising both point-wise and pair-wise methods, resulting in substantial improvements over the state-of-the-art. We then analyze t...
Conference Paper
The query performance prediction task (QPP) is estimating retrieval effectiveness in the absence of relevance judgments. Prior work has focused on prediction for retrieval methods based on surface level query-document similarities (e.g., query likelihood). We address the prediction challenge for pseudo-feedback-based retrieval methods which utilize...
Conference Paper
We revisit the Normalized Query Commitment (NQC) query performance prediction (QPP) method. To this end, we suggest a scaled extension to a discriminative QPP framework and use it to analyze NQC. Using this analysis allows us to redesign NQC and suggest several options for improvement.
Preprint
We suggest a new idea of Editorial Network - a mixed extractive-abstractive summarization approach, which is applied as a post-processing step over a given sequence of extracted sentences. Our network tries to imitate the decision process of a human editor during summarization. Within such a process, each extracted sentence may be either kept untou...
Preprint
This work presents a general query term weighting approach based on query performance prediction (QPP). To this end, a given term is weighed according to its predicted effect on query performance. Such an effect is assumed to be manifested in the responses made by the underlying retrieval method for the original query and its (simple) variants in t...
Preprint
Full-text available
Advances in Web technology enable personalization proxies that assist users in satisfying their complex information monitoring and aggregation needs through the repeated querying of multiple volatile data sources. Such proxies face a scalability challenge when trying to maximize the number of clients served while at the same time fully satisfying c...
Preprint
We propose Dual-CES -- a novel unsupervised, query-focused, multi-document extractive summarizer. Dual-CES is designed to better handle the tradeoff between saliency and focus in summarization. To this end, Dual-CES employs a two-step dual-cascade optimization approach with saliency-based pseudo-feedback distillation. Overall, Dual-CES significantl...
Conference Paper
We study the query performance prediction (QPP) task for fusion-based retrieval. Within such a retrieval setting, several ranked lists, each one retrieved by a different method, are combined into a single (fused) ranked list. A common prediction approach is to treat the (base) ranked lists as reference lists and combine those lists' QPP estimates a...
Conference Paper
We show that document-level post-retrieval query performance prediction (QPP) methods are mostly suited for short query prediction tasks; such methods perform significantly worse in verbose (long and informative) query prediction settings. To address the prediction quality gap among query lengths, we propose a novel passage-level post-retrieval QPP...
Conference Paper
The usage of positive relevance feedback in fusion-based retrieval was previously shown to be very useful. Yet, in many retrieval use-cases, no actual relevance feedback may be available. With the absence of relevance data, pseudo-relevance feedback models have been suggested as an alternative. Encouraged by the previous success of using positive r...
Conference Paper
Most collaborative filtering models assume that the interaction of users with items take a single form, e.g., only ratings or clicks or views. In fact, in most real-life recommendation scenarios, users interact with items in diverse ways. This in turn, generates complex usage data that contains multiple and diverse types of user feedback. In additi...
Conference Paper
This work studies the merits of using query-drift analysis for search re-ranking. A relationship between the ability to predict the quality of a result list retrieved by an arbitrary method, as manifested by its estimated query-drift, and the ability to improve that method's initial retrieval by re-ranking documents in the list based on such predic...
Conference Paper
We focus on the post-retrieval query performance prediction (QPP) task. Specifically, we make a new use of passage information for this task. Using such information we derive a new mean score calibration predictor that provides a more accurate prediction. Using an empirical evaluation over several common TREC benchmarks, we show that, QPP methods t...
Conference Paper
In this work we explore relationships between human and algorithmic schema matchers. We provide a novel approach to similar schema matchers termed coordinated matchers and use it to predict future human matching choices. We show throughout a comprehensive analysis that human matchers are usually coordinated with intuitive algorithms, e.g., based on...
Conference Paper
We propose to demonstrate SummIt -- a tool for extractive summarization, discovery and analysis. The main goal of SummIt is to provide consumable summaries that are driven by users' information intents. To this end, SummIt discovers and analyzes potential intents that can be used for summarization. Given an intent, SummIt generates a summary based...
Conference Paper
We study the problem of item recommendation using complex usage data. We assume that users may interact with items in various ways, each such interaction generates a usage point which may be accompanied with multiple feedback types. In addition, each user may interact with each item multiple times. We propose a generic framework that re-models the...
Conference Paper
We derive a robust standard deviation estimator for post-retrieval query performance prediction. To this end, we propose a novel bootstrap sampling approach which is inspired by user search behavior. Using an evaluation with several TREC benchmarks and a comparison with several different types of baselines, we demonstrate that, overall, our estimat...
Conference Paper
Full-text available
We study the problem of mean retrieval score estimation for query performance prediction (QPP). We propose an enhanced estimator which estimates the mean based on calibrated retrieval scores. Each document score is adjusted based on features that model potential tradeoffs that may exist in the retrieval process of that specific document. Using the...
Conference Paper
In this work we address the problem of Multi-Label Text Quantification. To this end, for a given collection of documents, each was pre-classified with one or more labels by some multi-label classifier, our goal is to find an estimate of the cardinality of each actual label set, as accurate as possible. We present two enhanced Probabilistic Classify...
Conference Paper
We address the problem of query performance prediction (QPP) using reference lists. To date, no previous QPP method has been fully successful in generating and utilizing several pseudo-effective and pseudo-ineffective reference lists. In this work, we try to fill the gaps. We first propose a novel unsupervised approach for generating and selecting...
Conference Paper
The session search task aims at best serving the user's information need given her previous search behavior during the session. We propose an extended relevance model that captures the user's dynamic information need in the session. Our relevance modelling approach is directly driven by the user's query reformulation (change) decisions and the esti...
Conference Paper
We present a novel unsupervised query-focused multi-document summarization approach. To this end, we generate a summary by extracting a subset of sentences using the Cross-Entropy (CE) Method. The proposed approach is generic and requires no domain knowledge. Using an evaluation over DUC 2005-2007 datasets with several other state-of-the-art baseli...
Article
The session search task aims at best serving the user's information need given her previous search behavior during the session. We propose an extended relevance model that captures the user's dynamic information need in the session. Our relevance modelling approach is directly driven by the user's query reformulation (change) decisions and the esti...
Conference Paper
In this paper we address a novel retrieval problem we term the "Better Than This" problem. For a given pair of a user query to be answered by some search engine and a single example answer provided by the user that may or may not be a correct answer to the query, we determine whether or not there exists some better answer within the search engine....
Conference Paper
This work presents a novel claim-oriented document retrieval task. For a given controversial topic, relevant articles containing claims that support or contest the topic are retrieved from a Wikipedia corpus. For that, a two-step retrieval approach is proposed. At the first step, an initial pool of articles that are relevant to the topic are retrie...
Conference Paper
Ontology & schema matching predictors assess the quality of matchers in the absence of an exact match. We propose MCD (Match Competitor Deviation), a new diversity-based predictor that compares the strength of a matcher confidence in the correspondence of a concept pair with respect to other correspondences that involve either concept. We also prop...
Patent
Full-text available
Method, system, and computer program product for automatic generation of a word-cloud for a content item are provided. The method includes: extracting terms from a content item using statistical selection criteria; weighting a term by a probability that the term is used as a tag; and generating a visual representation of terms with enhanced represe...
Article
Full-text available
This work addresses the problem of detecting topic-based influencers in social media. For that end, we devise a novel behavioral model of authors and readers, where authors try to influence readers by generating ``\emph{attractive}" content, which is both \emph{relevant} and \emph{unique}, and readers can become authors themselves by further citing...
Article
Full-text available
Understanding customer behavior in brick-and-mortar stores and other physical indoor venues is essential for any business aiming to provide a more personal and compelling shopping experience, optimize store layout, and improve store operations. Achieving these goals ultimately leads to improved user experience, conversion rates, and increased reven...
Article
We present a novel unsupervised approach to re-ranking an initially retrieved list. The approach is based on the Cross Entropy method applied to permutations of the list, and relies on performance prediction. Using pseudo predictors we establish a lower bound on the prediction quality that is required so as to have our approach significantly outper...
Article
We present a novel approach to the cluster labeling task using fusion methods. The core idea of our approach is to weigh labels, suggested by any labeler, according to the estimated labeler's decisiveness with respect to each of its suggested labels. We hypothesize that, a cluster labeler's labeling choice for a given cluster should remain stable e...
Patent
Full-text available
Method, system, and computer program product for indexing and searching entity-relationship data are provided. The method includes: defining a logical document model for entity-relationship data including: representing an entity as a document containing the entity's searchable content and metadata; dually representing the entity as a document and a...
Conference Paper
Full-text available
Measuring the effectiveness of marketing campaigns across different channels is one of the most challenging tasks for today's brand marketers. Such measurement usually relies on a combination of key performance indicators (KPIs), used for assessing various aspects of marketing outcomes. Recently, with the availability of social-media sources, new o...
Conference Paper
Measuring the effectiveness of marketing efforts across various channels is a challenging task. Such measurement usually relies on a combination of key performance indicators (KPIs) used to assess marketing outcomes. In this work we present the Multi-channel Marketing Monitoring Platform (M3P). M3P harnesses the crowds (people) as sources for effec...
Conference Paper
Social communities play an important role in many domains. While a lot of attention has been given to developing efficient methods for detecting and analyzing social communities, it still remains a great challenge to provide intuitive search interfaces for end-users who wish to discover and explore such communities. Trying to fill the gaps, in this...
Patent
Full-text available
A graphical user interface generation system offers a management module that displays GUI elements and a visual indicator in an editing window. The visual indicator is movable in the editing window, which has at least two panels and a divider between the panels. A configuration history of the divider including at least one prior location of the div...
Patent
Full-text available
Method, system, and computer program product for faceted search with relationships between categories are provided. The method includes: having a document set of multiple documents, each document having associated categories to which it belongs; grouping multiple categories associated with a document into a category set based on a relationship betw...
Conference Paper
In this paper we propose a novel framework for modeling the uniqueness of the user preferences for recommendation systems. User uniqueness is determined by learning to what extent the user's item preferences deviate from those of an "average user" in the system. Based on this framework, we suggest three different recommendation strategies that trad...
Article
Full-text available
In this paper, we present one possible way of analyzing social media conversional data in order to better understand customers. Ultimately, our goal is to analyze customer behavior as it is expressed in free-form conversations and extract from it commercially valuable information about the customer. In this study, we concentrate on using statistica...
Patent
Full-text available
A method and system for visualizing shared route information are provided. The method includes receiving a route query from a user and retrieving multiple route results for the query for display as an overlay on a map. The method further includes processing the route results for display by dividing each route result into sub-routes, wherein a sub-r...
Article
Full-text available
The First Joint International Workshop on Entity-oriented and Semantic Search (JIWES) workshop was held on Aug 16, 2012 in Portland, Oregon, USA, in conjunction with the 35th Annual International ACM SIGIR Conference (SIGIR 2012). The objective for the workshop was to bring together academic researchers and industry practitioners working on entity-...
Conference Paper
Full-text available
In this work we discuss the challenge of harnessing the crowd for smart city sensing. Within a city's context, such reports by citizen or city visitor eye witnesses may provide important information to city officials, additionally to more traditional data gathered by other means (e.g., through the city's control center, emergency services, sensors...
Conference Paper
In this work we discuss the challenges of utilizing social media data, and more specifically microblogs, for helping brand managers. Brand perception is one of the most important tasks of a brand manager, requiring to understand how customers perceive and select brands in specific product categories or market segments. While understanding the brand...
Conference Paper
This paper provides an overview of the 1st International Workshop on Multimodal Crowd Sensing (CrowdSens 2012), held at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012). This workshop aimed to provide an open forum for researchers from various fields such as fields such as Natural Language Processing, Inform...
Conference Paper
In this work we argue that two main gaps currently hinder the development of new applications requiring sophisticated data discovery capabilities over rich (semi-structured) entity-relationship data. The first gap exists at the conceptual level, and the second at the logical level. Aiming at fulfilling the identified gaps, we propose a novel method...
Article
Many new socially flavored medical services have recently emerged, utilizing the data openness and sharing through social channels. The adoption of such services by patients is still very limited, mainly due to privacy issues. Existing social-medical discovery services support only strict patient privacy policies and are not flexible enough to acco...
Article
We propose to demonstrate an end-to-end framework for leveraging time-sensitive and critical social media information for businesses. More specifically, we focus on identifying, structuring, integrating, and exposing timely insights that are essential to marketing services and monitoring reputation over social media. Our system includes components...
Article
Full-text available
In this paper we describe a novel approach for exploratory search over rich entity-relationship data that utilizes a unique combination of expressive, yet intuitive, query language, faceted search, and graph navigation. We describe an extended faceted search solution which allows to index, search, and browse rich entity-relationship data. We report...
Article
To enable true human/computer collaboration, knowledge needs to be freely communicated between the forms that each sort of intelligent agent finds most useful (text, speech and images, notably for people; logic, program fragments, probabilities, databases ...
Conference Paper
This paper summarizes the details of the first international workshop on search and mining entity-relationship data. This workshop will bridge between IR, DB, and KM researchers to seek novel solutions for search and data mining of rich entity-relationship data and their applications in various domains. We first provide an overview about the worksh...
Conference Paper
Full-text available
In this work we study the task of term extraction for word cloud generation. We present a folksonomy-based term extraction method, called tag-boost, which boosts terms that are frequently used by the public to tag content. Our experiments with tag-boost-based term extraction over different domains demonstrate tremendous improvement in word cloud qu...
Conference Paper
In this demo we shall present the IBM Patient Empowerment System (PES), and more specifically, its social-medical discovery sub-system. Social and medical data are represented using entities and relationships and are explored using a combination of expressive, yet intuitive, query language, faceted search, and ER graph navigation. While this demons...
Article
Full-text available
A variety of emerging online data delivery applications challenge existing techniques for data delivery to human users, applications, or middleware that are accessing data from multiple autonomous servers. In this paper, we develop a framework for formalizing and comparing pull-based solutions and present dual optimization approaches. The first app...
Article
Full-text available
In this paper we describe a novel social-medical discovery solution, based on an idea of social and medical data unification. Built on foundations of exploratory search technologies, the proposed discovery solution is better tailored for the social-medical discovery task. We then describe its implementation within the IBM Medics system and discuss...
Article
Adverse drug event (ADE) has significant implications on patient safety and is recognized as a major cause of fatalities and hospital expenses. Although some medical systems today can help reduce the number of ADE occurrences, these primarily take into account clinical factors-even though recent studies show the significance of genetic profiles in...
Article
Social bookmarking enables knowledge sharing and efficient discovery on the web, where users can collaborate together by tagging documents of interests. A lot of attention was given lately for utilizing social bookmarking data to enhance traditional IR tasks. Yet, much less attention was given to the problem of estimating the effectiveness of an in...
Conference Paper
Architectures and middleware for event delivery face scalability challenges in providing up-to-date events to meet clients' specifications. The use of proxy middleware is a common practice for increasing scalability. Proxies can aggregate client specifications, taking advantage of similar needs to reduce communication overhead, workload on event so...
Conference Paper
Full-text available
This work deals with the task of predicting the popularity of user-generated content. We demonstrate how the novelty of newly published content plays an important role in affecting its popularity. We study three dimensions of novelty: contemporaneous novelty, self novelty, and discussion novelty. We demonstrate the contribution of the new novelty m...
Conference Paper
In this paper we describe a novel approach of increasing patients' safety using explanation-driven personalized content recommendation. In this approach, patients are exposed to relevant medical information that is continuously gathered from various web sources and timely delivered, educating patients toward better preventive medicine decision maki...
Article
Full-text available
In this work we present the details of a large scale user pro-filing framework that we developed here in IBM on top of Apache Hadoop. We address the problem of extracting and maintaining a very large number of user profiles from large scale data. We first describe an efficient user profiling frame-work with high user profiling quality guarantees. W...
Article
Full-text available
Computer server management is an important component of the global IT (information technology) services business. The providers of server management services face unrelenting efficiency challenges in order to remain competitive with other providers. Server system administrators (SAs) represent the majority of the workers in this industry, and their...
Conference Paper
Full-text available
Web monitoring 2.0 supports the complex information needs of clients who probe multiple information sources and generate mashups by integrating across these volatile streams. A proxy that aims at satisfying multiple customized client profiles will face a scalability challenge in trying to maximize the number of clients served while at the same time...
Article
We consider a novel problem of top-k query processing under budget constraints. We provide both a framework and a set of algorithms to address this problem. Existing algorithms for top-k processing are budget-oblivious, i.e., they do not take budget constraints into account when making scheduling decisions, but focus on the performance to compute t...
Article
Full-text available
Computer server management is an important component of the global IT (information technology) services business. The providers of server management services face unrelenting efficiency challenges in order to remain competitive with other providers. Server system administrators (SAs) represent the majority of the workers in this industry, and their...