# Victor Lavrenko's research while affiliated with The University of Edinburgh and other places

## Publications (76)

Article
We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate a relevance model: probabilities of words in the relevant class. We propose a novel techniqu...
Conference Paper
Full-text available
In this paper we explore the impact of processing unbounded data streams on First Story Detection (FSD) accuracy. In particular, we study three different types of FSD algorithms: comparison-based, LSH-based and k-term based FSD. Our experiments reveal for the first time that the novelty score of all three algorithms decay over time. We explain why...
Article
Full-text available
Rumour detection is hard because the most accurate systems operate retrospectively, only recognizing rumours once they have collected repeated signals. By then the rumours might have already spread and caused harm. We introduce a new category of features based on novelty, tailored to detect rumours early on. To compensate for the absence of repeate...
Article
Full-text available
Relevance Models are well-known retrieval models and capable of producing competitive results. However, because they use query expansion they can be very slow. We address this slowness by incorporating two variants of locality sensitive hashing (LSH) into the query expansion process. Results on two document collections suggest that we can obtain la...
Conference Paper
Full-text available
In this paper we propose Regularised Cross-Modal Hashing (RCMH) a new cross-modal hashing model that projects annotation and visual feature descriptors into a common Hamming space. RCMH optimises the hashcode similarity of related data-points in the annotation modality using an iterative three-step hashing algorithm: in the first step each training...
Conference Paper
Hashing has witnessed an increase in popularity over the past few years due to the promise of compact encoding and fast query time. In order to be effective hashing methods must maximally preserve the similarity between the data points in the underlying binary representation. The current best performing hashing techniques have utilised supervision....
Article
Full-text available
Tracking topics on social media streams is non-Trivial as the number of topics mentioned grows without bound. This complexity is compounded when we want to track such topics against other fast moving streams. We go beyond traditional small scale topic tracking and consider a stream of topics against another document stream. We introduce two trackin...
Conference Paper
Full-text available
Article
In this paper, we introduce a new form of the continuous relevance model (CRM), dubbed the SKL-CRM, that adaptively selects the best performing kernel per feature type for automatic image annotation. Previous image annotation models apply a standard selection of kernels to model the distribution of image features. Popular examples include a Gaussia...
Conference Paper
In this paper we introduce a sparse kernel learning framework for the Continuous Relevance Model (CRM). State-of-the-art image annotation models linearly combine evidence from several different feature types to improve image annotation accuracy. While previous authors have focused on learning the linear combination weights for these features, there...
Conference Paper
Full-text available
We introduce a scheme for optimally allocating a variable number of bits per LSH hyperplane. Previous approaches assign a constant number of bits per hyperplane. This neglects the fact that a subset of hyperplanes may be more informative than others. Our method, dubbed Variable Bit Quantisation (VBQ), provides a data-driven non-uniform bit allocati...
Conference Paper
We introduce a scheme for optimally allocating multiple bits per hyperplane for Locality Sensitive Hashing (LSH). Existing approaches binarise LSH projections by thresholding at zero yielding a single bit per dimension. We demonstrate that this is a sub-optimal bit allocation approach that can easily destroy the neighbourhood structure in the origi...
Article
Twitter has become a major source of data for social media researchers. One important aspect of Twitter not previously considered are {\em deletions} -- removal of tweets from the stream. Deletions can be due to a multitude of reasons such as privacy concerns, rashness or attempts to undo public statements. We show how deletions can be automaticall...
Conference Paper
First story detection (FSD) involves identifying first stories about events from a continuous stream of documents. A major problem in this task is the high degree of lexical variation in documents which makes it very difficult to detect stories that talk about the same event but expressed using different words. We suggest using paraphrases to allev...
Article
Relevance feedback is one method for creating a 'virtuous cycle' -as put by Baeza-Yates -between semantics and search. Previ-ous approaches to search have generally considered the Semantic Web and hypertext Web search to be entirely disparate, indexing and searching over different domains. While relevance feedback have traditionally improved inform...
Conference Paper
We investigate the possibility of using structured data to improve search over unstructured documents. In particular, we use relevance feedback to create a 'virtuous cycle' between structured data from the Semantic Web and web-pages from the hypertext Web. Previous approaches have generally considered searching over the Semantic Web and hypertext W...
Conference Paper
With the recent rise in popularity and size of social media, there is a growing need for sys- tems that can extract useful information from this amount of data. We address the prob- lem of detecting new events from a stream of Twitter posts. To make event detection feasi- ble on web-scale corpora, we present an algo- rithm based on locality-sensiti...
Conference Paper
We describe the first release of our corpus of 97 million Twitter posts. We believe that this data will prove valuable to researchers working in social media, natural language processing, large-scale data processing, and similar areas.
Conference Paper
Pseudo-relevance feedback (PRF) improves search quality by expanding the query using terms from high-ranking documents from an initial retrieval. Although PRF can often result in large gains in effectiveness, running two queries is time consuming, limiting its applicability. We describe a PRF method that uses corpus pre-processing to achieve query-...
Article
Relevance feedback is one method for creating a 'virtuous cycle' - as put by Baeza-Yates - between semantics and search. Previous approaches to search have generally considered the Semantic Web and hypertext Web search to be entirely disparate, indexing and searching over different domains. While relevance feedback have traditionally improved infor...
Conference Paper
We demonstrate how user ratings can be accurately predicted from a set of tags assigned to a book on a social-networking site. Since a newly-published book is unlikely to have social-tags already assigned to it, we describe a probabilistic model for inferring the most probable tags from the text of the book. We evaluate the proposed approach on a n...
Article
Full-text available
Legal argument is based on the facts of a case as well as legal issues, concepts and factors. We assess the feasibility of using semantic events extracted from court judgements to adequately represent the legal concepts and factual content of cases for legal information retrieval (IR). Results of a preliminary study show extracted events are attrib...
Conference Paper
We explore the problem of discovering multiple missing values in a semi-structured database. For this task, we formally develop Structured Relevance Model (SRM) built on one hypothetical generative model for semi-structured records. SRM is based on the idea that plausible values for a given field could be inferred from the context provided by the o...
Conference Paper
We explore the problem of retrieving semi-structured documents from a real- world collection using a structured query. We formally develop Structured Rele- vance Models (SRM), a retrieval model that is based on the idea that plausible values for a given field could be inferred from the context provided by the other fields in the record. We then car...
Conference Paper
Ranking documents or sentences accord- ing to both topic and sentiment relevance should serve a critical function in helping users when topics and sentiment polari- ties of the targeted text are not explicitly given, as is often the case on the web. In this paper, we propose several sentiment information retrieval models in the frame- work of proba...
Conference Paper
We are interested in the problem of understanding the con- nections between human activities and the content of textual information generated in regard to those activities. Firstly, we define and motivate this problem as an important part in making sense of various life events. Secondly, we introduce the domain of massive online collaborative envir...
Conference Paper
We apply a continuous relevance model (CRM) to the problem of directly retrieving the visual content of videos using text queries. The model computes a joint probability model for image features and words using a training set of annotated images. The model may then be used to annotate unseen test images. The probabilistic annotations are used for r...
Article
In language modeling, it is nearly always assumed that documents are generated by sampling from a multinomial distribution. Many formal methods of estimation exist for the multinomial case. In this paper, we reexamine language models based on a multiple-Bernoulli distribution. This assumption has been explored in the past, but has never been formal...
Article
We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vecto...
Conference Paper
Most offline handwriting recognition approaches proceed by segmenting words into smaller pieces (usually characters) which are recognized separately. The recognition result of a word is then the composition of the individually recognized parts. Inspired by results in cognitive psychology, researchers have begun to focus on holistic word recognition...
Conference Paper
Retrieving images in response to textual queries requires some knowledge of the semantics of the picture. Here, we show how we can do both automatic image annotation and retrieval (using one word queries) from images and videos using a multiple Bernoulli relevance model. The model assumes that a training set of images or videos along with keyword a...
Conference Paper
Topic tracking is complicated when the stories in the stream occur in multiple languages. Typically, researchers have trained only English topic models because the training stories have been provided in English. In tracking, non-English test stories are then machine translated into English to compare them with the topic models. We propose a native...
Conference Paper
Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore can...
Article
Handwritten historical document collections in libraries and other areas are often of interest to researchers, students or the general public. Convenient access to such corpora generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Severa...
Article
We describe a demonstration system built upon Topic Detection and Tracking (TDT) technology. The demonstration system monitors a stream of news stories, organizes them into clusters that represent topics, presents the clusters to a user, and visually describes the changes that occur in those clusters over time. A user may also mark certain clusters...
Article
Recent interest in the area of music information retrieval and related technologies is exploding. However, very few of the existing techniques take advantage of recent developments in statistical modeling. In this paper we discuss an application of Random Fields to the problem of creating accurate yet flexible statistical models of polyphonic music...
Article
Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach...
Article
this report we discuss an application of Random Fields to the problem of statistical modeling of polyphonic music. With such models in hand, the challenges of developing effective searching, browsing, and organization techniques for the growing bodies of music collections may be successfully met
Article
We describe the one-month (June 2003) effort to create a topic detection and tracking (TDT) system to support news stories in Hindi. The University of Massachusetts submitted results for three different TDT tasks in the DARPA surprise language evaluation. The official task was topic tracking, but we also provided results for the new event detection...
Article
Full-text available
Information retrieval (IR) research has reached a point where it is appropriate to assess progress and to define a research agenda for the next five to ten years. This report summarizes a discussion of IR research challenges that took place at a recent workshop. The attendees of the workshop considered information retrieval research in a range of a...
Conference Paper
Conference Paper
We propose an approach to learning the semantics of images which al- lows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is di- vided into regions, each described by a continuous-valued feature v...
Article
We develop a simple statistical model, called a relevance model, for capturing the notion of topical relevance in information retrieval. Estimating probabilities of relevance has been an important part of many previous retrieval models, but we show how this estimation can be done in a more principled way based on a generative or language model appr...
Article
We propose a formal model of Cross-Language Information Retrieval that does not rely on either query translation or document translation. Our approach leverages recent advances in language modeling to directly estimate an accurate topic model in the target language, starting with a query in the source language. The model integrates popular techniqu...
Article
Full-text available
We extend relevance modeling to the link detection task of Topic Detection and Tracking (TDT) and show that it substantially improves performance. Relevance modeling, a statistical language modeling technique related to query expansion, is used to enhance the topic model estimate associated with a news story, boosting the probability of words that...
Conference Paper
We explore the use of Optimal Mixture Models to represent topics. We analyze two broad classes of mixture models: set-based and weighted. We provide an original proof that estimation of set-based models is NP-hard, and therefore not feasible. We argue that weighted models are superior to set-based models, and the solution can be estimated by a simp...
Article
This chapter presents the system used by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts for its participation in four of the five TDT tasks: tracking, detection, first story detection, and story link detection. For each task, we discuss the parameter setting approach that we used and the results of our sy...
Conference Paper
Article
Many approaches to personalization involve learning short-term and long-term user models. The user models provide context for queries and other interactions with the information system. In this paper, we discuss how language models can be used to represent context and support context-based techniques such as relevance feedback and query disambiguat...
Article
It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate a relevance model with no training data. We propose a novel technique for estimating such models using the query alone. We demonstrate that our technique can produce highly accurate relevance models. Our experiments show releva...
Article
The following work describes our solutions to the detection and tracking problems defined by the Topic Detection and Tracking (TDT2) research initiative. We discuss the implementation and results of the approaches which were recently tested on the TDT2 evaluation corpus. Our solutions to these problems extend text-based ranked retrieval techniques...
Article
180 0 157 158 159 7 9999 99 "#%$&' #%$& 23720-4769 23*465789;:<'$=>' ?$+9@' A&(*BC$& (*BC ED$F>F$G#; 9 7 9999 9999 9999 NOPLQRD$F F$GS 0 0 9 7 9 7WX$& 7WX$& 23330 +'G]1 O^;X$& '_X$:$=> $:$=> 20010-43510 ' $510 3510 -43510 9 7 9999 99 ' L 'Z]X$:<'$=>$:<'$= 2Z "N HdKeLfLNgL'a 21410-41410 fL Z [G D$F%F$G/ NgL 1918 0370$G/ NgL 19180-41410 9 99 e+...
Article
We explore a formal approach to dealing with the zero frequency problem that arises in applications of probabilistic models to language. In this report we introduce the zero frequency problem in the context of probabilistic language models, describe several popular solutions, and introduce localized smoothing, a potentially better alternative. We f...
Article
We present a statistical model of vocabulary growth for applications involving large volumes of text. Vocabulary growth is modeled as repeated sampling of words from some underlying distribution. We derive general expressions for the expected number of unique words and the confidence interval around the expected value. We suggest a parametric form...
Conference Paper
We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate arelevance model: probabilities of words in the relevant class. We propose a novel technique...
Article
ersity's Center for Language and Speech Processing.[1] It was substantially reworked to provide improved support for "language model" approaches to the TDT tasks, though that functionality was not used significantly for TDT 2000. 1.1. Detection algorithms Our system supports two models of comparing a story to previously seen material: centroid (agg...
Article
Full-text available
We present a unique approach to identifying news stories that influence the behavior of financial markets. We describe the design and implementation of AEnalyst, a system for predicting trends in stock prices based on the content of news stories that precede the trends. We identify trends in time series using piecewise linear fitting and then assig...
Article
Full-text available
9 9999 9999 9999 9999 9999 9999 9999 9999 9999 ,&-. /102 43!&,"65. 798: ;=<2 &5> 7&7#$?@(A /102 43!&,"65. 798: ;=<2 &5> 7#$+J?2*+$+ 8KLJ/ C8F8F B& +&'F ()L+& M +& .,417&: 7&$DN4OP B8:F02 .J3!&J;QH> 7#$+R2 G S0!$T 7&\$DN4OP B8:F02 .J3!&J;QH> 28990-45710 C A"V ()R+ R()& 28800-44660 F02 .J3!&J;QH> 28990 899-43620 F ()R+ R()& 28800-44660 F02 .J3!&J;QH>...
Article
This report presents the system used by the University of Massachusetts for its participation in three of the five TDT tasks this year: detection, first story detection, and story link detection. For each task, we discuss the parameter setting approach that we used and the results of our system on the test data. In addition, we use TDT evaluation a...
Conference Paper
Conference Paper
Full-text available
Article
We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting---i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel threshold...
Article
This paper introduces Event Tracking , a new application of Information Retrieval technology with interesting research and evaluation questions. We describe the problem, a pilot corpus of news stories that was constructed for experimental studies, and a "rolling" evaluation strategy that uses different segments of the corpus for each query. As part...
Article
Convenient access to handwritten historical document col- lections in libraries generally requires an index, which al- lows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Currently, extensive manual labor is used to annotate and organize such collections, because handwrit...
Article
Automatic query expansion is well known as a technique that improves query effectiveness on average. Unfortunately, it is usually very slow, increasing the time to process a query by 20 times or more. In this study, we use relevance models to show how the process can be made almost as fast as running a non-expanded query.

## Citations

... Such methods are also known in the literature as pseudo-relevance feedback or implicit feedback [9,23] ...
... In the context of the language-specific knowledge graph, events are a subset of entities. Whereas many definitions of an event exist in the literature, in this work, we follow an event definition by J. Allan et al. proposed in the context of the event detection and tracking within news stories [9]: Having introduced the entities, events, and their relations, we can now define the task of language-specific event recommendation. ...
... Document expansion techniques address the vocabulary mismatch problem [Zhao 2012]: queries can use terms semantically similar but lexically different from those used in the relevant documents. Traditionally, this problem has been addressed using query expansion techniques, such as relevance feedback [Rocchio 1971] and pseudo relevance feedback [Lavrenko and Croft 2001]. The advances in neural networks and natural language processing have paved the way to different techniques to address the vocabulary mismatch problem by expanding the documents by learning new terms. ...
... The TDT initiative held several competitions during which the UMass-FSD [3] system was recognized for its strong performance in detection effectiveness [6,9]. In recent years UMass-FSD has been actively used as a high accuracy baseline by state-of-the-art FSD systems [9][10][11]13], which try to scale to high volume streams while retaining a level of accuracy comparable to UMass-FSD. In this paper, we investigate the novelty computation algorithm of UMass and discover for the first time that it applies a new form of temporal bias in the decision making process. ...
... Its fitness function needs to interact with people, thus greatly limits its efficiency. Using searching agent, Cope's Experiments in Musical Intelligence (EMI) program successfully composes music in the styles of many famous composers such as Chopin, Bach, and Rachmanino [32]; Victor et al. [33] used random fields to model polyphonic music. The process of music generation using evolutionary algorithm is too abstract to follow. ...
... Two decades ago Berger and Lafferty [4] proposed to reduce the vocabulary gap and, thus, to improve retrieval effectiveness with a help of a lexical translation model called IBM Model 1 (henceforth, simply Model 1). Model 1 has strong performance when applied to finding answers in English question-answer (QA) archives using questions as queries [35,56,64,70] as well as to cross-lingual retrieval [72,37]. Yet, little is known about its effectiveness on realistic monolingual English queries, partly, because training Model 1 requires large query sets, which previously were not publicly available. ...
Citing conference paper
... Many of the existing studies have exploited content-based features at least in some capacity, extracting some basic knowledge from the text (sentence, tokens, linguistics, etc.) and also some of the studies employed various linguistic cues which characterize deception like the use of negative words, swear words. These linguistic cues, including some other features that have been successfully implemented in numerous methods concerning fake news and rumour detection (Castillo et al. 2011;Gupta et al. 2014;Qin et al. 2016;Zhang et al. 2012). The authors of Zubiaga et al. (2016) and Qin et al. (2016) have employed syntactic features such as the number of content words(noun, verb adjectives), frequency of specific keyword has been employed. ...
... The hashcodes exhibit the neighbourhood preserving property that similar data-points will be assigned similar (low Hamming distance) hashcodes. To compute these hashcodes, many hashing models partition the input feature space into disjoint regions with hyperplanes [1,6,9]. In the case of hyperplanes the polytope-shaped regions formed by the inter-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
Citing conference paper
... The TDT initiative held several competitions during which the UMass-FSD [3] system was recognized for its strong performance in detection effectiveness [6,9]. In recent years UMass-FSD has been actively used as a high accuracy baseline by state-of-the-art FSD systems [9][10][11]13], which try to scale to high volume streams while retaining a level of accuracy comparable to UMass-FSD. In this paper, we investigate the novelty computation algorithm of UMass and discover for the first time that it applies a new form of temporal bias in the decision making process. ...
... We have recast the entity linking problem as an application of a more generic mention encoding task. This approach is related to methods which perform clustering on test mentions in order to improve inference (Le and Titov, 2018;Angell et al., 2020), and can also be viewed as a form of crossdocument coreference resolution (Rao et al., 2010;Shrimpton et al., 2015;Barhom et al., 2019). We also take inspiration from recent instance-based language modelling approaches (Khandelwal et al., 2020;Lewis et al., 2020b). ...