
Stephen E. Robertson- Doctor of Philosophy
- Professor at University College London
Stephen E. Robertson
- Doctor of Philosophy
- Professor at University College London
About
258
Publications
163,779
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
24,096
Citations
Introduction
Stephen E. Robertson is currently retired, but has a visiting position at the Department of Computer Science, University College London. Stephen's research area is information retrieval. His most recent publication is 'A Brief History of Search Results Ranking'.
Skills and Expertise
Current institution
Publications
Publications (258)
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s BC: Before Computers is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had t...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The idea that the digital age has revolutionized our day-to-day experience of the world is nothing new, and has been amply recognized by cultural historians. In contrast, Stephen Robertson’s 'BC: Before Computers' is a work which questions the idea that the mid-twentieth century saw a single moment of rupture. It is about all the things that we had...
The theory and practice of search results ranking, as currently offered by most web search engines, is older than one might think. The first proposal for a system of ranking was in a JACM paper in 1960. Through the remainder of the twentieth century, extensive research was done on ranking systems - on devising methods of ranking, on the use of lear...
To me, an awareness of history is a fundamental requirement for progress; and I believe that we in the field of information retrieval are currently ill-served in this domain, or at least not as aware as we should be. While it is true that a researcher in IR is expected to acquire some knowledge of what has gone before, this knowledge is typically f...
Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads t...
Score-distribution models are used for various practical purposes in search, for example for results merging and threshold setting. In this paper, the basic ideas of the score-distributional approach to viewing and analysing the effectiveness of search systems are re-examined. All recent score-distribution modelling work depends on the availability...
The possibility of using fewer topics in TREC, and in TREC-like initiatives, has been studied recently, with encouraging results: even when decreasing consistently the number of topics (for example, using a topic subset of cardinality only 10, in place of the usual 50) it is possible, at least potentially, to obtain similar results when evaluating...
A solid research path towards new information retrieval models is to further develop the theory behind existing models. A profound understanding of these models is therefore essential. In this paper, we revisit probability ranking principle (PRP)-based models, probability of relevance (PR) models, and language models, finding conceptual differences...
Increasingly, web recommender systems face scenarios where they need to serve suggestions to groups of users; for example, when families share e-commerce or movie rental web accounts. Research to date in this domain has proposed two approaches: computing recommendations for the group by merging any members' ratings into a single profile, or computi...
Lab-based evaluations typically assess the quality of a retrieval system with respect to its ability to retrieve documents that are relevant to the information need of an end user. In a real-time search task however users not only wish to retrieve the most relevant items but the most recent as well. The current evaluation framework is not adequate...
We explore the notion, put forward by Cormack & Lynam and Robertson, that we should consider a document collection used for Cranfield-style experiments as a sample from some larger population of documents. In this view, any per-topic metric (such as average precision) should be regarded as an estimate of that metric's true value for that topic in t...
We propose an approach to the retrieval of entities that have a specific relationship with the entity given in a query. Our research goal is to investigate whether related entity finding problem can be addressed by combining a measure of relatedness of candidate answer entities to the query, and likelihood that the candidate answer entity belongs t...
In this work, we propose a theory for information matching. It is motivated
by the observation that retrieval is about the relevance matching between two
sets of properties (features), namely, the information need representation and
information item representation. However, many probabilistic retrieval models
rely on fixing one representation and o...
On the basis of a theoretical analysis of issues around populations and sampling, for both topics and documents, and parameters with which we hope to characterise the effectiveness of different systems, we propose a modification to the traditional average precision metric. This modification involves both transformation and (in the estimation of the...
The web search engine has come to occupy a central position in our information-seeking habits as citizens. This chapter explores the genesis of the idea of a search engine, and how this mechanism has developed in the web context. Search engines have adapted to the world of the web (in particular, to users and to the uses to which they have been put...
In this paper, an Eliteness Hypothesis for information retrieval is proposed,
where we define two generative processes to create information items and
queries. By assuming the deterministic relationships between the eliteness of
terms and relevance, we obtain a new theoretical retrieval framework. The
resulting ranking function is a unified one as...
We consider the selection of good subsets of topics for system evaluation. It has previously been suggested that some individual
topics and some subsets of topics are better for system evaluation than others: given limited resources, choosing the best
subset of topics may give significantly better prediction of overall system effectiveness than (fo...
We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary mixture models should satisfy, such as the Recall-Fallout Convexity Hypothesis, and formulate two ne...
We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions ...
Traditional information retrieval research has mostly focussed on satisfying clearly specified information needs. However, in reality, queries are often ambiguous and/or underspeci-fied. In light of this, evaluating search result diversity is beginning to receive attention. We propose simple evalu-ation metrics for diversified Web search results. O...
Most current machine learning methods for building search engines are based on the assumption that there is a target evaluation metric that evaluates the quality of the search engine with respect to an end user and the engine should be trained to optimize for that metric. Treating the target evaluation metric as a given, many different approaches (...
Evaluation metrics play a critical role both in the context of comparative evaluation of the performance of retrieval systems and in the context of learning-to-rank (LTR) as objective functions to be optimized. Many different evaluation metrics have been proposed in the IR literature, with average precision (AP) being the dominant one due a number...
Most information retrieval evaluation metrics are designed to measure the satisfaction of the user given the results returned by a search engine. In order to evaluate user satisfaction, most of these metrics have underlying user models, which aim at modeling how users interact with search engine results. Hence, the quality of an evaluation metric i...
We consider the issue of evaluating information retrieval systems on the basis of a limited number of topics. In contrast to statistically-based work on sample sizes, we hypothesize that some topics or topic sets are better than others at predicting true system effectiveness, and that with the right choice of topics, accurate predictions can be obt...
We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the
theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary
mixture models should satisfy, such as the Recall-Fallout Convexity Hypothesis, and formulate two ne...
The LETOR datasets consist of data extracted from tradi- tional IR test corpora. For each of a number of test top- ics, a set of documents has been extracted, in the form of features of each document-query pair, for use by a ranker. An examination of the ways in which documents were se- lected for each topic shows that the selection has (for each o...
Although Average Precision (AP) has been the most widely-used retrieval effectiveness metric since the ad-vent of Text Retrieval Conference (TREC), the general belief among researchers is that it lacks a user model. In light of this, Robertson recently pointed out that AP can be interpreted as a special case of Normalised Cu-mulative Precision (NCP...
The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especi...
We took part in the Web and Relevance Feedback tracks, using the ClueWeb09 corpus. To process the corpus, we developed a parallel processing pipeline which avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline and how we ran the TREC experiments, and we present effectiveness results.
Ranked retrieval has a particular disadvantage in comparison with traditional Boolean retrieval: there is no clear cut-off point where to stop consulting results. This is a serious problem in some setups. We investigate and further develop methods to select the rank cut- off value which optimizes a given effectiveness measure. Assuming no other inp...
Much research in learning to rank has been placed on developing sophisticated learning methods, treating the training set as a given. However, the number of judgments in the training set directly aff ects the quality of the learned system. Given the expense of obtaining relevance judgments for constructing training data, one often has a limited bud...
The ESP Game was designed to harvest human intelligence to assign labels to images - a task which is still difficult for even the most advanced systems in image processing. However, the ESP Game as it is currently implemented encourages players to assign "obvious" labels, which can be easily predicted given previously assigned labels. We present a...
This book constitutes the refereed proceedings of the Second International Conference on the Theory of Information Retrieval, ICTIR 2009, held in Cambridge, UK, in September 2009. The 18 revised full papers, 14 short papers, and 11 posters presented together with one invited talk were carefully reviewed and selected from 82 submissions. The papers...
Themes of the talk • Search as a science • The role of experiment and other empirical data gathering in IR • The (partial) standoff between the Cranfield tradition and user-oriented work • The role of theory in IR – the relation of theories and models to empirical data • Abstraction July 2009 Evaluation workshop, SIGIR 09, Boston 2 A caricature On...
Collaborative filtering is concerned with making recommendations about items to users. Most formulations of the problem are specifically designed for predicting user ratings, assuming past data of explicit user ratings is available. However, in practice we may only have implicit evidence of user preference; and furthermore, a better view of the tas...
Relevance Feedback has been one of the successes of information retrieval research for the past 30 years. It has been proven to be worthwhile in a wide variety of settings, both when actual user feedback is available, and when the user feedback is implicit. However, while the applications of relevance feedback and type of user input to relevance fe...
This paper is a personal take on the history of evaluation experiments in information retrieval. It describes some of the early experiments that were formative in our understanding, and goes on to discuss the current dominance of TREC (the Text REtrieval Conference) and to assess its impact.
We present the results of experiments using terms from citations for scientific literature search. To index a given document,
we use terms used by citing documents to describe that document, in combination with terms from the document itself. We find
that the combination of terms gives better retrieval performance than standard indexing of the docu...
This article presents a bilingual ontology-based dialog system
with multiple services. An ontology-alignment algorithm is proposed
to integrate ontologies of different languages for cross-language
applications. A domain-specific ontology is further ...
Query expansion by word alterations (alterna- tive forms of a word) is often used in Web search to replace word stemming. This allows users to specify particular word forms in a query. However, if many alterations are added, query traffic will be greatly increased. In this paper, we propose methods to select only a few useful word alterations for q...
In the field of information retrieval, one is often faced with the problem of computing the correlation between two ranked lists. The most commonly used statistic that quantifies this correlation is Kendall's Τ. Often times, in the information retrieval community, discrepancies among those items having high rankings are more important than those am...
Pseudo-relevance feedback assumes that most frequent terms in the pseudo-feedback documents are useful for the retrieval. In this study, we re-examine this assumption and show that it does not hold in reality - many expansion terms identified in traditional approaches are indeed unrelated to the query and harmful to the retrieval. We also show that...
We consider the question of whether Average Precision, as a measure of retrieval effectiveness, can be regarded as deriving from a model of user searching behaviour. It turns out that indeed it can be so regarded, under a very simple stochastic model of user behaviour.
We address the problem of learning large complex rank- ing functions. Most IR applications use evaluation metrics that depend only upon the ranks of documents. However, most ranking functions generate document scores, which are sorted to produce a ranking. Hence IR metrics are innately non-smooth with respect to the scores, due to the sort. Un- for...
The Cranfield projects began in 1958 -- fifty years ago. They have of course been extraordinarily influential, forming a view of information retrieval as an experimental science, which in some fashion persists to this day. Although the Cranfield tradition has had its ups and downs - the main down being in the late eighties, when it showed signs of...
In previous work, we have shown that using terms from around citations in citing papers to index the cited paper, in addition to the cited paper's own terms, can improve retrieval effectiveness. Now, we investigate how to select text from around the citations in order to extract good index terms. We compare the retrieval effectiveness that results...
In this work 1 , we analyze the popular KL-divergence ranking function in information re-trieval. We uncover the generative distribution, namely the Smoothed Dirichlet distribution, under-lying this ranking function and show that this distri-bution captures term occurrence distribution much better than the multinomial, thus offering, for the first...
This paper describes the official measures of retrieval effectiveness that are planned to be employed for the ad hoc track of INEX 2007.
This paper describes the official measures of retrieval effectiveness that are employed for the Ad Hoc Track at INEX 2007. Whereas in earlier years all, but only, XML elements could be retrieved, the result format has been liberalized to arbitrary passages. In response, the INEX 2007 measures are based on the amount of highlighted text retrieved, l...
Retrieval system experimentation has assumed that user requests represent a single information
need. The problem is identifying and meeting this need. Search engine experience demonstrates that
this assumption is far from holding in the real world. Responding appropriately to this fact raises
new issues for research on retrieval system theory, desi...
Retrieval system experimentation has assumed that user requests represent a single information
need. The problem is identifying and meeting this need. Search engine experience demonstrates that
this assumption is far from holding in the real world. Responding appropriately to this fact raises
new issues for research on retrieval system theory, desi...
This paper describes research that aims to define the
information needs of mobile individuals, to implement a mobile
information system that can satisfy those needs, and finally to
evaluate the performance of that system with end-users. First a
review ...
Purpose
An issue that tends to be ignored in information retrieval is the issue of updating inverted files. This is largely because inverted files were devised to provide fast query service, and much work has been done with the emphasis strongly on queries. This paper aims to study the effect of using parallel methods for the update of inverted fil...
Many current retrieval models and scoring functions contain free pa- rameters which need to be set - ideally, optimized. The process of optimization normally involves some training corpus of the usual document-query-relevance judgement type, and some choice of mea- sure that is to be optimized. The paper proposes a way to think about the process of...
We investigate the effect of different sources of relevant documents in the creation of a test collection in the scientific domain. Based on the Cranfield 2 design, paper authors are asked to judge their cited papers for relevance in the first stage. In a second stage, documents outside the reference list are judged. In this paper, we use the test...
We discuss the idea of modelling the statistical distributions of scores of documents, classified as relevant or non-relevant.
Various specific combinations of standard statistical distributions have been used for this purpose. Some theoretical considerations
indicate problems with some of the choices of pairs of distributions. Specifically, we rev...
In early 2006, as a result of a series of conversations between Steve Robertson, Mark Sanderson and Karen Spärck-Jones, Karen circulated a note summing up our discussions, which were on the topic of ambiguous requests. At the core of our discussion was the question: is too much information retrieval research focussed on search tasks where the query...
We propose a novel method of analysing data gathered from TREC or similar information retrieval evaluation experi- ments. We define two normalized versions of average pre- cision, that we use to construct a weighted bipartite graph of TREC systems and topics. We analyze the meaning of well known — and somewhat generalized — indicators from social n...
The experimental evaluation of information retrieval systems has a venerable history. Long before the current notion of a
search engine, in fact before search by computer was even feasible, people in the library and information science community
were beginning to tackle the evaluation issue. Sometimes it feels as though evaluation methodology has b...
Work on the statistical validity of experimental results in retrieval tests has concentrated on treating the topics as a sample from a population, but regarding the collection of documents as fixed. This paper raises the argument that we should also consider the documents as having been sampled from a population. It follows that we should regard a...
In this paper, we describe the Centre for Interactive Systems Research’s participation in the INEX 2006 adhoc track. Rather
than using a field-weighted BM25 model in INEX 2005, we revert back to using the traditional BM25 weighting function. Our
main research aims in this year are to investigate the effects of document filtering (by considering onl...
Lexical cohesion is a property of text, achieved through lexical-semantic relations between words in text. Most information retrieval systems make use of lexical relations in text only to a limited extent. In this paper we empirically investigate whether the degree of lexical cohesion between the contexts of query terms’ occurrences in a document i...
We consider the question of how informa- tion from the textual context of citations in scientic papers could improve index- ing of the cited papers. We rst present ex- amples which show that the context should in principle provide better and new index terms. We then discuss linguistic phenom- ena around citations and which type of processing would...
We consider the retrieval of XML-structured documents, and of passages from such documents, defined as elements of the XML structure. These are considered from the point of view of passage retrieval, as a form of document retrieval. A retrievable unit (an element chosen as defining suitable passages for retrieval) is a textual document in its own r...
The two previous probabilistic models of information retrieval, which seemed to be in some sense incompatible, can now be regarded as two complementary parts of a unified model. The new Model 3, which is derived in the framework of the unified model from a combination of Models 1 and 2, makes use of relevance feedback information from the individua...
We present an approach to building a test collection of research papers. The ap-proach is based on the Cran eld 2 tests but uses as its vehicle a current conference; research questions and relevance judge-ments of all cited papers are elicited from conference authors. The resultant test col-lection is different from TREC's in that it comprises scie...
This is the first year for the participation of the City University Centre of Interactive System Research (CISR) in the Expert Search Task. In this paper, we describe an expert search experiment based on window- based techniques, that is, we build profile for each expert by using informa- tion around the expert's name and email address in the docum...
This paper, based on a talk, presents an overview of evaluation experiments in information retrieval, and also of statistical approaches to search. A strong connection exists between them: the notion that the objective of search can be expressed in terms of the measures used for evaluation informs the statistical theory in several ways. The latest...
Optimising the parameters of ranking functions with respect to standard IR rank-dependent cost functions has eluded satisfactory analytical treatment. We build on recent ad- vances in alternative dierentiable pairwise cost functions, and show that these techniques can be successfully applied to tuning the parameters of an existing family of IR scor...
As an alternative to the usual Mean Average Precision, some use is currently being made of the Geometric Mean Average Precision (GMAP) as a measure of average search effectiveness across topics. GMAP is specifically used to emphasise the lower end of the average precision scale, in order to shed light on poor performance of search engines. This pap...
Purpose
The generation of inverted indexes is one of the most computationally intensive activities for information retrieval systems: indexing large multi‐gigabyte text databases can take many hours or even days to complete. We examine the generation of partitioned inverted files in order to speed up the process of indexing. Two types of index part...
A query independent feature, relating perhaps to document content, linkage or usage, can be transformed into a static, per-document relevance weight for use in ranking. The challenge is to find a good function to transform feature values into relevance scores. This paper presents FLOE, a simple density analysis method for modelling the shape of the...
A basic notion of probability theory is the event space, on which the probability measure is defined. A probabilistic model needs an event space. However, some classes of events (which we may want to model probabilistically) exhibit structure which does not fit well into the traditional event space notion. A simple one-to-many example is discussed...
1 Summary A major focus of much work of the group (as it has been since the City University Okapi work) is the development and refinement of basic ranking algorithms. The workhorse re- mains the BM25 algorithm; recently (3, 4) we introduced a field-weighted version of this, allowing differential treatment of different fields in the original documen...
This is the first year for the Centre for Interactive Systems Research participation of INEX. Based on a newly developed XML
indexing and retrieval system on Okapi, we extend Robertson’s field-weighted BM25F for document retrieval to element level
retrieval function BM25E. In this paper, we introduce this new function and our experimental method in...
The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, a...
Questions
Question (1)
I have several papers co-authored with Stephen Walker (who retired many years since). In your database, they are attributed to someone called Shelia Walker, who is apparently a current researcher). How can I make the correction?