
Andrew Trotman- PhD
- Professor (Associate) at University of Otago
Andrew Trotman
- PhD
- Professor (Associate) at University of Otago
About
182
Publications
26,339
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,651
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (182)
Researchers have had much recent success with ranking models based on so-called learned sparse representations generated by transformers. One crucial advantage of this approach is that such models can exploit inverted indexes for top- k retrieval, thereby leveraging decades of work on efficient query evaluation. Yet, there remain many open question...
We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads from each location. This approach explores the direct use of k-mer composition to characterise samples so that we can avoid the computationally demanding step of aligning reads to available microbial reference sequences. Each variable...
Recent advances in retrieval models based on learned sparse representations generated by transformers have led us to, once again, consider score-at-a-time query evaluation techniques for the top-k retrieval problem. Previous studies comparing document-at-a-time and score-at-a-time approaches have consistently found that the former approach yields l...
Sarcasm target detection (identifying the target of mockery in a sarcastic sentence) is an emerging field in computational linguistics. Although there has been some research in this field, accurately identifying the target still remains problematic especially when the target of mockery is not presented in the text. In this paper, we propose a combi...
There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions...
The SIGIR 2019 Workshop on eCommerce (ECOM19), was a full day workshop that took place on Thursday, July 25, 2019 in Paris, France. The purpose of the workshop was to serve as a platform for publication and discussion of Information Retrieval and NLP research and their applications in the domain of eCommerce. The workshop program was designed to br...
The SIGIR 2019 Workshop on eCommerce (ECOM19), was a full day workshop that took place on Thursday, July 25, 2019 in Paris, France. The purpose of the workshop was to serve as a platform for publication and discussion of Information Retrieval and NLP research and their applications in the domain of eCommerce. The workshop program was designed to br...
We make available a document collection of a million product titles from 3,008 anonymized categories of the rakuten.com product catalog. The anonymization has been done due to intellectual property rights on the underlying data organization taxonomy. Our analysis of the characteristics of the 800,000 training and 20,000 validation titles show that...
eCommerce Information Retrieval is receiving increasing attention in the academic literature, and is an essential component of some of the largest web sites (such as eBay, Amazon, Airbnb, Alibaba, Taobao, Target, Facebook, Home Depot, and others). These kinds of organisations clearly value the importance of research into Information Retrieval. The...
We introduce and test several micro‐ and macro‐optimizations to the Score‐at‐a‐Time approach to processing impact‐ordered postings lists in a search engine. Our micro‐optimizations are at the single‐assembly instruction level, but our macro‐optimizations are algorithmic. Overall, we see an improvement of 37% on our baseline (22% on state of the art...
Query expansion is commonly used to combat the vocabulary mismatch problem, it bridges the disparity between the vocabulary used in the corpus and search queries. However, if expansion terms are not chosen carefully, there is a risk of including spurious expansion terms, which can broaden the potential interpretations of the modified query. Uninten...
The prior belief that the Elias gamma and delta coding are slow because of the bit-wise manipulations is examined in the light of new CPU instructions that perform those manipulations. It is shown that despite using those instructions, Elias gamma and Elias delta remain slow compared to SIMD codecs such as QMX. We provide a theoretical basis on whi...
The purpose of the Strategic Workshop in Information Retrieval in Lorne is to explore the long-range issues of the Information Retrieval field, to recognize challenges that are on-or even over-the horizon, to build consensus on some of the key challenges, and to disseminate the resulting information to the research community. The intent is that thi...
eCommerce Information Retrieval has received little attention in the academic literature, yet it is an essential component of some of the largest web sites (such as eBay, Amazon, Airbnb, Alibaba, Taobao, Target, Facebook, and others). SIGIR has for several years seen sponsorship from these kinds of organisations, who clearly value the importance of...
The effectiveness of a search engine is typically evaluated using hand-labeled datasets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less expensive labels (e.g., from crowdsourcing) are used. This introduces errors in the labels, and thu...
The SIGIR 2017 Workshop on eCommerce (ECOM17), was a full day workshop that took place on Friday, August 11, 2017 in Tokyo, Japan. The purpose of the workshop was to serve as a platform for publication and discussion of Information Retrieval and NLP research and their applications in the domain of eCommerce. The workshop program was designed to bri...
Patents are a source of technical knowledge, but often difficult to understand. Technological solutions that would help understand the knowledge expressed in patents can assist the creation of new knowledge, and inventions. This paper explores anchor text selection for linking patents to external knowledge sources such as web pages and prior patent...
Query expansion is used to overcome the vocabulary mismatch between the documents and queries, but it can lead to query drift. We propose an automatic term reweighting strategy for BM25 ranking functions. Using expansion terms obtained from general purpose thesauri, we found that reweighting through term frequency merging is more effective than sta...
eCommerce Information Retrieval has received little attention in the academic literature, yet it is an essential component of some of the largest web sites (such as eBay, Amazon, Airbnb, Alibaba, Taobao, Target, Facebook, and others). SIGIR has for several years seen sponsorship from these kinds of organizations, who clearly value the importance of...
This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings wit...
The quality of a search engine is typically evaluated using hand-labeled data sets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less accurate labels (e.g. from crowdsourcing) must be used. This introduces errors in the labels, and thus e...
We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index o...
The size of a search engine index and the time to search are inextricably related through the compression codec. This investigation examines this tradeoff using several relatively unexplored SIMD-based codecs including QMX, TurboPackV, and TurboPFor. It uses (the non-SIMD) OPTPFor as a baseline. Four new variants of QMX are introduced and also comp...
The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with...
The SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) took place on Thursday, August 13, 2015 in Santiago, Chile. The goal of the workshop was two fold. The first to provide a venue for the publication and presentation of negative results. The second was to provide a venue through which the authors of...
During indexing the vocabulary of a collection needs to be built. The structure used for this needs to account for the skew distribution of terms. Parallel indexing allows for a large reduction in number of times the global vocabulary needs to be examined, however, this also raises a new set of challenges. In this paper we examine the structures us...
There are many competing models for the indexing process of an information retrieval system, one of which is a pipeline based model. Information retrieval is also an inherently parallel process, indexing one document is independent of another document. A pipeline model allows for easy experimentation on the parallelism within an indexer. In this pa...
The Simple family of codecs is popular for encoding postings lists for a search engine because they are both space effective and time efficient at decoding. These algorithms pack as many integers into a codeword as possible before moving on to the next codeword. This technique is known as left-greedy. This contribution proves that left-greedy is no...
The ability for a ranking function to control its own execution time is useful for managing load, reigning in outliers, and adapting to different types of queries. We propose a simple yet effective anytime algorithm for impact-ordered indexes that builds on a score-at-a-time query evaluation strategy. In our approach, postings segments are processe...
The three generations of postings list compression strategies (Variable Byte Encoding, Word Aligned Codes, and SIMD Codecs) are examined in order to test whether or not each truly represented a generational change -- they do. Some weaknesses of the current SIMD-based schemes are identified and a new scheme, QMX, is introduced to address both space...
Recent work on search engine ranking functions report improvements on BM25 and Language Models with Dirichlet Smoothing. In this investigation 9 recent ranking functions (BM25, BM25+, BM25T, BM25-adpt, BM25L, TF1°δ°p×ID, LM-DS, LM-PYP, and LM-PYP-TFIDF) are compared by training on the INEX 2009 Wikipedia collection and testing on INEX 2010 and 9 TR...
This paper proposes a novel approach to explore emergent patterns in images in an unsupervised setting. We consider emergent patterns to be sets of co-occurring visual words that appear together more often than chance would indicate. Rather than focusing on finding ways to learn a large number of objects or their categories we focus on analyzing be...
Cross-Lingual Link Discovery (CLLD) is a new problem in Information Retrieval. The aim is to automatically identify meaningful and relevant hypertext links between documents in different languages. This is particularly helpful in knowledge discovery if a multi-lingual knowledge base is sparse in one language or another, or the topical coverage in e...
In this paper we discuss some of the document encoding errors that were found when scaling our indexer and search engine up to large collections crawled from the web, such as ClueWeb09. In this paper we describe the encoding errors, what effect they could have on indexing and searching, how they are processed within our indexer and search engine an...
Previous work has examined space saving and throughput increasing techniques for long postings lists in an inverted file search engine. In this contribution we show that highly sporadic terms (terms that occur in 1 or 2 documents) are a high proportion of the unique terms in the collection and that these terms are seen in queries. The previously kn...
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2013 evaluation campaign, which consisted of four activities addressing three themes: searching professional an...
The time cost of searching with an inverted index is directly proportional to the number of postings processed and the cost of processing each posting. Dynamic pruning reduces the number of postings examined. Pre-calculation then quantization of term / document weights reduces the cost of evaluating each posting. The effect of quantization on preci...
Introduction: Before embarking on the design of any computer system it is
first necessary to assess the magnitude of the problem. In the case of a web
search engine this assessment amounts to determining the current size of the
web, the growth rate of the web, and the quantity of computing resource
necessary to search it, and projecting the histori...
The Seventeenth Australian Document Computing Symposium was held in Dunedin, New Zealand on the 5th and 6th of December 2012. In total twenty four papers were submitted. From those eleven were accepted for full presentation and 8 for short presentation. A poster session was held jointly with the Australasian Language Technology Workshop.
On August 16, 2012 the SIGIR 2012 Workshop on Open Source Information Retrieval was held as part of the SIGIR 2012 conference in Portland, Oregon, USA. There were 2 invited talks, one from industry and one from academia. There were 6 full papers and 6 short papers presented as well as demonstrations of 4 open source tools. Finally there was a livel...
In this paper we examine automated Chinese to English link discovery in Wikipedia and the effects of Chinese segmentation and Chinese to English translation on the hyperlink recommendation. Our experimental results show that the implemented link discovery framework can effectively recommend Chinese-to-English cross-lingual links. The techniques des...
Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short...
In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corp...
Spam has long been identified as a problem that web search engines are required to deal with. Large collection sizes are also an increasing issue for institutions that do not have the necessary resources to process them in their entirety. In this paper we investigate the effect that withholding documents identified as spam has on the resources requ...
In this paper we investigate an unsupervised learning method applied to low level image features extracted from a large collection of images using data mining strategies. The mining process resulted in several interesting emergent semantic patterns. Initially, local image features are extracted using image processing techniques which are then clust...
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organi- zations to compare their results. This paper reports on the INEX’12 evaluation campaign, which consisted of a five tracks: Linked Data, Relevance Feedback, Snippet Retrieval,...
Many image retrieval and object recognition systems rely on high-dimensional feature representation schemes such as SIFT. Because of this high dimensionality these features suf-fer from the curse of dimensionality and high memory needs. In this paper we evaluate an approach that reduces the size of a SIFT descriptor from 128 bytes to 128 bits. We t...
Divergence from a random baseline is a technique for the evaluation of
document clustering. It ensures cluster quality measures are performing work
that prevents ineffective clusterings from giving high scores to clusterings
that provide no useful result. These concepts are defined and analysed using
intrinsic and extrinsic approaches to the evalua...
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Quest...
We introduce a set of new metrics for hyper-link quality. These metrics are based on users' interac-tions with hyperlinks as recorded in click logs. Using a year-long click log, we assess the INEX 2008 link discovery (Link-the-Wiki) runs and find that our metrics rank them differently from the existing metrics (INEX automatic and manual assessment)...
This paper gives an overview of the INEX 2011 Snippet Retrieval Track. The goal of the Snippet Retrieval Track is to provide a common forum for the evaluation of the effectiveness of snippets, and to investigate how best to generate snippets for search results, which should provide the user with sufficient information to determine whether the under...
This paper describes the evaluation in benchmarking the effec-tiveness of cross-lingual link discovery (CLLD). Cross-lingual link discovery is a way of automatically finding prospective links between documents in different languages, which is par-ticularly helpful for knowledge discovery of different language domains. A CLLD evaluation framework is...
The search engine vocabulary is normally stored in alphabetical order so that it can be searched with a binary search. If the vocabulary is large, it can be represented as a 2-level B-tree and only the root of the tree is held in memory. The leaves are retrieved from disk only when required at runtime. In this paper, we investigate and address issu...
It is sometimes required to order search results using textual document attributes such as titles. This is problematic for performance because of the memory required to store these long text strings at indexing and search time. We create a method for compressing strings which may be used for approximate ordering of search results on textual attribu...
Interaction with a mobile device remains difficult due to inherent physical limitations. This dif-ficulty is particularly evident for search, which re-quires typing. We extend the One-Search-Only search paradigm by adding a novel link-browsing scheme built on top of automatic link discovery. A prototype was built for iPhone and tested with 12 subje...
The University of Otago submitted runs to the Snippet Retrieval Track and the Relevance Feedback tracks at INEX 2011. Snippets were generated using vector space ranking functions, taking into account or ignoring structural hints, and using word clouds. We found that using passages made better snippets than XML elements and that word clouds make bad...
1 Opening Matters The 2011 ACM SIGIR Annual Business Meeting took place on Thursday 27 th July 2011, at the SIGIR 2011 conference in Beijing, China. The meeting opened at 12:05pm and was lead by James Allan. The meeting started with a tribute to the late Efthi Efthimiadis who passed away on the 7 th of April 2011. Efthi was a long time contributor...
The INEX 2010 Link-the-Wiki track examined link-discovery in the Te Ara collection, a previously unlinked document collection.
Te Ara is structured more a digital cultural history than as a set of entities. With no links and no automatic entity identification,
previous Link-the-Wiki algorithms could not be used. Assessment was also necessarily manu...
In this paper, we describe University of Otago’s participation in Ad Hoc, Link-the-Wiki Tracks, Efficiency and Data Centric
Tracks of INEX 2010. In the Link-the-Wiki Track, we show that the simpler relevance summation method works better for producing
Best Entry Points (BEP). In the Ad Hoc Track, we discusses the effect of various stemming algorith...
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2010 evaluation campaign, which consisted of a wide range of tracks: Ad Hoc, Book, Data Centric, Interactive, Q...
We explore statistical properties of links within Wikipedia. We demonstrate that a simple algorithm can predict many of the
links that would normally be added to a new article, without considering the topic of the article itself. We then explore
a variant of topic-oriented PageRank, which can effectively identify topical links within existing artic...
Ranking function performance reached a plateau in 1994. The reason for this is investigated. First the performance of BM25 is measured as the proportion of queries satisfied on the first page of 10 results -- it performs well. The performance is then compared to human performance. They perform comparably. The conclusion is there isn't much room for...
Mobile phones are now powerful and pervasive making them ideal information browsers. The Internet has revolutionized our lives and is a major knowledge sharing media. However, many mobile phone users cannot access the Internet (for financial or technical reasons) and so the mobile Internet has not been fully realized. We propose a novel content del...
The INEX 2010 Data Centric Track is discussed. A dump of IMDb was used as the document collection, 28 topics were submitted, 36 runs were submitted by 8 institutes, and 26 topics were assessed. Most runs (all except 2) did not use the structure present in the topics; and consequently no improvement is yet seen by search engines that do so.
This paper gives an overview of the INEX 2010 Ad Hoc Track. The main goals of the Ad Hoc Track were three-fold. The first goal was to study focused retrieval under resource restricted conditions such as a small screen mobile device or a document summary on a hitlist. This leads to variants of the focused retrieval tasks that address the impact of r...
Wikimedia article archives (Wikipedia, Wiktionary, and so on) assemble open-access, authoritative corpora for semantic-informed datamining, machine learning, information retrieval, and natural language processing. In this paper, we show the MediaWiki wikitext grammar to be context-sensitive, thus precluding application of simple parsing techniques....
IR efficiency is normally addressed in terms of accumulator initialisation, disk I/O, decompression, ranking and sorting. Traditionally, the performance of search engines is dominated by slow disk I/O, CPU-intensive decompression, complex similarity ranking functions and sorting a large number of candidate documents. However, after we have applied...
Both focused retrieval and result aggregation provide the user with answers to their information needs, rather than just pointers to whole documents. Focused retrieval identifies not only relevant documents but also which parts of those documents are relevant, thus reducing the time it takes the user to navigate in a document. Result aggregation is...
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2009 evaluation campaign, which consisted of a wide range of tracks: Ad hoc, Book, Efficiency, Entity Ranking,...
Building an efficient and an effective search engine is a very challenging task. In this paper, we present the efficiency
and effectiveness of our search engine at the INEX 2009 Efficiency and Ad Hoc Tracks. We have developed a simple and effective
pruning method for fast query evaluation, and used a two-step process for Ad Hoc retrieval. The overa...
This paper gives an overview of the INEX 2009 Ad Hoc Track. The main goals of the Ad Hoc Track were three-fold. The first
goal was to investigate the impact of the collection scale and markup, by using a new collection that is again based on a
the Wikipedia but is over 4 times larger, with longer articles and additional semantic annotations. For th...
Over the last decade spam has become a serious prob-lem to email-users all over the world. Most of the daily email-traffic consists of this unwanted spam. There are various methods that have been proposed to fight spam, from IP-based blocking to filtering in-coming email-messages. However it seems that it is impossible to overcome this problem as t...
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We imple-mented an n-gram mutual information (NGMI) based segmentation algorithm with the mixed-up features from unsu-pervised, supervised and dictionary-based segmentation methods. This al-gorithm is also combined with a simple strategy for out-of-voc...
On July 23, 2009 the SIGIR Workshop on the Future of IR Evaluation was held as part of SIGIR in Boston. The program consisted of four keynotes, a boaster and poster session with 20 accepted papers, four breakout groups, and a final panel discussion of the breakout group reports. This report outlines the events of the workshop and summarizes the maj...
This paper analyzes the results of the INEX 2009 Ad Hoc Track, focusing on a variety of topics. First, we examine in detail
the relevance judgments. Second, we study the resulting system rankings, for each of the four ad hoc tasks, and determine
whether differences between the best scoring participants are statistically significant. Third, we restr...
In the third year of the Link the Wiki track, the focus has been shifted to anchor-to-bep link discovery. The participants were encouraged to utilize different technologies to resolve the issue of focused link discovery. Apart from the 2009 Wikipedia collection, the Te Ara collection was introduced for the first time in INEX. For the link the wiki...
The 2008 proxy log covering all student access to the Wikipedia from the University of Otago is analysed. The log covers 17,635 student users for all 366 days in the year, amounting to over 577,973 user sessions. The analysis shows the Wikipedia is used every hour of the day, but seasonally. Use is low between semesters, rising steadily throughout...