Rob Gaizauskas

Rob Gaizauskas
The University of Sheffield | Sheffield · Department of Computer Science (Faculty of Engineering)

DPhil

About

286
Publications
64,460
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,284
Citations
Citations since 2017
8 Research Items
2118 Citations
20172018201920202021202220230100200300400
20172018201920202021202220230100200300400
20172018201920202021202220230100200300400
20172018201920202021202220230100200300400

Publications

Publications (286)
Chapter
The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practi...
Chapter
Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine tran...
Chapter
The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automaticall...
Book
This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that c...
Article
Full-text available
Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers i...
Conference Paper
Automatic summarization of reader comments in on-line news is a challenging but clearly useful task. Work to date has produced extractive summaries using well-known techniques from other areas of NLP. But do users really want these, and do they support users in realistic tasks? We specify an alternative summary type for reader comments, based on th...
Conference Paper
Measuring the similarity of interlanguage-linked Wikipedia articles often requires the use of suitable language resources (e.g., dictionaries and MT systems) which can be problematic for languages with limited or poor translation resources. The size of Wikipedia can also present computational demands when computing similarity. This paper presents a...
Conference Paper
Full-text available
Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in on-line news. To date, however, there has been little discussion of what these summaries should be like and a lack of humanauthored exemplars, quite likely because writing summaries of thi...
Article
Full-text available
Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in on-line news. To date, however, there has been little discussion of what these summaries should be like and a lack of human-authored exemplars, quite likely because writing summaries of th...
Conference Paper
Full-text available
Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers i...
Conference Paper
Full-text available
This paper presents an overview of the ImageCLEF 2016 evaluation campaign, an event that was organized as part of the CLEF (Conference and Labs of the Evaluation Forum) labs 2016. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to collections of...
Conference Paper
Full-text available
This paper investigates graph-based approaches to labeled topic clustering of reader comments in online news. For graph-based clustering we propose a linear regression model of similarity between the graph nodes (comments) based on similarity features and weights trained using automatically derived training data. To label the clusters our graph-bas...
Conference Paper
Existing approaches to summarizing multi-party argumentative conversations in reader comment are extractive and fail to capture the argumentative nature of these conversations. Work on argument mining proposes schemes for identifying argument elements and relations in text but has not yet addressed how summaries might be generated from a global ana...
Conference Paper
Full-text available
The ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task was the fourth edition of a challenge aimed at developing more scalable image annotation systems. In particular this year the focus of the three subtasks available to participants had the goal to develop techniques to allow computers to reliably describe images,...
Conference Paper
Full-text available
Online commenting to news articles provides a communication channel between media professionals and readers offering a crucial tool for opinion exchange and freedom of expression. Currently, comments are detached from the news article and thus removed from the context that they were written for. In this work, we propose a method to connect readers'...
Article
Full-text available
Literature-based discovery (LBD) aims to identify "hidden knowledge" in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LB...
Article
Full-text available
In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligner...
Article
Full-text available
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of c...
Conference Paper
Full-text available
Different people may describe the same object in different ways, and at varied levels of granularity ("poodle", "dog", "pet" or "animal"?) In this paper, we propose the idea of 'granularity-aware' groupings where semantically related concepts are grouped across different levels of granularity to capture the variation in how different people describ...
Conference Paper
Full-text available
Named Entity Disambiguation (NED) refers to the task of mapping different named entity mentions in running text to their correct interpretations in a specific knowledge base (KB). This paper presents a collective disambiguation approach using a graph model. All possible NE candidates are represented as nodes in the graph and associations between di...
Article
In this paper we investigate the application of entity type models in extractive multi-document summarization using the automatic caption generation for images of geo-located entities (e.g. Westminster Abbey, Loch Ness, Eiffel Tower) as an application scenario. Entity type models contain sets of patterns aiming to capture the ways the geo-located e...
Conference Paper
Full-text available
In this paper we investigate a number of questions relating to the identification of the domain of a term by domain classification of the document in which the term occurs. We propose and evaluate a straightforward method for domain classification of documents in 24 languages that exploits a multilingual thesaurus and Wikipedia. We investigate and...
Article
Full-text available
Considerable attention is being paid tomethods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of thes...
Conference Paper
Full-text available
In this paper we present a novel approach to disambiguate textual mentions of named entities against the Wikipedia knowledge base. The conditional dependencies between different named entities across Wikipedia are represented as a Markov network. In our approach, named entities are treated as hidden variables and textual mentions as observations. T...
Article
The temporal bounding problem is that of finding the beginning and ending times of a temporal interval during which an assertion holds. Existing approaches to temporal bounding have assumed the provision of a reference document from which to extract temporal bounds. We argue that a real-world setting does not include a reference document and that a...
Article
Full-text available
Product and service reviews are abundantly available online, but selecting relevant information from them involves a significant amount of time. The authors address this problem with Starlet, a novel approach for extracting multidocument summarizations that considers aspect rating distributions and language modeling. These features encourage the in...
Article
In this article, we investigate what sorts of information humans request about geographical objects of the same type. For example, Edinburgh Castle and Bodiam Castle are two objects of the same type: “castle.” The question is whether specific information is requested for the object type “castle” and how this information differs for objects of other...
Chapter
Reviews about products and services are abundantly available online. However, gathering information relevant to shoppers involves a significant amount of time reading reviews and weeding out extraneous information. While recent work in multi-document summarization has attempted to some degree to address this challenge, many questions about extracti...
Article
In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform...
Conference Paper
Automatically determining the temporal order of events and times in a text is difficult, though humans can readily perform this task. Sometimes events and times are related through use of an explicit co-ordination which gives information about the temporal relation: expressions like "before" and "as soon as". We investigate the rôle that these co-o...
Conference Paper
Full-text available
In this paper we present a novel approach to search a knowledge base for an entry that contains information about a named entity (NE) mention as specified within a given context. A document similarity function (NEBSim) based on NE co-occurrence has been developed to calculate the similarity between two documents given a specific NE mention in one o...
Chapter
Full-text available
This paper reports an initial study that aims to assess the viability of multi-document summarization techniques for automatic captioning of geo-referenced images. The automatic captioning procedure requires summarizing multiple Web documents that contain information related to images’ location. We use different state-of-the art summarization syste...
Conference Paper
Full-text available
Images with geo-tagging information are increasingly available on the Web. However, such images need to be annotated with additional textual information if they are to be retrievable, since users do not search by geo-coordinates. We propose to automatically generate such textual information by (1) generating toponyms from the geo-tagging informatio...
Article
Automatic temporal ordering of events described in discourse has been of great interest in recent years. Event orderings are conveyed in text via va rious linguistic mechanisms including the use of expressions such as "before", "after" or "during" that explicitly assert a temporal relation -- temporal signals. In this paper, we investigate the role...
Article
Full-text available
This paper describes the University of Sheffield's entry in the 2011 TAC KBP entity linking and slot filling tasks. We chose to participate in the monolingual entity linking task, the monolingual slot filling task and the temporal slot filling tasks. We set out to build a framework for experimentation with knowledge base population. This framework...
Article
Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents a...
Article
Full-text available
We present CAVaT, a tool that performs Corpus Analysis and Validation for TimeML. CAVaT is an open source, modular checking utility for statistical analysis of features specific to temporally-annotated natural language corpora. It provides reporting, highlights salient links between a variety of general and time-specific linguistic features, and al...
Article
Temporal information conveyed by language describes how the world around us changes through time. Events, durations and times are all temporal elements that can be viewed as intervals. These intervals are sometimes temporally related in text. Automatically determining the nature of such relations is a complex and unsolved problem. Some words can ac...
Article
We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors tem...
Article
In this paper we present RTMML, a markup language for the tenses of verbs and temporal relations between verbs. There is a richness to tense in language that is not fully captured by existing temporal annotation schemata. Following Reichenbach we present an analysis of tense in terms of abstract time points, with the aim of supporting automated pro...
Article
Full-text available
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages, parallel corpora are not readily available. To overcome this problem previous work has recognized the potential of using comparable corpora as training data. The process of obtaining such data usually involve...
Conference Paper
Wikipedia articles in different languages have been mined to support various tasks, such as Cross-Language Information Retrieval (CLIR) and Statistical Machine Translation (SMT). Articles on the same topic in different languages are often connected by inter-language links, which can be used to identify similar or comparable content. In this work, w...
Conference Paper
In this paper we address the problem of optimizing global multi-document summary quality using A* search and discriminative training. Different search strategies have been investigated to find the globally best summary. In them the search is usually guided by an existing prediction model which can distinguish between good and bad summaries. However...
Conference Paper
Full-text available
Reviews about products and services are abundantly available online. However, selecting information relevant to a potential buyer involves a significant amount of time reading user's reviews and weeding out comments unrelated to the important aspects of the reviewed entity. In this work, we present STARLET, a novel approach to multi-document summar...
Conference Paper
Full-text available
In this paper we investigate what sorts of information humans request about geographical objects of the same type. For example, Edinburgh Castle and the Bodiam Castle are two objects of the same type - castle. The question is whether specific information is requested for the object type castle and how this information differs for objects of other t...
Conference Paper
Full-text available
This demonstration presents a novel interactive graphical interface to document content focusing on the time dimension. The objective of Time-Surfer is to let users search and explore information related to a specific period, event, or event participant within a document. The system is based on the automatic detection not only of time expressions,...
Conference Paper
Full-text available
Lack of sufficient linguistic resources and parallel corpora for many languages and domains currently is one of the major obstacles to further advancement of automated translation. The solution proposed in this paper is to exploit the fact that non-parallel bi-or multilingual text resources are much more widely available than parallel translation d...
Conference Paper
Full-text available
Increasing quantities of images are indexed by GPS coordinates. However, it is difficult to search within such pictures. In this paper, we propose a solution to automatically generate captions (including place name, keywords and summary) from the web content based on image location information. The richer descriptions have great potential to help i...
Conference Paper
We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors tem...
Conference Paper
Full-text available
In this paper, we present an approach to measure the transliteration similarity of English-Hindi word pairs. Our approach has two components. First we propose a bi-directional mapping between one or more characters in the Devanagari script and one or more characters in the Roman script (pronounced as in English). This allows a given Hindi word writ...
Conference Paper
We present CAVaT, a tool that performs Corpus Analysis and Validation for TimeML. CAVaT is an open source, modular checking utility for statistical analysis of features specific to temporally -annotated natural language corpora. It provides reportin g, highlights salient links between a variety of general and time-specific linguistic fe atures, and...
Conference Paper
Full-text available
A considerable amount of work has been put into development of stemmers and morphological analysers. The majority of these ap- proaches use hand-crafted suffix-replacement rules but a few try to discover such rules from corpora. While most of the approaches remove or replace suffixes, there are examples of derivational stemmers which are based on p...
Conference Paper
Full-text available
This paper presents a novel approach to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image's location. The summarizer is biased by dependency pattern models towards sentences which contain features typically provided for different scene types such as those of churches, bridge...
Conference Paper
Full-text available
In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. Furthe...
Article
Full-text available
In this paper we explore three approaches to assigning Gene Ontology semantic clas-sifications to abstracts from the PubMed database: lexical lookup, information re-trieval and machine learning. To eval-uate the approaches we use two "gold" standards derived from the yeast genome database (SGD). While evaluation pro-vides insights into the three ap...
Article
Full-text available
Abbreviations are common in biomedical doc- uments and many are ambiguous in the sense that they have several potential expansions. Identifying the correct expansion is necessary for language understanding and important for applications such as document retrieval. Iden- tifying the correct expansion can be viewed as a Word Sense Disambiguation (WSD...
Article
Full-text available
TempEval is a framework for evaluating systems that automatically annotate texts with temporal relations. It was created in the context of the SemEval 2007 workshop and uses the TimeML annotation language. The evaluation consists of three subtasks of temporal annotation: anchoring an event to a time expression in the same sentence, anchoring an eve...
Article
In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer pat...
Article
Full-text available
BACKGROUND: The Clinical E-Science Framework (CLEF) project has built a system to extract clinically significant information from the textual component of medical records in order to support clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. One part of this system is the identification of relationships between...
Article
Full-text available
Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of various sources of information including linguistic features of the context in wh...
Article
Text mining technology can be used to assist in finding relevant or novel information in large volumes of unstructured data, such as that which is increasingly available in the electronic scientific literature. However, publishers are not text mining specialists, nor typically are the end user scientists who consume their products. This situation s...
Conference Paper
In the near future digital cameras will come standardly equipped with GPS and compass and will automatically add global position and direction information to the metadata of every picture taken. Can we use this information, together with information from geographical information systems and the Web more generally, to caption images automatically?
Conference Paper
Full-text available
Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by en- gines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents...
Article
Full-text available
The Clinical E-Science Framework (CLEF) project has built a system to extract clinically significant information from the textual component of medical records, for clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. One part of this system is the identification of relationships between clinically important entitie...
Conference Paper
Full-text available
Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including linguistic information (from the context...