About
162
Publications
25,114
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,411
Citations
Citations since 2017
Publications
Publications (162)
Topic modelling is a popular unsupervised method for identifying the underlying themes in document collections that has many applications in information retrieval. A topic is usually represented by a list of terms ranked by their probability but, since these can be difficult to interpret, various approaches have been developed to assign descriptive...
Topic modelling is a popular unsupervised method for identifying the underlying themes in document collections that has many applications in information retrieval. A topic is usually represented by a list of terms ranked by their probability but, since these can be difficult to interpret, various approaches have been developed to assign descriptive...
Identifying relevant studies for inclusion in systematic reviews requires significant effort from human experts who manually screen large numbers of studies. The problem is made more difficult by the growing volume of medical literature and Information Retrieval techniques have proved to be useful to reduce workload. Reviewers are often interested...
This paper describes the University of Sheffield’s approach to the CLEF 2019 eHealth Task 2: Technologically Assisted Reviews in Empirical Medicine. This task focuses on identifying relevant studies for systematic reviews. The University of Sheffield participated in subtask 2 (Abstract and Title Screening). Our approach used lexical statistics (Log...
Identifying relevant studies for inclusion in systematic reviews requires significant effort from human experts who manually screen large numbers of studies. The problem is made more difficult by the growing volume of medical literature and Information Retrieval techniques have proved to be useful to reduce workload. Reviewers are often interested...
Topics models, such as LDA, are widely used in Natural Language Processing. Making their output interpretable is an important area of research with applications to areas such as the enhancement of exploratory search interfaces and the development of interpretable machine learning models. Conventionally, topics are represented by their n most probab...
Topics models, such as LDA, are widely used in Natural Language Processing. Making their output interpretable is an important area of research with applications to areas such as the enhancement of exploratory search interfaces and the development of interpretable machine learning models. Conventionally, topics are represented by their n most probab...
Background
The volume of research published in the biomedical domain has increasingly lead to researchers focussing on specific areas of interest and connections between findings being missed. Literature based discovery (LBD) attempts to address this problem by searching for previously unnoticed connections between published information (also known...
The identification of duplicated and plagiarised passages of text has become an increasingly active area of research. In this paper we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A sc...
Distant supervision is a useful technique for creating relation classifiers in the absence of labelled data. The approaches are often evaluated using a held-out portion of the distantly labelled data, thereby avoiding the need for lablelled data entirely. However, held-out evaluation means that systems are tested against noisy data, making it diffi...
Distant supervision is a widely applied approach to automatic training of
relation extraction systems and has the advantage that it can generate large
amounts of labelled data with minimal effort. However, this data may contain
errors and consequently systems trained using distant supervision tend not to
perform as well as those based on manually l...
Automatic recognition of relationships between key entities in text is an important problem which has many applications. Supervised machine learning techniques have proved to be the most effective approach to this problem. However, they require labelled training data which may not be available in sufficient quantity (or at all) and is expensive to...
Topic models have been shown to be a useful way of representing the content of large document collections, for example via visualisation interfaces (topic browsers). These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a term list, i.e. the top-n
words with highest conditional proba...
We introduce a new problem, identifying the type of relation that holds between a pair of similar items in a digital library. Being able to provide a reason why items are similar has applications in recommendation, personalization, and search. We investigate the problem within the context of Europeana, a large digital library containing items relat...
Literature-based discovery (LBD) aims to identify "hidden knowledge" in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LB...
Introduction
Cultural heritage involves rich and highly heterogeneous collections that are challenging to archive and convey to the general public.
Hardman et al., 2009, 23
This statement describes two aspects that make access to cultural heritage information challenging: the heterogeneous nature of many cultural heritage collections and the growin...
A range of approaches to the representation of lexical semantics have been explored within Computational Linguistics. Two of the most popular are distributional and knowledge-based models. This paper proposes hybrid models of lexical semantics that combine the advantages of these two approaches. Our models provide robust representations of synonymo...
Purpose
– The purpose of this paper is to investigate the effects of cognitive style on navigating a large digital library of cultural heritage information; specifically, the paper focus on the wholist/analytic dimension as experienced in the field of educational informatics. The hypothesis is that wholist and analytic users have characteristically...
Self-supervised relation extraction uses a knowledge base to automatically annotate a training corpus which is then used to train a classifier. This approach has been successfully applied to different domains using a range of knowledge bases. This paper applies the approach to the biomedical domain using UMLS, a large biomedical knowledge base cont...
Topic models have been shown to be a useful way of rep-resenting the content of large document collections, for ex-ample via visualisation interfaces (topic browsers). These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a set of keywords, i.e. the top-n words with highest marginal...
Search boxes providing simple keyword-based search are insufficient when users have complex information needs or are unfamiliar with a collection, for example in large digital libraries. Browsing hierarchies can support these richer interactions, but many collections do not have a suitable hierarchy available. In this paper we present a number of a...
This paper introduces an unsupervised graph-based method that selects textual labels for automatically generated topics. Our approach uses the topic keywords to query a search engine and generate a graph from the words contained in the results. PageRank is then used to weigh the words in the graph and score the candidate labels. The state-of-the-ar...
This paper describes first results using the Unified Medical Language System (UMLS) for distantly supervised relation extraction. UMLS is a large knowledge base which contains information about millions of medical concepts and relations between them. Our approach is evaluated using existing relation extraction data sets that contain relations that...
Previous approaches to the problem of measuring similarity between automatically generated topics have been based on comparison of the topics' word probability distributions. This paper presents alternative approaches, including ones based on distributional semantics and knowledge-based measures, evaluated by compari-son with human judgements. The...
Classic approaches to automatic input data generation are usually driven by the goal of obtaining program coverage and the need to solve or find solutions to path constraints to achieve this. As inputs are generated with respect to the structure of the code, they can be ineffective, difficult for humans to read, and unsuitable for testing missing i...
There is ample evidence of the influence of individual differences on information-seeking behaviours. Trailways and paths are increasingly important objects to support internet navigation. The EU-funded PATHS (Personalised Access to Cultural Heritage Spaces) project is investigating ways of assisting users with exploring a large collection of cultu...
In this paper we compare the results of the user-centred evaluation of two iterations of the PATHS system, which aims at supporting exploration, navigation and use of information in cultural heritage online collections. We focus on two path creation exercises, and examine the format and content of the paths according to available functionality and...
User-centered design and evaluation of a system to improve information access and assist the wider information activities of users in cultural heritage digital collections is described. Extending beyond simple, standalone information seeking and retrieval tasks, the system aims to enhance content ‘findability’ and to support users’ cognitive proces...
Automatic processing of biomedical documents is made difficult by the fact that many of the terms they contain are ambiguous. Word Sense Disambiguation (WSD) systems attempt to resolve these ambiguities and identify the correct meaning. However, the published literature on WSD systems for biomedical documents report considerable differences in perf...
This paper describes a system for navigating large collections of information about cultural heritage which is applied to Europeana, the European Library. Europeana contains over 20 million artefacts with metadata in a wide range of European languages. The system currently provides access to Europeana content with metadata in English and Spanish. T...
Current Information Retrieval systems for digital cultural heritage support only the actual search aspect of the information seeking process. This demonstration presents the second PATHS system which provides the exploration, analysis, and sense-making features to support the full information seeking process.
Topics generated automatically, e.g. using LDA, are now widely used in Computational Linguistics. Topics are normally represented as a set of keywords, often the n terms in a topic with the highest marginal probabilities. We introduce an alternative approach in which topics are represented using images. Candidate images for each topic are retrieved...
Objective We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information.
Materials and methods Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the doc...
This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collect frequencies. Relatedness between topic words and...
The frequent non-availability of an automated oracle means that, in practice, checking software behaviour is frequently a painstakingly manual task. Despite the high cost of human oracle involvement, there has been little research investigating how to make the role easier and less time-consuming. One source of human oracle cost is the inherent unre...
We approach the typed-similarity task using several kinds of heuristic similarity techniques for each type of similarity, based on the information of appropriated metadata fields. In addition we train a linear regressor for each type of similarity. The results indicate that the linear regression is key for good performance. Our best system ranked t...
The development of models for automatic detection of text re-use and plagiarism across languages has received increasing attention in recent years. However, the lack of an evaluation framework composed of annotated datasets has caused these efforts to be isolated. In this paper we present the CL!TR 2011 corpus, the first manually created corpus for...
Large amounts of cultural heritage content have now been digitized and are available in digital libraries. However, these are often unstructured and difficult to navigate. Automatic techniques for identifying similar items in these collections could be used to improve navigation since it would allow items that are implicitly connected to be linked...
There is a demand for taxonomies to organise large collections of documents into categories for browsing and exploration. This paper examines four existing taxonomies that have been manually created, along with two methods for deriving taxonomies automatically from data items. We use these taxonomies to organise items from a large online cultural h...
Large amounts of digital cultural heritage (CH) information have become available over the past years, requiring more powerful exploration systems than just a search box. The PATHS system aims to provide an environment in which users can successfully explore a large, unknown collection through two modalities: following existing paths to learn about...
Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsiste...
Digitisation of the cultural heritage means that a significant amount of material is now available through online digital library portals. However, the vast quantity of cultural heritage material can also be overwhelming for many users who lack knowledge of the collections, subject knowledge and the specialist language used to describe this content...
Classic approaches to test input generation -- such as dynamic symbolic execution and search-based testing -- are commonly driven by a test adequacy criterion such as branch coverage. However, there is no guarantee that these techniques will generate meaningful and realistic inputs, particularly in the case of string test data. Also, these techniqu...
Access to the vast body of research literature that is now available on biomedicine and related fields can be improved with automatic summarization. This paper describes a summarization system for the biomedical domain that represents documents as graphs formed from concepts and relations in the UMLS Metathesaurus. This system has to deal with the...
This paper describes the University of Sheffield's submission to SemEval-2012 Task 6: Semantic Text Similarity. Two approaches were developed. The first is an unsupervised technique based on the widely used vector space model and information from WordNet. The second method relies on supervised machine learning and represents each sentence as a set...
Text reuse is common in many scenarios and documents are often based, at least in part, on existing documents. This paper reports an approach to detecting text reuse which identifies not only documents which have been reused verbatim but is also designed to identify cases of reuse when the original has been rewritten. The approach identifies reuse...
A significant amount of information about Cultural Heritage artefacts is now available in digital format and has been made avail-able in digital libraries. Being able to iden-tify items that are similar would be use-ful for search and navigation through these data sets. Information about items in these repositories is often multimodal, such as pict...
Large numbers of cultural heritage items are now archived digitally along with accompanying metadata and are available to anyone with internet access. This information could be enriched by adding links to resources that provide background information about the items. Techniques have been developed for automatically adding links to Wikipedia to text...
Generating realistic, branch-covering string inputs is a challenging problem, due to the diverse and complex types of real-world data that are naturally encodable as strings, for example resource locators, dates of different localised formats, international banking codes, and national identity numbers. This paper presents an approach in which examp...
External plagiarism detection systems compare suspicious texts against a reference collection to identify the original one(s). The suspicious text may not contain a verbatim copy of the reference collection since plagiarists often try to disguise their behaviour by altering the text. For large reference collections, such as those accessible via the...
The most accurate approaches to Word Sense Disambiguation (WSD) for biomedical docu-ments are based on supervised learning. How-ever, these require manually labeled training examples which are expensive to create and consequently supervised WSD systems are normally limited to disambiguating a small set of ambiguous terms. An alternative approach is...
The Cultural Heritage in CLEF 2012 (CHiC) pilot evaluation included these tasks: ad-hoc retrieval, semantic enrichment and variability tasks. At CHiC 2012, the University of She?eld and the University of the Basque Country submitted a joint entry, attempting the three English monolingual tasks. For the ad-hoc task, the baseline approach used the In...
Word-sense disambiguation (WSD) is the process of identifying the meanings of words in context. This article begins with discussing the origins of the problem in the earliest machine translation systems. Early attempts to solve the WSD problem suffered from a lack of coverage. The main approaches to tackle the problem were dictionary-based, connect...
Previous systems for literature based discovery, an automatic method of identifying hidden knowledge, have largely been based on bag of words approaches which perform only limited semantic analysis and interpretation. We describe the shortcomings of these approaches and suggest possible solutions that make use of techniques from Natural Language Pr...
In this paper, we present the results of the user requirements and interface design phase for a prototype system, designed to enhance interaction with cultural heritage collections online through means of a pathway metaphor. We present a single user interaction model that supports various work and information seeking tasks undertaken by both expert...
Topic models are an established technique for generating information about the subjects discussed in collections of documents. Latent Dirichlet Allocation (LDA) is a widely applied topic model. The topic models generated by LDA consist of sets of terms associated with each topic and these are used to provide context for a Word Sense Disambiguation...
Current techniques for knowledge-based Word Sense Disambiguation (WSD) of ambiguous biomedical terms rely on relations in the Unified Medical Language System Metathesaurus but do not take into account the domain of the target documents. The authors' goal is to improve these methods by using information about the topic of the document in which the a...
Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the task
of identifying plagiarised work, or even to detect them automatically. Direct comparison of t...
This paper describes the University of Sheffield entry for the 3rd International Competition on Plagiarism Detection which attempted the monolingual external plagiarism detection task. A three stage framework was used: preprocessing and indexing, candidate document selection (using an Information Retrieval based approach) and detailed analysis (usi...
Previous work on relation extraction has focussed on identifying relationships between entities that occur in the same sentence (intra-sentential relations) rather than between entities in different sentences (inter-sentential relations) despite previous research having shown that intersentential relations commonly occur in information extraction c...
Corpus-based techniques have proved to be very beneficial in the development of efficient and accurate approaches to word
sense disambiguation (WSD) despite the fact that they generally represent relatively shallow knowledge. It has always been
thought, however, that WSD could also benefit from deeper knowledge sources. We describe a novel approach...
Word Sense Disambiguation (WSD), the automatic identification of the meanings of ambiguous terms in a document, is an important stage in text processing. We describe a WSD system that has been developed specifically for the types of ambiguities found in biomedical documents. This system uses a range of knowledge sources. It employs both linguistic...
Researchers have access to a vast amount of information stored in textual documents and there is a pressing need for the development of automated methods to enable and improve access to this resource. Lexical ambiguity, the phenomena in which a word or phrase has more than one possible meaning, presents a significant obstacle to automated text proc...
Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This article presents a graph-based approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use of knowledge from the Unified Medical La...
Due to the frequent non-existence of an automated oracle, test cases are often evaluated manually in practice. However, this fact is rarely taken into account by automatic test data generators, which seek to maximise a program's structural coverage only. The test data produced tends to be of a poor fit with the program's operational profile. As a r...
In natural language relationships between entities can asserted within a single sentence or over many sentences in a document. Many information extraction systems are constrained to extracting binary relations that are asserted within a single sentence (single-sentence relations) and this limits the proportion of relations they can extract since th...
Several methods for automatically generating labeled examples that can be used as training data for WSD systems have been proposed, including a semi-supervised approach based on relevance feedback (Stevenson et al., 2008a). This approach was shown to generate examples that improved the performance of a WSD system for a set of ambiguous terms from t...
This paper describes the University of Sheffield entry for the 2nd international plagiarism detection competition (PAN 2010). Our system attempts to identify extrinsic plagiarism. A three-stage approach is used: pre-processing, candidate document selection (using word n-grams) and detailed analysis (using the Running Karp-Rabin Greedy String Tiling...
This paper examines the problem of finding articles in Wikipedia to match noun synsets in WordNet. The motiva-tion is that these articles enrich the synsets with much more information than is already present in WordNet. Two meth-ods are used. The first is title matching, following redirects and disambiguation links. The second is information retrie...
We describe a concept-based summarization system for biomedical documents and show that its performance can be improved using Word Sense Disambiguation. The system represents the documents as graphs formed from concepts and relations from the UMLS. A degree-based clustering algorithm is applied to these graphs to discover different themes or topics...
We describe two systems that participated in SemEval-2010 task 17 (All-words Word Sense Disambiguation on a Specific Domain) and were ranked in the third and fourth positions in the formal evaluation. Domain adaptation techniques using the background documents released in the task were used to assign ranking scores to the words and their senses. Th...
This paper presents a novel approach to the problem of paraphrase identification. Al-though paraphrases often make use of syn-onymous or near synonymous terms, many previous approaches have either ignored or made limited use of information about simi-larities between word meanings. We present an algorithm for paraphrase identification which makes e...
Abbreviations are common in biomedical doc- uments and many are ambiguous in the sense that they have several potential expansions. Identifying the correct expansion is necessary for language understanding and important for applications such as document retrieval. Iden- tifying the correct expansion can be viewed as a Word Sense Disambiguation (WSD...
Several techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency trees to form
the basis of an extraction pattern representation. These approaches have used a variety of pattern models (schemes for representing
IE patterns based on particular parts of the dependency analysis). An appropriate pattern model...
Plagiarism is a serious problem in higher education and generally acknowledged to be on the increase (McCabe, 2005). Text analysis tools have the potential to be applied to work submitted by students and assist the educator in the detection of plagiarised text. It is difficult to develop and evaluate such systems without examples of such documents....
Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of various sources of information including linguistic features of the context in wh...
Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including linguistic information (from the context...
Supervised approaches to Word Sense Dis- ambiguation (WSD) have been shown to outperform other approaches but are ham- pered by reliance on labeled training ex- amples (the data acquisition bottleneck). This paper presents a novel approach to the automatic acquisition of labeled examples for WSD which makes use of the Informa- tion Retrieval techni...
This chapter explores the different sources of linguistic knowledge that can be employed by WSD systems. These are more abstract
than the features used by WSD algorithms, which are encoded at the algorithmic level and normally extracted from a lexical
resource or corpora. The chapter begins by listing a comprehensive set of knowledge sources with e...
We present a novel approach to the word sense disambiguation problem which makes use of corpus-based evidence com- bined with background knowledge. Em- ploying an inductive logic programming algorithm, the approach generates expres- sive disambiguation rules which exploit several knowledge sources and can also model relations between them. The ap-...
We describe an approach to the automatic crea-tion of a sense tagged corpus intended to train a word sense disambiguation (WSD) system for English-Portuguese machine translation. The ap-proach uses parallel corpora, translation diction-aries and a set of straightforward heuristics. In an evaluation with nine corpora containing 10 am-biguous verbs,...
Several recent approaches to Information Extraction (IE) have used dependency trees as the basis for an extraction pattern representation. These approaches have used a variety of pattern models (schemes which define the parts of the dependency tree which can be used to form extraction patterns). Previous comparisons of these pattern models are limi...