Rob Sanderson

Rob Sanderson
Yale University | YU · Office of the Provost

PhD (Liverpool, 2003)

About

49
Publications
7,193
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
896
Citations
Citations since 2017
0 Research Items
290 Citations
20172018201920202021202220230204060
20172018201920202021202220230204060
20172018201920202021202220230204060
20172018201920202021202220230204060
Additional affiliations
August 2009 - March 2014
Los Alamos National Laboratory
Position
  • Information Scientist
January 2004 - August 2009
University of Liverpool
Position
  • Lecturer

Publications

Publications (49)
Article
Full-text available
The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to...
Article
Persistent IDentifiers (PIDs), such as DOIs, Handles and ARK identifiers, play a significant role in the identification of a wide variety of assets that are created and used in scholarly endeavours, including research papers, datasets, images, etc. Motivated by concerns about long-term persistence, among others, PIDs are minted outside the informat...
Article
Full-text available
Web applications frequently leverage resources made available by remote web servers. As resources are created, updated, deleted, or moved, these applications face challenges to remain in lockstep with the server's change dynamics. Several approaches exist to help meet this challenge for use cases where "good enough" synchronization is acceptable. B...
Conference Paper
Conventional Web archives are created by periodically crawling a Web site and archiving the responses from the Web server. Although easy to implement and commonly deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to k...
Conference Paper
This tutorial provides an overview and a practical introduction to ResourceSync, a web-based synchronization framework consisting of multiple modular capabilities that a server can selectively implement to enable third party systems to remain synchronized with the server’s evolving resources. The tutorial motivates the ResourceSync approach by outl...
Conference Paper
The preservation of traditional, digital scholarly output, such as PDF or HTML journal articles, is relatively well understood, and adequately organized through systems such as Portico and LoCKSS. However, the scholarly record is expanding with a wide variety of materials for which no established archival approaches exist. This includes, for exampl...
Article
Full-text available
Maintenance of multiple, distributed up-to-date copies of collections of changing Web resources is important in many application contexts and is often achieved using ad hoc or proprietary synchronization solutions. ResourceSync is a resource synchronization framework that integrates with the Web architecture and leverages XML sitemaps. We define a...
Conference Paper
When retrieving archived copies of web resources (mementos) from web archives, the original resource's URI-R is typically used as the lookup key in the web archive. This is straightforward until the resource on the live web issues a redirect: R ->R`. Then it is not clear if R or R` should be used as the lookup key to the web archive. In this paper,...
Conference Paper
Many applications need up-to-date copies of collections of changing Web resources. Such synchronization is currently achieved using ad-hoc or proprietary solutions. We propose ResourceSync, a general Web resource synchronization protocol that leverages XML Sitemaps. It provides a set of capabilities that can be combined in a modular manner to meet...
Article
Full-text available
Many applications need up-to-date copies of collections of changing Web resources. Such synchronization is currently achieved using ad-hoc or proprietary solutions. We propose ResourceSync, a general Web resource synchronization protocol that leverages XML Sitemaps. It provides a set of capabilities that can be combined in a modular manner to meet...
Article
The Open Annotation Core Data Model specifies an interoperable framework for creating associations between related resources, called annotations, using a methodology that conforms to the Architecture of the World Wide Web. Open Annotations can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements...
Article
This is the second paper in D-Lib Magazine about the ResourceSync effort conducted by the National Information Standards Organization (NISO) and the Open Archives Initiative (OAI). The first part provided a perspective on the resource synchronization problem and introduced a template that organized possible components of a resource synchronization...
Article
Digital scholarship offers the opportunity to move beyond the limitations of traditional scholarly publication. Rather than limiting scholarly communication to text-based static documents, the Web makes it possible for scholars to expose and share the full evidence of their research including data, images, video, and other genre of materials. These...
Article
In this article, we present a model based on the principles of Linked Data that can be used to describe the interrelationships of images, texts and other resources to facilitate the interoperability of repositories of medieval manuscripts or other culturally important handwritten documents. The model is designed from a set of requirements derived f...
Article
Web applications frequently leverage resources made available by remote web servers. As resources are created, updated, deleted, or moved, these applications face challenges to remain in lockstep with changes on the server. Several approaches exist to help meet this challenge for use cases where "good enough" synchronization is acceptable. But when...
Article
Linked datasets contain descriptions that change over time. Applications that leverage linked data must be aware of these change dynamics to deliver accurate services. Here, the authors highlight important challenges that are involved in dealing with change and review possible solutions.
Article
In this poster, we describe the approach taken to designing and implementing a tera-scale multi-repository index of archived web resources using massively parallel processing.
Article
Full-text available
Many Web portals allow users to associate additional information with existing multimedia resources such as images, audio, and video. However, these portals are usually closed systems and user-generated annotations are almost always kept locked up and remain inaccessible to the Web of Data. We believe that an important step to take is the integrati...
Article
Full-text available
In this paper, we present the SharedCanvas model for describing the layout of culturally important, hand-written objects such as medieval manuscripts, which is intended to be used as a common input format to presentation interfaces. The model is evaluated using two collections from CATCHPlus not consulted during the design phase, each with their ow...
Article
Full-text available
Annotations allow users to associate additional information with existing resources. Using proprietary and closed systems on the Web, users are already able to annotate multimedia resources such as images, audio and video. So far, however, this information is almost always kept locked up and inaccessible to the Web of Data. We believe that an impor...
Article
In this paper we present the results of a study into the persistence and availability of web resources referenced from papers in scholarly repositories. Two repositories with different characteristics, arXiv and the UNT digital library, are studied to determine if the nature of the repository, or of its content, has a bearing on the availability of...
Article
In this paper we present a model based on the principles of Linked Data that can be used to describe the interrelationships of images, texts and other resources to facilitate the interoperability of repositories of medieval manuscripts or other culturally important handwritten documents. The model is designed from a set of requirements derived from...
Conference Paper
Full-text available
Textual Feature Selection (TFS) is an important phase in the process of text classification. It aims to identify the most significant textual features (i.e. key words and/or phrases), in a textual dataset, that serve to distinguish between text categories. In TFS, basic techniques can be divided into two groups: linguistic vs. statistical. For the...
Article
Full-text available
Dereferencing a URI returns a representation of the current state of the resource identified by that URI. But, on the Web representations of prior states of a resource are also available, for example, as resource versions in Content Management Systems or archival resources in Web Archives such as the Internet Archive. This paper introduces a resour...
Article
As Digital Libraries (DL) become more aligned with the web architecture, their functional components need to be fundamentally rethought in terms of URIs and HTTP. Annotation, a core scholarly activity enabled by many DL solutions, exhibits a clearly unacceptable characteristic when existing models are applied to the web: due to the representations...
Article
Full-text available
The Web is ephemeral. Many resources have representations that change over time, and many of those representations are lost forever. A lucky few manage to reappear as archived resources that carry their own URIs. For example, some content management systems maintain version pages that reflect a frozen prior state of their changing resources. Archiv...
Conference Paper
Full-text available
Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between text- categories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups:...
Article
Full-text available
Aggregations of Web resources are increasingly important in scholarship as it adopts new methods that are data-centric, collaborative, and networked-based. The same notion of aggregations of resources is common to the mashed-up, socially networked information environment of Web 2.0. We present a mechanism to identify and describe aggregations of We...
Conference Paper
A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy. Document sets are represented as graph sets to which...
Conference Paper
This poster evaluates the OAI-ORE specifications through experiments providing access to the JSTOR digital archive and the Flickr website. A browser-based dynamic graph visualization tool was designed and tested to determine if making the topology of the information available would provide end-user benefits in terms of navigation and discovery.
Article
Full-text available
Work in the Open Archives Initiative - Object Reuse and Exchange (OAI-ORE) focuses on an important aspect of infrastructure for eScience: the specification of the data model and a suite of implementation standards to identify and describe compound objects. These are objects that aggregate multiple sources of content including text, images, data, vi...
Conference Paper
Full-text available
Many text mining applications, especially when investigating Text Classification (TC), require experiments to be performed using common text-collections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the followin...
Article
Full-text available
The OAI Object Reuse and Exchange (OAI-ORE) framework recasts the repository-centric notion of digital object to a bounded aggregation of Web resources. In this manner, digital library content is more integrated with the Web architecture, and thereby more accessible to Web applications and clients. This generalized notion of an aggregation that is...
Conference Paper
Full-text available
Algorithms for text classification generally involve two stages, the first of which aims to identify textual elements (words and/or phrases) that may be relevant to the classification process. This stage often in- volves an analysis of the text that is both language-specific and pos- sibly domain-specific, and may also be computationally costly. In...
Conference Paper
Full-text available
In this paper we describe the concept of Meta ARM in the context of its objectives and challenges and go on to describe and analyse a number of potential solutions. Meta ARM is defined as the process of combining the results of a number of individually obtained Associate Rule Mining (ARM) operations to produce a composite result. The typical scenar...
Conference Paper
The strengths within six library collections were automatically determined through automated enrichment and analysis of bibliographic level metadata records, with a view towards efficient resource sharing and collaborative collection management. This involved very large scale deduplicantion, enrichment and automatic reclassification of records usin...
Conference Paper
Full-text available
This paper explores the integration of text mining and data mining techniques, digital library systems, and computational and data grid technologies with the objective of developing an online classification service exemplar. We discuss the current research issues relating to the use of data mining algorithms and toolkits for textual data; the neces...
Chapter
Full-text available
A number of language-independent text pre-processing techniques, to support multi-class single-label text classification, are described and compared. A simple but effective statistical keyword identification approach is proposed, coupled with a number of phrase identification mechanisms. Experimental results are presented. KeywordsText Mining-Mult...
Conference Paper
Full-text available
This poster describes the ongoing research of the Cheshire project with a particular focus on knowledge generation and digital preservation. The infrastructure described makes use of tools from computational linguistics, distributed parallel processing and storage, information retrieval and digital preservation environments to produce new knowledg...
Conference Paper
We describe a curated harvesting approach to creating and maintaining a subject portal, comprising selected records harvested from remote services via information retrieval standards such as SRU, Z39.50 and OAI-PMH. The result was a web-based data curation interface where administrative users can configure access to remote resources, queries to be...
Conference Paper
Full-text available
This acceptance talk is a curious mixture of personal history and developing ideas in the context of the growing field of IR covering several decades. I want to concentrate on models and theories, interpreted loosely, and try and give an insight into ...
Conference Paper
Full-text available
The University of California, Berkeley and the University of Liverpool in conjunction with the San Diego Supercomputer Center, are developing a framework for Grid- Based Digital Library systems and Information Retrieval Services (Cheshire3) that operates in both single-processor and distributed computing environments. In this paper we discuss some...
Article
SRW/U (the Search/Retrieve Webservice) and OAI (Open Archives Initiative) are both modern information retrieval protocols developed by distinct groups from different backgrounds at around the same time. This article sets out to briefly contrast the two protocols' aims and approaches, and then to look at some novel ways in which they have been or ma...
Conference Paper
Full-text available
The University of California, Berkeley and the University of Liverpool are developing a Information Retrieval and Digital Library system (Cheshire3) that operates in both single-processor and "Grid" distributed computing environments. This paper discusses the architecture of the system and how it performs Digital Library tasks in a Grid computing e...
Article
A thesis submitted for the degree of Doctor of Philosophy at the University of Liverpool.

Network

Cited By