Richard Connor's research while affiliated with University of St Andrews and other places

Publications (124)

Poster
Full-text available
Most work in similarity search takes place in a domain far distant from the actual semantics of a space. Given a metric and a set of values, there is a wealth of research in how to efficiently perform search. However almost without exception, the aspect of mapping the results to the "real" closeness semantic the metric is intended to model is ignor...
Article
In high dimensional datasets, exact indexes are ineffective for proximity queries, and a sequential scan over the entire dataset is unavoidable. Accepting this, here we present a new approach employing two-dimensional embeddings. Each database element is mapped to the XY plane using the four-point property. The caveat is that the mapping is local:...
Chapter
Similarity search using metric indexing techniques is largely a solved problem in low-dimensional spaces. However techniques based only on the triangle inequality property start to fail as dimensionality increases. Since proper metric spaces allow a finite projection of any three objects into a 2D Euclidean space, the notion of angle can be validly...
Article
Scalable similarity search in metric spaces relies on using the mathematical properties of the space in order to allow efficient querying. Most important in this context is the triangle inequality property, which can allow the majority of individual similarity comparisons to be avoided for a given query. However many important metric spaces, typica...
Article
Full-text available
Approximate Nearest Neighbor (ANN) search is a prevalent paradigm for searching intrinsically high dimensional objects in large-scale data sets. Recently, the permutation-based approach for ANN has attracted a lot of interest due to its versatility in being used in the more general class of metric spaces. In this approach, the entire database is ra...
Article
We define BitPart (Bitwise representations of binary Partitions), a novel exact search mechanism intended for use in high-dimensional spaces. In outline, a fixed set of reference objects is used to define a large set of regions within the original space, and each data item is characterised according to its containment within these regions. In contr...
Chapter
The concept of local pivoting is to partition a metric space so that each element in the space is associated with precisely one of a fixed set of reference objects or pivots. The idea is that each object of the data set is associated with the reference object that is best suited to filter that particular object if it is not relevant to a query, max...
Chapter
Full-text available
Many approaches for approximate metric search rely on a permutation-based representation of the original data objects. The main advantage of transforming metric objects into permutations is that the latter can be efficiently indexed and searched using data structures such as inverted-files and prefix trees. Typically, the permutation is obtained by...
Article
Incorporating intellectual and developmental frameworks into a Scottish school curriculum.
Conference Paper
In a metric space, triangle inequality implies that, for any three objects, a triangle with edge lengths corresponding to their pairwise distances can be formed. The n-point property is a generalisation of this where, for any \((n+1)\) objects in the space, there exists an n-dimensional simplex whose edge lengths correspond to the distances among t...
Article
Full-text available
In 1953, Blumenthal showed that every semi-metric space that is isometrically embeddable in a Hilbert space has the n-point property; we have previously called such spaces supermetric spaces. Although this is a strictly stronger property than triangle inequality, it is nonetheless closely related and many useful metric spaces possess it. These incl...
Article
Full-text available
Metric search is concerned with the efficient evaluation of queries in metric spaces. In general,a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing...
Conference Paper
There are many contexts where the definition of similarity in multivariate space requires to be based on the correlation, rather than absolute value, of the variables. Examples include classic IR measurements such as TDF/IF and BM25, client similarity measures based on collaborative filtering, feature analysis of chemical molecules, and biodiversit...
Conference Paper
Our context of interest is tree-structured exact search in metric spaces. We make the simple observation that, the deeper a data item is within the tree, the higher the probability of that item being excluded from a search. Assuming a fixed and independent probability p of any subtree being excluded at query time, the probability of an individual d...
Conference Paper
Metric indexing research is concerned with the efficient evaluation of queries in metric spaces. In general, a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most such mechanisms rely upon the triangle inequality property of the...
Article
Most research into similarity search in metric spaces relies upon the triangle inequality property. This property allows the space to be arranged according to relative distances to avoid searching some subspaces. We show that many common metric spaces, notably including those using Euclidean and Jensen-Shannon distances, also have a stronger proper...
Conference Paper
There are many published methods for detecting similar and near-duplicate images. Here, we consider their use in the context of unsupervised near-duplicate detection, where the task is to find a (relatively small) near- duplicate intersection of two large candidate sets. Such scenarios are of particular importance in forensic near-duplicate detecti...
Article
There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, proba...
Book
This book constitutes the proceedings of the 8th International Conference on Similarity Search and Applications, SISAP 2015, held in Glasgow, UK, in October 2015. The 19 full papers, 12 short and 9 demo and poster papers presented in this volume were carefully reviewed and selected from 68 submissions. They are organized in topical sections named:...
Conference Paper
Full-text available
This paper argues that the “institutionalised understanding” of pseudo-code as a blend of formal and natural languages makes it an unsuitable choice for national assessment where the intention is to test program comprehension skills. It permits question-setters to inadvertently introduce a level of ambiguity and consequent confusion. This is not in...
Conference Paper
It is well known that, as the dimensionality of a metric space increases, metric search techniques become less effective and the cost of indexing mechanisms becomes greater than the saving they give. This is due to the so-called curse of dimensionality. One effect of increasing dimensionality is that the ratio of unit hypersphere to unit hypercube...
Patent
Full-text available
A distributed mediation network and method of employing such is provided, having a plurality of different types of network module. Each module has a non-reciprocal path therethrough for network traffic and the distribution of network traffic across the network is managed by an autonomic control plane.
Conference Paper
Jensen-Shannon divergence is a symmetrised, smoothed version of Küllback-Leibler. It has been shown to be the square of a proper distance metric, and has other properties which make it an excellent choice for many high-dimensional spaces in ℝ*. The metric as defined is however expensive to evaluate. In sparse spaces over many dimensions the Intrin...
Conference Paper
The majority of work in similarity search focuses on the efficiency of threshold and nearest-neighbour queries. Similarity join has been less well studied, although efficient indexing algorithms have been shown. The multi-way similarity join, extending similarity join to multiple spaces, has received relatively little treatment. Here we present a n...
Conference Paper
Full-text available
Over the years a number of models have been introduced as solutions to the central IR problem of ranking documents given textual queries. Here we define another new model. It is a probabilistic model and has no term inter-dependencies, thus allowing calculation from inverted indices. It is based upon a simple core hypothesis, directly calculating a...
Conference Paper
We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlati...
Article
Comparing tree-structured data for structural similarity is a recurring theme and one on which much effort has been spent. Most approaches so far are grounded, implicitly or explicitly, in algorithmic information theory, being approximations to an information distance derived from Kolmogorov complexity. In this paper we propose a novel complexity m...
Conference Paper
The similarity of objects is one of the most fundamental concepts in any collection of complex information; similarity, along with techniques for storing and indexing sets of values based on it, is a concept of ever increasing importance as inherently unordered data sets become ever more common. Examples of such datasets include collections of imag...
Article
Full-text available
Pervasive services may be defined as services that are available to any client (anytime, anywhere). Here we focus on the software and network infrastructure required to support pervasive contextual services operating over a wide area. One of the key requirements is a matching service capable of assimilating and filtering information from various so...
Conference Paper
A GLObal Smart Space (GLOSS) provides support for interaction amongst people, artefacts and places while taking account of both context and movement on a global scale. Crucial to the definition of a GLOSS i s the provision of a set of location-aware services that detect, convey, store and exploit location information. We use one of these services,...
Article
This document describes the Gloss software currently implemented. The description of the Gloss demonstrator for multi-surface interaction can be found in D17. The ongoing integration activity for the work described in D17 and D8 constitutes our development of infrastructure for a first smart space. In this report, the focus is on infrastructure to...
Conference Paper
We show a new metric for comparing unordered, tree-structured data. While such data is increasingly important in its own right, the methodology underlying the construction of the metric is generic and may be reused for other classes of ordered and partially ordered data. The metric is based on the information content of the two values under conside...
Article
Distributed application systems have become a popular and provenly viable computing paradigm. There are a number of reasons for this such as: the geographical dispersal of information; the improved reliability of multiple computer systems; and the possibility of concurrent execution of applications. As yet no single model of distribution has been p...
Article
Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are inc...
Article
Full-text available
, and autonomic adaptation to highly dynamic en- vironments - which defeat the scope of commodity technologies and ad-hoc developments alike. We thus turn to e-Infrastructures and the Grid vision which underlines them for the provision of the human, hardware and application resources which are required to support true Virtual Research Environments...
Conference Paper
In everyday life there are low risk situations where resources normally need to be protected and thus access to them controlled, but at the same time the implications of occasional unauthorized access are not severe. In these situations using an authentication technique that is expensive or requires significant effort from the user may not be justi...
Conference Paper
Full-text available
There are various situations where a distinction needs to be made between group members and outsiders. For example, to protect students in chat groups from unpleasant incidents caused by intruders; or to provide access to common domains such as computer labs. In some of these situations the implications of unauthorized access are negligible. Thus,...
Article
Full-text available
On the Internet, there is an uneasy tension between the security and usability of authentication mechanisms. An easy three-part classification is: "something you know" (e.g. password); "something you hold" (e.g. device holding digital certificate), and "who you are" (e.g. biometric assessment) [9]. Each of these has well-known problems; passwords a...
Conference Paper
We consider the topic of query evaluation over semistructured information streams, and XML data streams in particular. Streaming evaluation methods are necessarily eventdriven, which is in tension with high-level query models; in general, the more expressive the query language, the harder it is to translate queries into an event-based implementatio...
Article
In this paper, we investigate the issues that arise when bind- ing statically typed languages to XML data. In particular, our motivation is to exploit the computational facilities of mainstream languages when computing over real-world entities encoded as XML documents or doc- ument fragments. These include completeness, strong typing, e-ciency, as...
Article
In this paper, we investigate the issues that arise when binding statically typed languages to XML data. In particular, our motivation is to exploit the computational facilities of mainstream languages when computing over real-world entities encoded as XML documents or document fragments. These include completeness, strong typing, efficiency, as we...
Conference Paper
The Internet is experiencing an overwhelming growth that will have a negative impact on its performance and quality of service. In this paper we describe a new architecture that offers a better use of Internet resources and help improve security at the server nodes. The Thin Server Architecture aims to dynamically push code and data in arbitrary lo...
Conference Paper
We discuss the design of a quasi-statically typed language for XML in which data may be associated with different structures and different algebras in different scopes, whilst preserving identity. In declarative scopes, data are trees and may be queried with the full flexibility associated with XML query algebras. In procedural scopes, data have mo...
Article
We describe Projector, a language that can be used to perform a mixture of typed and untyped computation against data represented in XML. For some problems, notably when the data is unstructured or semistructured, the most desirable programming model is against the tree structure underlying the document. When this tree structure has been used to mo...
Article
Values of existing typed programming languages are increasingly generated and manipulated outside the language jurisdiction. Instead, they often occur as fragments of XML documents, where they are uniformly interpreted as labelled trees in spite of their domain-specific semantics. In particular, the values are divorced from the high-level type with...
Article
. The traditional representation of a program is as a linear sequence of text. At some stage in the execution sequence the source text is checked for type correctness and its translated form is linked to values in the environment. When this is performed early in the execution process, confidence in the correctness of the program is raised. During p...
Article
Full-text available
Data Types ............................................45 11.1 Abstract Data Type Definition ................................... 45 11.2 Creation of Abstract Data Objects............................... 45 11.3 Use of Abstract Data Objects .................................... 46 11.4 Equality and Equivalence ...........................................
Chapter
Full-text available
Persistent programming systems are designed as an implementation technology for long lived, concurrently accessed and potentially large bodies of data and programs, known here as persistent application systems (PASs). Within a PAS the persistence concept is used to abstract over the physical properties of data such as where it is kept, how long it...
Article
Traditional database technology may be extended by taking advantage of the facilities of an integrated persistent programming environment. This paper focuses on how such an environment may be used to provide new solutions to a long standing problem in traditional databases, that of schema evolution. A general mechanism is first described, followed...
Chapter
The application of orthogonal persistence to a number of traditional programming language mechanisms has produced significant results for the modelling power of persistent languages. Many of the implementation techniques for these mechanisms have been copied or adapted from non-persistent systems for use in early persistent language implementations...
Chapter
Reflective systems allow their own structures to be altered from within. In a programming system reflection can occur in two ways: by a program altering its own interpretation or by it changing itself. Reflection has been used to facilitate the production and evolution of data and programs in database and programming language systems. This paper is...
Chapter
The most significant difference between persistent and non-persistent programming languages is that, in a persistent language, the long-term data is typed. This results in a major shift in the emphasis of type system protection, from one of a safety mechanism over programs to that of a safety mechanism over the entire software system, including bot...
Article
The Hippo project is an investigation into the use of high-level programming paradigms in the context of global computation. The intention is to discover abstractions which allow global applications to be written by an average applications programmer, rather than a specialist in networked computer systems. This will be achieved by the combination o...
Article
We identify problems in using traditional programming systems to write applications that access internet data. Some of these problems may be artefacts of current internet technology, and some are fundamental in any large heterogeneous and autonomous network. Our experimental approach to this is to design a new language system, with a semantic domai...
Conference Paper
. In its infancy, the World-Wide Web consisted of a web of largely static hypertext documents. As time progresses it is evolving into a domain which supports almost arbitrary networked computations. Central to its successful operation is the agreement of the HTML and http standards, which provide inter-node communication via the medium of streamed...
Conference Paper
The traditional representation of a program is as a linear sequence of text. At some stage in the execution sequence the source text is checked for type correctness and its translated form is linked to values in the environment. When this is performed early in the execution process, confidence in the correctness of the program is raised. During pro...
Article
Full-text available
A major design goal of the Flask architecture is to separate the mechanisms of concurrency control and recovery management in database programming systems. This paper describes the DataSafe component of Flask, which is the second recovery mechanism to be implemented within the architecture and therefore provides a proof of concept. The DataSafe is...
Article
Contents 1 Introduction .............................................................................................................. 2 2 Downloading ............................................................................................................. 2 3 DEC Alpha Only .........................................................................
Article
Full-text available
This document should be referenced as:
Conference Paper
Existential quantification of procedures is introduced as a mechanism for languages with dynamic typing. It allows abstraction over types whose representations require to be manipulated at run time. Universal quantification, the mechanism normally associated with procedural type abstraction, is shown to be unsuitable for this style of abstraction....
Article
In a type secure persistent programming system, all data is governed for its entire lifetime by a single type system. The universality of the persistent type system has implications in terms of both the modelling and protection provided by the type system itself, and also presents some new challenges in terms of implementation. With respect to mode...
Article
Full-text available
Crash recovery in database systems aims to provide an acceptable level of protection from failure at a given engineering cost. A large number of recovery mechanisms are known, and have been compared both analytically and empirically. However, recent trends in computer hardware present different engineering tradeoffs in the design of recovery mechan...
Article
Full-text available
This paper contains an examination of the typings associated with thee construction of persistent systems through the use of parametric abstract modules.. Of particular interest are the class of dependent, potentially dynamic typingss required as a consequence of diamond import. An example is introduced, includingg the diamond problem; solutions to...
Conference Paper
We demonstrate the use of a hyper-programming system in building persistent applications. This allows program representations to contain type-safe links to persistent objects embedded directly within the source code. The benefits include improved efficiency and potential for static program checking, reduced programming effort and the ability to dis...
Article
Full-text available
This paper briefly and selectively reviews the experience gained in designing and implementing the orthogonally persistent programming languages PS-algol and Napier88. A major design issue is how much is built into the support system and how much is built on top of the language itself. The PS-algol and Napier88 systems provide at the system or lang...
Article
Full-text available
Distributed application systems have become a popular and provenly viable computing paradigm. There are a number of reasons for this such as: the geographical dispersal of information; the improved reliability of multiple computer systems; and the possibility of concurrent execution of applications. As yet no single model of distribution has been p...
Article
Dynamic subclass checks are common in many sub-paradigms of object-oriented programming. These checks may be simply implemented by following the superclass chain through the objects which represent the class hierarchy. However, in large or data-intensive applications, such as OODBMS, this simple checking strategy may have significant efficiency con...