About
53
Publications
37,415
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,253
Citations
Publications
Publications (53)
Labelling unlabeled data is a time-consuming and expensive process. Labelling initiatives should select samples that are likely to enhance the classification accuracy of the classifier. Several methods can be employed to accomplish this goal. One of these techniques is to select samples with the highest level of uncertainty in their predicted label...
Labelling unlabeled data is a time-consuming and expensive process. Labelling initiatives should select samples that are likely to enhance the classification accuracy of the classifier. Several methods can be employed to accomplish this goal. One of these techniques is to select samples with the highest level of uncertainty in their predicted label...
Blockmodelling is the process of determining community structure in a graph. Real graphs contain noise and so it is up to the blockmodelling method to allow for this noise and reconstruct the most likely role memberships and role relationships. Relationships are encoded in a graph using the absence and presence of edges. Two objects are considered...
Blockmodelling is an important technique for detecting underlying patterns in graphs. Existing blockmodelling algorithms are unsupervised and cannot take advantage of the existing information that might be available about objects that are known to be similar. This background information can help finding complex patterns, such as hierarchical or rin...
The iVAT (asiVAT) algorithms reorder symmetric (asymmetric) dissimilarity data so that an image of the data may reveal cluster substructure. Images formed from incomplete data don’t offer a very rich interpretation of cluster structure. In this paper we examine four methods for completing the input data with imputed values before imaging. We choose...
Multi-label classifiers allow us to predict the state of a set of responses using a single model. A multi-label model is able to make use of the correlation between the labels to potentially increase the accuracy of its prediction. Critical applications of multi-label classifiers (such as medical diagnoses) require that the system’s confidence in p...
It is becoming increasingly difficult to stay aware of the state-of-the-art in any research field due to the exponential increase in the number of academic publications. This problem effects authors and reviewers of submissions to academic journals and conferences, who must be able to identify which portions of an article are novel and which are no...
Measurement of graph centrality provides us with an indication of the importance or popularity of each vertex in a graph. When dealing with graphs that are not centrally controlled (such as the Web, social networks and academic citation graphs), centrality measure must 1) correlate with vertex importance/popularity, 2) scale well in terms of comput...
Retinal vascular landmark points such as branching points and crossovers are important features for automatic retinal image matching and vascular abnormality detection. These landmark points can enable automatic screening of large dataset through the detection of vascular network abnormalities (i.e., arteriovenous nicking, retinal vein occlusion) w...
Retinal arteriovenous nicking (AV nicking) is the phenomenon where the venule is compressed or decreases in its caliber at both sides of an arteriovenous crossing. Recent research suggests that retinal AVN is associated with hypertension and cardiovascular diseases such as stroke. In this article, we propose a computer method for assessing the seve...
Retinal arteriovenous (AV) nicking is one of the prominent and significant microvascular abnormalities. It is characterized by the decrease in the venular calibre at both sides of an artery-vein crossing. Recent research suggests that retinal AV nicking is a strong predictor of eye diseases such as branch retinal vein occlusion and cardiovascular d...
Outlier detection is an important process for text document collections, but as the collection grows, the detection process becomes a computationally expensive task. Random projection has shown to provide a good fast approximation of sparse data, such as document vectors, for outlier detection. The random samples of Fourier and cosine spectrum have...
a b s t r a c t Changes in retinal blood vessel features are precursors of serious diseases such as cardiovascular disease and stroke. Therefore, analysis of retinal vascular features can assist in detecting these changes and allow the patient to take action while the disease is still in its early stages. Automation of this process would help to re...
Analysis of retinal blood vessels allows us to identify individuals with the onset of cardiovascular diseases, diabetes and hypertension. Unfor-tunately, this analysis requires a specialist to identify specific retinal fea-tures which is not always possible. Automation of this process will allow the analysis to be performed in regions where special...
Web link analysis methods such as PageRank, HITS, and SALSA have focused on obtaining global popularity or authority of the set of Web pages in question. Although global popularity is useful for general queries, we find that global popularity is not as useful for queries in which the global population has less knowledge of. By examining the many di...
In this paper, we present a supervised framework for extracting blood vessels from retinal images. The local standardisation of the green channel of the retinal image and the Gabor filter responses at four different scales are used as features for pixel classification. The Bayesian classifier is used with a bagging framework to classify each image...
Document clustering involves repetitive scanning of a document set, therefore as the size of the set increases, the time required
for the clustering task increases and may even become impossible due to computational constraints. Compressive sampling is
a feature sampling technique that allows us to perfectly reconstruct a vector from a small number...
Comparing, clustering and merging ellipsoids are problems that arise in various applications, e.g., anomaly detection in wireless sensor networks and motif-based patterned fabrics. We develop a theory underlying three measures of similarity that can be used to find groups of similar ellipsoids in p-space. Clusters of ellipsoids are suggested by dar...
We model anomalies in wireless sensor networks with ellipsoids that represent node measurements. Elliptical anomalies (EAs) are level sets of ellipsoids, and classify them as type 1, type 2 and higher order anomalies. Three measures of (dis)similarity between pairs of ellipsoids convert model ellipsoids into dissimilarity data. Clusters in the diss...
Search effectiveness metrics are used to evaluate the quality of the answer lists returned by search services, usually based
on a set of relevance judgments. One plausible way of calculating an effectiveness score for a system run is to compute the
inner-product of the run’s relevance vector and a “utility” vector, where the ith element in the util...
Information retrieval results are currently limited to the publication in which they exist. Significance tests are used to remove the dependence of the evaluation on the query sample, but the findings cannot be transferred to other systems not involved in the test. Confidence intervals for the population parameters provide query independent results...
Data clustering is a difficult and challenging task, especially when the
hidden clusters are of different shapes and non-linearly separable in
the input space. This paper addresses this problem by proposing a new
method that combines a path-based dissimilarity measure and
multi-dimensional scaling to effectively identify these complex
separable str...
Spectral co-clustering is a generic method of computing co- clusters of relational data, such as sets of documents and their terms. Latent semantic analysis is a method of document and term smoothing that can assist in the information retrieval process. In this article we ex- amine the process behind spectral clustering for documents and terms, and...
Reducing the Web access latency perceived by a Web user has become a
problem of interest. Web prefetching and caching are two effective
techniques that can be used together to reduce the access latency
problem on the Internet. Because the success of Web prefetching mainly
relies on the prediction accuracy of prediction methods, in this paper
we emp...
It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the
appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a
Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provide...
Information retrieval systems are evaluated against test collections of topics, documents, and assessments of which documents are relevant to which topics. Documents are chosen for relevance assessment by pooling runs from a set of existing systems. New systems can return unassessed documents, leading to an evaluation bias against them. In this pap...
The concept of recall has been one of the key elements of system measurement throughout the history of information retrieval, despite the fact that there are many unanswered questions as to its value. In this essay, we review those questions and explore several further issues that affect the usefulness of recall. In particular, we ask whether it is...
Web page prefetching has shown to provide reduction in Web access latency, but is highly dependent on the accuracy of the
Web page prediction method. Conditional Random Fields (CRFs) with Error Correcting Output Coding (ECOC) have shown to provide
highly accurate and efficient Web page prediction on large-size websites. However, the limited class i...
Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set.
The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive
storage space relative to a simple term frequency index, which causes lengthy query times. To o...
Latent semantic analysis (LSA) is a generalised vector space method (GVSM) that uses dimension reduction to generate term correlations for use during the information retrieval process. We hypothesised that even though the dimension reduction establishes correlations between terms, the reduction is causing a degradation in the correlation of a term...
D is an mtimesn matrix of pairwise dissimilarities between m row objects Or and n column objects O<sub>c</sub>, which, taken together, comprise m+n objects O = [o<sub>1</sub>,...o<sub>m</sub>,o<sub>m+1</sub>,...o<sub>m+n</sub>]. There are four clustering problems associated with O: (P1) amongst the row objects O<sub>r</sub>; (P2) amongst the column...
The concept of recall has been one of the key elements of system measurement throughout the history of information retrieval, despite the fact that there are many unanswered questions as to its value. In this essay, we review those questions and explore several further issues that affect the usefulness of recall. In particular, we ask whether it is...
We introduce smoothing of retrieval effectiveness scores, which balances results from prior incomplete query sets against limited additional complete information, in order to obtain more refined system orderings than would be possible on the new queries alone.
Hidden term relationships can be found within a document collection using Latent semantic analysis (LSA) and can be used to assist in information retrieval. LSA uses the inner product as its similarity function, which unfortunately introduces bias due to document length and term rarity into the term relationships. In this article, we present the no...
Web page prefetching has been used efficiently to reduce the access latency problem of the Internet, its success mainly relies on the accuracy of Web page prediction. As powerful sequential learning models, conditional random fields (CRFs) have been used successfully to improve the Web page prediction accuracy when the total number of unique Web pa...
Probabilistic latent semantic analysis (PLSA) is a method of calculating term relationships within a document set using term
frequencies. It is well known within the information retrieval community that raw term frequencies contain various biases
that affect the precision of the retrieval system. Weighting schemes, such as BM25, have been developed...
Language modelling is new form of information retrieval that is rapidly becoming the preferred choice over probabilistic and
vector space models, due to the intuitiveness of the model formulation and its effectiveness. The language model assumes that
all terms are independent, therefore the majority of the documents returned to the ser will be thos...
Web page prefetching is used to reduce the access latency of the Internet. However, if most prefetched Web pages are not visited by the users in their subsequent accesses, the limited network bandwidth and server resources will not be used efficiently and may worsen the access delay problem. Therefore, it is critical that we have an accurate predic...
Web page prefetching techniques are used to address the access latency problem of the Internet. To perform successful prefetching, we must be able to predict the next set of pages that will be accessed by users. The PageRank algorithm used by Google is able to compute the popularity of a set of Web pages based on their link structure. In this paper...
Many queries on collections of text documents are too short to produce informative results. Automatic query expansion is a
method of adding terms to the query without interaction from the user in order to obtain more refined results. In this investigation,
we examine our novel automatic query expansion method using the probabilistic latent semantic...
The PageRank algorithm is used in Web information re- trieval to calculate a single list of popularity scores for each page in the Web. These popularity scores are used to rank query results when presented to the user. By using the struc- ture of the entire Web to calculate one score per document, we are calculating a general popularity score, not...
Rank-biased precision (RBP) is a new method of information retrieval system evaluation that takes into account any uncertainty due to incomplete relevance judgements for a given document and query set. To do so, RBP uses a model of user persistence. In this article, we will present a statistical analysis of the RBP user persistence model to observe...
Current information retrieval methods either ignore the term positions or deal with exact term positions; the former can be seen as coarse document resolution, the latter as fine document resolution. We propose a new spectral-based information retrieval method that is able to utilize many different levels of document resolution by examining the ter...
The vector space model (VSM) of information retrieval suffers in two areas, it does not utilise term positions and it treats
every term as being independent. We examine two information retrieval methods based on the simple vector space model. The
first uses the query term position flow within the documents to calculate the document score, the secon...
We propose a new Spectral text retrieval method using the Discrete Cosine Transform (DCT). By taking advantage of the properties of the DCT and by employing the fast query and compression techniques found in vector space methods (VSM), we show that we can process queries as fast as VSM and achieve a much higher precision.
Current document retrieval methods use a vector space similarity measure to give scores of relevance to documents when related to a specific query. The central problem with these methods is that they neglect any spatial information within the documents in question. We present a new method, called Fourier Domain Scoring (FDS), which takes advantage...
HE information found on the Internet is growing at such a rapid rate that soon methods of searching through text using terms frequencies will not be enough. At the moment, many Web search engines are showing signs of imprecision because they are based on these term counting methods which do not examine the relationships between the document terms....
Latent semantic retrieval methods (unlike vector space methods) take the document and query vectors and map them into a topic space to cluster related terms and documents. This produces a more precise retrieval but also a long query time. We present a new method of document retrieval which allows us to process the latent semantic information into a...
The fast vector space and probabilistic methods use the term counts and the slower proximity methods use term positions. We
present the spectral-based information retrieval method which is able to use both term count and position information to obtain
high precision document rankings. We are able to perform this, in a time comparable to the vector...
Fourier Domain Scoring (FDS) has been shown to give a 60% improvement in precision over the existing vector space methods, but its index requires a large storage space. We propose a new Web text mining method using the discrete cosine transform (DCT) to extract use- ful information from text documents and to provide improved document ranking, witho...
The traditional methods of spectral text retrieval (FDS,CDS) create an index of spatial data and convert the data to its spectral form at query time. We present a new method of implementing and querying an index containing spectral data which will conserve the high precision performance of the spectral methods, reduce the time needed to resolve the...
Contents 1 Introduction 2 2 Problem Statement 2 3 Literature Review 3 3.1 Retrieving Text . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Understanding Music . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Identifying Images . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Extracting Video . . . . . . . . . . . . . . . . . . . ....
Most search engines return a lot of unwanted information. A more thorough filtering process can be performed on this information
to sort out the relevant documents. A new method called Frequency Domain Scoring (FDS), which is based on the Fourier Transform
is proposed. FDS performs the filtering by examining the locality of the keywords throughout...