[Show abstract][Hide abstract]ABSTRACT: Identifying unusual or anomalous patterns in an underlying dataset is an important but challenging task in many applications. The focus of the unsupervised anomaly detection literature has mostly been on vectorised data. However , many applications are more naturally described using higher-order tensor representations. Approaches that vectorise tensorial data can destroy the structural information encoded in the high-dimensional space, and lead to the problem of the curse of dimensionality. In this paper we present the first unsupervised tensorial anomaly detection method, along with a randomised version of our method. Our anomaly detection method, the One-class Support Ten-sor Machine (1STM), is a generalisation of conventional one-class Support Vector Machines to higher-order spaces. 1STM preserves the multiway structure of tensor data, while achieving significant improvement in accuracy and efficiency over conventional vectorised methods. We then leverage the theory of nonlinear random projections to propose the Ran-domised 1STM (R1STM). Our empirical analysis on several real and synthetic datasets shows that our R1STM algorithm delivers comparable or better accuracy to a state-of-the-art deep learning method and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.
[Show abstract][Hide abstract]ABSTRACT: We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be used to explain any data point of interest, which itself might be an inlier or outlier. In this paper, we investigate several open challenges faced by existing outlying aspects mining techniques and propose novel solutions, including (a) how to design effective scoring functions that are unbiased with respect to dimensionality and yet being computationally efficient, and (b) how to efficiently search through the exponentially large search space of all possible subspaces. We formalize the concept of dimensionality unbiasedness, a desirable property of outlyingness measures. We then characterize existing scoring measures as well as our novel proposed ones in terms of efficiency, dimensionality unbiasedness and interpretability. Finally, we evaluate the effectiveness of different methods for outlying aspects discovery and demonstrate the utility of our proposed approach on both large real and synthetic data sets.
No preview · Article · Feb 2016 · Data Mining and Knowledge Discovery
[Show abstract][Hide abstract]ABSTRACT: Big Data can mean different things to different people. The scale and
challenges of Big Data are often described using three attributes, namely
Volume, Velocity and Variety (3Vs), which only reflect some of the aspects of
data. In this chapter we review historical aspects of the term "Big Data" and
the associated analytics. We augment 3Vs with additional attributes of Big Data
to make it more comprehensive and relevant. We show that Big Data is not just
3Vs, but 32 Vs, that is, 9 Vs covering the fundamental motivation behind Big
Data, which is to incorporate Business Intelligence (BI) based on different
hypothesis or statistical models so that Big Data Analytics (BDA) can enable
decision makers to make useful predictions for some crucial decisions or
researching results. History of Big Data has demonstrated that the most cost
effective way of performing BDA is to employ Machine Learning (ML) on the Cloud
Computing (CC) based infrastructure or simply, ML + CC -> BDA. This chapter is
devoted to help decision makers by defining BDA as a solution and opportunity
to address their business needs.
[Show abstract][Hide abstract]ABSTRACT: The emergence of cloud computing has made dynamic provisioning of elastic
capacity to applications on-demand. Cloud data centers contain thousands of
physical servers hosting orders of magnitude more virtual machines that can be
allocated on demand to users in a pay-as-you-go model. However, not all systems
are able to scale up by just adding more virtual machines. Therefore, it is
essential, even for scalable systems, to project workloads in advance rather
than using a purely reactive approach. Given the scale of modern cloud
infrastructures generating real time monitoring information, along with all the
information generated by operating systems and applications, this data poses
the issues of volume, velocity, and variety that are addressed by Big Data
approaches. In this paper, we investigate how utilization of Big Data analytics
helps in enhancing the operation of cloud computing environments. We discuss
diverse applications of Big Data analytics in clouds, open issues for enhancing
cloud operations via Big Data analytics, and architecture for anomaly detection
and prevention in clouds along with future research directions.
[Show abstract][Hide abstract]ABSTRACT: White matter lesions (WMLs) are small groups of dead cells that clump together in the white matter of brain. In this paper, we propose a reliable method to automatically segment WMLs. Our method uses a novel filter to enhance the intensity of WMLs. Then a feature set containing enhanced intensity, anatomical and spatial information is used to train a random forest classifier for the initial segmentation of WMLs. Following that a reliable and robust edge potential function based Markov Random Field (MRF) is proposed to obtain the final segmentation by removing false positive WMLs. Quantitative evaluation of the proposed method is performed on 24 subjects of ENVISion study. The segmentation results are validated against the manual segmentation, performed under the supervision of an expert neuroradiologist. The results show a dice similarity index of 0.76 for severe lesion load, 0.73 for moderate lesion load and 0.61 for mild lesion load. In addition to that we have compared our method with three state of the art methods on 20 subjects of Medical Image Computing and Computer Aided Intervention Society's (MICCAI's) MS lesion challenge dataset, where our method shows better segmentation accuracy compare to the state of the art methods. These results indicate that the proposed method can assist the neuroradiologists in assessing the WMLs in clinical practice.
No preview · Article · Aug 2015 · Computerized medical imaging and graphics: the official journal of the Computerized Medical Imaging Society
[Show abstract][Hide abstract]ABSTRACT: Monitoring and predicting traffic conditions are of utmost importance in reacting to emergency events in time and for computing the real-time shortest travel-time path. Mobile sensors, such as GPS devices and smartphones, are useful for monitoring urban traffic due to their large coverage area and ease of deployment. Many researchers have employed such sensed data to model and predict traffic conditions. To do so, we first have to address the problem of associating GPS trajectories with the road network in a robust manner. Existing methods rely on point-by-point matching to map individual GPS points to a road segment. However, GPS data is imprecise due to noise in GPS signals. GPS coordinates can have errors of several meters and, therefore, direct mapping of individual points is error prone. Acknowledging that every GPS point is potentially noisy, we propose a radically different approach to overcome inaccuracy in GPS data. Instead of focusing on a point-by-point approach, our proposed method considers the set of relevant GPS points in a trajectory that can be mapped together to a road segment. This clustering approach gives us a macroscopic view of the GPS trajectories even under very noisy conditions. Our method clusters points based on the direction of movement as a spatial-linear cluster, ranks the possible route segments in the graph for each group, and searches for the best combination of segments as the overall path for the given set of GPS points. Through extensive experiments on both synthetic and real datasets, we demonstrate that, even with highly noisy GPS measurements, our proposed algorithm outperforms state-of-the-art methods in terms of both accuracy and computational cost.
No preview · Article · Jul 2015 · International Journal of Geographical Information Science
[Show abstract][Hide abstract]ABSTRACT: In this paper a reliable and robust method is presented for the quantification of Focal Arteriolar Narrowing (FAN), a precursor for hypertension, stroke and other cardiovascular diseases. Our contribution in this paper is that we have proposed a novel edge based retinal blood vessel segmentation technique which is very effective in low contrast retinal images. In addition to that we developed a robust and reliable measurement technique to quantify Focal Arteriolar Narrowing. For initial results and quantitative evaluation of the proposed method, we evaluate our proposed method on a dataset of 53 manually graded vessel segments. The experimental results indicate a strong correlation between the computed Focal Arteriolar Narrowing (FAN) values and the expert grading (Spearman coefficient of 0.76, P < 0.0001). In addition, the results also show that the system can detect healthy and FAN affected cases with an accuracy of 93%. This has demonstrated the reliability of the proposed method for automatic focal narrowing assessment. The quantitative measurements provided by the system may help to establish a more reliable link between Focal Arteriolar Narrowing (FAN) and known systemic and eye diseases, which will be investigated further.
[Show abstract][Hide abstract]ABSTRACT: In outlying aspects mining, given a query object, we aim to answer the question as to what features make the query most outlying. The most recent works tackle this problem using two different strategies. (i) Feature selection approaches select the features that best distinguish the two classes: the query point vs. the rest of the data. (ii) Score-and-search approaches define an outlyingness score, then search for subspaces in which the query point exhibits the best score. In this paper, we first present an insightful theoretical result connecting the two types of approaches. Second, we present OARank – a hybrid framework that leverages the efficiency of feature selection based approaches and the effectiveness and versatility of score-and-search based methods. Our proposed approach is orders of magnitudes faster than previously proposed score-and-search based approaches while being slightly more effective, making it suitable for mining large data sets.
[Show abstract][Hide abstract]ABSTRACT: Recent studies show that, cerebral White MatterLesion (WML) is related to cerebrovascular diseases,cardiovascular diseases, dementia and psychiatric disorders.Manual segmentation of WML is not appropriate for long termlongitudinal studies because it is time consuming and it showshigh intra- and inter-rater variability. In this paper, a fullyautomated segmentation method is utilized to segment WMLfrom brain Magnetic Resonance Imaging (MRI). The segmentationmethod uses a combination of global neighbourhoodgiven contrast feature-based Random Forest (RF) classifier andMarkov Random Field (MRF) to segment WML. To removefalse positive lesions we use a rule based morphological postprocessingoperation. Quantitative evaluation of the proposedmethod was performed on 24 subjects of ENVIS-ion study.The segmentation results were validated against the manualsegmentation, performed by an experienced radiologist andwere compared to a recenlty published WML segmentationmethod. The results show a dice similarity index of 0.75 forhigh lesion load, 0.71 for medium lesion load and 0.60 for lowlesion load.
[Show abstract][Hide abstract]ABSTRACT: Canonical correlation analysis (CCA) has proven an effective tool for
two-view dimension reduction due to its profound theoretical foundation and
success in practical applications. In respect of multi-view learning, however,
it is limited by its capability of only handling data represented by two-view
features, while in many real-world applications, the number of views is
frequently many more. Although the ad hoc way of simultaneously exploring all
possible pairs of features can numerically deal with multi-view data, it
ignores the high order statistics (correlation information) which can only be
discovered by simultaneously exploring all features.
Therefore, in this work, we develop tensor CCA (TCCA) which straightforwardly
yet naturally generalizes CCA to handle the data of an arbitrary number of
views by analyzing the covariance tensor of the different views. TCCA aims to
directly maximize the canonical correlation of multiple (more than two) views.
Crucially, we prove that the multi-view canonical correlation maximization
problem is equivalent to finding the best rank-1 approximation of the data
covariance tensor, which can be solved efficiently using the well-known
alternating least squares (ALS) algorithm. As a consequence, the high order
correlation information contained in the different views is explored and thus a
more reliable common subspace shared by all features can be obtained. In
addition, a non-linear extension of TCCA is presented. Experiments on various
challenge tasks, including large scale biometric structure prediction, internet
advertisement classification and web image annotation, demonstrate the
effectiveness of the proposed method.
Preview · Article · Feb 2015 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract]ABSTRACT: Graphs are a powerful representation of relational data, such as social and biological networks. Often, these entities form groups and are organised according to a latent structure. However, these groupings and structures are generally unknown and it can be difficult to identify them. Graph clustering is an important type of approach used to discover these vertex groups and the latent structure within graphs. One type of approach for graph clustering is non-negative matrix factorisation However, the formulations of existing factorisation approaches can be overly relaxed and their groupings and results consequently difficult to interpret, may fail to discover the true latent structure and groupings, and converge to extreme solutions. In this paper, we propose a new formulation of the graph clustering problem that results in clusterings that are easy to interpret. Combined with a novel algorithm, the clusterings are also more accurate than state-of-the-art algorithms for both synthetic and real datasets.
[Show abstract][Hide abstract]ABSTRACT: Scientific workflows are used to model applications of high throughput computation and complex large scale data analysis. In recent years, Cloud computing is fast evolving as the target platform for such applications among researchers. Furthermore, new pricing models have been pioneered by Cloud providers that allow users to provision resources and to use them in an efficient manner with significant cost reductions. In this paper, we propose a scheduling algorithm that schedules tasks on Cloud resources using two different pricing models (spot and on-demand instances) to reduce the cost of execution whilst meeting the workflow deadline. The proposed algorithm is fault tolerant against the premature termination of spot instances and also robust against performance variations of Cloud resources. Experimental results demonstrate that our heuristic reduces up to 70% execution cost as against using only on-demand instances.
Preview · Article · Dec 2014 · Procedia Computer Science
[Show abstract][Hide abstract]ABSTRACT: Altered expression profiles of microRNAs (miRNAs) are linked to many diseases including lung cancer. miRNA expression profiling is reproducible and miRNAs are very stable. These characteristics of miRNAs make them ideal biomarker candidates.
This work is aimed to detect 2-and 3-miRNA groups, together with specific expression ranges of these miRNAs, to form simple linear discriminant rules for biomarker identification and biological interpretation. Our method is based on a novel committee of decision trees to derive 2-and 3-miRNA 100%-frequency rules. This method is applied to a data set of lung miRNA expression profiles of 61 squamous cell carcinoma (SCC) samples and 10 normal tissue samples. A distance separation technique is used to select the most reliable rules which are then evaluated on a large independent data set.
We obtained four 2-miRNA and three 3-miRNA top-ranked rules. One important rule is that: If the expression level of miR-98 is above 7.356 and the expression level of miR-205 is below 9.601 (log2 quantile normalized MirVan miRNA Bioarray signals), then the sample is normal rather than cancerous with specificity and sensitivity both 100%. The classification performance of our best miRNA rules remarkably outperformed that by randomly selected miRNA rules. Our data analysis also showed that miR-98 and miR-205 have two common predicted target genes FZD3 and RPS6KA3, which are actually genes associated with carcinoma according to the Online Mendelian Inheritance in Man (OMIM) database. We also found that most of the chromosomal loci of these miRNAs have a high frequency of genomic alteration in lung cancer. On the independent data set (with balanced controls), the three miRNAs miR-126, miR-205 and miR-182 from our best rule can separate the two classes of samples at the accuracy of 84.49%, sensitivity of 91.40% and specificity of 77.14%.
Our results indicate that rule discovery followed by distance separation is a powerful computational method to identify reliable miRNA biomarkers. The visualization of the rules and the clear separation between the normal and cancer samples by our rules will help biology experts for their analysis and biological interpretation.
[Show abstract][Hide abstract]ABSTRACT: It is well known that processing big graph data can be costly on Cloud. Processing big graph data introduces complex and multiple iterations that raise challenges such as parallel memory bottlenecks, deadlocks, and inefficiency. To tackle the challenges, we propose a novel technique for effectively processing big graph data on Cloud. Specifically, the big data will be compressed with its spatiotemporal features on Cloud. By exploring spatial data correlation, we partition a graph data set into clusters. In a cluster, the workload can be shared by the inference based on time series similarity. By exploiting temporal correlation, in each time series or a single graph edge, temporal data compression is conducted. A novel data driven scheduling is also developed for data processing optimization. The experiment results demonstrate that the spatiotemporal compression and scheduling achieve significant performance gains in terms of data size and data fidelity loss.
No preview · Article · Dec 2014 · Journal of Computer and System Sciences
[Show abstract][Hide abstract]ABSTRACT: Mining GPS trajectories of moving vehicles has led to many research
directions, such as traffic modeling and driving prediction. An
important challenge is how to map GPS traces to a road network
accurately under noisy conditions. However, to the best of our
knowledge, there is no existing work that first simplifies a trajectory
to improve map matching. In this paper we propose three trajectory
simplification algorithms that can deal with both offline and online
trajectory data. We use weighting functions to incorporate spatial
knowledge, such as segment lengths and turning angles, into our
simplification algorithms. In addition, we measure the noise degree of a
GPS point based on its spatio-temporal relationship to its neighbors.
The effectiveness of our algorithms is comprehensively evaluated on real
trajectory datasets with varying the noise levels and sampling rates.
Our evaluation shows that under highly noisy conditions, our proposed
algorithms considerably improve map matching accuracy and reduce
computational costs compared to the state-of-the-art methods.
[Show abstract][Hide abstract]ABSTRACT: We explore the idea that by modeling a financial time series at regular points in space (i.e. price) rather than regular points in time, more predictive power can be extracted from the time series. We will term this concept of modeling time series at regular points in space as 'volatility homogenisation'. Our hypothesis is that if we select the correct quantum in terms of regular steps in space, we replace noise which can normally interfere with prediction methods and thus uncover the underlying patterns in the time series. Furthermore, this technique can also be viewed a way of decoupling spatial and temporal dependence, which again, can replace unnecessary noise. We apply this decomposition to nine different financial time series and then apply support vector classification in order to make our predictions on the decomposed time series. Our results show that in the majority of cases, this technique yields better predictions than applications to data that has regular points in time, with applications of techniques such as support vector regression and Autoregressive Integrated Moving Averages models. The contribution of this paper is that it demonstrates the efficacy of this new methodology known as 'volatility homogenisation'
[Show abstract][Hide abstract]ABSTRACT: Retinal arteriovenous (AV) nicking is a precursor for hypertension, stroke and other cardiovascular diseases. In this paper, an effective method is proposed for the analysis of retinal venular widths to automatically classify the severity level of AV nicking. We use combination of intensity and edge information of the vein to compute its widths. The widths at various sections of the vein near the crossover point are then utilized to train a random forest classifier to classify the severity of AV nicking. We analyzed 47 color retinal images obtained from two population based studies for quantitative evaluation of the proposed method. We compare the detection accuracy of our method with a recently published four class AV nicking classification method. Our proposed method shows 64.51% classification accuracy in-contrast to the reported classification accuracy of 49.46% by the state of the art method.
[Show abstract][Hide abstract]ABSTRACT: Imbalanced dataset classification is a challenging problem, since many classifiers are sensitive to class distribution so that the classifiers’ prediction has bias towards majority class. Hellinger Distance has been proven that it is skew-insensitive and the decision trees that employ Hellinger Distance as a splitting criterion have shown better performance than other decision trees based on Information Gain. We propose a new decision tree induction classifier (HeDEx) based on Hellinger Distance that is randomized ensemble trees selecting both attribute and split-point at random. We also propose hyperplane as a decision surface for HeDEx to improve the performance. A new pattern-based oversampling method is also proposed in this paper to reduce the bias towards majority class. The patterns are detected from HeDEx and the new instances generated are applied after verification process using Hellinger Distance Decision Trees. Our experiments show that the proposed methods show performance improvements on imbalanced datasets over the state-of-the-art Hellinger Distance Decision Trees.
[Show abstract][Hide abstract]ABSTRACT: Clustering in graphs aims to group vertices with similar patterns of connections. Applications include discovering communities and latent structures in graphs. Many algorithms have been proposed to find graph clusterings, but an open problem is the need for suitable comparison measures to quantitatively validate these algorithms, performing consensus clustering and to track evolving (graph) clusters across time. To date, most comparison measures have focused on comparing the vertex groupings, and completely ignore the difference in the structural approximations in the clusterings, which can lead to counter-intuitive comparisons. In this paper, we propose new measures that account for differences in the approximations. We focus on comparison measures for two important graph clustering approaches, community detection and blockmodelling, and propose comparison measures that work for weighted (and unweighted) graphs.