[Show abstract][Hide abstract] ABSTRACT: Identifying unusual or anomalous patterns in an underlying dataset is an important but challenging task in many applications. The focus of the unsupervised anomaly detection literature has mostly been on vectorised data. However , many applications are more naturally described using higher-order tensor representations. Approaches that vectorise tensorial data can destroy the structural information encoded in the high-dimensional space, and lead to the problem of the curse of dimensionality. In this paper we present the first unsupervised tensorial anomaly detection method, along with a randomised version of our method. Our anomaly detection method, the One-class Support Ten-sor Machine (1STM), is a generalisation of conventional one-class Support Vector Machines to higher-order spaces. 1STM preserves the multiway structure of tensor data, while achieving significant improvement in accuracy and efficiency over conventional vectorised methods. We then leverage the theory of nonlinear random projections to propose the Ran-domised 1STM (R1STM). Our empirical analysis on several real and synthetic datasets shows that our R1STM algorithm delivers comparable or better accuracy to a state-of-the-art deep learning method and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.
[Show abstract][Hide abstract] ABSTRACT: Big Data can mean different things to different people. The scale and
challenges of Big Data are often described using three attributes, namely
Volume, Velocity and Variety (3Vs), which only reflect some of the aspects of
data. In this chapter we review historical aspects of the term "Big Data" and
the associated analytics. We augment 3Vs with additional attributes of Big Data
to make it more comprehensive and relevant. We show that Big Data is not just
3Vs, but 32 Vs, that is, 9 Vs covering the fundamental motivation behind Big
Data, which is to incorporate Business Intelligence (BI) based on different
hypothesis or statistical models so that Big Data Analytics (BDA) can enable
decision makers to make useful predictions for some crucial decisions or
researching results. History of Big Data has demonstrated that the most cost
effective way of performing BDA is to employ Machine Learning (ML) on the Cloud
Computing (CC) based infrastructure or simply, ML + CC -> BDA. This chapter is
devoted to help decision makers by defining BDA as a solution and opportunity
to address their business needs.
[Show abstract][Hide abstract] ABSTRACT: The emergence of cloud computing has made dynamic provisioning of elastic
capacity to applications on-demand. Cloud data centers contain thousands of
physical servers hosting orders of magnitude more virtual machines that can be
allocated on demand to users in a pay-as-you-go model. However, not all systems
are able to scale up by just adding more virtual machines. Therefore, it is
essential, even for scalable systems, to project workloads in advance rather
than using a purely reactive approach. Given the scale of modern cloud
infrastructures generating real time monitoring information, along with all the
information generated by operating systems and applications, this data poses
the issues of volume, velocity, and variety that are addressed by Big Data
approaches. In this paper, we investigate how utilization of Big Data analytics
helps in enhancing the operation of cloud computing environments. We discuss
diverse applications of Big Data analytics in clouds, open issues for enhancing
cloud operations via Big Data analytics, and architecture for anomaly detection
and prevention in clouds along with future research directions.
[Show abstract][Hide abstract] ABSTRACT: White matter lesions (WMLs) are small groups of dead cells that clump together in the white matter of brain. In this paper, we propose a reliable method to automatically segment WMLs. Our method uses a novel filter to enhance the intensity of WMLs. Then a feature set containing enhanced intensity, anatomical and spatial information is used to train a random forest classifier for the initial segmentation of WMLs. Following that a reliable and robust edge potential function based Markov Random Field (MRF) is proposed to obtain the final segmentation by removing false positive WMLs. Quantitative evaluation of the proposed method is performed on 24 subjects of ENVISion study. The segmentation results are validated against the manual segmentation, performed under the supervision of an expert neuroradiologist. The results show a dice similarity index of 0.76 for severe lesion load, 0.73 for moderate lesion load and 0.61 for mild lesion load. In addition to that we have compared our method with three state of the art methods on 20 subjects of Medical Image Computing and Computer Aided Intervention Society's (MICCAI's) MS lesion challenge dataset, where our method shows better segmentation accuracy compare to the state of the art methods. These results indicate that the proposed method can assist the neuroradiologists in assessing the WMLs in clinical practice.
No preview · Article · Aug 2015 · Computerized medical imaging and graphics: the official journal of the Computerized Medical Imaging Society
[Show abstract][Hide abstract] ABSTRACT: Monitoring and predicting traffic conditions are of utmost importance in reacting to emergency events in time and for computing the real-time shortest travel-time path. Mobile sensors, such as GPS devices and smartphones, are useful for monitoring urban traffic due to their large coverage area and ease of deployment. Many researchers have employed such sensed data to model and predict traffic conditions. To do so, we first have to address the problem of associating GPS trajectories with the road network in a robust manner. Existing methods rely on point-by-point matching to map individual GPS points to a road segment. However, GPS data is imprecise due to noise in GPS signals. GPS coordinates can have errors of several meters and, therefore, direct mapping of individual points is error prone. Acknowledging that every GPS point is potentially noisy, we propose a radically different approach to overcome inaccuracy in GPS data. Instead of focusing on a point-by-point approach, our proposed method considers the set of relevant GPS points in a trajectory that can be mapped together to a road segment. This clustering approach gives us a macroscopic view of the GPS trajectories even under very noisy conditions. Our method clusters points based on the direction of movement as a spatial-linear cluster, ranks the possible route segments in the graph for each group, and searches for the best combination of segments as the overall path for the given set of GPS points. Through extensive experiments on both synthetic and real datasets, we demonstrate that, even with highly noisy GPS measurements, our proposed algorithm outperforms state-of-the-art methods in terms of both accuracy and computational cost.
No preview · Article · Jul 2015 · International Journal of Geographical Information Science
[Show abstract][Hide abstract] ABSTRACT: In this paper a reliable and robust method is presented for the quantification of Focal Arteriolar Narrowing (FAN), a precursor for hypertension, stroke and other cardiovascular diseases. Our contribution in this paper is that we have proposed a novel edge based retinal blood vessel segmentation technique which is very effective in low contrast retinal images. In addition to that we developed a robust and reliable measurement technique to quantify Focal Arteriolar Narrowing. For initial results and quantitative evaluation of the proposed method, we evaluate our proposed method on a dataset of 53 manually graded vessel segments. The experimental results indicate a strong correlation between the computed Focal Arteriolar Narrowing (FAN) values and the expert grading (Spearman coefficient of 0.76, P < 0.0001). In addition, the results also show that the system can detect healthy and FAN affected cases with an accuracy of 93%. This has demonstrated the reliability of the proposed method for automatic focal narrowing assessment. The quantitative measurements provided by the system may help to establish a more reliable link between Focal Arteriolar Narrowing (FAN) and known systemic and eye diseases, which will be investigated further.
[Show abstract][Hide abstract] ABSTRACT: Recent studies show that, cerebral White MatterLesion (WML) is related to cerebrovascular diseases,cardiovascular diseases, dementia and psychiatric disorders.Manual segmentation of WML is not appropriate for long termlongitudinal studies because it is time consuming and it showshigh intra- and inter-rater variability. In this paper, a fullyautomated segmentation method is utilized to segment WMLfrom brain Magnetic Resonance Imaging (MRI). The segmentationmethod uses a combination of global neighbourhoodgiven contrast feature-based Random Forest (RF) classifier andMarkov Random Field (MRF) to segment WML. To removefalse positive lesions we use a rule based morphological postprocessingoperation. Quantitative evaluation of the proposedmethod was performed on 24 subjects of ENVIS-ion study.The segmentation results were validated against the manualsegmentation, performed by an experienced radiologist andwere compared to a recenlty published WML segmentationmethod. The results show a dice similarity index of 0.75 forhigh lesion load, 0.71 for medium lesion load and 0.60 for lowlesion load.
[Show abstract][Hide abstract] ABSTRACT: Graphs are a powerful representation of relational data, such as social and biological networks. Often, these entities form groups and are organised according to a latent structure. However, these groupings and structures are generally unknown and it can be difficult to identify them. Graph clustering is an important type of approach used to discover these vertex groups and the latent structure within graphs. One type of approach for graph clustering is non-negative matrix factorisation However, the formulations of existing factorisation approaches can be overly relaxed and their groupings and results consequently difficult to interpret, may fail to discover the true latent structure and groupings, and converge to extreme solutions. In this paper, we propose a new formulation of the graph clustering problem that results in clusterings that are easy to interpret. Combined with a novel algorithm, the clusterings are also more accurate than state-of-the-art algorithms for both synthetic and real datasets.
[Show abstract][Hide abstract] ABSTRACT: Scientific workflows are used to model applications of high throughput computation and complex large scale data analysis. In recent years, Cloud computing is fast evolving as the target platform for such applications among researchers. Furthermore, new pricing models have been pioneered by Cloud providers that allow users to provision resources and to use them in an efficient manner with significant cost reductions. In this paper, we propose a scheduling algorithm that schedules tasks on Cloud resources using two different pricing models (spot and on-demand instances) to reduce the cost of execution whilst meeting the workflow deadline. The proposed algorithm is fault tolerant against the premature termination of spot instances and also robust against performance variations of Cloud resources. Experimental results demonstrate that our heuristic reduces up to 70% execution cost as against using only on-demand instances.
Preview · Article · Dec 2014 · Procedia Computer Science
[Show abstract][Hide abstract] ABSTRACT: Altered expression profiles of microRNAs (miRNAs) are linked to many diseases including lung cancer. miRNA expression profiling is reproducible and miRNAs are very stable. These characteristics of miRNAs make them ideal biomarker candidates.
This work is aimed to detect 2-and 3-miRNA groups, together with specific expression ranges of these miRNAs, to form simple linear discriminant rules for biomarker identification and biological interpretation. Our method is based on a novel committee of decision trees to derive 2-and 3-miRNA 100%-frequency rules. This method is applied to a data set of lung miRNA expression profiles of 61 squamous cell carcinoma (SCC) samples and 10 normal tissue samples. A distance separation technique is used to select the most reliable rules which are then evaluated on a large independent data set.
We obtained four 2-miRNA and three 3-miRNA top-ranked rules. One important rule is that: If the expression level of miR-98 is above 7.356 and the expression level of miR-205 is below 9.601 (log2 quantile normalized MirVan miRNA Bioarray signals), then the sample is normal rather than cancerous with specificity and sensitivity both 100%. The classification performance of our best miRNA rules remarkably outperformed that by randomly selected miRNA rules. Our data analysis also showed that miR-98 and miR-205 have two common predicted target genes FZD3 and RPS6KA3, which are actually genes associated with carcinoma according to the Online Mendelian Inheritance in Man (OMIM) database. We also found that most of the chromosomal loci of these miRNAs have a high frequency of genomic alteration in lung cancer. On the independent data set (with balanced controls), the three miRNAs miR-126, miR-205 and miR-182 from our best rule can separate the two classes of samples at the accuracy of 84.49%, sensitivity of 91.40% and specificity of 77.14%.
Our results indicate that rule discovery followed by distance separation is a powerful computational method to identify reliable miRNA biomarkers. The visualization of the rules and the clear separation between the normal and cancer samples by our rules will help biology experts for their analysis and biological interpretation.
[Show abstract][Hide abstract] ABSTRACT: It is well known that processing big graph data can be costly on Cloud. Processing big graph data introduces complex and multiple iterations that raise challenges such as parallel memory bottlenecks, deadlocks, and inefficiency. To tackle the challenges, we propose a novel technique for effectively processing big graph data on Cloud. Specifically, the big data will be compressed with its spatiotemporal features on Cloud. By exploring spatial data correlation, we partition a graph data set into clusters. In a cluster, the workload can be shared by the inference based on time series similarity. By exploiting temporal correlation, in each time series or a single graph edge, temporal data compression is conducted. A novel data driven scheduling is also developed for data processing optimization. The experiment results demonstrate that the spatiotemporal compression and scheduling achieve significant performance gains in terms of data size and data fidelity loss.
No preview · Article · Dec 2014 · Journal of Computer and System Sciences
[Show abstract][Hide abstract] ABSTRACT: Mining GPS trajectories of moving vehicles has led to many research
directions, such as traffic modeling and driving prediction. An
important challenge is how to map GPS traces to a road network
accurately under noisy conditions. However, to the best of our
knowledge, there is no existing work that first simplifies a trajectory
to improve map matching. In this paper we propose three trajectory
simplification algorithms that can deal with both offline and online
trajectory data. We use weighting functions to incorporate spatial
knowledge, such as segment lengths and turning angles, into our
simplification algorithms. In addition, we measure the noise degree of a
GPS point based on its spatio-temporal relationship to its neighbors.
The effectiveness of our algorithms is comprehensively evaluated on real
trajectory datasets with varying the noise levels and sampling rates.
Our evaluation shows that under highly noisy conditions, our proposed
algorithms considerably improve map matching accuracy and reduce
computational costs compared to the state-of-the-art methods.
[Show abstract][Hide abstract] ABSTRACT: We explore the idea that by modeling a financial time series at regular points in space (i.e. price) rather than regular points in time, more predictive power can be extracted from the time series. We will term this concept of modeling time series at regular points in space as 'volatility homogenisation'. Our hypothesis is that if we select the correct quantum in terms of regular steps in space, we replace noise which can normally interfere with prediction methods and thus uncover the underlying patterns in the time series. Furthermore, this technique can also be viewed a way of decoupling spatial and temporal dependence, which again, can replace unnecessary noise. We apply this decomposition to nine different financial time series and then apply support vector classification in order to make our predictions on the decomposed time series. Our results show that in the majority of cases, this technique yields better predictions than applications to data that has regular points in time, with applications of techniques such as support vector regression and Autoregressive Integrated Moving Averages models. The contribution of this paper is that it demonstrates the efficacy of this new methodology known as 'volatility homogenisation'
[Show abstract][Hide abstract] ABSTRACT: Retinal arteriovenous (AV) nicking is a precursor for hypertension, stroke and other cardiovascular diseases. In this paper, an effective method is proposed for the analysis of retinal venular widths to automatically classify the severity level of AV nicking. We use combination of intensity and edge information of the vein to compute its widths. The widths at various sections of the vein near the crossover point are then utilized to train a random forest classifier to classify the severity of AV nicking. We analyzed 47 color retinal images obtained from two population based studies for quantitative evaluation of the proposed method. We compare the detection accuracy of our method with a recently published four class AV nicking classification method. Our proposed method shows 64.51% classification accuracy in-contrast to the reported classification accuracy of 49.46% by the state of the art method.
[Show abstract][Hide abstract] ABSTRACT: Imbalanced dataset classification is a challenging problem, since many classifiers are sensitive to class distribution so that the classifiers’ prediction has bias towards majority class. Hellinger Distance has been proven that it is skew-insensitive and the decision trees that employ Hellinger Distance as a splitting criterion have shown better performance than other decision trees based on Information Gain. We propose a new decision tree induction classifier (HeDEx) based on Hellinger Distance that is randomized ensemble trees selecting both attribute and split-point at random. We also propose hyperplane as a decision surface for HeDEx to improve the performance. A new pattern-based oversampling method is also proposed in this paper to reduce the bias towards majority class. The patterns are detected from HeDEx and the new instances generated are applied after verification process using Hellinger Distance Decision Trees. Our experiments show that the proposed methods show performance improvements on imbalanced datasets over the state-of-the-art Hellinger Distance Decision Trees.
[Show abstract][Hide abstract] ABSTRACT: Dynamic resource provisioning and the notion of seemingly unlimited resources are attracting scientific workflows rapidly into Cloud computing. Existing works on workflow scheduling in the context of Clouds are either on deadline or cost optimization, ignoring the necessity for robustness. Robust scheduling that handles performance variations of Cloud resources and failures in the environment is essential in the context of Clouds. In this paper, we present a robust scheduling algorithm with resource allocation policies that schedule workflow tasks on heterogeneous Cloud resources while trying to minimize the total elapsed time (make span) and the cost. Our results show that the proposed resource allocation policies provide robust and fault-tolerant schedule while minimizing make span. The results also show that with the increase in budget, our policies increase the robustness of the schedule.
[Show abstract][Hide abstract] ABSTRACT: Retinal imaging can facilitate the measurement and quantification of subtle variations and abnormalities in retinal vasculature. Retinal vascular imaging may thus offer potential as a noninvasive research tool to probe the role and pathophysiology of the microvasculature, and as a cardiovascular risk prediction tool. In order to perform this, an accurate method must be provided that is statistically sound and repeatable. This paper presents the methodology of such a system that assists physicians in measuring vessel caliber (i.e., diameters or width) from digitized fundus photographs. The system involves texture and edge information to measure and quantify vessel caliber. The graphical user interfaces are developed to allow retinal image graders to select individual vessel area that automatically returns the vessel calibers for noisy images. The accuracy of the method is validated using the measured caliber from graders and an existing method. The system provides very high accuracy vessel caliber measurement which is also reproducible with high consistency.
Full-text · Article · Jan 2014 · Computers in Biology and Medicine
[Show abstract][Hide abstract] ABSTRACT: Infrastructure-as-a-Service cloud providers offer diverse purchasing options and pricing plans, namely on-demand, reservation, and spot market plans. This allows them to efficiently target a variety of customer groups with distinct preferences and to generate more revenue accordingly. An important consequence of this diversification however, is that it introduces a nontrivial optimization problem related to the allocation of the provider’s available data center capacity to each pricing plan. The complexity of the problem follows from the different levels of revenue generated per unit of capacity sold, and the different commitments consumers and providers make when resources are allocated under a given plan. In this work, we address a novel problem of maximizing revenue through an optimization of capacity allocation to each pricing plan by means of admission control for reservation contracts, in a setting where aforementioned plans are jointly offered to customers. We devise both an optimal algorithm based on a stochastic dynamic programming formulation and two heuristics that trade-off optimality and computational complexity. Our evaluation, which relies on an adaptation of a large-scale real-world workload trace of Google, shows that our algorithms can significantly increase revenue compared to an allocation without capacity control given that sufficient resource contention is present in the system. In addition, we show that our heuristics effectively allow for online decision making and quantify the revenue loss caused by the assumptions made to render the optimization problem tractable.
Full-text · Article · Jan 2014 · IEEE Transactions on Cloud Computing