Archived project

BigData@BTH - Scalable resource-efficient systems for big data analytics

Updates
0 new
20
Recommendations
0 new
3
Followers
0 new
44
Reads
2 new
644

Project log

Håkan Grahn
added 2 research items
Data has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.
Håkan Grahn
added a research item
Chromatic aberration is an error that occurs in color images due to the fact that camera lenses refract the light of different wavelengths in different angles. The common approach today to correct the error is to use a lookup table for each camera-lens combination, e.g., as in Adobe PhotoShop Lightroom or DxO Optics Pro. In this paper, we propose a method that corrects the chromatic aberration error without any priot knowledge of the camera-lens combination, and does the correction already on the bayer data, i.e., before the Raw image data is interpolated to an RGB image. We evaluate our method in comparison to DxO Optics Pro, a state-of-the-art tool based on lookup tables, using 25 test images and the variance of the color differences (VCD) metric. The results show that our blind method has a similar error correction performance as DxO Optics Pro, but without prior knowledge of the camera-lens setup.
Abbas Cheddad
added a research item
This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms' performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The word transcription file contains 17 columns of information pertaining to each image (e.g., child's first name, birth date, date of baptism, father's first/last name, mother's first/last name, death records, town, job title of the father/mother, etc.). Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be used for competitions dedicated to a large set of document analysis problems, including word spotting.
Håkan Grahn
added a research item
Recently machine learning researchers are designing algorithms that can run in embedded and mobile devices, which introduces additional constraints compared to traditional algorithm design approaches. One of these constraints is energy consumption, which directly translates to battery capacity for these devices. Streaming algorithms, such as the Very Fast Decision Tree (VFDT), are designed to run in such devices due to their high velocity and low memory requirements. However, they have not been designed with an energy efficiency focus. This paper addresses this challenge by presenting the nmin adaptation method, which reduces the energy consumption of the VFDT algorithm with only minor effects on accuracy. nmin adaptation allows the algorithm to grow faster in those branches where there is more confidence to create a split, and delays the split on the less confident branches. This removes unnecessary computations related to checking for splits but maintains similar levels of accuracy. We have conducted extensive experiments on 29 public datasets, showing that the VFDT with nmin adaptation consumes up to 31% less energy than the original VFDT, and up to 96% less energy than the CVFDT (VFDT adapted for concept drift scenarios), trading off up to 1.7 percent of accuracy.
Håkan Grahn
added a research item
In this study, we propose a higher order mining approach that can be used for the analysis of real-world datasets. The approach can be used to monitor and identify the deviating operational behaviour of the studied phenomenon in the absence of prior knowledge about the data. The proposed approach consists of several different data analysis techniques, such as sequential pattern mining, clustering analysis, consensus clustering and the minimum spanning tree (MST). Initially, a clustering analysis is performed on the extracted patterns to model the behavioural modes of the studied phenomenon for a given time interval. The generated clustering models, which correspond to every two consecutive time intervals, can further be assessed to determine changes in the monitored behaviour. In cases in which significant differences are observed, further analysis is performed by integrating the generated models into a consensus clustering and applying an MST to identify deviating behaviours. The validity and potential of the proposed approach is demonstrated on a real-world dataset originating from a network of district heating (DH) substations. The obtained results show that our approach is capable of detecting deviating and sub-optimal behaviours of DH substations.
Abbas Cheddad
added an update
Appointed by PRL (Pattern Recognition Letters Journal) as a lead guest editor for a special section on: “Topical Collection on Intelligent Systems and Pattern Recognition (ISPR’2020)” -submission open for the conference awardees only-.
 
Abbas Cheddad
added an update
Our project proposal has been granted:
DocPRESERV: Preserving & Processing Historical Document Images with Artificial Intelligence
 
Abbas Cheddad
added a research item
Nowadays, the field of multimedia retrieval system has earned a lot of attention as it helps retrieve information more efficiently and accelerates daily tasks. Within this context, image processing techniques such as layout analysis and word recognition play an important role in transcribing content in printed or handwritten documents into digital data that can be further processed. This transcription procedure is called document digitization. This work stems from an industrial need, namely, a Swedish company (Arkiv Digital AB) has scanned more than 80 million pages of Swedish historical documents from all over the country and there is a high demand to transcribe the contents into digital data. Such process starts by figuring out text location which, seen from another angle, is merely table layout analysis. In this study, the aim is to reveal the most effective solution to extract document layout w.r.t Swedish handwritten historical documents that are featured by their tabular forms. In short, outcome of public tools (i.e., Breuel’s OCRopus method), traditional image processing techniques (e.g., Hessian/Gabor filters, Hough transform, Histograms of oriented gradients -HOG-features), machine learning techniques (e.g., support vector machines, transfer learning) are studied and compared. Results show that the existing OCR tool cannot carry layout analysis task on our Swedish historical handwritten documents. Traditional image processing techniques are mildly capable of extracting the general table layout in these documents, but the accuracy is enhanced by introducing machine learning techniques. The best performing approach will be used in our future document mining research to allow for the development of scalable resource-efficient systems for big data analytics.
Håkan Grahn
added 3 research items
This paper compares three different word image representations as base for label free sample selection for word spotting in historical handwritten documents. These representations are a temporal pyramid representation based on pixel counts, a graph based representation, and a pyramidal histogram of characters (PHOC) representation predicted by a PHOCNet trained on synthetic data. We show that the PHOC representation can help to reduce the amount of required training samples by up to 69% depending on the dataset, if it is learned iteratively in an active learning like fashion. While this works for larger datasets containing about \(1\,700\) images, for smaller datasets with 100 images, we find that the temporal pyramid and the graph representation perform better.
Abbas Cheddad
added a research item
A unique member of the power transformation family is known as the Box-Cox transformation. The latter can be seen as a mathematical operation that leads to finding the optimum lambda (λ) value that maximizes the log-likelihood function to transform a data to a normal distribution and to reduce heteroscedasticity. In data analytics, a normality assumption underlies a variety of statistical test models. This technique, however, is best known in statistical analysis to handle one-dimensional data. Herein, this paper revolves around the utility of such a tool as a pre-processing step to transform two-dimensional data, namely, digital images and to study its effect. Moreover, to reduce time complexity, it suffices to estimate the parameter lambda in real-time for large two-dimensional matrices by merely considering their probability density function as a statistical inference of the underlying data distribution. We compare the effect of this light-weight Box-Cox transformation with well-established state-of-the-art low light image enhancement techniques. We also demonstrate the effectiveness of our approach through several test-bed data sets for generic improvement of visual appearance of images and for ameliorating the performance of a colour pattern classification algorithm as an example application. Results with and without the proposed approach, are compared using the AlexNet (transfer deep learning) pretrained model. To the best of our knowledge, this is the first time that the Box-Cox transformation is extended to digital images by exploiting histogram transformation.
Abbas Cheddad
added an update
Siva Krishna Dasari, Abbas Cheddad, Jonatan Palmquist, "Melt-pool Defects Classification for Additive Manufactured Components in Aerospace Use-case," Accepted for oral presentation in 7th Intl. Conference on Soft Computing & Machine Intelligence (ISCMI 2020), IEEE, Stockholm, Sweden November 14-15, 2020.
 
Håkan Grahn
added a research item
We propose a cluster analysis approach for organizing , visualizing and understanding households' electricity consumption data. We initially partition the consumption data into a number of clusters with similar daily electricity consumption profiles. The centroids of each cluster can be seen as representative signatures of a household's electricity consumption behaviors. We evaluate the proposed approach by conducting a number of experiments on electricity consumption data of ten selected households. Our results show that the approach is suitable for data analysis, understanding and creating electricity consumption behavior models.
Abbas Cheddad
added an update
S. K. Dasari, A. Cheddad, P. Andersson, "Predictive Modelling to Support Sensitivity Analysis for Robust Design in Aerospace Engineering." Accepted for publication in Structural and Multidisciplinary Optimization, 2019, Springer Berlin Heidelberg. DOI: 10.1007/s00158-019-02467-5.
 
Håkan Grahn
added 2 research items
Involving humans in the learning process of a machine learning algorithm can have many advantages ranging from establishing trust into a particular model to added personalization capabilities to reducing labeling efforts. While these approaches are commonly summarized under the term interactive machine learning (iML), no unambiguous definition of iML exists to clearly define this area of research. In this position paper, we discuss the shortcomings of current definitions of iML and propose and define the term guided machine learning (gML) as an alternative.
Energy consumption has been widely studied in the computer architecture field for decades. While the adoption of energy as a metric in machine learning is emerging, the majority of research is still primarily focused on obtaining high levels of accuracy without any computational constraint. We believe that one of the reasons for this lack of interest is due to their lack of familiarity with approaches to evaluate energy consumption. To address this challenge, we present a review of the different approaches to estimate energy consumption in general and machine learning applications in particular. Our goal is to provide useful guidelines to the machine learning community giving them the fundamental knowledge to use and build specific energy estimation methods for machine learning algorithms. We also present the latest software tools that give energy estimation values, together with two use cases that enhance the study of energy consumption in machine learning.
Abbas Cheddad
added a research item
Diabetic retinopathy is the most common cause of new cases of blindness in people of working age. Early diagnosis is the key to slowing the progression of the disease, thus preventing blindness. Retinal fundus images form an important basis for judging these retinal diseases. To the best of our knowledge, no prior studies have scrutinized the predictive power of the different compositions of retinal images using deep learning. This paper is to investigate whether there exists specific region that could assist in better prediction of the retinopathy disease, meaning to find the best region in fundus images that can boost the prediction power of models for retinopathy classification. To this end, with image segmentation techniques, the fundus image is divided into three different segments, namely, the optic disc, the blood vessels, and the other regions (regions other than blood vessels and optic disk). These regions are then contrasted against the performance of original fundus images. The convolutional neural network as well as transfer deep learning with the state-of-the-art pre-trained models (i.e., AlexNet, GoogleNet, Resnet50, VGG19) are deployed. We report the average of ten runs for each model. Different machine learning evaluation metrics are used. The other regions’ segment reveals more predictive power than the original fundus image especially when using AlexNet/Resnet50. URL: https://ieeexplore.ieee.org/document/8936078
Abbas Cheddad
added an update
Visit program to our school: a delegation from Shanghai Polytechnic University (SSU) has visited our university today where we exchanged an overview of our research teams. I brought up the different projects we are doing at the BigData@BTH profile where we exploit machine learning to address contemporary issues. We eventually established an initial mutual agreement to collaborate on project funding/exchange of students etc.
 
Abbas Cheddad
added an update
Wu Qian and Abbas Cheddad. "Segmentation-based Deep Learning Fundus Image Analysis," Accepted for oral presentation at the 9th International Conference on Image Processing Theory, Tools and Applications IPTA 2019. Nov 6-9, 2019, Istanbul, Turkey.
 
Håkan Grahn
added a research item
In this study we apply clustering techniques for analyzing and understanding households’ electricity consumption data. The knowledge extracted by this analysis is used to create a model of normal electricity consumption behavior for each particular household. Initially, the household’s electricity consumption data are partitioned into a number of clusters with similar daily electricity consumption profiles. The centroids of the generated clusters can be considered as representative signatures of a household’s electricity consumption behavior. The proposed approach is evaluated by conducting a number of experiments on electricity consumption data of ten selected households. The obtained results show that the proposed approach is suitable for data organizing and understanding, and can be applied for modeling electricity consumption behavior on a household level.
Håkan Grahn
added 5 research items
This paper proposes a preprocessing stage to augment the bank of features that one can retrieve from binary images to help increase the accuracy of pattern recognition algorithms. To this end, by applying successive dilations to a given shape, we can capture a new dimension of its vital characteristics which we term hereafter: the shape growth pattern (SGP). This work investigates the feasibility of such a notion and also builds upon our prior work on structure preserving dilation using Delaunay triangulation. Experiments on two public data sets are conducted, including comparisons to existing algorithms. We deployed two renowned machine learning methods into the classification process (i.e., convolutional neural network-CNN-and random forests-RF-) since they perform well in pattern recognition tasks. The results show a clear improvement of the proposed approach's classification accuracy (especially for data sets with limited training samples) as well as robustness against noise when compared to existing methods.
Abbas Cheddad
added 2 research items
Historical documents are essentially formed of handwritten texts that exhibit a variety of perceptual environment complexities. The cursive and connected nature of text lines on one hand and the presence of artefacts and noise on the other hand hinder achieving plausible results using current image processing algorithm. In this paper, we present a new algorithm which we termed QTE (Query by Text Example) that allows for training-free and binarisation-free pattern spotting in scanned handwritten historical documents. Our algorithm gives promising results on a subset of our database revealing ∼83% success rate in locating word patterns supplied by the user.
Mathematical morphology has been of a great significance to several scientific fields. Dilation, as one of the fundamental operations, has been very much reliant on the common methods based on the set theory and on using specific shaped structuring elements to morph binary blobs. We hypothesised that by performing morphological dilation while exploiting geometry relationship between dot patterns, one can gain some advantages. The Delaunay triangulation was our choice to examine the feasibility of such hypothesis due to its favourable geometric properties. We compared our proposed algorithm to existing methods and it becomes apparent that Delaunay based dilation has the potential to emerge as a powerful tool in preserving objects structure and elucidating the influence of noise. Additionally, defining a structuring element is no longer needed in the proposed method and the dilation is adaptive to the topology of the dot patterns. We assessed the property of object structure preservation by using common measurement metrics. We also demonstrated such property through handwritten digit classification using HOG descriptors extracted from dilated images of different approaches and trained using Support Vector Machines. The confusion matrix shows that our algorithm has the best accuracy estimate in 80% of the cases. In both experiments, our approach shows a consistent improved performance over other methods which advocates for the suitability of the proposed method.
Abbas Cheddad
added an update
S.K. Dasari, A. Cheddad and P. Andersson, (2019) "Random Forest Surrogate Models to Support Design Space Exploration in Aerospace Use-case." Accepted for oral presentation at the 15th International Conference on Artificial Intelligence Applications and Innovations (AIAI'19). 24-26 May 2019, Crete, Greece. SPRINGER IFIP AICT (LNCS) Series.
 
Håkan Grahn
added 2 research items
Machine learning algorithms are responsible for a significant amount of computations. These computations are increasing with the advancements in different machine learning fields. For example, fields such as deep learning require algorithms to run during weeks consuming vast amounts of energy. While there is a trend in optimizing machine learning algorithms for performance and energy consumption, still there is little knowledge on how to estimate an algorithm’s energy consumption. Currently, a straightforward cross-platform approach to estimate energy consumption for different types of algorithms does not exist. For that reason, well-known researchers in computer architecture have published extensive works on approaches to estimate the energy consumption. This study presents a survey of methods to estimate energy consumption, and maps them to specific machine learning scenarios. Finally, we illustrate our mapping suggestions with a case study, where we measure energy consumption in a big data stream mining scenario. Our ultimate goal is to bridge the current gap that exists to estimate energy consumption in machine learning scenarios.
Emiliano Casalicchio
added 2 research items
Container technologies are changing the way cloud platforms and distributed applications are architected and managed. Containers are used to run enterprise, scientific and big data applications, to architect IoT and edge/fog computing systems, and by cloud providers to internally manage their infrastructure and services. However, we are far away from the maturity stage and there are still many research challenges to be solved. One of them is container orchestration that makes it possible to define how to select, deploy, monitor, and dynamically control the configuration of multi-container packaged applications in the cloud. This paper surveys the state-of-the-art solutions and discusses research challenges in autonomic orchestration of containers. A reference architecture of an autonomic container orchestrator is also proposed.
Huseyin Kusetogullari
added a research item
In this paper, a new approach is proposed to enhance the handwriting image by using learning-based windowing contrast enhancement and Gaussian Mixture Model (GMM). A fixed size window moves over the handwriting image and two quantitative methods which are discrete entropy (DE) and edge-based contrast measure (EBCM) are used to estimate the quality of each patch. The obtained results are used in the un-supervised learning method by using k-means clustering to assign the quality of handwriting as bad (if it is low contrast) or good (if it is high contrast). After that, if the corresponding patch is estimated as low contrast, a contrast enhancement method is applied to the window to enhance the handwriting. GMM is used as a final step to smoothly exchange information between original and enhanced images to discard the ar-tifacts to represent the final image. The proposed method has been compared with the other contrast enhancement methods for different datasets which are Swedish historical documents , DIBCO2010, DIBCO2012 and DIBCO2013. Results illustrate that proposed method performs well to enhance the handwriting comparing to the existing contrast enhancement methods. Index Terms-Handwriting image enhancement, contrast enhancement, learning-based windowing, gaussian mixture modeling, k-means clustering.
Shahrooz Abghari
added a research item
The growth of Internet video and over-the-top transmission techniques has enabled online video service providers to deliver high quality video content to viewers. To maintain and improve the quality of experience, video providers need to detect unexpected issues that can highly affect the viewers' experience. This requires analyzing massive amounts of video session data in order to find unexpected sequences of events. In this paper we combine sequential pattern mining and clustering to discover such event sequences. The proposed approach applies sequential pattern mining to find frequent patterns by considering contextual and collective outliers. In order to distinguish between the normal and abnormal behavior of the system, we initially identify the most frequent patterns. Then a clustering algorithm is applied on the most frequent patterns. The generated clustering model together with Silhouette Index are used for further analysis of less frequent patterns and detection of potential outliers. Our results show that the proposed approach can detect outliers at the system level.
Håkan Grahn
added a research item
Machine learning software accounts for a significant amount of energy consumed in data centers. These algorithms are usually optimized towards predictive performance, i.e. accuracy, and scalability. This is the case of data stream mining algorithms. Although these algorithms are adaptive to the incoming data, they have fixed parameters from the beginning of the execution. We have observed that having fixed parameters lead to unnecessary computations, thus making the algorithm energy inefficient. In this paper we present the nmin adaptation method for Hoeffding trees. This method adapts the value of the nmin parameter, which significantly affects the energy consumption of the algorithm. The method reduces unnecessary computations and memory accesses, thus reducing the energy, while the accuracy is only marginally affected. We experimentally compared VFDT (Very Fast Decision Tree, the first Hoeffding tree algorithm) and CVFDT (Concept-adapting VFDT) with the VFDT-nmin (VFDT with nmin adaptation). The results show that VFDT-nmin consumes up to 27% less energy than the standard VFDT, and up to 92% less energy than CVFDT, trading off a few percent of accuracy in a few datasets.
Håkan Grahn
added 2 research items
Graphics processing units (GPUs) in embedded mobile platforms are reaching performance levels where they may be useful for computer vision applications. We compare two generations of embedded GPUs for mobile devices when running a state-of-the-art feature detection algorithm, i.e., Harris-Hessian/FREAK. We compare architectural differences, execution time, temperature, and frequency on Sony Xperia Z3 and Sony Xperia XZ mobile devices. Our results indicate that the performance soon is sufficient for real-time feature detection, the GPUs have no temperature problems, and support for large work-groups is important.
Håkan Grahn
added a research item
In the context of historical document analysis, image binarization is a first important step, which separates foreground from background, despite common image degradations, such as faded ink, stains, or bleed-through. Fast binarization has great significance when analyzing vast archives of document images, since even small inefficiencies can quickly accumulate to years of wasted execution time. Therefore, efficient binarization is especially relevant to companies and government institutions, who want to analyze their large collections of document images. The main challenge with this is to speed up the execution performance without affecting the binarization performance. We modify a state-of-the-art binarization algorithm and achieve on average a 3.5 times faster execution performance by correctly mapping this algorithm to a heterogeneous platform, consisting of a CPU and a GPU. Our proposed parameter tuning algorithm additionally improves the execution time for parameter tuning by a factor of 1.7, compared to previous parameter tuning algorithms. We see that for the chosen algorithm, machine learning-based parameter tuning improves the execution performance more than heterogeneous computing, when comparing absolute execution times.
Abbas Cheddad
added a research item
This paper proposes a preprocessing stage to augment the bank of features that one can retrieve from binary images to help increase the accuracy of pattern recognition algorithms. To this end, by applying successive dilations to a given shape, we can capture a new dimension of its vital characteristics which we term hereafter: the shape growth pattern (SGP). This work investigates the feasibility of such a notion and also builds upon our prior work on structure preserving dilation using Delaunay triangulation. Experiments on two public data sets are conducted, including comparisons to existing algorithms. We deployed two renowned machine learning methods into the classification process (i.e., convolutional neural network-CNN-and random forests-RF-) since they perform well in pattern recognition tasks. The results show a clear improvement of the proposed approach's classification accuracy (especially for data sets with limited training samples) as well as robustness against noise when compared to existing methods.
Abbas Cheddad
added an update
Huseyin Kusetogullari & Håkan Grahn
Deleted publication
Abbas Cheddad, Huseyin Kusetogullari and Håkan Grahn, (2017). "Object Recognition using
Shape Growth Pattern,” 10th International Symposium on Image and Signal Processing and Analysis (ISPA 2017). 18-20th September 2017, pp.47-52, Ljubljana, Slovenia.
 
Håkan Grahn
added a research item
In telecommunication business, a major investment goes into the infrastructure and its maintenance, while business revenues are proportional to how big, good, and well-balanced the customer base is. We present a data-driven analytic strategy based on combinatorial optimization and analysis of the historical mobility designed to quantify the desirability of different geo-demographic segments. In our case study, several segments were recommended for a partial reduction. Within a segment, clients are different. In order to enable intelligent reduction, we introduce the term infrastructure-stressing client and, using the proposed method, we reveal the list of the IDs of such clients. Fuzzy logic is used to build a natural language interface between a manager (who does not want technicalities but a comprehensive summary) and big data with its processing: a query is formulated in a natural language: " retrieve absolutely unwanted clients ". Once the list is given, to convince the manager, we have developed a visualization tool to allow for a manual checking: it shows how the client moved through a sequence of hot spots and was repeatedly served by critically loaded antennas.
Emiliano Casalicchio
added an update
The 18th of September 2017, at University of Arizona, toke place the 1st edition of the Workshop on Autonomic Management of Large Scale Container-based systems (AMLCS17). AMLCS17 is part of the 2nd IEEE International Workshops on Foundations and Applications of Self* Systems co-located with the IEEE International Conference on Cloud and Autonomic Computing.
The workshop has been an opportunity to discuss research challenges and state of the art research results in the field of autonomic computing and large scale container based systems.
The workshop hosted three Keynote speech. Justin Cappos (inventor of Stork and TUF), from New Your University, talked about how to secure the software deployment chain presenting the The Update Framework (TUF) a framework to helps developers to secure new or existing software update systems and in-toto a tool designed to ensure the integrity of a software product from initiation to end-user installation. Alan Still, Senior Director of the High Performance Computing Center and Co-Director for the multi-university US National Science Foundation Cloud and Autonomic Computing Industry/University Cooperative Research Center, talked about its experience of emulating large scale datacenters (>5000 nodes) using containers. Finally, Sherif Abdelwahed, Professor of Electrical and Computer Engineering at Virginia Commonwealth University, presented a state of the art scalable distributed performance optimization framework for the autonomic performance management of distributed computing systems operating in a dynamic environment to satisfy desired quality-of service objectives.
The five research papers presented at the workshop addressed different hot topics and applications of container-based systems. Uros Pascinski, from University of Ljubljana (Slovenia), presented how Kubernetes and Docker could be used to implement an event-driven videoconferencing services. Vasily Tarasov, from IBM Almaden (USA), presented an extensive performance study of storage solutions for Docker containers. Their results provide solid guidelines for selecting and configuring the most appropriate file system and storage solution for Docker containers. Emiliano Casalicchio, from Blekinge Institute of Technology (Sweden), presented issues and guidelines for collecting performance counters and auto-scaling docker-based systems using Kubernetes. Wang Kangjin, from Peking University (China), presented a state of the art solution for reducing latency when docker images are distributed to deploy large scale applications. Finally, Matej Cigale, from Cardif University (UK), presented a QoS modeling approach that complements and extends the standard microservice and component-based software engineering tools by giving the software engineer information on what Non-Functional Requirements and quality constraints have critical influence on QoS.
Many people contributed to the success of the workshop. I would like to thank the attended, the authors, to the keynote speakers, the University of Arizona and the IEEE ICCAC organization (special thanks goes to Stefano Iannucci, Chian Tunc and Salim Hariri), the Blekinge Institute of Technology and the BigData@BTH project for supporting the travel, the Technical Program Committee members, the workshop co-chair Nectario Koziris and the Panel chair Kurt Tutschku.
In 2018, a second edition of AMLCS workshop will be organized. Looking forward for new interesting contributions.
 
Emiliano Casalicchio
added a research item
Today, The cloud industry is adopting the container technology both for internal usage and as commercial offering. The use of containers as base technology for large-scale systems opens many challenges in the area of resource management at run-time. This paper addresses the problem of selecting the more appropriate performance metrics to activate auto-scaling actions. Specifically, we investigate the use of relative and absolute metrics. Results demonstrate that, for CPU intense workload, the use of absolute metrics enables more accurate scaling decisions. We propose and evaluate the performance of a new autoscaling algorithm that could reduce the response time of a factor between 0.66 and 0.5 compared to the actual Kubernetes' horizontal auto-scaling algorithm.
Håkan Grahn
added a research item
The aim of this study is to improve the monitoring and controlling of heating systems located at customer buildings through the use of a decision support system. To achieve this, the proposed system applies a two-step classifier to detect manual changes of the temperature of the heating system. We apply data from the Swedish company NODA, active in energy optimization and services for energy efficiency, to train and test the suggested system. The decision support system is evaluated through an experiment and the results are validated by experts at NODA. The results show that the decision support system can detect changes within three days after their occurrence and only by considering daily average measurements.
Emiliano Casalicchio
added a research item
Apache Cassandra is an highly scalable and available NoSql datastore, largely used by enterprises of each size and for application areas that range from entertainment to big data analytics. Managed Cassandra service providers are emerging to hide the complexity of the installation, fine tuning and operation of Cassandra Virtual Data Centers (VDCs). This paper address the problem of energy efficient auto-scaling of Cassandra VDC in managed Cassandra data centers. We propose three energy-aware autoscaling algorithms: \texttt{Opt}, \texttt{LocalOpt} and \texttt{LocalOpt-H}. The first provides the optimal scaling decision orchestrating horizontal and vertical scaling and optimal placement. The other two are heuristics and provide sub-optimal solutions. Both orchestrate horizontal scaling and optimal placement. \texttt{LocalOpt} consider also vertical scaling. In this paper: we provide an analysis of the computational complexity of the optimal and of the heuristic auto-scaling algorithms; we discuss the issues in auto-scaling Cassandra VDC and we provide best practice for using auto-scaling algorithms; we evaluate the performance of the proposed algorithms under programmed SLA variation, surge of throughput (unexpected) and failures of physical nodes. We also compare the performance of energy-aware auto-scaling algorithms with the performance of two energy-blind auto-scaling algorithms, namely \texttt{BestFit} and \texttt{BestFit-H}. The main findings are: VDC allocation aiming at reducing the energy consumption or resource usage in general can heavily reduce the reliability of Cassandra in term of the consistency level offered. Horizontal scaling of Cassandra is very slow and make hard to manage surge of throughput. Vertical scaling is a valid alternative, but it is not supported by all the cloud infrastructures.
Håkan Grahn
added a research item
Data mining algorithms are usually designed to optimize a trade-off between predictive accuracy and computational efficiency. This paper introduces energy consumption and energy efficiency as important factors to consider during data mining algorithm analysis and evaluation. We conducted an experiment to illustrate how energy consumption and accuracy are affected when varying the parameters of the Very Fast Decision Tree (VFDT) algorithm. These results are compared with a theoretical analysis on the algorithm, indicating that energy consumption is affected by the parameters design and that it can be reduced significantly while maintaining accuracy.
Håkan Grahn
added 4 research items
JPEG encoding is a common technique to compress images. However, since JPEG is a lossy compression certain artifacts may occur in the compressed image. These artifacts typically occur in high frequency or detailed areas of the image. This paper proposes an algorithm based on the SSIM metric to improve the experienced quality in JPEG encoded images. The algorithm improves the quality in detailed areas by up to 1.29 dB while reducing the quality in less detailed areas of the image, thereby increasing the overall experienced quality without increasing the image data size. Further, the algorithm can also be used to decrease the file size (by up to 43%) while preserving the experienced image quality. Finally, an efficient GPU implementation is presented.
Data mining algorithms are usually designed to optimize a trade-off between predictive accuracy and computational efficiency. This paper introduces energy consumption and energy efficiency as important factors to consider during data mining algorithm analysis and evaluation. We extended the CRISP (Cross Industry Standard Process for Data Mining) framework to include energy consumption analysis. Based on this framework, we conducted an experiment to illustrate how energy consumption and accuracy are affected when varying the parameters of the Very Fast Decision Tree (VFDT) algorithm. The results indicate that energy consumption can be reduced by up to 92.5% (557 J) while maintaining accuracy.
GPUs in embedded platforms are reaching performance levels comparable to desktop hardware, thus it becomes interesting to apply Computer Vision techniques. We propose, implement, and evaluate a novel feature detector and descriptor combination, i.e., we combine the Harris-Hessian detector with the FREAK binary descriptor. The implementation is done in OpenCL, and we evaluate the execution time and classification performance. We compare our approach with two other methods, FAST/BRISK and ORB. Performance data is presented for the mobile device Xperia Z3 and the desktop Nvidia GTX 660. Our results indicate that the execution times on the Xperia Z3 are insufficient for real-time applications while desktop execution shows future potential. Classification performance of Harris-Hessian/FREAK indicates that the solution is sensitive to rotation, but superior in scale variant images.
Emiliano Casalicchio
added a research item
Today, a new technology is going to change the way platforms for the internet of services are designed and managed. This technology is called container (e.g. Docker and LXC). The internet of service industry is adopting the container technology both for internal usage and as commercial offering. The use of container as base technology for largescale systems opens many challenges in the area of resource management at run-time, for example: autoscaling, optimal deployment and monitoring. Speci�cally, monitoring of container based systems is at the ground of any resource management solution, and it is the focus of this work. This paper explores the tools available to measure the performance of Docker from the perspective of the host operating system and of the virtualization environment, and it provides a characterization of the CPU and disk I/O overhead introduced by containers.
Emiliano Casalicchio
added 2 research items
Apache Cassandra is a NoSql database offering high scalability and availability. Among with its competitors, e.g. Hbase, SympleDB and BigTable, Cassandra is a widely used platform for big data systems. Tuning the performance of those systems is a complex task and there is a growing demand for autonomic management solutions. In this paper we present an energy-aware adaptation model built from a real case based on Apache Cassandra and on real application data from Ericsson AB Sweden. Along with the optimal adaptation model we propose a sub-optimal adaptation algorithm to avoid system perturbations due to re-configuration actions triggered by subscription of new tenants and/or the increase in the volume of queries. Results shows that the penalty for an adaptation mechanism that does not hurt the system stability is between 20 and 30% with respect the optimal adaptation.
Platforms for big data includes mechanisms and tools to model, organize, store and access big data (e.g. Apache Cassandra, Hbase, Amazon SimpleDB, Dynamo, Google BigTable). The resource management for those platforms is a complex task and must account also for multi-tenancy and infrastructure scalability. Human assisted control of Big data platform is unrealistic and there is a growing demand for autonomic solutions. In this paper we propose a QoS and energyaware adaptation model designed to cope with the real case of a Cassandra-as-a-Service provider.
Abbas Cheddad
added an update
Stefan Axelsson
added a research item
The lack of legitimate datasets on mobile money transactions to perform research on in the domain of fraud detection is a big problem today in the scientific community. Part of the problem is the intrinsic private nature of financial transactions, that leads to no public available data sets. This will leave the researchers with the burden of first harnessing the dataset before performing the actual research on it. This paper propose an approach to such a problem that we named the PaySim simulator. PaySim is a financial simulator that simulates mobile money transactions based on an original dataset. In this paper, we present a solution to ultimately yield the possibility to simulate mobile money transactions in such a way that they become similar to the original dataset. With technology frameworks such as Agent-Based simulation techniques, and the application of mathematical statistics, we show in this paper that the simulated data can be as prudent as the original dataset for research.
Lars Lundberg
added 2 research items
The number of applications that use virtualized cloud-based systems is growing, and one would like to use this kind of systems also for real-time applications with hard deadlines. There is scheduling on two levels in real-time applications executing in a virtualized environment: traditional real-time scheduling of the tasks in the real-time application, and scheduling of different Virtual Machines (VMs) on the hypervisor level. Traditional real-time scheduling is well understood, and most of the existing results calculate schedules based on periods, deadlines and worst-case execution times of the real-time tasks. In order to apply the existing theory also to cloud-based virtualized environments we must obtain periods and worst-case execution times for the VMs containing real-time applications. In this paper, we describe a technique for calculating a period and a worst-case execution time for a VM containing a real-time application with hard deadlines. This new result makes it possible to apply existing real-time scheduling theory when scheduling VMs on the hypervisor level, thus making it possible to guarantee that the real-time tasks in a VM meet their deadlines.
Emiliano Casalicchio
added an update
Håkan Grahn
added a project goal