Massimo Cafaro

Massimo Cafaro
Università del Salento | Unisalento · Department of Engineering for Innovation

Ph.D.
Associate Professor, University of Salento, Lecce, Italy

About

151
Publications
17,244
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,330
Citations
Citations since 2017
43 Research Items
381 Citations
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
Introduction
My research is focused around Parallel, Distributed and Grid/Cloud/P2P Computing. In particular, I am interested to the design and analysis of sequential, parallel and distributed algorithms in the context of data mining, machine learning, deep learning, security and cryptography, resource, data and information management in Grid/Cloud/P2P environments.
Additional affiliations
December 2015 - present
Università del Salento
Position
  • Professor (Associate)
October 2006 - December 2006
British Institute of Technology and E-commerce, London, UK
Position
  • Grid Technology and Security
Education
November 1996 - November 1999

Publications

Publications (151)
Chapter
As the economy and technology continue to advance, the need of energy for humans’ activities is growing, placing significant pressure on power distribution to reach this demand instantly. Household energy behaviors can be tracked by using Smart Meters (SM), whose data undoubtedly contains valuable insights into household electricity consumption. Ho...
Chapter
Clustering is one of the main data mining techniques used to analyze and group data, but often applications have to deal with a very large amount of spatially distributed data for which most of the clustering algorithms available so far are impractical. In this paper we present P2PRASTER, a distributed algorithm relying on a gossip–based protocol f...
Article
Full-text available
Ship sailing is a complex endeavour, requiring carefully considered proactive and reactive strategies in choosing the course of action that best suits the various events to be managed. Humans are already supported by different technologies for sailing, however these technologies are usually available in isolation. In this paper we show how to use s...
Chapter
Many real-world problems deal with collections of high-dimensional data, i.e., data with many different features. A dataset exhibiting a high number of features incurs the so-called curse of dimensionality: when the dimensionality increases, the volume of the space increases at a fast rate, causing the sparseness of the data. This makes challenging...
Article
UDDSketch is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSketch algorithm. UDDSketch provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSketch. In this paper we show how to compress and fuse two or...
Article
Full-text available
We present afqn (Approximate Fast Qn), a novel algorithm for approximate computation of the Qn scale estimator in a streaming setting, in the sliding window model. It is well-known that computing the Qn estimator exactly may be too costly for some applications, and the problem is a fortiori exacerbated in the streaming setting, in which the time av...
Article
Full-text available
The heavy hitters q -tail latencies problem has been introduced recently. This problem, framed in the context of data stream monitoring, requires approximating the quantiles of the heavy hitters items of an input stream whose elements are pairs (item, latency). The underlying rationale is that heavy hitters are obviously among the most important...
Article
Full-text available
We estimate the zonal drift velocity of small-scale ionospheric irregularities at low latitude by leveraging the spaced-receivers technique applied to two GNSS receivers for scintillation monitoring installed along the magnetic parallel passing in Presidente Prudente (Brazil, magnetic latitude 12.8°S). The investigated ionospheric sector is ideal t...
Article
Full-text available
We aim at contributing to the reliability of the phase scintillation index on Global Navigation Satellite System (GNSS) signals at high-latitude. To the scope, we leverage on a recently introduced detrending scheme based on the signal decomposition provided by the fast iterative filtering (FIF) technique. This detrending scheme has been demonstrate...
Article
We present P2PTFHH (Peer-to-Peer Time-Faded Heavy Hitters) which, to the best of our knowledge, is the first distributed algorithm for mining time-faded heavy hitters on unstructured P2P networks. P2PTFHH is based on the FDCMSS (Forward Decay Count-Min Space-Saving) sequential algorithm, and efficiently exploits an averaging gossip protocol by merg...
Preprint
Full-text available
UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data st...
Article
Full-text available
A stream can be thought of as a very large set of data, sometimes even infinite, which arrives sequentially and must be processed without the possibility of being stored. In fact, the memory available to the algorithm is limited and it is not possible to store the whole stream of data which is instead scanned upon arrival and summarized through a s...
Article
Full-text available
We present UDDSketch (Uniform DDSketch), a novel sketch for fast and accurate tracking of quantiles in data streams. This sketch is heavily inspired by the recently introduced DDSketch, and is based on a novel bucket collapsing procedure that allows overcoming the intrinsic limits of the corresponding DDSketch procedures. Indeed, the DDSketch bucke...
Article
We present fqn (Fast Qn), a novel algorithm for online computation of the Qn scale estimator. The algorithm works in the sliding window model, cleverly computing the Qn scale estimator in the current window. We thoroughly compare our algorithm for online Qn with the state of the art competing algorithm by Nunkesser et al., and show that fqn (i) is...
Conference Paper
Rapid and sudden fluctuations of phase and amplitude in Global Navigation Satellite System (GNSS) signals due to diffraction of the ionosphere phase components when signals passing through small-scale irregularities (less than hundreds meters) are commonly so-called ionospheric scintillation. The aim of the paper is to analyze the implementation an...
Article
Full-text available
We contribute to the debate on the identification of phase scintillation induced by the ionosphere on the global navigation satellite system (GNSS) by introducing a phase detrending method able to provide realistic values of the phase scintillation index at high latitude. It is based on the fast iterative filtering signal decomposition technique, w...
Preprint
Full-text available
We present UDDSketch (Uniform DDSketch), a novel sketch for fast and accurate tracking of quantiles in data streams. This sketch is heavily inspired by the recently introduced DDSketch, and is based on a novel bucket collapsing procedure that allows overcoming the intrinsic limits of the corresponding DDSketch procedures. Indeed, the DDSketch bucke...
Article
Full-text available
A statistical analysis of Loss of Lock (LoL) over Brazil throughout the 24th solar cycle is performed. Four geodetic GPS dual-frequency (L1, L2) receivers, deployed at different geographic latitudes ranging from about 25° to 2° South in the eastern part of the country, are used to investigate the LoL dependence on time of the day, season, solar and...
Preprint
Full-text available
We present FQN (Fast $Q_n$), a novel algorithm for fast detection of outliers in data streams. The algorithm works in the sliding window model, checking if an item is an outlier by cleverly computing the $Q_n$ scale estimator in the current window. We thoroughly compare our algorithm for online $Q_n$ with the state of the art competing algorithm by...
Article
Full-text available
Reliably tracking large network flows in order to determine so-called elephant flows, also known as heavy hitters or frequent items, is a common data mining task. Indeed, this kind of information is crucial in many different contexts and applications. Since storing all of the traffic is impossible, owing to the fact that flows arrive as an unbounde...
Article
Full-text available
We introduce a new information and communication technology (ICT) cloud-based architecture for Global Navigation Satellite System (GNSS) high-accuracy solutions, offering also a commercial overview of GNSS downstream market to show how the developed innovation is thought to fit in the real context. The designed architecture is featured by dynamic s...
Chapter
We present a survey of the most important algorithms that have been proposed in the context of the frequent itemset mining. We start with an introduction and overview of basic sequential algorithms, and then discuss and compare different parallel approaches based on shared-memory, message-passing, map-reduce, and the use of GPU accelerators. Even t...
Preprint
We present \textsc{P2PTFHH} (Peer--to--Peer Time--Faded Heavy Hitters) which, to the best of our knowledge, is the first distributed algorithm for mining time--faded heavy hitters on unstructured P2P networks. \textsc{P2PTFHH} is based on the \textsc{FDCMSS} (Forward Decay Count--Min Space-Saving) sequential algorithm, and efficiently exploits an a...
Article
Full-text available
Large scale decentralized systems, such as P2P, sensor or IoT device networks are becoming increasingly common, and require robust protocols to address the challenges posed by the distribution of data and the large number of peers belonging to the network. In this paper, we deal with the problem of mining frequent items in unstructured P2P networks...
Chapter
Full-text available
We present a message-passing based parallel algorithm for mining Correlated Heavy Hitters from a two-dimensional data stream. To the best of our knowledge, this is the first parallel algorithm solving the problem. We show, through experimental results, that our algorithm provides very good scalability, whilst retaining the accuracy of its sequentia...
Preprint
Full-text available
Large scale decentralized systems, such as P2P, sensor or IoT device networks are becoming increasingly common, and require robust protocols to address the challenges posed by the distribution of data and the large number of peers belonging to the network. In this paper, we deal with the problem of mining frequent items in unstructured P2P networks...
Article
Full-text available
The problem of mining Correlated Heavy Hitters (CHH) from a bi-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra--Gries algorithm has been proposed to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and...
Article
Full-text available
We deal with the problem of detecting frequent items in a stream under the constraint that items are weighted, and recent items must be weighted more than older ones. This kind of problem naturally arises in a wide class of applications in which recent data is considered more useful and valuable with regard to older, stale data. The weight assigned...
Article
This manuscript deals with a novel approach aimed at identifying multiple damaged sites in structural components through finite frequency changes. Natural frequencies, meant as a privileged set of modal data, are adopted along with a numerical model of the system. The adoption of finite changes efficiently allows challenging characteristic problems...
Article
Full-text available
We present PFDCMSS, a novel message-passing based parallel algorithm for mining time-faded heavy hitters. The algorithm is a parallel version of the recently published FDCMSS sequential algorithm. We formally prove its correctness by showing that the underlying data structure, a sketch augmented with a Space Saving stream summary holding exactly tw...
Article
Given an array of n elements and a value 2≤k≤n, a frequent item or k-majority element is an element occurring in more than n/k times. The k-majority problem requires finding all of the k-majority elements. In this paper, we deal with parallel shared-memory algorithms for frequent items; we present a shared-memory version of the Space Saving algorit...
Article
We present FDCMSS, a new sketch based algorithm for mining frequent items in data streams. The algorithm cleverly combines key ideas borrowed from forward decay, the Count-Min and the Space Saving algorithms. It works in the time fading model, mining data streams according to the cash register model. We formally prove its correctness and show, thro...
Article
We present a message-passing based parallel version of the Space Saving algorithm designed to solve the $k$--majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater than a given threshold, and is therefore useful for iceberg queries and many other different contexts. We apply our algorithm to th...
Article
Full-text available
The access control problem in a hierarchy can be solved by using a hierarchical key assignment scheme, where each class is assigned an encryption key and some private information. A formal security analysis for hierarchical key assignment schemes has been traditionally considered in two different settings, i.e., the unconditionally secure and the c...
Article
Full-text available
Preserving data confidentiality in clouds is a key issue. Secret Sharing, a cryptographic primitive for the distribution of a secret among a group of $n$ participants designed so that only subsets of shareholders of cardinality $0 < t \leq n$ are allowed to reconstruct the secret by pooling their shares, can help mitigating and minimizing the probl...
Article
Full-text available
We deal with the problem of preference-based matchmaking of computational resources belonging to a Grid. We introduce CP–Nets, a recent development in the field of Artificial Intelligence, as a means to deal with user’s preferences in the context of Grid scheduling. We discuss CP–Nets from a theoretical perspective and then analyze, qualitatively a...
Conference Paper
Full-text available
The sharing and integration of health care data such as medical history, pathology, therapy, radiology images, etc., is a key requirement for improving the patient diagnosis and in general the patient care. Today, many EPR (Electronic Patient Record) systems are present both in the same or different health centers and record a huge amount of data r...
Chapter
Grid computing is an emerging and enabling technology allowing organizations to easily share, integrate and manage resources in a distributed environment. Computational Grid allows running millions of jobs in parallel, but the huge amount of generated data has caused another interesting problem: the management (classification, storage, discovery et...
Chapter
Grid computing is an emerging and enabling technology allowing organizations to easily share, integrate and manage resources in a distributed environment. Computational Grid allows running millions of jobs in parallel, but the huge amount of generated data has caused another interesting problem: the management (classification, storage, discovery et...
Article
The report describes the design of a system that can monitor the load on the Calypso IBM Power 6 cluster and act on different nodes turning them on and off according to their load (in terms of jobs), scheduled by LSF. The system is able to collect and analyze information about the cluster (queues, jobs, and hosts) in order to decide its course of a...
Article
Data centers now play an important role in modern IT infrastructures. Much research effort has been made in the field of green data center computing in order to save energy, thus making data centers environmentally friendly, for example by reducing CO2 emission. The Scientific Computing and Operations (SCO) division of the CMCC has started a resear...
Article
This report describes the design, implementation and test of a software able to monitor the load of a cluster IBM Power 6, acting on its nodes switching them on and off on the basis of their load (jobs), scheduled by LSF, taking into account a linear, integer variables optimization model which produces a feasible schedule, possibly optimal, of pend...
Article
We present a deterministic parallel algorithm for the k-majority problem, that can be used to find in parallel frequent items, i.e. those whose multiplicity is greater than a given threshold, and is therefore useful to process iceberg queries and in many other different contexts of applied mathematics and information theory. The algorithm can be us...
Article
This Special Section of Future Generation Computer Systems contains selected high-quality papers from the 4th International Conference on Grid and Pervasive Computing (GPC 2009), which was held in May 2009 in Geneva, Switzerland, and its related workshops. Research problems in these papers have been analyzed systematically, and for specific approac...
Book
Research into grid computing has been driven by the need to solve large-scale, increasingly complex problems for scientific applications. Yet the applications of grid computing for business and casual users did not begin to emerge until the development of the concept of cloud computing, fueled by advances in virtualization techniques, coupled with...
Chapter
Full-text available
This chapter introduces and puts in context Grids, Clouds, and Virtualization. Grids promised to deliver computing power on demand. However, despite a decade of active research, no viable commercial grid computing provider has emerged. On the other hand, it is widely believed—especially in the Business World—that HPC will eventually become a commod...
Article
We present two deterministic parallel Selection algorithms for distributed memory machines, under the coarse-grained multicomputer model. Both are based on the use of two weighted 3-medians, that allows discarding at least 1/3 of the elements in each iteration. The first algorithm slightly improves the current experimentally fastest algorithm by Sa...
Article
Workflows are widely used in applications that require coordinated use of computational resources. Workflow definition languages typically abstract over some aspects of the way in which a workflow is to be executed, such as the level of parallelism to ...
Article
Full-text available
In this paper, we describe the process of parallelizing an existing, production level, sequential Synthetic Aperture Radar (SAR) processor based on the Range-Doppler algorithmic approach. We show how, taking into account the constraints imposed by the software architecture and related software engineering costs, it is still possible with a moderat...
Article
Grid computing is an emerging and enabling technology allowing organizations to easily share, integrate and manage resources in a distributed environment. Computational Grid allows running millions of jobs in parallel, but the huge amount of generated data has caused another interesting problem: the management (classification, storage, discovery et...
Chapter
Grid computing is an emerging and enabling technology allowing organizations to easily share, integrate and manage resources in a distributed environment. Computational Grid allows running millions of jobs in parallel, but the huge amount of generated data has caused another interesting problem: the management (classification, storage, discovery et...
Chapter
In this chapter, the ProGenGrid (Proteomics and Genomics Grid) research project, which started in 2004, is described. It is a Grid Problem Solving Environment, specialized for the Bioinformatics domain, which aims at providing an integrated environment in order to compose, schedule and monitor biological applications in a Computational Grid. The ma...
Conference Paper
In a growing number of scientific disciplines, large data collections are emerging as important community resources. Data and metadata management exploiting the data grid paradigm is becoming more and more important as the number of involved data sources is continuously increasing and decentralizing. Efficient grid data access services are perceive...
Article
Increasingly, complex scientific applications are structured in terms of workflows. These applications are usually computationally and/or data intensive and thus are well suited for execution in grid environments. Distributed, geographically spread computing and storage resources are made available to scientists belonging to virtual organizations s...
Article
Even though many useful tools for sequence alignment are available, such as BLAST and PSI-BLAST by NCBI and FASTA by the University of Virginia, a key issue regarding sequence databases is their size, growing at an exponential rate. Grid and parallel computing are crucial techniques to maintain and improve the effectiveness of sequence comparison t...
Article
Full-text available
In this paper we describe a Workflow Management System, named ProGenGrid (Proteomics and Genomics Grid, developed at the University of Lecce) which aims at providing a tool where e-scientists can simulate biological experiments through the composition of existing analysis and visualization tools, wrapped as Web Services. Since bioinformatics applic...
Conference Paper
With a growing trend towards grid-based data repositories and data analysis services, scientific data analysis often involves accessing multiple data sources, and analyzing the data using a variety of analysis programs. A strictly related critical challenge is the fact that data sources often hold the same type of data in a number of different form...
Conference Paper
Data grid management systems are becoming increasingly important in the context of the recently adopted service oriented science paradigm. The Grid Relational Catalog (GRelC) project is working towards ubiquitous, integrated, seamless and comprehensive data grid management solutions to fully address application specific requirements. This paper des...
Conference Paper
A large-scale simulation in e-science experiments can be modeled by using a workflow. The ProGenGrid workflow management system is being developed at the University of Salento in Lecce since 2004 and consists of an editor for designing the experiment and an engine for scheduling the jobs in a computational grid. The initial version was based on wra...
Article
Grids encourage and promote the publication, sharing and integration of scientific data, distributed across Virtual Organizations. Scientists and researchers work on huge, complex and growing datasets. The complexity of data management within a grid environment comes from the distribution, heterogeneity and number of data sources. Along with coarse...
Chapter
Design and Implementation of a Grid Computing Environment for Remote Sensing Massimo Cafaro, Euromediterranean Center for Climate Change & University of Salento, Italy Italo Epicoco, Euromediterranean Center for Climate Change & University of Salento, Italy Gianvito Quarta, Institute of Atmospheric Sciences and Climate, National Research Council, I...
Conference Paper
Increasingly, complex scientific applications are structured in terms of workflows. These applications are usually computationally and/or data intensive and thus are well suited for execution in grid environments. Distributed, geographically spread computing and storage resources are made available to scientists belonging to virtual organizations s...
Conference Paper
Full-text available
Increasingly, grid computing is becoming the paradigm of choice for building large-scale complex scientific applications. These applications are characterized as being computationally and/or data intensive, requiring computational power and storage resources well beyond the capability of a single computer. Grid environments provide distributed, geo...
Conference Paper
Many data grid applications manage and process huge datasets distributed across multiple grid nodes and stored into heterogeneous databases; e-Science projects need to access widespread databases within a computational grid environment, through a set of secure, interoperable and efficient data grid services. In the data grid management area several...
Article
This paper describes the Grid Resource Broker (GRB), a Grid portal built leveraging a set of high-level, Globus-Toolkit-based Grid libraries called GRB libraries. The portal leverages the Liferay framework to provide users with an intuitive, highly customizable Web GUI. The underlying GRB middleware allows trusted users seamless access to their com...
Article
We present an integrated Grid system for the prediction of protein secondary structures, based on the frequent automatic update of proteins in the training set. The predictor model is based on a feed-forward multilayer perceptron (MLP) neural network which is trained with the back-propagation algorithm; the design reuses existing legacy software an...
Conference Paper
This paper presents a data Grid system, built on top of specific biological data sources in flat file format, which carries out the ingestion into a relational DBMS that integrates these data. The prototype has been implemented for UniProtKB (located at EBI - European Bioinformatics Institute, UK) and UTRdb (located at ITB/CNR Bari, Italy) data ban...
Conference Paper
Full-text available
Data grids are middleware systems that offer secure shared storage of massive scientific datasets over wide area networks. In this paper we describe the GReIC Data Storage, a novel grid storage service which has been developed within the Grid Relational Catalog (GReIC) Project. The aim of this service is to manage efficiently, securely and transpar...
Conference Paper
Current production Grids involve hundreds of sites and thousands of machines. In this context, P2P solutions are well suited - with regard to existing centralized and hierarchical approaches