Pietro Morerio

Pietro Morerio
Istituto Italiano di Tecnologia | IIT · Department of Pattern Analysis and Computer Vision

PhD in Computational Intelligence

About

75
Publications
21,077
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,092
Citations
Introduction
I received my B. Sc. and M. Sc. in Physics from the University of Milan (Italy) in 2007 and 2010 (summa cum laude). I was Research Fellow at the University of Genoa (Italy) from 2011 to 2012, working in Video Analysis for Interactive Cognitive Environments. I pursued a PhD in Computational Intelligence at the same institution in 2016. Currently I am a Postdoctoral Researcher at Istituto Italiano di Tecnologia (IIT). My research focuses on machine learning, deep learning and computer vision.
Additional affiliations
July 2016 - present
Istituto Italiano di Tecnologia
Position
  • PostDoc Position
January 2013 - June 2016
Università degli Studi di Genova
Position
  • PhD Student
April 2011 - December 2012
Università degli Studi di Genova
Position
  • Research Grant
Education
October 2004 - December 2010
University of Milan
Field of study
  • Physics

Publications

Publications (75)
Article
Counterfeiting is a worldwide issue affecting many industrial sectors, ranging from specialized technologies to retail market, such as fashion brands, pharmaceutical products, and consumer electronics. Counterfeiting is not only a huge economic burden (>$ 1 trillion losses/year), but it also represents a serious risk to human health, for example, d...
Conference Paper
The large-scale use of surveillance cameras in public spaces raised severe concerns about an individual privacy breach. Introducing privacy and security in video surveillance systems, primarily in person re-identification (re-id), is quite challenging. Event cameras are novel sensors, which only respond to brightness changes in the scene. This char...
Conference Paper
Full-text available
Gaze prediction in egocentric videos is a fairly new research topic, which might have several applications for assistive technology (e.g., supporting blind people in their daily interactions), security (e.g., attention tracking in risky work environments), education (e.g., augmented / mixed reality training simulators, immersive games) and so forth...
Article
Full-text available
Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usual...
Conference Paper
Full-text available
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to an...
Preprint
Full-text available
This paper presents a novel 3D human pose estimation approach using a single stream of asynchronous events as input. Most of the state-of-the-art approaches solve this task with RGB cameras, however struggling when subjects are moving fast. On the other hand, event-based 3D pose estimation benefits from the advantages of event-cameras, especially t...
Preprint
The concept of compressing deep Convolutional Neural Networks (CNNs) is essential to use limited computation, power, and memory resources on embedded devices. However, existing methods achieve this objective at the cost of a drop in inference accuracy in computer vision tasks. To address such a drawback, we propose a framework that leverages knowle...
Article
Full-text available
We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction defined as the firing of neurons in specific paths. In this work, we utilize the evidence at each neuron to determine the probability of dropout, rather than dropping out neurons uniformly at random as in standard dropout. In essence, we dropout...
Preprint
Full-text available
The majority of existing Unsupervised Domain Adaptation (UDA) methods presumes source and target domain data to be simultaneously available during training. Such an assumption may not hold in practice, as source data is often inaccessible (e.g., due to privacy reasons). On the contrary, a pre-trained source model is always considered to be availabl...
Chapter
In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array...
Preprint
In this work, we address the problem of estimating the so-called "Social Distancing" given a single uncalibrated image in unconstrained scenarios. Our approach proposes a semi-automatic solution to approximate the homography matrix between the scene ground and image plane. With the estimated homography, we then leverage an off-the-shelf pose detect...
Conference Paper
In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audiovisual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array...
Preprint
Full-text available
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to an...
Preprint
Full-text available
The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the...
Preprint
Full-text available
Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usual...
Preprint
Full-text available
We investigate and characterize the inherent resilience of conditional Generative Adversarial Networks (cGANs) against noise in their conditioning labels, and exploit this fact in the context of Unsupervised Domain Adaptation (UDA). In UDA, a classifier trained on the labelled source set can be used to infer pseudo-labels on the unlabelled target s...
Article
Full-text available
This paper aims at investigating the action prediction problem from a pure kinematic perspective. Specifically, we address the problem of recognizing future actions, indeed human intentions, underlying a same initial (and apparently unrelated) motor act. This study is inspired by neuroscientific findings asserting that motor acts at the very onset...
Preprint
Full-text available
In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We i...
Preprint
Full-text available
Pedestrian attributes, e.g., hair length, clothes type and color, locally describe the semantic appearance of a person. Training person re-identification (ReID) algorithms under the supervision of such attributes have proven to be effective in extracting local features which are important for ReID. Unlike person identity, attributes are consistent...
Article
Full-text available
Heterogeneous data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while training data can be accurately collected to include a variety of sensory modalities, it is often the case that not all of them are available in real life (testing) scenarios, where a model...
Preprint
Full-text available
In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and a novel audio data modality, namely acoustic images. Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval. Ho...
Preprint
Full-text available
Heterogeneous data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while training data can be accurately collected to include a variety of sensory modalities, it is often the case that not all of them are available in real life (testing) scenarios, where a model...
Conference Paper
Full-text available
Autism is a behavioral neurological disorder affecting a significant percentage of worldwide population. It especially starts manifesting at very low ages, but it is difficult to early diagnose it since there is not a specific exam or trial that is able to spot it safely. Its detection is in fact mainly dependent from the medical expertise used to...
Conference Paper
Full-text available
Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, where...
Preprint
Full-text available
Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities could be available in real life (testing) scenarios, w...
Preprint
Full-text available
We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction: the firing of neurons in specific paths. In this work, we utilize the evidence at each neuron to determine the probability of dropout, rather than dropping out neurons uniformly at random as in standard dropout. In essence, we dropout with highe...
Poster
Full-text available
In this work, we face the problem of unsupervised domain adaptation with a novel deep learning approach which leverages our finding that entropy minimization is induced by the optimal alignment of second order statistics between source and target domains. We formally demonstrate this hypothesis and, aiming at achieving an optimal alignment in pract...
Conference Paper
Full-text available
In this work, we face the problem of unsupervised domain adaptation with a novel deep learning approach which leverages on our finding that entropy minimization is induced by the optimal alignment of second order statistics between source and target domains. We formally demonstrate this hypothesis and, aiming at achieving an optimal alignment in pr...
Conference Paper
Full-text available
Dropout is a simple yet effective regulariza-tion technique that has been applied to various machine learning tasks, including linear classification, matrix factorization (MF) and deep learning. However, despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive. In this paper, we present a...
Article
Full-text available
Despite the recent deep learning (DL) revolution, kernel machines still remain powerful methods for action recognition. DL has brought the use of large datasets and this is typically a problem for kernel approaches, which are not scaling up efficiently due to kernel Gram matrices. Nevertheless, kernel methods are still attractive and more generally...
Conference Paper
Full-text available
Recent works showed that Generative Adversarial Networks (GANs) can be successfully applied in unsupervised domain adaptation, where, given a labeled source dataset and an unlabeled target dataset, the goal is to train powerful classifiers for the target samples. In particular, it was shown that a GAN objective function can be used to learn target...
Conference Paper
Full-text available
Dropout is a very effective way of regularizing neural networks. Stochastically "dropping out" units with a certain probability discourages over-specific co-adaptations of feature detectors, preventing overfitting and improving network generalization. Besides, Dropout can be interpreted as an approximate model aggregation technique, where an expone...
Article
Full-text available
Regularization for matrix factorization (MF) and approximation problems has been carried out in many different ways. Due to its popularity in deep learning, dropout has been applied also for this class of problems. Despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive for this class of...
Conference Paper
Full-text available
3D action recognition was shown to benefit from a covariance representation of the input data (joint 3D positions). A kernel machine feed with such feature is an effective paradigm for 3D action recognition, yielding state-of-the-art results. Yet, the whole framework is affected by the well-known scalability issue. In fact, in general, the kernel f...
Conference Paper
Full-text available
Human action recognition from skeletal data is a hot research topic and important in many open domain applications of computer vision, thanks to recently introduced 3D sensors. In the literature, naive methods simply transfer off-the-shelf techniques from video to the skeletal representation. However, the current state-of-the-art is contended betwe...
Preprint
Domain adaptation techniques address the problem of reducing the sensitivity of machine learning methods to the so-called domain shift, namely the difference between source (training) and target (test) data distributions. In particular, unsupervised domain adaptation assumes no labels are available in the target domain. To this end, aligning second...
Article
Full-text available
Wearable cameras allow people to record their daily activities from a user-centered (First Person Vision) perspective. Due to their favorable location, wearable cameras frequently capture the hands of the user, and may thus represent a promising user-machine interaction tool for different applications. Existent First Person Vision methods handle ha...
Article
Under a tracking framework, the definition of the target state is the basic step for automatic understanding of dynamic scenes. More specifically, far object tracking raises challenges related to potentially abrupt size changes of targets as they approach the sensor. If not handled, size changes can introduce heavy issues in data association and po...
Thesis
Full-text available
The emergence of new pervasive wearable technologies such as action cameras and smart glasses, brings the focus of Computer Vision research to the so called First Person Vision (FPV), or Egocentric Vision. Nowadays, more and more everyday-life videos are being shot from a first-person point of view, overturning the classical fixed-camera understand...
Data
Full-text available
The UNIGE-­HANDS, is a dataset containing a large set of videos recorded with a head mounted camera. It is originally proposed to test hand-­detection methods in egocentric videos. The dataset is carefully recorded to guarantee a good balance between frames with hands and without hands, and offer challenging conditions such as illumination changes,...
Conference Paper
Full-text available
Classifying frames, or parts of them, is a common way of carrying out detection tasks in computer vision. However, frame by frame classification suffers from sudden significant variations in image texture, colour and luminosity, resulting in noise in the extracted features and consequently in the decisions taken. Support Vector Machines have been w...
Conference Paper
Full-text available
Hand detection and segmentation methods stand as two of the most most prominent objectives in First Person Vision. Their popularity is mainly ex- plained by the importance of a reliable detection and location of the hands to develop human-machine interfaces for emergent wearable cameras. Current de- velopments have been focused on hand segmentation...
Conference Paper
Full-text available
First Person Vision (Egocentric) video analysis stands nowadays as one of the emerging fields in computer vision. The availability of wearable devices recording exactly what the user is looking at is ineluctable and the opportunities and challenges carried by this kind of devices are broad. Particularly, for the first time a device is so intimate w...
Article
In this work, we propose a strategy for optimizing a superpixel algorithm for video signals, in order to get closer to real time performances which are on the one hand needed for egocentric vision applications and on the other must be bearable by wearable technologies. Instead of applying the algorithm frame by frame, we propose a technique inspire...
Conference Paper
Full-text available
Superpixel methods have become popular in recent years as they provide an efficient preprocessing tool for a manifold of computer vision applications. In this work, we propose a method based on a self-adapting and self-growing network, which is bred starting from two random initialization seeds in the image. Such a network, which is a modification...
Conference Paper
Recently, a Bayesian estimator with a hybrid update was developed, based on a mathematical formulation of sampling. Such an Event Based State Estimator (EBSE) allows for a stable synchronous state estimate, relying on asynchronous measurements. Usefulness of such a filter comes with its approximate analytic formulation, which is attainable given a...
Article
Full-text available
Cognitive algorithms, integrated in intelligent systems, represent an important innovation in designing interactive smart environments. More in details, Cognitive Systems have important applications in anomaly detection and management in advanced video surveillance. These algorithms mainly address the problem of modelling interactions and behaviou...