
Vittorio MurinoUniversity of Verona | UNIVR · Department of Computer Science
Vittorio Murino
Professor, PhD, Eng
About
633
Publications
116,835
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,015
Citations
Citations since 2017
Introduction
I am full professor at the University of Verona, Computer Science Dept. I was the former director of the Pattern Analysis & Computer Vision (PAVIS) at the Italian Institute of Technology, Genova.
My research interest are in computer vision, machine learning and pattern recognition methods. In particular, I work on deep learning techniques for vision and image/video understanding, with main applications in surveillance and biomedical image analysis.
More info: https://pavis.iit.it/
Additional affiliations
October 1998 - November 2011
Università degli studi di Verona
October 1995 - October 1998
Università degli studi di Udine
Publications
Publications (633)
Source-free Unsupervised Domain Adaptation (SUDA) approaches inherently exhibit catastrophic forgetting. Typically, models trained on a labeled source domain and adapted to unlabeled target data improve performance on the target while dropping performance on the source, which is not available during adaptation. In this study, our goal is to cope wi...
In this paper, we introduce a novel framework for the challenging problem of One-Shot Unsupervised Domain Adaptation (OS-UDA), which aims to adapt to a target domain with only a single unlabeled target sample. Unlike existing approaches that rely on large labeled source and unlabeled target data, our Target-Driven One-Shot UDA (TOS-UDA) approach em...
Vector Quantized Variational Autoencoders (VQ-VAEs) have gained popularity in recent years due to their ability to represent images as discrete sequences of tokens that index a learned codebook of vectors, enabling efficient image compression. One variant of particular interest is VQ-VAE 2, which extends previous works by representing images as a h...
Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Tran...
In zero-shot learning (ZSL), the task of recognizing unseen categories when no data for training is available, state-of-the-art methods generate visual features from semantic auxiliary information (e.g., attributes). In this work, we propose a valid alternative (simpler, yet better scoring) to fulfill the very same task. We observe that, if first-...
In this technical report, we describe the Guided-Attention mechanism based solution for the short-term anticipation (STA) challenge for the EGO4D challenge. It combines the object detections, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motio...
Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their intera...
In this paper, we introduce a novel framework for the challenging problem of One-Shot Unsupervised Domain Adaptation (OSUDA), which aims to adapt to a target domain with only a single unlabeled target sample. Unlike existing approaches that rely on large labeled source and unlabeled target data, our Target-driven One-Shot UDA (TOS-UDA) approach emp...
Existing Source-free Unsupervised Domain Adaptation (SUDA) approaches inherently exhibit catastrophic forgetting. Typically, models trained on a labeled source domain and adapted to unlabeled target data improve performance on the target while dropping performance on the source, which is not available during adaptation. In this study, our goal is t...
Our brain constantly combines sensory information in unitary percept to build coherent representations of the environment. Even though this process could appear smooth, integrating sensory inputs from various sensory modalities must overcome several computational issues, such as recoding and statistical inferences problems. Following these assumpti...
Deploying a person re-identification (Re-ID) system in a real scenario requires adapting a model trained on one labeled dataset to a different environment, with no person identity information. This poses an evident challenge that can be faced by unsupervised domain adaptation approaches. Recent state-of-the-art methods adopt architectures composed...
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separ...
Multiple sclerosis (MS) is a neurological condition characterized by severe structural brain damage and by functional reorganization of the main brain networks that try to limit the clinical consequences of structural burden. Resting‐state (RS) functional connectivity (FC) abnormalities found in this condition were shown to be variable across diffe...
In this paper, we investigate brain activity associated with complex visual tasks, showing that electroencephalography (EEG) data can help computer vision in reliably recognizing actions from video footage that is used to stimulate human observers. Notably, we consider not only typical “explicit” video action benchmarks, but also more complex data...
Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural microphones. However, acoustic images are typicall...
Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily af...
In Generalized Zero-Shot Learning (GZSL), unseen categories (for which no visual data are available at training time) can be predicted by leveraging their class embeddings (e.g., a list of attributes describing them) together with a complementary pool of seen classes (paired with both visual data and class embeddings). Despite GZSL is arguably chal...
Increased explainability in machine learning is traditionally associated with lower performance, e.g. a decision tree is more explainable, but less accurate than a deep neural network. We argue that, in fact, increasing the explainability of a deep classifier can improve its generalization. In this chapter, we survey a line of our published work th...
Zero-Shot Learning (ZSL) objective is to classify instances of classes that were not seen during the training phase. ZSL methods take advantage of side information, i.e., class attributes, to leverage information between the seen and unseen classes. Lately, generative methods have been used to synthesize unseen features in order to train a classifi...
The relationship between structure and function is of interest in many research fields involving the study of complex biological processes. In neuroscience in particular, the fusion of structural and functional data can help to understand the underlying principles of the operational networks in the brain. To address this issue, this paper proposes...
The increasing presence of robots in society necessitates a deeper understanding into what attitudes people have toward robots. People may treat robots as mechanistic artifacts or may consider them to be intentional agents. This might result in explaining robots’ behavior as stemming from operations of the mind (intentional interpretation) or as a...
Acoustic images constitute an emergent data modality for multimodal scene understanding. Such images have the peculiarity to distinguish the spectral signature of sounds coming from different directions in space, thus providing richer information than the one derived from mono and binaural microphones. However, acoustic images are typically generat...
Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usual...
The concept of compressing deep Convolutional Neural Networks (CNNs) is essential to use limited computation, power, and memory resources on embedded devices. However, existing methods achieve this objective at the cost of a drop in inference accuracy in computer vision tasks. To address such a drawback, we propose a framework that leverages knowle...
We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction defined as the firing of neurons in specific paths. In this work, we utilize the evidence at each neuron to determine the probability of dropout, rather than dropping out neurons uniformly at random as in standard dropout. In essence, we dropout...
The majority of existing Unsupervised Domain Adaptation (UDA) methods presumes source and target domain data to be simultaneously available during training. Such an assumption may not hold in practice, as source data is often inaccessible (e.g., due to privacy reasons). On the contrary, a pre-trained source model is always considered to be availabl...
In Generalized Zero-Shot Learning (GZSL), unseen categories (for which no visual data are available at training time) can be predicted by leveraging their class embeddings (e.g., a list of attributes describing them) together with a complementary pool of seen classes (paired with both visual data and class embeddings). Despite GZSL is arguably chal...
The precise segmentation of organs from computed tomography is a fundamental and pivotal task for correct diagnosis and proper treatment of diseases. Neural network models are widely explored for their promising performance in the segmentation of medical images. However, the small dimension of available datasets is affecting the biomedical imaging...
In this paper, we address zero-shot learning (ZSL), the problem of recognizing categories for which no labeled visual data are available during training. We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. State-of-the-art paradigms in ZSL typically exploit generative adversarial networks to synth...
In state-of-the-art deep single-label classification models, the top-
$k$
$(k=2,3,4, \dots)$
accuracy is usually significantly higher than the top-1 accuracy. This is more evident in fine-grained datasets, where differences between classes are quite subtle. Exploiting the information provided in the top
$k$
predicted classes boosts the final p...
We address the challenging Voice Activity Detection (VAD) problem, which determines "Who is Speaking and When?" in audiovisual recordings. The typical audio-based VAD systems can be ineffective in the presence of ambient noise or noise variations. Moreover, due to technical or privacy reasons, audio might not be always available. In such cases, the...
In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array...
This paper presents a novel setup for automatic visual inspection of cracks in ceramic tile as well as studies the effect of various classifiers and height-varying illumination conditions for this task. The intuition behind this setup is that cracks can be better visualized under specific lighting conditions than others. Our setup, which is designe...
DSLib is an open-source implementation of the Dominant Set (DS) clustering algorithm written entirely in Matlab. The DS method is a graph-based clustering technique rooted in the evolutionary game theory that starts gaining lots of interest in the computer science community. Thanks to its duality with game theory and its strict relation to the noti...
The retina is a complex nervous system which encodes visual stimuli before higher order processing occurs in the visual cortex. In this study we evaluated whether information about the stimuli received by the retina can be retrieved from the firing rate distribution of Retinal Ganglion Cells (RGCs), exploiting High-Density 64x64 MEA technology. To...
The increasing presence of robots in society necessitates a deeper understanding into what attitudes people have toward robots. People may treat robots as mechanistic artifacts or may consider them to be intentional agents. This might result in explaining robots' behavior as stemming from operations of the mind (intentional interpretation) or as a...
In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audiovisual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array...
One of the main and most effective measures to contain the recent viral outbreak is the maintenance of the so-called Social Distancing (SD). To comply with this constraint, governments are adopting restrictions over the minimum inter-personal distance between people. Given this actual scenario, it is crucial to massively measure the compliance to s...
We present an automatic voice activity detection (VAD) method that is solely based on visual cues. Unlike traditional approaches processing audio, we show that upper body motion analysis is desirable for the VAD task. The proposed method consists of components for body motion representation, feature extraction from a Convolutional Neural Network (C...
We present an automatic voice activity detection (VAD) method that is solely based on visual cues. Unlike traditional approaches processing audio, we show that upper body motion analysis is desirable for the VAD task. The proposed method consists of components for body motion representation , feature extraction from a Convolutional Neural Network (...
One of the main and most effective measures to contain the recent viral outbreak is the maintenance of the so-called Social Distancing (SD). To comply with this constraint, workplaces, public institutions, transports and schools will likely adopt restrictions over the minimum inter-personal distance between people. Given this actual scenario, it is...
The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the...
In this paper, we tackle the task of automatically analyzing 3D volumetric scans obtained from computed tomography (CT) devices. In particular, we address a particular task for which data is very limited: the segmentation of ancient Egyptian mummies CT scans. We aim at digitally unwrapping the mummy and identify different segments such as body, ban...
We are interested in learning data-driven representations that can generalize well, even when trained on inherently biased data. In particular, we face the case where some attributes (bias) of the data, if learned by the model, can severely compromise its generalization properties. We tackle this problem through the lens of information theory, leve...
Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification...
Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usual...
We investigate and characterize the inherent resilience of conditional Generative Adversarial Networks (cGANs) against noise in their conditioning labels, and exploit this fact in the context of Unsupervised Domain Adaptation (UDA). In UDA, a classifier trained on the labelled source set can be used to infer pseudo-labels on the unlabelled target s...
This paper aims at investigating the action prediction problem from a pure kinematic perspective. Specifically, we address the problem of recognizing future actions, indeed human intentions, underlying a same initial (and apparently unrelated) motor act. This study is inspired by neuroscientific findings asserting that motor acts at the very onset...
In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We i...
Visual features designed for image classification have shown to be useful in zero-shot learning (ZSL) when generalizing towards classes not seen during training. In this paper, we argue that a more effective way of building visual features for ZSL is to extract them through captioning, in order not just to classify an image but, instead, to describ...
This paper presents a simple and effective generalization method for magnetic resonance imaging (MRI) segmen-tation when data is collected from multiple MRI scanning sites and as a consequence is affected by (site-)domain shifts. We propose to integrate a traditional encoder-decoder network with a regularization network. This added network includes...
1 Abstract This paper presents a simple and effective generalization method for magnetic resonance imaging (MRI) segmen-tation when data is collected from multiple MRI scanning sites and as a consequence is affected by (site-)domain shifts. We propose to integrate a traditional encoder-decoder network with a regularization network. This added netwo...