Alberto Del Bimbo

Alberto Del Bimbo
  • University of Florence

About

672
Publications
100,797
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
14,869
Citations
Current institution

Publications

Publications (672)
Conference Paper
Full-text available
The impressive advancements of AI have marked the progress of technological solutions in many fields in the last few years. A key role in this progress has been played by Computer Vision, providing machines with the capability to observe, interpret and based on such understanding predict. These progresses require to consider this technology with re...
Poster
Full-text available
There is a growing awareness of the need for action to address the ecological emergency. The 2030 Agenda is widely shared around the world and the sustainable Development Goals (SDGs) represent global moral values. The 9 Planetary Boundaries (PBs) [1] feature the constraints of the planet within which human beings can continue to develop in a susta...
Preprint
Full-text available
Human facial expressions change dynamically, so their recognition / analysis should be conducted by accounting for the temporal evolution of face deformations either in 2D or 3D. While abundant 2D video data do exist, this is not the case in 3D, where few 3D dynamic (4D) datasets were released for public use. The negative consequence of this scarci...
Chapter
Face recognition in unconstrained open-world settings is a challenging problem. Differently from the closed-set and open-set face recognition scenarios that assume that the face representations of known subjects have been manually enrolled in a gallery, the open-world scenario requires that the system learns identities incrementally from frame to f...
Conference Paper
We propose a method aimed at reducing human intervention in football video shooting and highlights editing, allowing automatic highlight detection together with panning and zooming on salient areas of the playing field. Our recognition subsystem exploits computer vision algorithms to perform automatic detection, pan and zoom and extraction of salie...
Conference Paper
Full-text available
This paper describes an action classification pipeline for detecting and evaluating correct execution of actions in video recorded by smartphone cameras; the use case is that of simplifying monitoring of how physiotherapeutic exercises are performed by patients in the comfort of their own home, reducing the need of physical presence of therapists....
Conference Paper
The multimedia and multi-modal community is witnessing an explosive transformation in the recent years with major societal impact. With the unprecedented deployment of multimedia devices and systems, multimedia research is critical to our abilities and prospects in advancing state-of-the-art technologies and solving real-world challenges facing the...
Conference Paper
NeuronUnityIntgration2.0 (demo video is avilable at http://tiny.cc/u1lz6y) is a plugin for Unity which provides gesture recognition functionalities through the Perception Neuron motion capture suit. The system offers a recording mode, which guides the user through the collection of a dataset of gestures, and a recognition mode, capable of detecting...
Conference Paper
Full-text available
Video compression algorithms result in a reduction of image quality, because of their lossy approach to reduce the required bandwidth. This affects commercial streaming services such as Netflix, or Amazon Prime Video, but affects also video conferencing and video surveillance systems. In all these cases it is possible to improve the video quality,...
Conference Paper
Recent years have seen unprecedented research on using artificial intelligence to understand the subjective attributes of images and videos. These attributes are not objective properties of the content but are highly dependent on the perception of the viewers. Subjective attributes are extremely valuable in many applications where images are tailor...
Preprint
Full-text available
Text to Image Synthesis refers to the process of automatic generation of a photo-realistic image starting from a given text and is revolutionizing many real-world applications. In order to perform such process it is necessary to exploit datasets containing captioned images, meaning that each image is associated with one (or more) captions describin...
Chapter
Autonomous driving is becoming a reality, yet vehicles still need to rely on complex sensor fusion to understand the scene they act in. The ability to discern static environment and dynamic entities provides a comprehension of the road layout that poses constraints to the reasoning process about moving objects. We pursue this through a GAN-based se...
Preprint
Full-text available
In this report, we provide additional and corrected results for the paper "Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification" [1]. 1 After further investigations, we discovered and corrected wrongly labeled images and incorrect identities. This forced us to regenerate the evaluation protocol for the new data; in doing...
Preprint
Full-text available
Autonomous driving is becoming a reality, yet vehicles still need to rely on complex sensor fusion techniques to fully understand the scene they act in. Being able to discern the static environment from the dynamic entities that populate it, will improve scene comprehension algorithms and will pose constraints to the reasoning process about moving...
Chapter
Full-text available
Depth cameras enable long term re-identification exploiting 3D information that captures biometric cues such as face and characteristic lengths of the body. In the typical approach, person re-identification is performed using appearance, thus invalidating any application in which a person may change dress across subsequent acquisitions. For example...
Article
Full-text available
In this paper we present a machine vision system to efficiently monitor, analyze and present visual data acquired with a railway overhead gantry equipped with multiple cameras. This solution aims to improve the safety of daily life railway transportation in a two- fold manner: (1) by providing automatic algorithms that can process large imagery of...
Chapter
Identity recognition using 3D scans of the face has been recently proposed as an alternative or complement ary solution to conventional 2D face recognition approaches based on still images or videos. In fact, face representations based on 3D data are expected to be more robust to pose changes and illumination variations than 2D images, thus allowin...
Chapter
In this Chapter, we describe the design and development of low-cost solutions, exploiting Microsoft Kinect ™ and 3D graphics, for two specific use cases: the simulation of the Basic Life Support Defibrillation (BLSD-S) and Surgical Safety Checklist (SSC-S) procedures. These prototypes have been designed with the aim to propose innovative pedagogica...
Chapter
Depth cameras simplify many tasks in computer vision, such as background modeling, 3D reconstruction, articulated object tracking, and gesture analysis. These sensors provide a great tool for real-time analysis of human behavior. In this chapter, we cover two important issues that can be solved using computer vision for natural interaction. First,...
Chapter
Skills training is of primary importance for healthcare professionals. New technologies such as natural interfaces and virtual reality bring new possibilities in learning outcomes especially if used synergically exploiting some principles derived from the so-called serious gaming interfaces. Virtual reality combined with natural interaction interfa...
Article
We present a novel online unsupervised method for face identity learning from video streams. The method exploits deep face descriptors together with a memory based learning mechanism that takes advantage of the temporal coherence of visual data. Specifically, we introduce a discriminative feature matching solution based on Reverse Nearest Neighbour...
Article
Full-text available
Ensembles of Exemplar-SVMs have been introduced as a framework for Object Detection but have rapidly found a large interest in a wide variety of computer vision applications such as mid-level feature learning, tracking and segmentation. What makes this technique so attractive is the possibility of associating to instance specific classifiers one or...
Conference Paper
Full-text available
We present a smart audio guide that adapts itself to the environment the user is navigating into. The system builds automatically a point of interest database exploiting Wikipedia and Google APIs as source. We rely on a computer vision system, to overcome the likely sensor limitations, and determine with high accuracy if the user is facing a certai...
Conference Paper
Full-text available
In this demo we present a system for immersive experiences in museums using Voice Commands (VCs) and Virtual Reality (VR). The system has been specifically designed for use by people with motor disabilities. Natural interaction is provided through Automatic Speech Recognition (ASR) and allows to experience VR environments wearing an Head Mounted Di...
Conference Paper
Modern automated visual surveillance scenarios demand to process effectively a large set of visual stream with a limited amount of human resources. Actionable information is required in real-time, therefore abnormal pattern detection shall be performed in order to select the most useful streams for an operator to visually inspect. To tackle this ch...
Conference Paper
Full-text available
Emotion recognition is attracting great interest for its potential application in a multitude of real-life situations. Much of the Computer Vision research in this field has focused on relating emotions to facial expressions, with investigations rarely including more than upper body. In this work, we propose a new scenario, for which emotional stat...
Article
We present a novel unsupervised method for face identity learning from video sequences. The method exploits the ResNet deep network for face detection and VGGface fc7 face descriptors together with a smart learning mechanism that exploits the temporal coherence of visual data in video streams. We present a novel feature matching solution based on R...
Article
Full-text available
Person re-identification is best known as the problem of associating a single person that is observed from one or more disjoint cameras. The existing literature has mainly addressed such an issue, neglecting the fact that people usually move in groups, like in crowded scenarios. We believe that the additional information carried by neighboring indi...
Article
Emotion recognition is attracting great interest for its potential application in a multitude of real-life situations. Much of the Computer Vision research in this field has focused on relating emotions to facial expressions, with investigations rarely including more than upper body. In this work, we propose a new scenario, for which emotional stat...
Article
Full-text available
In this article, we address the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors. As an autonomous agent, our guide perceives the context and is able to interact with users in an appropriate fashion. To do so, it understands what the visitor is looking at, if the visitor is moving inside the museum...
Conference Paper
In this paper we present a system for the detection and validation of macro and micro-events in cities (e.g. concerts, business meetings, car accidents) through the analysis of geolocalized messages from Twitter. A simple but effective method is proposed for unknown event detection designed to alleviate computational issues in traditional approache...
Article
Full-text available
Background The traditional paper and pencil tests are often inadequate to detect the mild forms of Unilateral Spatial Neglect (USN). Objective To verify the effectiveness of a touchscreen-based cancellation test in assessing individuals with USN. Methods Seven individuals, six with right and one with left brain damage, who showed moderate to seve...
Conference Paper
Digital and mobile technologies have become increasingly popular to support and improve the quality of experience during cultural visits. The portability of the device, the daily adaptation of most people to its usage, the easy access to information and the opportunity of interactive augmented reality have been key factors of this popularity. We be...
Conference Paper
We present a new tool we have developed to ease the annotation of crowded environments, typical of visual surveillance datasets. Our tool is developed using HTML5 and Javascript and has two back-ends. A PHP based back-end implement the persistence using a relational database and manage the dynamic creation of pages and the authentication procedure....
Conference Paper
Full-text available
Given the huge quantity of hours of video available on video sharing platforms such as YouTube, Vimeo, etc. development of automatic tools that help users find videos that fit their interests has attracted the attention of both scientific and industrial communities. So far the majority of the works have addressed semantic analysis, to identify obje...
Article
Face analysis from 2D images and videos is a central task in many multimedia applications. Methods developed to this end perform either face recognition or facial expression recognition, and in both cases results are negatively influenced by variations in pose, illumination and resolution of the face. Such variations have a lower impact on 3D face...
Article
Human motion and behaviour in crowded spaces is influenced by several factors, such as the dynamics of other moving agents in the scene, as well as the static elements that might be perceived as points of attraction or obstacles. In this work, we present a new model for human trajectory prediction which is able to take advantage of both human-human...
Article
Full-text available
In this paper we introduce the problem of predicting action progress in untrimmed videos. We argue that this is an extremely important task because, on the one hand, it can be valuable for a wide range of applications and, on the other hand, it facilitates better action detection results. To solve this problem we introduce a novel approach, named P...
Article
Full-text available
Compression artifacts arise in images whenever a lossy compression algorithm is applied. These artifacts eliminate details present in the original image, or add noise and small structures; because of these effects they make images less pleasant for the human eye, and may also lead to decreased performance of computer vision algorithms such as objec...
Article
When running multiplayer online games on IP networks with losses and delays, the order of actions may be changed when compared to the order run on an ideal network with no delays and losses. To maintain a proper ordering of events, traditional approaches either use rollbacks to undo certain actions or local lags to introduce additional delays. Both...
Article
Object detection is one of the most important tasks of computer vision. It is usually performed by evaluating a subset of the possible locations of an image that are more likely to contain the object of interest. Exhaustive approaches have now been superseded by object proposal methods. The interplay of detectors and proposal algorithms has not bee...
Conference Paper
In this paper, we propose a new and effective frontalization algorithm for frontal rendering of unconstrained face images, and experiment it for face recognition. Initially, a 3DMM is fit to the image, and an interpolating function maps each pixel inside the face region on the image to the 3D model’s. Thus, we can render a frontal view without intr...
Conference Paper
Full-text available
The analysis of human gait is more and more investigated due to its large panel of potential applications in various domains, like rehabilitation, deficiency diagnosis, surveillance and movement optimization. In addition, the release of depth sensors offers new opportunities to achieve gait analysis in a non-intrusive context. In this paper, we pr...
Article
Performing face recognition across 3D scans with different resolution is now attracting an increasing interest thanks to the introduction of a new generation of depth cameras, capable of acquiring color/depth images over time. In fact, these devices acquire and provide depth data with much lower resolution compared with the 3D high-resolution scann...
Article
We present an approach for human activity recognition based on trajectory grouping. Our representation allows to perform partial matching between videos obtaining a robust similarity measure. This approach is extremely useful in sport videos where multiple entities are involved in the activities. Many existing works perform person detection, tracki...
Conference Paper
Imaging Novecento is a native mobile application that can be used to get insights on artworks in the “Museo Novecento” in Florence, IT. The App provides smart paradigms of interaction to ease the learning of the Italian art history of the 20\(^{th}\) century. Imaging Novecento exploits automatic approaches and gamification techniques with recreatio...
Conference Paper
In this paper we evaluate methods to move 'naturally' in an Immersive Virtual Environment (IVE) visualised through an Head Mounted Display (HMD). Natural interaction is provided through gesture recognition on depth sensors' data. Gestural input solutions in the literature to provide locomotion are discussed. Two new methods for locomotion are propo...
Article
Full-text available
Pan-tilt-zoom (PTZ) cameras are powerful to support object identification and recognition in far-field scenes. Real-time detection and tracking of targets with these cameras is nevertheless complicated by the fact that the geometrical relationship between the camera view and the 3D observed scene is time-varying and, over long periods of operation,...
Conference Paper
In this paper we present a simple yet effective approach to extend without supervision any object proposal from static images to videos. Unlike previous methods, these spatio-temporal proposals, to which we refer as “tracks”, are generated relying on little or no visual content by only exploiting bounding boxes spatial correlations through time. Th...
Conference Paper
Full-text available
The goal of this work is to implement a real-time computer vision system that can run on wearable devices to perform object classification and artwork recognition, to improve the experience of a museum visit through understanding the interests of users. Object classification helps to understand the context of the visit, e.g. differentiating when a...
Conference Paper
We present a novel method to improve action recognition by leveraging a set of captioned videos. By learning linear projections to map videos and text onto a common space, our approach shows that improved results on unseen videos can be obtained. We also propose a novel structure preserving loss that further ameliorates the quality of the projectio...
Article
Full-text available
In this paper we present a simple yet effective approach to extend without supervision any object proposal from static images to videos. Unlike previous methods, these spatio-temporal proposals, to which we refer as tracks, are generated relying on little or no visual content by only exploiting bounding boxes spatial correlations through time. The...
Conference Paper
This paper discusses the role of computer vision to bridge the experiential gap between the cultural and emotional experience of the visitors in museums or cultural heritage sites. We don't argue against the use of multiple sensors to provide a more complete cultural experience but claim the primary role of computer vision for such a task. Although...
Article
The traditional k-out-of-n Visual Cryptography (VC) scheme is the conception of “all or nothing” for n participants to share a secret image. The original secret image can be visually revealed only when a subset of k or more shares are superimposed together, but if the number of stacked shares are less than k, nothing will be revealed. On the other...
Article
Full-text available
In this paper, we propose a framework for analyzing and understanding human behavior from depth videos. The proposed solution �rst employs shape analysis of the human pose across time to decompose the full motion into short temporal segments representing elementary motions. Then, each segment is characterized by human motion and depth appearance ar...
Article
In this paper we introduce a method to overcome one of the main challenges of person re-identification in multi-camera networks, namely cross-view appearance changes. The proposed solution addresses the extreme variability of person appearance in different camera views by exploiting multiple feature representations. For each feature, Kernel Canonic...
Article
Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary i...
Conference Paper
Full-text available
Hundreds of hours of videos are uploaded every minute on YouTube and other video sharing sites: some will be viewed by millions of people and other will go unnoticed by all but the uploader. In this paper we propose to use visual sentiment and content features to predict the popularity of web videos. The proposed approach outperforms current state-...
Conference Paper
In this paper we propose a method for video recommendation in Social Networks based on crowdsourced and automatic video annotations of salient frames. We show how two human factors, users' self-expression in user profiles and perception of visual saliency in videos, can be exploited in order to stimulate annotations and to obtain an efficient repre...
Article
While most automatic image annotation methods rely solely on visual features, we consider integrating additional information into an unified embedding comprised of visual and textual information. We propose an approach based on Kernel Canonical Correlation Analysis, which builds a latent semantic space where correlation of visual and textual featur...
Article
In this paper we present an efficient method for visual descriptors retrieval based on compact hash codes computed using a multiple k-means assignment. The method has been applied to the problem of approximate nearest neighbor (ANN) search of local and global visual content descriptors, and it has been tested on different datasets: three large scal...
Article
Full-text available
This paper presents a novel method for efficient image retrieval, based on a simple and effective hashing of CNN features and the use of an indexing structure based on Bloom filters. These filters are used as gatekeepers for the database of image features, allowing to avoid to perform a query if the query features are not stored in the database and...
Article
In this paper, we present a novel approach for fusing shape and texture local binary patterns (LBP) on a mesh for 3D face recognition. Using a recently proposed framework [1], we compute LBP directly on the face mesh surface, then we construct a grid of the regions on the facial surface that can accommodate global and partial descriptions. Compared...
Article
Crowd system has motivated a surge of interests in many areas of multimedia, as it contains plenty of information about crowd scenes. In crowd systems, individuals tend to exhibit collective behaviors, and the motion of all those individuals is called collective motion. As a comprehensive descriptor of collective motion, collectiveness has been pro...
Conference Paper
Full-text available
This tutorial focuses on challenges and solutions for content-based image annotation and retrieval in the context of online image sharing and tagging. We present a unified review on three closely linked problems, i.e., tag assignment, tag refinement , and tag-based image retrieval. We introduce a tax-onomy to structure the growing literature, under...
Conference Paper
Full-text available
Affective content analysis has gained great attention in recent years and is an important challenge of content-based multimedia information retrieval. In this paper, a hierarchical approach is proposed for affect recognition in movie datasets. This approach has been verified on the AFEW dataset, showing an improvement in classification results comp...
Conference Paper
Full-text available
In this paper, we propose a new approach for constructing a 3D morphable model (3DMM) and experiment its application to face recognition. Differently from existing solutions, the proposed 3DMM is constructed from a training set that includes a large spectrum of variability in terms of ethnicity and facial expressions. By exploiting annotated landma...
Conference Paper
Full-text available
The goal of the system presented in this demo is to make possible for the visually and hearing impaired audience to live empathetic viewing experiences using their home theatre. In this work we suggest the incorporation of new emotion communication modalities into the standard television, to provide the targeted audience with sensations that they d...
Conference Paper
In this paper, we present a web based annotation tool we developed allowing creating collaboratively a detailed ground truth for datasets related to visual surveillance and behavior understanding. The system persistence is based on a relational database and the user interface is designed using HTML5, Javascript and CSS. Our tool can easily manage d...
Conference Paper
In this demo we present smArt, a low-cost framework to quickly set up indoor exhibits featuring a smart navigation system for museums. The framework is web-based and allows the design on a digital map of a sensorized museum environment and the dynamic and assisted definition of the multimedia materials and sensors associated to the artworks. The kn...
Conference Paper
In this demo we present PITAGORA\footnote{Demo video available at http://bit.ly/1GgtUrN}: a mobile web contextual social network designed for the check-in area of an airport. The app provides recommendation of potential friends, local experts and targeted services. Recommendation is hybrid and combines social media analysis and collaborative filter...
Conference Paper
Full-text available
Images in social networks share different destinies: some are going to become popular while others are going to be completely unnoticed. In this paper we propose to use visual sentiment features together with three novel context features to predict a concise popularity score of social images. Experiments on large scale datasets show the benefits of...
Conference Paper
Full-text available
In this paper we present a system for content-based video recommendation that exploits visual saliency to better represent video features and content\footnote{Demo video available at http://bit.ly/1FYloeQ}. Visual saliency is used to select relevant frames to be presented in a web-based interface to tag and annotate video frames in a social network...
Conference Paper
In the last decade facial age estimation has grown its importance in computer vision. In this paper we propose an efficient and effective age estimation system from face imagery. To assess the quality of the proposed approach we compare the results obtained by our system with those achieved by other recently published methods on a very large datase...
Conference Paper
In this paper we present a solution for tracking-by-detection that is able to handle both scale variations and occlusions of the tracked object. We build upon the framework proposed in [7] based on structured output SVM and improve it in order to deal with both variations of target scale and occlusions. We first propose to modify the original solut...
Article
Tagging of multimedia content is becoming more and more widespread as web 2.0 sites, like Flickr and Facebook for images, YouTube and Vimeo for videos, have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia...
Article
Full-text available
In this paper, we introduce an original framework for computing local binary like-patterns on 2D mesh manifolds (i.e., surfaces in the 3D space). This framework, dubbed mesh-LBP, preservers the simplicity and the adaptability of the 2D LBP and has the capacity of handling both open and close mesh surfaces without requiring normalization as compared...
Article
In this paper we introduce a method for person re-identification based on discriminative, sparse basis expansions of targets in terms of a labeled gallery of known individuals. We propose an iterative extension to sparse discriminative classifiers capable of ranking many candidate targets. The approach makes use of soft- and hard- re-weighting to r...

Network

Cited By