Pietro Pala’s research while affiliated with University of Florence and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (35)


Pain Level Estimation from Videos by Analyzing the Dynamics of Facial Landmarks with a Spatio-Temporal Graph Neural Network
  • Preprint

March 2025

·

5 Reads

Fatemah Alhamdoosh

·

Pietro Pala

·



Fig. 1: Pooling-based fusion approach. We pool the features after a cut-off layer (the encoder) and process the blended features with the final part of the model.
Fig. 2: Asymmetric modality injection. The main domain (event) is informed about the complementary domain (RGB) thanks to a cross-attention mechanism that blends the features asymmetrically.
Fig. 3: Symmetric fusion architecture. Two asymmetric injections are performed, to inform the two modalities of each other. A final pooling layer is used to merge the features symmetrically.
Comparison of existing event-based drone datasets. Other datasets either have a low resolution, do not contain RGB versions of the samples, are not Drone- centric or are very small.
Results of the different proposed architectures.

+1

Neuromorphic Drone Detection: an Event-RGB Multimodal Approach
  • Preprint
  • File available

September 2024

·

31 Reads

In recent years, drone detection has quickly become a subject of extreme interest: the potential for fast-moving objects of contained dimensions to be used for malicious intents or even terrorist attacks has posed attention to the necessity for precise and resilient systems for detecting and identifying such elements. While extensive literature and works exist on object detection based on RGB data, it is also critical to recognize the limits of such modality when applied to UAVs detection. Detecting drones indeed poses several challenges such as fast-moving objects and scenes with a high dynamic range or, even worse, scarce illumination levels. Neuromorphic cameras, on the other hand, can retain precise and rich spatio-temporal information in situations that are challenging for RGB cameras. They are resilient to both high-speed moving objects and scarce illumination settings, while prone to suffer a rapid loss of information when the objects in the scene are static. In this context, we present a novel model for integrating both domains together, leveraging multimodal data to take advantage of the best of both worlds. To this end, we also release NeRDD (Neuromorphic-RGB Drone Detection), a novel spatio-temporally synchronized Event-RGB Drone detection dataset of more than 3.5 hours of multimodal annotated recordings.

Download


Explaining autonomous driving with visual attention and end-to-end trainable region proposals

February 2023

·

84 Reads

·

7 Citations

Journal of Ambient Intelligence and Humanized Computing

Autonomous driving is advancing at a fast pace, with driving algorithms becoming more and more accurate and reliable. Despite this, it is of utter importance to develop models that can offer a certain degree of explainability in order to be trusted, understood and accepted by researchers and, especially, society. In this work we present a conditional imitation learning agent based on a visual attention mechanism in order to provide visually explainable decisions by design. We propose different variations of the method, relying on end-to-end trainable regions proposal functions, generating regions of interest to be weighed by an attention module. We show that visual attention can improve driving capabilities and provide at the same time explainable decisions.


Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving

January 2023

·

22 Reads

·

3 Citations

IEEE Transactions on Intelligent Vehicles

Conditional Imitation learning is a common and effective approach to train autonomous driving agents. However, two issues limit the full potential of this approach: (i) the inertia problem, a special case of causal confusion where the agent mistakenly correlates low speed with no acceleration, and (ii) low correlation between offline and online performance due to the accumulation of small errors that brings the agent in a previously unseen state. Both issues are critical for state-aware models, yet informing the driving agent of its internal state as well as the state of the environment is of crucial importance. In this paper we propose a multi-task learning agent based on a multi-stage vision transformer with state token propagation. We feed the state of the vehicle along with the representation of the environment as a special token of the transformer and propagate it throughout the network. This allows us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model's decisions. We report a drastic decrease in inertia and a high correlation between offline and online metrics.


Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices

October 2022

·

28 Reads

·

9 Citations

IEEE Transactions on Affective Computing

We propose an automatic method to estimate self-reported pain intensity based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used to represent the trajectory of facial landmarks on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. A curve fitting algorithm is then used to smooth the trajectories and a temporal alignment is performed to compute the similarity between the trajectories on the manifold. A Support Vector Regression classifier is then trained to encode the extracted trajectories into pain intensity levels consistent with the self-reported pain intensity measurement. Finally, a late fusion of the estimation for each region is performed to obtain the final predicted pain intensity level. The proposed approach is evaluated on two publicly available databases, the UNBCMcMaster Shoulder Pain Archive and the Biovid Heat Pain database. We compared our method to the state-of-the-art on both databases using different testing protocols, showing the competitiveness of the proposed approach.


IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1 Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices

September 2022

·

66 Reads

IEEE Transactions on Affective Computing

We propose an automatic method to estimate self-reported pain based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and the pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used for representing the trajectory of landmarks on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. A curve fitting algorithm is used to smooth the trajectories and temporal alignment is performed to compute the similarity between the trajectories on the manifold. A Support Vector Regression classifier is then trained to encode extracted trajectories into pain intensity levels consistent with self-reported pain intensity measurement. Finally, a late fusion of the estimation for each region is performed to obtain the final predicted pain level. The proposed approach is evaluated on two publicly available datasets, the UNBCMcMaster Shoulder Pain Archive and the Biovid Heat Pain dataset. We compared our method to the state-of-the-art on both datasets using different testing protocols, showing the competitiveness of the proposed approach.


Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices

September 2022

·

46 Reads

We propose an automatic method to estimate self-reported pain based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and the pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used for representing the trajectory of landmarks on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. A curve fitting algorithm is used to smooth the trajectories and temporal alignment is performed to compute the similarity between the trajectories on the manifold. A Support Vector Regression classifier is then trained to encode extracted trajectories into pain intensity levels consistent with self-reported pain intensity measurement. Finally, a late fusion of the estimation for each region is performed to obtain the final predicted pain level. The proposed approach is evaluated on two publicly available datasets, the UNBCMcMaster Shoulder Pain Archive and the Biovid Heat Pain dataset. We compared our method to the state-of-the-art on both datasets using different testing protocols, showing the competitiveness of the proposed approach.


Measuring 3D face deformations from RGB images of expression rehabilitation exercises

August 2022

·

19 Reads

·

2 Citations

Virtual Reality & Intelligent Hardware

Background The accurate (quantitative) analysis of face deformations in 3D is a problem of increasing interest for the many applications it may have. In particular, defining a 3D model of the face that can deform to a 2D target image, while capturing local and asymmetric deformations is still a challenge in the existing literature. Computing a measure of such local deformations may represent a relevant index for monitoring rehabilitation exercises that are used in Parkinson’s and Alzheimer’s disease or in recovering from a stroke. Methods In this study, we present a complete framework that allows the construction of a 3D Morphable Shape Model (3DMM) of the face and its fitting to a target RGB image. The model has the specific characteristic of being based on localized components of deformation; the fitting transformation is performed from 3D to 2D and is guided by the correspondence between landmarks detected in the target image and landmarks manually annotated on the average 3DMM. The fitting has also the peculiarity of being performed in two steps, disentangling face deformations that are due to the identity of the target subject from those induced by facial actions. Results In the experimental validation of the method, we used the MICC-3D dataset that includes 11 subjects each acquired in one neutral pose plus 18 facial actions that deform the face in localized and asymmetric ways. For each acquisition, we fit the 3DMM to an RGB frame with an apex facial action and to the neutral frame, and computed the extent of the deformation. Results indicated that the proposed approach can accurately capture the face deformation even for localized and asymmetric ones. Conclusions The proposed framework proved the idea of measuring the deformations of a reconstructed 3D face model to monitor the facial actions performed in response to a set of target ones. Interestingly, these results were obtained just using RGB targets without the need for 3D scans captured with costly devices. This opens the way to the use of the proposed tool for remote medical monitoring of rehabilitation.


Citations (21)


... The rise of large language models such as GPT-4.5, QwenLM, and DeepSeek, along with virtual reality technologies, has driven affective computing beyond traditional single-modal analyses to multimodal data fusion [1,2], improving both the accuracy of cultural heritage communication and user immersion [3,4]. ...

Reference:

Affective-Computing-Driven Personalized Display of Cultural Information for Commercial Heritage Architecture
Personalized Generative Storytelling with AI-Visual Illustrations for the Promotion of Knowledge in Cultural Heritage Tourism
  • Citing Conference Paper
  • October 2024

... Furthermore, another study [11] emphasizes the benefits of visual attention in enhancing model interpretability and decision-making. ViT [12] is also implemented with multi-task learning to address key limitations in state-aware imitation learning, such as compounding errors and offline-online performance gaps. By leveraging state token propagation, these models improve performance handling of unseen states and low inertia problems. ...

Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving

IEEE Transactions on Intelligent Vehicles

... In the realm of 3D face mesh acquisition for datasets collection, methodologies vary from extracting geometry from images or videos [32], [71] or generating synthetic data [47], to employing specialized scanners in controlled environments [18], [20], [23], [42], [50], [54], [68], [72]. While methods that follow the former paradigm are simpler and cost-effective, and typically result in meshes with known topology, they may not capture complete 3D information with the necessary fidelity. ...

The Florence multi-resolution 3D facial expression dataset
  • Citing Article
  • October 2023

Pattern Recognition Letters

... Human-inspired attention mechanisms [8] that dynamically adjust focus based on situational familiarity, similar to how drivers prioritize potentially dangerous situations, have been explored in autonomous driving research [9]. Additionally, methods using feature extractors and gating mechanisms have shown promise in optimizing sensor input to reduce computational inference time [10]. ...

Explaining autonomous driving with visual attention and end-to-end trainable region proposals

Journal of Ambient Intelligence and Humanized Computing

... Current methods for FEQA in healthcare primarily rely on RGB video input, and analyse its spatio-temporal features [47], [37], [6], [19], [7], [26], [27], [52]. To increase accuracy, while reducing computational overheads, others have explored additional modalities to either complement [23], [46], [43], [18], [16], [32], [15], [13], [40] or replace [54], [33], [44], [20], [35] RGB features with more focused features, such as optical flow [46], [40], action units (AUs) [54], [33], [32], [15], [35], and facial landmarks [23], [18], [43], [16], [44], [20], [13]. ...

Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices
  • Citing Article
  • October 2022

IEEE Transactions on Affective Computing

... The study shows the suitability for color-based grading of the components. Ferrari et al [28] worked on the 3D face analysis of the captured image. A framework was developed to measure the deformations that occurred in a 3D face model. ...

Measuring 3D face deformations from RGB images of expression rehabilitation exercises
  • Citing Article
  • August 2022

Virtual Reality & Intelligent Hardware

... We present this comparison with various existing fall detection models in terms of accuracy and other metrics. For instance, Kwolek et al.46 and Youssfi et al.69 used hand-crafted features from skeleton and depth data, achieving accuracies of 94.28% and 96.55%, respectively, through SVM methods. Cai et al.17 employed a CNN-based encoder-decoder system, achieving an accuracy of 90.50%. ...

Fall Detection of Elderly People Using the Manifold of Positive Semidefinite Matrices

... It also applies statistical analysis and restrictions based on the shapes and sizes of human faces in order to construct 3D human faces automatically. Face reconstruction is a leading field in the science and technology industry today [5]; therefore, many more powerful methods have been developed, including the application of multistage 3D-mapping algorithms to deformable models to ensure high accuracy and speed of the process [6], the development of cluster-based face recognition techniques [7], and fully automated methods for transferring their dense semantic annotations to the original 3D faces to establish dense correspondences between them [8]. At present, most operators of 3D face reconstruction technology have stopped using traditional expensive 3D scanners and instead adopted the more modern method of using a CNN to estimate the parameters of a face in three dimensions, which can then be used to reconstruct the face in real-time. ...

A Sparse and Locally Coherent Morphable Face Model for Dense Semantic Correspondence Across Heterogeneous 3D Faces
  • Citing Article
  • June 2021

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Lastly, we also considered the temporal variations of facial expressions with each frame representing a unique variation of pain expression. This ensures that the model learns from diverse facial movements rather than memorizing specific identities 46 . ...

Automatic Estimation of Self-Reported Pain by Interpretable Representations of Motion Dynamics
  • Citing Conference Paper
  • January 2021

... Curves in homogeneous spaces where used for animation purposes in [1]. We improve the implementations of temporal alignment procedures introduced in [9,7,4,2,3]. Here our main concern is to align motions in order to be able to display them in a synchronized manner. ...

Modelling the Statistics of Cyclic Activities by Trajectory Analysis on the Manifold of Positive-Semi-Definite Matrices
  • Citing Conference Paper
  • November 2020