Giuseppe Serra

Giuseppe Serra
Università degli Studi di Modena e Reggio Emilia | UNIMO · Imagelab

PhD

About

112
Publications
31,897
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,751
Citations
Introduction
Giuseppe Serra was awarded the title of Doctor of Philosophy in Computer Engineering, Multimedia and Telecommunications in 2010 from The MICC Center. He was a visiting scholar at Carnegie Mellon University, USA and at Telecom ParisTech/ENST, Paris, in 2006 and 2010 respectively. His research interests include image and video analysis, multimedia ontologies and multiple‐view geometry. He has been awarded the best paper award by the ACM‐SIGMM Workshop on Social Media in 2010.
Additional affiliations
January 2013 - present
Università degli Studi di Modena e Reggio Emilia
Position
  • University of Modena and Reggio Emilia
January 2007 - December 2012
Università degli Studi di Firenze
Education
January 2007 - December 2009
University of Florence
Field of study
  • PhD Computer Engineering Multimedia and Telecommunication
September 1999 - October 2006
University of Florence
Field of study
  • Laurea (MS) Computer Engineering

Publications

Publications (112)
Article
Accurately estimating the amount of productive time remaining to a machine before it suffers a fault condition is a fundamental problem, which is commonly found in several contexts (e.g. mechanical systems) and can have great industrial, societal, and safety-related consequences. Recently, deep learning showed promising results and significant impr...
Preprint
Full-text available
This paper describes the models developed by the AILAB-Udine team for the SMM4H 22 Shared Task. We explored the limits of Transformer based models on text classification, entity extraction and entity normalization, tackling Tasks 1, 2, 5, 6 and 10. The main take-aways we got from participating in different tasks are: the overwhelming positive effec...
Preprint
Full-text available
In the last decade, an increasing number of users have started reporting Adverse Drug Events (ADE) on social media platforms, blogs, and health forums. Given the large volume of reports, pharmacovigilance has focused on ways to use Natural Language Processing (NLP) techniques to rapidly examine these large collections of text, detecting mentions of...
Preprint
Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test exa...
Preprint
This report presents the technical details of our submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2022. To participate in the challenge, we designed an ensemble consisting of different models trained with two recently developed relevance-augmented versions of the widely used triplet loss. Our submission, visible on the public...
Preprint
Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually e...
Preprint
Full-text available
Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is...
Article
Background: In the current phase of the COVID-19 pandemic, we are witnessing the most massive vaccine rollout in human history. Like any other drug, vaccines may cause unexpected side effects, which need to be timely investigated to minimize harm in the population. If not properly dealt with, side effects may also impact the public trust in the va...
Preprint
BACKGROUND In the current phase of the COVID-19 pandemic, we are witnessing the most massive vaccine rollout in human history. Like any other drug, vaccines may cause unexpected side effects, which need to be investigated in a timely manner to minimize harm in the population. If not properly dealt with, side effects may also impact public trust in...
Preprint
Full-text available
Adverse Drug Event (ADE) extraction mod-els can rapidly examine large collections of so-cial media texts, detecting mentions of drug-related adverse reactions and trigger medicalinvestigations. However, despite the recent ad-vances in NLP, it is currently unknown if suchmodels are robust in face ofnegation, which ispervasive across language varieti...
Article
Full-text available
Recently, the misinformation problem has been addressed with a crowdsourcing-based approach: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of non-expert is exploited. We study whether crowdsourcing is an effective and reliable method to assess truthfulness during a pandemic, targeting statements related to...
Preprint
Full-text available
Recently, the misinformation problem has been addressed with a crowdsourcing-based approach: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of non-expert is exploited. We study whether crowdsourcing is an effective and reliable method to assess truthfulness during a pandemic, targeting statements related to...
Preprint
Full-text available
In recent years, Internet users are reporting Adverse Drug Events (ADE) on social media, blogs and health forums. Because of the large volume of reports, pharmacovigilance is seeking to resort to NLP to monitor these outlets. We propose for the first time the use of the SpanBERT architecture for the task of ADE extraction: this new version of the p...
Article
Full-text available
Keyphrase Generation is the task of predicting keyphrases: short text sequences that convey the main semantic meaning of a document. In this paper, we introduce a keyphrase generation approach that makes use of a Generative Adversarial Networks (GANs) architecture. In our system, the Generator produces a sequence of keyphrases for an input document...
Preprint
Full-text available
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, b...
Chapter
Text-to-Image Synthesis refers to the process of automatic generation of a photo-realistic image starting from a given text and is revolutionizing many real-world applications. In order to perform such process it is necessary to exploit datasets containing captioned images, meaning that each image is associated with one (or more) captions describin...
Chapter
The Internet of Things, whose main goal is to automatically predict users' desires, can find very interesting opportunities in the art and culture field, as the tourism is one of the main driving engines of the modern society. Currently, the innovation process in this field is growing at a slower pace, so the cultural heritage is a prerogative of a...
Chapter
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, b...
Preprint
Full-text available
Text to Image Synthesis refers to the process of automatic generation of a photo-realistic image starting from a given text and is revolutionizing many real-world applications. In order to perform such process it is necessary to exploit datasets containing captioned images, meaning that each image is associated with one (or more) captions describin...
Chapter
In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, and their head pose and 3D location are estimated. Following the formalism of the f-formation, we define as regards the orientation and distance inherently social pairwise features capable of des...
Conference Paper
Full-text available
The prediction of human eye fixations has been recently gaining a lot of attention thanks to the improvements shown by deep architectures. In our work, we go beyond classical feed-forward networks to predict saliency maps and propose a Saliency Attentive Model which incorporates neural attention mechanisms to iteratively refine predictions. Experim...
Article
Full-text available
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedic...
Article
In this paper we propose a new method for customized summarization of egocentric videos according to specific user preferences, so that different users can extract different summaries from the same stream. Our approach, tailored on cultural heritage scenario, relies on creating a short synopsis of the original video focused on key shots, in which c...
Conference Paper
Full-text available
Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popu...
Article
Full-text available
Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations. In this paper we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and we present a novel model which can predict accurate saliency maps by inc...
Conference Paper
Full-text available
With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, stable and long-time running solutions; however, devices are too power-hungry for truly always-on...
Conference Paper
Full-text available
State of the art approaches for saliency prediction are based on Fully Convolutional Networks, in which saliency maps are built using the last layer. In contrast, we here present a novel model that predicts saliency maps exploiting a non-linear combination of features coming from different layers of the network. We also present a new loss function...
Conference Paper
Nowadays, egocentric wearable devices are continuously increasing their widespread among both the academic community and the general public. For this reason, methods capable of automatically segment the video based on the recorder motion patterns are gaining attention. These devices present the unique opportunity of both high quality video recordin...
Conference Paper
Full-text available
This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extrac...
Article
With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video...
Conference Paper
With the spread of wearable cameras, many consumer applications ranging from social tagging to video summa-rization would greatly benefit from people re-identification methods capable of dealing with the egocentric perspective. In this regard, first-person camera views present such a unique setting that traditional re-identification methods results...
Conference Paper
With the spread of wearable and mobile devices, the request for interactive augmented reality applications is in constant growth. Among the different possibilities, we focus on the cultural heritage domain where a key step in the development applications for augmented cultural experiences is to obtain a precise localization of the user, i.e. the 6...
Article
The new technologies characterizing the Internet of Things (IoT) allow realizing real smart environments able to provide advanced services to the users. Recently, these smart environments are also being exploited to renovate the users' interest on the cultural heritage, by guaranteeing real interactive cultural experiences. In this paper, we design...
Article
Among all the possible areas of applicability of ICT technologies, art and culture (identified by “Education & Tourism” in the aforementioned architecture) are becoming more and more interesting since they play an important role in human beings lives. Over the centuries, hundreds of museums and art galleries have preserved our diverse cultural heri...
Article
Full-text available
Augmented user experiences in the cultural heritage domain are in increasing demand by the new digital native tourists of 21st century. In this paper, we propose a novel solution that aims at assisting the visitor during an outdoor tour of a cultural site using the unique first person perspective of wearable cameras. In particular, the approach exp...
Conference Paper
Full-text available
Tracking objects moving around a person is one of the key steps in human visual augmentation: we could estimate their locations when they are out of our field of view, know their position, distance or velocity just to name a few possibilities. This is no easy task: in this paper, we show how current state-of-the-art visual tracking algorithms fail...
Conference Paper
Full-text available
In this paper we propose a novel approach for egocentric video personalization in a cultural experience scenario, based on shots automatic labelling according to different semantic dimensions, such as web leveraged knowledge of the surrounded cultural Points Of Interest, information about stops and moves, both relying on geolocalization, and camera...
Article
Tagging of multimedia content is becoming more and more widespread as web 2.0 sites, like Flickr and Facebook for images, YouTube and Vimeo for videos, have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia...
Conference Paper
Full-text available
The interest in cultural cities is in constant growth, and so is the demand for new multimedia tools and applications that enrich their fruition. In this paper we propose an egocentric vision system to enhance tourists' cultural heritage experience. Exploiting a wearable board and a glass-mounted camera, the visitor can retrieve architectural detai...
Conference Paper
Smart cities are a trading topic in both the academic literature and industrial world. The capability to provide the users with added-value services through low-power and low-cost smart objects is very attractive in many fields. Among these, art and culture represent very interesting examples, as the tourism is one of the main driving engines of mo...
Conference Paper
Full-text available
Smart cities are a trading topic in both the academic literature and industrial world. The capability to provide the users with added-value services through low-power and low-cost smart objects is very attractive in many fields. Among these, art and culture represent very interesting examples, as the tourism is one of the main driving engines of mo...
Article
Full-text available
We introduce a novel approach to cultural heritage experience: by means of ego-vision embedded devices we develop a system, which offers a more natural and entertaining way of accessing museum knowledge. Our method is based on distributed self-gesture and artwork recognition, and does not need fixed cameras nor radio-frequency identifications senso...
Article
The Bag of Words paradigm has been the baseline from which several successful image classification solutions were developed in the last decade. These represent images by quantizing local descriptors and summarizing their distribution. The quantization step introduces a dependency on the dataset, that even if in some contexts significantly boosts th...
Article
In this paper we describe a system for automatically analyzing old documents and creating hyper linking between different epochs, thus opening ancient documents to young people and to make them available on the web with old and current content. We propose a supervised learning approach to segment text and illustration of digitized old documents usi...
Conference Paper
Full-text available
This paper reports a novel approach to deal with the problem of Object and Scene recognition extending the traditional Bag of Words approach in two ways. Firstly, a dataset independent method of summarizing local features, based on multivariate Gaussian descriptors, is employed. Secondly, a recently proposed classification technique, particularly s...
Conference Paper
Full-text available
In this paper we present a new method for head pose real-time estimation in ego-vision scenarios that is a key step in the understanding of social interactions. In order to robustly detect head under changing aspect ratio, scale and orientation we use and extend the Hough-Based Tracker which allows to follow simultaneously each subject in the scene...
Conference Paper
Full-text available
In this paper we present a novel approach to detect groups in ego-vision scenarios. People in the scene are tracked through the video sequence and their head pose and 3D location are estimated. Based on the concept of f-formation, we define with the orientation and distance an inherently social pairwise feature that describes the affinity of a pair...
Conference Paper
Full-text available
We present a novel method for monocular hand gesture recognition in ego-vision scenarios that deals with static and dynamic gestures and can achieve high accuracy results using a few positive samples. Specifically, we use and extend the dense trajectories approach that has been successfully introduced for action recognition. Dense features are extr...
Conference Paper
Full-text available
In this paper we propose a novel image descriptor built by computing the covariance of pixel level features on densely sampled patches and encoding them using their covariance. Appropriate projections to the Euclidean space and feature normalizations are employed in order to provide a strong descriptor usable with linear classifiers. In order to re...
Article
Full-text available
In this paper we present a novel method to improve the flexibility of descriptor matching for image recognition by using local multiresolution pyramids in feature space. We propose that image patches be represented at multiple levels of descriptor detail and that these levels be defined in terms of local spatial pooling resolution. Preserving multi...
Conference Paper
Full-text available
Portable devices for first-person camera views will play a central role in future interactive systems. One necessary step for feasible human-computer guided activities is gesture recognition, preceded by a reliable hand segmentation from egocentric vision. In this work we provide a novel hand segmentation algorithm based on Random Forest superpixel...
Conference Paper
Full-text available
Common techniques represent images by quantizing local descriptors and summarizing their distribution in a histogram. In this paper we propose to employ a parametric description and compare its capabilities to histogram based approaches. We use the multivariate Gaussian distribution, applied over the SIFT descriptors, extracted with dense sampling...