Sergio EscaleraUniversity of Barcelona | UB · Facultad de Matemáticas
Sergio Escalera
Professor
Univ. de Barcelona. Computer Vision Center. Aalborg University. ELLIS & IAPR & AAIA Fellow. ChaLearn Looking at People.
About
532
Publications
205,869
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
14,816
Citations
Introduction
Prof. Sergio Escalera is vice-president of ChaLearn, co-creator of Codalab platform for challenges organization, Fellow of ELLIS. He received a CVPR best paper award and a CVPR outstanding reviewer award. He has been General co-Chair of FG. His research interests include automatic analysis of humans from visual and multi-modal data, with special interest in inclusive, transparent, and fair affective computing and people characterization.
www.sergioescalera.com
http://chalearnlap.cvc.uab.es/
Additional affiliations
January 2015 - present
January 2013 - present
ChaLearn challenges in machine learning
Position
- Research Director
Description
- Organization of scientific events: http://gesture.chalearn.org/
January 2008 - present
Publications
Publications (532)
Background: Depression and anxiety are among the leading causes of disability worldwide, significantly impacting workplace productivity through absenteeism and presenteeism. The MetrikaMind platform offers a scalable, digital solution for addressing these challenges by providing personalized, AI-driven mental health assessments and real-time monito...
The rising interest in leveraging higher-order interactions present in complex systems has led to a surge in more expressive models exploiting high-order structures in the data, especially in topological deep learning (TDL), which designs neural networks on high-order domains such as simplicial complexes. However, progress in this field is hindered...
We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable par...
The SoccerNet 2024 challenges represent the fourth annual video understanding challenges organized by the SoccerNet team. These challenges aim to advance research across multiple themes in football, including broadcast video understanding, field understanding, and player understanding. This year, the challenges encompass four vision-based tasks. (1...
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurrin...
The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection m...
Topological Deep Learning seeks to enhance the predictive performance of neural network models by harnessing topological structures in input data. Topological neural networks operate on spaces such as cell complexes and hypergraphs, that can be seen as generalizations of graphs. In this work, we introduce the Cellular Transformer (CT), a novel arch...
Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting th...
We introduce TopoX, a Python software suite that provides reliable and user-friendly building blocks for computing and machine learning on topological domains that extend graphs: hypergraphs, simplicial, cellular, path and combinatorial complexes. TopoX consists of three packages: TopoNetX facilitates constructing and computing on these domains, in...
Hajij et al. We introduce TopoX, a Python software suite that provides reliable and user-friendly building blocks for computing and machine learning on topological domains that extend graphs: hypergraphs, simplicial, cellular, path and combinatorial complexes. TopoX consists of three packages: TopoNetX facilitates constructing and computing on thes...
This paper presents the computational challenge on topological deep learning that was hosted within the ICML 2023 Workshop on Topology and Geometry in Machine Learning. The competition asked participants to provide open-source implementations of topological neural networks from the literature by contributing to the python packages TopoNetX (data pr...
The EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging. One of the core aspects of the system is its human sensing capabilities, allowing for the perception of emotional states to provide a personalized experience. This paper outlines the d...
Objectives
To assess the feasibility of extracting radiomics signal intensity based features from the myocardium using cardiovascular magnetic resonance (CMR) imaging stress perfusion sequences. Furthermore, to compare the diagnostic performance of radiomics models against standard-of-care qualitative visual assessment of stress perfusion images, w...
Deployments of real-world object detection systems often experience a degradation in performance over time due to concept drift. Systems that leverage thermal cameras are especially susceptible because the respective thermal signatures of objects and their surroundings are highly sensitive to environmental changes. In this study, two types of weath...
Deployments of real-world object-detection systems often experience a degradation in performance over time due to concept drift. Systems that leverage thermal cameras are especially susceptible because the respective thermal signatures of objects and their surroundings are highly sensitive to environmental changes. In this study, a conditioning met...
Deep neural networks have demonstrated the ability to outperform humans in multiple tasks, but they often require substantial amounts of data and computational resources. These resources may be limited in certain fields. Meta-learning seeks to overcome these challenges by utilizing past task experiences to efficiently solve new tasks, achieving bet...
We propose a novel way to improve the generalisation capacity of deep learning models by reducing high correlations between neurons. For this, we present two regularisation terms computed from the weights of a minimum spanning tree of the clique whose vertices are the neurons of a given network (or a sample of those), where weights on edges are cor...
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still lack understanding of the resulting reduction patterns and how those patterns differ across token reduction m...
Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation, i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign lan...
Through the release of three large-scale datasets and the successful holding of three competitions, we have promoted the development of the face anti-spoofing community. In this chapter, we will summarize our work in recent years from two aspects of datasets and competitions, including the characteristics of datasets CASIA-SURF, CASIA-SURF CeFA and...
In recent years, the security of face recognition systems has been increasingly threatened. Face Anti-spoofing (FAS) is essential to secure face recognition systems primarily from various attacks. In order to attract researchers and push forward the state of the art in Face Presentation Attack Detection (PAD), we organized three editions of Face An...
In this chapter, we first report the results obtained by each team that has participated in the face anti-spoofing challenge series, including ablation study results when available. Then, we analyze the advantages and disadvantages of the analyzed methods based on the experimental results. Finally, we outline the common characteristics we identifie...
The PAD competitions we organized attracted more than 835 teams from home and abroad, most of them from the industry, which shows that the topic of face anti-spoofing is closely related to daily life, and there is an urgent need for advanced algorithms to solve its application needs. Specifically, the Chalearn LAP multi-modal face anti-spoofing att...
With the ubiquity of facial authentication systems and the prevalence of security cameras around the world, the impact that facial presentation attack techniques may have is huge. However, research progress in this field has been slowed by a number of factors, including the lack of appropriate and realistic datasets, ethical and privacy issues that...
While there has been a growing research interest in developing out-of-distribution (OOD) detection methods, there has been comparably little discussion around how these methods should be evaluated. Given their relevance for safe(r) AI, it is important to examine whether the basis for comparing OOD detection methods is consistent with practical need...
The Multi-modal Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions...
While there has been a growing research interest in developing out-of-distribution (OOD) detection methods, there has been comparably little discussion around how these methods should be evaluated. Given their relevance for safe(r) AI, it is important to examine whether the basis for comparing OOD detection methods is consistent with practical need...
The power requirements of video-oculography systems can be prohibitive for high-speed operation on portable devices. Recently, low-power alternatives such as photosensors have been evaluated, providing gaze estimates at high frequency with a trade-off in accuracy and robustness. Potentially, an approach combining slow/high-fidelity and fast/low-fid...
Gesture Recognition (GR) is a challenging research area in computer vision. To tackle the annotation bottleneck in GR, we formulate the problem of Zero-Shot Gesture Recognition (ZS-GR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer mo...
Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiri...
In recent years, several deep learning models have been proposed to accurately quantify and diagnose cardiac pathologies. These automated tools heavily rely on the accurate segmentation of cardiac structures in MRI images. However, segmentation of the right ventricle is challenging due to its highly complex shape and ill-defined borders. Hence, the...
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, most of the studies lacked consideration of long-distance scenarios. Specifically, compared with FAS in traditional scenes such as phone unlocking, face payment, and self-service security inspection, FAS in long-distance such as station...
Face anti-spoofing (FAS) is an essential mechanism for safeguarding the integrity of automated face recognition systems. Despite substantial advancements, the generalization of existing approaches to real-world applications remains challenging. This limitation can be attributed to the scarcity and lack of diversity in publicly available FAS dataset...
Assessment of myocardial viability is essential in diagnosis and treatment management of patients suffering from myocardial infarction, and classification of pathology on the myocardium is the key to this assessment. This work defines a new task of medical image analysis, i.e., to perform myocardial pathology segmentation (MyoPS) combining three-se...
In thermal video security monitoring the reliability of deployed systems rely on having varied training data that can effectively generalize and have consistent performance in the deployed context. However, for security monitoring of an outdoor environment the amount of variation introduced to the imaging system would require extensive annotated da...
We introduce Meta-Album, an image classification meta-dataset designed to facilitate few-shot learning, transfer learning, meta-learning, among other tasks. It includes 40 open datasets, each having at least 20 classes with 40 examples per class, with verified licences. They stem from diverse domains, such as ecology (fauna and flora), manufacturin...
The ECCV 2022 Sign Spotting Challenge focused on the problem of fine-grain sign spotting for continuous sign language recognition. We have released and made publicly available a new dataset of Spanish sign language of around 10 h of video data in the health domain performed by 7 deaf people and 3 interpreters. The added value of this dataset over e...
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are survey...
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (i.e., phone unlocking) while lacking consideration of long-distance scenes (i.e., surveillance security checks). In order to promote relevant research and fill this gap in...
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (
i.e
., phone unlocking) while lacking consideration of long-distance scenes (
i.e
., surveillance security checks). In order to promote relevant research and fill this...
We present a general framework for the garment animation problem through unsupervised deep learning inspired in physically based simulation. Existing trends in the literature already explore this possibility. Nonetheless, these approaches do not handle cloth dynamics. Here, we propose the first methodology able to learn realistic cloth dynamics uns...
A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, f...
We present a general framework for the garment animation problem through unsupervised deep learning inspired in physically based simulation. Existing trends in the literature already explore this possibility. Nonetheless, these approaches do not handle cloth dynamics. Here, we propose the first methodology able to learn realistic cloth dynamics uns...
Stochastic human motion prediction (HMP) has generally been tackled with generative adversarial networks and variational autoencoders. Most prior works aim at predicting highly diverse movements in terms of the skeleton joints' dispersion. This has led to methods predicting fast and motion-divergent movements, which are often unrealistic and incohe...
Memes evolve and mutate through their diffusion in social media. They have the potential to propagate ideas and, by extension, products. Many studies have focused on memes, but none so far, to our knowledge, on the users that post them, their relationships, and the reach of their influence. In this article, we define a meme influence graph together...
The SoccerNet 2022 challenges were the second annual video understanding challenges organized by the SoccerNet team. In 2022, the challenges were composed of 6 vision-based tasks: (1) action spotting, focusing on retrieving action timestamps in long untrimmed videos, (2) replay grounding, focusing on retrieving the live moment of an action shown in...
Domain Adaptation (DA) has recently been of strong interest in the medical imaging community. While a large variety of DA techniques have been proposed for image segmentation, most of these techniques have been validated either on private datasets or on small publicly available datasets. Moreover, these datasets mostly addressed single-class proble...
We present the design and baseline results for a new challenge in the ChaLearn meta-learning series, accepted at NeurIPS'22, focusing on "cross-domain" meta-learning. Meta-learning aims to leverage experience gained from previous tasks to solve new tasks efficiently (i.e., with better performance, little training data, and/or modest computational r...
In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB vide...