Conference Paper

The Unconstrained Ear Recognition Challenge 2023: Maximizing Performance and Minimizing Bias*

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The accuracy of ear recognition systems has improved significantly in the last decade [38], [12], primarily driven by the advancements in deep learning methods. The large Unconstrained Ear Recognition Challenge (UERC) datasets [22], [19] have played a critical role in reigniting interest as they capture a variety of ear appearances encountered in realworld scenarios-such as variances in pose, illumination, occlusion, and resolution. ...
... This patch-based approach makes ViTs particularly flexible for various image recognition applications, including ear recognition. The ViTEar work [19] fine-tuned pre-trained DINOv2 Vision Transformers [37] with datasets like UERC2023 [19] and EarVN1.0 [27] to obtain a rank-1 accuracy of 96.27% on the UERC2019 dataset [22]. ...
... This patch-based approach makes ViTs particularly flexible for various image recognition applications, including ear recognition. The ViTEar work [19] fine-tuned pre-trained DINOv2 Vision Transformers [37] with datasets like UERC2023 [19] and EarVN1.0 [27] to obtain a rank-1 accuracy of 96.27% on the UERC2019 dataset [22]. ...
Preprint
Full-text available
Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.
... Extensive evaluation on the Unconstrained Ear Recognition Challenge 2023 (UERC2023) benchmark [9] demonstrates that EdgeEar achieves performance comparable to state-ofthe-art deep learning architectures across key metrics, such as Equal Error Rate, as shown in Fig. 1, while maintaining a significantly smaller parameter count. These findings highlight the potential of adapting edge-optimized face recognition architectures for ear recognition in resource-constrained scenarios. ...
... This has pushed research toward stronger feature representations and larger datasets to handle real-world variability. The Unconstrained Ear Recognition Challenge (UERC) [9], [10], [12] revealed the strengths and limits of current methods, stressing the need for approaches that generalize to new identities and enable open-set recognition. ...
... Two datasets were used during training: 1) UERC2023 [9], containing 1310 subjects with a total of 248.655 samples. 2) EARVN1.0 [15], containing 164 subjects of Asian descent with a total of 28.412 samples. ...
Preprint
Full-text available
Ear recognition is a contactless and unobtrusive biometric technique with applications across various domains. However, deploying high-performing ear recognition models on resource-constrained devices is challenging, limiting their applicability and widespread adoption. This paper introduces EdgeEar, a lightweight model based on a proposed hybrid CNN-transformer architecture to solve this problem. By incorporating low-rank approximations into specific linear layers, EdgeEar reduces its parameter count by a factor of 50 compared to the current state-of-the-art, bringing it below two million while maintaining competitive accuracy. Evaluation on the Unconstrained Ear Recognition Challenge (UERC2023) benchmark shows that EdgeEar achieves the lowest EER while significantly reducing computational costs. These findings demonstrate the feasibility of efficient and accurate ear recognition, which we believe will contribute to the wider adoption of ear biometrics.
Chapter
Every biometric system relies on one or more biometric modalities. The choice of modality is a key driver of how the system is architected, how it is presented to the user, and how match versus non-match decisions are made. Understanding particular modalities and how best to use the modalities is critical to overall system effectiveness.
Article
Full-text available
Ear images have been shown to be a reliable modality for biometric recognition with desirable characteristics, such as high universality, distinctiveness, measurability and permanence. While a considerable amount of research has been directed towards ear recognition techniques, the problem of ear alignment is still under‐explored in the open literature. Nonetheless, accurate alignment of ear images, especially in unconstrained acquisition scenarios, where the ear appearance is expected to vary widely due to pose and view point variations, is critical for the performance of all downstream tasks, including ear recognition. Here, the authors address this problem and present a framework for ear alignment that relies on a two‐step procedure: (i) automatic landmark detection and (ii) fiducial point alignment. For the first (landmark detection) step, the authors implement and train a Two‐Stack Hourglass model (2‐SHGNet) capable of accurately predicting 55 landmarks on diverse ear images captured in uncontrolled conditions. For the second (alignment) step, the authors use the Random Sample Consensus (RANSAC) algorithm to align the estimated landmark/fiducial points with a pre‐defined ear shape (i.e. a collection of average ear landmark positions). The authors evaluate the proposed framework in comprehensive experiments on the AWEx and ITWE datasets and show that the 2‐SHGNet model leads to more accurate landmark predictions than competing state‐of‐the‐art models from the literature. Furthermore, the authors also demonstrate that the alignment step significantly improves recognition accuracy with ear images from unconstrained environments compared to unaligned imagery.
Article
Full-text available
Bias and fairness of biometric algorithms have been key topics of research in recent years, mainly due to the societal, legal and ethical implications of potentially unfair decisions made by automated decision-making models. A considerable amount of work has been done on this topic across different biometric modalities, aiming at better understanding the main sources of algorithmic bias or devising mitigation measures. In this work, we contribute to these efforts and present the first study investigating bias and fairness of sclera segmentation models. Although sclera segmentation techniques represent a key component of sclera-based biometric systems with a considerable impact on the overall recognition performance, the presence of different types of biases in sclera segmentation methods is still underexplored. To address this limitation, we describe the results of a group evaluation effort (involving seven research groups), organized to explore the performance of recent sclera segmentation models within a common experimental framework and study performance differences (and bias), originating from various demographic as well as environmental factors. Using five diverse datasets, we analyze seven independently developed sclera segmentation models in different experimental configurations. The results of our experiments suggest that there are significant differences in the overall segmentation performance across the seven models and that among the considered factors, ethnicity appears to be the biggest cause of bias. Additionally, we observe that training with representative and balanced data does not necessarily lead to less biased results. Finally, we find that in general there appears to be a negative correlation between the amount of bias observed (due to eye color, ethnicity and acquisition device) and the overall segmentation performance, suggesting that advances in the field of semantic segmentation may also help with mitigating bias.
Article
Full-text available
This paper presents a Lightweight Dense Convolutional (LDC) neural network for edge detection. The proposed model is an adaptation of two state-of-the-art approaches, but it requires less than 4% of parameters in comparison with these approaches. The proposed architecture generates thin edge maps and reaches the highest score (i.e., ODS) when compared with lightweight models (models with less than 1 million parameters), and reaches a similar performance when compare with heavy architectures (models with about 35 million parameters). Both quantitative and qualitative results and comparisons with state-of-the-art models, using different edge detection datasets, are provided. The proposed LDC does not use pre-trained weights and requires straightforward hyper-parameter settings. The source code is released at https://github.com/xavysp/LDC .
Article
Full-text available
Recognition using ear images has been an active field of research in recent years. Besides faces and fingerprints, ears have a unique structure to identify people and can be captured from a distance, contactless, and without the subject’s cooperation. Therefore, it represents an appealing choice for building surveillance, forensic, and security applications. However, many techniques used in those applications—e.g., convolutional neural networks (CNN)—usually demand large-scale datasets for training. This research work introduces a new dataset of ear images taken under uncontrolled conditions that present high inter-class and intra-class variability. We built this dataset using an existing face dataset called the VGGFace, which gathers more than 3.3 million images. in addition, we perform ear recognition using transfer learning with CNN pretrained on image and face recognition. Finally, we performed two experiments on two unconstrained datasets and reported our results using Rank-based metrics.
Article
Full-text available
Ear detection represents one of the key components of contemporary ear recognition systems. While significant progress has been made in the area of ear detection over recent years, most of the improvements are direct results of advances in the field of visual object detection. Only a limited number of techniques presented in the literature are domain–specific and designed explicitly with ear detection in mind. In this paper, we aim to address this gap and present a novel detection approach that does not rely only on general ear (object) appearance, but also exploits contextual information, i.e., face–part locations, to ensure accurate and robust ear detection with images captured in a wide variety of imaging conditions. The proposed approach is based on a Contex t–aware E{E} ar D{D} etection Net work (ContexedNet) and poses ear detection as a semantic image segmentation problem. ContexedNet consists of two processing paths: i) a context–provider that extracts probability maps corresponding to the locations of facial parts from the input image, and ii) a dedicated ear segmentation model that integrates the computed probability maps into a context–aware segmentation-based ear detection procedure. ContexedNet is evaluated in rigorous experiments on the AWE and UBEAR datasets and shown to ensure competitive performance when evaluated against state–of–the–art ear detection models from the literature. Additionally, because the proposed contextualization is model agnostic, it can also be utilized with other ear detection techniques to improve performance.
Article
Full-text available
Biometric recognition technology has made significant advances over the last decade and is now used across a number of services and applications. However, this widespread deployment has also resulted in privacy concerns and evolving societal expectations about the appropriate use of the technology. For example, the ability to automatically extract age, gender, race, and health cues from biometric data has heightened concerns about privacy leakage. Face recognition technology, in particular, has been in the spotlight, and is now seen by many as posing a considerable risk to personal privacy. In response to these and similar concerns, researchers have intensified efforts towards developing techniques and computational models capable of ensuring privacy to individuals, while still facilitating the utility of face recognition technology in several application scenarios. These efforts have resulted in a multitude of privacy–enhancing techniques that aim at addressing privacy risks originating from biometric systems and providing technological solutions for legislative requirements set forth in privacy laws and regulations, such as GDPR. The goal of this overview paper is to provide a comprehensive introduction into privacy–related research in the area of biometrics and review existing work on Biometric Privacy–Enhancing Techniques (B–PETs) applied to face biometrics. To make this work useful for as wide of an audience as possible, several key topics are covered as well, including evaluation strategies used with B–PETs, existing datasets, relevant standards, and regulations and critical open issues that will have to be addressed in the future.
Preprint
Full-text available
Systems incorporating biometric technologies have become ubiquitous in personal, commercial, and governmental identity management applications. Both cooperative (e.g. access control) and non-cooperative (e.g. surveillance and forensics) systems have benefited from biometrics. Such systems rely on the uniqueness of certain biological or behavioural characteristics of human beings, which enable for individuals to be reliably recognised using automated algorithms. Recently, however, there has been a wave of public and academic concerns regarding the existence of systemic bias in automated decision systems (including biometrics). Most prominently, face recognition algorithms have often been labelled as "racist" or "biased" by the media, non-governmental organisations, and researchers alike. The main contributions of this article are: (1) an overview of the topic of algorithmic bias in the context of biometrics, (2) a comprehensive survey of the existing literature on biometric bias estimation and mitigation, (3) a discussion of the pertinent technical and social matters, and (4) an outline of the remaining challenges and future work items, both from technological and social points of view.
Conference Paper
Full-text available
This paper presents a summary of the 2019 Unconstrained Ear Recognition Challenge (UERC), the second in a series of group benchmarking efforts centered around the problem of person recognition from ear images captured in uncontrolled settings. The goal of the challenge is to assess the performance of existing ear recognition techniques on a challenging large-scale ear dataset and to analyze performance of the technology from various viewpoints, such as generalization abilities to unseen data characteristics, sensitivity to rotations, occlusions and image resolution and performance bias on sub-groups of subjects, selected based on demographic criteria, i.e. gender and ethnicity. Research groups from 12 institutions entered the competition and submitted a total of 13 recognition approaches ranging from descriptor-based methods to deep-learning models. The majority of submissions focused on ensemble based methods combining either representations from multiple deep models or hand-crafted with learned image descriptors. Our analysis shows that methods incorporating deep learning models clearly outperform techniques relying solely on hand-crafted descriptors, even though both groups of techniques exhibit similar behavior when it comes to robustness to various covariates, such presence of occlusions, changes in (head) pose, or variability in image resolution. The results of the challenge also show that there has been considerable progress since the first UERC in 2017, but that there is still ample room for further research in this area.
Article
Full-text available
Ear recognition is starting to grow as an alternative to other biometric recognition types in recent years. The EarVN1.0 dataset is constructed by collecting ear images of 164 Asian peoples during 2018. It is among the largest ear datasets publicly to the research community which composed by 28,412 colour images of 98 males and 66 females. Thus, this dataset is different from previous works by providing images of both ears per person under unconstrained conditions. The original facial images have been acquired by unconstrained environment including cameras systems and light condition. Ear images are then cropped from facial images over the large variations of pose, scale and illumination. Several machine learning tasks can be applied such as ear recognition, image classification or clustering, gender recognition, right-ear or left-ear detection and enhanced super resolution.
Article
Full-text available
We present an unconstrained ear recognition framework that outperforms state-of-the-art systems in different publicly available image databases. To this end, we developed CNN-based solutions for ear normalization and description, we used well-known handcrafted descriptors, and we fused learned and handcrafted features to improve recognition. We designed a two-stage landmark detector that successfully worked under untrained scenarios. We used the results generated to perform a geometric image normalization that boosted the performance of all evaluated descriptors. Our CNN descriptor outperformed other CNN-based works in the literature, specially in more difficult scenarios. The fusion of learned and handcrafted matchers appears to be complementary as it achieved the best performance in all experiments. The obtained results outperformed all other reported results for the UERC challenge, which contains the most difficult database nowadays.
Article
Full-text available
In this paper we present the results of the Unconstrained Ear Recognition Challenge (UERC), a group benchmarking effort centered around the problem of person recognition from ear images captured in uncontrolled conditions. The goal of the challenge was to assess the performance of existing ear recognition techniques on a challenging large-scale dataset and identify open problems that need to be addressed in the future. Five groups from three continents participated in the challenge and contributed six ear recognition techniques for the evaluation, while multiple baselines were made available for the challenge by the UERC organizers. A comprehensive analysis was conducted with all participating approaches addressing essential research questions pertaining to the sensitivity of the technology to head rotation, flipping, gallery size, large-scale recognition and others. The top performer of the UERC was found to ensure robust performance on a smaller part of the dataset (with 180 subjects) regardless of image characteristics, but still exhibited a significant performance drop when the entire dataset comprising 3,704 subjects was used for testing.
Article
Full-text available
Automatic identity recognition from ear images represents an active field of research within the biometric community. The ability to capture ear images from a distance and in a covert manner makes the technology an appealing choice for surveillance and security applications as well as other application domains. Significant contributions have been made in the field over recent years, but open research problems still remain and hinder a wider (commercial) deployment of the technology. This paper presents an overview of the field of automatic ear recognition (from 2D images) and focuses specifically on the most recent, descriptor-based methods proposed in this area. Open challenges are discussed and potential research directions are outlined with the goal of providing the reader with a point of reference for issues worth examining in the future. In addition to a comprehensive review on ear recognition technology, the paper also introduces a new, fully unconstrained dataset of ear images gathered from the web and a toolbox implementing several state-of-the-art techniques for ear recognition. The dataset and toolbox are meant to address some of the open issues in the field and are made publicly available to the research community.
Article
Automatic identity recognition from ear images is an active research topic in the biometric community. The ability to secretly acquire images of the ear remotely and the stability of the ear shape over time make this technology a promising alternative for surveillance, authentication, and forensic applications. In recent years, significant research has been conducted in this area. Nevertheless, challenges remain that limit the commercial use of this technology. Several phases of the ear recognition system have been studied in the literature, from ear detection, normalization, and feature extraction to classification. This paper reviews the most recent methods used to describe and classify biometric features of the ear. We propose a first taxonomy to group existing approaches to ear recognition, including 2D, 3D, and combined 2D and 3D methods, as well as an overview of historical advances in this field. It is well known that data and algorithms are the essential components in biometrics, particularly in-ear recognition. However, early ear recognition datasets were very limited and collected in laboratory with controlled environments. With the wider use of deep neural networks, a considerable amount of training data has become necessary if acceptable ear recognition performance is to be achieved. As a consequence, current ear recognition datasets have increased significantly in size. This paper gives an overview of the chronological evolution of ear recognition datasets and compares the performance of conventional vs. deep learning methods on several datasets. We proposed a second taxonomy to classify the existing databases, including 2D, 3D, and video ear datasets. Finally, some open challenges and trends are debated for future research.
Article
Media reports have accused face recognition of being “biased”“, sexist” and “racist”. There is consensus in the research literature that face recognition accuracy is lower for females, who often have both a higher false match rate and a higher false non-match rate. However, there is little published research aimed at identifying the cause of lower accuracy for females. For instance, the 2019 Face Recognition Vendor Test that documents lower female accuracy across a broad range of algorithms and datasets also lists “Analyze cause and effect” under the heading “What we did not do”. We present the first experimental analysis to identify major causes of lower face recognition accuracy for females on datasets where previous research has observed this result. Controlling for equal amount of visible face in the test images mitigates the apparent higher false non-match rate for females. Additional analysis shows that makeup-balanced datasets further improves females to achieve lower false non-match rates. Finally, a clustering experiment suggests that images of two different females are inherently more similar than of two different males, potentially accounting for a difference in false match rates.
Article
Previous generations of face recognition algorithms differ in accuracy for images of different races (race bias). Here, we present the possible underlying factors (data-driven and scenario modeling) and methodological considerations for assessing race bias in algorithms. We discuss data-driven factors (e.g., image quality, image population statistics, and algorithm architecture), and scenario modeling factors that consider the role of the “user” of the algorithm (e.g., threshold decisions and demographic constraints). To illustrate how these issues apply, we present data from four face recognition algorithms (a previous-generation algorithm and three deep convolutional neural networks, DCNNs) for East Asian and Caucasian faces. First, dataset difficulty affected both overall recognition accuracy and race bias, such that race bias increased with item difficulty. Second, for all four algorithms, the degree of bias varied depending on the identification decision threshold. To achieve equal false accept rates (FARs), East Asian faces required higher identification thresholds than Caucasian faces, for all algorithms. Third, demographic constraints on the formulation of the distributions used in the test, impacted estimates of algorithm accuracy. We conclude that race bias needs to be measured for individual applications and we provide a checklist for measuring this bias in face recognition algorithms.
Chapter
We address the problem of bias in automated face recognition and demographic attribute estimation algorithms, where errors are lower on certain cohorts belonging to specific demographic groups. We present a novel de-biasing adversarial network (DebFace) that learns to extract disentangled feature representations for both unbiased face recognition and demographics estimation. The proposed network consists of one identity classifier and three demographic classifiers (for gender, age, and race) that are trained to distinguish identity and demographic attributes, respectively. Adversarial learning is adopted to minimize correlation among feature factors so as to abate bias influence from other factors. We also design a new scheme to combine demographics with identity features to strengthen robustness of face representation in different demographic groups. The experimental results show that our approach is able to reduce bias in face recognition as well as demographics estimation while achieving state-of-the-art performance.
Article
Face recognition technology has recently become controversial over concerns about possible bias due to accuracy varying based on race or skin tone. We explore three important aspects of face recognition technology related to this controversy. Using two different deep convolutional neural network face matchers, we show that for a fixed decision threshold, the African-American image cohort has a higher false match rate (FMR), and the Caucasian cohort has a higher false nonmatch rate. We present an analysis of the impostor distribution designed to test the premise that darker skin tone causes a higher FMR, and find no clear evidence to support this premise. Finally, we explore how using face recognition for one-to-many identification can have a very low false-negative identification rate and still present concerns related to the false-positive identification rate. Both the ArcFace and VGGFace2 matchers and the MORPH dataset used in our experiments are available to the research community so that others should be able to reproduce or reanalyze our results.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0% on UCF-101.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.
Dinov2: Learning robust visual features without supervision
  • Oquab