Conference Paper

A Discriminative Feature Learning Approach for Deep Face Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Advanced loss functions can be used to address this problem. Generally, advanced loss functions are divided into two categories: Angular-distance-based method (L-Softmax [24], AM-Softmax [25]), Euclidean-distance-based method (contrastive loss [26], triplet loss [27], center loss [28]). ...
... The training procedure for these losses was still challenging owing to the selection of effective training samples. Center loss [28] decreased intra-class variations during training by penalizing the distances between deep features and their corresponding class centers. By relying solely on training CNNs with center loss, the deep features and class centers might deteriorate to zero. ...
... The proposed loss function was compared with softmax, center [28], range [33], and marginal losses [34] using the same CNN architectures to demonstrate the effectiveness of the proposed loss function. Accuracy is a crucial quantitative metric to evaluate the performance of the proposed method, which can be calculated as follows: ...
Article
Full-text available
Facial expression recognition is crucial for understanding human emotions and nonverbal communication. With the growing prevalence of facial recognition technology and its various applications, accurate and efficient facial expression recognition has become a significant research area. However, most previous methods have focused on designing unique deep-learning architectures while overlooking the loss function. This study presents a new loss function that allows simultaneous consideration of inter- and intra-class variations to be applied to CNN architecture for facial expression recognition. More concretely, this loss function reduces the intra-class variations by minimizing the distances between the deep features and their corresponding class centers. It also increases the inter-class variations by maximizing the distances between deep features and their non-corresponding class centers, and the distances between different class centers. Numerical results from several benchmark facial expression databases, such as Cohn-Kanade Plus, Oulu-Casia, MMI, and FER2013, are provided to prove the capability of the proposed loss function compared with existing ones.
... Recently, it has also been observed that the traditional softmax loss fails to produce highly discriminative feature vectors [1]. Thus, Deep Metric Learning (DML) has received massive attention and has been used for a variety of tasks, especially face recognition [2], [3]. However, due to the high computational costs of the previously proposed methods, such as Triplet Loss, on larger datasets, several variants have been proposed to improve the discriminative power of the softmax loss [1], [4]- [8]. ...
... As DML has become a de facto approach for face recognition and verification, other related tasks, where earlier approaches relied on handcrafted features [20]- [24], have also been receiving attention and benefit from the robust, discriminative features learned by novel algorithms [25]. Thus, with the increasing attention towards DML, various methods have been proposed in recent years, albeit some of the eminent works in this area include FaceNet [2], Center Loss [3], Circle Loss [26], L-Softmax Loss [27], Npair Loss [28], AdaReg [29], CRS-CONT [30], and Ada-CM [31]. However, the above methods, such as Triplet Loss, N-pair Loss, and Circle Loss, which require mining of both negative and positive samples, become computationally burdensome as the size of the dataset increases. ...
... For comparison, we select recent state-of-the-art (SOTA) methods that are closely related to ours. These include Center Loss [3], SphereFace [6], VGGFace2 [46], UV-GAN [47], Multicolumn [49], AdaptiveFace [53], P2SGrad [50], Adacos [51], CosFace [5], MV-AM-Softmax-a [54], Arc-Face [4], CurricularFace [7], MagFace [48]. In each experiment, we only include methods that have previously reported their performance. ...
Article
Full-text available
Feature learning is a widely used method for large-scale face recognition tasks. Recently, large-margin softmax loss methods have demonstrated significant improvements in deep face recognition. However, these methods typically propose fixed positive margins to enforce intra-class compactness and inter-class diversity, without considering imbalanced learning issues that arise due to different learning difficulties or the number of training samples available in each class. This overlook not only compromises the efficiency of the learning process but, more critically, the generalization capability of the resultant models. To address this problem, we introduce a novel adaptive strategy called KappaFace, which modulates the relative importance of each class based on its learning difficulty and imbalance. Drawing inspiration from the von Mises-Fisher distribution, KappaFace increases the margin values for the challenging or underrepresented classes and decreases that of more well-represented classes. Comprehensive experiments across eight cutting-edge baselines and nine well-established facial benchmark datasets strongly confirm the advantage of our method. Notably, we observed an enhancement of up to 0.5% on the verification task when evaluated on the IJB-B/C datasets. In conclusion, KappaFace offers a novel solution that effectively tackles imbalanced learning in deep face recognition tasks and establishes a new baseline.
... This field of research has been one of the hottest subjects in the deep FR community, and the histopathology domain shares a similar challenge in which intra-class variances and interclass similarities exist in the dataset. One method for learning discriminative feature embeddings is to learn a center for each class in Euclidean space and penalize the distance between a feature embedding and its class center, as proposed by Center loss [28]. Alternatively, angular-margin-based loss functions implement a cosine angular margin penalty to make the learned features more separable. ...
... Meanwhile, oversampling can cause overfitting [31] and significantly increase training time. Regarding algorithmic solutions to the class-imbalance problem, we can implement cost-sensitive learning by assigning weights to each class [21], [31], or modify the loss function [20], [28], [32]. Modifying the loss function essentially changes how we train the model and provides a better theoretical foundation for solving specific learning issues with the softmax loss. ...
... is the similarity of the sample with other classes, and λ denotes the enforced margin. The R(w) is a 'diversity regularizer' term that enforces the class centers to spread out in the feature space, which ensures the learned features converge to a class prototype center (the idea behind center loss [28]), but the prototypes are equally spaced out to give a fair representation to all class samples. The µ is the mean distance between all class prototypes. ...
Article
Full-text available
Early-stage cancer diagnosis potentially improves the chances of survival for many cancer patients worldwide. Manual examination of Whole Slide Images (WSIs) is a time-consuming task for analyzing tumor-microenvironment. To overcome this limitation, the conjunction of deep learning with computational pathology has been proposed to assist pathologists in efficiently prognosing the cancerous spread. Nevertheless, the existing deep learning methods are ill-equipped to handle fine-grained histopathology datasets. This is because these models are constrained via conventional softmax loss function, which cannot expose them to learn distinct representational embeddings of the similarly textured WSIs containing an imbalanced data distribution. To address this problem, we propose a novel center-focused affinity loss (CFAL) function that exhibits 1) constructing uniformly distributed class prototypes in the feature space, 2) penalizing difficult samples, 3) minimizing intra-class variations, and 4) placing greater emphasis on learning minority class features. We evaluated the performance of the proposed CFAL loss function on two publicly available breast and colon cancer datasets having varying levels of imbalanced classes. The proposed CFAL function shows better discrimination abilities as compared to the popular loss functions such as ArcFace, CosFace, and Focal loss. Moreover, it outperforms several SOTA methods for histology image classification across both datasets. The paper code is publicly available at https://github.com/Taslim-M/CFAL-max-margin-keras .
... Advanced loss functions can be used to address this problem. Generally, advanced loss functions are divided into two categories: Angular-distance-based method (L-Softmax [22], AM-Softmax [23]), Euclidean-distance-based method (contrastive loss [24], triplet loss [25], center loss [26]). ...
... The training procedure for these losses was still challenging owing to the selection of effective training samples. Center loss [26] decreased intra-class variations during training by penalizing the distances between deep features and their corresponding class centers. By relying solely on training CNNs with center loss, the deep features and class centers might deteriorate to zero. ...
... The rotated facial images are shown in Fig. 7. The proposed loss function was compared with softmax, center [26], range [31], and marginal losses [32] using the same CNN architectures to demonstrate the effectiveness of the proposed loss function. The experiment was conducted in a subject-independent scenario. ...
Preprint
Full-text available
Facial expression recognition is crucial for understanding human emotions and nonverbal communication. With the growing prevalence of facial recognition technology and its various applications, accurate and efficient facial expression recognition has become a significant research area. However, most previous methods have focused on designing unique deep-learning architectures while overlooking the loss function. This study presents a new loss function that allows simultaneous consideration of inter- and intra-class variations to be applied to CNN architecture for facial expression recognition. More concretely, this loss function reduces the intra-class variations by minimizing the distances between the deep features and their corresponding class centers. It also increases the inter-class variations by maximizing the distances between deep features and their non-corresponding class centers, and the distances between different class centers. Numerical results from several benchmark facial expression databases, such as Cohn-Kanade Plus, Oulu-Casia, MMI, and FER2013, are provided to prove the capability of the proposed loss function compared with existing ones.
... A common solution is to design a deep network that targets embeddings that distinguish people without recognizing them. Wen et al. [36] introduced the center loss to minimize distance between samples within the same class and reduce deep feature distance. In the same direction, the following improvements have been proposed: Zhang et al. created a loss function for long-tailed distributions [37], while Zheng et al. [38] found that normalizing embeddings with ring loss enhances performance. ...
... The behavior of the four experimental losses is shown in Fig. 1 for clarity. The center loss method [36] compacts samples around centroids without considering cluster distance. For tangled data, this may be problematic. ...
Article
Full-text available
This paper addresses the problem of recognition of naturally-appearing human facial movements (action units), as an intermediate step toward their aggregation for the recognition and understanding of facial expressions. With respect to the proposed method, we introduce a domain adaptation solution that is applied to deep convolutional networks, taking advantage of the networks capability of providing simultaneous predictions and discriminative embeddings. In this way, we adapt information gathered from training on mutual expression recognition to facial action unit detection. The described strategy is evaluated in the context of action units in the wild within the EmotioNet dataset and action units acquired in laboratory conditions within the DISFA and CK+ datasets. Our method achieves results comparable to state-of-the-art and demonstrates superior recognition in the case of rarely occurring action units. Additionally, the embedding space structuring is significantly enhanced with respect to the results obtained by classical losses.
... This modification can obtain a more robust judgment result when the range of similar signals is large. In this case, the center loss function [25] and the cross-entropy loss function are introduced for the joint training of the network. ...
... In order to test the scalability of each method for interference signal testing, the known interference signal set and the unknown class interference signal set are combined to form an open test set. The CNN, LRN, TF-LSTM, SNRSN and AFUCR-SNRSN models are tested, and the model results are evaluated using two indicators, which are well-accepted metrics in intelligent recognition to compare model performance [26,27]: the true positive rate (TPR) and the false positive rate (FPR), calculated as in (24) and (25). True positive (TP) represents the number of known class samples that are correctly recognized as a known class. ...
Article
Full-text available
In complex battlefield environments, flying ad-hoc network (FANET) faces challenges in manually extracting communication interference signal features, a low recognition rate in strong noise environments, and an inability to recognize unknown interference types. To solve these problems, one simple non-local correction shrinkage (SNCS) module is constructed. The SNCS module modifies the soft threshold function in the traditional denoising method and embeds it into the neural network, so that the threshold can be adjusted adaptively. Local importance-based pooling (LIP) is introduced to enhance the useful features of interference signals and reduce noise in the downsampling process. Moreover, the joint loss function is constructed by combining the cross-entropy loss and center loss to jointly train the model. To distinguish unknown class interference signals, the acceptance factor is proposed. Meanwhile, the acceptance factor-based unknown class recognition simplified non-local residual shrinkage network (AFUCR-SNRSN) model with the capacity for both known and unknown class recognition is constructed by combining AFUCR and SNRSN. Experimental results show that the recognition accuracy of the AFUCR-SNRSN model is the highest in the scenario of a low jamming to noise ratio (JNR). The accuracy is increased by approximately 4-9% compared with other methods on known class interference signal datasets, and the recognition accuracy reaches 99% when the JNR is -6 dB. At the same time, compared with other methods, the false positive rate (FPR) in recognizing unknown class interference signals drops to 9%.
... Prototype, also known as proxy (Movshovitz-Attias et al., 2017), or center (Wen et al., 2016), is the one representative of a class among training examples . Contrary the softmax weights in decision making, prototypes (Yang et al., 2018;Wang et al., 2019;Zhou et al., 2022), aim to learn a latent feature space where the prediction is made by calculating the distance between the test anchor and prototypes of each class. ...
Article
Full-text available
Deep learning has achieved great success in academic benchmarks but fails to work effectively in the real world due to the potential dataset bias. The current learning methods are prone to inheriting or even amplifying the bias present in a training dataset and under-represent specific demographic groups. More recently, some dataset debiasing methods have been developed to address the above challenges based on the awareness of protected or sensitive attribute labels. However, the number of protected or sensitive attributes may be considerably large, making it laborious and costly to acquire sufficient manual annotation. To this end, we propose a prototype-based network to dynamically balance the learning of different subgroups for a given dataset. First, an object pattern embedding mechanism is presented to make the network focus on the foreground region. Then we design a prototype learning method to discover and extract the visual patterns from the training data in an unsupervised way. The number of prototypes is dynamic depending on the pattern structure of the feature space. We evaluate the proposed prototype-based network on three widely used polyp segmentation datasets with abundant qualitative and quantitative experiments. Experimental results show that our proposed method outperforms the CNN-based and transformer-based state-of-the-art methods in terms of both effectiveness and fairness metrics. Moreover, extensive ablation studies are conducted to show the effectiveness of each proposed component and various parameter values. Lastly, we analyze how the number of prototypes grows during the training process and visualize the associated subgroups for each learned prototype. The code and data will be released at https://github.com/zijinY/dynamic-prototype-debiasing .
... To obtain better separable features, we apply the Center loss L center [33]. As the result, we can get the total loss function of the SAST framework as: ...
Article
Full-text available
Facial expression recognition (FER) suffers from insufficient label information, as human expressions are complex and diverse, with many expressions ambiguous. Using low-quality labels or low-quantity labels will aggravate ambiguity of model predictions and reduce the accuracy of FER. How to improve the robustness of FER to ambiguous data with insufficient information remains challenging. To this end, we propose the Suppressing Ambiguity Self-Training (SAST) framework which is the first attempt to address the problem of insufficient information both label quality and label quantity containing, simultaneously. Specifically, we design an Ambiguous Relative Label Usage (ARLU) strategy that mixes hard labels and soft labels to alleviate the information loss problem caused by hard labels. We also enhance the robustness of the model to ambiguous data by means of Self-Training Resampling (STR). We further use the landmarks and Patch Branch (PB) to enhance the ability of suppressing ambiguity. Experiments on RAF-DB, FERPlus, SFEW, and AffectNet datasets show that our SAST outperforms 6 semi-supervised methods with fewer annotations, and achieves competitive accuracy to State-Of-The-Art (SOTA) FER methods. Our code is available at https://github.com/Liuxww/SAST.
... Recently, deep learning-based methods have been proposed for iris recognition, which leverages the power of convolutional neural networks (CNNs) for feature extraction and various hashing techniques for efficient retrieval. Wen et al. [11] proposed a deep learning-based method that combines CNNs for feature extraction and product quantization for efficient ANN search. Their method demonstrated improved performance compared to traditional methods in terms of retrieval accuracy and query time. ...
Conference Paper
Full-text available
Iris recognition is a widely used biometric identification technology that relies on the accurate and efficient matching of iris images. However, fast matching in large databases poses a significant challenge due to the increasing search time for a given query. To address this problem, this paper proposes an end-to-end hashing framework for iris recognition tasks based on the DenseFly algorithm. The presented approach utilizes a deep convolutional neural network to extract features from iris images and then applies hashing to map the features into compact binary codes. This process enables efficient retrieval of the query iris templates by reducing the whole search space. To evaluate and compare the proposed method with the existing IHashNet approach, we conduct experiments on three publicly available iris datasets namely CASIA-Irisv4-Thousand, UBIRIS.v2 and CASIA-Irisv4-Lamp. Our simulation results demonstrate that the proposed method outperforms IHashNet in terms of retrieval accuracy and equal error rate (EER). Furthermore, our method achieves significantly lower query time over all the datasets, thereby vindicating its usage over large iris datasets.
... In addition, NE loss is proposed to constrain the training process, hoping that the extracted features have excellent clustering characteristics. Specifically, the center loss [45] of f and x is calculated to measure their intra-class dispersion. Considering that the intra-class dispersion of the output feature f should be smaller than that of the input feature x, the NE loss is designed as: ...
Article
Full-text available
Existing person re-identification methods are difficult to generalize to unseen datasets because of the domain gaps. The key to solving this problem is to extract domain-invariant and discriminative features. In this paper, we propose a normalization and enhancement (NE) module that can effectively suppress domain gaps and enhance pedestrian features without any target domain data, thereby enhancing the generalization ability of the model. NE module consists of a residual connection and a channel attention (CA) block. The residual connection is designed to suppress the domain gaps while retaining discriminative information. The proposed CA block embeds spatial information into the channel dimension and models the dependencies between channels to enhance pedestrian features. In addition, the NE loss is designed to constrain the training of NE module to extract domain-invariant features with excellent distribution. Ablation experiments were conducted on Market-1501, DukeMTMC-reID, CUHK03-NP and MSMT17. Experimental results confirmed the superiority of the proposed method, the mAP reached a maximum of 47.3% in cross-domain scenarios.
... Moreover, for model stability, the attention maps of each channel are expected to provide specific attention patterns, e.g., in rock classification, one of the attention maps focuses on color features and the other on rock texture features; however, the supervision of classification loss alone does not guarantee this. To make the intra-class samples more aggregated, center loss is introduced [31], which initializes an intra-class center for each class and brings the feature vectors of this class samples closer to the center; each class center is determined by the feature vectors of this class and is continuously updated during training, and all images in each training iteration help to update their corresponding center feature vectors. The initial center of a class is denoted by c ∈ R D×C , the center of a round is updated by c , and the update process is as in Equation (10). ...
Article
Full-text available
Efficient and convenient rock image classification methods are important for geological research. They help in identifying and categorizing rocks based on their physical and chemical properties, which can provide insights into their geological history, origin, and potential uses in various applications. The classification and identification of rocks often rely on experienced and knowledgeable professionals and are less efficient. Fine-grained rock image classification is a challenging task because of the inherent subtle differences between highly confusing categories, which require a large number of data samples and computational resources, resulting in low recognition accuracy, and are difficult to apply in mobile scenarios, requiring the design of a high-performance image processing classification architecture. In this paper we design a knowledge distillation and high-accuracy feature localization comparison network (FPCN)-based learning architecture for generating small high-performance rock image classification models. Specifically, for a pair of images, we interact with the feature vectors generated from the localized feature maps to capture common and unique features, let the network focus on more complementary information according to the different scales of the objects, and then the important features of the images learned in this way are made available for the micro-model to learn the critical information for discrimination via model distillation. The proposed method improves the accuracy of the micro-model by 3%.
... The objective is to bring samples closer to their true label in the embedding space while maintaining distance from negative samples (viz. this is a wellknown technique called center loss [22] not as claimed as a new embedding technique). They achieved an accuracy of 88.8%, precision of 92.1%, and recall of 87.1% when evaluated on the UNSW-NB15 dataset. ...
... The ArcFace loss function used in this article is one of the loss functions in metric learning. It is the same as Contrastive Loss [26], Triplet Loss [27], Center Loss [28], L-Softmax Loss [29], SphereFace [30], CosFace [31], etc., and is designed to increase the inter-class distance and reduce the intra-class distance. There have been a lot of compara- ...
Article
Full-text available
Currently, widely used object detection can predict targets present in the training set. However, in fine-grained object detection tasks, such as commodity detection, the introduction of a new target class requires retraining the model, which significantly reduces the flexibility of the algorithm in applications. In response to this problem, we propose an end-to-end fine-grained object detection and feature extraction network (FOF). To detect and identify objects beyond the target category of the training set, the category output in the network head is removed and replaced with a 128-dimensional feature vector. We used the ArcFace loss function to improve feature classification during training. Since there is no category output, an improved non-maximum suppression algorithm, non-maximum suppression-feature similarity, is proposed to distinguish same class and dissimilar class prediction boxes by feature similarity. During the inference, FOF outputs prediction boxes and feature vectors, and matches them with the feature vectors in the feature gallery to determine the detected object category and complete object detection and recognition. Experimental results indicate that FOF achieved high accuracy in both the MS COCO, PASCAL VOC2012, SmartUVM, and a large-scale and fine-grained Retail Product Checkout datasets. In addition, the method exhibits a low equal error rate when identifying new categories, achieving the objective of detecting and identifying new categories without the need to retrain the model.
... The Cluster Center can be designed manually, or with a Center Loss optimization [35]. ...
Article
Full-text available
In the field of artificial intelligence, pattern recognition is widely used to extract the abstract information in those high dimensional inputs of image, voice, or video. However, the interpretability of pattern recognition still remains understudied. The incomplete features extracted from system input still limit the recognition performance. To reject the disturbance of feature incompleteness, an error compensation is realized into the pattern recognition model under a quantum computation framework. The quantum-based recognition system fulfills the information transmission from input to output with the transformation of quantum states. Then, a compensation for the quantum state is used to reject those intermediate errors in the pattern recognition task. The experiment results in this paper indicate an effectiveness of the proposed method, with which the compensated Quantum Neural Network obtains a better performance. The proposed method brings a more robust recognition system under unknown disturbances.
... The comparisons of some classic loss functions for adjusting distances between classes are also included. Table VIII compares SSC under OpenSARShip and FUSAR-ship with Softmax Loss, Contrastive Loss [53], Center Loss [54], and Affinity Loss [55] to better demonstrate the effectiveness of the proposed method. ...
Article
Full-text available
Ship recognition in synthetic aperture radar (SAR) images is essential for many applications in maritime surveillance tasks. Recently, convolutional neural network (CNN)-based methods tend to be the mainstream in SAR recognition. Though considerable developments have been achieved, there are still several challenging issues toward superior ship recognition performance: 1) Ships have a large variance in size, making it difficult to recognize ships by using a single scale features of CNN. 2) The SAR ship’s large aspect ratio presents an obvious geometric characteristic. However, standard convolution is limited by the fixed convolution kernel, which is less effective in processing elongated SAR ships. 3) Existing CNN classifiers with softmax loss are less powerful to deal with intraclass diversity and interclass similarity in SAR ships. In this paper, we propose a task-specific hierarchically designed network with a spherical space classifier (HDSS-Net) to alleviate the above issues. Firstly, to realize SAR ship recognition with large size variation, a feature aggregation module (FAM) is designed for obtaining a feature pyramid that has strong representational power at all scales. Secondly, a FeatureBoost module (FBM) is devised to provide rectangular receptive fields to refine the features generated by FAM. Finally, a novel spherical space classifier (SSC) is proposed to expand the interclass margin and compress the intraclass feature distribution by fully taking advantage of the property of spherical space. The experimental results on two benchmark datasets (OpenSARShip and FUSAR-Ship) jointly show that the proposed HDSS-Net performs better than classic CNN methods and novel SAR ship recognition CNN methods.
... Originally applied in face recognition, some techniques [18][19][20][21][22] have been widely applied in various domains to cater to the demands of real-world recognition tasks. For example, Wen et al. [23] introduced Center Loss based on the Triplet Loss [24], which calculates the center vector of each class and the Euclidean distance between the sample and the center vector. By optimizing the Center Loss, the network can learn more compact and discriminating feature representations. ...
Article
Full-text available
Open-set methods are crucial for rejecting unknown facial expressions in real-world scenarios. Traditional open-set methods primarily rely on a single feature vector for constructing the centers of known facial expression categories, which limits their ability to discriminate unknown categories. To address this problem, we propose the OpenFE method. This method introduces an attention mechanism that focuses on critical regions to improve the quality of feature vectors. Simultaneously, reconstruction methods are employed to extract low-dimensional potential features from images. By enriching the feature representation of known categories, the OpenFE method significantly amplifies the differentiation between unknown and known facial categories. Extensive experimental validation demonstrates the exceptional performance of the OpenFE method in expression open set classification, confirming its robustness.
... We first initialize several clusters of the learned physiological feature representations for training the RHPRNet. Each cluster describes the center and spread of the representation distribution from a scenario via a center loss [52]. ...
... Meanwhile, to improve the recognition accuracy of neural networks, the loss function also plays a key role. In previous studies, Wen et al. [38] learned the center of each sample by a new central loss function to enhance intra-class compactness. L2-softmax [39] and NormFace [40] investigated the necessity of normalization operation, and applied L2 normalization constraints on both features and weights. ...
Article
Full-text available
The fractal features of liver fibrosis MR images exhibit an irregular fragmented distribution, and the diffuse feature distribution lacks interconnectivity, result- ing in incomplete feature learning and poor recognition accuracy. In this paper, we insert recursive gated convolution into the ResNet18 network to introduce spatial information interactions during the feature learning process and extend it to higher orders using recursion. Higher-order spatial information interactions enhance the correlation between features and enable the neural network to focus more on the pixel-level dependencies, enabling a global interpretation of liver MR images. Additionally, the existence of light scattering and quantum noise during the imaging process, coupled with environmental factors such as breathing artifacts caused by long time breath holding, affects the quality of the MR images. To improve the classification performance of the neural network and better cap- ture sample features, we introduce the Adaptive Rebalance loss function and incorporate the feature paradigm as a learnable adaptive attribute into the angular margin auxiliary function. Adaptive Rebalance loss function can expand the inter-class distance and narrow the intra-class difference to further enhance discriminative ability of the model. We conduct extensive experiments on liver fibrosis MR imaging involving 209 patients. The results demonstrate an average improvement of two percent in recognition accuracy compared to ResNet18. The github is at https://github.com/XZN1233/paper.git.
... This work discloses how a compact adaptation net- additional negative data and evaluated with unknown samples. LeNet++ network [4] topologies are trained on 10 MNIST classes (knowns, colored dots) and evaluated with EMNIST letters (negatives, black) as well as Devanagari letters (unknowns, gray). ...
Article
Full-text available
Open-set face recognition characterizes a scenario where unknown individuals, unseen during the training and enrollment stages, appear on operation time. This work concentrates on watchlists, an open-set task that is expected to operate at a low False Positive Identification Rate and generally includes only a few enrollment samples per identity. We introduce a compact adapter network that benefits from additional negative face images when combined with distinct cost functions , such as Objectosphere Loss (OS) and the proposed Maximal Entropy Loss (MEL). MEL modifies the traditional Cross-Entropy loss in favor of increasing the entropy for negative samples and attaches a penalty to known target classes in pursuance of gallery specialization. The proposed approach adopts pre-trained deep neural networks (DNNs) for face recognition as feature extractors. Then, the adapter network takes deep feature representations and acts as a substitute for the output layer of the pre-trained DNN in exchange for an agile domain adaptation. Promising results have been achieved following open-set protocols for three different datasets: LFW, IJB-C, and UCCS as well as state-of-the-art performance when supplementary negative data is properly selected to fine-tune the adapter network.
... And some data augmentation was carried out during the fine-tuning stage. The disadvantage is that the learning feature of transfer learning is not effective, and the data augmentation during the fine-tuning stage does not produce substantial data augmentation [28]. ...
Article
Full-text available
With the development of deepfake technology, deepfake detection has received widespread attention. Although some deepfake forensics techniques have been proposed, they are still very difficult to implement in real-world scenarios. This is due to the differences in different deepfake technologies and the compression or editing of videos during the propagation process. Considering the issue of sample imbalance with few-shot scenarios in deepfake detection, we propose a multi-feature channel domain-weighted framework based on meta-learning (MCW). In order to obtain outstanding detection performance of a cross-database, the proposed framework improves a meta-learning network in two ways: it enhances the model’s feature extraction ability for detecting targets by combining the RGB domain and frequency domain information of the image and enhances the model’s generalization ability for detecting targets by assigning meta weights to channels on the feature map. The proposed MCW framework solves the problems of poor detection performance and insufficient data compression resistance of the algorithm for samples generated by unknown algorithms. The experiment was set in a zero-shot scenario and few-shot scenario, simulating the deepfake detection environment in real situations. We selected nine detection algorithms as comparative algorithms. The experimental results show that the MCW framework outperforms other algorithms in cross-algorithm detection and cross-dataset detection. The MCW framework demonstrates its ability to generalize and resist compression with low-quality training images and across different generation algorithm scenarios, and it has better fine-tuning potential in few-shot learning scenarios.
... In the context of comparison techniques, numerous researchers have focused on feature extractors and embedding models. Notably, research on learning suitable embedding models for comparisons has been typically based on the Siamese network structure [5] with contrastive loss [6], triplet loss [7], and center loss [8]. These techniques have been applied to problems including face verification [9] and few-shot learning, with representative models including the prototypical network [10]. ...
Article
Full-text available
In recent years, deep learning has attracted considerable attention owing to its ability to address complex problems in various fields. One notable problem is metric space learning, which is aimed at learning feature embeddings through the calculation of the metric space similarity of feature vectors in the embedding space by training embedding models. However, research on metric space similarity has been limited, and the existing methods, such as those based on the cosine similarity or concatenate layer, exhibit drawbacks in terms of flexibility and performance. For example, to apply the cosine similarity, the shapes of the two vectors must be identical, and thus, an advanced comparator cannot be used. Moreover, the concatenate layer cannot reflect the positions of the elements in the two vectors, leading to deteriorated performance and learning success rates. To address these limitations, this paper proposes a specialized artificial neural network layer named SimilarNet, designed to compare two feature vectors while considering the positions of their elements to produce an output in a vector format. By leveraging the advantages of the cosine similarity and concatenation layer, SimilarNet can effectively compare two vectors, enabling the construction of trained comparison models using multidimensional activation functions. In addition, SimilarNet can realize 1:1 comparisons of data with different shapes, unlike cosine similarity. The results of experiments conducted on various datasets indicate that models employing SimilarNet outperform those with the concatenate layer in terms of the comparison accuracy by 4.3% to 26.5% and learning success rate by 5% to 75%.
Chapter
Iris recognition is a widely used biometric identification technology that relies on the accurate and efficient matching of iris images. However, fast matching in large databases poses a significant challenge due to the increasing search time for a given query. To address this problem, this paper proposes an end-to-end hashing framework for iris recognition tasks based on the DenseFly algorithm. The presented approach utilizes a deep convolutional neural network to extract features from iris images and then applies hashing to map the features into compact binary codes. This process enables efficient retrieval of the query iris templates by reducing the whole search space. To evaluate and compare the proposed method with the existing IHashNet approach, we conduct experiments on three publicly available iris datasets namely CASIA-Irisv4-Thousand, UBIRIS.v2 and CASIA-Irisv4-Lamp. Our simulation results demonstrate that the proposed method outperforms IHashNet in terms of retrieval accuracy and equal error rate (EER). Furthermore, our method achieves significantly lower query time over all the datasets, thereby vindicating its usage over large iris datasets.
Conference Paper
Since facial manipulation technology has raised serious concerns, facial forgery detection has also attracted increasing attention. Although recent work has made good achievements, the detection of unseen fake faces is still a big challenge. In this paper, we tackle facial forgery detection problem from the perspective of distance metric learning, and design a new Intra-Variance guided Metric Learning (IVML) method to drive classification and adopt Vision Transformer (ViT) as the backbone, which aims to improve the generalization ability of face forgery detection methods. Specifically, considering that there is a large gap between different real faces, our proposed IVML method increases the distance between real and fake faces while maintaining a certain distance within real faces. We choose ViT as the backbone as our experiments prove that ViT has better generalization ability in face forgery detection. A large number of experiments demonstrate the effectiveness and superiority of our IVML method in cross-dataset evaluation.
Chapter
Traditionally, researchers train facial gender and age recognition models separately using deep convolutional networks. However, in the real world, it is crucial to build a low-cost and time-efficient multitask learning system that can simultaneously recognize both these tasks. In multitask learning, the synergy among the tasks creates imbalance in the loss functions and influences their individual performances. This imbalance among the task-specific loss functions leads to a drop in accuracy. To overcome this challenge and achieve better performance, we propose a novel weighted sum of loss functions that balances the loss of each task. We train our method for the recognition of gender and age on the publicly available Adience benchmark dataset. Finally, we experiment our method on VGGFace and FaceNet architectures and evaluate on the Adience test set to achieve better performance than previous architectures.
Chapter
Multimedia face images appear in social networks, digital entertainment, and other applications, exhibiting dramatic pose, illumination, and expression variations, resulting in considerable performance degradation. Using deep visualization techniques, this study analyses the classic network VGG Face. First, we analyze features computed by neurons, looking at diversity, invariance, and discrimination characteristics. According to the conventional view, the robustness to transformation increases as the network gets deeper, but it turns out that the middle layer is the least robust. One of the most significant findings is that high-level features are correlated with complex face attributes that humans may not be able to describe verbally.
Chapter
Incremental similarity learning in neural networks poses a challenge due to catastrophic forgetting. To address this, previous research suggests that retaining “image exemplars” can proxy for past learned features. Additionally, it is widely accepted that the output layers acquire task-specific features during later training stages, while the input layers develop general features earlier on. We lock the input layers of a neural network and then explore the feasibility of producing “embedding” models from a VAE that can safeguard the essential knowledge in the intermediate to output layers of the neural network. The VAEs eliminate the necessity of preserving “exemplars”. In an incremental similarity learning setup, we tested three metric learning loss functions on CUB-200 (Caltech-UCSD Birds-200-2011) and CARS-196 datasets. Our approach involved training VAEs to produce exemplars from intermediate convolutional and linear output layers to represent the base knowledge. Our study compared our method with a previous technique and evaluated the baseline knowledge (\(\varOmega _{base}\)), new knowledge (\(\varOmega _{new}\)), and average knowledge (\(\varOmega _{all}\)) preservation metrics. The results show that generating exemplars from the linear and convolutional layers is the most effective way to retain base knowledge. It should be noted that embeddings from the linear layers result in better performance when it comes to new knowledge compared to convolutional embeddings. Overall, our methods have shown better average knowledge performance (\(\varOmega _{all}=[0.7879, 0.7805]\)) compared to iCaRL (\(\varOmega _{all}=[0.7476, 0.7683]\)) in the CUB-200 and CARS-196 experiments, respectively. Based on the results, it appears that it is important to focus on embedding exemplars for the intermediate to output layers to prevent catastrophic forgetting during incremental similarity learning in classes. Additionally, our findings suggest that the later linear layers play a greater role in incremental similarity learning for new knowledge than convolutions. Further research is needed to explore the connection between transfer learning and similarity learning and investigate ways to protect the intermediate layer embedding space from catastrophic forgetting.
Chapter
One goal of Automatic Speech Recognition (ASR) is to convert the command of human speech into computer-readable input, but noise interference is an important yet challenging problem. Capturing speech context, deep neural networks have demonstrated to be superior in learning networks for identifying specific command words. Existing deep neural networks generally rely on the two-layer structure, different layer is used to identify the noisy environment and speech respectively, which makes the model large and complex. In addition, their performance generally drops dramatically in unknown noisy environments, which restricts the generalization of the method. In this paper, we propose a novel deep framework, named Adaptive-Attention and Joint Supervision (AJS) to circumvent the above challenge. Specifically, we use the spectrogram as the input. Adaptive attention is employed to refine the features from the noise environment and get rid of the limitation of the noisy scene. Furthermore, a combination of coarse-to-fine losses are adopted to process difficult words step by step. Extensive experiments on four public datasets demonstrate the robustness of our method to various noise environments and its superiority for ASR in terms of accuracy. Codes are available at: https://github.com/zhishulin/bajs.
Article
Existing supervised quantization methods usually learn the quantizers from pair-wise, triplet, or anchor-based losses, which only capture their relationship locally without aligning them globally. This may cause an inadequate use of the entire space and a severe intersection among different semantics, leading to inferior retrieval performance. Furthermore, to enable quantizers to learn in an end-to-end way, current practices usually relax the non-differentiable quantization operation by substituting it with softmax, which unfortunately is biased, leading to an unsatisfying suboptimal solution. To address the above issues, we present Spherical Centralized Quantization (SCQ), which contains a Priori Knowledge based Feature (PKFA) module for the global alignment of feature vectors, and an Annealing Regulation Semantic Quantization (ARSQ) module for low-biased optimization. Specifically, the PKFA module first applies Semantic Center Allocation (SCA) to obtain semantic centers based on prior knowledge, and then adopts Centralized Feature Alignment (CFA) to gather feature vectors based on corresponding semantic centers. The SCA and CFA globally optimize the inter-class separability and intra-class compactness, respectively. After that, the ARSQ module performs a partial-soft relaxation to tackle biases, and an Annealing Regulation Quantization loss for further addressing the local optimal solution. Experimental results show that our SCQ outperforms state-of-the-art algorithms by a large margin (2.1%, 3.6%, 5.5% mAP respectively) on CIFAR-10, NUS-WIDE, and ImageNet with a code length of 8 bits. Codes are publicly available: https://github.com/zzb111/Spherical-Centralized-Quantization.
Chapter
Based on improvement of the existing person re-identification network model and investigating person re-identification related problems on open datasets, a pedestrian re-identification method that incorporates shared feature branching as well as a fused attention mechanism is proposed. First, the benchmark network framework of pedestrian re-identification method is formed by image pre-processing method, generalized average pooling and loss function together, and BNNeck structure is added to optimize the problem of inconsistent multi-loss objectives before obtaining classification features. Second, the global feature branching structure is proposed to be assisted by shared feature branching, which can effectively assist global feature branching to obtain more distinguishable pedestrian features and the attention network is combined with person re-recognition using a network structure that fuses non-local modules with channel attention to fully utilize the feature information of pedestrian images. Finally, the performance improvement of the benchmark network with shared feature branching and the fused attention mechanism approach is demonstrated by experiments on the re-identification dataset.
Chapter
To tackle the fine-grained classification problem encountered in mushroom identification, a classification model IAMR-Net(Integrating Attention Mechanism and ResNet-Net) which combined bilinear residual network and attention mechanism was proposed. The model combined a modified residual block to extract features from input images. Extracted features will be embedded to fit the Multi-Headed-Self-Attention blocks for spatial dimensional modeling to achieve the purpose of extracting fine-grained relationships in feature maps. Our model trained by mixing loss functions, on the benchmarks, such as Oxford 102 Flowers, CUB-200-2011, Stanford Cars, Stanford Dogs, and a Mushroom-96 datasets. The model obtained an accuracy score of over \(91\%\) on both the fine-grained benchmarks and Mushroom-96 datasets, demonstrating its efficacy in classifying fine-grained mushroom images.
Chapter
With the increasing adoption of online learning, decreasing student engagement is becoming rampant. Detecting this is the first step in making online education more viable and effective. We present MuOE, a Multi-task Ordinality-aware Engagement detection model to identify attention levels from students’ webcam videos. MuOE uses a transformer with exceptional sequence-processing capability and a novel selector-based attention mechanism that picks important video frames. Facial cue detection is used as an auxillary task in our multi-task formulation of the problem, so the shared model base has more supervision. We leverage the ordinal nature of engagement levels by introducing a smooth loss function that penalizes predictions based on closeness to the true label. In this paper, we motivate each component of MuOE, and demonstrate its utility through a set of quantative experiments. We achieve a state-of-the-art accuracy of 57.65% (Top-2 accuracy 95.07%) on the DAiSEE dataset.
Chapter
Person search based on semantic attributes description presents an interest task for intelligent video surveillance applications. The main objective is to locate a suspect or to find a missing person in public areas using a semantic description (e.g. a 40-year-old asian woman) provided by an eyewitness. Such a description provides the facial soft biometric related to the facial semantic attributes (i.e. age, gender and ethnicity). In this paper, we introduced a new approach for person search named “Quick-Search” based on a facial semantic attributes description to enhance the person search task in an unconstrained environment. The main contribution of the paper is to introduce a multi-attributes score fusion method which relies on soft biometric features (age, gender, ethnicity) to improve the person search in a large dataset. An experimental study was conducted on the challenging FairFace dataset to validate the effectiveness of the proposed person search approach.
Article
Facial recognition technology has been developed and widely used for decades. However, it has also made privacy concerns and researchers’ expectations for facial recognition privacy-preserving technologies. To provide privacy, detailed or semantic contents in face images should be obfuscated. However, face recognition algorithms have to be tailor-designed according to current obfuscation methods, as a result the face recognition service provider has to update its commercial off-the-shelf(COTS) products for each obfuscation method. Meanwhile, current obfuscation methods have no clearly quantified explanation. This paper presents a universal face obfuscation method for a family of face recognition algorithms using global or local structure of eigenvector space. By specific mathematical explanations, we show that the upper bound of the distance between the original and obfuscated face images is smaller than the given recognition threshold. Experiments show that the recognition degradation is 0% for global structure based and 0.3%-5.3% for local structure based, respectively. Meanwhile, we show that even if an attacker knows the whole obfuscation method, he/she has to enumerate all the possible roots of a polynomial with an obfuscation coefficient, which is computationally infeasible to reconstruct original faces. So our method shows a good performance in both privacy and recognition accuracy without modifying recognition algorithms.
Article
Human activity recognition (HAR) has become increasingly important in healthcare, sports, and fitness domains due to its wide range of applications. However, existing deep learning based HAR methods often overlook the challenges posed by the diversity of human activities and data quality, which can make feature extraction difficult. To address these issues, we propose a new neural network model called MAG-Res2Net, which incorporates the Borderline-SMOTE data upsampling algorithm, a loss function combination algorithm based on metric learning, and the Lion optimization algorithm. We evaluate the proposed method on two widely used public datasets, UCI-HAR and WISDM, and achieve state-of-the-art performance. Specifically, on the UCI-HAR dataset, our model achieves accuracy, F1-macro, and F1-weighted scores of 93.58%, 93.83%, and 92.16%, respectively. On the WISDM dataset, the scores are 94.28%, 94.01%, and 94.25%, respectively. Our results show that the proposed MAG-Res2Net model can flexibly control time and space costs by adding or reducing network layers, and has better performance compared to existing methods.
Conference Paper
Convolutional neural networks (CNNs) have been widely used in the computer vision community, significantly improving the state-of-the-art. But learning good features often is computationally expensive in machine learning settings and is especially difficult when there is a lack of data. One-shot learning is one such area where only limited data is available. In one-shot learning, predictions have to be made after seeing only one example from one class, which requires special techniques. In this paper we explore different approaches to one-shot identification tasks in different domains including an industrial application and face recognition. We use a special technique with stacked images and use siamese capsule networks. It is encouraging to see that the approach using capsule architecture achieves strong results and exceeds other techniques on a wide range of datasets from industrial application to face recognition benchmarks while being easy to use and optimise.
Article
Full-text available
Face recognition for surveillance remains a complex challenge due to the disparity between low-resolution (LR) face images captured by surveillance cameras and the typically high-resolution (HR) face images in databases. To address this cross-resolution face recognition problem, we propose a two-stage dual-resolution face network to learn more robust resolution-invariant representations. In the first stage, we pre-train the proposed dual-resolution face network using solely HR images. Our network utilizes a two-branch structure and introduces bilateral connections to fuse the high- and low-resolution features extracted by two branches, respectively. In the second stage, we introduce the triplet loss as the fine-tuning loss function and design a training strategy that combines the triplet loss with competence-based curriculum learning. According to the competence function, the pre-trained model can train first from easy sample sets and gradually progress to more challenging ones. Our method achieves a remarkable face verification accuracy of 99.25% on the native cross-quality dataset SCFace and 99.71% on the high-quality dataset LFW. Moreover, our method also enhances the face verification accuracy on the native low-quality dataset.
Article
Full-text available
Personalized learning has gained significant attention in education as a means to cater to the diverse needs of learners and optimize educational outcomes. However, ensuring the efficiency of personalized learning remains a challenge. It requires the ability to accurately analyze and interpret vast amounts of data collected from learners. Traditional analytical approaches often struggle to handle the complexity and heterogeneity of this data, limiting the potential for personalized learning interventions. To address these challenges, this paper proposes a personalized learning efficiency data analysis network (PLEDANet) based on machine learning. First, PLEDANet redesigns a convolutional neural network based on the ResNet structure. The network performs convolutions using multiple convolution kernels of different scales to extract diverse feature information from personalized learning efficiency data. To enhance the extraction and representation of fine-grained differentiated features, PLEDANet introduces a hybrid attention module to combine channel and spatial information among feature maps. Second, PLEDANet designs a hybrid loss function for model training, which consists of the AM-softmax loss and the Center loss. The former increases the inter-class distance of features by imposing a fixed angular margin, while the latter reduces the intra-class distance by constraining the samples and feature centers. Finally, extensive experiments are conducted on PLEDANet. The experimental results validate the superiority of PLEDANet for personalized learning efficiency analysis.
Conference Paper
Full-text available
In this paper we propose a novel semantic label transfer method using supervised geodesic propagation (SGP). We use supervised learning to guide the seed selection and the label propagation. Given an input image, we first retrieve its similar image set from annotated databases. A Joint Boost model is learned on the similar image set of the input image. Then the recognition proposal map of the input image is inferred by this learned model. The initial distance map is defined by the proposal map: the higher probability, the smaller distance. In each iteration step of the geodesic propagation, the seed is selected as the one with the smallest distance from the undetermined superpixels. We learn a classifier as an indicator to indicate whether to propagate labels between two neighboring superpixels. The training samples of the indicator are annotated neighboring pairs from the similar image set. The geodesic distances of its neighbors are updated according to the combination of the texture and boundary features and the indication value. Experiments on three datasets show that our method outperforms the traditional learning based methods and the previous label transfer method for the semantic segmentation work.
Article
Full-text available
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set. Moreover, the generalization capability of DeepID increases as more face classes are to be predicted at training. DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize about 10, 000 face identities in the training set and configured to keep reducing the neuron numbers along the feature extraction hierarchy, these deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. 97:45% verification accuracy on LFW is achieved with only weakly aligned faces.
Article
Full-text available
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
Conference Paper
Full-text available
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).
Article
Full-text available
Recent face recognition experiments on the LFW benchmark show that face recognition is performing stunningly well, surpassing human recognition rates. In this paper, we study face recognition at scale. Specifically, we have collected from Flickr a \textbf{Million} faces and evaluated state of the art face recognition algorithms on this dataset. We found that the performance of algorithms varies--while all perform great on LFW, once evaluated at scale recognition rates drop drastically for most algorithms. Interestingly, deep learning based approach by \cite{schroff2015facenet} performs much better, but still gets less robust at scale. We consider both verification and identification problems, and evaluate how pose affects recognition at scale. Moreover, we ran an extensive human study on Mechanical Turk to evaluate human recognition at scale, and report results. All the photos are creative commons photos and will be released for research and further experiments.
Article
Full-text available
We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, e.g. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.
Article
Full-text available
With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having explicitly learned the notion of objects.
Article
Full-text available
This paper designs a high-performance deep convolutional network (DeepID2+) for face recognition. It is learned with the identification-verification supervisory signal. By increasing the dimension of hidden representations and adding supervision to early convolutional layers, DeepID2+ achieves new state-of-the-art on LFW and YouTube Faces benchmarks. Through empirical studies, we have discovered three properties of its deep neural activations critical for the high performance: sparsity, selectiveness and robustness. (1) It is observed that neural activations are moderately sparse. Moderate sparsity maximizes the discriminative power of the deep net as well as the distance between images. It is surprising that DeepID2+ still can achieve high recognition accuracy even after the neural responses are binarized. (2) Its neurons in higher layers are highly selective to identities and identity-related attributes. We can identify different subsets of neurons which are either constantly excited or inhibited when different identities or attributes are present. Although DeepID2+ is not taught to distinguish attributes during training, it has implicitly learned such high-level concepts. (3) It is much more robust to occlusions, although occlusion patterns are not included in the training set.
Article
Full-text available
Predicting face attributes from web images is challenging due to background clutters and face variations. A novel deep learning framework is proposed for face attribute prediction in the wild. It cascades two CNNs (LNet and ANet) for face localization and attribute prediction respectively. These nets are trained in a cascade manner with attribute labels, but pre-trained differently. LNet is pre-trained with massive general object categories, while ANet is pre-trained with massive face identities. This framework not only outperforms state-of-the-art with large margin, but also reveals multiple valuable facts on learning face representation as below. (1) It shows how LNet and ANet can be improved by different pre-training strategies. (2) It reveals that although filters of LNet are fine-tuned by attribute labels, their response maps over the entire image have strong indication of face's location. This fact enables training LNet for face localization with only attribute tags, but without face bounding boxes (which are required by all detection works). With a novel fast feed-forward scheme, the cascade of LNet and ANet can localize faces and recognize attributes in images with arbitrary sizes in real time. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training, and such concepts are significantly enriched after fine-tuning. Each attribute can be well explained by a sparse linear combination of these concepts. By analyzing such combinations, attributes show clear grouping patterns, which could be well interpreted semantically.
Article
Full-text available
Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
The key challenge of face recognition is to develop effective feature representations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals as supervision. The Deep IDentification-verification features (DeepID2) are learned with carefully designed deep convolutional networks. The face identification task increases the inter-personal variations by drawing DeepID2 extracted from different identities apart, while the face verification task reduces the intra-personal variations by pulling DeepID2 extracted from the same identity together, both of which are essential to face recognition. The learned DeepID2 features can be well generalized to new identities unseen in the training data. On the challenging LFW dataset, 99.15% face verification accuracy is achieved. Compared with the best deep learning result on LFW, the error rate has been significantly reduced by 67%.
Conference Paper
Full-text available
We propose in this paper a fully automated deep model, which learns to classify human actions without using any prior knowledge. The first step of our scheme, based on the extension of Convolutional Neural Networks to 3D, automatically learns spatio-temporal features. A Recurrent Neural Network is then trained to classify each sequence considering the temporal evolution of the learned features for each timestep. Experimental results on the KTH dataset show that the proposed approach outperforms existing deep models, and gives comparable results with the best related works.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
Dimensionality reduction involves mapping a set of high dimensional input points onto a low dimensional manifold so that 'similar" points in input space are mapped to nearby points on the manifold. We present a method - called Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) - for learning a globally coherent nonlinear function that maps the data evenly to the output manifold. The learning relies solely on neighborhood relationships and does not require any distancemeasure in the input space. The method can learn mappings that are invariant to certain transformations of the inputs, as is demonstrated with a number of experiments. Comparisons are made to other techniques, in particular LLE.
Conference Paper
Full-text available
We present a method for training a similarity metric from data. The method can be used for recognition or verification applications where the number of categories is very large and not known during training, and where the number of training samples for a single category is very small. The idea is to learn a function that maps input patterns into a target space such that the L<sub>1</sub> norm in the target space approximates the "semantic" distance in the input space. The method is applied to a face verification task. The learning process minimizes a discriminative loss function that drives the similarity metric to be small for pairs of faces from the same person, and large for pairs from different persons. The mapping from raw to the target space is a convolutional network whose architecture is designed for robustness to geometric distortions. The system is tested on the Purdue/AR face database which has a very high degree of variability in the pose, lighting, expression, position, and artificial occlusions such as dark glasses and obscuring scarves.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Large face datasets are important for advancing face recognition research, but they are tedious to build, because a lot of work has to go into cleaning the huge amount of raw data. To facilitate this task, we describe an approach to building face datasets that starts with detecting faces in images returned from searches for public figures on the Internet, followed by discarding those not belonging to each queried person. We formulate the problem of identifying the faces to be removed as a quadratic programming problem, which exploits the observations that faces of the same person should look similar, have the same gender, and normally appear at most once per image. Our results show that this method can reliably clean a large dataset, leading to a considerable reduction in the work needed to build it. Finally, we are releasing the FaceScrub dataset that was created using this approach. It consists of 141,130 faces of 695 public figures and can be obtained from http://vintage.winklerbros.net/facescrub.html.
Article
Face Recognition has been studied for many decades. As opposed to traditional hand-crafted features such as LBP and HOG, much more sophisticated features can be learned automatically by deep learning methods in a data-driven way. In this paper, we propose a two-stage approach that combines a multi-patch deep CNN and deep metric learning, which extracts low dimensional but very discriminative features for face verification and recognition. Experiments show that this method outperforms other state-of-the-art methods on LFW dataset, achieving 99.85% pair-wise verification accuracy and significantly better accuracy under other two more practical protocols. This paper also discusses the importance of data size and the number of patches, showing a clear path to practical high-performance face recognition systems in real world
Article
This paper introduces a method for face recognition across age and also a dataset containing variations of age in the wild. We use a data-driven method to address the cross-age face recognition problem, called cross-age reference coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC can encode the low-level feature of a face image with an age-invariant reference space. In the retrieval phase, our method only requires a linear projection to encode the feature and thus it is highly scalable. To evaluate our method, we introduce a large-scale dataset called cross-age celebrity dataset (CACD). The dataset contains more than 160 000 images of 2,000 celebrities with age ranging from 16 to 62. Experimental results show that our method can achieve state-of-the-art performance on both CACD and the other widely used dataset for face recognition across age. To understand the difficulties of face recognition across age, we further construct a verification subset from the CACD called CACD-VS and conduct human evaluation using Amazon Mechanical Turk. CACD-VS contains 2,000 positive pairs and 2,000 negative pairs and is carefully annotated by checking both the associated image and web contents. Our experiments show that although state-of-the-art methods can achieve competitive performance compared to average human performance, majority votes of several humans can achieve much higher performance on this task. The gap between machine and human would imply possible directions for further improvement of cross-age face recognition in the future.
Conference Paper
This paper presents a new discriminative deep metric learning (DDML) method for face verification in the wild. Different from existing metric learning-based face verification methods which aim to learn a Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, simultaneously, the proposed DDML trains a deep neural network which learns a set of hierarchical nonlinear transformations to project face pairs into the same feature subspace, under which the distance of each positive face pair is less than a smaller threshold and that of each negative pair is higher than a larger threshold, respectively, so that discriminative information can be exploited in the deep network. Our method achieves very competitive face verification performance on the widely used LFW and YouTube Faces (YTF) datasets.
Article
Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face. On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result by 30% on both datasets.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities, where each identity has an average of over a thousand samples. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.25% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 25%, closely approaching human-level performance.
Conference Paper
This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. The deep ConvNets in our model mimic the primary visual cortex to jointly extract local relational visual features from two face images compared with the learned filter pairs. These relational features are further processed through multiple layers to extract high-level and global features. Multiple groups of ConvNets are constructed in order to achieve robustness and characterize face similarities from different aspects. The top-layer RBM performs inference from complementary high-level features extracted from different ConvNet groups with a two-level average pooling hierarchy. The entire hybrid deep network is jointly fine-tuned to optimize for the task of face verification. Our model achieves competitive face verification performance on the LFW dataset.
Recognizing faces in unconstrained videos is a task of mounting importance. While obviously related to face recognition in still images, it has its own unique characteristics and algorithmic requirements. Over the years several methods have been suggested for this problem, and a few benchmark data sets have been assembled to facilitate its study. However, there is a sizable gap between the actual application needs and the current state of the art. In this paper we make the following contributions. (a) We present a comprehensive database of labeled videos of faces in challenging, uncontrolled conditions (i.e., `in the wild'), the `YouTube Faces' database, along with benchmark, pair-matching tests<sup>1</sup>. (b) We employ our benchmark to survey and compare the performance of a large variety of existing video face recognition techniques. Finally, (c) we describe a novel set-to-set similarity measure, the Matched Background Similarity (MBGS). This similarity is shown to considerably improve performance on the benchmark tests.
Conference Paper
We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.
Article
Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life. The database exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background. In addition to describing the details of the database, we provide specific experimental paradigms for which the database is suitable. This is done in an effort to make research performed with the database as consistent and comparable as possible. We provide baseline results, including results of a state of the art face recognition system combined with a face alignment system. To facilitate experimentation on the database, we provide several parallel databases, including an aligned version.
Article
This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features. The face image is divided into several regions from which the LBP feature distributions are extracted and concatenated into an enhanced feature vector to be used as a face descriptor. The performance of the proposed method is assessed in the face recognition problem under different challenges. Other applications and several extensions are also discussed.
Article
The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points. This rule is independent of the underlying joint distribution on the sample points and their classifications, and hence the probability of error R of such a rule must be at least as great as the Bayes probability of error R^{\ast} --the minimum probability of error over all decision rules taking underlying probability structure into account. However, in a large sample analysis, we will show in the M -category case that R^{\ast} \leq R \leq R^{\ast}(2 --MR^{\ast}/(M-1)) , where these bounds are the tightest possible, for all suitably smooth underlying distributions. Thus for any number of categories, the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error. In this sense, it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.
Article
Computation of the k-nearest neighbors generally requires a large number of expensive distance computations. The method of branch and bound is implemented in the present algorithm to facilitate rapid calculation of the k-nearest neighbors, by eliminating the necesssity of calculating many distances. Experimental results demonstrate the efficiency of the algorithm. Typically, an average of only 61 distance computations were made to find the nearest neighbor of a test sample among 1000 design samples.
Labeled faces in the wild: updates and new reporting procedures
  • G B Huang
  • E Learned-Miller
Labeled faces in the wild: A database for studying face recognition in unconstrained environments
  • G B Huang
  • M Ramesh
  • T Berg
  • E Learned-Miller