Conference Paper

# A Discriminative Feature Learning Approach for Deep Face Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

## No full-text available

... To overcome this problem, Center loss [33] is proposed to compress the intra-class variations for face recognition tasks. A center vector of each class is created to represent the center of each class. ...
... Inspired by the idea of Center loss, PM loss is proposed to compress the same class's features by minimizing the loss. Different from Center loss, which is sensitive to parameter initialization and difficult to compress all intraclass examples into one center under multiple centers [33], PM loss does not introduce the center vectors to represent the class centers or need more memory space under many categories. Instead, it directly calculates a center based on sampled multiple examples of the same category in a batch. ...
... Analysis of PM loss : To verify the effectiveness of the proposed method, PM loss was evaluated on the CUB-200-2011 dataset. The ablation study includes as follows: whether to use PM loss, and the widely used Center loss [33]. The experimental results are shown in Table 4. ...
Article
Full-text available
Fine-grained visual classification (FGVC) is widely used to identify different sub-categories of ships, dogs, flowers, and so on, and aims to help the ordinary people distinguish sub-categories with only slight differences. It mainly faces the challenges of small inter-class differences and large intra-class variations. The current effective methods adopt multi-scale or multi-granularity feature to find the subtle difference. However, these methods pay their attentions to the accuracy while neglecting the computational cost in practice. Therefore, in this paper, an improved efficient Multi-granularity Learning method with Only Forward Once (MLOFO) is proposed. It reduces the forward and back propagation in training from several times to once, and decreases the computational cost several times. And more, an intra-class metric loss, named prototype metric (PM) loss, is proposed to supervise learning the effective features for classification in a multi-granularity network (MGN) framework. The effectiveness of the proposed method is verified on four fine-grained classification datasets (CUB-200-2011, Stanford Cars, FGVC-Aircraft, and AircraftCarrier). Experimental results demonstrate that our method achieves state-of-the-art accuracies, substantially improving FGVC tasks. Furthermore, we discuss that the new PM loss can compress the distribution of the intra-class features as label smoothing to achieve better generalization ability. Our method is helpful to promote the training efficiency of the MGN model and improve the accuracy of fine-grained classification to a certain extent.
... D EEP face recognition has obtained surprising improvement recent years [1], [2], [3], [4], [5], [6], [7], [8], [9]. The pipeline for deep face recognition has been widely used for its practical usage [10], [4], [5], [8]. ...
... The first type applies metric learning method in deep learning [1], [2], [3], which maps face images to a deep feature space and directly optimizes distances, so that the inter-class distance is larger than the intra-class distance. The contrastive loss [1], triplet loss [2] and N-pair loss [29] are early methods to enhance the discrimination ability of deep features, which optimize intra-class and inter-class variance by using face pairs. ...
... The contrastive loss [1], triplet loss [2] and N-pair loss [29] are early methods to enhance the discrimination ability of deep features, which optimize intra-class and inter-class variance by using face pairs. Combined with softmax loss, centerloss [3] obtains promising performance by simultaneously learns a center for deep features of each class and minimizes the distances between training samples and their corresponding class centers. Then, range loss [24] minimizes overall intrapersonal differences and maximizes inter-personal differences in one mini-batch. ...
Preprint
Deep face recognition has achieved great success due to large-scale training databases and rapidly developing loss functions. The existing algorithms devote to realizing an ideal idea: minimizing the intra-class distance and maximizing the inter-class distance. However, they may neglect that there are also low quality training images which should not be optimized in this strict way. Considering the imperfection of training databases, we propose that intra-class and inter-class objectives can be optimized in a moderate way to mitigate overfitting problem, and further propose a novel loss function, named sigmoid-constrained hypersphere loss (SFace). Specifically, SFace imposes intra-class and inter-class constraints on a hypersphere manifold, which are controlled by two sigmoid gradient re-scale functions respectively. The sigmoid curves precisely re-scale the intra-class and inter-class gradients so that training samples can be optimized to some degree. Therefore, SFace can make a better balance between decreasing the intra-class distances for clean examples and preventing overfitting to the label noise, and contributes more robust deep face recognition models. Extensive experiments of models trained on CASIA-WebFace, VGGFace2, and MS-Celeb-1M databases, and evaluated on several face recognition benchmarks, such as LFW, MegaFace and IJB-C databases, have demonstrated the superiority of SFace.
... Each average ROC curve comes from ten instantiations of the model hyper-parameterization it features. The NN used on real data from the testbed SWaT is a 1D-CNN autoencoder has the following topology [(600, 51), (150,25), (38,10),75,380, (150,25),(600, 51)], taking time windows of size 600 and 51 dimensions. ...
... Each average ROC curve comes from ten instantiations of the model hyper-parameterization it features. The NN used on real data from the testbed SWaT is a 1D-CNN autoencoder has the following topology [(600, 51), (150,25), (38,10),75,380, (150,25),(600, 51)], taking time windows of size 600 and 51 dimensions. ...
... A well-known loss function is the MSE, usually applied at the output of a NN. Regarding losses applied on a hidden layer, let us mention the example of the Center Loss [150] (2.2) that a version of which is applied on the code layer in [21] in addition to the MSE so as to take into account the class of the sample point x by minimizing the distance of the encoded sample point Enc(x) from the centroid c s(x) of its class s(x). Thereafter, we will say that a NN is trained under a loss L jointly with another loss L when one concerns the output layer, referred as the main loss, and the other concerns a hidden layer. ...
Thesis
Les systèmes industriels sont voués à fonctionner des années durant et leurs dispositifs font parfois face à des contraintes énergétiques empêchant la mise en place de nouveaux moyens de sécurité. Nous étudions donc des solutions passives, c’est-à-dire n’ayant besoin que des données, au problème de surveillance des processus physiques de systèmes industriels par l’observation des valeurs des capteurs, des actionneurs et des commandes des automates. La majeure partie de nos travaux concerne l’intégrité de ces données qui se traduit par le fait que les données liées à un ensemble d’actions du système n’ont pas subies un changement inattendue et la traçabilité de l’information que nous définissons comme la capacité d’authentifier chaque processus de transformation des données depuis leur création par le système industriel jusqu’à leur dernière utilisation. Nous proposons un nouveau concept d’état de Système Cyber-Physique que les modèles d’apprentissage automatique peuvent utiliser pour répondre aux questions de l’intégrité et de la traçabilité des données et nous l’appliquons plus particulièrement à l’autoencoder. Nous proposons un nouveau type de réseau de neurones classifieur accompagné d’une mesure de confiance qui nous permet de répondre à notre problème de traçabilité.
... The basic principle is to combine a classification loss with an auxiliary metric learning loss. Wen et al. (2016) aim to support intra-class connectivity by forcing all feature vectors related to one class to be close to the corresponding center of the feature vectors using an auxiliary center loss. Qi and Su (2017) expand the center loss by an additional term such that it also requires inter-class separability. ...
... Qi and Su (2017) expand the center loss by an additional term such that it also requires inter-class separability. Instead of forcing the distances in feature space to be small for features belonging to the same class and large for features belonging to different classes, respectively (Wen et al., 2016;Qi and Su, 2017), there are also margin-based loss variants that introduce within-class and between-class margins to explicitly force the produced clusters to reflect inter-class separability and intraclass connectivity. Whereas distance-based margin constraints are proposed in (Huang et al., 2016;Liu et al., 2017;Yang et al., 2020) , the approaches in (Choi et al., 2020;Hameed et al., 2021) rely on angular margins. ...
... (Liu et al., 2017;Choi et al., 2020), we do not require additional hyper-parameter tuning for cluster definitions. The approaches in (Wen et al., 2016;Qi and Su, 2017) also do not introduce additional hyper-parameters in the auxiliary clustering loss, but in contrast to those works as well as the works investigating margin-based approaches, the features in our work are not only clustered based on the available class information but also based on the visual properties of the related input images. Nevertheless, the proposed clustering loss forces the distances of the feature vectors to reflect intraclass connectivity and inter-class separability. ...
Article
Full-text available
Learning from imbalanced class distributions generally leads to a classifier that is not able to distinguish classes with few training examples from the other classes. In the context of cultural heritage, addressing this problem becomes important when existing digital online collections consisting of images depicting artifacts and assigned semantic annotations shall be completed automatically; images with known annotations can be used to train a classifier that predicts missing information, where training data is often highly imbalanced. In the present paper, combining a classification loss with an auxiliary clustering loss is proposed to improve the classification performance particularly for underrepresented classes, where additionally different sampling strategies are applied. The proposed auxiliary loss aims to cluster feature vectors with respect to the semantic annotations as well as to visual properties of the images to be classified and thus, is supposed to help the classifier in distinguishing individual classes. We conduct an ablation study on a dataset consisting of images depicting silk fabrics coming along with annotations for different silk-related classification tasks. Experimental results show improvements of up to 10.5% in average F1-score and up to 20.8% in the F1-score averaged over the underrepresented classes in some classification tasks.
... Thus, the classifier trained on labeled source samples may misidentify target samples distributed at the edge of each class [26]. Considering this problem, advanced feature-based UDA methods [17,[27][28][29][30]26,31,32] encourage involving Discriminative Feature Learning (DFL) methods to learn the domain-invariant discriminative feature. Generally, DFL methods construct the intra-class compactness [27,17,[33][34][35][36] (similar to intra-class scatter [31,37] or intra-class discrepancy [38]) and the inter-class separability [33][34][35][36] (similar to inter-class dispersion [17,27] or interclass scatter [24]) to learn the discriminative feature. ...
... Considering this problem, advanced feature-based UDA methods [17,[27][28][29][30]26,31,32] encourage involving Discriminative Feature Learning (DFL) methods to learn the domain-invariant discriminative feature. Generally, DFL methods construct the intra-class compactness [27,17,[33][34][35][36] (similar to intra-class scatter [31,37] or intra-class discrepancy [38]) and the inter-class separability [33][34][35][36] (similar to inter-class dispersion [17,27] or interclass scatter [24]) to learn the discriminative feature. Most DFL methods optimize the intra-class compactness and the inter-class separability by minimizing the distances between every-two samples within the same class and maximizing the distances between every-two samples from different classes [17,26,27,33]. ...
... Considering this problem, advanced feature-based UDA methods [17,[27][28][29][30]26,31,32] encourage involving Discriminative Feature Learning (DFL) methods to learn the domain-invariant discriminative feature. Generally, DFL methods construct the intra-class compactness [27,17,[33][34][35][36] (similar to intra-class scatter [31,37] or intra-class discrepancy [38]) and the inter-class separability [33][34][35][36] (similar to inter-class dispersion [17,27] or interclass scatter [24]) to learn the discriminative feature. Most DFL methods optimize the intra-class compactness and the inter-class separability by minimizing the distances between every-two samples within the same class and maximizing the distances between every-two samples from different classes [17,26,27,33]. ...
Article
Full-text available
Most feature-based Unsupervised Domain Adaptation (UDA) methods aligned distributions of the source and target domains by minimizing Maximum Mean Discrepancy (MMD) between the two domains. However, MMD using mean values may misalign distributions of the two domains due to outliers. Besides, to enhance the identifiability of learned features, some feature-based UDA methods adopted samples-based distances to measure the intra-class compactness and inter-class separability. But, the number of samples-based distances is a quadratic function of the sample size, feature-based UDA methods using samples-based distances may lead to the inefficient computation of measuring the intra-class compactness and inter-class separability. To overcome the two problems, we propose Discriminative Transfer Feature Learning based on Robust-Centers (DTFLRC) for UDA. First, we design robust-class-centers and robust-domain-centers for decreasing the influence of outliers and establish MMD with robust-centers to align distributions of the two domains. Second, noticing that the number of centers is far smaller than the sample size, we construct three robust-centers-based distances to effectively reduce the number of distances in measuring the intra-class compactness and inter-class separability. Specifically, three robust-centers-based distances include Sample-Class-center, Class-center-Domain-center, and Class-center-Nearest-neighbor-class-center distances, where Sample-Class-center distance measures the intra-class compactness, and Class-center-Domain-center and Class-center-Nearest-neighbor-class-center distances jointly reflect the inter-class separability. Then, the optimization objective of DTFLRC is established to minimize the MMD with robust-centers and Sample-Class-center distance, and maximize Class-center-Domain-center and Class-center-Nearest-neighbor-class-center distances. Finally, experimental results demonstrate that DTFLRC outperforms state-of-the-art methods, where the accuracies of DTFLRC on datasets CMU-PIE, Office-Caltech, ImageCLEF-DA, and VisDA-2017 are 81.8%, 94.6%, 90.2%, and 80.5%.
... D EEP learning has achieved considerable success in computer vision [1], [2], [3], [4], significantly improving the state-of-art of face recognition [5], [6], [7], [8], [9], [10], [11], [12], [13]. This ubiquitous technology is now used to create innovative applications for entertainment and commercial services. ...
... With the help of deep learning technologies, face recognition has developed with unprecedented success [5], [6], [7], [8], [9], [10], [11], [12], [13]. Face recognition models are trained on large-scale training databases [29], [37], [38], and used as feature extractors to test identities that are usually disjoint from the training set [8]. ...
... which is a popular method [7] to approximate class centers. In this way, the privacy mask Δ can be generated by ...
Article
Full-text available
While convenient in daily life, face recognition technologies also raise privacy concerns for regular users on the social media since they could be used to analyze face images and videos, efficiently and surreptitiously without any security restrictions. In this paper, we investigate the face privacy protection from a technology standpoint based on a new type of customized cloak, which can be applied to all the images of a regular user, to prevent malicious face recognition systems from uncovering their identity. Specifically, we propose a new method, named one person one mask (OPOM), to generate person-specific (class-wise) universal masks by optimizing each training sample in the direction away from the feature subspace of the source identity. To make full use of the limited training images, we investigate several modeling methods, including affine hulls, class centers and convex hulls, to obtain a better description of the feature subspace of source identities. The effectiveness of the proposed method is evaluated on both common and celebrity datasets against black-box face recognition models with different loss functions and network architectures. In addition, we discuss the advantages and potential problems of the proposed method.
... Many methods have been proposed to narrow train-test gaps. Some works revise loss designs [30,39,49], some focus on imbalanced data distribution [5,12,25,53], some propose specific training strategies [42,51,52]. Although these approaches are powerful for common vision tasks, the train-test gap on Mars rover data is too challenging, making existing methods ineffective. ...
... Triplet loss [39] minimizes the distance between positive pairs and maximizes the distance between negative pairs. Center loss [49] clusters the feature representation. Focal Loss [30] aims at the imbalance between positive and negative samples. ...
... Our supervised inter-class contrastive loss not only resolves the conflicts between contrastive learning and Mars images but also clusters the feature by categories, yielding better decision boundaries and improving the classification performance. Triplet loss [39] and center loss [49] also enforce the embedded distances among different categories. However, these loss functions do not enrich the feature representation, therefore have limited effectiveness as we will show in the experiments. ...
Preprint
Full-text available
With the progress of Mars exploration, numerous Mars image data are collected and need to be analyzed. However, due to the imbalance and distortion of Martian data, the performance of existing computer vision models is unsatisfactory. In this paper, we introduce a semi-supervised framework for machine vision on Mars and try to resolve two specific tasks: classification and segmentation. Contrastive learning is a powerful representation learning technique. However, there is too much information overlap between Martian data samples, leading to a contradiction between contrastive learning and Martian data. Our key idea is to reconcile this contradiction with the help of annotations and further take advantage of unlabeled data to improve performance. For classification, we propose to ignore inner-class pairs on labeled data as well as neglect negative pairs on unlabeled data, forming supervised inter-class contrastive learning and unsupervised similarity learning. For segmentation, we extend supervised inter-class contrastive learning into an element-wise mode and use online pseudo labels for supervision on unlabeled areas. Experimental results show that our learning strategies can improve the classification and segmentation models by a large margin and outperform state-of-the-art approaches.
... In the past decade, face recognition has achieved remarkable and continuous progress in improving recognition accuracy [13,50,3,63,47,56,44] and has been widely used in daily activities such as online payment and security for identification. Advanced face recognition algorithms [13,50,3,47,48,46,58,60,49,22,51,59,50] and large-scale public face datasets [63,56,29] are two key factors of these progresses. ...
... In the past decade, face recognition has achieved remarkable and continuous progress in improving recognition accuracy [13,50,3,63,47,56,44] and has been widely used in daily activities such as online payment and security for identification. Advanced face recognition algorithms [13,50,3,47,48,46,58,60,49,22,51,59,50] and large-scale public face datasets [63,56,29] are two key factors of these progresses. Nevertheless, collecting and releasing large-scale face datasets raise increasingly more concerns on the privacy leakage of identity membership [20,52] and attribute [31,6] of training samples in recent years. ...
... In the past decade, face recognition has achieved remarkable and continuous progress in improving recognition accuracy [13,50,3,63,47,56,44] and has been widely used in daily activities such as online payment and security for identification. Advanced face recognition algorithms [13,50,3,47,48,46,58,60,49,22,51,59,50] and large-scale public face datasets [63,56,29] are two key factors of these progresses. Nevertheless, collecting and releasing large-scale face datasets raise increasingly more concerns on the privacy leakage of identity membership [20,52] and attribute [31,6] of training samples in recent years. ...
Preprint
Face recognition, as one of the most successful applications in artificial intelligence, has been widely used in security, administration, advertising, and healthcare. However, the privacy issues of public face datasets have attracted increasing attention in recent years. Previous works simply mask most areas of faces or synthesize samples using generative models to construct privacy-preserving face datasets, which overlooks the trade-off between privacy protection and data utility. In this paper, we propose a novel framework FaceMAE, where the face privacy and recognition performance are considered simultaneously. Firstly, randomly masked face images are used to train the reconstruction module in FaceMAE. We tailor the instance relation matching (IRM) module to minimize the distribution gap between real faces and FaceMAE reconstructed ones. During the deployment phase, we use trained FaceMAE to reconstruct images from masked faces of unseen identities without extra training. The risk of privacy leakage is measured based on face retrieval between reconstructed and original datasets. Experiments prove that the identities of reconstructed images are difficult to be retrieved. We also perform sufficient privacy-preserving face recognition on several public face datasets (i.e. CASIA-WebFace and WebFace260M). Compared to previous state of the arts, FaceMAE consistently \textbf{reduces at least 50\% error rate} on LFW, CFP-FP and AgeDB.
... D EEP learning has achieved considerable success in computer vision [1], [2], [3], [4], significantly improving the state-of-art of face recognition [5], [6], [7], [8], [9], [10], [11], [12], [13]. This ubiquitous technology is now used to create innovative applications for entertainment and commercial services. ...
... With the help of deep learning technologies, face recognition has developed with unprecedented success [5], [6], [7], [8], [9], [10], [11], [12], [13]. Face recognition models are trained on large-scale training databases [29], [37], [38], and used as feature extractors to test identities that are usually disjoint from the training set [8]. ...
... which is a popular method [7] to approximate class centers. In this way, the privacy mask Δ can be generated by ...
Preprint
While convenient in daily life, face recognition technologies also raise privacy concerns for regular users on the social media since they could be used to analyze face images and videos, efficiently and surreptitiously without any security restrictions. In this paper, we investigate the face privacy protection from a technology standpoint based on a new type of customized cloak, which can be applied to all the images of a regular user, to prevent malicious face recognition systems from uncovering their identity. Specifically, we propose a new method, named one person one mask (OPOM), to generate person-specific (class-wise) universal masks by optimizing each training sample in the direction away from the feature subspace of the source identity. To make full use of the limited training images, we investigate several modeling methods, including affine hulls, class centers, and convex hulls, to obtain a better description of the feature subspace of source identities. The effectiveness of the proposed method is evaluated on both common and celebrity datasets against black-box face recognition models with different loss functions and network architectures. In addition, we discuss the advantages and potential problems of the proposed method. In particular, we conduct an application study on the privacy protection of a video dataset, Sherlock, to demonstrate the potential practical usage of the proposed method. Datasets and code are available at https://github.com/zhongyy/OPOM.
... Following [16], we treat the Re-ID task as both classification learning and metric learning. We apply ID loss, triplet loss [37] and center loss [38] to each branch. The ID loss with label smoothing is formulated as: ...
... Our proposed method is implemented on three benchmark person re-ID datase-ts: Mar-ket1501, DukeMTMC-reID, and CUHK03. Following [38], we generate the training set, the query set, and the gallery set. For each query image, the pedestrian retrieval task is to search the images of same ID in gallery set. ...
Article
Full-text available
Attention mechanism is widely employed in Person Re-Identification task to allocate the weight of features. However, most of the existing attention-based methods focus on the region of interest but ignore other potential diverse information, which may cause a sub-optimal results in some situations. To alleviate the problem, we propose a novel Attention-Guided Multi-Clue Mining Network (AMMN). By leveraging the attention mechanism and the dropblock, the model can further emphasize the features other than the attention areas. All of the output features are finally grouped into a multi-clue representation contributed to person identities. Extensive experimental results demonstrate the proposed method outperforms current competitors of relevant methods on several benchmark datasets such as Market1501, DukeMTMC-reID, CUHK03. We also achieve state-of-the-art performance on Occluded datasets.
... The margin in our work here is 0.3. We selected four different loss functions: the softmax loss, which is commonly used in multiclassification tasks, the hardmax loss, the contrastive loss [52], and the center loss [53]. The classification loss with global and local feature-matching loss achieved the best results. ...
... Its shortcomings are obvious: it is necessary to specify a margin for each pair of nonhomogeneous samples, and this margin is fixed, which results in the fixed embedding space with no distortion. The center loss learns a center for each category and pulls all the feature vectors of each category into the corresponding category center [53]. It is based on the softmax loss, and is compact only within explicit constraints. ...
Article
Full-text available
In practical classification tasks, the sample distribution of the dataset is often unbalanced; for example, this is the case in a dataset that contains a massive quantity of samples with weak labels and for which concrete identification is unavailable. Even in samples with exact labels, the number of samples corresponding to many labels is small, resulting in difficulties in learning the concepts through a small number of labeled samples. In addition, there is always a small interclass variance and a large intraclass variance among categories. Weak labels, few-shot problems, and fine-grained analysis are the key challenges affecting the performance of the classification model. In this paper, we develop a progressive training technique to address the few-shot challenge, along with a weak-label boosting method, by considering all of the weak IDs as negative samples of every predefined ID in order to take full advantage of the more numerous weak-label data. We introduce an instance-aware hard ID mining strategy in the classification loss and then further develop the global and local feature-mapping loss to expand the decision margin. We entered the proposed method into the Kaggle competition, which aims to build an algorithm to identify individual humpback whales in images. With a few other common training tricks, the proposed approach won first place in the competition. All three problems (weak labels, few-shot problems, and fine-grained analysis) exist in the dataset used in the competition. Additionally, we applied our method to CUB-2011 and Cars-196, which are the most widely-used datasets for fine-grained visual categorization tasks, and achieved respective accuracies of 90.1% and 94.9%. This experiment shows that the proposed method achieves the optimal effect compared with other common baselines, and verifies the effectiveness of our method. Our solution has been made available as an open source project.
... Further, we introduce Squeeze-Excitation Adaptors as introduced by Wang et. al, [22] for domain specific attention to improve our model and incorporate a center loss as proposed by Wen et al., [24] in each of the domain adaptation components to reduce the intra-class variance in the source and target domain feature space. We perform experiments and evaluate our results on the Cityscapes and Foggy Cityscapes datasets to demonstrate the superiority of our approach. ...
... In [24], Wen et al propose a new loss function called center loss to efficiently enhance the discriminative power of the deeply learned features in neural networks. Specifically, a center is learned (a vector with the same dimension as a feature) for deep features of each class. ...
Preprint
Full-text available
Despite growing interest in object detection, very few works address the extremely practical problem of cross-domain robustness especially for automative applications. In order to prevent drops in performance due to domain shift, we introduce an unsupervised domain adaptation method built on the foundation of faster-RCNN with two domain adaptation components addressing the shift at the instance and image levels respectively and apply a consistency regularization between them. We also introduce a family of adaptation layers that leverage the squeeze excitation mechanism called SE Adaptors to improve domain attention and thus improves performance without any prior requirement of knowledge of the new target domain. Finally, we incorporate a center loss in the instance and image level representations to improve the intra-class variance. We report all results with Cityscapes as our source domain and Foggy Cityscapes as the target domain outperforming previous baselines.
... With the rapid developments in computing hardware, big data, and novel algorithms, deep learning-based FR techniques have fostered numerous startups with practical applications in the past five years. Massive deployment of FR systems based on the deep learning models [11,12,13,14,15,16,17,18] draws the public's attention to the privacy and security concerns of reconstructing face images from deep features [19,20]. According to the techniques used, face image reconstruction can be divided into conventional and deep learning methods. ...
... To find the best population size, we set the crossover probability to 0.2 and the mutation ratio to 0.1. The fitness value in terms of mean square error (MSE) is evaluated at different ranges (16,32,64,128,256) of population size. The lower the MSE, the better performance is achieved. ...
Preprint
Full-text available
Face recognition based on the deep convolutional neural networks (CNN) shows superior accuracy performance attributed to the high discriminative features extracted. Yet, the security and privacy of the extracted features from deep learning models (deep features) have been often overlooked. This paper proposes the reconstruction of face images from deep features without accessing the CNN network configurations as a constrained optimization problem. Such optimization minimizes the distance between the features extracted from the original face image and the reconstructed face image. Instead of directly solving the optimization problem in the image space, we innovatively reformulate the problem by looking for a latent vector of a GAN generator, then use it to generate the face image. The GAN generator serves as a dual role in this novel framework, i.e., face distribution constraint of the optimization goal and a face generator. On top of the novel optimization task, we also propose an attack pipeline to impersonate the target user based on the generated face image. Our results show that the generated face images can achieve a state-of-the-art successful attack rate of 98.0\% on LFW under type-I attack @ FAR of 0.1\%. Our work sheds light on the biometric deployment to meet the privacy-preserving and security policies.
... Face Recognition is one of the most important research fields in computer vision and pattern recognition. Recent advances in deep learning, coupled with abundant face data, have led to excellent progress in face recognition algorithms [8,24,27,30,31,32,37]. Due to these achievements, face recognition technology is widely utilized in the real world, such as human-computer interaction [20], video surveillance [5], and identification [14,34]. ...
Preprint
Full-text available
Face recognition is one of the most active tasks in computer vision and has been widely used in the real world. With great advances made in convolutional neural networks (CNN), lots of face recognition algorithms have achieved high accuracy on various face datasets. However, existing face recognition algorithms based on CNNs are vulnerable to noise. Noise corrupted image patterns could lead to false activations, significantly decreasing face recognition accuracy in noisy situations. To equip CNNs with built-in robustness to noise of different levels, we proposed a Median Pixel Difference Convolutional Network (MeDiNet) by replacing some traditional convolutional layers with the proposed novel Median Pixel Difference Convolutional Layer (MeDiConv) layer. The proposed MeDiNet integrates the idea of traditional multiscale median filtering with deep CNNs. The MeDiNet is tested on the four face datasets (LFW, CA-LFW, CP-LFW, and YTF) with versatile settings on blur kernels, noise intensities, scales, and JPEG quality factors. Extensive experiments show that our MeDiNet can effectively remove noisy pixels in the feature map and suppress the negative impact of noise, leading to achieving limited accuracy loss under these practical noises compared with the standard CNN under clean conditions.
... • explicitly minimize the intraclass variance [142], ...
Preprint
Full-text available
Deep neural networks such as convolutional neural networks (CNNs) and transformers have achieved many successes in image classification in recent years. It has been consistently demonstrated that best practice for image classification is when large deep models can be trained on abundant labelled data. However there are many real world scenarios where the requirement for large amounts of training data to get the best performance cannot be met. In these scenarios transfer learning can help improve performance. To date there have been no surveys that comprehensively review deep transfer learning as it relates to image classification overall. However, several recent general surveys of deep transfer learning and ones that relate to particular specialised target image classification tasks have been published. We believe it is important for the future progress in the field that all current knowledge is collated and the overarching patterns analysed and discussed. In this survey we formally define deep transfer learning and the problem it attempts to solve in relation to image classification. We survey the current state of the field and identify where recent progress has been made. We show where the gaps in current knowledge are and make suggestions for how to progress the field to fill in these knowledge gaps. We present a new taxonomy of the applications of transfer learning for image classification. This taxonomy makes it easier to see overarching patterns of where transfer learning has been effective and, where it has failed to fulfill its potential. This also allows us to suggest where the problems lie and how it could be used more effectively. We show that under this new taxonomy, many of the applications where transfer learning has been shown to be ineffective or even hinder performance are to be expected when taking into account the source and target datasets and the techniques used.
... Thus, for further improving the performance of the MMD-based feature adaptation methods, the relations between samples and centers should be considered. Wisely, Wen et al. [41] and Tahmoresnezhad et al. [42] designed the center loss function based on Class-center-Sample distances. Chen et al. [43] verified that the center-based distance is better than the sample-based distance. ...
Article
Full-text available
Current feature adaptation methods align the joint distributions across domains. But they may be limited because the difference between distributions cannot be completely eliminated. Existing classifier adaptation methods find the shared classifier across domains based on the original features or Manifold features. However, the shared classifier may be ineffective due to the high granularity at the category level of the features. Inspired by these, we propose the unsupervised domain adaptation via Discriminative feature learning and Classifier adaptation from Center-based Distances (DCCD). First, we define the data-centers and class-centers. Second, Discriminative feature learning from Center-based Distances (DCD) is established by using the data-centers and class-centers to align the joint distributions across domains and maximize the intra-class compactness and inter-class separability of features at the category level. Specifically, the optimization objective of DCD is constructed by minimizing the maximum mean discrepancy (MMD) between two domains and Class-center-Sample distances (CS), and maximizing the Data-center-Class-center (DC) and Class-center-Nearest-Class-center distances (CNC). Next, we propose Classifier adaptation from Center-based Distances (CCD). In detail, CCD applies Structural Risk Minimization (SRM), dynamic distribution alignment, and the constructed Laplacian Regularization to solve the shared classifier, where the constructed Laplacian Regularization extra considers CS and CNC to measure the local structure of features. Benefited from CCD, the joint distributions can be further aligned at the classifier level. Besides, integrating with the learned features from DCD, the shared classifier can be effective on the two domains. Finally, extensive experiments on four benchmark datasets show that DCCD outperforms the state-of-the-art UDA methods.
... It was trained on a large scale dataset of 2.6M images of 2622 subjects. Wen et al. [43] proposed a center loss to reduce the intra-class features variations. To separate samples more strictly and avoid misclassifying the difficult samples, angular/cosine margin based loss is proposed to make learned features potentially separable with a larger angular/cosine distance on a hypersphere manifold, such as Sphereface [44], L-softmax [45], Cosface [46], AMS [47] and Arcface [48]. ...
Preprint
Full-text available
Despite great progress in face recognition tasks achieved by deep convolution neural networks (CNNs), these models often face challenges in real world tasks where training images gathered from Internet are different from test images because of different lighting condition, pose and image quality. These factors increase domain discrepancy between training (source domain) and testing (target domain) database and make the learnt models degenerate in application. Meanwhile, due to lack of labeled target data, directly fine-tuning the pre-learnt models becomes intractable and impractical. In this paper, we propose a new clustering-based domain adaptation method designed for face recognition task in which the source and target domain do not share any classes. Our method effectively learns the discriminative target feature by aligning the feature domain globally, and, at the meantime, distinguishing the target clusters locally. Specifically, it first learns a more reliable representation for clustering by minimizing global domain discrepancy to reduce domain gaps, and then applies simplified spectral clustering method to generate pseudo-labels in the domain-invariant feature space, and finally learns discriminative target representation. Comprehensive experiments on widely-used GBU, IJB-A/B/C and RFW databases clearly demonstrate the effectiveness of our newly proposed approach. State-of-the-art performance of GBU data set is achieved by only unsupervised adaptation from the target training data.
... Existence of bias. We examine some SOTA algorithms, i.e., Center-loss [16], Sphereface [17], VGGFace2 [19] and ArcFace [18], as well as four commercial recognition APIs, i.e., Face++, Baidu, Amazon, Microsoft on our IDS-4 and IDS-8, respectively. The results on IDS-4 are presented in Table 2, Fig. 7 and Fig.10. ...
Preprint
Full-text available
Although deep face recognition has achieved impressive progress in recent years, controversy has arisen regarding discrimination based on skin tone, questioning their deployment into real-world scenarios. In this paper, we aim to systematically and scientifically study this bias from both data and algorithm aspects. First, using the dermatologist approved Fitzpatrick Skin Type classification system and Individual Typology Angle, we contribute a benchmark called Identity Shades (IDS) database, which effectively quantifies the degree of the bias with respect to skin tone in existing face recognition algorithms and commercial APIs. Further, we provide two skin-tone aware training datasets, called BUPT-Globalface dataset and BUPT-Balancedface dataset, to remove bias in training data. Finally, to mitigate the algorithmic bias, we propose a novel meta-learning algorithm, called Meta Balanced Network (MBN), which learns adaptive margins in large margin loss such that the model optimized by this loss can perform fairly across people with different skin tones. To determine the margins, our method optimizes a meta skewness loss on a clean and unbiased meta set and utilizes backward-on-backward automatic differentiation to perform a second order gradient descent step on the current margins. Extensive experiments show that MBN successfully mitigates bias and learns more balanced performance for people with different skin tones in face recognition. The proposed datasets are available at http://www.whdeng.cn/RFW/index.html.
... Center loss [31] aims to constrain the distance between the features and their class centers in Euclidean space. Triplet loss [22] minimizes the distance between the paired images and maximizes the negative ones in the triplets. ...
Article
Full-text available
Face recognition (FR) has received remarkable attention for improving feature discrimination with the development of deep convolutional neural networks (CNNs). Although the existing methods have achieved great success in designing margin-based loss functions by using hard sample mining strategy, they still suffer from two issues: 1) the neglect of some training status and feature position information and 2) inaccurate weight assignment for hard samples due to the coarse hardness description. To solve these issues, we develop a novel loss function, namely Hardness Loss, to adaptively assign weights for the misclassified (hard) samples guided by their corresponding hardness, which accounts for multiple training status and feature position information. Specifically, we propose an estimator to provide the real-time training status to precisely compute the hardness for weight assignment. To the best of our knowledge, this is the first attempt to design a loss function by using multiple pieces of information about the training status and feature positions. Extensive experiments on popular face benchmarks demonstrate that the proposed method is superior to the state-of-the-art (SOTA) losses under various FR scenarios.
... The softmax loss function L CCE decreases whereas and the interclass dispersion increases as model training progresses. The centre loss is utilized as a portion of the loss function within CNNs to improve the discriminative capability of the modelling effect [45]. We may prepare CNNs to attain features possessing two primary learning objectives, intraclass compactness and interclass dispersion, simultaneously using the combined supervision of the centre loss and softmax loss. ...
Article
Full-text available
Background The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. Results The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. Conclusions The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.
... We elaborately design the spatio-temporal Transformer for capture discriminative feature tokens and model temporal dependencies among different frames. To further increase the model discriminant ability, we impose the constraint on the prediction distribution by the loss function, following [21,25,35,55]. We propose the compact softmax cross entropy loss to decrease the intra-class distance and increase the inter-class distance. ...
Preprint
Previous methods for dynamic facial expression in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. To solve this problem, we propose the spatio-temporal Transformer (STT) to capture discriminative features within each frame and model contextual relationships among frames. Spatio-temporal dependencies are captured and integrated by our unified Transformer. Specifically, given an image sequence consisting of multiple frames as input, we utilize the CNN backbone to translate each frame into a visual feature sequence. Subsequently, the spatial attention and the temporal attention within each block are jointly applied for learning spatio-temporal representations at the sequence level. In addition, we propose the compact softmax cross entropy loss to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and AFEW) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for dynamic facial expression recognition. The source code and the training logs will be made publicly available.
... A large intra-class variance is often observed for FER in-the-wild. To address this issue, some approaches have been inspired by the center loss [173] that penalizes the distance between the features and the centroid of the corresponding class. For instance, Cai et al. propose the island loss [22] to penalize the pairwise-distance between different FE class centroids. ...
Thesis
Despite the recent advances in face analysis, it is still difficult to adapt the models to the immense variety of morphological traits, head poses or occlusions that can strongly affect the face appearance. Thus, the objective of this thesis is to improve the robustness to these variations. To this end, we propose to combine deep learning methods with ensemble methods. In particular, we have proposed a new gating mechanism that allows to adaptively combine the decisions of an ensemble of predictors. This first architecture has been extended to allow the combination of an ensemble of representations. We also showed that it was possible to better guide this combination by using an exogenous variable, identified as an important source of variation in the face appearance. In addition to these architectural contributions, we have shown the contribution of a new training loss that improves the representation extracted by the network. The genericity and performance of the method have been experimentally validated. In particular, this method outperforms the state-of-the-art in both face alignment and facial expression recognition.
... Our final results training from scratch from U (1) labels is similar in spirit to work seeking to regularize the label and/or the activation feature space, including using real-valued rather than one-hot training labels [61], learning-based classifiers [81,83], prototypical networks for few-shot learning [51,68], deep k-nearest neighbors [54], geometrical regularization based on hyperspheres [48], enforcing constant radial distance from the feature space origin [87] or angular loss between prototypes [79]. We seek to present our theory in the broadest, most general context. ...
Preprint
Full-text available
We report on a significant discovery linking deep convolutional neural networks (CNN) to biological vision and fundamental particle physics. A model of information propagation in a CNN is proposed via an analogy to an optical system, where bosonic particles (i.e. photons) are concentrated as the 2D spatial resolution of the image collapses to a focal point $1\times 1=1$. A 3D space $(x,y,t)$ is defined by $(x,y)$ coordinates in the image plane and CNN layer $t$, where a principal ray $(0,0,t)$ runs in the direction of information propagation through both the optical axis and the image center pixel located at $(x,y)=(0,0)$, about which the sharpest possible spatial focus is limited to a circle of confusion in the image plane. Our novel insight is to model the principal optical ray $(0,0,t)$ as geometrically equivalent to the medial vector in the positive orthant $I(x,y) \in R^{N+}$ of a $N$-channel activation space, e.g. along the greyscale (or luminance) vector $(t,t,t)$ in $RGB$ colour space. Information is thus concentrated into an energy potential $E(x,y,t)=\|I(x,y,t)\|^2$, which, particularly for bottleneck layers $t$ of generic CNNs, is highly concentrated and symmetric about the spatial origin $(0,0,t)$ and exhibits the well-known "Sombrero" potential of the boson particle. This symmetry is broken in classification, where bottleneck layers of generic pre-trained CNN models exhibit a consistent class-specific bias towards an angle $\theta \in U(1)$ defined simultaneously in the image plane and in activation feature space. Initial observations validate our hypothesis from generic pre-trained CNN activation maps and a bare-bones memory-based classification scheme, with no training or tuning. Training from scratch using a random $U(1)$ class label the leads to improved classification in all cases.
... We believe the deeper model and inclusion of vision modality in the proposed scheme will further enhance the efficiency of the proposed model. ResiDen [45] 76.54 % ResNet-PL [46] 81.97 % PG-CNN [47] 83.27 % Center Loss [48] 83.68 % DLP-CNN [35] 84.13 % ALT [49] 84.50 % gACNN [50] 85.07 % OADN [51] 87.16 % Proposed Model 91.66% [52] 52.97 % DLP-CNN [35] 54.47 % PG-CNN [47] 55.33 % ResNet-PL [46] 56.42 % gACNN [50] 58.78 % OADNN [51] 64.06 % Proposed Model 72.06 % ...
Preprint
Full-text available
Facial expression recognition has been a hot topic for decades, but high intraclass variation makes it challenging. To overcome intraclass variation for visual recognition, we introduce a novel fusion methodology, in which the proposed model first extract features followed by feature fusion. Specifically, RestNet-50, VGG-19, and Inception-V3 is used to ensure feature learning followed by feature fusion. Finally, the three feature extraction models are utilized using Ensemble Learning techniques for final expression classification. The representation learnt by the proposed methodology is robust to occlusions and pose variations and offers promising accuracy. To evaluate the efficiency of the proposed model, we use two wild benchmark datasets Real-world Affective Faces Database (RAF-DB) and AffectNet for facial expression recognition. The proposed model classifies the emotions into seven different categories namely: happiness, anger, fear, disgust, sadness, surprise, and neutral. Furthermore, the performance of the proposed model is also compared with other algorithms focusing on the analysis of computational cost, convergence and accuracy based on a standard problem specific to classification applications.
... For example, Guo et al. have pointed out that neural network is over-confident under the training with certain loss functions [6]. Wen [37] adds an additional central loss, which forces the same class samples' last-layer feature embedding cluster together, with the cross-entropy loss to jointly optimize the neural network. Liu [21,22] decouples the dot product behavior of the optimization function and design new optimization functions that optimize the angle of the last layer's feature embedding. ...
Preprint
Full-text available
Deep learning approaches have provided state-of-the-art performance in many applications by relying on extremely large and heavily overparameterized neural networks. However, such networks have been shown to be very brittle, not generalize well to new uses cases, and are often difficult if not impossible to deploy on resources limited platforms. Model pruning, i.e., reducing the size of the network, is a widely adopted strategy that can lead to more robust and generalizable network -- usually orders of magnitude smaller with the same or even improved performance. While there exist many heuristics for model pruning, our understanding of the pruning process remains limited. Empirical studies show that some heuristics improve performance while others can make models more brittle or have other side effects. This work aims to shed light on how different pruning methods alter the network's internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding. The proposed tool provides an environment for exploring and studying differences among pruning methods and between pruned and original model. By leveraging our visualization, the ML researchers can not only identify samples that are fragile to model pruning and data corruption but also obtain insights and explanations on how some pruned models achieve superior robustness performance.
... As stated in Ref. [34] , cross-entropy loss contributes to making inter-class samples dispersion and provides a basis for category identification. In AIC loss, a cross-entropy (CE) loss is applied to the joint supervised for a large inter-class variance. ...
Article
Full-text available
Radar‐based hand gesture recognition (HGR) has attracted growing interest in human–computer interaction. A rich diversity in how people perform gestures causes a large intra‐class variance, and the sample quality varies from person to person. It makes HGR more challenging to identify dynamic, complicated, and deforming hand gestures. It is urgent for the real world to explore a robust method that better identifies the gestures from non‐specified users. To address the above issues, an adaptive framework is proposed for gesture recognition, and it has two main contributions. First of all, a trajectory range Doppler map (t‐RDM) is obtained by non‐coherent accumulating for inter‐frame dependencies, and then t‐RDM is enhanced to highlight the trajectory information. Taking into account different movement patterns of the gestures, a two‐pathway convolutional neural network targeted for raw and enhanced t‐RDMs is proposed, which independently mines discriminative information from the two t‐RDMs with different salient features. Second, an adaptive individual cost (AIC) loss is proposed, aiming to establish a powerful feature representation by adaptively extracting the commonalities in variant gestures according to the sample quality. Based on a public dataset using soli radar, the proposed method is evaluated on two tasks: cross‐person recognition and cross‐scenario recognition. These two recognition modes require that the training set and the test set are mutually exclusive not only at the sample level but also at the source level. Extensive experiments demonstrate that the proposed method is superior to the existing approaches for alleviating the low recognition performance caused by gesture diversity.
... Second, softmax does not encourage classification between classes (As shown in Figure 1, we can see that softmax only separates different classes, while LMCL and other methods also form interclass gaps). In response to this situation, Wen et al. proposed Center Loss [3]. In the face recognition task, in order to increase the interclass distance of different face IDs and reduce the intraclass distance of the same face ID, based on the original softmax classification, a set of weights are added, which represent the center of each face ID in the space. ...
Article
Full-text available
Benefiting from deep learning, the accuracy of face expression recognition tasks based on convolutional neural networks has been greatly improved. However, the traditional SoftMax activation function lacks the ability to discriminate between classes. To solve this problem, the industry has proposed several activation functions based on softmax, such as A-softmax, LMCL, etc. We investigate the geometric significance of the weights from a fully connected layer and consider the weights as the class centers. By extracting the feature vector of several samples and extending the corresponding means to the weights, the model can develop the ability to recognize custom classes without training, while maintaining the accuracy of the original classification. On the expression task, the original seven-category classification is validated to obtain 97.10% accuracy on the CK+ dataset and 88% accuracy on the custom dataset.
... Many efforts are devoted to extracting domain-invariant features with deep learning algorithms. The center loss [27] and triple loss [28] are utilized to reduce NIR-VIS discrepancy. [9] proposes a Wasserstein CNN (W-CNN) to capture invariant deep features by minimizing the Wasserstein distance between NIR and VIS features. ...
Preprint
Heterogeneous Face Recognition (HFR) aims to match faces across different domains (e.g., visible to near-infrared images), which has been widely applied in authentication and forensics scenarios. However, HFR is a challenging problem because of the large cross-domain discrepancy, limited heterogeneous data pairs, and large variation of facial attributes. To address these challenges, we propose a new HFR method from the perspective of heterogeneous data augmentation, named Face Synthesis with Identity-Attribute Disentanglement (FSIAD). Firstly, the identity-attribute disentanglement (IAD) decouples face images into identity-related representations and identity-unrelated representations (called attributes), and then decreases the correlation between identities and attributes. Secondly, we devise a face synthesis module (FSM) to generate a large number of images with stochastic combinations of disentangled identities and attributes for enriching the attribute diversity of synthetic images. Both the original images and the synthetic ones are utilized to train the HFR network for tackling the challenges and improving the performance of HFR. Extensive experiments on five HFR databases validate that FSIAD obtains superior performance than previous HFR approaches. Particularly, FSIAD obtains 4.8% improvement over state of the art in terms of VR@FAR=0.01% on LAMP-HQ, the largest HFR database so far.
... Hybrid Local Feature Analysis [19], Component-Based method [20], Modular Eigen faces [21], and Shape normalized methods [22] are examples of hybrid methods presented in the literature. Other famous methods include neural networkbased methods [23,24], Support Vector Machines [25,26], and Bayesian classifier methods [27]. ...
Article
Full-text available
Recognition of face images is still a challenging and open research problem. A number of recent algorithms have shown that there is a vast scope in improving recognition accuracy by utilizing facial symmetry for face recognition task. The lower computational complexity and faster processing times make this method well suited for real-time applications. In this paper, we have used only one half of the face image for recognition task against various facial challenges. Keeping in view all the previous related studies that are limited in their scope, an unbiased comparison is presented between full face images and half face images by applying four subspace-based algorithms with four different distance metrics. Experiments are conducted on the two most challenging face databases. The FERET is a benchmark database, which closely simulates real-life scenarios, and LFW which is developed for the problem of unconstrained face recognition.
Article
In continual learning over deep neural networks (DNNs), the rehearsal strategy, in which the previous exemplars are jointly trained with new samples, is commonly employed for the purpose of addressing catastrophic forgetting. Unfortunately, due to the memory limit, rehearsal-based techniques inevitably cause the class imbalance issue leading to a DNN biased toward new tasks having more samples. Existing works mostly focus on correcting such a bias in the fully connected layer or classifier. In this paper, we newly discover that class imbalance tends to make old classes even more highly correlated with their similar new classes in the feature space, which turns out to be the major reason behind catastrophic forgetting, called inter-task forgetting. To alleviate inter-task forgetting, we propose a novel class incremental learning method, called attractive & repulsive training (ART), which effectively captures the previous feature space into a set of class-wise flags, and thereby makes old and new similar classes less correlated in the new feature space. In our empirical study, our ART method is observed to be quite effective to improve the performance of the state-of-the-art methods by substantially mitigating inter-task forgetting. Our implementation is available at: https://github.com/bigdata-inha/ART/.
Chapter
Face anti-spoofing (FAS) plays an important role in protecting face recognition systems from face representation attacks. Many recent studies in FAS have approached this problem with domain generalization technique. Domain generalization aims to increase generalization performance to better detect various types of attacks and unseen attacks. However, previous studies in this area have defined each domain simply as an anti-spoofing datasets and focused on developing learning techniques. In this paper, we proposed a method that enables network to judge its domain by itself with the clustered convolutional feature statistics from intermediate layers of the network, without labeling domains as datasets. We obtained pseudo-domain labels by not only using the network extracting features, but also using depth estimators, which were previously used only as an auxiliary task in FAS. In our experiments, we trained with three datasets and evaluated the performance with the remaining one dataset to demonstrate the effectiveness of the proposed method by conducting a total of four sets of experiments.
Article
Multiple object tracking (MOT) generally employs the paradigm of tracking-by-detection, where object detection and object tracking are executed conventionally using separate systems. Current progress in MOT has focused on detecting and tracking objects by harnessing the representational power of deep learning. Since existing methods always combine two submodules in the same network, it is particularly important that they must be trained effectively together. Therefore, the development of a suitable network architecture for the end-to-end joint training of detection and tracking submodules remains a challenging issue. The present work addresses this issue by proposing a novel architecture denoted as YOLOTracker that performs online MOT by exploiting a joint detection and embedding network. First, an efficient and powerful joint detection and tracking model is constructed to accomplish instance-level embedded training, which can ensure that the proposed tracker achieves highly accurate MOT results with high efficiency. Then, the Path Aggregation Network is employed to combine low-resolution and high-resolution features for integrating textural features and semantic information and mitigating the misalignment of the re-identification features. Experiments are conducted on three challenging and publicly available benchmark datasets and results demonstrate the proposed tracker outperforms other state-of-the-art MOT trackers in terms of accuracy and efficiency.
Chapter
First-person hand activity recognition plays a significant role in the computer vision field with various applications. Thanks to recent advances in depth sensors, several 3D skeleton-based hand activity recognition methods using supervised Deep Learning (DL) have been proposed, proven effective when a large amount of labeled data is available. However, the annotation of such data remains difficult and costly, which motivates the use of unsupervised methods. We propose in this paper a new approach based on unsupervised domain adaptation (UDA) for 3D skeleton hand activity clustering. It aims at exploiting the knowledge-driven from labeled samples of the source domain to categorize the unlabeled ones of the target domain. To this end, we introduce a novel metric learning-based loss function to learn a highly discriminative representation while preserving a good activity recognition accuracy on the source domain. The learned representation is used as a low-level manifold to cluster unlabeled samples. In addition, to ensure the best clustering results, we proposed a statistical and consensus-clustering-based strategy. The proposed approach is experimented on the real-world FPHA data set.
Article
Recently, deep learning based Computer-Aided Diagnosis methods have been widely utilized due to their highly effective diagnosis of patients. Although Convolutional Neural Networks (CNNs) are capable of extracting the latent structural characteristics of dementia and of capturing the changes of brain anatomy in Magnetic Resonance Imaging (MRI) scans, the high-dimensional input to a deep CNN usually makes the network difficult to train, and affects its diagnostic accuracy. In this paper, a novel method called the hierarchical pseudo-3D convolution neural network based on a kernel attention mechanism with a new global context block, which is abbreviated as “PKG-Net”, is proposed to accurately predict Alzheimer’s disease even when the input features are complex. Specifically, the proposed network first extracts multi-scale features from pre-processed images. Second, the attention mechanism and global context blocks are applied to combine features from different layers to hierarchically transform the MRI into more compact high-level features. Then, a joint loss function is used to train the proposed network to generate more distinguishing features, which improve the generalization performance of the network. In addition, we combine our method with different architectures. Extensive experiments are conducted to analyze the performance of the PKG-Net with different hyper-parameters and architectures. Finally, in order to verify the effectiveness of our method on Alzheimer’s disease diagnosis, we carry out extensive experiments on the ADNI dataset, and compare the results of our method with that of existing methods in terms of accuracy, recall and precision. Furthermore, our network can fully take advantage of the deep 3D convolutional neural network for automatic feature extraction and representation, and thus can avoid the limitation of low processing efficiency caused by the preprocessing procedure in which a specific area needs to be annotated in advance. Finally, we evaluate our proposed framework using two public datasets, ADNI-1 and ADNI-2, and the experimental results show that our proposed framework can achieve superior performance over state-of-the-art approaches.
Article
As a widely mentioned topic in face recognition, the margin-based loss function enhances the discriminability of face recognition models by applying margin between class decision boundaries. However, there is still room to improve the representation of face features. Local face feature extraction has been employed in traditional face recognition methods, but with the increase of network depth in deep learning, the traditional approach requires a large number of computational resources. In this paper, we propose a novel face recognition architecture called LocalFace to extract local face features. First, by analyzing the distribution of significant features in face images, we propose an efficient face fixed-point local feature extraction approach and improve this method to propose a more effective face dynamic local feature extraction scheme. Subsequently, we propose a block-based random occlusion method for the limitations of the random face occlusion method to better simulate the occlusion situation in real scenes. In the end, we present a detailed discussion on the channel attention method that is more appropriate for face recognition and classification tasks. Our method enhances the representation of face features by ensembling local features into global features without extra parameters, which is efficient and easy to implement. Extensive experiments on various benchmarks demonstrate the superiority of our LocalFace, and part of the experimental results achieve SOTA results.
Article
The recognition of Chinese characters has always been a challenging task due to their huge variety and complex structures. The current radical-based methods fail to recognize Chinese characters without learning all of their radicals in the training stage. To this end, we propose a novel Hippocampus-heuristic Character Recognition Network (HCRN), which can recognize unseen Chinese characters only by training part of radicals. More specifically, the network architecture of HCRN is a new pseudo-siamese network designed by us, which can learn features from pairs of input samples and use them to predict unseen characters. The experimental results on the recognition of printed and handwritten characters show that HCRN is robust and effective on zero/few-shot learning tasks. For the printed characters, the mean accuracy of HCRN outperforms the state-of-the-art approach by 23.93% on recognizing unseen characters. For the handwritten characters, HCRN improves the mean accuracy by 11.25% on recognizing unseen characters.
Article
Fine-grained visual classification is challenging due to similarities within classes and discriminative features located in subtle regions. Conventional methods focus on extracting features from the most discriminative parts, which may underperform when these parts are occluded or invisible. And the limited training data also leads to serious overfitting problem. In this paper, we propose an Attention-based Cropping and Erasing Network (ACEN) with a coarse-to-fine refinement strategy to address these problems. By convolving the feature maps from CNN, we obtain a set of attention maps which focus on discriminative object parts. Guided by the attention maps, we propose attention region cropping and erasing operations to augment training data. Moreover, the attention region cropping enhances local discriminative feature learning, and the attention region erasing promotes multi-attention learning. During inference phase, the coarse-to-fine refinement strategy is proposed to refine the model prediction. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on challenging benchmarks, including CUB-200-2011, FGVC-Aircraft and Stanford Cars.
Article
Video-based re-identification (ReID) is a crucial task in computer vision that draws increasing attention due to advances in deep learning (DL) and modern computational devices. Despite recent success with CNN architectures, single models (e.g., 2D-CNNs or 3D-CNNs) alone failed to leverage temporal information with spatial cues. This is due to uncontrolled surveillance scenarios and variable poses leading to inevitable misalignment of ROIs across the tracklets, which is accompanied by occlusion and motion blur. In this context, designing temporal and spatial cues for two different models and their combinations can be beneficial, considering the global of a video-tracklet. 3D-CNNs allow encoding of temporal information while 2D-CNNs extract spatial or appearance information. In this paper, we propose a Spatio-Temporal Cross Attention (STCA) network to utilize both 2D-CNNs and 3D-CNNs that calculate the cross attention mapping both from the layer of 3D-CNNs and 2D-CNNs along a person's trajectory to gate the following layers of 2D-CNNs; and highlight relevant appearance features for the person ReID. Given an input tracklet, the proposed cross attention (CA) is able to capture the salient regions that propagate throughout the tracklet to obtain the global view. This provides a spatio-temporal attention approach that can be dynamically aggregated with spatial features of 2D-CNNs to perform finer-grained recognition. Additionally, we exploit the advantage of utilizing cosine similarity while triplet sampling as well as for calculating the final recognition score. Experimental analyses on three challenging benchmark datasets indicate that integrating spatio-temporal cross attention into the state-of-the-art video ReID backbone CNN architecture allows for improving their recognition accuracy.
Preprint
Dynamic facial expression recognition (DFER) in the wild is an extremely challenging task, due to a large number of noisy frames in the video sequences. Previous works focus on extracting more discriminative features, but ignore distinguishing the key frames from the noisy frames. To tackle this problem, we propose a noise-robust dynamic facial expression recognition network (NR-DFERNet), which can effectively reduce the interference of noisy frames on the DFER task. Specifically, at the spatial stage, we devise a dynamic-static fusion module (DSF) that introduces dynamic features to static features for learning more discriminative spatial features. To suppress the impact of target irrelevant frames, we introduce a novel dynamic class token (DCT) for the transformer at the temporal stage. Moreover, we design a snippet-based filter (SF) at the decision stage to reduce the effect of too many neutral frames on non-neutral sequence classification. Extensive experimental results demonstrate that our NR-DFERNet outperforms the state-of-the-art methods on both the DFEW and AFEW benchmarks.
Article
Compound faults and their involved single faults often have severe overlap in traditional feature spaces, and the strong background noise unavoidably exacerbates the degree of overlap. Aiming at the problem, this article constructs a multi-level discriminative feature learning method, namely deep progressive shrinkage learning, to progressively suppress intra-class dispersions using a few feature-level shrinkage modules and a decision-level shrinkage module for separating compound faults from single faults. First, soft thresholding is embedded as a key part of feature-level shrinkage modules to gradually eliminate noise-related information in the multi-layer feature learning process, in which thresholds are adaptively set using attention mechanism. Second, in the decision-level shrinkage module, high penalties are imposed on the samples that are far from their class centers. Finally, the efficacy of the method in compound fault diagnosis along with single faults has been verified through a variety of experiments.
Article
Damage detection of composite materials is crucial to monitor the component condition over the life cycle for the maintenance management and possible replacement. Delamination damage, as a common damage form in glass-fiber reinforced polymer (GFRP) composites, may occur during the manufacturing and service process due to the mechanical and thermal loads. Terahertz (THz) NDT technique, as a novel characterization approach, can provide promising alternatives to fulfill the 3D characterization of delamination defects in multi-layer GFRP composites. During THz testing, in order to attain adequate discrimination in depth direction, complex signal processing and prior knowledge of the undamaged stratigraphy are usually required to suppress confounding effects from noise, overlap, and dispersion of THz signal, which hinders the realization of automatic defect detection. Therefore, here we propose an effective, reliable, and end-to-end 3D THz characterization system based on deep learning methods to fulfill the automatic localization and imaging of delamination defects in GFRP composites without any additional signal processing or prior knowledge. In the localization process, the full-scale promoted convolution neural network (FSP–CNN) is developed by integrating dual mechanisms of the full-scale feature learning and the promoted classifier. In the imaging process, the class encoding strategy is employed to obtain the 2D and 3D information of delamination defects based on the classification results. Finally, a series of experiments validate the effectiveness of the system for automatic localization and imaging of delamination defects in GFRP composites, which provides a novel and efficient paradigm for the intelligent and automatic THz 3D characterization of hidden delamination defects in composites.
Chapter
Human activity classification has several applications in smart homes, healthcare and human-machine interaction systems. Several human activity recognition systems using IMU sensors, namely accelerometer and gyroscope, have been proposed in literature. In this paper, we propose an improved integrated solution involving two stages with detection network followed by recognition network for classification of human activities. In a practical deployed environment, the IMU sensor is expected to encounter out-of-distribution (OoD) samples due to sensor degradation, alien operating environment or unknown activities. To handle such adverse examples we propose to use generalized out-of-distribution detector for Neural Networks (ODIN), which also acts to engage the following recognition network only when a valid activity example is encountered leading to power saving. Furthermore, compared to conventional cross-entropy based loss function, we propose quadruplet-loss to train our activity recognition network leading to improved classification and clustering scores. We demonstrate the performance of our proposed solution using commercial off-the-shelf IMU sensors.
Article
Examination of pathological images is the golden standard for diagnosing and screening many kinds of cancers. Multiple datasets, benchmarks, and challenges have been released in recent years, resulting in significant improvements in computer-aided diagnosis (CAD) of related diseases. However, few existing works focus on the digestive system. We released two well-annotated benchmark datasets and organized challenges for the digestive-system pathological cell detection and tissue segmentation, in conjunction with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). This paper first introduces the two released datasets, i.e., signet ring cell detection and colonoscopy tissue segmentation, with the descriptions of data collection, annotation, and potential uses. We also report the set-up, evaluation metrics, and top-performing methods and results of two challenge tasks for cell detection and tissue segmentation. In particular, the challenge received 234 effective submissions from 32 participating teams, where top-performing teams developed advancing approaches and tools for the CAD of digestive pathology. To the best of our knowledge, these are the first released publicly available datasets with corresponding challenges for the digestive-system pathological detection and segmentation. The related datasets and results provide new opportunities for the research and application of digestive pathology.
Conference Paper
Full-text available
Conference Paper
Full-text available
In this paper we propose a novel semantic label transfer method using supervised geodesic propagation (SGP). We use supervised learning to guide the seed selection and the label propagation. Given an input image, we first retrieve its similar image set from annotated databases. A Joint Boost model is learned on the similar image set of the input image. Then the recognition proposal map of the input image is inferred by this learned model. The initial distance map is defined by the proposal map: the higher probability, the smaller distance. In each iteration step of the geodesic propagation, the seed is selected as the one with the smallest distance from the undetermined superpixels. We learn a classifier as an indicator to indicate whether to propagate labels between two neighboring superpixels. The training samples of the indicator are annotated neighboring pairs from the similar image set. The geodesic distances of its neighbors are updated according to the combination of the texture and boundary features and the indication value. Experiments on three datasets show that our method outperforms the traditional learning based methods and the previous label transfer method for the semantic segmentation work.
Article
Full-text available
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set. Moreover, the generalization capability of DeepID increases as more face classes are to be predicted at training. DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize about 10, 000 face identities in the training set and configured to keep reducing the neuron numbers along the feature extraction hierarchy, these deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. 97:45% verification accuracy on LFW is achieved with only weakly aligned faces.
Article
Full-text available
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
Conference Paper
Full-text available
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).
Article
Full-text available
Recent face recognition experiments on the LFW benchmark show that face recognition is performing stunningly well, surpassing human recognition rates. In this paper, we study face recognition at scale. Specifically, we have collected from Flickr a \textbf{Million} faces and evaluated state of the art face recognition algorithms on this dataset. We found that the performance of algorithms varies--while all perform great on LFW, once evaluated at scale recognition rates drop drastically for most algorithms. Interestingly, deep learning based approach by \cite{schroff2015facenet} performs much better, but still gets less robust at scale. We consider both verification and identification problems, and evaluate how pose affects recognition at scale. Moreover, we ran an extensive human study on Mechanical Turk to evaluate human recognition at scale, and report results. All the photos are creative commons photos and will be released for research and further experiments.
Article
Full-text available
We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, e.g. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.
Article
Full-text available
With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having explicitly learned the notion of objects.
Article
Full-text available
This paper designs a high-performance deep convolutional network (DeepID2+) for face recognition. It is learned with the identification-verification supervisory signal. By increasing the dimension of hidden representations and adding supervision to early convolutional layers, DeepID2+ achieves new state-of-the-art on LFW and YouTube Faces benchmarks. Through empirical studies, we have discovered three properties of its deep neural activations critical for the high performance: sparsity, selectiveness and robustness. (1) It is observed that neural activations are moderately sparse. Moderate sparsity maximizes the discriminative power of the deep net as well as the distance between images. It is surprising that DeepID2+ still can achieve high recognition accuracy even after the neural responses are binarized. (2) Its neurons in higher layers are highly selective to identities and identity-related attributes. We can identify different subsets of neurons which are either constantly excited or inhibited when different identities or attributes are present. Although DeepID2+ is not taught to distinguish attributes during training, it has implicitly learned such high-level concepts. (3) It is much more robust to occlusions, although occlusion patterns are not included in the training set.
Article
Full-text available
Predicting face attributes from web images is challenging due to background clutters and face variations. A novel deep learning framework is proposed for face attribute prediction in the wild. It cascades two CNNs (LNet and ANet) for face localization and attribute prediction respectively. These nets are trained in a cascade manner with attribute labels, but pre-trained differently. LNet is pre-trained with massive general object categories, while ANet is pre-trained with massive face identities. This framework not only outperforms state-of-the-art with large margin, but also reveals multiple valuable facts on learning face representation as below. (1) It shows how LNet and ANet can be improved by different pre-training strategies. (2) It reveals that although filters of LNet are fine-tuned by attribute labels, their response maps over the entire image have strong indication of face's location. This fact enables training LNet for face localization with only attribute tags, but without face bounding boxes (which are required by all detection works). With a novel fast feed-forward scheme, the cascade of LNet and ANet can localize faces and recognize attributes in images with arbitrary sizes in real time. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training, and such concepts are significantly enriched after fine-tuning. Each attribute can be well explained by a sparse linear combination of these concepts. By analyzing such combinations, attributes show clear grouping patterns, which could be well interpreted semantically.
Article
Full-text available
Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
The key challenge of face recognition is to develop effective feature representations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals as supervision. The Deep IDentification-verification features (DeepID2) are learned with carefully designed deep convolutional networks. The face identification task increases the inter-personal variations by drawing DeepID2 extracted from different identities apart, while the face verification task reduces the intra-personal variations by pulling DeepID2 extracted from the same identity together, both of which are essential to face recognition. The learned DeepID2 features can be well generalized to new identities unseen in the training data. On the challenging LFW dataset, 99.15% face verification accuracy is achieved. Compared with the best deep learning result on LFW, the error rate has been significantly reduced by 67%.
Conference Paper
Full-text available
We propose in this paper a fully automated deep model, which learns to classify human actions without using any prior knowledge. The first step of our scheme, based on the extension of Convolutional Neural Networks to 3D, automatically learns spatio-temporal features. A Recurrent Neural Network is then trained to classify each sequence considering the temporal evolution of the learned features for each timestep. Experimental results on the KTH dataset show that the proposed approach outperforms existing deep models, and gives comparable results with the best related works.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
Dimensionality reduction involves mapping a set of high dimensional input points onto a low dimensional manifold so that 'similar" points in input space are mapped to nearby points on the manifold. We present a method - called Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) - for learning a globally coherent nonlinear function that maps the data evenly to the output manifold. The learning relies solely on neighborhood relationships and does not require any distancemeasure in the input space. The method can learn mappings that are invariant to certain transformations of the inputs, as is demonstrated with a number of experiments. Comparisons are made to other techniques, in particular LLE.
Conference Paper
Full-text available
We present a method for training a similarity metric from data. The method can be used for recognition or verification applications where the number of categories is very large and not known during training, and where the number of training samples for a single category is very small. The idea is to learn a function that maps input patterns into a target space such that the L<sub>1</sub> norm in the target space approximates the "semantic" distance in the input space. The method is applied to a face verification task. The learning process minimizes a discriminative loss function that drives the similarity metric to be small for pairs of faces from the same person, and large for pairs from different persons. The mapping from raw to the target space is a convolutional network whose architecture is designed for robustness to geometric distortions. The system is tested on the Purdue/AR face database which has a very high degree of variability in the pose, lighting, expression, position, and artificial occlusions such as dark glasses and obscuring scarves.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.
Conference Paper
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Large face datasets are important for advancing face recognition research, but they are tedious to build, because a lot of work has to go into cleaning the huge amount of raw data. To facilitate this task, we describe an approach to building face datasets that starts with detecting faces in images returned from searches for public figures on the Internet, followed by discarding those not belonging to each queried person. We formulate the problem of identifying the faces to be removed as a quadratic programming problem, which exploits the observations that faces of the same person should look similar, have the same gender, and normally appear at most once per image. Our results show that this method can reliably clean a large dataset, leading to a considerable reduction in the work needed to build it. Finally, we are releasing the FaceScrub dataset that was created using this approach. It consists of 141,130 faces of 695 public figures and can be obtained from http://vintage.winklerbros.net/facescrub.html.
Article
Face Recognition has been studied for many decades. As opposed to traditional hand-crafted features such as LBP and HOG, much more sophisticated features can be learned automatically by deep learning methods in a data-driven way. In this paper, we propose a two-stage approach that combines a multi-patch deep CNN and deep metric learning, which extracts low dimensional but very discriminative features for face verification and recognition. Experiments show that this method outperforms other state-of-the-art methods on LFW dataset, achieving 99.85% pair-wise verification accuracy and significantly better accuracy under other two more practical protocols. This paper also discusses the importance of data size and the number of patches, showing a clear path to practical high-performance face recognition systems in real world
Article
This paper introduces a method for face recognition across age and also a dataset containing variations of age in the wild. We use a data-driven method to address the cross-age face recognition problem, called cross-age reference coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC can encode the low-level feature of a face image with an age-invariant reference space. In the retrieval phase, our method only requires a linear projection to encode the feature and thus it is highly scalable. To evaluate our method, we introduce a large-scale dataset called cross-age celebrity dataset (CACD). The dataset contains more than 160 000 images of 2,000 celebrities with age ranging from 16 to 62. Experimental results show that our method can achieve state-of-the-art performance on both CACD and the other widely used dataset for face recognition across age. To understand the difficulties of face recognition across age, we further construct a verification subset from the CACD called CACD-VS and conduct human evaluation using Amazon Mechanical Turk. CACD-VS contains 2,000 positive pairs and 2,000 negative pairs and is carefully annotated by checking both the associated image and web contents. Our experiments show that although state-of-the-art methods can achieve competitive performance compared to average human performance, majority votes of several humans can achieve much higher performance on this task. The gap between machine and human would imply possible directions for further improvement of cross-age face recognition in the future.
Conference Paper
This paper presents a new discriminative deep metric learning (DDML) method for face verification in the wild. Different from existing metric learning-based face verification methods which aim to learn a Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, simultaneously, the proposed DDML trains a deep neural network which learns a set of hierarchical nonlinear transformations to project face pairs into the same feature subspace, under which the distance of each positive face pair is less than a smaller threshold and that of each negative pair is higher than a larger threshold, respectively, so that discriminative information can be exploited in the deep network. Our method achieves very competitive face verification performance on the widely used LFW and YouTube Faces (YTF) datasets.
Article
Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face. On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result by 30% on both datasets.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities, where each identity has an average of over a thousand samples. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.25% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 25%, closely approaching human-level performance.
Conference Paper
This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. The deep ConvNets in our model mimic the primary visual cortex to jointly extract local relational visual features from two face images compared with the learned filter pairs. These relational features are further processed through multiple layers to extract high-level and global features. Multiple groups of ConvNets are constructed in order to achieve robustness and characterize face similarities from different aspects. The top-layer RBM performs inference from complementary high-level features extracted from different ConvNet groups with a two-level average pooling hierarchy. The entire hybrid deep network is jointly fine-tuned to optimize for the task of face verification. Our model achieves competitive face verification performance on the LFW dataset.
Conference Paper
Recognizing faces in unconstrained videos is a task of mounting importance. While obviously related to face recognition in still images, it has its own unique characteristics and algorithmic requirements. Over the years several methods have been suggested for this problem, and a few benchmark data sets have been assembled to facilitate its study. However, there is a sizable gap between the actual application needs and the current state of the art. In this paper we make the following contributions. (a) We present a comprehensive database of labeled videos of faces in challenging, uncontrolled conditions (i.e., in the wild'), the YouTube Faces' database, along with benchmark, pair-matching tests<sup>1</sup>. (b) We employ our benchmark to survey and compare the performance of a large variety of existing video face recognition techniques. Finally, (c) we describe a novel set-to-set similarity measure, the Matched Background Similarity (MBGS). This similarity is shown to considerably improve performance on the benchmark tests.
Conference Paper
We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.
Article
Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life. The database exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background. In addition to describing the details of the database, we provide specific experimental paradigms for which the database is suitable. This is done in an effort to make research performed with the database as consistent and comparable as possible. We provide baseline results, including results of a state of the art face recognition system combined with a face alignment system. To facilitate experimentation on the database, we provide several parallel databases, including an aligned version.
Article
This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features. The face image is divided into several regions from which the LBP feature distributions are extracted and concatenated into an enhanced feature vector to be used as a face descriptor. The performance of the proposed method is assessed in the face recognition problem under different challenges. Other applications and several extensions are also discussed.