Article

Tattoo Image Search at Scale: Joint Detection and Compact Representation Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The explosive growth of digital images in video surveillance and social media has led to the significant need for efficient search of persons of interest in law enforcement and forensic applications. Despite tremendous progress in primary biometric traits (e.g., face and fingerprint) based person identification, a single biometric trait alone can not meet the desired recognition accuracy in forensic scenarios. Tattoos, as one of the important soft biometric traits, have been found to be valuable for assisting in person identification. However, tattoo search in a large collection of unconstrained images remains a difficult problem, and existing tattoo search methods mainly focus on matching cropped tattoos, which is different from real application scenarios. To close the gap, we propose an efficient tattoo search approach that is able to learn tattoo detection and compact representation jointly in a single convolutional neural network (CNN) via multi-task learning. While the features in the backbone network are shared by both tattoo detection and compact representation learning, individual latent layers of each sub-network optimize the shared features toward the detection and feature learning tasks, respectively. We resolve the small batch size issue inside the joint tattoo detection and compact representation learning network via random image stitch and preceding feature buffering. We evaluate the proposed tattoo search system using multiple public-domain tattoo benchmarks, and a gallery set with about 300K distracter tattoo images compiled from these datasets and images from the Internet. In addition, we also introduce a tattoo sketch dataset containing 300 tattoos for sketch-based tattoo search. Experimental results show that the proposed approach has superior performance in tattoo detection and tattoo search at scale compared to several state-of-the-art tattoo retrieval algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the case of [12], despite the relatively large number of samples and an annotation focused on tattoo detection, the VOLUME 4, 2016 dataset is private, which makes comparisons, establishing benchmarks, and analyzing new methods for the dataset difficult. On the other hand, the dataset proposed in [23] is publicly available and annotated only with BB, i.e., it is not possible to use it in the context of segmentation. The datasets proposed in [24] and [22], in turn, are focused exclusively on classification and provide only one label for each image or image patch, restricting their use to multiclass classification problems, without the location of the tattoo in the image. ...
... In the context of tattoo semantic segmentation, as stated in the previous Section, to the extent of our knowledge, no works focus on the open set context-the most closely related works are [23] and [47]. [23] built a tattoo search approach that can learn tattoo detection and compact representation jointly in a single CNN via multi-task learning is presented. ...
... In the context of tattoo semantic segmentation, as stated in the previous Section, to the extent of our knowledge, no works focus on the open set context-the most closely related works are [23] and [47]. [23] built a tattoo search approach that can learn tattoo detection and compact representation jointly in a single CNN via multi-task learning is presented. However, the compactness proposed by the authors is more focused on the compressive yet discriminative feature learning for large-scale visual search and ...
Article
Full-text available
Tattoos can serve as an essential source of biometric information for public security, aiding in identifying suspects and victims. In order to automate tattoo classification, tasks like classification require more detailed image content analysis, such as semantic segmentation. However, a dataset with appropriate semantic segmentation annotations is currently lacking. Also, there are countless ways to categorize tattoo classes, and many are not directly categorizable, either because they belong to a specific artistic trait or characterize an object with previously undefined semantics. An effective way to overcome these limitations is to build recognition systems based on open-set assumptions. Nevertheless, state-of-the-art open set approaches are not directly applicable in tattoo semantic segmentation, mainly due to the significant class imbalance (predominant background). To the best of our knowledge, this paper is the first to explore semantic segmentation in closed and open-set scenarios for tattoos. In this sense, this paper presents two key contributions: (i) a novel large-margin loss function and generalized open-set classifier approach and (ii) an open-set tattoo semantic segmentation dataset with a publicly accessible test set, enabling comparisons and future research in this area. The proposed approach outperforms other methods, achieving 0.8013 of AUROC, 0.6318 of Macro F1, 0.4900 of mIoU, and notably 0.2753 of IoU for the unknown class, demonstrating the feasibility of this approach for automatic tattoo analysis. The paper also highlights key limitations and open research areas in this challenging field. Dataset and codes are available at https://github.com/Brilhador/tssd2023.
... Although a tattoo alone may not be sufficient for reliable subject identification, it can, however, help as an This research work has been partially funded by the Hessian Ministry of the Interior and Sport in the course of the Bio4ensics project, the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. 1 https://www.domo.com/learn/infographic/data-never-sleeps-5 [3] 890 Yes Yes NTU-Tattoo-V1 [6] 5,740 Yes None WebTattoo [4] 5,000 Yes Bounding box BIVTatt [7] 210 Yes None additional factor to establish the identity of suspects or victims in forensic investigations. These reasons have therefore driven the development of automated image-based tattoo retrieval techniques. ...
... To that end, some studies have addressed the tattoo segmentation task by using hand-crafted techniques [2] or the recent concept of semantic segmentation with the use of Convolutional Neural Networks (CNNs) [3]. To develop efficient systems, other works have proposed end-to-end schemes combining tattoo detection and retrieval [4], [5]. ...
... Inspired by the pixel-wise segmentation advances obtained in 2014 by Simonyan and Zisserman [9], Hrkac et al. [3], [10] proposed both a new uncontrolled tattoo database called "DeMSI" and an architecture for tattoo segmentation. More recently, Han et al. [4] introduced a system that was able to learn tattoo detection and compact representation thereof jointly in a single CNN via multi-task learning. Following this idea, Zhang et al. [11] proposed a CNN-based framework for joint tattoo detection and person reidentification. ...
... Throughout the years, tattoos have played a significant role in the identification of criminals and victims in a variety of forensic scenarios [8]. They are considered a soft biometric trait [3] that provides some identifying information about an individual, and they present several advantages [14] that have gained the attention of the scientific community in recent years [2,5,17]. Tattoos feature extraction and recognition is a fundamental step in the development of an automatic tattoo identification systems. ...
... Many works have proposed "hand-crafted" CBIR methods for tattoo identification, being SIFT [15] arguably the most used one [6,14]. However, most recent tattoo identification researches [4,5] have focused on deep learning based methods due to their great success on many computer vision tasks. ...
... In [5] a multi-task learning approach was used to train a Faster R-CNN based network to detect and describe the content of a tattoo at the same time. The retrieval process is optimized by feature binarization. ...
Article
Full-text available
The use of deep learning in computer vision has been extremely successful. Nevertheless, for the tattoo recognition task very few approaches incorporate this technique, mainly due to the lack of large datasets to train specific models. Within this domain, some works have used the intermediate layers of pre-trained object classification networks to extract a global tattoo image descriptor, avoiding the expensive work of training from scratch. Although that approach showed good results, it does not incorporate specific knowledge of tattoo identification. In this work, we propose an attention pooling method that addresses this problem. Our method uses several functions to weight the local features of a convolutional feature map and then those weights are averaged using again some weights associated with each function. The use of these weighting functions provides more or less importance to local regions of the tattoo image, allowing the recognition process to take into account some domain-specific characteristics. This approach showed promising results in three tattoo databases, outperforming previous state-of-the-art works.
... To evaluate the effectiveness of our approaches, we conduct massive experiments on three benchmark datasets, including CUHK-SYSU dataset [57], PRW dataset [45] and Webtattoo dataset [58], for image search tasks. The first two datasets focus on the person image search (i.e., person search), which refers to joint object (person) detection and person re-identification tasks in our models. ...
... The Webtattoo dataset [58] was presented in different viewpoints and illuminations, which consists of three parts: (i) the first part is a combination of three small-scale (less than 10K) tattoo datasets, such as Tatt-C [60], Flickr [61] and DeMSI [62]. ...
... The Webtattoo dataset [58] is proposed search the images containing the same tattoos from the gallery image set as the query (probe) image. Therefore, this dataset is used for object (tattoo) search instead of person search. ...
Preprint
The traditional object retrieval task aims to learn a discriminative feature representation with intra-similarity and inter-dissimilarity, which supposes that the objects in an image are manually or automatically pre-cropped exactly. However, in many real-world searching scenarios (e.g., video surveillance), the objects (e.g., persons, vehicles, etc.) are seldom accurately detected or annotated. Therefore, object-level retrieval becomes intractable without bounding-box annotation, which leads to a new but challenging topic, i.e. image-level search. In this paper, to address the image search issue, we first introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A Siamese architecture and an on-line pairing strategy for similar and dissimilar objects in the given images are designed. 2) A novel on-line pairing (OLP) loss is introduced with a dynamic feature dictionary, which alleviates the multi-task training stagnation problem, by automatically generating a number of negative pairs to restrict the positives. 3) A hard example priority (HEP) based softmax loss is proposed to improve the robustness of classification task by selecting hard categories. With the philosophy of divide and conquer, we further propose an improved I-Net, called DC-I-Net, which makes two new contributions: 1) two modules are tailored to handle different tasks separately in the integrated framework, such that the task specification is guaranteed. 2) A class-center guided HEP loss (C2HEP) by exploiting the stored class centers is proposed, such that the intra-similarity and inter-dissimilarity can be captured for ultimate retrieval. Extensive experiments on famous image-level search oriented benchmark datasets demonstrate that the proposed DC-I-Net outperforms the state-of-the-art tasks-integrated and tasks-separated image search models.
... Several techniques have been proposed to identify or verify the identity of a person, automatically, based on soft biometric traits [2,8]. In particular, person identification and retrieval systems based on tattoos have gained much interest in recent years [4,16]. This is due to several reasons, including the fact that the tendency of people to have tattoos has increased [11]. ...
... However, in recent years deep learning methods have shown better results than this kind of features in similar computer vision tasks [3]. That is the reason why some works [3,4] have proposed the use of deep neural networks for tattoo identification. ...
... We extract the features from intermediate layers of these networks and used them as descriptors of the tattoo images. The difference with previous works [3,4] is that we show that it is possible to achieve competitive results without training or even fine-tuning the networks. ...
Chapter
Full-text available
Recently, interest has grown in using tattoos as a biometric feature for person identification. Previous works used handcrafted features for the tattoo identification task, such as SIFT. However, deep learning methods have shown better results than this kind of methods in many computer vision tasks. Taking into account that there are little research on tattoo identification using deep learning, we asses several publicly available CNNs models, pre-trained on large generic image databases, for the task of tattoo identification. We believe that, since tattoos mostly depict objects of the real world, their semantic and visual features might be related to those learned from a generic image database with real objects. Our experiments show that these models can outperform previous approaches without even fine-tuning them for tattoo identification. This allows developing tattoo identification applications with minimum implementation cost. Besides, due to the difficult access to public tattoo databases, we created two tattoo datasets and put one of them in public domain.
... La identificación de personas desaparecidas en México: retos actuales y potencial de los tatuajes. Rev Mex Med Forense, 10 (1): [21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40]. ...
... REVISTA MEXICANA DE MEDICINA FORENSE UNIVERSIDAD VERACRUZANAMás allá de la descripción del tatuaje, la recolección de datos AM y PM debe procurar la documentación fotográfica de forma detallada y comprensiva, o considerar las imágenes disponibles de los tatuajes de personas desaparecidas, que pueden ser proporcionadas por familiares, autoridades en investigación, organizaciones no gubernamentales, y/o redes sociales. Una comparativa automática de los tatuajes puede ayudar en un sistema de recuperación de imágenes basado en contenido como los propuestos por varios autores(Jain AK, et al., 2009;Han H, et al., 2019), pero aún no es aplicable en los casos rutinarios, y hasta que sea desarrollado e implementado un sistema técnicamente apropiado, deben ser usadas soluciones prácticas para la identificación de personas fallecidas sin identificar de una forma rápida y segura(Tendencias, 2009; CICR, 2022).Considerando la actual situación en México, los autores recomiendan que el uso del tatuaje como herramienta de individualización se incremente con la finalidad de ayudar en la identificación de personas fallecidas. Dependiendo de las circunstancias individuales de cada caso, los tatuajes pueden ser usados como método único o complementario en la identificación, o para ayudar a priorizar investigaciones subsecuentes. ...
Article
Full-text available
Más de 110.000 personas están desaparecidas en México. Es muy probable que una parte de ellas se encuentre entre los 52.000 cadáveres registrados, pero no identificados. Identificar estos cuerpos de forma rápida y fiable es una tarea colosal, pero esencial: para la propia persona fallecida, para los familiares y amigos que la buscan y para mantener el Estado de Derecho. Este estudio aborda el proceso general de identificación de cadáveres desconocidos en México, en particular los límites (actuales) del uso de huellas dactilares, ADN y estado dental como identificadores primarios. Se discute el uso de tatuajes como identificadores secundarios, como indicación de identidad, o posiblemente incluso para asegurar la identidad. Para ello, se examinaron 1.000 cadáveres tatuados y se elaboró una clasificación de los tatuajes para poder compararlos eficazmente con los datos sobre personas desaparecidas. Además, se propone un registro descriptivo sistemático de los tatuajes en personas fallecidas, incluyendo la correspondiente documentación fotográfica. El peróxido de hidrógeno y la fotografía infrarroja se presentan para la re-visualización de tatuajes después de un período más largo desde la muerte, la momificación y / o después de la exposición al fuego. El objetivo general es crear una base de datos de ámbito nacional con datos sobre personas dadas por desaparecidas y datos de personas fallecidas desconocidas, que busque automáticamente características coincidentes y haga así más eficiente el proceso de identificación. Hasta que exista esta base de datos de ámbito nacional, deberían utilizarse identificadores secundarios a nivel local para generar identificaciones más positivas.
... Some of the literature uses different deep architectures together to extract efficient features, including CNN + RNN [83], DAE + GAN [74,84], DAE + RNN [36,85], RNN + CapsNet [81], and CNN + GNN [48,86]. The published papers basically present the body shape in two different ways: one is appearance-based, which is the silhouette, and another is pose-based, which is the skeleton: the 2D or 3D body joint representation [87]. ...
... For example, an RNN can be used to process the silhouettes of different gait cycles, and the learned features are then used for classification. Moreover, an RNN can also be used to process joint angles of different gait cycles, and the learned features were used for gender classification [87,119]. It can be used to process acceleration signals from wearable sensors, and the learned features are used for activity recognition. ...
Article
Full-text available
Gait recognition, also known as walking pattern recognition, has expressed deep interest in the computer vision and biometrics community due to its potential to identify individuals from a distance. It has attracted increasing attention due to its potential applications and non-invasive nature. Since 2014, deep learning approaches have shown promising results in gait recognition by automatically extracting features. However, recognizing gait accurately is challenging due to the covariate factors, complexity and variability of environments, and human body representations. This paper provides a comprehensive overview of the advancements made in this field along with the challenges and limitations associated with deep learning methods. For that, it initially examines the various gait datasets used in the literature review and analyzes the performance of state-of-the-art techniques. After that, a taxonomy of deep learning methods is presented to characterize and organize the research landscape in this field. Furthermore, the taxonomy highlights the basic limitations of deep learning methods in the context of gait recognition. The paper is concluded by focusing on the present challenges and suggesting several research directions to improve the performance of gait recognition in the future.
... Studies have investigated tattoos from an information science perspective (authors, 2021;Fortier & Menard, 2018;Gorichanaz, 2016;Sundberg & Kjellman, 2018;Cwojdzinski, 2019;Ellis, 2008;Han, Shan, & Chen, 2019;Minahan, 2015;Chronis, 2019;Perzanowski, 2017;Tan, 2013). Tattoos are an important part of personal and collective identity containing narratives and memory (Patterson, 2017). ...
... The ability of tattoos to function as a document and a record is noticeable in research coming from a wide variety of fields. For example, tattoos can record medical information (Lai, Nesrsta, Bodiford, & Jain, 2018;Spaete, Zheng, Chow, Burbridge, & Garman, 2019;Wolf & Laumann, 2008), or be used in criminology for identification purposes (Bȃlan, 2020;Han et al., 2019;Miranda, 2019). There has been a limited amount of research within the field of library and information science regarding the informational nature of tattoos. ...
Article
The tattoo information experience reveals possibilities to explore how tattoo images are created as things, what actions lead to the creation of a tattoo image, who is considered a creator of a tattoo image, and how different personal, social and cultural contexts influence creation of information through the tattoo acquisition experience. Based on the findings from nine interviews, the process of tattoo information creation was conceptualized encompassing all stages of the tattoo experience: from the moment the first idea of getting a tattoo emerges to sharing of information about a tattoo. Participants' stories about their tattoo experiences were used to develop a framework of four key phases of tattoo information creation: conceptualizing, verbalizing, visualizing, and pluralizing. These phases occur between four anchors identified in the participants' stories: anticipation, identification, ideation, and creation. This framework can be used to assist future empirical and theoretical research on tattoo information experience.
... Heflin et al. [11] introduced a methodology that uses a graph-based visual saliency model (GBVS) [12] and GrabCut [13] for localising tattoos found in unconstrained images. Han et al. [14] proposed an efficient tattoo localisation method that is able to learn tattoo localisation and representation in a single CNN via multi-task learning. Chowdhury et al. [15] combines a skin segmentation procedure with a deformable convolution and inception (DCINN)-based scene text detector. ...
... The proposed text tattoo localisation networks utilise the relationship between general tattoo and text tattoo and therefore, a large general tattoo dataset is also required. Existing open access tattoo datasets (Tatt-C [4], DeMSI [39], NTU Tattoo V1 [40] and WebTattoo [14]) contain very limited number of tattoo images for tattoo localisation task. Table 1 summarises the size of each public tattoo datasets. ...
Article
Full-text available
Text tattoos contain rich information about an individual for forensic investigation. To extract this information, text tattoo localisation is the first and essential step. Previous tattoo studies applied existing object detectors to detect general tattoos, but none of them considered text tattoo localisation and they neglect the prior knowledge that text tattoos are usually inside or nearby larger tattoos and appear only on human skin. To use this prior knowledge, a prior knowledge‐based attention mechanism (PKAM) and a network named Text Tattoo Localisation Network based on Double Attention (TTLN‐DA) are proposed. In addition to TTLN‐DA, two variants of TTLN‐DA are designed to study the effectiveness of different prior knowledge. For this study, NTU Tattoo V2, the largest tattoo dataset and NTU Text Tattoo V1, the largest text tattoo dataset are established. To examine the importance of the prior knowledge and the effectiveness of the proposed attention mechanism and the networks, TTLN‐DA and its variants are compared with state‐of‐the‐art object detectors and text detectors. The experimental results indicate that the prior knowledge is vital for text tattoo localisation; The PKAM contributes significantly to the performance and TTLN‐DA outperforms the state‐of‐the‐art object detectors and scene text detectors.
... Moreover, the pre-trained CNNs can be used to generate high-level feature descriptors to represent visual objects in images. These activations generated at the intermediate layers of pre-trained CNNs could be used for instance search [14]- [16], in which given a query image, similar objects can be searched and retrieved from different images or videos. Region-based proposal methods [17], [18] based on CNNs, play a crucial role for object retrieval by easing up object localization process [19]. ...
... Since the breakthrough of deep learning in the computer vision domain, the neural activations of a pre-trained network serves as a robust image descriptor. Several works [14], [27] used the neural activations extracted from the intermediate layers and achieved state-of-the-art results in instance retrieval tasks. Radenovic et al. [28] proposed to fine-tune CNNs for image retrieval by introducing trainable generalized-mean pooling layer that boosts the retrieval performance. ...
Article
Full-text available
Image representations in the form of neural activations derived from intermediate layers of deep neural networks are the state-of-the-art descriptors for instance based retrieval. However, the problem that persists consists of how to retrieve identical images as the most relevant ones from a large image or video corpus. In this work, we introduce colour neural descriptors that are made of convolutional neural networks (CNN) features obtained by combining different colour spaces and colour channels. In contrast to previous works, which rely on fine-tuning pre-trained networks, we compute the proposed descriptors based on the activations generated from a pretrained VGG-16 network without fine-tuning. Besides, we take advantage of an object detector to optimize our proposed instance retrieval architecture to generate features at both local and global scales. In addition, we introduce a stride based query expansion technique to retrieve objects from multi-view datasets. Finally, we experimentally proved that the proposed colour neural descriptors, obtain state-of-the-art results in Paris 6K, Revisiting-Paris 6k, INSTRE-M and COIL-100 datasets, with mAPs of 81.70, 82.02, 78.8 and 97.9, respectively.
... Utiliza un detector entrenado en los conjuntos de datos Tatt-C y PASCAL VOC 2007, basado en aprendizaje profundo por regiones. En [27] se utiliza un enfoque eficiente para la búsqueda de tatuajes utilizando redes neuronales convolucionales (CNN) que integran la detección y la representación compacta de tatuajes de manera conjunta. ...
Conference Paper
Full-text available
En los últimos años, los sistemas de identificación y recuperación de personas basados en tatuajes han despertado un creciente interés, especialmente por su valor en la asistencia a las fuerzas del orden para la identificación de individuos en distintas investigaciones. Una solución eficaz para este problema radica en los sistemas de búsqueda por similitud, los cuales permiten recuperar imágenes similares a una consulta dada. En este estudio, se propone un enfoque basado en preprocesamiento y extracción de características mediante Transfer Learning utilizando la arquitectura ResNet50 para realizar búsquedas por similitud de tatuajes. Este método demuestra un rendimiento notable, alcanzando tasas de éxito significativas sin la necesidad de reentrenamiento o ajuste fino, lo que lo convierte en una opción eficiente y robusta para aplicaciones operativas.
... It achieved only 53.38% accuracy in its own dataset, resulting in a less expressive result. Recently, Han et al. [20] have also presented a detection model using faster R-CNN. This model classified detection problems as examples of image recovery systems in which learning and detection were performed simultaneously. ...
Article
Full-text available
The large number of images in the different areas and the possibilities of technologies lead to various solutions in automatization using image data. In this paper, tattoo detection and identification were analyzed. The combination of YOLOv5 object detection methods and similarity measures was investigated. During the experimental research, various parameters have been investigated to determine the best combination of parameters for tattoo detection. In this case, the influence of data augmentation parameters, the size of the YOLOv5 models (n, s, m, l, x), and the three main hyperparameters of YOLOv5 were analyzed. Also, the efficiency of the most popular similarity distances cosine and Euclidean was analyzed in the tattoo identification process with the purpose of matching the detected tattoo with the person’s tattoo in the database. Experiments have been performed using the deMSI dataset, where images were manually labeled to be suitable for use by the YOLOv5 algorithm. To validate the results obtained, the newly collected tattoo dataset was used. The results have shown that the highest average accuracy of all tattoo detection experiments has been obtained using the YOLOv5l model, where mAP@0.5:0.95 is equal to 0.60, and mAP@0.5 is equal to 0.79. The accuracy for tattoo identification reaches 0.98, and the F-score is up to 0.52 when the highest cosine similarity tattoo is associated. Meanwhile, to ensure that no suspects will be missed, the cosine similarity threshold value of 0.15 should be applied. Then, photos with higher similarity scores should be analyzed only. This would lead to a 1.0 recall and would reduce the manual tattoo comparison by 20%.
... Han et al. [10] (2018) propose a method to track and recognize tattoos on human bodies in images from surveillance pictures for POI detection and social media search. ...
Conference Paper
Full-text available
The research problem addressed in the paper centers around the difficulty of identifying Persons of Interest (POIs) in law enforcement activity due to the vast amount of data stored on mobile devices. Given the complexity and volume of mobile forensic data, traditional analysis methods are often insufficient. The paper proposes leveraging Artificial Intelligence (AI) techniques, including machine learning and natural language processing, to improve the efficiency and effectiveness of data analysis in mobile forensics. This approach aims to overcome the limitations of manual data examination and enhance the identification process of POIs in a forensically sound manner. The main objective of the study is to explore and demonstrate the effectiveness of Artificial Intelligence techniques in improving the identification of POIs from mobile forensic data. The study proposes AI-driven approaches, particularly machine learning, and natural language processing, which can significantly enhance the efficiency, accuracy, and depth of analysis in mobile forensics, thereby addressing the challenges of handling vast amounts of data and the complexity of modern digital evidence. The study employs a quantitative research design, utilizing AI algorithms to process mobile forensic data from simulated environments. The study particularly demonstrates how deep learning can be utilized for searching POIs in WhatsApp messenger data. The result of the experiment shows that using AI for face recognition may throw false positive results, which means humans can’t be replaced in the stage of AI evolution. Also, results emphasize that using AI is helpful in mobile forensics data analysis and followed 88% of successful face recognition. The findings underscore the transformative potential of AI in mobile forensics, highlighting its capacity to enhance investigative accuracy and efficiency. This advancement could lead to more effective law enforcement and judicial processes by enabling quicker identification of POIs with higher precision. Moreover, the research underscores the importance of addressing ethical and privacy concerns in the application of AI technologies in forensic investigations, suggesting a balanced approach to leverage AI benefits while safeguarding individual rights.
... In order to enhance the search diversity and the quality of the user experience, WSEs have been developed that can search by image file: when a user wants to search for information that is difficult to express in terms of keywords, the WSE allows the user to upload an image file to carry out a search. An image WSE uses the image path name, file name and surrounding text information to search based on text keywords [7][8][9][10][11][12][13][14][15]. The performance of this type of search is lower than expected, as the above information is not highly correlated with the image file. ...
Article
Full-text available
With the rapid development of the Internet and the World Wide Web, and the increasing amounts and variety of information on the Internet, people can now use search engines to obtain a diverse rich range of information. This paper proposes a user intent prediction search engine system (UIPSES) based on query history, using machine learning and deep learning image recognition technologies. Two different search methods are developed, based on a user keyword search and an upload image file search. The upload image file search uses deep learning image recognition technology to obtain multiple intent features for the image. Both the keyword and image searches use machine learning technology to extract multiple search intent feature information from the search logs, which is used as a basis for creating a user intent prediction for the keyword information search and the image file search. UIPSES provides highly correlated website index information between user browsing and predicted intent behaviour and uses machine learning to periodically train each user search process to update the user search intent recognition model to adapt to changes in the user intent, to improve the overall inference performance and analyse the accuracy of UIPSES, and to realise a search engine system with personalisation and a high-quality user experience. The UIPSES is a novel image search system that compares the relevance of search engine results for image and text information by using mean average precision with the well-known advanced web image search engines (Google, Bing, and Yandex). When the user uploads an image file for a search, the highest mean average precision value achieved by these three web image search engines was 2.28% for image information and text information feedback. In contrast, UIPSES can adapt to different conditions for single-object or multi-object images searches by obtaining multiple features from images and making inferences based on search logs, and therefore achieves high mean average precision values of 82.57 and 98.28%. UIPSES can also accurately find preset image and text information with higher relevance to allow users to search for images.
... Hence, various biometric modalities and gait can complement one another to compensate each others' weaknesses in the context of a multi-biometric system [246], [247]. Apart from the complementary (hard-)biometric traits, soft-biometric traits such as age [248], height [249], [250], weight [251], gender [252], and particular body marks including tattoos [253] can also be included to boost overall performance. The combination of other softand hard-biometric traits with gait has mostly been done in the literature based on non-deep methods [254], [255], [256], [257], [258], [259], while multi-modal deep learning methods [260], [261], notably based on fusion [262], joint learning [234], and attention [263] networks, can also be adopted. ...
Article
Full-text available
Gait recognition is an appealing biometric modality which aims to identify individuals based on the way they walk. Deep learning has reshaped the research landscape in this area since 2015 through the ability to automatically learn discriminative representations. Gait recognition methods based on deep learning now dominate the state-of-the-art in the field and have fostered real-world applications. In this paper, we present a comprehensive overview of breakthroughs and recent developments in gait recognition with deep learning, and cover broad topics including datasets, test protocols, state-of-the-art solutions, challenges, and future research directions. We first review the commonly used gait datasets along with the principles designed for evaluating them. We then propose a novel taxonomy made up of four separate dimensions namely body representation, temporal representation, feature representation, and neural architecture, to help characterize and organize the research landscape and literature in this area. Following our proposed taxonomy, a comprehensive survey of gait recognition methods using deep learning is presented with discussions on their performances, characteristics, advantages, and limitations. We conclude this survey with a discussion on current challenges and mention a number of promising directions for future research in gait recognition.
... Text detection and recognition in natural scene images and video is gaining a huge popularity because of several real world application, such as surveillance, monitoring and forensic applications [1][2][3][4]. In the same way, tattoo detection and identification is also play a vital role in person identification [5], criminal/terrorist identification [6], studying personality traits [7] and psychology of person [8]. This is because Tattoo is considered as soft biometric features for person identification. ...
Chapter
Identifying Tattoo is an integral part of forensic investigation and crime identification. Tattoo text detection is challenging because of its freestyle handwriting over the skin region with a variety of decorations. This paper introduces Deformable Convolution and Inception based Neural Network (DCINN) for detecting tattoo text. Before tattoo text detection, the proposed approach detects skin regions in the tattoo images based on color models. This results in skin regions containing Tattoo text, which reduces the background complexity of the tattoo text detection problem. For detecting tattoo text in the skin regions, we explore a DCINN, which generates binary maps from the final feature maps using differential binarization technique. Finally, polygonal bounding boxes are generated from the binary map for any orientation of text. Experiments on our Tattoo-Text dataset and two standard datasets of natural scene text images, namely, Total-Text, CTW1500 show that the proposed method is effective in detecting Tattoo text as well as natural scene text in the images. Furthermore, the proposed method outperforms the existing text detection methods in several criteria.
... Hence, various biometric modalities and gait can complement one another to compensate each others' weaknesses in the context of a multi-biometric system [221], [222]. Apart from the complementary (hard-)biometric traits, soft-biometric traits such as age [223], height [224], [225], weight [226], gender [227], and particular body marks including tattoos [228] can also be included to boost overall performance. The combination of other softand hard-biometric traits with gait has mostly been done in the literature based on non-deep methods [229], [230], [231], [232], [233], [234], while multi-modal deep learning methods [235], [236], notably based on fusion [237], joint learning [208], and attention [238] networks, can also be adopted. ...
Preprint
Full-text available
Gait recognition is an appealing biometric modality which aims to identify individuals based on the way they walk. Deep learning has reshaped the research landscape in this area since 2015 through the ability to automatically learn discriminative representations. Gait recognition methods based on deep learning now dominate the state-of-the-art in the field and have fostered real-world applications. In this paper, we present a comprehensive overview of breakthroughs and recent developments in gait recognition with deep learning, and cover broad topics including datasets, test protocols, state-of-the-art solutions, challenges, and future research directions. We first review the commonly used gait datasets along with the principles designed for evaluating them. We then propose a novel taxonomy made up of four separate dimensions namely body representation, temporal representation, feature representation, and neural architecture, to help characterize and organize the research landscape and literature in this area. Following our proposed taxonomy, a comprehensive survey of gait recognition methods using deep learning is presented with discussions on their performances, characteristics, advantages, and limitations. We conclude this survey with a discussion on current challenges and mention a number of promising directions for future research in gait recognition.
... Ma et al. [13] have used a convolution neural network for assisting in image retrieval. Study towards the identification of representation learning can be used for searching a specific form of an image that can be used in image forensic using a convolution neural network, as seen in the work of Han et al. [14]. Reillo et al. [15] have used attribute-based identification of varied signatures of humans by extracting the geometric attributes. ...
Article
Full-text available
Digital Image Forensic is significantly becoming popular owing to the increasing usage of the images as a media of information propagation. However, owing to the presence of various image editing tools and softwares, there is also an increasing threats over image content security. Reviewing the existing approaches of identify the traces or artifacts states that there is a large scope of optimization to be implmentation to further enhance teh processing. Therfore, this paper presents a novel framework that performs cost effective optmization of digital forensic tehnqiue with an idea of accurately localizing teh area of tampering as well as offers a capability to mitigate the attacks of various form. The study outcome shows that propsoed system offers better outcome in contrast to existing system to a significant scale to prove that minor novelty in design attribute could induce better improvement with respect to accuracy as well as resilience toward all potential image threats.
... Li et al. [8] proposed the so-called MVP-Net, which is a multi-view feature pyramid network (FPN) [26] with position-aware attention to incorporate multi-view information for ULD. Han et al. [27] leveraged cascaded multi-task learning to jointly optimize object detection and representation learning. ...
Preprint
Universal Lesion Detection (ULD) in computed tomography plays an essential role in computer-aided diagnosis systems. Many detection approaches achieve excellent results for ULD using possible bounding boxes (or anchors) as proposals. However, empirical evidence shows that using anchor-based proposals leads to a high false-positive (FP) rate. In this paper, we propose a box-to-map method to represent a bounding box with three soft continuous maps with bounds in x-, y- and xy- directions. The bounding maps (BMs) are used in two-stage anchor-based ULD frameworks to reduce the FP rate. In the 1 st stage of the region proposal network, we replace the sharp binary ground-truth label of anchors with the corresponding xy-direction BM hence the positive anchors are now graded. In the 2 nd stage, we add a branch that takes our continuous BMs in x- and y- directions for extra supervision of detailed locations. Our method, when embedded into three state-of-the-art two-stage anchor-based detection methods, brings a free detection accuracy improvement (e.g., a 1.68% to 3.85% boost of sensitivity at 4 FPs) without extra inference time.
... After the cross-verified learning of MD-Net, the encoders E P AD from DR-Net can be expected to have the ability to generate more general features that are preferred for crossdomain PAD. All the above optimizations in MD-Net are performed in a commonly used multi-task learning manner [17]. Finally, for all the face images from domains A and B, we use the learned PAD encoders E A P AD and E B P AD to extract two PAD features per face image, which are then concatenated into one feature and used for learning a live vs. spoof classifier F M D . ...
Conference Paper
Face presentation attack detection (PAD) has been an urgent problem to be solved in the face recognition systems. Conventional approaches usually assume the testing and training are within the same domain; as a result, they may not generalize well into unseen scenarios because the representations learned for PAD may overfit to the subjects in the training set. In light of this, we propose an efficient disentangled representation learning for cross-domain face PAD. Our approach consists of disentangled representation learning (DR-Net) and multi-domain learning (MD-Net). DR-Net learns a pair of encoders via generative models that can disentangle PAD informative features from subject discriminative features. The disentangled features from different domains are fed to MD-Net which learns domain-independent features for the final cross-domain face PAD task. Extensive experiments on several public datasets validate the effectiveness of the proposed approach for cross-domain PAD.
... After the cross-verified learning of MD-Net, the encoders E P AD from DR-Net can be expected to have the ability to generate more general features that are preferred for crossdomain PAD. All the above optimizations in MD-Net are performed in a commonly used multi-task learning manner [17]. Finally, for all the face images from domains A and B, we use the learned PAD encoders E A P AD and E B P AD to extract two PAD features per face image, which are then concatenated into one feature and used for learning a live vs. spoof classifier F M D . ...
Preprint
Full-text available
Face presentation attack detection (PAD) has been an urgent problem to be solved in the face recognition systems. Conventional approaches usually assume the testing and training are within the same domain; as a result, they may not generalize well into unseen scenarios because the representations learned for PAD may overfit to the subjects in the training set. In light of this, we propose an efficient disentangled representation learning for cross-domain face PAD. Our approach consists of disentangled representation learning (DR-Net) and multi-domain learning (MD-Net). DR-Net learns a pair of encoders via generative models that can disentangle PAD informative features from subject discriminative features. The disentangled features from different domains are fed to MD-Net which learns domain-independent features for the final cross-domain face PAD task. Extensive experiments on several public datasets validate the effectiveness of the proposed approach for cross-domain PAD.
... In terms of the baselines, since there is not known prior methods that are reported to jointly handle low-resolution and occlusion, we use two straightforward baselines: successively performing face completion followed by face super-resolution (i.e., GFC + SRResNet), or vice versa (i.e., SRResNet + GFC). In other words, we expect to evaluate the advantages of joint face completion and super-resolution via multi-task learning [51], [52], [53], [54] applying face completion and face super-resolution models successively. The PSNR and MSSIM of individual methods are shown in Fig. 7. ...
Preprint
Full-text available
Combined variations containing low-resolution and occlusion often present in face images in the wild, e.g., under the scenario of video surveillance. While most of the existing face image recovery approaches can handle only one type of variation per model, in this work, we propose a deep generative adversarial network (FCSR-GAN) for performing joint face completion and face super-resolution via multi-task learning. The generator of FCSR-GAN aims to recover a high-resolution face image without occlusion given an input low-resolution face image with occlusion. The discriminator of FCSR-GAN uses a set of carefully designed losses (an adversarial loss, a perceptual loss, a pixel loss, a smooth loss, a style loss, and a face prior loss) to assure the high quality of the recovered high-resolution face images without occlusion. The whole network of FCSR-GAN can be trained end-to-end using our two-stage training strategy. Experimental results on the public-domain CelebA and Helen databases show that the proposed approach outperforms the state-of-the-art methods in jointly performing face super-resolution (up to 8 ×\times) and face completion, and shows good generalization ability in cross-database testing. Our FCSR-GAN is also useful for improving face identification performance when there are low-resolution and occlusion in face images.
... We would like to investigate adaptive relationship learning approach between pedestrian and its body parts to enhance the flexibility in utilizing contextual information in joint pedestrian and body part detection. In addition, we would also like to investigate the effectiveness of the proposed approach for joint pedestrian and retrieval task [37]. ...
Article
Full-text available
While remarkable progress has been made to pedestrian detection in recent years, robust pedestrian detection in the wild e.g., under surveillance scenarios with occlusions, remains a challenging problem. In this paper, we present a novel approach for joint pedestrian and body part detection via semantic relationship learning under unconstrained scenarios. Specifically, we propose a Body Part Indexed Feature (BPIF) representation to encode the semantic relationship between individual body parts (i.e., head, head-shoulder, upper body, and whole body) and highlight per body part features, providing robustness against partial occlusions to the whole body. We also propose an Adaptive Joint Non-Maximum Suppression (AJ-NMS) to replace the original NMS algorithm widely used in object detection, leading to higher precision and recall for detecting overlapped pedestrians. Experimental results on the public-domain CUHK-SYSU Person Search Dataset show that the proposed approach outperforms the state-of-the-art methods for joint pedestrian and body part detection in the wild.
Chapter
There is clinical evidence that suppressing the bone structures in X-rays (e.g., Chest X-rays (CXRs), pelvic X-rays (PXRs)) improves diagnostic value, either for radiologists or computer-aided diagnosis. However, bone-free CXRs are not always accessible. In this chapter, we explore the integration of 3D CT prior knowledge into X-ray imaging using generative adversarial networks (GANs) to address challenges posed by 2D projection superposition and improve diagnostic accuracy. First, we introduce the Decomposition GAN (DecGAN) designed for the anatomical decomposition of CXR images, leveraging unpaired CT data. DecGAN utilizes decomposition loss, adversarial loss, cycle consistency loss, and mask loss to ensure realistic anatomical separation of components such as bone, lung, and soft tissue. We can remove the bone components and get the bone-suppressed CXRs. Next, we propose a coarse-to-fine High-Resolution CXRs Suppression (HRCS) approach to suppress bone structures in high-resolution CXRs. By leveraging digitally reconstructed radiographs (DRRs) and domain adaptation techniques, this method mitigates domain differences between CXRs and CT-derived images. Experiments on benchmark datasets show that this method outperforms existing unsupervised bone suppression techniques and significantly reduces false-negative rates in lung disease diagnoses. Finally, we address the superposition problem in PXRs by introducing the Pelvis Extraction (PELE) module. This module, comprising a decomposition network, a domain adaptation network, and an enhancement module, utilizes 3D anatomical knowledge from CT scans to isolate the pelvis from PXR images, enhancing landmark detection. Evaluations of public and private datasets demonstrate that the PELE module significantly improves landmark detection accuracy, achieving state-of-the-art performance across several metrics. These approaches are based on similar principles but evaluated across different scenarios. The results demonstrate the potential to improve X-ray diagnosis at no extra cost by leveraging generative models enriched with CT knowledge.
Article
Disaster victim identification is structured according to international recommendations on the attempt to optimize forensic logistics. The International Criminal Police Organization (INTERPOL) establishes primary and secondary methods for human identification. This study aimed to revisit the existing literature to address the forensic importance of tattoos. The scientific literature has shown advances in the forensic analyses of tattoos specially when it comes to the application of especial imaging techniques, namely photography with infrared light to visualize latent tattoo inks and cover-up tattoos, as well as the use of biochemical processing to distinguish components of the tattoo inks. Other relevant aspect is the fields dedicated to tattoo descriptions in software used worldwide for disaster victim identification, namely PlassData. Coding systems have been proposed as well to facilitate communication in the human identification process. The future of forensic analyses of tattoos is promising considering the increase of research in recent years. Forensic practice might benefit from it with more scientific evidence to support the utilization of tattoo analyses in casework
Article
The methods for identifying persons of interest (POI) based on mobile device data has been considered. The problem is relevant and unresolved in the activities of law enforcement intelligence and other agencies involved in operational search activities due to the large amount of data stored on mobile devices. Given the complexity and volume of mobile data traditional analysis methods are often insufficiently effective. The authors propose use of artificial intelligence (AI) including machine learning and natural language processing to improve the efficiency and speed of mobile device data analysis. This approach aims to overcome the limitations of manual data analysis and enhance the process of identifying POIs while adhering to the principles of forensic integrity. The research specifically demonstrates how machine learning can be utilized to search for persons of interest in WhatsApp messenger data. A method has been developed for decentralized control of adaptive data collection processes using the principle of equilibrium and reinforcement learning using the normalized exponential function method. The developed method allows for efficient operation of autonomous distributed systems in conditions of dynamic changes in the number of data collection processes and limited information interaction between them. The results of the experiment indicate that using artificial intelligence for facial recognition may result in false positive outcomes implying that humans cannot be entirely replaced at the current stage of AI evolution. However the application of deep learning showed an 88% success rate in facial recognition. These findings underscore the transformative potential of artificial intelligence in mobile forensics highlighting its capacity to enhance the accuracy and efficiency of data analysis in mobile devices. Key words: artificial intelligence mobile forensics data analysis ios whatsapp
Article
Image retrieval with fine-grained categories is an extremely challenging task due to the high intraclass variance and low interclass variance. Most previous works have focused on localizing discriminative image regions in isolation, but have rarely exploited correlations across the different discriminative regions to alleviate intraclass differences. In addition, the intraclass compactness of embedding features is ensured by extra regularization terms that only exist during the training phase, which appear to generalize less well in the inference phase. Finally, the information granularity of the distance measure should distinguish subtle visual differences and the correlation between the embedding features and the quantized features should be maximized sufficiently. To address the above issues, we propose a logit variated product quantization method based on part interaction and metric learning with knowledge distillation for fine-grained image retrieval. Specifically, we introduce a causal context module into the deep navigator to generate discriminative regions and utilize a channelwise cross-part fusion transformer to model the part correlations while alleviating intraclass differences. Subsequently, we design a logit variation module based on a weighted sum scheme to further reduce the intraclass variance of the embedding features directly and enhance the learning power of the quantization model. Finally, we propose a novel product quantization loss based on metric learning and knowledge distillation to enhance the correlation between the embedding features and the quantized features and allow the quantization features to learn more knowledge from the embedding features. The experimental results on several fine-grained datasets demonstrate that the proposed method is superior to state-of-the-art fine-grained image retrieval methods.
Article
Text spotting in natural scenes is of increasing interest and significance due to its critical role in several applications, such as visual question answering, named entity recognition and event rumor detection on social media. One of the newly emerging challenging problems is Tattoo Text Spotting (TTS) in images for assisting forensic teams and for person identification. Unlike the generally simpler scene text addressed by current state-of-the-art methods, tattoo text is typically characterized by the presence of decorative backgrounds, calligraphic handwriting and several distortions due to the deformable nature of the skin. This paper describes the first approach to address TTS in a real-world application context by designing an end-to-end text spotting method employing a Hilbert transform-based Generative Adversarial Network (GAN). To reduce the complexity of the TTS task, the proposed approach first detects fine details in the image using the Hilbert transform and the Optimum Phase Congruency (OPC). To overcome the challenges of only having a relatively small number of training samples, a GAN is then used for generating suitable text samples and descriptors for text spotting (i.e. both detection and recognition). The superior performance of the proposed TTS approach, for both tattoo and general scene text, over the state-ofthe-art methods is demonstrated on a new TTS-specific dataset (publicly available1) as well as on the existing benchmark natural scene text datasets: Total-Text, CTW1500 and ICDAR 2015.
Article
Розглядається завдання підвищення оперативності та адекватності автоматизованого пошуку зображень, які найбільш близькі до заданого зображення за складом представлених у ньому об’єктів та їх локальності, у системах управління великими сховищами зображень. Пропонується метод порівняння зображень, який забезпечить вилучення інформації про об’єкти зображення на основі використання на першому етапі каскадного процесу парадигми клітинних нейронних мереж, зберігання цієї інформації у службових полях файлів зображень, а також можливості використання різних метрик порівняння зображень, виходячи з особливостей завдання, що вирішується. Базовий показник визначає коефіцієнт подібності на основі типів та кількості об’єктів на порівнюваних зображеннях. Рамковий показник визначає коефіцієнт подібності з урахуванням розташування окремих об’єктів та їх перетинів на зображеннях. Матричний показник, орієнтований на обробку зображення з великою кількістю об’єктів, визначає коефіцієнти подібності з урахуванням площі перетину, займаної об’єктами певного типу. Ефективність показників порівняння залежить від типу зображень і певних умов, що робить доцільним їх системне застосування в рамках одного методу.
Article
In the field of computer vision, fine-grained image retrieval is an extremely challenging task due to the inherently subtle intra-class object variations. In addition, the high-dimensional real-valued features extracted from large-scale fine-grained image datasets slow the retrieval speed and increase the storage cost. To solve above issues, existing fine-grained image retrieval methods mainly focus on finding more discriminative local regions for generating discriminative and compact hash codes, which achieve limited fine-grained image retrieval performance due to the large quantization errors and the confounding granularities and context of discriminative parts, i.e., the correct recognition of fine-grained objects mainly attribute to the discriminative parts and their context. To learn robust causal features and reduce the quantization errors, we propose a deep progressive asymmetric quantization (DPAQ) method based on causal intervention to learn compact and robust descriptions for fine-grained image retrieval task. Specifically, we introduce a structural causal model to learn robust casual features via causal intervention for fine-grained visual recognition. Subsequently, we design a progressive asymmetric quantization layer in the feature embedding space, which can preserve the semantic information and reduce the quantization errors sufficiently. Finally, we incorporate both the fine-grained image classification and retrieval tasks into an end-to-end deep learning architecture for generating robust and compact descriptions. Experimental results on several fine-grained image retrieval datasets demonstrate that the proposed DPAQ method performs the best for fine-grained image retrieval task and surpasses the state-of-the art fine-grained hashing methods by a large margin.
Article
Full-text available
Person search aims to simultaneously localize and identify a query person from uncropped images. To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN. Owing to the ROI-Align operation, this pipeline yields promising accuracy as re-id features are explicitly aligned with the corresponding object regions, but in the meantime, it introduces high computational overhead due to dense object anchors. In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs. First, we select an anchor-free detector (i.e., FCOS) as the prototype of our framework. Due to the lack of dense object anchors, it exhibits significantly higher efficiency compared with existing person search models. Second, when directly accommodating this anchor-free detector for person search, there exist several misalignment issues in different levels (i.e., scale, region, and task). To address these issues, we propose an aligned feature aggregation module to generate more discriminative and robust feature embeddings. Accordingly, we name our framework as Feature-Aligned Person Search Network (AlignPS). Third, by investigating the advantages of both anchor-based and anchor-free models, we further augment AlignPS with an ROI-Align head, which significantly improves the robustness of re-id features while still keeping our model highly efficient. Our framework not only achieves state-of-the-art or competitive performance on two challenging person search benchmarks, but can be also extended to other challenging searching tasks such as animal and object search. All the source codes, data, and trained models are available at: https://github.com/daodaofr/alignps.
Article
The instability is shown in the existing methods of representation learning based on Euclidean distance under a broad set of conditions. Furthermore, the scarcity and high cost of labels prompt us to explore more expressive representation learning methods which depends on as few labels as possible. To address above issues, the small-perturbation ideology is firstly introduced on the representation learning model based on the representation probability distribution. The positive small-perturbation information (SPI) which only depend on two labels of each cluster is used to stimulate the representation probability distribution and then two variant models are proposed to fine-tune the expected representation distribution of Restricted Boltzmann Machine (RBM), namely, Micro-supervised Disturbance Gaussian-binary RBM (Micro-DGRBM) and Micro-supervised Disturbance RBM (Micro-DRBM) models. The Kullback-Leibler (KL) divergence of SPI is minimized in the same cluster to promote the representation probability distributions to become more similar in Contrastive Divergence (CD) learning. In contrast, the KL divergence of SPI is maximized in the different clusters to enforce the representation probability distributions to become more dissimilar in CD learning. To explore the representation learning capability under the continuous stimulation of the SPI, we present a deep Microsupervised Disturbance Learning (Micro-DL) framework based on the Micro-DGRBM and Micro-DRBM models and compare it with a similar deep structure which has no external stimulation. Experimental results demonstrate that the proposed deep Micro-DL architecture shows better performance in comparison to the baseline method, the most related shallow models and deep frameworks for clustering.
Article
Tattoo text detection provides a vital clue for person and crime identification. Due to the freestyle and unconstrained nature of handwritten tattoo text over skin regions, accurate tattoo text detection is very challenging. This paper proposes a comprehensive scheme for tattoo text detection which comprises (a) adaptive Deformable Convolutional Neural Network (DCNN) for skin region detection to reduce text detection complexity (b) a Decoupled Gradient Text Detector (DGTD) for tattoo text detection from skin region (c) a Deep Q-Network (DQN) to refine the bounding boxes detected by DGTD, and (d) a Term-Frequency-Inverse-Document-Frequency (TF-IDF) model to group the words into text lines based on semantic information to fix the bounding box for the line. To test the effectiveness, the proposed method is evaluated on different datasets, namely, (i) a newly developed tattoo text dataset, (ii) benchmark bib number dataset of the marathon, and (iii) person re-identification dataset. The proposed method achieves 91.2, 87.5, and 88.8 F-scores from these three respective datasets. To demonstrate its superior performance, the text detection module (without skin detection) is also compared with state-of-the-art scene text detection methods on benchmark datasets, namely, ICDAR 2019 ArT, Total-Text, and DAST1500 and the proposed method achieves 90.3, 88.5 and 89.8 F-score from these respective datasets.
Article
Occlusions are often present in face images in the wild, e.g., under video surveillance and forensic scenarios. Existing face de-occlusion methods are limited as they require the knowledge of an occlusion mask. To overcome this limitation, we propose in this paper a new generative adversarial network (named OA-GAN) for natural face de-occlusion without an occlusion mask, enabled by learning in a semi-supervised fashion using (i) paired images with known masks of artificial occlusions and (ii) natural images without occlusion masks. The generator of our approach first predicts an occlusion mask, which is used for filtering the feature maps of the input image as a semantic cue for de-occlusion. The filtered feature maps are then used for face completion to recover a non-occluded face image. The initial occlusion mask prediction might not be accurate enough, but it gradually converges to the accurate one because of the adversarial loss we use to perceive which regions in a face image need to be recovered. The discriminator of our approach consists of an adversarial loss, distinguishing the recovered face images from natural face images, and an attribute preserving loss, ensuring that the face image after de-occlusion can retain the attributes of the input face image. Experimental evaluations on the widely used CelebA dataset and a dataset with natural occlusions we collected show that the proposed approach can outperform the state of the art methods in natural face de-occlusion.
Article
The low storage and strong representation capabilities of hash codes for image retrieval has made hashing technologies very popular. Several existing deep hashing methods focus on the task of general image retrieval, while neglecting the task of fine-grained image retrieval. Recently, some fine-grained hashing methods have been proposed to capture the subtle differences, which mainly utilize the single-modality visual features to solve the discriminative region localization while ignoring the semantic information. In this paper, we propose a correlation filtering hashing (CFH) method to learn discrete binary codes, which can adequately take advantage of the cross-modal correlation between the semantic information and the visual features for discriminative region localization. Specifically, we utilize a feature pyramid network to learn multi-level visual features. Subsequently, the label vector is embedded into the visual space, which can be used as a correlation filter on the feature maps to capture the latent location of objects. Finally, we perform global average pooling over the output maps and concatenate the features of different levels to produce the hash codes of query images. Extensive experiments on two fine-grained datasets show that the proposed CFH outperforms the state-of-the-art hashing methods.
Chapter
(ULD) in computed tomography plays an essential role in computer-aided diagnosis systems. Many detection approaches achieve excellent results for ULD using possible bounding boxes (or anchors) as proposals. However, empirical evidence shows that using anchor-based proposals leads to a high false-positive (FP) rate. In this paper, we propose a box-to-map method to represent a bounding box with three soft continuous maps with bounds in x-, y- and xy-directions. The bounding maps (BMs) are used in two-stage anchor-based ULD frameworks to reduce the FP rate. In the 1st1^{st} stage of the region proposal network, we replace the sharp binary ground-truth label of anchors with the corresponding xy-direction BM hence the positive anchors are now graded. In the 2nd2^{nd} stage, we add a branch that takes our continuous BMs in x- and y-directions for extra supervision of detailed locations. Our method, when embedded into three state-of-the-art two-stage anchor-based detection methods, brings a free detection accuracy improvement (e.g., a 1.68% to 3.85% boost of sensitivity at 4 FPs) without extra inference time.
Article
On account of the remarkable performance of convolutional neural network (CNN) features for natural image searches, utilizing it for other images collected with the anamorphic lens has become a research hotspot. This article selects the aurora images generated from a circular fisheye lens as a typical example. By considering the imaging principle and geomagnetic information, a saliency-weighted region network (SWRN) is presented and introduced into the Mask R-CNN pipeline. Our SWRN selects salient regions with important semantic information and weights them both hierarchically and spatially. Hence, regions encompassing the search target are strengthened while uninformative regions are discarded, which benefits the suppression of background interference and reduction of computational complexity. In practice, by aggregating the outputs of SWRN with post-processing, a compact CNN feature is generated to represent the aurora image. Large-scale aurora image search experiments are conducted, and the results prove that our method performs better than the state-of-the-art methods on both accuracy and efficiency.
Article
The traditional object (person) retrieval (re-identification) task aims to learn a discriminative feature representation with intra-similarity and inter-dissimilarity, which supposes that the objects in an image are manually or automatically pre-cropped exactly. However, in many real-world searching scenarios (e.g., video surveillance), the objects (e.g., persons, vehicles, etc.) are seldom accurately detected or annotated. Therefore, object-level retrieval becomes intractable without bounding-box annotation, which leads to a new but challenging topic, i.e., image-level search with multi-task integration of joint detection and retrieval. In this paper, to address the image search issue, we first introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A Siamese architecture and an on-line pairing strategy for similar and dissimilar objects in the given images are designed. Benefited by the Siamese structure, I-Net learns the shared feature representation, because, on which, both object detection and classification tasks are handled. 2) A novel o n- l ine p airing (OLP) loss is introduced with a dynamic feature dictionary, which alleviates the multi-task training stagnation problem, by automatically generating a number of negative pairs to restrict the positives. 3) A h ard e xample p riority (HEP) based softmax loss is proposed to improve the robustness of classification task by selecting hard categories. The shared feature representation of I-Net may restrict the task-specific flexibility and learning capability between detection and retrieval tasks. Therefore, with the philosophy of d ivide and c onquer, we further propose an improved I-Net, called DC-I-Net, which makes two new contributions: 1) two modules are tailored to handle different tasks separately in the integrated framework, such that the task specification is guaranteed. 2) A class-center guided HEP loss (C 2^2 HEP) by exploiting the stored class centers is proposed, such that the intra-similarity and inter-dissimilarity can be captured for ultimate retrieval. Extensive experiments on famous image-level search oriented benchmark datasets, such as CUHK-SYSU dataset and PRW dataset for person search and the large-scale WebTattoo dataset for tattoo search, demonstrate that the proposed DC-I-Net outperforms the state-of-the-art tasks-integrated and tasks-separated image search models.
Article
There is clinical evidence that suppressing the bone structures in Chest X-rays (CXRs) improves diagnostic value, either for radiologists or computer-aided diagnosis. However, bone-free CXRs are not always accessible. We hereby propose a coarse-to-fine CXR bone suppression approach by using structural priors derived from unpaired computed tomography (CT) images. In the low-resolution stage, we use the digitally reconstructed radiograph (DRR) image that is computed from CT as a bridge to connect CT and CXR. We then perform CXR bone decomposition by leveraging the DRR bone decomposition model learned from unpaired CTs and domain adaptation between CXR and DRR. To further mitigate the domain differences between CXRs and DRRs and speed up the learning convergence, we perform all the aboved operations in Laplacian of Gaussian (LoG) domain. After obtaining the bone decomposition result in DRR, we upsample it to a high resolution, based on which the bone region in the original high-resolution CXR is cropped and processed to produce a high-resolution bone decomposition result. Finally, such a produced bone image is subtracted from the original high-resolution CXR to obtain the bone suppression result. We conduct experiments and clinical evaluations based on two benchmarking CXR databases to show that (i) the proposed method outperforms the state-of-the-art unsupervised CXR bone suppression approaches; (ii) the CXRs with bone suppression are instrumental to radiologists for reducing their false-negative rate of lung diseases from 15% to 8%; and (iii) state-of-the-art disease classification performances are achieved by learning a deep network that takes the original CXR and its bone-suppressed image as inputs.
Article
Combined variations containing low-resolution and occlusion often present in face images in the wild, e.g., under the scenario of video surveillance. While most of the existing face image recovery approaches can handle only one type of variation per model, in this work, we propose a deep generative adversarial network (FCSR-GAN) for performing joint face completion and face super-resolution via multi-task learning. The generator of FCSR-GAN aims to recover a high-resolution face image without occlusion given an input low-resolution face image with occlusion. The discriminator of FCSR-GAN uses a set of carefully designed losses (an adversarial loss, a perceptual loss, a pixel loss, a smooth loss, a style loss, and a face prior loss) to assure the high quality of the recovered high-resolution face images without occlusion. The whole network of FCSR-GAN can be trained end-to-end using our two-stage training strategy. Experimental results on the public-domain CelebA and Helen databases show that the proposed approach outperforms the state-of-the-art methods in jointly performing face super-resolution (up to 8 ×) and face completion, and shows good generalization ability in cross-database testing. Our FCSR-GAN is also useful for improving face identification performance when there are low-resolution and occlusion in face images. The code of FCSR-GAN is available at: https://github.com/swordcheng/FCSR-GAN.
Article
Full-text available
Plenty of face detection and recognition methods have been proposed and got excellent results in decades. Common face recognition pipeline consists of: 1) face detection, 2) face alignment, 3) feature extraction, 4) similarity calculation, which are separated and independent from each other. The separated face analyzing stages lead the model redundant calculation and are hard for end-to-end training. In this paper, we proposed a novel end-to-end trainable convolutional network framework for face detection and recognition, in which a geometric transformation matrix was directly learned to align the faces, instead of predicting the facial landmarks. In training stage, our single CNN model is supervised only by face bounding boxes and personal identities, which are publicly available from WIDER FACE [52] dataset and CASIA-WebFace [53] dataset. Tested on Face Detection Dataset and Benchmark (FDDB) [21] dataset and Labeled Face in the Wild (LFW) [19] dataset, we have achieved 89.24% recall for face detection task and 98.63% verification accuracy for face recognition task simultaneously, which are comparable to state-of-the-art results.
Article
Full-text available
Texture is a fundamental characteristic of many types of images, and texture representation is one of the essential and challenging problems in computer vision and pattern recognition which has attracted extensive research attention over several decades. Since 2000, texture representations based on Bag of Words and on Convolutional Neural Networks have been extensively studied with impressive performance. Given this period of remarkable evolution, this paper aims to present a comprehensive survey of advances in texture representation over the last two decades. More than 250 major publications are cited in this survey covering different aspects of the research, including benchmark datasets and state of the art results. In retrospect of what has been achieved so far, the survey discusses open challenges and directions for future research.
Article
Full-text available
The nearest neighbor problem is the following: Given a set of n points P in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to the query point q in X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = R-d under some l-p norm.
Article
Full-text available
Plenty of face detection and recognition methods have been proposed and got delightful results in decades. Common face recognition pipeline consists of: 1) face detection, 2) face alignment, 3) feature extraction, 4) similarity calculation, which are separated and independent from each other. The separated face analyzing stages lead the model redundant calculation and are hard for end-to-end training. In this paper, we proposed a novel end-to-end trainable convolutional network framework for face detection and recognition, in which a geometric transformation matrix was directly learned to align the faces, instead of predicting the facial landmarks. In training stage, our single CNN model is supervised only by face bounding boxes and personal identities, which are publicly available from WIDER FACE \cite{Yang2016} dataset and CASIA-WebFace \cite{Yi2014} dataset. Tested on Face Detection Dataset and Benchmark (FDDB) \cite{Jain2010} dataset and Labeled Face in the Wild (LFW) \cite{Huang2007} dataset, we have achieved 89.24\% recall for face detection task and 98.63\% verification accuracy for face recognition task simultaneously, which are comparable to state-of-the-art results.
Article
Full-text available
Learning-based hashing algorithms are “hot topics” because they can greatly increase the scale at which existing methods operate. In this paper, we propose a new learning-based hashing method called “fast supervised discrete hashing” (FSDH) based on “supervised discrete hashing” (SDH). Regressing the training examples (or hash code) to the corresponding class labels is widely used in ordinary least squares regression. Rather than adopting this method, FSDH uses a very simple yet effective regression of the class labels of training examples to the corresponding hash code to accelerate the algorithm. To the best of our knowledge, this strategy has not previously been used for hashing. Traditional SDH decomposes the optimization into three sub-problems, with the most critical sub-problem - discrete optimization for binary hash codes - solved using iterative discrete cyclic coordinate descent (DCC), which is time-consuming. However, FSDH has a closed-form solution and only requires a single rather than iterative hash code-solving step, which is highly efficient. Furthermore, FSDH is usually faster than SDH for solving the projection matrix for least squares regression, making FSDH generally faster than SDH. For example, our results show that FSDH is about 12-times faster than SDH when the number of hashing bits is 128 on the CIFAR-10 data base, and FSDH is about 151-times faster than FastHash when the number of hashing bits is 64 on the MNIST data-base. Our experimental results show that FSDH is not only fast, but also outperforms other comparative methods.
Conference Paper
Full-text available
Image representations derived from pre-trained Convolutional Neural Networks (CNNs) have become the new state of the art in computer vision tasks such as instance retrieval. This work explores the suitability for instance retrieval of image- and region-wise representations pooled from an object detection CNN such as Faster R-CNN. We take advantage of the object proposals learned by a Region Proposal Network (RPN) and their associated CNN features to build an instance search pipeline composed of a first filtering stage followed by a spatial reranking. We further investigate the suitability of Faster R-CNN features when the network is fine-tuned for the same objects one wants to retrieve. We assess the performance of our proposed system with the Oxford Buildings 5k, Paris Buildings 6k and a subset of TRECVid Instance Search 2013, achieving competitive results.
Article
Full-text available
Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We show that there is a gap between current face detection performance and the real world requirements. To facilitate future face detection research, we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion, as shown in Fig. 1. Furthermore, we show that WIDER FACE dataset is an effective training source for face detection. We benchmark several representative detection systems, providing an overview of state-of-the-art performance and propose a solution to deal with large scale variation. Finally, we discuss common failure cases that worth to be further investigated. Dataset can be downloaded at: mmlab.ie.cuhk.edu.hk/projects/WIDERFace
Conference Paper
Full-text available
When implementing real-world computer vision systems, researchers can use mid-level representations as a tool to adjust the trade-off between accuracy and efficiency. Unfortunately, existing mid-level representations that improve accuracy tend to decrease efficiency, or are specifically tailored to work well within one pipeline or vision problem at the exclusion of others. We introduce a novel, efficient mid-level representation that improves classification efficiency without sacrificing accuracy. Our Exemplar Codes are based on linear classifiers and probability normalization from extreme value theory. We apply Exemplar Codes to two problems: facial attribute extraction and tattoo classification. In these settings, our Exemplar Codes are competitive with the state of the art and offer efficiency benefits, making it possible to achieve high accuracy even on commodity hardware with a low computational budget.
Article
Full-text available
Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Image representations derived from pre-trained Convolutional Neural Networks (CNNs) have become the new state of the art in computer vision tasks such as instance retrieval. This work explores the suitability for instance retrieval of image- and region-wise representations pooled from an object detection CNN such as Faster R-CNN. We take advantage of the object proposals learned by a Region Proposal Network (RPN) and their associated CNN features to build an instance search pipeline composed of a first filtering stage followed by a spatial reranking. We further investigate the suitability of Faster R-CNN features when the network is fine-tuned for the same objects one wants to retrieve. We assess the performance of our proposed system with the Oxford Buildings 5k, Paris Buildings 6k and a subset of TRECVid Instance Search 2013, achieving competitive results.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Chapter
Soft biometrics are physiological and behavioral characteristics that provide some identifying information about an individual. Color of eye, gender, ethnicity, skin color, height, weight, hair color, scar, birthmarks, and tattoos are examples of soft biometrics. Several techniques have been proposed to identify or verify an individual based on soft biometrics in the literature. In particular, person identification and retrieval systems based on tattoos have gained a lot of interest in recent years. Tattoos, in some extent, indicate ones personal beliefs and characteristics. Hence, the analysis of tattoos can lead to a better understanding of ones background and membership to gang and hate groups. They have been used to assist law enforcement in investigations leading to the identification of criminals. In this chapter, we will provide an overview of recent advances in tattoo recognition and detection based on deep learning. In particular, we will present deep convolutional neural network-based methods for automatic matching of tattoo images based on the AlexNet and Siamese networks. Furthermore, we will show that rather than using a simple contrastive loss function, triplet loss function can significantly improve the performance of a tattoo matching system. Various experimental results on a recently introduced Tatt-C dataset will be presented.
Article
With the rapid development of information storage and networking technologies, quintillion bytes of data are generated every day from social networks, business transactions, sensors, and many other domains. The increasing data volumes impose significant challenges to traditional data analysis tools in storing, processing, and analyzing these extremely large-scale data. For decades, hashing has been one of the most effective tools commonly used to compress data for fast access and analysis, as well as information integrity verification. Hashing techniques have also evolved from simple randomization approaches to advanced adaptive methods considering locality, structure, label information, and data security, for effective hashing. This survey reviews and categorizes existing hashing techniques as a taxonomy, in order to provide a comprehensive view of mainstream hashing techniques for different types of data and applications. The taxonomy also studies the uniqueness of each method and therefore can serve as technique references in understanding the niche of different hashing mechanisms for future development.
Article
In this paper, we propose a new deep hashing (DH) approach to learn compact binary codes for scalable image search. Unlike most existing binary codes learning methods which usually seek a single linear projection to map each sample into a binary feature vector, we develop a deep neural network to seek multiple hierarchical non-linear transformations to learn these binary codes, so that the nonlinear relationship of samples can be well exploited. Our model is learned under three constraints at the top layer of the developed deep network: 1) the loss between the compact real-valued code and the learned binary vector is minimized, 2) the binary codes distribute evenly on each bit, and 3) different bits are as independent as possible. To further improve the discriminative power of the learned binary codes, we extend DH into supervised DH (SDH) and multi-label supervised DH (MSDH) by including a discriminative term into the objective function of DH which simultaneously maximizes the inter-class variations and minimizes the intra-class variations of the learned binary codes with the single-label and multilabel settings, respectively. Extensive experimental results on eight widely used image search datasets show that our proposed methods achieve very competitive results with the state-of-thearts.
Article
In content-based image retrieval, SIFT feature and the feature from deep convolutional neural network (CNN) have demonstrated promising performance. To fully explore both visual features in a unified framework for effective and efficient retrieval, we propose a collaborative index embedding method to implicitly integrate the index matrices of them. We formulate the index embedding as an optimization problem from the perspective of neighborhood sharing and solve it with an alternating index update scheme. After the iterative embedding, only the embedded CNN index is kept for on-line query, which demonstrates significant gain in retrieval accuracy, with very economical memory cost. Extensive experiments have been conducted on the public datasets with million-scale distractor images. The experimental results reveal that, compared with the recent state-of-the-art retrieval algorithms, our approach achieves competitive accuracy performance with less memory overhead and efficient query computation.
Article
This paper presents a simple yet effective supervised deep hash approach that constructs binary hash codes from labeled data for large-scale image search. We assume that the semantic labels are governed by several latent attributes with each attribute on or off, and classification relies on these attributes. Based on this assumption, our approach, dubbed supervised semantics-preserving deep hashing (SSDH), constructs hash functions as a latent layer in a deep network and the binary codes are learned by minimizing an objective function defined over classification error and other desirable hash codes properties. With this design, SSDH has a nice characteristic that classification and retrieval are unified in a single learning model. Moreover, SSDH performs joint learning of image representations, hash codes, and classification in a point-wised manner, and thus is scalable to large-scale datasets. SSDH is simple and can be realized by a slight enhancement of an existing deep architecture for classification; yet it is effective and outperforms other hashing approaches on several benchmarks and large datasets. Compared with state-of-the-art approaches, SSDH achieves higher retrieval accuracy, while the classification performance is not sacrificed.
Article
The Bag-of-Words (BoW) model has been predominantly viewed as the state of the art in Content-Based Image Retrieval (CBIR) systems since 2003. The past 13 years has seen its advance based on the SIFT descriptor due to its advantages in dealing with image transformations. In recent years, image representation based on the Convolutional Neural Network (CNN) has attracted more attention in image retrieval, and demonstrates impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of image retrieval methods over the past decade. In particular, according to the feature extraction and quantization schemes, we classify current methods into three types, i.e., SIFT-based, one-pass CNN-based, and multi-pass CNN-based. This survey reviews milestones in BoW image retrieval, compares previous works that fall into different BoW steps, and shows that SIFT and CNN share common characteristics that can be incorporated in the BoW model. After presenting and analyzing the retrieval accuracy on several benchmark datasets, we highlight promising directions in image retrieval that demonstrate how the CNN-based BoW model can learn from the SIFT feature.
Conference Paper
In this paper, we demonstrate that the essentials of image classification and retrieval are the same, since both tasks could be tackled by measuring the similarity between images. To this end, we propose ONE (Online Nearest-neighbor Estimation), a unified algorithm for both image classification and retrieval. ONE is surprisingly simple, which only involves manual object definition, regional description and nearest-neighbor search. We take advantage of PCA and PQ approximation and GPU parallelization to scale our algorithm up to large-scale image search. Experimental results verify that ONE achieves state-of-the-art accuracy in a wide range of image classification and retrieval benchmarks.
Article
This paper presents a novel compact coding approach, composite quantization, for approximate nearest neighbor search. The idea is to use the composition of several elements selected from the dictionaries to accurately approximate a vector and to represent the vector by a short code composed of the indices of the selected elements. To efficiently compute the approximate distance of a query to a database vector using the short code, we introduce an extra constraint, constant inter-dictionary-element-product, resulting in that approximating the distance only using the distance of the query to each selected element is enough for nearest neighbor search. Experimental comparison with state-of-the-art algorithms over several benchmark datasets demonstrates the efficacy of the proposed approach. Copyright © (2014) by the International Machine Learning Society (IMLS) All rights reserved.
Conference Paper
We introduce a new compression scheme for high-dimensional vectors that approximates the vectors using sums of M codewords coming from M different codebooks. We show that the proposed scheme permits efficient distance and scalar product computations between compressed and uncompressed vectors. We further suggest vector encoding and codebook learning algorithms that can minimize the coding error within the proposed scheme. In the experiments, we demonstrate that the proposed compression can be used instead of or together with product quantization. Compared to product quantization and its optimized versions, the proposed compression approach leads to lower coding approximation errors, higher accuracy of approximate nearest neighbor search in the datasets of visual descriptors, and lower image classification error, whenever the classifiers are learned on or applied to compressed vectors.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing~(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.
Article
The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.
Article
Logo search is widely required in many real-world applications. As a special case of near-duplicate images, logo pictures have some particular properties, for instance, suffering from flipping operations, e.g., geometry-inverted and brightness-inverted operations. Such operations completely change the spatial structure of local descriptors, such as SIFT, so that image search algorithms based on Bag-of-Visual-Words (BoVW) often fail to retrieve the flipped logos. We propose a novel descriptor named Max-SIFT, which finds the maximal SIFT value sequence for detecting flipping operations. Compared with previous algorithms, our algorithm is extremely easy to implement yet very efficient to carry out. We evaluate the improved descriptor on a large-scale Web logo search dataset, and demonstrate that our method enjoys good performance and low computational costs.
Article
Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
Recently, learning based hashing techniques have attracted broad research interests due to the resulting efficient storage and retrieval of images, videos, documents, etc. However, a major difficulty of learning to hash lies in handling the discrete constraints imposed on the needed hash codes, which typically makes hash optimizations very challenging (NP-hard in general). In this work, we propose a new supervised hashing framework, where the learning objective for hashing is to make the optimal binary hash codes for classification. By introducing an auxiliary variable, we reformulate the objective such that it can be solved substantially efficiently by using a regularization algorithm. One of the key steps in the algorithm is to solve the regularization sub-problem associated with the NP-hard binary optimization. We show that with cyclic coordinate descent, the sub-problem admits an analytical solution. As such, a high-quality discrete solution can eventually be obtained in an efficient computing manner, which enables to tackle massive datasets. We evaluate the proposed approach, dubbed Supervised Discrete Hashing (SDH), on four large image datasets, and demonstrate that SDH outperforms the state-of-the-art hashing methods in large-scale image retrieval.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Recent years have witnessed the growing popularity of hashing in large-scale vision problems. It has been shown that the hashing quality could be boosted by leveraging su-pervised information into hash function learning. However, the existing supervised methods either lack adequate per-formance or often incur cumbersome model training. In this paper, we propose a novel kernel-based supervised hashing model which requires a limited amount of supervised infor-mation, i.e., similar and dissimilar data pairs, and a feasi-ble training cost in achieving high quality hashing. The idea is to map the data to compact binary codes whose Ham-ming distances are minimized on similar pairs and simul-taneously maximized on dissimilar pairs. Our approach is distinct from prior works by utilizing the equivalence be-tween optimizing the code inner products and the Hamming distances. This enables us to sequentially and efficiently train the hash functions one bit at a time, yielding very short yet discriminative codes. We carry out extensive ex-periments on two image benchmarks with up to one million samples, demonstrating that our approach significantly out-performs the state-of-the-arts in searching both metric dis-tance neighbors and semantically similar neighbors, with accuracy gains ranging from 13% to 46%.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Conference Paper
Bag-of-Words (BoW) model based on SIFT has been widely used in large scale image retrieval applications. Feature quantization plays a crucial role in BoW model, which generates visual words from the high dimensional SIFT features, so as to adapt to the inverted file structure for indexing. Traditional feature quantization approaches suffer several problems: 1) high computational cost---visual words generation (codebook construction) is time consuming especially with large amount of features; 2) limited reliability---different collections of images may produce totally different codebooks and quantization error is hard to be controlled; 3) update inefficiency--once the codebook is constructed, it is not easy to be updated. In this paper, a novel feature quantization algorithm, scalar quantization, is proposed. With scalar quantization, a SIFT feature is quantized to a descriptive and discriminative bit-vector, of which the first tens of bits are taken out as code word. Our quantizer is independent of collections of images. In addition, the result of scalar quantization naturally lends itself to adapt to the classic inverted file structure for image indexing. Moreover, the quantization error can be flexibly reduced and controlled by efficiently enumerating nearest neighbors of code words. The performance of scalar quantization has been evaluated in partial-duplicate Web image search on a database of one million images. Experiments reveal that the proposed scalar quantization achieves a relatively 42% improvement in mean average precision over the baseline (hierarchical visual vocabulary tree approach), and also outperforms the state-of-the-art Hamming Embedding approach and soft assignment method.