ArticlePublisher preview available

Depthwise Separable Convolutional Neural Networks for Pedestrian Attribute Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Video surveillance is ubiquitous. In addition to understanding various scene objects, extracting human visual attributes from the scene has attracted tremendous traction over the past many years. This is a challenging problem even for human observers. This is a multi-label problem, i.e., a subject in a scene can have multiple attributes that we are hoping to recognize, such as shoes types, clothing type, wearing some accessory, or carrying some object or not, etc. Solutions have been presented over the years and many researchers have employed convolutional neural networks (CNNs). In this work, we propose using Depthwise Separable Convolution Neural Network (DS-CNN) to solve the pedestrian attribute recognition problem. The network employs depthwise separable convolution layers (DSCL), instead of the regular 2D convolution layers. DS-CNN performs extremely well, especially with smaller datasets. In addition, with a compact network, DS-CNN reduces the number of trainable parameters while making learning efficient. We evaluated our method on two benchmark pedestrian datasets and results show improvements over the state of the art.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
SN Computer Science (2021) 2:100
https://doi.org/10.1007/s42979-021-00493-z
SN Computer Science
ORIGINAL RESEARCH
Depthwise Separable Convolutional Neural Networks forPedestrian
Attribute Recognition
ImranN.Junejo1· NaveedAhmed2
Received: 11 March 2020 / Accepted: 29 January 2021 / Published online: 14 February 2021
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021
Abstract
Video surveillance is ubiquitous. In addition to understanding various scene objects, extracting human visual attributes from
the scene has attracted tremendous traction over the past many years. This is a challenging problem even for human observ-
ers. This is a multi-label problem, i.e., a subject in a scene can have multiple attributes that we are hoping to recognize, such
as shoes types, clothing type, wearing some accessory, or carrying some object or not, etc. Solutions have been presented
over the years and many researchers have employed convolutional neural networks (CNNs). In this work, we propose using
Depthwise Separable Convolution Neural Network (DS-CNN) to solve the pedestrian attribute recognition problem. The
network employs depthwise separable convolution layers (DSCL), instead of the regular 2D convolution layers. DS-CNN
performs extremely well, especially with smaller datasets. In addition, with a compact network, DS-CNN reduces the number
of trainable parameters while making learning efficient. We evaluated our method on two benchmark pedestrian datasets and
results show improvements over the state of the art.
Keywords Pedestrian attribute recognition· Computer vision· Deep learning· Depthwise separable convolution
Introduction
One of the active areas of research in computer vision is the
pedestrian attribute recognition. The pedestrian attribute rec-
ognition deals with identifying a number of visual attributes
from an image data. The identified attributes can belong to
different classes, e.g., clothing style, footwear, gender, age
group, etc. A successful outcome of this research can be
applied to various domains. It can be employed for motion
analysis [1], where it can be used to identify crowd behavior
attributes. Another important area of application is image-
based surveillance or visual features extractions for person
identification [2, 3]. Other applications include video analyt-
ics for business intelligence, or searching a criminal database
for suspects using the identified visual attributes. Various
factors make this a challenging problem. One of the main
factors that makes this problem very difficult is the varying
lighting conditions. Attributes of the same type of clothing
can appear completely different under different lighting
conditions. For example, distinguishing between black and
dark blue colors is very difficult in certain weather condi-
tions. Both colors will appear very similar to the camera in a
darker environment. Occlusion also complicates the correct
visual attribution identification and recognition. Occlusions
can be either complete or partial and can results due to the
camera orientation or from object self-occlusions. For exam-
ple, if a person is wears a hat, it might appear partially in the
image, or its shape might be completely different. Similarly,
the orientation of a person or a camera can hide a back-
pack partially or completely from the view. These examples
clearly show that settings of an acquisition environment for
image or video capture result in a high intra-class variations
for the same visual attributes.
The focus of this work is the identification of visual
attributes from image and video data. The distance of an
object from the camera affects how that object appears in
the image. If the object is very far from the camera, or if the
image resolution is very low, a visual attribute, e.g., dress,
hat, backpack, scarf, shoes etc. will only occupy a few pixels
in the image. The combination of low image resolution, in
addition to the self-occlusions or view-oriented occlusions,
makes visual attribute identification a very challenging
* Imran N. Junejo
imran.junejo@zu.ac.ae
1 Zayed University, Dubai, UnitedArabEmirates
2 University ofSharjah, Sharjah, UnitedArabEmirates
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... DSC is composed of two stages (Figure 1b): depthwise convolution (intra-channel) in which each input channel is independently convolved with a filter to extract features of this channel, followed by a pointwise convolution (intra-pixel) that combines the outputs of the depthwise stage with 1 × 1 convolution creating a new feature map. This representation significantly reduces the number of parameters and computational costs while preserving spectral fidelity [57,58]. ...
Preprint
Sentinel-5P (S5P) satellite provides atmospheric measurements for air quality and climate monitoring. While the S5P satellite offers rich spectral resolution, it inherits physical limitations that restricts its spatial resolution. Super-resolution (SR) techniques can overcome these limitations and enhance the spatial resolution of S5P data. In this work, we introduce a novel SR model specifically designed for S5P data that have eight spectral bands with around 500 channels for each band. Our proposed S5-DSCR model relies on Depth Separable Convolution (DSC) architecture to effectively perform spatial SR by exploiting cross-channel correlations. Quantitative evaluation demonstrates that our model outperforms existing methods for the majority of the spectral bands. This work highlights the potential of leveraging DSC architecture to address the challenges of hyperspectral SR. Our model allows for capturing fine details necessary for precise analysis and paves the way for advancements in air quality monitoring as well as remote sensing applications.
... However, the handcrafted feature was used in studies [35] and [34], which could not represent images effectively on surveillance cameras. In research [36], the Depthwise Separable Convolution method achieved a recall value of 72.07 and an F1 score of 66.60. In the study [37], multi-visual feature recognition with multi-label focal loss was carried out, producing 84.83% ...
Article
Full-text available
Various methods are employed in computer vision applications to identify individuals, including using face recognition as a human visual feature helpful in tracking or searching for a person. However, tracking systems that rely solely on facial information encounter limitations, particularly when faced with occlusions, blurred images, or faces oriented away from the camera. Under these conditions, the system struggles to achieve accurate tracking-based face recognition. Therefore, this research addresses this issue by fusing descriptions of the face visual with body visual features. When the system cannot find the target face, the CNN+LSTM hybrid method assists in multi-feature body visual recognition, narrowing the search space and speeding up the search process. The results indicate that the combination of the CNN+LSTM method yields higher accuracy, recall, precision, and F1 scores (reaching 89.20%, 87.36%, 91.02%, and 88.43%, respectively) compared to the single CNN method (reaching 88.84%, 74.00%, 67.00%, and 69.00% respectively). However, the combination of these two visual features requires high computation. Thus, it is necessary to add a tracking system to reduce the computational load and predict the location. Furthermore, this research utilizes the Q-Learning algorithm to make optimal decisions in automatically tracking objects in dynamic environments. The system considers factors such as face and body visual features, object location, and environmental conditions to make the best decisions, aiming to enhance tracking efficiency and accuracy. Based on the conducted experiments, it is concluded that the system can adjust its actions in response to environmental changes with better outcomes. It achieves an accuracy rate of 91.5% and an average of 50 fps in five different videos, as well as a video benchmark dataset with an accuracy of 84% and an average error of 11.15 pixels. Utilizing the proposed method speeds up the search process and optimizes tracking decisions, saving time and computational resources.
... It has been used in applications such as image classification [20], object detection [21], et cetera. [22] Depthwise separable convolution is a type of convolutional operation that reduces the number of parameters and computations in the neural network. ...
Preprint
Medical image segmentation is vital to the area of medical imaging because it enables professionals to more accurately examine and understand the information offered by different imaging modalities. The technique of splitting a medical image into various segments or regions of interest is known as medical image segmentation. The segmented images that are produced can be used for many different things, including diagnosis, surgery planning, and therapy evaluation. In initial phase of research, major focus has been given to review existing deep-learning approaches, including researches like MultiResUNet, Attention U-Net, classical U-Net, and other variants. The attention feature vectors or maps dynamically add important weights to critical information, and most of these variants use these to increase accuracy, but the network parameter requirements are somewhat more stringent. They face certain problems such as overfitting, as their number of trainable parameters is very high, and so is their inference time. Therefore, the aim of this research is to reduce the network parameter requirements using depthwise separable convolutions, while maintaining performance over some medical image segmentation tasks such as skin lesion segmentation using attention system and residual connections.
... The efficiency of the presented model was evaluated on the RAP, and the obtained accuracy was 91.1%. Imran N. Junejo, and Naveed Ahmed [29] proposed a depth wise separable CNN for solving the PA recognition problem. This 24layer network reduced the number of trainable parameters with almost 50% fewer parameters while making learning efficient. ...
Article
Full-text available
In recent years, pedestrian attributes and activity recognition has attracted expanding research emphasis due to their considerable research importance and value of application in the intelligent civil and military domains. Owing to the inadequate image or frame quality of inexpensive cameras and the absence of obvious and stable feature information, direction of pedestrian movement, and so on, the complication of pedestrian attributes and activity recognition is expanded. However, with the comprehensive implementation of deep learning techniques, pedestrian attributes and activity recognition has made a substantial advance. This paper is the first of its kind because it combines the recognition of pedestrian attributes and activities separately, and reviews the works on pedestrian attributes and activities using deep learning in relation to datasets. The fundamental concepts, corresponding challenges, and popular solutions are also explained. Furthermore, in this community, metrics of evaluation and concise performance comparisons are given. In the end, the hotspots of the present research and the directions of the future research are summarized.
... The overall number of learnable parameters in our system is 6,844, as opposed to 14,362 for the very same system using traditional convolutions. We chose this particular DS-CNN because of its demonstrated versatility, training efficacy, low parameter bank, and impressive performance on smaller samples (25). ...
Article
Full-text available
Deep neural networks have made tremendous strides in the categorization of facial photos in the last several years. Due to the complexity of features, the enormous size of the picture/frame, and the severe inhomogeneity of image data, efficient face image classification using deep convolutional neural networks remains a challenge. Therefore, as data volumes continue to grow, the effective categorization of face photos in a mobile context utilizing advanced deep learning techniques is becoming increasingly important. In the recent past, some Deep Learning (DL) approaches for learning to identify face images have been designed; many of them use convolutional neural networks (CNNs). To address the problem of face mask recognition in facial images, we propose to use a Depthwise Separable Convolution Neural Network based on MobileNet (DWS-based MobileNet). The proposed network utilizes depth-wise separable convolution layers instead of 2D convolution layers. With limited datasets, the DWS-based MobileNet performs exceptionally well. DWS-based MobileNet decreases the number of trainable parameters while enhancing learning performance by adopting a lightweight network. Our technique outperformed the existing state of the art when tested on benchmark datasets. When compared to Full Convolution MobileNet and baseline methods, the results of this study reveal that adopting Depthwise Separable Convolution-based MobileNet significantly improves performance (Acc. = 93.14, Pre. = 92, recall = 92, F-score = 92).
Article
The pedestrian attribute recognition aims to generate a structured description of pedestrians, which serves an important role in surveillance. Current works usually assume that the images and the specific pedestrian states, including pedestrian occlusion and pedestrian orientation, are given. However, we argue that the current works ignore the guidance of the pedestrian state and cannot achieve the appropriate performance since the appearance feature will become unreliable due to the variance of the pedestrian state, which is common in practice. Therefore, this paper proposes the Explicit State Representation (ExSR) Guided Pedestrian Attribute Recognition to improve the accuracy through state learning and attribute fusion among frames. Firstly, the pedestrian state is explicitly represented by concatenating the pedestrian orientation and occlusion, which can be accurately determined via analyzing the pose. Secondly, the state-aware pedestrian attribute fusion method is proposed and divided into two cases, namely the inter-state case and the intra-state case. In the intra-state case, the appearance feature will maintain stable and the attribute relations are propagated to refine. The method of exploiting attribute relations within a single frame using the Graph Neural Network. In the inter-state case, the state changes, the attribute relationship propagation is prevented, and the advantages of attribute recognition in each frame are complemented to make a reliable judgment on the invisible region. The experimental results demonstrate that the ExSR outperforms the state-of-the-art methods on two public databases, benefiting from the explicit introduction of the state into the attribute recognition.
Book
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Chapter
In computer vision, pedestrian attribute recognition is one of the hot topics. It often learns convolutional networks to perform feature extraction as well as multi-label classification for pedestrian attributes. Nowadays, researchers often employ attention mechanisms to improve the network performance. In this work, we focus on the less studied aspect to learn the relationships between different pedestrian attributes. For that, we propose a method based on Feature Combination in Transformer with the attention model (referred to as FCT) to learn the inconsistency relationship between attributes, where the attention mechanism is built based on visual transformers. Extensive experiments conducted on three benchmark datasets (i.e., DukeMTMC-reID, PA-100K and PETA) clearly demonstrate the effectiveness of our proposed FCT method.
Article
Pedestrian Attribute Recognition (PAR) is an important task in computer vision community and plays an important role in practical video surveillance. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attribute recognition, including the fundamental concepts and formulation of pedestrian attributes and corresponding challenges. Secondly, we analyze popular solutions for this task from eight perspectives. Thirdly, we discuss the specific attribute recognition, then, give a comparison between deep learning and traditional algorithm based PAR methods. After that, we show the connections between PAR and other computer vision tasks. Fourthly, we introduce the benchmark datasets, evaluation metrics in this community, and give a brief performance comparison. Finally, we summarize this paper and give several possible research directions for PAR. The project page of this paper can be found at: https://sites.google.com/view/ahu-pedestrianattributes/.
Article
Pedestrian attribute recognition in surveillance is a challenging task due to poor image quality, significant appearance variations and diverse spatial distribution of different attributes. This paper treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. We verify the proposed framework on three large scale pedestrian attribute datasets including PETA, RAP, and PA100k. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction.
Article
Pedestrian attribute recognition is to predict attribute labels of pedestrian from surveillance images, which is a very challenging task for computer vision due to poor imaging quality and small training dataset. It is observed that many semantic pedestrian attributes to be recognised tend to show spatial locality and semantic correlations by which they can be grouped while previous works mostly ignore this phenomenon. Inspired by Recurrent Neural Network (RNN)’s super capability of learning context correlations and Attention Model’s capability of highlighting the region of interest on feature map, this paper proposes end-to-end Recurrent Convolutional (RC) and Recurrent Attention (RA) models, which are complementary to each other. RC model mines the correlations among different attribute groups with convolutional LSTM unit, while RA model takes advantage of the intra-group spatial locality and inter-group attention correlation to improve the performance of pedestrian attribute recognition. Our RA method combines the Recurrent Learning and Attention Model to highlight the spatial position on feature map and mine the attention correlations among different attribute groups to obtain more precise attention. Extensive empirical evidence shows that our recurrent model frameworks achieve state-of-the-art results, based on pedestrian attribute datasets, i.e. standard PETA and RAP datasets.
Conference Paper
Pedestrian behavior understanding and identification in surveillance scenarios has attraction a tremendous amount of attention over the past many years. An integral part of this problem involves identifying various human visual attributes in the scene. Over the years, researcher have proposed various solutions and have explored various features. However, they have focused on either engineered features or simple RGB images. In this paper, we explore the problem of crowd at- tribute recognition using RGB (Red, Green, Blue), HSV (Hue, Saturation, Value) and L*a*b* color models and propose a 3-branch Siamese network to solve the problem. We present a unique approach of using these three color models and fine- tune a pre-trained VGG-19 network for our task. We perform extensive experimentation on the most challenging public PETA dataset, which is by far the largest and the most diverse dataset of its kind. We show an improvement over the state of the art work.