ArticlePublisher preview available

Depthwise Separable Convolutional Neural Networks for Pedestrian Attribute Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Video surveillance is ubiquitous. In addition to understanding various scene objects, extracting human visual attributes from the scene has attracted tremendous traction over the past many years. This is a challenging problem even for human observers. This is a multi-label problem, i.e., a subject in a scene can have multiple attributes that we are hoping to recognize, such as shoes types, clothing type, wearing some accessory, or carrying some object or not, etc. Solutions have been presented over the years and many researchers have employed convolutional neural networks (CNNs). In this work, we propose using Depthwise Separable Convolution Neural Network (DS-CNN) to solve the pedestrian attribute recognition problem. The network employs depthwise separable convolution layers (DSCL), instead of the regular 2D convolution layers. DS-CNN performs extremely well, especially with smaller datasets. In addition, with a compact network, DS-CNN reduces the number of trainable parameters while making learning efficient. We evaluated our method on two benchmark pedestrian datasets and results show improvements over the state of the art.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
SN Computer Science (2021) 2:100
https://doi.org/10.1007/s42979-021-00493-z
SN Computer Science
ORIGINAL RESEARCH
Depthwise Separable Convolutional Neural Networks forPedestrian
Attribute Recognition
ImranN.Junejo1· NaveedAhmed2
Received: 11 March 2020 / Accepted: 29 January 2021 / Published online: 14 February 2021
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021
Abstract
Video surveillance is ubiquitous. In addition to understanding various scene objects, extracting human visual attributes from
the scene has attracted tremendous traction over the past many years. This is a challenging problem even for human observ-
ers. This is a multi-label problem, i.e., a subject in a scene can have multiple attributes that we are hoping to recognize, such
as shoes types, clothing type, wearing some accessory, or carrying some object or not, etc. Solutions have been presented
over the years and many researchers have employed convolutional neural networks (CNNs). In this work, we propose using
Depthwise Separable Convolution Neural Network (DS-CNN) to solve the pedestrian attribute recognition problem. The
network employs depthwise separable convolution layers (DSCL), instead of the regular 2D convolution layers. DS-CNN
performs extremely well, especially with smaller datasets. In addition, with a compact network, DS-CNN reduces the number
of trainable parameters while making learning efficient. We evaluated our method on two benchmark pedestrian datasets and
results show improvements over the state of the art.
Keywords Pedestrian attribute recognition· Computer vision· Deep learning· Depthwise separable convolution
Introduction
One of the active areas of research in computer vision is the
pedestrian attribute recognition. The pedestrian attribute rec-
ognition deals with identifying a number of visual attributes
from an image data. The identified attributes can belong to
different classes, e.g., clothing style, footwear, gender, age
group, etc. A successful outcome of this research can be
applied to various domains. It can be employed for motion
analysis [1], where it can be used to identify crowd behavior
attributes. Another important area of application is image-
based surveillance or visual features extractions for person
identification [2, 3]. Other applications include video analyt-
ics for business intelligence, or searching a criminal database
for suspects using the identified visual attributes. Various
factors make this a challenging problem. One of the main
factors that makes this problem very difficult is the varying
lighting conditions. Attributes of the same type of clothing
can appear completely different under different lighting
conditions. For example, distinguishing between black and
dark blue colors is very difficult in certain weather condi-
tions. Both colors will appear very similar to the camera in a
darker environment. Occlusion also complicates the correct
visual attribution identification and recognition. Occlusions
can be either complete or partial and can results due to the
camera orientation or from object self-occlusions. For exam-
ple, if a person is wears a hat, it might appear partially in the
image, or its shape might be completely different. Similarly,
the orientation of a person or a camera can hide a back-
pack partially or completely from the view. These examples
clearly show that settings of an acquisition environment for
image or video capture result in a high intra-class variations
for the same visual attributes.
The focus of this work is the identification of visual
attributes from image and video data. The distance of an
object from the camera affects how that object appears in
the image. If the object is very far from the camera, or if the
image resolution is very low, a visual attribute, e.g., dress,
hat, backpack, scarf, shoes etc. will only occupy a few pixels
in the image. The combination of low image resolution, in
addition to the self-occlusions or view-oriented occlusions,
makes visual attribute identification a very challenging
* Imran N. Junejo
imran.junejo@zu.ac.ae
1 Zayed University, Dubai, UnitedArabEmirates
2 University ofSharjah, Sharjah, UnitedArabEmirates
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... It has been used in applications such as image classification [20], object detection [21], et cetera. [22] Depthwise separable convolution is a type of convolutional operation that reduces the number of parameters and computations in the neural network. ...
Preprint
Medical image segmentation is vital to the area of medical imaging because it enables professionals to more accurately examine and understand the information offered by different imaging modalities. The technique of splitting a medical image into various segments or regions of interest is known as medical image segmentation. The segmented images that are produced can be used for many different things, including diagnosis, surgery planning, and therapy evaluation. In initial phase of research, major focus has been given to review existing deep-learning approaches, including researches like MultiResUNet, Attention U-Net, classical U-Net, and other variants. The attention feature vectors or maps dynamically add important weights to critical information, and most of these variants use these to increase accuracy, but the network parameter requirements are somewhat more stringent. They face certain problems such as overfitting, as their number of trainable parameters is very high, and so is their inference time. Therefore, the aim of this research is to reduce the network parameter requirements using depthwise separable convolutions, while maintaining performance over some medical image segmentation tasks such as skin lesion segmentation using attention system and residual connections.
... The efficiency of the presented model was evaluated on the RAP, and the obtained accuracy was 91.1%. Imran N. Junejo, and Naveed Ahmed [29] proposed a depth wise separable CNN for solving the PA recognition problem. This 24layer network reduced the number of trainable parameters with almost 50% fewer parameters while making learning efficient. ...
Article
Full-text available
In recent years, pedestrian attributes and activity recognition has attracted expanding research emphasis due to their considerable research importance and value of application in the intelligent civil and military domains. Owing to the inadequate image or frame quality of inexpensive cameras and the absence of obvious and stable feature information, direction of pedestrian movement, and so on, the complication of pedestrian attributes and activity recognition is expanded. However, with the comprehensive implementation of deep learning techniques, pedestrian attributes and activity recognition has made a substantial advance. This paper is the first of its kind because it combines the recognition of pedestrian attributes and activities separately, and reviews the works on pedestrian attributes and activities using deep learning in relation to datasets. The fundamental concepts, corresponding challenges, and popular solutions are also explained. Furthermore, in this community, metrics of evaluation and concise performance comparisons are given. In the end, the hotspots of the present research and the directions of the future research are summarized.
... The overall number of learnable parameters in our system is 6,844, as opposed to 14,362 for the very same system using traditional convolutions. We chose this particular DS-CNN because of its demonstrated versatility, training efficacy, low parameter bank, and impressive performance on smaller samples (25). ...
Article
Full-text available
Deep neural networks have made tremendous strides in the categorization of facial photos in the last several years. Due to the complexity of features, the enormous size of the picture/frame, and the severe inhomogeneity of image data, efficient face image classification using deep convolutional neural networks remains a challenge. Therefore, as data volumes continue to grow, the effective categorization of face photos in a mobile context utilizing advanced deep learning techniques is becoming increasingly important. In the recent past, some Deep Learning (DL) approaches for learning to identify face images have been designed; many of them use convolutional neural networks (CNNs). To address the problem of face mask recognition in facial images, we propose to use a Depthwise Separable Convolution Neural Network based on MobileNet (DWS-based MobileNet). The proposed network utilizes depth-wise separable convolution layers instead of 2D convolution layers. With limited datasets, the DWS-based MobileNet performs exceptionally well. DWS-based MobileNet decreases the number of trainable parameters while enhancing learning performance by adopting a lightweight network. Our technique outperformed the existing state of the art when tested on benchmark datasets. When compared to Full Convolution MobileNet and baseline methods, the results of this study reveal that adopting Depthwise Separable Convolution-based MobileNet significantly improves performance (Acc. = 93.14, Pre. = 92, recall = 92, F-score = 92).
... The co-attentive sharing module introduced in [18] help to extract discriminative channels and spatial regions for more effective feature sharing for each task. The time complexity was also concerned with utilizing DS-CNN [19] to reduce the number of model parameters of PAR network. ...
Article
Full-text available
This paper presents an extended model for a pedestrian attribute recognition network utilizing skeleton data as a soft attention model to extract a local feature corresponding to a specific attribute. This technique helped keep valuable information surrounding the target area and handle the variation of human posture. The attention masks were designed to focus on the partial and the whole-body regions. This research utilized an augmented layer for data augmentation inside the network to reduce over-fitting errors. Our network was evaluated in two datasets (RAP and PETA) with various backbone networks (ResNet-50, Inception V3, and Inception-ResNet V2). The experimental result shows that our network improves overall classification performance with a mean accuracy of about 2–3% in the same backbone network, especially local attributes and various human postures.
Book
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Chapter
In computer vision, pedestrian attribute recognition is one of the hot topics. It often learns convolutional networks to perform feature extraction as well as multi-label classification for pedestrian attributes. Nowadays, researchers often employ attention mechanisms to improve the network performance. In this work, we focus on the less studied aspect to learn the relationships between different pedestrian attributes. For that, we propose a method based on Feature Combination in Transformer with the attention model (referred to as FCT) to learn the inconsistency relationship between attributes, where the attention mechanism is built based on visual transformers. Extensive experiments conducted on three benchmark datasets (i.e., DukeMTMC-reID, PA-100K and PETA) clearly demonstrate the effectiveness of our proposed FCT method.
Article
Pedestrian attribute recognition in surveillance is a challenging task due to poor image quality, significant appearance variations and diverse spatial distribution of different attributes. This paper treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. We verify the proposed framework on three large scale pedestrian attribute datasets including PETA, RAP, and PA100k. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction.
Article
Pedestrian attribute recognition is to predict attribute labels of pedestrian from surveillance images, which is a very challenging task for computer vision due to poor imaging quality and small training dataset. It is observed that many semantic pedestrian attributes to be recognised tend to show spatial locality and semantic correlations by which they can be grouped while previous works mostly ignore this phenomenon. Inspired by Recurrent Neural Network (RNN)’s super capability of learning context correlations and Attention Model’s capability of highlighting the region of interest on feature map, this paper proposes end-to-end Recurrent Convolutional (RC) and Recurrent Attention (RA) models, which are complementary to each other. RC model mines the correlations among different attribute groups with convolutional LSTM unit, while RA model takes advantage of the intra-group spatial locality and inter-group attention correlation to improve the performance of pedestrian attribute recognition. Our RA method combines the Recurrent Learning and Attention Model to highlight the spatial position on feature map and mine the attention correlations among different attribute groups to obtain more precise attention. Extensive empirical evidence shows that our recurrent model frameworks achieve state-of-the-art results, based on pedestrian attribute datasets, i.e. standard PETA and RAP datasets.
Conference Paper
Pedestrian behavior understanding and identification in surveillance scenarios has attraction a tremendous amount of attention over the past many years. An integral part of this problem involves identifying various human visual attributes in the scene. Over the years, researcher have proposed various solutions and have explored various features. However, they have focused on either engineered features or simple RGB images. In this paper, we explore the problem of crowd at- tribute recognition using RGB (Red, Green, Blue), HSV (Hue, Saturation, Value) and L*a*b* color models and propose a 3-branch Siamese network to solve the problem. We present a unique approach of using these three color models and fine- tune a pre-trained VGG-19 network for our task. We perform extensive experimentation on the most challenging public PETA dataset, which is by far the largest and the most diverse dataset of its kind. We show an improvement over the state of the art work.