A preview of this full-text is provided by Springer Nature.
Content available from SN Computer Science
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
SN Computer Science (2021) 2:100
https://doi.org/10.1007/s42979-021-00493-z
SN Computer Science
ORIGINAL RESEARCH
Depthwise Separable Convolutional Neural Networks forPedestrian
Attribute Recognition
ImranN.Junejo1· NaveedAhmed2
Received: 11 March 2020 / Accepted: 29 January 2021 / Published online: 14 February 2021
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021
Abstract
Video surveillance is ubiquitous. In addition to understanding various scene objects, extracting human visual attributes from
the scene has attracted tremendous traction over the past many years. This is a challenging problem even for human observ-
ers. This is a multi-label problem, i.e., a subject in a scene can have multiple attributes that we are hoping to recognize, such
as shoes types, clothing type, wearing some accessory, or carrying some object or not, etc. Solutions have been presented
over the years and many researchers have employed convolutional neural networks (CNNs). In this work, we propose using
Depthwise Separable Convolution Neural Network (DS-CNN) to solve the pedestrian attribute recognition problem. The
network employs depthwise separable convolution layers (DSCL), instead of the regular 2D convolution layers. DS-CNN
performs extremely well, especially with smaller datasets. In addition, with a compact network, DS-CNN reduces the number
of trainable parameters while making learning efficient. We evaluated our method on two benchmark pedestrian datasets and
results show improvements over the state of the art.
Keywords Pedestrian attribute recognition· Computer vision· Deep learning· Depthwise separable convolution
Introduction
One of the active areas of research in computer vision is the
pedestrian attribute recognition. The pedestrian attribute rec-
ognition deals with identifying a number of visual attributes
from an image data. The identified attributes can belong to
different classes, e.g., clothing style, footwear, gender, age
group, etc. A successful outcome of this research can be
applied to various domains. It can be employed for motion
analysis [1], where it can be used to identify crowd behavior
attributes. Another important area of application is image-
based surveillance or visual features extractions for person
identification [2, 3]. Other applications include video analyt-
ics for business intelligence, or searching a criminal database
for suspects using the identified visual attributes. Various
factors make this a challenging problem. One of the main
factors that makes this problem very difficult is the varying
lighting conditions. Attributes of the same type of clothing
can appear completely different under different lighting
conditions. For example, distinguishing between black and
dark blue colors is very difficult in certain weather condi-
tions. Both colors will appear very similar to the camera in a
darker environment. Occlusion also complicates the correct
visual attribution identification and recognition. Occlusions
can be either complete or partial and can results due to the
camera orientation or from object self-occlusions. For exam-
ple, if a person is wears a hat, it might appear partially in the
image, or its shape might be completely different. Similarly,
the orientation of a person or a camera can hide a back-
pack partially or completely from the view. These examples
clearly show that settings of an acquisition environment for
image or video capture result in a high intra-class variations
for the same visual attributes.
The focus of this work is the identification of visual
attributes from image and video data. The distance of an
object from the camera affects how that object appears in
the image. If the object is very far from the camera, or if the
image resolution is very low, a visual attribute, e.g., dress,
hat, backpack, scarf, shoes etc. will only occupy a few pixels
in the image. The combination of low image resolution, in
addition to the self-occlusions or view-oriented occlusions,
makes visual attribute identification a very challenging
* Imran N. Junejo
imran.junejo@zu.ac.ae
1 Zayed University, Dubai, UnitedArabEmirates
2 University ofSharjah, Sharjah, UnitedArabEmirates
Content courtesy of Springer Nature, terms of use apply. Rights reserved.