Conference PaperPDF Available

Pedestrian detection via a leg-driven physiology framework

Gongbo Liang, Qi Li
Western Kentucky University
Department of Computer Science
Ogden College of Science & Engineering
Bowling Green, KY 42101, USA
Xiangui Kang
Guangdong Key Lab of Information Security
Sun Yat-Sen University
School of Data and Computer Science
Guangzhou 510006, China
In this paper, we propose a leg-driven physiology framework
for pedestrian detection. The framework is introduced to re-
duce the search space of candidate regions of pedestrians.
Given a set of vertical line segments, we can generate a space
of rectangular candidate regions, based on a model of body
proportions. The proposed framework can be either integrated
with or without learning-based pedestrian detection methods
to validate the candidate regions. A symmetry constraint is
then applied to validate each candidate region to decrease the
false positive rate. The experiment demonstrates the promis-
ing results of the proposed method by comparing it with Dalal
& Triggs method. For example, rectangular regions detected
by the proposed method has much similar area to the ground
truth than regions detected by Dalal & Triggs method.
Index TermsPedestrian detection, leg, line segment,
bounding box
Pedestrian detection has been active in computer vision and
pattern recognition [6, 4] due to its wide range of applica-
tions, e.g., video surveillance [20], driving assistance [17].
Most of pedestrian detection methods follow a machine learn-
ing strategy that usually contains two aspects: i) designing
a distinct representation robust with respect to various ap-
pearance of a pedestrian, and ii) designing/selecting an ef-
fective classifier. In the context of pedestrian detection, ex-
amples of well-known representations include Haar wavelet
coefficients [15], grid of Histogram of Oriented Gradients
(HOG) [3], Local Binary Patterns [23], and edgelet part rep-
resentations [21]; examples of well-known classifiers include
SVM [15, 3], neural network [22, 5], boosting [14, 9], and
Bayesian [21]. Recently, deep neural network (also called
deep learning) received extensive studies for pedestrian detec-
tion [16, 13, 24, 18]. An appealing advantage of deep neural
networks is that they can be directly applied to the raw rep-
resentation of a candidate region without an explicit feature
Fig. 1. A theory of body proportions of a pedestrian (in a
standing pose), where the head height is used as the basic unit
to measure the length of other body parts [2]. In practice, it is
not easy to estimate the head height, and thus we propose to
estimate the height and the width of a pedestrian in terms of
his/her legs (either lower legs or entire legs) under a standing
pose or a walking pose.
extraction procedure, such as HOG [3].
In this paper, we propose a leg-driven physiology frame-
work for pedestrian detection. The basic idea of the proposed
framework is that we can construct a small set of rectangu-
lar candidate regions based on the theory of body proportions
of a pedestrian [2] (as illustrated in Fig. 1), in addition to
recent developments on line segment detection (or 2-piece
polylines) [10, 19, 11]. The proposed framework is driven
by legs, which is motivated by the following facts: 1) a leg is
more salient than an arm in terms of its length and width; 2)
a leg can be modeled by simpler geometry primitives (line
segments or 2-piece polylines) than a head. More specif-
ically, we first detect a number of “vertical” line segments
(whose cross angles with the ground is larger than 45. For
each line segment, we generate a number of bounding boxes
2926978-1-4673-9961-6/16/$31.00 ©2016 IEEE ICIP 2016
whose locations, heights, and widths are estimated by a devel-
oped theory of body proportion of a pedestrian in a standing
or a walking pose. Finally, a symmetry constraint is applied
to remove non-pedestrian bounding boxes. It is worth noting
that the proposed framework can be integrated with an exist-
ing machine learning method, by adding an additional step of
verifying bounding boxes via a machine learning method. In
the experiment, we compare Dalal & Triggs method (i.e., the
HOG+SVM method) and the proposed framework, and the
results convince its effectiveness.
The paper is structured as follows: Section 2 reviews line
segment detection. A pedestrian detection algorithm is pro-
posed in Section 3. Experiments are presented in Section 4.
Conclusion and future works are presented in Section 5.
Recently, several interesting methods were proposed to de-
tect line segments [10, 19, 11]. Here, we are going to review
two of them, both of which engage connected components of
edge pixels: i) Kosecka and Zhang [10], and ii) Li et al. [11].
Kosecka and Zhang [10] proposed a fitting based method to
detect line segments in the context of vanishing point estima-
tion. Specifically, their method first applied quantified gradi-
ent directions to label edge pixels, and then applied the con-
nected components algorithm to group edge pixels with the
same label. A fitting algorithm was finally applied to each
connected component to estimate its line parameter. Li et
al. [11] proposed a method that can detect not only line seg-
ments (called 1-piece polylines) but also two joint line seg-
ments (called 2-piece polylines) in the context of stop sign
detection. Note that 2-piece polylines can be used to model
a bended leg of a walking pedestrian under a side view. Due
to the space constraint, we here will focused on line segments
Given a connected component (C) of edge pixels, Li et al.
method performs three steps for line segment detection [11].
The first step extracts three dominant points, where the first
two points (v1and v2) maximize the distance of an arbitrary
pair of points in C, and the third point v3maximizes the sum
of distances between a pCand vi, i = 1,2, i.e.,
The second step verifies the piecewise linearity of C. The
third step partitions Cinto two subsets if Cdoes not form a
1-piece polyline. This idea is then applied to the two subsets
Fig. 2 shows “vertical” line segments detected in a pedes-
trian image by the above two methods: i) Kosecka and Zhang
[10], and ii) Li et al. [11]. Recall that a vertical line segment
refers to a line segment whose cross angle with the ground is
larger than 45in this paper. Note that some line segments
detected by Li’s methods may be partially overlapped to each
other due to the application of a scale space. Both methods
detected a sufficient number of line segments consistent with
the length of lower legs or entire legs of a pedestrian, which
helps to generate precise bounding boxes.
(a) Kosecka & Zhang [10] (b) Li et al. [11]
Fig. 2. “Vertical” line segments detected by Kosecka & Zhang
method [10] and Li et al. method [11]. Both methods detected
a sufficient number of line segments consistent with the length
of lower legs or entire legs of a pedestrian, which helps to
generate precise bounding boxes.
In this section, we will propose a leg-driven framework for
pedestrian detection. The framework contains two key com-
ponents: i) how to generate hypothesis boxes, given vertical
line segments; and 2) how to remove false-positive boxes.
For convenience, we will use the notations listed in Ta-
ble 1.
notation meaning
hheight of a hypothesis box
wwidth of a hypothesis box
llength of a line segment
θcross angle between a line and the ground
Table 1. Notations
3.1. Generation of hypothesis boxes
Pedestrians can have many different poses under different
(camera) viewing directions. Poses and viewing directions
have a significant impact on the width of a hypothesis box,
and have relatively small impact on the height of a hypothesis
box. Exhaustive modeling a large number of poses may not
be a realistic attempt since this attempt can generate a large
number of hypothesis boxes, increasing computation cost and
more seriously increasing false-positive instances. Thus, we
propose two standard configurations on hypothesis boxes: i)
narrow, and ii) wide. A narrow box has a width equal to 2
times of the height of a head (refer to Fig. 1; a wide box has
a width equal to 4 times the height of a head. The height of
both types of boxes is equal to 8 times the height of a head.
Given a “vertical” line segment with length l, it is not dif-
ficult for us to decide the height and width of a box, under
different combinations of two factors: i) a lower or entire leg,
and ii) a narrow or wide type. Table 2 shows the formula to
estimate the size of a box under different scenarios.
The two standard types of boxes are exclusive, i.e., only
one type of box can be generated for a given vertical line seg-
ment. The selection of a narrow or wide box depends on θ,
in addition to the leg configuration (lower or entire). When a
line segment is an edge of an entire leg, it is easy to see that
θ= 60can be used to decide the type of a box. (Note that
cos 60= 0.5.) Otherwise, the analysis seems difficult with-
out introducing a 2-piece polyline that can model a walking
leg very well. We will leave this in the future work.
narrow type wide type
lower leg entire leg lower leg entire leg
h w h w h w h w
4l l 2l0.5l4l2l2l l
Table 2. The length of a “vertical” line segment is used to
decide the length and the width of a hypothesis box.
Since a line segment can be one of four edges of two lower
legs, or one of four edges of two entire legs, we need to decide
the location of a hypothesis box for each possible case. One
key factor that has an impact on the location estimation is the
width of a leg. With measurement on images on the INRIA
pedestrian dataset, we set the width of a lower leg as 0.25 ×l.
Under a narrow size configuration, we set the gap between
two legs to be 0.1×l. Thus, we can derive the relative distance
between each of four possible edges of a leg and the boundary
of a bounding box, as illustrated in Fig. 3. Note that numbers
displayed in Fig. 3 represents ratios only. The analysis under
a wide size configuration is similar to Fig. 3. But there are
only 4 boxes generated for a fairly skew line segment.
3.2. Symmetry constraint
Left/right symmetry has been shown an effective way to re-
duce the false positive rate of a pedestrian detection method
[1, 7, 8]. Bertozzi et al. [1] measured the symmetry of an
image region (enclosed by a bounding box) by computing the
similarity of the normalized histograms of gray values of left
and right sub-regions (that are divided by the central verti-
cal line of a given bounding box). Specifically, assume that
(a) (b)
Fig. 3. Under a narrow size configuration, eight hypothesis
boxes are generated for given a “vertical” line segment. (a)
Four boxes are generated with the hypothesis that the line seg-
ment is one of four edges of two lower legs; (b) Four boxes
are generated with the hypothesis that the line segment is one
of four edges of two entire legs.
hi, i = 1,2,are the histograms of the gray values of left and
right sub-regions, respectively. The symmetry of the region
is measured by the dot product h1
kh2k, where k · k de-
notes the 2-norm of a vector. Following the above idea, sym-
metry measurement is translated to similarity measurement,
and thus many existing feature representations, such as HOG,
SIFT [12], and LBP [23], can be used as alternatives, espe-
cially in the scenarios that computational time is not critical
in an application of pedestrian/human detection.
3.3. Merging boxes with significant overlaps
There may be multiple boxes partially overlapped to each
other. One reason is that multiple hypothesis boxes may be
generated by the same vertical line. In the context of scale
space, similar line segments may be detected, which leads to
similar bounding boxes. We try to merge them together in or-
der to have a clear view on the output results. Given hypothe-
sis boxes b1and b2, their overlapping region are quantified as
the overlapping ratio between these two boxes as follows:
overlapping =area(b1b2)
If the overlapping ratio is significant, i.e., larger than a thresh-
old (that is set to be 50% in this paper), b2is added to the
cluster containing b1. For each cluster of boxes, we return the
bounding box with the highest symmetry. Note that if two
boxes have significant difference in their size, the two boxes
won’t be merged even though one box is enclosed to the other
In this section, we will test the performance of the pro-
posed framework, along with a comparison to Dalal & Triggs
method, i.e., the HOG+SVM method [3]. (The pedestrian
method true positive # false positive #
Dalal and Triggs [3] 86 367
Proposed 127 215
Table 3. Test set contains 250 images and 287 pedestrians.
The proposed method increases 48% true positive detected
pedestrians, and descrease 41% false positive detected pedes-
detector in the vision package in Matlab 2016a is used as the
implementation of Dalal & Triggs method.) Li et al. method
[11] is used to detect line segments. The left/right symmetry
of a region is measured by the similarity of histograms of
gray values in its left and right sub-regions. We will first give
a visual comparison, and then a quantitative comparison.
Fig. 4 shows a visual comparison between Dalal & Triggs
method and the proposed method. It is clear to see that
the bounding boxes detected by Dalal & Triggs method are
commonly much larger than the proposed method, while the
bounding boxes detected by the proposed method are much
more accurate. Moreover, many bounding boxes detected
Dalal & Triggs method are false positive. In the first two im-
ages, each of which contains two pedestrians, Dalal & Triggs
method only detects one.
Next, we present a quantitative comparison between the
two methods. Our test set includes 250 INRIA images that
contain 287 pedestrians totally. A bounding box output by a
method is considered as true-positive if the overlap between
the bounding box and the ground truth is over 50%.
Table 3 shows the number of true positives and false posi-
tives obtained by the two methods. Dalal & Triggs method de-
tected 86 pedestrians correctly, and the proposed one detected
127 pedestrians. Dalal & Briggs method outputs 367 false-
positive bounding boxes, and the proposed one outputs 215.
Precisely, the proposed method increases 48% true-positives,
and decreases 41% false-positives.
In this paper, we proposed a leg-driven framework for pedes-
trian detection. Experiments show that the proposed frame-
work achieve more precise localization of a pedestrian region
than Dalal & Triggs method. In the future, we will explore the
features of 2-piece polylines, such as the orientation, to re-
duce the number of hypothesis boxes, which is in turn equiv-
alent to reduce the false positive rates.
Acknowledgements: The work of Xiangui Kang was sup-
ported by NSFC (Grant nos. 61379155, U1536204) and NSF
of Guangdong province (Grant no. s2013020012788).
(a) Dalal & Triggs [3] (b) Proposed
Fig. 4. A comparison between Dalal & Triggs [3] and the
proposed method. The bounding boxes detected by Dalal &
Triggs method are commonly much larger than the proposed
[1] M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse, A. Fascioli,
and A. Tibaldi. Shape-based pedestrian detection and local-
ization. In IEEE Trans. on Intelligent Transportation Systems,
volume 1, pages 328–333, 2003.
[2] B. Bogin and M. Varela-Silva. Leg length, body proportion,
and health: a review with a note on beauty. International Jour-
nal of Environmental Research and Public Health, 7(3):1047–
1075, 2010.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05), vol-
ume 1, pages 886–893, vol. 1, 2005.
[4] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian
detection: An evaluation of the state of the art. IEEE Trans.
Pattern Anal. Mach. Intell., 34(4):743–761, 2012.
[5] D. Gavrila and J. Giebel. Shape-based pedestrian detection
and tracking. In IEEE Intelligent Vehicle Symposium, vol-
ume 1, pages 8–14, 2002.
[6] D. Geronimo, A. Lopez, A. Sappa, and T. Graf. Survey of
pedestrian detection for advanced driver assistance systems.
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 32(7):1239–1258, 2010.
[7] I. Havasi, Z. Szlavik, and T. Sziranyi. Pedestrian detection
using derived third-order symmetry of legs: A novel method
of motion-based information extraction from video image-
sequences. In K. Wojciechowski, B. Smolka, H. Palus, R. Koz-
era, W. Skarbek, and L. Noakes, editors, Computer Vision and
Graphics, volume 32 of Computational Imaging and Vision,
pages 733–739. 2006.
[8] L. Havasi, Z. Szl´
avik, and T. Szir´
anyi. Detection of gait char-
acteristics for scene registration in video surveillance system.
IEEE Trans. Image Processing, 16(2):503–510, 2007.
[9] V.-D. Hoang, M.-H. Le, and K.-H. Jo. Hybrid cascade boost-
ing machine using variant scale blocks based HOG features for
pedestrian detection. Neurocomputing, 135:357–366, 2014.
[10] J. Koseck´
a and W. Zhang. Video compass. In European Con-
ference on Computer Vision (4), pages 476–490, 2002.
[11] Q. Li, G. Liang, and Y. Gong. A geometric framework for
stop sign detection. In IEEE China Summit and International
Conference on Signal and Information Processing, ChinaSIP
2015, Chengdu, China, July 12-15, 2015, pages 258–262,
[12] D. Lowe. Distinctive image features from scale-invariant key-
points. International Journal on Computer Vision, 60(2):91–
110, 2004.
[13] W. Ouyang and X. Wang. Joint deep learning for pedestrian
detection. In Computer Vision (ICCV), 2013 IEEE Interna-
tional Conference on, pages 2056–2063, 2013.
[14] S. Paisitkriangkrai, C. Shen, and J. Zhang. Fast pedestrian
detection using a cascade of boosted covariance features. IEEE
Trans. Circuits Syst. Video Techn., 18(8):1140–1151, 2008.
[15] C. Papageorgiou, T. Evgeniou, and T. Poggio. A trainable
pedestrian detection system. In Intelligent Vehicles, pages
241–246, 1998.
[16] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun.
Pedestrian detection with unsupervised multi-stage feature
learning. In 2013 IEEE Conference on Computer Vision and
Pattern Recognition, Portland, OR, USA, June 23-28, 2013,
pages 3626–3633, 2013.
[17] A. Shashua, Y. Gdalyahu, and G. Hayun. Pedestrian detection
for driving assistance systems: single-frame classification and
system level performance. In IEEE Intelligent Vehicles Sym-
posium, pages 1–6, 2004.
[18] Y. Tian, P. Luo, X. Wang, and X. Tang. Pedestrian detection
aided by deep learning semantic tasks. In IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2015,
Boston, MA, USA, June 7-12, 2015, pages 5079–5087, 2015.
[19] R. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall. Lsd:
A fast line segment detector with a false detection control.
IEEE Trans. Pattern Anal. Mach. Intell., 32(4):722–732, 2010.
[20] X. Wang, M. Wang, and W. Li. Scene-specific pedestrian de-
tection for static video surveillance. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 36(2):361–374,
[21] B. Wu and R. Nevatia. Detection of multiple, partially oc-
cluded humans in a single image by bayesian combination of
edgelet part detectors. In International Conference on Com-
puter Vision, volume 1, pages 90–97, 2005.
[22] L. Zhao and C. Thorpe. Stereo- and neural network-based
pedestrian detection. IEEE Transactions on Intelligent Trans-
portation Systems, 1(3):148–154, 2000.
[23] Y. Zheng, C. Shen, and X. Huang. Pedestrian detection us-
ing center-symmetric local binary patterns. In Proceedings of
the International Conference on Image Processing, ICIP 2010,
September 26-29, Hong Kong, China, pages 3497–3500, 2010.
[24] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao. Orientation
robust object detection in aerial images using deep convolu-
tional neural network. In 2015 IEEE International Conference
on Image Processing, ICIP 2015, Quebec City, QC, Canada,
September 27-30, 2015, pages 3735–3739, 2015.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In this paper, we propose a geometric framework for stop sign detection based on polylines. We first propose a scheme for the extraction of 1-piece and 2-piece polylines from connected components of edge pixels. This scheme contains three basic steps: i) dominant point extraction, ii) linearity verification, and iii) partitioning. We then propose a polyline-based framework for stop sign detection. Specifically, the framework consists of three parts: i) extraction of 1-/2-piece polylines, ii) generation of regular octagon candidates, and iii) scoring regular octagon candidates. We test the proposed framework in a dataset that contains 500 stop sign images, and obtain a result of 96% detection rate.
Conference Paper
Full-text available
Feature extraction, deformation handling, occlusion handling, and classification are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
This paper contributes two issues for enhancing the accuracy and speed of a pedestrian detection system. First, it introduces a feature description using variant-scale block based Histograms of Oriented Gradients (HOG) features. By non-restricted block sizes, an extensive feature space that allows high-discriminated features to be selected for classification can be obtained. Second, a classification method based on a hybrid cascade boosting technique and a Support vector machine (SVM) is described. The SVM is known as one of the most efficient learning models for classification. On the other hand, one advantage of cascade boosting structure is to quickly reject most negative examples in the early layers, while retains almost all positive examples for speed up of the system. Because the performance of boosting depends on the kernel of weak classifier, the hybrid algorithms using the proposed feature descriptor is helpful for constructing an efficient classification with low computational time. In addition, an “integral image” method is utilized to support fast computation of the feature. The experimental results showed that performance of the proposed method is higher than the SVM using standard HOG features about 5% and the AdaBoost using variant-scale based HOG features about 4% detection rates, at 1% false alarm rates. The speed of classification using a cascade boosting approach is doubled comparing to that of the non-cascade one.
The performance of a generic pedestrian detector may drop significantly when it is applied to a specific scene due to the mismatch between the source training set and samples from the target scene. We propose a new approach of automatically transferring a generic pedestrian detector to a scene-specific detector in static video surveillance without manually labeling samples from the target scene. The proposed transfer learning framework consists of four steps. 1) Through exploring the indegrees from target samples to source samples on a visual affinity graph, the source samples are weighted to match the distribution of target samples. 2) It explores a set of context cues to automatically select samples from the target scene, predicts their labels, and computes confidence scores to guide transfer learning. 3) The confidence scores propagate among target samples according to their underlying visual structures. 4) Target samples with higher confidence scores have larger influence on training scene-specific detectors. All these considerations are formulated under a single objective function called confidence-encoded SVM, which avoids hard thresholding on confidence scores. During test, only the appearance-based detector is used without context cues. The effectiveness is demonstrated through experiments on two video surveillance data sets. Compared with a generic detector, it improves the detection rates by 48 and 36 percent at one false positive per image (FPPI) on the two data sets, respectively. The training process converges after one or two iterations on the data sets in experiments.
Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.