Salient Object Detection Using UAVs
Mo Shan∗, Feng Lin∗, and Ben M. Chen§
∗Temasek Laboratories, National University of Singapore
§Department of Electrical and Computer Engineering, National University of Singapore
ABS TR ACT
Asalient object detection approach is proposed
in this paper, tailored to the aerial images col-
lected by Unmanned Aerial Vehicles (UAVs). In
particular, the aerial images are classiﬁed. The
selected image is segmented using superpixel-
s, and then a weak saliency map is construct-
ed based on image priors. Positive and nega-
tive samples are selected in accordance with the
weak saliency map for training a boosted classi-
ﬁer. Consequently, the classiﬁer is used to pro-
duce a strong saliency map. The weak and strong
saliency maps are integrated to locate the candi-
date objects, and false alarms are pruned after
post-processing. Experiments on aerial images
collected above meadow and roof demonstrate
the effectiveness of the proposed approach.
Keywords: Saliency detection, UAVs
UAVs have already been widely employed for a range of
tasks, including inspection and surveillance. Furthermore, the
advent of visual inertial odometry presented recently in [1, 2],
which utilizes on-board camera and IMU in a tightly coupled
manner, reduces their dependence on GPS dramatically and
enables the UAVs to operate in clustered environment. For
instance, researchers in the SFLY project have demonstrat-
ed that the UAVs are capable of performing localization and
mapping in GPS-denied environment .
This paper studies salient object detection in aerial im-
ages. Speciﬁcally, illegal dumping detection is selected as
a case-study. It refers to the unauthorized waste disposal of
garbage, appliances or furniture upon public or private prop-
erties. A patrol team is usually required to deter such offen-
sive activities. Meanwhile, the automatic illegal dumping de-
tection from the images captured by UAVs is very critical to
lower the burden of manual labor. Since the amount of images
obtained from the on-board camera is signiﬁcant, it is trouble-
some to scan through every image. Consequently, detection
algorithms are necessary to select those images that may con-
tain dumped objects. Normally, the waste is left at the places
where it stands out from the environment, such as a plastic
bag on the meadow, and this makes the dumped object to be
salient. Therefore salient object detection becomes handy for
detecting illegal dumping automatically. This problem may
seem trivial at ﬁrst glance, because the dumped waste is usu-
ally quite distinguishable from its environment. However,
other man-made objects may exist in the scene as well, such
as basketball court on a meadow, or windows on the roof. The
existence of these outliers makes it more challenging to detect
Saliency detection could be divided into three categories:
bottom-up, top-down, and hybrid approaches . The ﬁrst
category includes [5, 6, 7, 8, 9], and considers primitive in-
formation, such as the intensity contrast, color histogram,
and global distribution of colors, whereas the second cate-
gory contains [10, 11], and concerns about the application of
prior knowledge for the speciﬁc task at hand. An example
of bottom-up saliency detection is to detect a red ball against
a while background, while top-down approach is used when
the cyclists are searching for the bicycle lane .
In this paper, a task driven salient object detection algo-
rithm is proposed. The images are pre-processed by scene
classiﬁcation. For the images that are likely to contain de-
bris, a weak saliency map is generated from the input image
based on prior knowledge, and then both positive and nega-
tive training samples are selected to produce a strong saliency
map. Those two saliency maps are combined and threshold-
ed to identify salient objects. Post-processing is carried out
to prune outliers. The proposed approach is implemented in
MATLAB based on the open-source code of  †. Aerial
images collected by our UAV are used for the experiments.
The remainder of the paper is organized as follows: Sec-
tion II presents a brief literature review; the methodology is
described in Section III; Section IV contains the experiments
for salient object detection; Section V consists of the conclu-
2 RELATED WORKS
2.1 Bottom-up approaches
One of the early bottom-up models is presented in .
It is based on a biologically-plausible architecture related to
the feature integration theory. The primitive visual features
considered include intensity contrast, color contrast, and lo-
cal orientation, all of which are computed at multiple scales.
To integrate the feature maps with different modalities, a nor-
malization operator is designed to promote the unique maps,
mimicking the cortical lateral inhibition mechanisms. The
feature maps are combined into conspicuity maps, which are
normalized and summed to produce the saliency map. While
this framework could be tailored to different tasks via dedi-
cated feature maps, it may not be able to detect objects salient
for feature types not implemented.
Another biologically-plausible bottom-up approach is
proposed in . The structure of graph algorithms is ex-
ploited to compute saliency efﬁciently. Speciﬁcally, Markov
chains are deﬁned for the images, and the equilibrium distri-
bution is used as activation. Because each node is indepen-
dent, this process can be computed in a parallel way. In the
normalization phase, a graph is constructed to concentrate ac-
tivation into key locations. This approach outperforms  by
14% on human ﬁxation prediction, and it could be extended
to multiresolutions for improved performance. Nevertheless,
this work favors the central-bias and thus may not be suitable
to detect salient objects in the image periphery.
The recent work  relies on a convolutional neural net-
work (CNN) trained from ImageNet dataset to perform fea-
ture extraction. The features are extracted from three rect-
angular windows enclosing the target region, its surrounding
regions, and the whole image, in order to evaluate the visual
contrast. Fully connected layers are trained from these multi-
scale CNN features to infer the saliency score. The saliency
maps are reﬁned to maintain spatial coherence and fused to
obtain an aggregated map. The drawbacks of this approach
include its requirement of training datasets and long process-
ing time if GPU is not used.
2.2 Top-down approaches
As for the top-down approach,  focuses on salient ob-
ject detection by incorporating the high level concept. The
problem is modeled by a condition random ﬁeld (CRF) to
combine multiscale contrast, center-surround histogram, and
color spatial distribution as local, regional as well as global
salient features. Temporal cues are also exploited for dealing
with sequential images. Being only able to detect a single
salient object is one of the remaining issues for this approach.
An alternative top-down model is described in . The
image is decomposed into multiscale segmentations. A ran-
dom forest regressor is learnt to map the regional descrip-
tors representing the contrast, property and backgroundness
to a saliency score. An aggregated saliency map is obtained
from fusing the saliency maps of different segmentation lev-
els. The key differences of this approach is that it computes a
contrast vector instead of a value, and it combines the features
to generate the integrated saliency map, other than combining
the saliency maps generated by varied features. This method
requires the collection of training samples with groundtruth
labels as well.
2.3 Hybrid approach
In addition to the two categories mentioned above, there
are hybrid approaches as well. In , a novel salient ob-
ject detection approach is proposed. Image priors and mul-
tiple features are exploited to generate positive and negative
training samples for bootstrap learning. Since the training
samples are selected using the bottom-up model, the off-line
training and groundtruth labeling are alleviated. This is criti-
cal as there is a lack of datasets for illegal dumping detection.
Our work is similar to  with the following contributions:
1. Saliency map is used to detect illegal dumping using
aerial images captured by a UAV.
2. A simple yet efﬁcient pre-processing algorithm is used
for scene classiﬁcation.
3. The color and size priors are used to take into account
the features of the scene and the objects to be detected.
4. Steps for post-processing the saliency map to prune the
outliers are proposed.
Referring to the overview in Fig.1, the proposed approach
consists of scene classiﬁcation as pre-processing, generating
a saliency map, which is thresholded to identify the candi-
dates for the dumped waste, and post-processing to prune the
UAVs are often needed to survey a building to detect ille-
gal dumping, and it is common that the building is surrounded
by meadow and tree. Since the debris is most likely to exist
on the meadow and roof, the aerial images are classiﬁed in-
to three categories, namely meadow, roof, and tree images.
The salient object detection is only carried out in the meadow
and roof images. In this way, not only the number of false
alarms is reduced, but also the computational time is saved
by discarding the tree images.
In scene classiﬁcation, green regions are detected ﬁrst,
and then blob detection of these regions is performed to coun-
t the cluster number. If there is few clusters, then the scene
is probably the meadow. In contrast, the existence of more
clusters indicate that the green regions are not connected, and
hence the image is likely a tree image. If no green region is
found, the probability of the image depicting the roof is high.
The green region detection is based on . The input
RGB image is transformed to grayscale image by
I= 0.2989 ×R+ 0.5870 ×G+ 0.1140 ×B(1)
where R, G, B are the red, green, and blue channels of the
input, Iis the grayscale image. Then Iis subtracted from G
to obtain the green part of the image, followed by applying
the median ﬁlter to suppress the noise. The resulting image is
converted to a binary one by using the threshold of threshgr.
Afterwards, blob detection is performed in the binary image,
and the green regions whose area is smaller than threshga
are discarded as noise. Suppose the number of the remaining
blobs is k, then the scene is classiﬁed as meadow if 0< k ≤
threshk, or tree if k > threshk. If k= 0, then the scene
depicts roof. In the experiments, the parameters are set to
threshgr = 0.07,threshga = 1000, and threshk= 3.
Figure 1: Overview of the proposed approach for salient object detection.
3.2 Saliency detection
A weak saliency map is constructed using color and size
priors. As for the color prior, two cases are tested: mead-
ow and roof, since it is common to ﬁnd debris on the mead-
ow, and it is also important to detect debris on the roof using
UAVs because it is difﬁcult for people due to out of view.
The color prior of a pixel pfor the meadow is deﬁned as
Sc(p) = p(h(p)−Hm)2+ (s(p)−Sm)2+ (v(p)−Vm)2
where h, s, v are the Hue, Saturation and Value channels of
the RGB image. Hm, Sm, Vmare the typical values for the
meadow in the HSV colorspace, and Hm= 0.2,Sm= 1,
Vm= 0.5. This color prior essentially computes the distance
of the pixel to the green color. In this way, non-green object-
s are assigned with higher weights, and thus they are more
likely to be chosen as positive training samples.
Similarly, the color prior for the roof is Sc(p) = s(p),
which is the value of the Saturation channel. Based on the
observation that the Saturation for roof is low, the higher the
Saturation, the more likely that it is not part of the roof, indi-
cating the pixel belongs to objects such as debris.
The superpixels proposed in  are computed in the im-
age to exploit the size prior, deﬁned as Ss(p) = 1 −A, where
Ais the normalized region of the superpixel that pixel pbe-
longs to. The size prior penalizes large area, because these
regions often correspond to buildings, and are unlikely to be
Besides color and size priors, another useful criterion for
determining salient objects is the difference from the image
border, assuming that the majority of the border region con-
tains only background. To compute the distance of each su-
perpixel to those close to the boundary, the RGB, CIELab,
and Local Binary Pattern (LBP) features are used. Therefore,
the per-pixel saliency could be obtained from
Sweak (p) = Sc(p)×Ss(p)×d(p)(3)
where Sc(p), Ss(p), d(p)are the color prior, size prior, and
the superpixel difference to the image border respectively.
This weak saliency map is smoothed by Graph Cut.
Next, the weak saliency map is used to generate the train-
ing samples for the strong saliency map. The superpixels
whose average saliency values are below the lower threshold
is selected as negative training sample, whereas those with
average saliency values exceeding the higher threshold are
chosen as positive training samples.
Since how to integrate different features for training is
unclear, Multiple Kernel Boosting (MKB) presented in  is
employed. Several weak classiﬁers, which are Support Vector
Machines (SVMs) with linear, polynomial, RBF, and sigmoid
kernels, are combined to form a strong classiﬁer. The details
of the boosting process is described in . The aggregated
classiﬁer is then used to generate a pixel-wise strong saliency
map Sstrong . The map is smoothed by Graph Cut and guided
To address the multiscale issue, superpixels with four d-
ifferent granularities are generated, producing four pairs of
weak and strong saliency maps. Averaging these maps gives
the multiresolution weak and strong saliency maps.
To summarize, a weak saliency map is generated based
on priors, and then a strong saliency map is constructed using
the training samples. The former detects ﬁne details due to its
bottom-up model, while the latter focuses on global shapes.
As the two maps are complementary, they are integrated to
produce the ﬁnal saliency map as
Saggreg ated =Sweak +Sstrong
The ﬁnal saliency map Saggreg ated is thresholded by
threshsa to identify objects that possess a high saliency val-
The post-processing algorithm is described in Algorith-
m 1. The ﬁrst step is to detect homogenous regions. This
is necessary to exclude the false alarms existed on non-
homogenous regions, such as the basketball court besides the
meadow, or the sidewall on the roof.
The meadow is considered ﬁrst. Green region detec-
tion as described in the pre-processing step is conducted.
For the roof, the image is converted from RGB colorspace
Algorithm 1 Post-processing algorithm
1: Detect homogenous region
2: if The scene is meadow then
3: Detect green region
4: threshgr ←0.05
5: end if
6: if The scene is roof then
7: Threshold the HSV image
8: threshsl ←0,threshsu ←0.1
9: end if
10: Retain the salient objects on the homogenous region
11: for Each salient object candidate do
12: if Rectangularity < threshrt then
13: Remove outlier
14: end if
15: if (Area < threshal or Area > threshau )then
16: Remove outlier
17: end if
18: if (ECD < threshdl or ECD > threshdu )then
19: Remove outlier
20: end if
21: end for
to HSV colorspace, and thresholded using pre-deﬁned low-
er and upper thresholds for the Saturation channel, where
threshsl = 0,threshsu = 0.1, since the roof part has very
low saturation. The holes in these binary maps are ﬁlled to
produce masks for homogeneous regions.
Even though the salient object could be detected on the
homogenous regions, there may still exist some outliers such
as the boundaries of the meadow or the building. To remove
these outliers, the rectangularity of the detected object could
be taken into account, since the garbage usually consists of
boxes or appliances that are rectangular. The rectangularity
could be computed as the ratio of pixels belonging to the ob-
ject to the total pixels in its bounding box. The objects whose
rectangularity is lower than the threshold threshrt are dis-
Besides rectangularity, the size is also taken into accoun-
t to prune the objects that are either too small, such as the
shadow, or too large, such as the entire wall. The parameter-
s for size include the area and the equivalent circular diam-
eter (ECD), computed by √4π×area, which speciﬁes the
diameter of a circle with the same area as the detected objec-
t. Their lower and upper thresholds are threshal,threshau,
and threshdl ,threshdu respectively.
4.1 Image collection
The images used in the experiments are captured when the
UAV surveys a building surrounded by a meadow, as shown
Figure 2: Stitched map of the test site, generated by
Pix4Dmapper. The locations of plastic bag and tree branches
are marked in red.
in Fig. 2 ‡. Black plastic bags and tree branches are placed
on the meadow and the roof to simulate illegal dumping. The
UAV operates at about 35m, and the height of the building
is 18m. In other words, the height in the aerial images for
garbage detection on meadow is about 35m, while the height
for garbage detection on roof is about 17m.
The camera carried on-board is Sony A6000, with a fo-
cal length of 16mm. The original resolution of the image
is 6000 ×4000, and to reduce the computational time, it is
downsampled to 640 ×427.
4.2 Scene classiﬁcation
Figure 3: Scene classiﬁcation results. From left to right: orig-
inal image, binary image after green region detection, classi-
ﬁcation result with blobs marked in red.
The images are classiﬁed into different categories prior
to salient object detection. From Fig. 3, it is evident that the
‡These images used for the experiments are available at
green regions can be effectively detected in the images collect
by the UAV. Moreover, the number of blobs in meadow image
is smaller than that in the tree image, where the green region-
s tend to be discontinuous. Furthermore, there is no green
region in the roof image. With the help of the scene classi-
ﬁcation, illegal dumping detection will only be performed in
the meadow and roof images to save computational time.
4.3 Garbage detection on the meadow
Figure 4: Garbage detection result on the meadow. a: origi-
nal image. b: saliency map. c: mask image of the meadow
region. d: result of detected garbage marked in red. Best
viewed in color.
The experimental results for the illegal dumping detection
on a meadow will be presented in this section, where some
black plastic bags are placed on the meadow as the dumped
waste. The rectangularity threshold threshrt is set to 0.5.
The size thresholds are threshal = 30,threshau = 600, and
threshdl = 20,threshdu = 40 respectively. The saliency
threshold is threshsa = 0.1.
As shown in Fig. 4, there is a plastic bag on the mead-
ow, which could be clearly observed in the saliency map.
Moreover, the mask image of the meadow effectively cov-
ers the entire meadow region, such that the salient objects on
the basketball court can be removed as outliers. After post-
processing, the boundaries of the meadow could be further
removed, and the resulting saliency map only contains the
For comparison, using the same original image from Fig.
4, the saliency map from other methods are displayed in Fig.
5. It shows the saliency map generated from the approaches
proposed in [6, 9] and the original version of  respective-
ly. Since the dumped waste is not in the center, the methods
that emphasize central-bias, for instance  and , may
not work well.
Figure 5: Comparison of saliency maps. From left to right:
saliency maps generated by , , .
Figure 6: Garbage detection result on the roof. a: original
image. b: saliency map. c: mask image of the roof region.
d: result of detected garbage marked in red. Best viewed in
4.4 Garbage detection on the roof
This section presents the results of the proposed approach
for garbage detection on roof, and the dumped waste are
tree branches. The rectangularity threshold threshrt is set
to 0.6, which is higher than that of the previous experiment,
because the water pond on the roof whose boundaries have
irregular shapes produce many outliers. The size threshold-
s are threshal = 300,threshau = 600, and threshdl =
20,threshdu = 30. The saliency threshold is also set to
threshsa = 0.1as used in the meadow case.
It could be observed in Fig. 6 that although the tree
branches are not as salient as the plastic bag shown in Fig.
4, it could still be detected.
Figure 7: Comparison of saliency maps. From left to right:
saliency maps generated by , , .
To compare the results of different saliency detection al-
gorithms, the same image from Fig. 6 is used, and the salien-
cy maps are shown in Fig. 7. Similar to what is observed in
Fig. 5, the algorithms assuming that the object is close to the
image center fail to detect the tree branches.
Some may contend that instead of using saliency, HSV
thresholds could be applied to the meadow or roof regions to
detect the plastic bags or tree branches, as they possess unique
colors such as black or brown. Admittedly, color threshold-
ing may work well as the meadow or roof regions are quite
homogeneous. However, the type of garbage is not limited
to only two kinds, and thus saliency is preferable over color
because of its capabilities to detect all kinds of dumped waste.
Though the proposed approach is extensible to detect var-
ied kinds of garbage, there are also certain limitations associ-
ated with its extensibility. For instance, sometimes the light-
ing pole on the meadow is detected as dumped waste, because
it stands out from the scene.
Another drawback of our work is its limited dataset size,
which contains few images for each category, and thus the
number of images that have debris in the scene is very small.
Hence it is difﬁcult to evaluate the proposed approach quan-
titively. To obtain statistically meaningful results, a larger
dataset is needed. In addition, the small dataset also hinders
the usage of CNNs for scene classiﬁcation and garbage de-
tection as the training samples are quite limited. One possible
way is to apply transfer learning to resolve this issue.
In this paper, a new saliency detection algorithm is pro-
posed, which consists of pre-processing, saliency detection
and post-processing. During the ﬁrst stage, the image is clas-
siﬁed based on the number of green blobs. For the second
stage, color and size priors are designed to obtain a weak
saliency map. This map is used to generate the training sam-
ples for a boosted classiﬁer. The classiﬁer then produces a
strong saliency map, which is fused with the weak salien-
cy map. In the third stage, the outliers that are located on
the non-homogenous regions, or inconsistent with the size re-
quirements are pruned. Experimental results show that the
proposed approach could perform illegal dumping detection
on aerial images captured above the meadow and roof. Con-
sequently, the proposed approach could be exported to other
salient object detection applications as well.
The authors would like to thank the members of NUS
UAV Research Group for their kind support, and Prof Lu
Huchuan for the generous sharing of their code.
 Christian Forster, Matia Pizzoli, and Davide Scaramuz-
za. Svo: Fast semi-direct monocular visual odometry.
In Robotics and Automation (ICRA), 2014 IEEE Inter-
national Conference on, pages 15–22. IEEE, 2014.
 Stefan Leutenegger, Simon Lynen, Michael Bosse,
Roland Siegwart, and Paul Furgale. Keyframe-
based visual–inertial odometry using nonlinear opti-
mization. The International Journal of Robotics Re-
search, 34(3):314–334, 2015.
 Davide Scaramuzza, Michael C Achtelik, Lefteris Doit-
sidis, Felice Friedrich, Elias Kosmatopoulos, Alessio
Martinelli, Markus W Achtelik, Maria Chli, Savvas A
Chatzichristoﬁs, Laurent Kneip, et al. Vision-controlled
micro ﬂying robots: from system design to autonomous
navigation and mapping in gps-denied environments.
Robotics & Automation Magazine, IEEE, 21(3):26–40,
 Kate Duncan and Santonu Sarkar. Saliency in im-
ages and video: a brief survey. Computer Vision, IET,
 Laurent Itti, Christof Koch, and Ernst Niebur. A model
of saliency-based visual attention for rapid scene analy-
sis. IEEE Transactions on Pattern Analysis & Machine
Intelligence, (11):1254–1259, 1998.
 Jonathan Harel, Christof Koch, and Pietro Perona.
Graph-based visual saliency. In Advances in neural in-
formation processing systems, pages 545–552, 2006.
 Ming Cheng, Niloy J Mitra, Xumin Huang, Philip HS
Torr, and Song Hu. Global contrast based salient region
detection. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 37(3):569–582, 2015.
 Xiaodi Hou and Liqing Zhang. Saliency detection: A
spectral residual approach. In Computer Vision and Pat-
tern Recognition, 2007. CVPR’07. IEEE Conference on,
pages 1–8. IEEE, 2007.
 Guanbin Li and Yizhou Yu. Visual saliency based on
multiscale deep features. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 5455–5463, 2015.
 Tie Liu, Zejian Yuan, Jian Sun, Jingdong Wang, Nan-
ning Zheng, Xiaoou Tang, and Heung-Yeung Shum.
Learning to detect a salient object. Pattern Analy-
sis and Machine Intelligence, IEEE Transactions on,
 Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu,
Nanning Zheng, and Shipeng Li. Salient object detec-
tion: A discriminative regional feature integration ap-
proach. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 2083–2090,
 Simone Frintrop, Erich Rome, and Henrik I Chris-
tensen. Computational visual attention systems and
their cognitive foundations: A survey. ACM Transac-
tions on Applied Perception (TAP), 7(1):6, 2010.
 Na Tong, Huchuan Lu, Xiang Ruan, and Ming-Hsuan
Yang. Salient object detection via bootstrap learning. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 1884–1892, 2015.
 Arindam Bose. How to detect and track red, green and
blue objects in live video, 2013–2014.
 Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aure-
lien Lucchi, Pascal Fua, and Sabine Susstrunk. Slic su-
perpixels compared to state-of-the-art superpixel meth-
ods. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 34(11):2274–2282, 2012.
 Fan Yang, Huchuan Lu, and Yen-Wei Chen. Human
tracking by multiple kernel boosting with locality afﬁn-
ity constraints. In Computer Vision–ACCV 2010, pages
39–50. Springer, 2010.