Conference PaperPDF Available

Salient Object Detection Using UAVs

Conference Paper

Salient Object Detection Using UAVs

Abstract and Figures

A salient object detection approach is proposed in this paper, tailored to the aerial images collected by Unmanned Aerial Vehicles (UAVs). In particular, the aerial images are classified. The selected image is segmented using superpixel-s, and then a weak saliency map is constructed based on image priors. Positive and negative samples are selected in accordance with the weak saliency map for training a boosted classi-fier. Consequently, the classifier is used to produce a strong saliency map. The weak and strong saliency maps are integrated to locate the candidate objects, and false alarms are pruned after post-processing. Experiments on aerial images collected above meadow and roof demonstrate the effectiveness of the proposed approach.
Content may be subject to copyright.
Salient Object Detection Using UAVs
Mo Shan, Feng Lin, and Ben M. Chen§
Temasek Laboratories, National University of Singapore
§Department of Electrical and Computer Engineering, National University of Singapore
ABS TR ACT
Asalient object detection approach is proposed
in this paper, tailored to the aerial images col-
lected by Unmanned Aerial Vehicles (UAVs). In
particular, the aerial images are classified. The
selected image is segmented using superpixel-
s, and then a weak saliency map is construct-
ed based on image priors. Positive and nega-
tive samples are selected in accordance with the
weak saliency map for training a boosted classi-
fier. Consequently, the classifier is used to pro-
duce a strong saliency map. The weak and strong
saliency maps are integrated to locate the candi-
date objects, and false alarms are pruned after
post-processing. Experiments on aerial images
collected above meadow and roof demonstrate
the effectiveness of the proposed approach.
Keywords: Saliency detection, UAVs
1 INTRODUCTION
UAVs have already been widely employed for a range of
tasks, including inspection and surveillance. Furthermore, the
advent of visual inertial odometry presented recently in [1, 2],
which utilizes on-board camera and IMU in a tightly coupled
manner, reduces their dependence on GPS dramatically and
enables the UAVs to operate in clustered environment. For
instance, researchers in the SFLY project have demonstrat-
ed that the UAVs are capable of performing localization and
mapping in GPS-denied environment [3].
This paper studies salient object detection in aerial im-
ages. Specifically, illegal dumping detection is selected as
a case-study. It refers to the unauthorized waste disposal of
garbage, appliances or furniture upon public or private prop-
erties. A patrol team is usually required to deter such offen-
sive activities. Meanwhile, the automatic illegal dumping de-
tection from the images captured by UAVs is very critical to
lower the burden of manual labor. Since the amount of images
obtained from the on-board camera is significant, it is trouble-
some to scan through every image. Consequently, detection
algorithms are necessary to select those images that may con-
tain dumped objects. Normally, the waste is left at the places
where it stands out from the environment, such as a plastic
bag on the meadow, and this makes the dumped object to be
salient. Therefore salient object detection becomes handy for
detecting illegal dumping automatically. This problem may
seem trivial at first glance, because the dumped waste is usu-
ally quite distinguishable from its environment. However,
other man-made objects may exist in the scene as well, such
as basketball court on a meadow, or windows on the roof. The
existence of these outliers makes it more challenging to detect
the garbage.
Saliency detection could be divided into three categories:
bottom-up, top-down, and hybrid approaches [4]. The first
category includes [5, 6, 7, 8, 9], and considers primitive in-
formation, such as the intensity contrast, color histogram,
and global distribution of colors, whereas the second cate-
gory contains [10, 11], and concerns about the application of
prior knowledge for the specific task at hand. An example
of bottom-up saliency detection is to detect a red ball against
a while background, while top-down approach is used when
the cyclists are searching for the bicycle lane [12].
In this paper, a task driven salient object detection algo-
rithm is proposed. The images are pre-processed by scene
classification. For the images that are likely to contain de-
bris, a weak saliency map is generated from the input image
based on prior knowledge, and then both positive and nega-
tive training samples are selected to produce a strong saliency
map. Those two saliency maps are combined and threshold-
ed to identify salient objects. Post-processing is carried out
to prune outliers. The proposed approach is implemented in
MATLAB based on the open-source code of [13] . Aerial
images collected by our UAV are used for the experiments.
The remainder of the paper is organized as follows: Sec-
tion II presents a brief literature review; the methodology is
described in Section III; Section IV contains the experiments
for salient object detection; Section V consists of the conclu-
sion.
2 RELATED WORKS
2.1 Bottom-up approaches
One of the early bottom-up models is presented in [5].
It is based on a biologically-plausible architecture related to
the feature integration theory. The primitive visual features
considered include intensity contrast, color contrast, and lo-
cal orientation, all of which are computed at multiple scales.
To integrate the feature maps with different modalities, a nor-
malization operator is designed to promote the unique maps,
mimicking the cortical lateral inhibition mechanisms. The
feature maps are combined into conspicuity maps, which are
normalized and summed to produce the saliency map. While
http://202.118.75.4/lu/publications.html
1
this framework could be tailored to different tasks via dedi-
cated feature maps, it may not be able to detect objects salient
for feature types not implemented.
Another biologically-plausible bottom-up approach is
proposed in [6]. The structure of graph algorithms is ex-
ploited to compute saliency efficiently. Specifically, Markov
chains are defined for the images, and the equilibrium distri-
bution is used as activation. Because each node is indepen-
dent, this process can be computed in a parallel way. In the
normalization phase, a graph is constructed to concentrate ac-
tivation into key locations. This approach outperforms [5] by
14% on human fixation prediction, and it could be extended
to multiresolutions for improved performance. Nevertheless,
this work favors the central-bias and thus may not be suitable
to detect salient objects in the image periphery.
The recent work [9] relies on a convolutional neural net-
work (CNN) trained from ImageNet dataset to perform fea-
ture extraction. The features are extracted from three rect-
angular windows enclosing the target region, its surrounding
regions, and the whole image, in order to evaluate the visual
contrast. Fully connected layers are trained from these multi-
scale CNN features to infer the saliency score. The saliency
maps are refined to maintain spatial coherence and fused to
obtain an aggregated map. The drawbacks of this approach
include its requirement of training datasets and long process-
ing time if GPU is not used.
2.2 Top-down approaches
As for the top-down approach, [10] focuses on salient ob-
ject detection by incorporating the high level concept. The
problem is modeled by a condition random field (CRF) to
combine multiscale contrast, center-surround histogram, and
color spatial distribution as local, regional as well as global
salient features. Temporal cues are also exploited for dealing
with sequential images. Being only able to detect a single
salient object is one of the remaining issues for this approach.
An alternative top-down model is described in [11]. The
image is decomposed into multiscale segmentations. A ran-
dom forest regressor is learnt to map the regional descrip-
tors representing the contrast, property and backgroundness
to a saliency score. An aggregated saliency map is obtained
from fusing the saliency maps of different segmentation lev-
els. The key differences of this approach is that it computes a
contrast vector instead of a value, and it combines the features
to generate the integrated saliency map, other than combining
the saliency maps generated by varied features. This method
requires the collection of training samples with groundtruth
labels as well.
2.3 Hybrid approach
In addition to the two categories mentioned above, there
are hybrid approaches as well. In [13], a novel salient ob-
ject detection approach is proposed. Image priors and mul-
tiple features are exploited to generate positive and negative
training samples for bootstrap learning. Since the training
samples are selected using the bottom-up model, the off-line
training and groundtruth labeling are alleviated. This is criti-
cal as there is a lack of datasets for illegal dumping detection.
Our work is similar to [13] with the following contributions:
1. Saliency map is used to detect illegal dumping using
aerial images captured by a UAV.
2. A simple yet efficient pre-processing algorithm is used
for scene classification.
3. The color and size priors are used to take into account
the features of the scene and the objects to be detected.
4. Steps for post-processing the saliency map to prune the
outliers are proposed.
3 METHODOLOGY
Referring to the overview in Fig.1, the proposed approach
consists of scene classification as pre-processing, generating
a saliency map, which is thresholded to identify the candi-
dates for the dumped waste, and post-processing to prune the
false alarms.
3.1 Pre-processing
UAVs are often needed to survey a building to detect ille-
gal dumping, and it is common that the building is surrounded
by meadow and tree. Since the debris is most likely to exist
on the meadow and roof, the aerial images are classified in-
to three categories, namely meadow, roof, and tree images.
The salient object detection is only carried out in the meadow
and roof images. In this way, not only the number of false
alarms is reduced, but also the computational time is saved
by discarding the tree images.
In scene classification, green regions are detected first,
and then blob detection of these regions is performed to coun-
t the cluster number. If there is few clusters, then the scene
is probably the meadow. In contrast, the existence of more
clusters indicate that the green regions are not connected, and
hence the image is likely a tree image. If no green region is
found, the probability of the image depicting the roof is high.
The green region detection is based on [14]. The input
RGB image is transformed to grayscale image by
I= 0.2989 ×R+ 0.5870 ×G+ 0.1140 ×B(1)
where R, G, B are the red, green, and blue channels of the
input, Iis the grayscale image. Then Iis subtracted from G
to obtain the green part of the image, followed by applying
the median filter to suppress the noise. The resulting image is
converted to a binary one by using the threshold of threshgr.
Afterwards, blob detection is performed in the binary image,
and the green regions whose area is smaller than threshga
are discarded as noise. Suppose the number of the remaining
blobs is k, then the scene is classified as meadow if 0< k
threshk, or tree if k > threshk. If k= 0, then the scene
depicts roof. In the experiments, the parameters are set to
threshgr = 0.07,threshga = 1000, and threshk= 3.
Figure 1: Overview of the proposed approach for salient object detection.
3.2 Saliency detection
A weak saliency map is constructed using color and size
priors. As for the color prior, two cases are tested: mead-
ow and roof, since it is common to find debris on the mead-
ow, and it is also important to detect debris on the roof using
UAVs because it is difficult for people due to out of view.
The color prior of a pixel pfor the meadow is defined as
Sc(p) = p(h(p)Hm)2+ (s(p)Sm)2+ (v(p)Vm)2
(2)
where h, s, v are the Hue, Saturation and Value channels of
the RGB image. Hm, Sm, Vmare the typical values for the
meadow in the HSV colorspace, and Hm= 0.2,Sm= 1,
Vm= 0.5. This color prior essentially computes the distance
of the pixel to the green color. In this way, non-green object-
s are assigned with higher weights, and thus they are more
likely to be chosen as positive training samples.
Similarly, the color prior for the roof is Sc(p) = s(p),
which is the value of the Saturation channel. Based on the
observation that the Saturation for roof is low, the higher the
Saturation, the more likely that it is not part of the roof, indi-
cating the pixel belongs to objects such as debris.
The superpixels proposed in [15] are computed in the im-
age to exploit the size prior, defined as Ss(p) = 1 A, where
Ais the normalized region of the superpixel that pixel pbe-
longs to. The size prior penalizes large area, because these
regions often correspond to buildings, and are unlikely to be
dumped objects.
Besides color and size priors, another useful criterion for
determining salient objects is the difference from the image
border, assuming that the majority of the border region con-
tains only background. To compute the distance of each su-
perpixel to those close to the boundary, the RGB, CIELab,
and Local Binary Pattern (LBP) features are used. Therefore,
the per-pixel saliency could be obtained from
Sweak (p) = Sc(p)×Ss(p)×d(p)(3)
where Sc(p), Ss(p), d(p)are the color prior, size prior, and
the superpixel difference to the image border respectively.
This weak saliency map is smoothed by Graph Cut.
Next, the weak saliency map is used to generate the train-
ing samples for the strong saliency map. The superpixels
whose average saliency values are below the lower threshold
is selected as negative training sample, whereas those with
average saliency values exceeding the higher threshold are
chosen as positive training samples.
Since how to integrate different features for training is
unclear, Multiple Kernel Boosting (MKB) presented in [16] is
employed. Several weak classifiers, which are Support Vector
Machines (SVMs) with linear, polynomial, RBF, and sigmoid
kernels, are combined to form a strong classifier. The details
of the boosting process is described in [13]. The aggregated
classifier is then used to generate a pixel-wise strong saliency
map Sstrong . The map is smoothed by Graph Cut and guided
filter.
To address the multiscale issue, superpixels with four d-
ifferent granularities are generated, producing four pairs of
weak and strong saliency maps. Averaging these maps gives
the multiresolution weak and strong saliency maps.
To summarize, a weak saliency map is generated based
on priors, and then a strong saliency map is constructed using
the training samples. The former detects fine details due to its
bottom-up model, while the latter focuses on global shapes.
As the two maps are complementary, they are integrated to
produce the final saliency map as
Saggreg ated =Sweak +Sstrong
2(4)
The final saliency map Saggreg ated is thresholded by
threshsa to identify objects that possess a high saliency val-
ue.
3.3 Post-processing
The post-processing algorithm is described in Algorith-
m 1. The first step is to detect homogenous regions. This
is necessary to exclude the false alarms existed on non-
homogenous regions, such as the basketball court besides the
meadow, or the sidewall on the roof.
The meadow is considered first. Green region detec-
tion as described in the pre-processing step is conducted.
For the roof, the image is converted from RGB colorspace
Algorithm 1 Post-processing algorithm
1: Detect homogenous region
2: if The scene is meadow then
3: Detect green region
4: threshgr 0.05
5: end if
6: if The scene is roof then
7: Threshold the HSV image
8: threshsl 0,threshsu 0.1
9: end if
10: Retain the salient objects on the homogenous region
11: for Each salient object candidate do
12: if Rectangularity < threshrt then
13: Remove outlier
14: end if
15: if (Area < threshal or Area > threshau )then
16: Remove outlier
17: end if
18: if (ECD < threshdl or ECD > threshdu )then
19: Remove outlier
20: end if
21: end for
to HSV colorspace, and thresholded using pre-defined low-
er and upper thresholds for the Saturation channel, where
threshsl = 0,threshsu = 0.1, since the roof part has very
low saturation. The holes in these binary maps are filled to
produce masks for homogeneous regions.
Even though the salient object could be detected on the
homogenous regions, there may still exist some outliers such
as the boundaries of the meadow or the building. To remove
these outliers, the rectangularity of the detected object could
be taken into account, since the garbage usually consists of
boxes or appliances that are rectangular. The rectangularity
could be computed as the ratio of pixels belonging to the ob-
ject to the total pixels in its bounding box. The objects whose
rectangularity is lower than the threshold threshrt are dis-
carded.
Besides rectangularity, the size is also taken into accoun-
t to prune the objects that are either too small, such as the
shadow, or too large, such as the entire wall. The parameter-
s for size include the area and the equivalent circular diam-
eter (ECD), computed by 4π×area, which specifies the
diameter of a circle with the same area as the detected objec-
t. Their lower and upper thresholds are threshal,threshau,
and threshdl ,threshdu respectively.
4 EXPERIMENT
4.1 Image collection
The images used in the experiments are captured when the
UAV surveys a building surrounded by a meadow, as shown
Figure 2: Stitched map of the test site, generated by
Pix4Dmapper. The locations of plastic bag and tree branches
are marked in red.
in Fig. 2 . Black plastic bags and tree branches are placed
on the meadow and the roof to simulate illegal dumping. The
UAV operates at about 35m, and the height of the building
is 18m. In other words, the height in the aerial images for
garbage detection on meadow is about 35m, while the height
for garbage detection on roof is about 17m.
The camera carried on-board is Sony A6000, with a fo-
cal length of 16mm. The original resolution of the image
is 6000 ×4000, and to reduce the computational time, it is
downsampled to 640 ×427.
4.2 Scene classification
Figure 3: Scene classification results. From left to right: orig-
inal image, binary image after green region detection, classi-
fication result with blobs marked in red.
The images are classified into different categories prior
to salient object detection. From Fig. 3, it is evident that the
These images used for the experiments are available at
https://github.com/shanmo/IMAV2016-Dataset
green regions can be effectively detected in the images collect
by the UAV. Moreover, the number of blobs in meadow image
is smaller than that in the tree image, where the green region-
s tend to be discontinuous. Furthermore, there is no green
region in the roof image. With the help of the scene classi-
fication, illegal dumping detection will only be performed in
the meadow and roof images to save computational time.
4.3 Garbage detection on the meadow
Figure 4: Garbage detection result on the meadow. a: origi-
nal image. b: saliency map. c: mask image of the meadow
region. d: result of detected garbage marked in red. Best
viewed in color.
The experimental results for the illegal dumping detection
on a meadow will be presented in this section, where some
black plastic bags are placed on the meadow as the dumped
waste. The rectangularity threshold threshrt is set to 0.5.
The size thresholds are threshal = 30,threshau = 600, and
threshdl = 20,threshdu = 40 respectively. The saliency
threshold is threshsa = 0.1.
As shown in Fig. 4, there is a plastic bag on the mead-
ow, which could be clearly observed in the saliency map.
Moreover, the mask image of the meadow effectively cov-
ers the entire meadow region, such that the salient objects on
the basketball court can be removed as outliers. After post-
processing, the boundaries of the meadow could be further
removed, and the resulting saliency map only contains the
plastic bag.
For comparison, using the same original image from Fig.
4, the saliency map from other methods are displayed in Fig.
5. It shows the saliency map generated from the approaches
proposed in [6, 9] and the original version of [13] respective-
ly. Since the dumped waste is not in the center, the methods
that emphasize central-bias, for instance [6] and [13], may
not work well.
Figure 5: Comparison of saliency maps. From left to right:
saliency maps generated by [6], [9], [13].
Figure 6: Garbage detection result on the roof. a: original
image. b: saliency map. c: mask image of the roof region.
d: result of detected garbage marked in red. Best viewed in
color.
4.4 Garbage detection on the roof
This section presents the results of the proposed approach
for garbage detection on roof, and the dumped waste are
tree branches. The rectangularity threshold threshrt is set
to 0.6, which is higher than that of the previous experiment,
because the water pond on the roof whose boundaries have
irregular shapes produce many outliers. The size threshold-
s are threshal = 300,threshau = 600, and threshdl =
20,threshdu = 30. The saliency threshold is also set to
threshsa = 0.1as used in the meadow case.
It could be observed in Fig. 6 that although the tree
branches are not as salient as the plastic bag shown in Fig.
4, it could still be detected.
Figure 7: Comparison of saliency maps. From left to right:
saliency maps generated by [6], [9], [13].
To compare the results of different saliency detection al-
gorithms, the same image from Fig. 6 is used, and the salien-
cy maps are shown in Fig. 7. Similar to what is observed in
Fig. 5, the algorithms assuming that the object is close to the
image center fail to detect the tree branches.
4.5 Discussion
Some may contend that instead of using saliency, HSV
thresholds could be applied to the meadow or roof regions to
detect the plastic bags or tree branches, as they possess unique
colors such as black or brown. Admittedly, color threshold-
ing may work well as the meadow or roof regions are quite
homogeneous. However, the type of garbage is not limited
to only two kinds, and thus saliency is preferable over color
because of its capabilities to detect all kinds of dumped waste.
Though the proposed approach is extensible to detect var-
ied kinds of garbage, there are also certain limitations associ-
ated with its extensibility. For instance, sometimes the light-
ing pole on the meadow is detected as dumped waste, because
it stands out from the scene.
Another drawback of our work is its limited dataset size,
which contains few images for each category, and thus the
number of images that have debris in the scene is very small.
Hence it is difficult to evaluate the proposed approach quan-
titively. To obtain statistically meaningful results, a larger
dataset is needed. In addition, the small dataset also hinders
the usage of CNNs for scene classification and garbage de-
tection as the training samples are quite limited. One possible
way is to apply transfer learning to resolve this issue.
5 CONCLUSION
In this paper, a new saliency detection algorithm is pro-
posed, which consists of pre-processing, saliency detection
and post-processing. During the first stage, the image is clas-
sified based on the number of green blobs. For the second
stage, color and size priors are designed to obtain a weak
saliency map. This map is used to generate the training sam-
ples for a boosted classifier. The classifier then produces a
strong saliency map, which is fused with the weak salien-
cy map. In the third stage, the outliers that are located on
the non-homogenous regions, or inconsistent with the size re-
quirements are pruned. Experimental results show that the
proposed approach could perform illegal dumping detection
on aerial images captured above the meadow and roof. Con-
sequently, the proposed approach could be exported to other
salient object detection applications as well.
ACKNOWLEDGMENT
The authors would like to thank the members of NUS
UAV Research Group for their kind support, and Prof Lu
Huchuan for the generous sharing of their code.
REFERENCES
[1] Christian Forster, Matia Pizzoli, and Davide Scaramuz-
za. Svo: Fast semi-direct monocular visual odometry.
In Robotics and Automation (ICRA), 2014 IEEE Inter-
national Conference on, pages 15–22. IEEE, 2014.
[2] Stefan Leutenegger, Simon Lynen, Michael Bosse,
Roland Siegwart, and Paul Furgale. Keyframe-
based visual–inertial odometry using nonlinear opti-
mization. The International Journal of Robotics Re-
search, 34(3):314–334, 2015.
[3] Davide Scaramuzza, Michael C Achtelik, Lefteris Doit-
sidis, Felice Friedrich, Elias Kosmatopoulos, Alessio
Martinelli, Markus W Achtelik, Maria Chli, Savvas A
Chatzichristofis, Laurent Kneip, et al. Vision-controlled
micro flying robots: from system design to autonomous
navigation and mapping in gps-denied environments.
Robotics & Automation Magazine, IEEE, 21(3):26–40,
2014.
[4] Kate Duncan and Santonu Sarkar. Saliency in im-
ages and video: a brief survey. Computer Vision, IET,
6(6):514–523, 2012.
[5] Laurent Itti, Christof Koch, and Ernst Niebur. A model
of saliency-based visual attention for rapid scene analy-
sis. IEEE Transactions on Pattern Analysis & Machine
Intelligence, (11):1254–1259, 1998.
[6] Jonathan Harel, Christof Koch, and Pietro Perona.
Graph-based visual saliency. In Advances in neural in-
formation processing systems, pages 545–552, 2006.
[7] Ming Cheng, Niloy J Mitra, Xumin Huang, Philip HS
Torr, and Song Hu. Global contrast based salient region
detection. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 37(3):569–582, 2015.
[8] Xiaodi Hou and Liqing Zhang. Saliency detection: A
spectral residual approach. In Computer Vision and Pat-
tern Recognition, 2007. CVPR’07. IEEE Conference on,
pages 1–8. IEEE, 2007.
[9] Guanbin Li and Yizhou Yu. Visual saliency based on
multiscale deep features. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 5455–5463, 2015.
[10] Tie Liu, Zejian Yuan, Jian Sun, Jingdong Wang, Nan-
ning Zheng, Xiaoou Tang, and Heung-Yeung Shum.
Learning to detect a salient object. Pattern Analy-
sis and Machine Intelligence, IEEE Transactions on,
33(2):353–367, 2011.
[11] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu,
Nanning Zheng, and Shipeng Li. Salient object detec-
tion: A discriminative regional feature integration ap-
proach. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 2083–2090,
2013.
[12] Simone Frintrop, Erich Rome, and Henrik I Chris-
tensen. Computational visual attention systems and
their cognitive foundations: A survey. ACM Transac-
tions on Applied Perception (TAP), 7(1):6, 2010.
[13] Na Tong, Huchuan Lu, Xiang Ruan, and Ming-Hsuan
Yang. Salient object detection via bootstrap learning. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 1884–1892, 2015.
[14] Arindam Bose. How to detect and track red, green and
blue objects in live video, 2013–2014.
[15] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aure-
lien Lucchi, Pascal Fua, and Sabine Susstrunk. Slic su-
perpixels compared to state-of-the-art superpixel meth-
ods. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 34(11):2274–2282, 2012.
[16] Fan Yang, Huchuan Lu, and Yen-Wei Chen. Human
tracking by multiple kernel boosting with locality affin-
ity constraints. In Computer Vision–ACCV 2010, pages
39–50. Springer, 2010.
Conference Paper
Full-text available
Visual saliency is a fundamental problem in both cogni-tive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be trained with multiscale features extracted using a popular deep learning framework, convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggre-gating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pix-elwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.
Article
Full-text available
Combining visual and inertial measurements has become popular in mobile robotics, since the two sensing modalities offer complementary characteristics that make them the ideal choice for accurate Visual-Inertial Odometry or Simultaneous Localization and Mapping (SLAM). While historically the problem has been addressed with filtering, advancements in visual estimation suggest that non-linear optimization offers superior accuracy, while still tractable in complexity thanks to the sparsity of the underlying problem. Taking inspiration from these findings, we formulate a rigorously probabilistic cost function that combines reprojection errors of landmarks and inertial terms. The problem is kept tractable and thus ensuring real-time operation by limiting the optimization to a bounded window of keyframes through marginalization. Keyframes may be spaced in time by arbitrary intervals, while still related by linearized inertial terms. We present evaluation results on complementary datasets recorded with our custom-built stereo visual-inertial hardware that accurately synchronizes accelerometer and gyroscope measurements with imagery. A comparison of both a stereo and monocular version of our algorithm with and without online extrinsics estimation is shown with respect to ground truth. Furthermore, we compare the performance to an implementation of a state-of-the-art stochasic cloning sliding-window filter. This competititve reference implementation performs tightly-coupled filtering-based visual-inertial odometry. While our approach declaredly demands more computation, we show its superior performance in terms of accuracy.
Conference Paper
Full-text available
We propose a semi-direct monocular visual odom-etry algorithm that is precise, robust, and faster than current state-of-the-art methods. The semi-direct approach eliminates the need of costly feature extraction and robust matching techniques for motion estimation. Our algorithm operates directly on pixel intensities, which results in subpixel precision at high frame-rates. A probabilistic mapping method that explicitly models outlier measurements is used to estimate 3D points, which results in fewer outliers and more reliable points. Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture. The algorithm is applied to micro-aerial-vehicle state-estimation in GPS-denied environments and runs at 55 frames per second on the onboard embedded computer and at more than 300 frames per second on a consumer laptop. We call our approach SVO (Semi-direct Visual Odometry) and release our implementation as open-source software.
Conference Paper
Full-text available
Salient object detection has been attracting a lot of interest, and recently various heuristic computational models have been designed. In this paper, we regard saliency map computation as a regression problem. Our method, which is based on multi-level image segmentation, uses the supervised learning approach to map the regional feature vector to a saliency score, and finally fuses the saliency scores across multiple levels, yielding the saliency map. The contributions lie in two-fold. One is that we show our approach, which integrates the regional contrast, regional property and regional background ness descriptors together to form the master saliency map, is able to produce superior saliency maps to existing algorithms most of which combine saliency maps heuristically computed from different types of features. The other is that we introduce a new regional feature vector, background ness, to characterize the background, which can be regarded as a counterpart of the objectness descriptor [2]. The performance evaluation on several popular benchmark data sets validates that our approach outperforms existing state-of-the-arts.
Conference Paper
Full-text available
A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed. It consists of two steps: first forming activation maps on certain feature channels, and then normalizing them in a way which highlights conspicuity and admits combination with other maps. The model is simple, and biologically plausible insofar as it is naturally parallelized. This model powerfully predicts human fixations on 749 variations of 108 natural images, achieving 98% of the ROC area of a human-based control, whereas the classical algorithms of Itti & Koch ([2], [3], [4]) achieve only 84%.
Conference Paper
Full-text available
The ability of human visual system to detect visual saliency is extraordinarily fast and reliable. However, computational modeling of this basic intelligent behavior still remains a challenge. This paper presents a simple method for the visual saliency detection. Our model is independent of features, categories, or other forms of prior knowledge of the objects. By analyzing the log-spectrum of an input image, we extract the spectral residual of an image in spectral domain, and propose a fast method to construct the corresponding saliency map in spatial domain. We test this model on both natural pictures and artificial images such as psychological patterns. The result indicate fast and robust saliency detection of our method.
Article
Full-text available
Based on concepts of the human visual system, computational visual attention systems aim to detect regions of interest in images. Psychologists, neurobiologists, and computer scientists have investigated visual attention thoroughly during the last decades and profited considerably from each other. However, the interdisciplinarity of the topic holds not only benefits but also difficulties: concepts of other fields are usually hard to access due to differences in vocabulary and lack of knowledge of the relevant literature. This paper aims to bridge this gap and bring together concepts and ideas from the different research areas. It provides an extensive survey of the grounding psychological and biological research on visual attention as well as the current state of the art of computational systems. Furthermore, it presents a broad range of applications of computational attention systems in fields like computer vision, cognitive systems and mobile robotics. We conclude with a discussion on the limitations and open questions in the field.
Article
Salient image regions permit non-uniform allocation of computational resources. The selection of a commensurate set of salient regions is often a step taken in the initial stages of many computer vision algorithms, thereby facilitating object recognition, visual search and image matching. In this study, the authors survey the role and advancement of saliency algorithms over the past decade. The authors first offer a concise introduction to saliency. Next, the authors present a summary of saliency literature cast into their respective categories then further differentiated by their domains, computational methods, features, context and use of scale. The authors then discuss the achievements and limitations of the current state of the art. This information is augmented by an outline of the datasets and performance measures utilised as well as the computational techniques pervasive in the literature.
Conference Paper
In this paper, we incorporate the concept of Multiple Kernel Learning (MKL) algorithm, which is used in object categorization, into human tracking field. For efficiency, we devise an algorithm called Multiple Kernel Boosting (MKB), instead of directly adopting MKL. MKB aims to find an optimal combination of many single kernel SVMs focusing on different features and kernels by boosting technique. Besides, we apply Locality Affinity Constraints (LAC) to each selected SVM. LAC is computed from the distribution of support vectors of respective SVM, recording the underlying locality of training data. An update scheme to reselect good SVMs, adjust their weights and recalculate LAC is also included. Experiments on standard and our own testing sequences show that our MKB tracking outperforms some other state-of-the-art algorithms in handling various conditions.