Conference PaperPDF Available

Digital image stabilization technique for fixed camera on small size drone

Authors:

Abstract and Figures

This paper explores a digital image algorithm to stabilize videos recorded from a fixed camera (without stabilized machanical tools) on a small size drone. In particular, this paper focuses on implementation of the Speed-Up Robust Feature (SURF) method. The fundamental concept is to match 2 pictures, one obtained from the current image frame and another from the previous (or reference) frame. The matching process is achieved by locating common keypoints between the current and reference frames and associating them together. Then transformation is then implemented to translate and rotate the current image frame so that keypoints remain in the same position or as close as possible to those of the reference frame. Various video samples are used to validate the SURF method's efficiency. The scenarios include recorded videos under a normal light condition and having partial shadows on the image. The movement due to drone's engine and environmental winds are also under this study. The results indicate that the SURF method can be used to stabilize image frame so that the processed video becomes smoother and more suitable for viewing.
Content may be subject to copyright.
Digital Image Stabilization Technique for Fixed
Camera on Small Size Drone
Ekkaphon Mingkhwan
Control and Communication Division
Defence Technology Institute (Public Organization)
Nontaburi, Thailand
ekkaphon.m@dti.or.th
Weerawat Khawsuk
Control and Communication Division
Defence Technology Institute (Public Organization)
Nontaburi, Thailand
weerawat.k@dti.or.th
Abstract—This paper explores a digital image algorithm to
stabilize videos recorded from a fixed camera (without stabilized
machanical tools) on a small size drone. In particular, this
paper focuses on implementation of the Speed-Up Robust Feature
(SURF) method. The fundamental concept is to match 2 pictures,
one obtained from the current image frame and another from the
previous (or reference) frame. The matching process is achieved
by locating common keypoints between the current and reference
frames and associating them together. Then transformation is
then implemented to translate and rotate the current image
frame so that keypoints remain in the same position or as
close as possible to those of the reference frame. Various video
samples are used to validate the SURF method’s efficiency. The
scenarios include recorded videos under a normal light condition
and having partial shadows on the image. The movement due
to drone’s engine and environmental winds are also under this
study. The results indicate that the SURF method can be used
to stabilize image frame so that the processed video becomes
smoother and more suitable for viewing.
Keywords—SURF, Video Stabilize, Matching Estimation,
Warping
I. INTRODUCTION
Nowadays, there have been widespread uses of unmanned
aerial vehicles (UAVs) or drones for landscape survey and
aerial images. These UAVs come with various sizes depending
on applications. A larger sized UAV can carry more payloads
such as high quality cameras and stabilized devices (or gim-
bals). This gimbal has capability to compensate the UAV’s
attitude into an opposite direction such that the view of camera
remains unchanged (with respect to the Earth’s axis). Thus,
the resulting video images appear smoother for viewing. For
a smaller sized UAV, on the other hand, there is not enough
available space to install both a video camera and a gimbal.
Typically, the camera is fixed to the UAV’s body. When the
UAV is in operation, its engine causes vibration and shaky
maneuver due to environment’s uncontrolled disturbances.
These movement directly contribute to unstable video images.
It is obvious that video images from a fixed camera on
a small UAV are not smooth enough for human eyes. To
record small objects, their images and movement can be
difficult to notice to some extends. Moreover as continually
monitoring these unstable video images, a viewer can develop
the uneasy states such as motion sickness or nausea feeling.
These symptoms are caused by lacks of coordination between
viewing position and body balance. Hence if problems due to
unstable motion images from a fixed camera can be solved or
reduced without adding extra payloads, the video quality from
a small UAV for landscape survey application will efficiently
be improved.
The image stabilization can be achieved via mechanical
devices, optical sensors, or digital methods. For a digital sta-
bilization, there are various well-developed algorithms such as
Scale-Invariant Feature Transform (SIFT) method and Speed-
Up Robust Feature (SURF) method. In particular, this work
focuses on implementation of the SURF method.
The organization of this paper is as follows. The literature
review in Section II provides summarized topics on motion in
mobile videos, motion model, image stabilization techniques,
feature detection method, SURF algorithm, and related re-
searches. Section III describes implementation consisting of
experimental setups, stabilization process, and performance
measures. Results and discussion are presented in Section IV.
Finally, the conclusion remarks are given in Section V.
II. LI TE RATU RE RE VI EW
A. Motion in Mobile Videos
When an untrained person records videos, various types of
motions are acted on a hand held camera. In general, the object
motion or the camera movement causes video images to shake
or move. There are 7typical motions in video images, which
can be described as follows [1, 2].
Track is a left or right translation in a horizontal or X-
direction.
Boom is an up or down translation in a vertical or Y-
direction.
Dolly is a forward or backward translation in the camera
axis direction.
Pan is a left or right turn around the vertical axis.
Tilt is an up or down turn around the horizontal axis.
Roll is a rotation around the camera axis direction.
Zoom is not a camera position movement, but a change
of camera focal length causing changes of image size.
Various camera movement in each axis cause motions in
recording videos. Therefore to generate stabilized video im-
ages, we should estimate these camera movement and then
compensate them in an opposite direction.
B. Motion Model
A2D or 3D motion model is used to represent camera
movement. A 2D model describes various transformations of
images occurred in a 2D plane or (x, y)coordinate. Denote the
original (X)and new locations (X)and a translation distance
(t)in each direction as
X=[x
y], X=[x
y], t =[tx
ty](1)
The transformations for movement are modeled as follows [1].
1) Translation can be written as X=X+t
2) Rotation and translation, also known as a rigid body
motion, can be written as X=RX +t, where
R=[cos θsin θ
sin θcos θ](2)
is a rotation matrix at angle θwith respect to X axis.
3) Scaled rotation, also known as a similarity transform,
preserves the angle of an original image and a new
(scaled) version. This transformation can be described
as X=SRX +t, where
S=[ab
b a](3)
Here, it is not necessary that a2+b2= 1.
4) Affine transformation preserves parallel lines between
these images. The relation is given as X=AX, where
A=[a00 a01 a02
a10 a11 a12](4)
5) Projective transformation operates on homogeneous co-
ordinates between Xand Xsuch that
x
y
1
=
h00 h01 h02
h10 h11 h12
h20 h21 h22
x
y
1
(5)
To obtain a homogeneous coordinate, we normalize the
new location Xby enforcing h20x+h21 y+h22 = 1
condition as shown in above equation.
The projective transformation is a combination between the
affine transformation and projective warps. It uses 9 parame-
ters for estimation. On the other hand, the affine transformation
uses 6 parameters to preserves parallel lines and image ratios.
Hence, the affine transformation using less parameters is more
suitable for motion estimation.
Fig. 1: Illustration of similarity (scaled rotation), affine, and
projective transformations [4].
C. Image Stabilization Techniques
In general, a video quality will be improved if one can
reduce the image motion. One of practical reduction tech-
niques is to re-order frames of video images. This technique
consists of 2 stages as: 1) a motion estimation of image frames
and 2) a removal of undesired inter-frame motion as well as
any distortion due to movement [5, 6]. The process at each
stage can be achieved either by a hardware or a software. The
hardware implementation is performed via a mechanical image
stabilization or an optical image stabilization. On the other
hand, a software implementation stabilizes image digitally. We
can classify techniques of image stabilization as follows [1].
(a) (b)
Fig. 2: (a) gyroscopic sensor, (b) 3 axis gimbal camera
1) Mechanical image stabilization: It uses a gimbal, com-
prising of gyroscopic sensors and a mechanical system, as
illustrated in Figure 2, to stabilize video image. These gyro-
scopic sensors provide tilt angles of platform in each of 3D
directions. The mechanical system on a gimbal compensates
these angles by tilting camera in an opposite direction in order
to preserve the same image view (in comparing to the Earth’s
axis) at all time.
Fig. 3: Sensor-shift (left) and Lens-shift (right) method [7].
2) Optical image stabilization: When a ray of light passes
through an aperture and camera lens onto a charge-coupled
device (CCD) sensor, electrical signals are generated and
then transformed into images. A distance between lens and
a CCD sensor is critical to a quality of images. There are 2
implementation methods of an optical image stabilization as a
lens-shift method and a sensor-shift method [7] as illustrated
in Figure 3. Using information from motion sensors installed
within a camera, the first method adjusts the lens position so
that an image formed on a CCD sensor appears at the same
position while a camera is in motion. On the other hand, the
second method adjusts the position of a CCD sensor so that
the same image is formed as a camera moves.
3) Digital video stabilization: It is a post processing tech-
nique with various implementation methods. For example,
a path of camera movement is determined to compensate
for its motion or an additional frame is introduced so that
image features (keypoints) move smoother. Note that a video
image appears moving because the keypoints change their
positions in each frame. Therefore, this technique tries to
maintain keypoints of the current frame at the exact positions
or as close as to those from the previous frame. The image
feature determination depends on a processing speed, size, and
image orientation. Its efficiency remains a major challenge.
The digital video stabilization technique consists of 3 stages
as 1) motion estimation, 2) motion compensation to have a
smooth image, and 3) image warping to have the same viewing
angle [1, 5], as demonstrated by Figure 4.
(a) Original (b) Unstabilized
(c) Motion compensated (d) Warping image
Fig. 4: Demonstation of digital video stabilization
D. Feature Detection Method
Feature detection becomes a crucial part in the video
stabilization process. Various methods have been extensively
studied to detect image features with higher efficiency. Exam-
ples of the feature detection method are as follows.
1) Scale-invariant feature transform (SIFT): This method
transforms an image into a large collection of local feature vec-
tors, each of which is invariant to image translation, scaling,
rotation, affine, and projection [8]. The SIFT parameters for an
image feature are partially related to those parameters obtained
from the affine transform. There are 4 major computational
stages to generate sets of these image features [9]:
(1) Scale-space extrema detection searches over all image
locations and scales for its extrema. The efficient imple-
mentation uses a Difference-of-Gaussian (DoG) function
to identify keypoints invariant to scale and orientation.
(2) Keypoint localization and filtering improves important
keypoints and throws out insignificant ones.
(3) Orientation assignment identifies each keypoint location
based on the gradient direction of a local image. The
effects of rotation and scale transformations are further
removed. Hence, resulting images are invariant to these
transformations.
(4) Keypoint descriptor creation is generated from a his-
togram of orientation.
The SIFT method for feature detection uses DoG filters
in various levels as shown in Figure 5. It creates important
characteristics invariant to image transformations. Therefore,
this method is suitable for images with different viewpoint.
Fig. 5: DoG filters at various levels in the SIFT method [9]
2) Affine scale-nvariant feature transform (ASIFT): This
method extends the SIFT method’s capability so that it is
invariant to the camera axis orientations such as longitude and
latitude angles [6]. Figure 6 illustrates the keypoint association
of the magazine’s cover page determined by the SIFT and
ASIFT methods. It is evident that when the reference image
changes size and orientation, the ASIFT method can associate
more keypoints than does the SIFT method.
Fig. 6: Comparison of keypoint association obtained by SIFT
and ASIFT methods [10]
3) Principal Component Analysis SIFT (PCA-SIFT): This
method adopts advantages of a standard PCA technique to
reduce data dimensionality. It uses 36 image data rather than
128 data from the SIFT method. A data reduction contributes
the PCA-SIFT method to a faster detection of keypoints [11].
4) Speed-Up Robust Feature (SURF): This method devel-
ops based upon the SIFT method to response changes in size
and orientation of images. Instead of a DoG function, the
Haar wavelet is used to approximate the Lapacian of Gaussian
(LoG), resulting in more accurate detection [12].
From all feature detection methods (as described above),
each development aims to improve the detection efficiency. It
is evident that the SURF method is faster and more accurate
than the SIFT method. It gives better responses to images that
are changing in size, movement and speed, illumination, and
orientation [14]. Therefore, the SURF method is more suitable
to detect keypoints of moving and unstable images obtained
from a video camera of a small UAV.
E. SURF Algorithm
The SURF detection method is faster than the SIFT method
because it can determine keypoints from integral images rather
than the original images. An integral image I(X)of location
X= (x, y)stores a sum of all pixel intensities in a rectangular
area formed by point Xand its origin.
I(X) =
x
i=0
y
j=0
I(i, j)(6)
There are 4 stages for the SURF algorithm as follows.
1) Keypoint detection: This stage computes a determinant
of the Hessian matrix H(X, σ)of an Xlocation at a σscale
to represent a local change around an interest area, given by
H(X, σ) = [Lxx (X, σ)Lxy (X, σ )
Lyx(X , σ)Lyy(X, σ)],(7)
Lxy(X , σ) = I(X)2
∂x∂ y g(σ)(8)
where Lxy(X , σ)is the convolution ()of an integral image
I(X)with the 2nd order partial derivatives of the Gaussian
g(σ)with respect to x and y directions.
The SURF algorithm approximates these derivatives with
rectangular boxes or box filters. These derivative approxi-
mations can be evaluated very fast using integral images,
independently of filter size [13]. The 9×9box filters with
σ= 1.2represent the lowest scale (or the highest spatial reso-
lution). Denote quantities Dxx, Dyy and Dxy as discretized
approximations for Lxx(X, σ ), Lxy(X, σ), and Lyy(X, σ),
respectively. Thus, a determinant of the discretized Hessian
matrix H(X, σ)becomes
det (Happrox) = Dxx Dyy (wDxy )2(9)
The weight wapplied to the rectangular region is kept simple
for computational efficiency. To keep a grey region around
keypoints to a zero value, the relative weights are adjusted to
w= 0.9for further balance in the Hessian’s determinant [14].
2) Keypoint localization: It computes a local extrema (max-
ima or minima) of a single selected keypoint to its nearest
neighbor. This stage builds a pyramid of the image LoG maps,
with different levels within octaves. An octave represents a
series of filter response maps obtained by convolving the same
input image with a filter of increasing size. In total, an octave
encompasses a scaling factor of 2. Each octave is subdivided
into a constant number of scale levels. The output of the 9×9
Fig. 7: Increasing the filter size and keeping the Gaussian
derivatives with corrected scales [14].
filter is considered as the initial scale level. The following
levels are obtained by filtering the image with gradually
bigger masks, taking into account a discrete nature of integral
images and filter structure. Figure 7 shows an increase of
filter size from 9 to 15 for the 1st octave, where the top
and bottom represent discretized approximations of Dyy and
Dxy, respectively. The filter sizes are {9,15,21,27}for the
1st octave, {15,27,39,51}for the 2nd octave, {27,51,75,99}
for the 3rd octave, and {51,99,147,195}for the 4th octave.
The LoG maps are the two middle scales of each octave
and also the adjacent scale-space neighborhood. A 3×3×3
scale-space neighborhood is used to determine if the interest
pixel is a local maximum within a search region. A pictorial
representation of the adjacent pixels in space and scale space
is provided in Figure 8. The center pixel (red) is considered
a local maximum among the surrounding points (grey area)
when it has the highest intensity in the search area. If its value
exceeds a pre-defined threshold, then that pixel is regarded as
a keypoint.
Fig. 8: The non-maximum suppression to detect a keypoint
from 3×3×3scale-space search [14].
3) Orientation assignment: This stage calculates the Haar-
wavelet responses in X and Y directions, with a wavelet
size of length 4s. Once the wavelet responses are calculated
and weigthed with a Gaussian (σ= 2.5s) centered at the
interest point, the responses are represented as vectors with
a coordinate of the horizontal and vertical response strengths,
respectively. The dominant orientation is estimated by calcu-
lating the sum of all responses within a sliding orientation
window covering an angle of π/3. The horizontal and vertical
responses within the window are summed. The two summed
responses then yield a new vector. The longest such vector
lends its orientation to the interest point [13].
4) Descriptor generation: The descriptor describes a dis-
tribution of Haar wavelet responses within the interest point
neighborhood. This stage partitions the interest region into
smaller 4×4(or 16) square sub-regions, where each sub-
region is further divided into 5×5(or 25) spaced sample
points as shown in Figure 9.
Fig. 9: An interest area is divided into 4×4sub-regions,
which are sampled into 5×5points [14].
Denote dx and dy as the Haar wavelet responses in hori-
zontal and vertical directions, in relation to the selected point
orientation. The, the wavelet responses dxand dyare summed
up over each sub-region and form a first set of entries to
the feature vector. In order to bring in information about the
polarity of the intensity changes, the sum of the absolute values
of the responses, |dx|and |dy|are also included. These values
forms a descriptor vector (v) as
v={dx, |dx|,dy, |dy|}(10)
For all 4×4(or 16) sub-areas, the length of a descriptor vector
becomes 64. The wavelet responses are invariant to a bias in
illumination. Invariant to contrast is achieved by turning the
descriptor into a unit vector.
F. Related Research
In general, motion images from a video camera can be
mechanically stabilized via with a gimbal. However for a
small UAV, there are some limitations to install both a video
camera and its gimbal because of less available spaces and
lower payload capacity. A typical approach is to use a post
processing technique for a digital video stabilization. Three
major stages for a video stabilization are motion estimation,
motion compensation, and image composition [12, 16, 17].
At the motion estimation stage, a video is divided into image
frames. The features of a current frame is extracted and then
matched with those from a previous frame. This estimation
stage yields an orientation and a distance of the video im-
age. During a motion compensation stage, these features are
adjusted such that their positions are located as close as to
the same features of the previous frame. Lastly, an image
composition is performed via a projective transform so that
all features are in the correct positions. These 3 subsequent
stages are continually repeated for a new render frame.
A vital process during the motion estimation is a detection
of image features. Studies suggest that a feature detection
using the SURF method is more efficient than the SIFT
method. Not only does the SURF method yield more accurate
results, but it also leads to a higher precision for keypoint
matching and less processing computations [12, 16, 17].
III. IMP LE ME NTATION
A. Experiment Setups
This experiment uses a multi-rotor equipped with a digital
video camera to capture motion images as shown in Figure 10.
The video is recorded in front of the Defence Technology
Institute (DTI) building during a mid-day under a normal
light condition. The multi-rotor is operated at about 20 meters
height while the camera is set at a full HD (or 1080p)
resolution utilizing a 30 frames/sec rate.
(a) (b)
Fig. 10: (a) Multi-rotor and (b) video camera
B. Stabilization Process
For a small multi-rotor (UAV), a video camera is usually
installed without any stabilization device. Unstable movement
due to UAV’s flying natures often occurs in the video images.
To reduce this uneasy viewing effect, the recorded images are
digitally processed as follows.
1) Frame extraction: It extracts a motion video into image
frames, where each frame is further processed.
2) Motion detection: It estimates an image motion using the
SURF method. Figure 11 illustrates feature detection,
extraction, and matching processes. Here, the keypoint
examples are indicated by green circles. There are hun-
dreds of keypoints detected in this frame, however, only
20 important keypoints are shown for the extraction and
matching processes.
3) Motion compensation: An image frame is compensated
in an opposite direction to offset camera motion. This
compensation includes scale and rotation transforma-
tions with a proper adjustment.
Fig. 11: Keypoint examples of image frame (top), and feature
extraction and matching between adjacent frames (bottom).
4) Image composition: A compensated image is adjusted
such that all detected keypoints between adjacent frames
are aligned. Illustration of motion compensation and
image composition is shown in Figure 12.
Fig. 12: Motion compensation and image composition.
5) Image Stabilization: A final image of the current frame
has all detected keypoints located as close as to those
from a previous frame.
Figure 13 summarizes the video stabilization process. As a
result, a final image of each frame has fewer movement and is
easier for viewing than a original version. All stabilized image
frames are reconstructed in order to have a smoother video.
C. Performance Measures
Since the visual quality of digitally stabilized video images
can rather be subjective, it is necessary to establish quantitative
measures to compare the effects of various video enhancement
algorithms on the quality of each image frame. Two commonly
used measures are the Mean-Squared Error (MSE) and Peak
Signal-to-Noise Ratio (PSNR). The MSE value represents a
cumulative squared error between the original image frames
and the stabilized version, whereas the PSNR value represents
a measure of the peak error. A lower error of the stabilized
algorithm results in a lower MSE value. A higher PSNR value
between any two stabilized adjacent frames indicates a good
quality of stabilized video.
Fig. 13: Video stabilization process.
The MSE and PSNR values are computed by:
MSE(n) = 1
MN
M
y=1
N
x=1
[In(x, y)In+1 (x, y)]2,(11)
PSNR(n) = 10 log10 (Imax
MSE(n)),(12)
where Mand Nare the number of column and row pixels
of an image (frame dimension). Intensity values at the (x, y)
pixel location of the current nand the next n+ 1 frames are
defined as In(x, y)and In+1 (x, y). Typically, the intensity has
value between 0255.Imax is the maximum pixel intensity
of the current frame. The PSNR is measured in decibels (dB).
Fig. 14: Quality of stabilized video and PSNR value.
This experiment compares a compensated frame, which
having undesired motion been removed, against the reference
image from a previous frame. Here, a high PSNR value
indicates both the compensated and reference image frames
have similar quality. In Figure 14, two adjacent frames having
all keypoint locations in a close vicinity can be combined into
an easy viewing video.
IV. RES ULT S AN D DISCUSSION
A. Experimental Results
There are 4 video scenarios in this analysis and each video
has a total of 700 image frames. The 1st video is recorded by a
multi-rotor’s digital camera flying in front of the DTI building.
This video has significant motion due to flying in a high wind
environment. However, it yields a good video sample since
all keypoint locations are well-spread throughout an image
frame. Similarly, the 2nd video is still recorded under a high
wind condition but at a later time. It appears to have almost
50%shade occurred on the image. The 3rd video is an internet
sample and is recorded from another multi-rotor camera. There
are more vibration in this video because the camera is set at a
low frame rate without using a stabilizing device. This video
is a good reference for a low quality camera. The 4th video is
recorded before a multi-rotor is taken off. Therefore, its image
does not have significant vibration effects.
Fig. 15: PSNR value at each frame of the 1st video sample.
In Figure 15, the PSNR values of both original and sta-
bilized versions are high because of a higher frame rate. The
keypoints in each frame are located closely to one another with
minimal differences. Therefore, the image frame of a stabilized
version has similar quality to those of original video samples.
However, higher PSNR values of stabilized samples indicate
that the SURF algorithm yields a smoother video. The PSNR
values of both original and stabilized samples of the 2nd video
shown in Figure 16 are still high. Though there are shadows
on the image, the SURF algorithm reduces their effects and
gives a cleaner video as evident by higher PSNR values than
those from a original version.
Fig. 16: PSNR value at each frame of the 2nd video sample.
In Figure 17, there are more changes of the PSNR values
due to a lower camera’s quality and a less control of the multi-
rotor to maintain a steady position. However, the stabilized
version still yields a higher PSNR value at each frame resulting
in an easier viewing video. The PSNR values of both original
and stabilized samples as shown in Figure 18 are relatively
equal because the multi-rotor has not yet flown. Their visual
qualities are thus similar.
Fig. 17: PSNR value at each frame of the 3rd video sample.
Fig. 18: PSNR value at each frame of the 4th video sample.
A comparison of average PSNR values between the original
and stabilized video samples is summarized in Table I. The
1st video yields a higher improvement percentage than the
2nd sample because the former does not have shadows on
the image. The stabilization process is able to match more
keypoints with the reference frame. However, a stabilization
performance of the 2nd scenario is relatively at the same level
of the 1st video. The 1st, 2nd, and 3rd videos quantitatively
show small improvement percentages, but their visual qualities
appear smoother for viewing. The 4th video yields an insignif-
icant improvement percentage since the multi-rotor has not yet
flown and the recorded video is affected by little movement.
TABLE I: Average PSNR Value of Each Video Samples
Scenario Original Stabilized Improvement
1st 70.7489 74.6266 5.48%
2nd 72.7996 76.1505 4.60%
3rd 72.3716 75.4981 4.32%
4th 86.8969 87.5343 0.73%
B. Discussion
All graphs in Figures 15 to 18 show that PSNR values
of the stabilized videos are higher at the beginning. Most
keypoints have not drastically changed their locations. Since
the multi-rotor has not taken off, there are little vibration
to affect the recorded video. However during its flight, the
video samples are shanking due to engine and movement
of the multi-rotor as well as environmental winds. These
disturbances directly contribute to different PSNR values of
the stabilized and original video. All stabilized versions have
closer keypoint locations between adjacent frames than those
keypoint locations from the original video. Therefore, the
stabilized videos using the SURF algorithm appear smoother
and easier for viewing.
In a certain situation where there exists abrupt changes to
attitude, height, and movement of a multi-rotor, the process
to match keypoints of image frames is more difficult than a
normal condition. The processed video seems more stabilized
but it can have some scratchy portions. Nonetheless, PSNR
values of the stabilized version are still higher than those of
the original sample. It is evident that this SURF algorithm can
be used to reduce video motions in order to improve visual
quality of a recorded video from a UAV camera.
V. CONCLUSION
The SURF algorithm detects keypoints of each image frame
and determines the location where a color intensity has a
maximum value among all points around its neighborhood
area. The experiments in this paper utilize 4 video samples
recorded from a fixed camera on a small multi-rotor to verify
the method’s efficiency. These videos include conditions under
normal mid-day light and having partial shadows on recorded
images. In fact, the SURF algorithm detects keypoints of
each image frame and make a location comparison between
the current and previous frames. The matched keypoints are
transformed by compensating movement of the current frame
such that these keypoints are located as close as possible to
those of the previous frame. These compensated image frame
are combined into a more stabilized video.
Several methods such as projective, similarity, and affine
transformations can be applied to achieve this compensation.
Each transformation is suitable to compensate keypoints with
different characteristics. A quantitative criteria for selecting an
appropriate transformation can be explored for a future study.
A combination among these methods can be implemented
instead of a single transformation so that the processed video
becomes smoother and more stabilized for viewing.
REF ER EN CE S
[1] P. Rawat and J. Singhai, “Review of Motion Estima-
tion and Video Stabilization Techniques for Hand Held
Mobile Video,Signal and Image Process.: An Int. J.,
Vol. 2(2), pp. 159-168, Jun. 2011.
[2] S. Navayot and N. Homsup, “Real-Time Video Stabiliza-
tion for Aerial Mobile Multimedia Communication,” Int.
J. of Innovation, Management and Technology, Vol. 4(1),
pp. 26-30, Feb. 2013.
[3] M. Liebling. (2010). PoorMan3DReg [Online]. Available:
http://sybil.ece.ucsb.edu
[4] A. Neumann, H. Freimark and A. Wehrle. (2010). Geo-
data and Spatial Relation [Online]. Available: https://
geodata.ethz.ch
[5] P. Rawat and J. Singhai, “Efficient Video Stabilization
Technique for Hand Held Mobile Videos,” Int. J. of Sig-
nal Process., Image Process. and Pattern Recognition,
Vol. 6(3), pp. 17-32, Jun. 2013.
[6] M. Niskanen, O. Silven and M. Tico, “Video Stabilization
Performance Assessment,” Proc. of IEEE Int. Conf. on
Multimedia and Expo, Canada, Jul. 2006, pp. 405-408.
[7] C. Macmanus. (2009). The Technology Behind Sony Al-
pha DSLR’s Steady Shot Inside [Online]. Available:
http://www.sonyinsider.com
[8] D. G. Lowe, “Object Recognition from Local Scale-
Invariant Features,Proc. of IEEE Int. Conf. on Computer
Vision, Kerkyra, Greece, Sep. 1999, pp. 1150-1157.
[9] D. G. Lowe, “Distinctive Image Features from Scale-
Invariant Keypoints,Int. J. of Computer Vision, Vol. 2,
pp. 91-110, Nov. 2004.
[10] G. Yu and J. M. Morel, “ASIFT: An Algorithm for Fully
Affine Invariant Comparison,” Image Process. On Line,
Feb. 2011, pp. 1-28.
[11] Y. Ke and R. Sukthankar, “PCA-SIFT: A More Distinc-
tive Representation for Local Image Descriptors,Proc.
of IEEE Comput. Soc. Conf. on Computer Vision and
Pattern Recognition, USA, Jun. 2004(2), pp. 506-513.
[12] X. Zheng, C. Shaohui, W. Gang and L. Jinlun, “Video
Stabilization System Based on Speeded-up Robust Fea-
tures,” Proc. of Int. Ind. Informatics and Computer Eng.
Conf., Xian, China, Jan. 2015, pp. 1995-1998.
[13] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded
Up Robust Features”, J. of Computer Vision and Image
Understanding, Vol. 110, Issue 3, pp. 346-359, Jun. 2008.
[14] J. T. Pederson. SURF: Feature Detection & Description,
Dept. of Computer Sci., Aarhus Univ., Denmark, 2011.
[15] S. M. Jurgensen, “The rotated Seepeded-Up Robust Fea-
tures algorithm (R-SURF),” M.S. thesis, Dept. Elec. and
Comp. Eng., Naval Postgraduate School, CA, 2014.
[16] Dhara Patel, Dixesh Patel, D. Bhatt and K. R. Jadav,
“Motion Compensation for Hand Held Camera Device,
Int. J. of Research in Engineering and Technology, Vol. 4,
pp. 771-775, Feb. 2015.
[17] J. Y. Kim and C. H. Caldas, “Exploring Local Feature
Descriptors for Construction Site Video Stabilization,
Proc. of 31st Int. Symp. on Automation and Robotics in
Construction and Mining, Sydney, Australia, Jul. 2014,
pp. 654-660.
[18] H. M. Sergieh, E. E. Zsigmond, M. Doller, D. Coquil,
J.M. Pinon and H. Kosch, “Improving SURF Image
Matching Using Supervised Learning,” Proc. of 8th Int.
Conf. on Signal Image Technology and Internet Based
Systems, Sorrento, Italy, Nov. 2012, pp. 230-237.
[19] S. Namarateruangsuk, “Image Filtering using Raised
Cosine-Blur,SDU Research J. of Sci. and Technology,
Vol. 7(2), pp. 23-32, May 2014.
[20] A. Walhaa, A. Walia, and A. M. Alimia. “Video Stabi-
lization for Aerial Video Surveillance,AASRI Conf. on
Intell. Syst. and Control, Vol. 4, pp. 72-77, 2013.
... Unlike larger drones equipped with high-performance gimbals, video stabilization is crucial for accurate traffic data extraction with smaller commercial drones that are more susceptible to environmental disturbances like wind [84,85]. Stabilization, a form of video registration [86], involves aligning consecutive frames by matching keypoints to correct unwanted motion. ...
... Various methods can be combined to align video frames, allowing for smoother vehicle tracking. For instance, [84] utilized speeded-up robust features (SURF) features for keypoint detection to ensure smooth traffic analysis. However, the main challenge in stabilization is achieving an optimal balance between speed and accuracy while mitigating distortions caused by moving objects. ...
Preprint
Full-text available
This paper presents a comprehensive framework for extracting georeferenced vehicle trajectories from high-altitude drone imagery, addressing key challenges in urban traffic monitoring and the limitations of traditional ground-based systems. Our approach integrates several novel contributions, including a tailored object detector optimized for high-altitude bird’s-eye view perspectives, a unique track stabilization method that uses detected vehicle bounding boxes as exclusion masks during image registration, and an orthophoto and master frame-based georeferencing strategy that enhances consistent alignment across multiple drone viewpoints. Additionally, our framework features robust vehicle dimension estimation and detailed road segmentation, enabling comprehensive traffic dynamics analysis. Conducted in the Songdo International Business District, South Korea, the study utilized a multi-drone experiment covering 20 intersections, capturing approximately 12TB of ultra-high-definition video data over four days. The framework produced two high-quality datasets: the Songdo Traffic dataset, comprising approximately 700,000 unique vehicle trajectories, and the Songdo Vision dataset, containing over 5,000 human-annotated images with about 300,000 vehicle instances categorized into four classes. Comparisons with high-precision sensor data from an instrumented probe vehicle highlight the accuracy and consistency of our extraction pipeline in dense urban environments. The public release of the Songdo Traffic and Songdo Vision datasets, along with the complete source code for the extraction pipeline, establishes new benchmarks in data quality, reproducibility, and scalability in traffic research. The results demonstrate the potential of integrating drone technology with advanced computer vision methods for precise and cost-effective urban traffic monitoring, providing valuable resources for developing intelligent transportation systems and enhancing traffic management strategies.
... Last, Sands's analysis of multiple controls [23] for space robots also provides a good instance for the comparison of different optimal-control structures. Although all control structures selected are optimization controls, they have their own advantages and disadvantages in different situations [24] . ...
... Due to the boundary condition in Equation (2), the control function with respect to time from Equation (19) to (21) is displayed in Equation (23), (24) and (25): Table 5 explains some values int the results and it proves that the result of solving a practical issue with the HzMAT depends solely on time. By utilizing the optimal analysis result from the HzMAT with equation (22) as input, the ideal output for both theta and angular velocity can be derived through double integrator translation control. ...
Article
Full-text available
The goal of the manuscript is to design a relatively best control structure for the noise suppression of a drone’s camera gimbal action. The gimbal’s movement can be simplified as a rest-to-rest reorientation system that can achieve the boundary result of a dynamic system. Six different control architectures are proposed and evaluated based on their ability to control the trajectory of the dynamic-system position and speed, their running time, and quadratic cost. The robustness of the control architecture to uncertainties in inertia and sensor noise is also analyzed. Monte Carlo figures are used to assess the performance of the six control systems. The conditions for applying different architectures are identified through this analysis. The analysis and experimental tests reveal the most suitable control of the drone’s camera gimbal rotation.
... 3. Aerial data acquisition through drones inherently grapples with specific drone-induced artifacts that perturb the optical fidelity of the captured sequences (Seifert et al., 2019). These artifacts, stemming from intricate drone kinematics and atmospheric interferences, imprint stochastic translational and rotational perturbations onto the footage, manifesting as jitter, inadvertent panning, and axis misalignment (Mingkhwan & Khawsuk, 2017;Aguilar & Angulo, 2016). These spatiotemporal incongruities disrupt the optical flow coherence, infusing the sequences with undesirable optical phenomena like motion-induced blur, consecutive frame misregistration, and depth-induced parallax anomalies. ...
Article
Full-text available
This paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features over 70 min of event footage acquired with a 640 ×\times 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and street scenes featuring busy urban environments are captured at different elevations and illumination conditions. Manual bounding box annotations of vehicles and pedestrians contained in the recordings are provided at a frequency of 30 Hz, yielding more than 93,000 labels in total. A baseline evaluation for this dataset was performed using three Spiking Neural Networks (SNNs) and ten Deep Neural Networks (DNNs). All data and Python code to voxelize the data and subsequently train SNNs/DNNs has been open-sourced.
... Video frames are stabilized by SURF feature extraction, matching and motion compensation by Mingkhwan et.al in [1]. Feature matching is performed for consecutive video frames. ...
Conference Paper
Full-text available
Now-a-days, drones are very commonly used in various civil applications like disaster management, traffic monitoring, crowd monitoring, infrastructure inspection and search and rescue. Drones have different types of sensors , among which cameras are normally used in these type of real time scenarios. The low flying small drones commonly have fixed cameras. The vid-eo/images from these cameras are not smooth due to the propeller vibrations and environmental effects. These unstable videos need to be smoothened without adding extra payloads. This will improve the quality of the videos and will help in further data/image processing. In this paper, we propose an optical flow-based motion estimation and r smoothening of drone videos to stabilize and improve the Inter Frame Transformation Fidelity (ITF) when compared to the existing feature extraction-based methods.
... An alternative strategy involves using a stationary target within the camera's field of view (FOV), leveraging optical measurements of its motion to correct the motion of the subject of interest, by comparing two regions of interest (ROI) centered on the subject and the stationary target, respectively 4-6 . Video stabilization algorithms represent another approach, relying on feature tracking to match image features across frames using algorithms like SIFT and SURF, and computing transformation matrices for stable video projection [7][8][9][10][11][12][13] . Lastly, sensor-based methods, particularly using IMUs, gather data to stabilize video via algorithms that map points between current and stabilized image planes, employing the homography matrix to capture perspective differences 14,15 . ...
... The digital grayscale image can be represented by a histogram and the case for the digital color image where the three histograms are combined to obtain the total histogram (see figure 2) [9] [10] [11] [30]. The histogram consists of 256 values, each of which represents the repetitions of colors from 0 to 255, and it should be noted here that when the image is rotated, the pixel's color value will not change. ...
Article
Full-text available
Gray and color digital images are used in many vital applications; such as fingerprint or face recognition. These applications require dealing with the digital image as a single image, even if this image is rotated with different rotation angles. This paper deals with using the digital image histogram to set image features and for multiple models of image rotation. A detailed study and analysis of wavelet packet tree decomposition will be introduced; various experiments will be done to show how to stabilize the image features for multiple modes of the image. Introduction Gray and color digital images are one of the most used digital data in many vital and essential applications, such as facial recognition systems and fingerprint recognition systems [1][2] [26] [27], these applications do not deal directly with the image (because the image has a massive number of pixel values), but the deal with a set of small victor values called image victor features. These features must be unique for each image; thus, we can use them as a signature to classify or recognize the image in an image recognition system [3][4][5] [28].
... Ignoring the preprocessing such as graying in EIS system, EIS mainly consists of three parts: ① estimating the image transformation matrix of the current frame with respect to the reference frame; ② filtering the state variables derived by transformation matrix; ③ inverse compensation and output of the current frame [4]. ...
Article
Full-text available
Noise, vibration and harshness (NVH) problems in vehicle engineering are always challenging in both traditional vehicles and intelligent vehicles. Although high accuracy manufacturing, modern structural roads and advanced suspension technology have already significantly reduced NVH problems and their impacts; off-road condition, obstacles and extreme operating condition could still trigger NVH problems unexpectedly. This paper proposes a vehicular electronic image stabilization (EIS) system to solve the vibration problem of the camera and ensure the environment perceptive function of vehicles. Firstly, feature point detection and matching based on an oriented FAST and rotated BRIEF (ORB) algorithm are implemented to match images in the process of EIS. Furthermore, a novel improved random sampling consensus algorithm (i-RANSAC) is proposed to eliminate mismatched feature points and increase the matching accuracy significantly. And an adaptive Kalman filter (AKF) is applied to improve the adaptability of the vehicular EIS. Finally, an experimental platform based on a gasoline model car was established to validate its performance. The experimental results show that the proposed EIS system can satisfy vehicular performance requirements even under off-road condition with obvious obstacles.
... Stabilisation techniques can be manual or automatic. Manual methods rely on a selection of static reference points which are then automatically processed to determine displacement between frames (Rodriguez-Padilla et al., 2019), while automatic methods select features and then track displacements using binary feature matching techniques (e.g., Harris Corner Detection and FAST) (Liu and Cheng, 2008;Muja and Lowe, 2012;Mingkhwan and Khawsuk, 2017). A general process for stabilisation techniques is: breaking down videos into sequential frames, selecting of static features manually or automatically (i.e., feature detection), feature matching and outlier rejection, derivation of transformation function, and image reconstruction and stitching (Tareen and Saleem, 2018). ...
Article
Full-text available
Large-scale image velocimetry is a novel approach for non-contact remote sensing of flow in rivers. Research within this topic has largely focussed on developing specific aspects of the image velocimetry work-flow, or alternatively, testing specific tools or software using case studies. This has resulted in the development of a multitude of techniques, with varying practice being employed between groups, and authorities. As such, for those new to image velocimetry, it may be hard to decipher which methods are suited for particular challenges. This research collates, synthesises, and presents current understanding related to the application of particle image velocimetry (PIV) and particle tracking velocimetry (PTV) approaches in a fluvial setting. The image velocimetry work-flow is compartmentalised into sub-systems of: capture optimisation, pre-processing, processing, and post-processing. The focus of each section is to provide examples from the wider literature for best practice, or where this is not possible, to provide an overview of the theoretical basis and provide examples to use as precedence and inform decision making. We present literature from a range of sources from across the hydrology and remote sensing literature to suggest circumstances in which specific approaches are best applied. For most sub-systems, there is clear research or precedence indicating how to best perform analysis. However, there are some stages in the process that are not conclusive with one set method and require user intuition or further research. For example, the role of external environmental conditions on the performance of image velocimetry being a key aspect that is currently lacking research. Further understanding in areas that are lacking, such as environmental challenges, is vital if image velocimetry is to be used as a method for the extraction of river flow information across the range of hydro-geomorphic conditions.
Chapter
Now-a-days, drones are very commonly used in various civil applications like disaster management, traffic monitoring, crowd monitoring, infrastructure inspection and search and rescue. Drones have different types of sensors, among which cameras are normally used in these type of real time scenarios. The low flying small drones commonly have fixed cameras. The video/images from these cameras are not smooth due to the propeller vibrations and environmental effects. These unstable videos need to be smoothened without adding extra payloads. This will improve the quality of the videos and will help in further data/image processing. In this paper, we propose an optical flow-based motion estimation and further smoothening of drone videos to improve the Inter Frame Transformation Fidelity (ITF) when compared to the existing feature extraction methods.
Chapter
In this paper, an intelligent computer vision system is developed for detection and classification of tomatoes based on maturity level. The proposed system is fabricated in real time using a slider mechanism. Fabrication was carried out by arriving at two design concepts for the slider mechanism. Evaluation of the design concepts resulted in selection of optimal design for real time implementation. Real time results show the efficacy of the proposed system. Moreover, the designed system resulted in an accuracy of 90%. The execution time taken for detection of a single tomato is 0.2 s.
Conference Paper
Full-text available
In this paper,a fast and efficient video stabilization method based on speeded-up robust features (SURF) is presented .We adopted speeded-up robust features as feature descriptor,which are extracted and tracked in each frame .After that, we further refined the matching features through RANSAC, estimating the motion parameters through least squares method and computed the integrated motion. Experimental results illustrate superior performance of the SURF based video stabilization in terms of accuracy and speed when compared with the Scale Invariant Feature Transform (SIFT) based stabilization method.
Article
Full-text available
If a physical object has a smooth or piecewise smooth boundary, its images obtained by cameras in varying positions undergo smooth apparent deformations. These deformations are locally well approximated by affine transforms of the image plane. In consequence the solid object recognition problem has often been led back to the computation of affine invariant image local features. The similarity invariance (invariance to translation, rotation, and zoom) is dealt with rigorously by the SIFT method The method illustrated and demonstrated in this work, Affine-SIFT (ASIFT), simulates a set of sample views of the initial images, obtainable by varying the two camera axis orientation parameters, namely the latitude and the longitude angles, which are not treated by the SIFT method. Then it applies the SIFT method itself to all images thus generated. Thus, ASIFT covers effectively all six parameters of the affine transform.
Article
Full-text available
Aerial video stabilization system aims to remove undesired motion in aerial v deo. This motion is the result of undesired movement of mobile sensor. In this article we present a new video stabilization system for Unmanned Aerial Vehicles (UAV). Our system is based on keypoints tracking. We use Scale Invariant Feature Transform (SIFT) keypoint detection, and matching to estimate parameters of affine transformation model. Then, Kalman filter with median filter is applied to remove video noise. A number of real aerials videos surveillances demonstrate that this method can achieve good performance.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Recent studies on automated activity analysis have adopted construction videos as an input data source to recognize and categorize construction workers' actions. To ensure the representativeness of its analysis results, these videos have to be gathered randomly in terms of time and location. In doing so, such videos must be taken with hand-held cameras, a fact that inevitably leads to videos including jittery frames. Such frames can decrease the accuracy of automated activity analysis results. One area of the most recent and effective action recognition methods involves using spatio-temporal action recognition algorithms. The jittery frames, however, are fatal to the recognizing of a human worker's action using such an algorithm. Jitters can be removed from the videos by using video stabilization technologies. The video stabilization is the pre-processing of action recognition for automated activity analysis. Regarding the video stabilization, local feature descriptor plays a major role in the stabilization process, and the correct selection of proper descriptor is critical. Therefore, the purpose of this study is to identify the best local feature descriptor for the video stabilization. This paper describes detail steps of the stabilization and provides performance analysis of various local feature descriptors in terms of stabilization of videos from construction site.
Article
With handy camera image is not enough stable at that time stabilization method is used to recover that shaky effect. So, stabilization of image is concept to recover the scale and theta of shaky image. For that algorithm should be able to stabilize the image with maximum original information from that shaky input image. And from this image stabilization algorithm we can use this as a fundamental concept to stabilize the video. Here in this paper algorithm is applied for 2D image and measure the efficiency of that algorithm
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
Key points-based image matching algorithms have proven very successful in recent years. However, their execution time makes them unsuitable for online applications. Indeed, identifying similar key points requires comparing a large number of high dimensional descriptor vectors. Previous work has shown that matching could be still accurately performed when only considering a few highly significant key points. In this paper, we investigate reducing the number of generated SURF features to speed up image matching while maintaining the matching recall at a high level. We propose a machine learning approach that uses a binary classifier to identify key points that are useful for the matching process. Furthermore, we compare the proposed approach to another method for key point pruning based on saliency maps. The two approaches are evaluated using ground truth datasets. The evaluation shows that the proposed classification-based approach outperforms the adversary in terms of the trade-off between the matching recall and the percentage of reduced key points. Additionally, the evaluation demonstrates the ability of the proposed approach of effectively reducing the matching runtime.