Content uploaded by Lorenzo Pasqualetto Cassinis
Author content
All content in this area was uploaded by Lorenzo Pasqualetto Cassinis on Feb 25, 2021
Content may be subject to copyright.
Evaluation of Tightly- and Loosely-coupled Approaches
in CNN-based Pose Estimation Systems for
Uncooperative Spacecraft
Lorenzo Pasqualetto Cassinisa,∗
, Robert Fonoda, Eberhard Gilla, Ingo Ahrnsb,
Jesus Gill Fernandezc
aDelft University of Technology, Kluyverweg 1 2629 HS, Delft, The Netherlands
bAirbus DS GmbH, Airbusallee 1, 28199, Bremen, Germany
cESTEC, Keplerlaan 1, 2201 AZ, Noordwijk, The Netherlands
Abstract
The relative pose estimation of an inactive spacecraft by an active servicer
spacecraft is a critical task in the design of current and planned space missions,
due to its relevance for close-proximity operations, such as In-Orbit Servicing
and Active Debris Removal. This paper introduces a novel framework to enable
robust monocular pose estimation for close-proximity operations around an
uncooperative spacecraft, which combines a Convolutional Neural Network
(CNN) for feature detection with a Covariant Efficient Procrustes Perspective-n-
Points (CEPPnP) solver and a Multiplicative Extended Kalman Filter (MEKF).
The performance of the proposed method is evaluated at different levels of the
pose estimation system. A Single-stack Hourglass CNN is proposed for the feature
detection step in order to decrease the computational load of the Image Processing
(IP), and its accuracy is compared to the standard, more complex High-Resolution
Net (HRNet). Subsequently, heatmaps-derived covariance matrices are included
in the CEPPnP solver to assess the pose estimation accuracy prior to the
navigation filter. This is done in order to support the performance evaluation
of the proposed tightly-coupled approach against a loosely-coupled approach,
∗Corresponding author
Email addresses: L.PasqualettoCassinis@tudelft.nl (Lorenzo Pasqualetto Cassinis),
Robert.Fonod@ieee.org (Robert Fonod), E.K.A.Gill@tudelft.nl (Eberhard Gill),
ingo.ahrns@airbus.com
(Ingo Ahrns),
Jesus.Gil.Fernandez@esa.int
(Jesus Gill Fernandez)
Preprint submitted to Elsevier February 25, 2021
in which the detected features are converted into pseudomeasurements of the
relative pose prior to the filter. The performance results of the proposed system
indicate that a tightly-coupled approach can guarantee an advantageous coupling
between the rotational and translational states within the filter, whilst reflecting a
representative measurements covariance. This suggest a promising scheme to cope
with the challenging demand for robust navigation in close-proximity scenarios.
Synthetic 2D images of the European Space Agency’s Envisat spacecraft are used
to generate datasets for training, validation and testing of the CNN. Likewise,
the images are used to recreate a representative close-proximity scenario for the
validation of the proposed filter.
Keywords: Relative Pose Estimation, Active Debris Removal, In-Orbit
Servicing, Monocular-based Navigation Filters, Convolutional Neural Networks,
2010 MSC: 00-01, 99-00
1. Introduction
Nowadays, key Earth-based applications such as remote sensing, navigation,
and telecommunication, rely on satellite technology on a daily basis.
To ensure
a high reliability of these services, the safety and operations of satellites in orbit
has to be guaranteed. In this context, advancements in the field of Guidance,
5
Navigation, and Control (GNC) were made in the past years to cope with
the challenges involved in In-Orbit Servicing (IOS) and
Active Debris Removal
(ADR) missions [
1
,
2
]. For such scenarios, the estimation of the relative pose
(position and attitude) of an uncooperative spacecraft by an active servicer
spacecraft represents a critical task. Compared to cooperative close-proximity
10
missions, the pose estimation problem is indeed complicated by the fact that the
target satellite is not functional and/or not able to aid the relative navigation.
Hence, optical sensors shall be preferred over Radio Frequency (RF) sensors to
cope with a lack of navigation devices such as Global Positioning System (GPS)
sensors and/or antennas onboard the target.15
2
From a high-level perspective, optical sensors can be divided into active and
passive devices, depending on whether they require power to function, i.e. LIght
Detection And Ranging (LIDAR) sensors and Time-Of-Flight (TOF) cameras, or
if they passively acquire radiation, i.e. monocular and stereo cameras. Spacecraft
20
relative navigation usually exploits Electro-Optical (EO) sensors such as stereo
cameras [
3
,
4
] and/or a LIDAR sensor [
5
] in combination with one or more
monocular cameras, in order to overcome the partial observability that results
from the lack of range information in these latter [
6
]. In this framework, pose
estimation systems based solely on a monocular camera are recently becoming an
25
attractive alternative to systems based on active sensors or stereo cameras, due
to their reduced mass, power consumption and system complexity [
7
]. However,
given the low Signal-To-Noise Ratio (SNR) and the high contrast which charac-
terize space images, a significant effort is still required to comply with most of the
demanding requirements for a robust and accurate monocular-based navigation
30
system. Interested readers are referred to Pasqualetto Cassinis et al.
[8]
for a
recent overview of the current trends in monocular-based pose estimation systems.
Notably, the aforementioned navigation system cannot rely on known visual
markers, as they are typically not installed on an uncooperative target. Since the
extraction of visual features is an essential step in the pose estimation process,
35
advanced Image Processing (IP) techniques are required to extract keypoints
(or interest points), corners, and/or edges on the target body. In model-based
methods, the detected features are then matched with pre-defined features on an
offline wireframe 3D model of the target to solve for the relative pose. In other
words, a reliable detection of key features under adverse orbital conditions is
40
highly desirable to guarantee safe operations around an uncooperative spacecraft.
Moreover, it would be beneficial from a different standpoint to obtain a model of
feature detection uncertainties. This would provide the navigation system with
additional statistical information about the measurements, which could in turn
improve the robustness of the entire estimation process.45
Unfortunately, standard pose estimation solvers such as the Efficient Perspective-
3
n-Point (EPnP) [9], the Efficient Procrustes Perspective-n-Point (EPPnP) [10],
or the multi-dimensional Newton Raphson Method (NRM) [
11
] do not have
the capability to include features uncertainties. Only recently, the Maximum-
50
Likelihood PnP (MLPnP) [
12
] and the Covariant EPPnP (CEPPnP) [
13
] solvers
were introduced to exploit statistical information by including feature covariances
in the pose estimation. Ferraz et al.
[13]
proposed a method for computing
the covariance which takes different camera poses to create a fictitious distri-
bution around each detected keypoint. Other authors proposed an improved
55
pose estimation method based on projection vector, in which the covariance
is associated to the image gradient magnitude and direction at each feature
location [
14
], or a method in which covariance information is derived for each
feature based on feature’s visibility and robustness against illumination changes
[
15
]. However, in all these methods the derivation of features covariance matrices
60
is a lengthy process which generally cannot be directly related to the actual
detection uncertainty. Moreover, this procedure could not be easily applied if
Convolutional Neural Networks (CNNs) are used in the feature detection step,
due to the difficulty to associate statistical meaning to the IP tasks performed
within the network. In this context, another procedure should be followed in
65
which the output of the CNNs is directly exploited to return relevant statistical
information about the detection step. This could, in turn, provide a reliable
representation of the detection uncertainty.
The implementation of CNNs for monocular pose estimation in space has
70
already become an attractive solution in recent years, also thanks to the creation
of the Spacecraft PosE Estimation Dataset (SPEED) [
16
], a database of highly
representative synthetic images of PRISMA’s TANGO spacecraft made publicly
available by Stanford’s Space Rendezvous Laboratory (SLAB) and applicable
to train and test different network architectures. One of the main advantages
75
of CNNs over standard feature-based algorithms for relative pose estimation
[
7
,
17
,
18
] is an increase in robustness under adverse illumination conditions,
as well as a reduction in the computational complexity. Initially, end-to-end
4
CNNs were exploited to map a 2D input image directly into a relative pose
by means of learning complex non-linear functions [
19
,
20
,
21
,
22
]. However,
80
since the pose accuracies of these end-to-end CNNs proved to be lower than
the accuracies returned by common pose estimation solvers, especially in the
estimation of the relative attitude [
19
], recent efforts investigated the capability
of CNNs to perform keypoint localization prior to the actual pose estimation
[
23
,
24
,
25
,
26
]. The output of these networks is a set of so called heatmaps
85
around pre-trained features. The coordinates of the heatmap’s peak intensity
characterize the predicted feature location, with the intensity and the shape
indicating the confidence of locating the corresponding keypoint at this position
[
23
]. Additionally, due to the fact that the trainable features can be selected
offline prior to the training, the matching of the extracted feature points with
90
the features of the wireframe model can be performed without the need of a
large search space for the image-model correspondences, which usually char-
acterizes most of the edges/corners-based methods [
27
]. In this context, the
High-Resolution Net (HRNet) [
28
] already proved to be a reliable and accurate
keypoint detector prior to pose estimation, due to its capability of maintaining
95
a high-resolution representation of the heatmaps through the whole detection
process.
To the best of the authors’ knowledge, the reviewed implementations of
CNNs feed solely the heatmap’s peak location into the pose estimation solver,
100
despite multiple information could be extracted from the detected heatmaps.
Only in Pavlakos et al.
[23]
, the pose estimation is solved by assigning weights
to each feature based on their heatmap’s peak intensities, in order to penalize
inaccurate detections. Yet, there is another aspect related to the heatmaps which
has not been considered. It is in fact hardly acknowledged how the overall shape
105
of the detected heatmaps returned by CNN can be translated into a statistical
distribution around the peak, allowing reliable feature covariances and, in turn,
a robust navigation performance. As already investigated by the authors in
earlier works [29, 30], deriving an accurate representation of the measurements
5
uncertainty from feature heatmaps can in fact not only improve the pose esti-
110
mation, but it can also benefit the estimation of the full relative state vector,
which would include the relative pose as well as the relative translational and
rotational velocities.
From a high level perspective, two different navigation architectures are
115
normally exploited in the framework of relative pose estimation. A tightly-
coupled architecture, where the extracted features are directly processed by the
navigation filter as measurements, and a loosely-coupled architecture, in which
the relative pose is computed by a pose solver prior to the navigation filter, in
order to derive pseudomeasurements from the target features [
31
]. Usually, a
120
loosely-coupled approach is preferred for an uncooperative tumbling target, due
to the fact that the fast relative dynamics could jeopardize feature tracking and
return highly-variable measurements to the filter. However, one shortcoming of
this approach is that it is generally hard to obtain a representative covariance
matrix for the pseudomeasurements. This can be quite challenging when filter
125
robustness is demanded. Remarkably, the adoption of a CNN in the feature
detection step can overcome the challenges in feature tracking by guaranteeing
the detection of a constant, pre-defined set of features. At the same time, the
CNN heatmaps can be used to derive a measurements covariance matrix and
improve filter robustness. Following this line of reasoning, a thightly-coupled
130
filter is expected to interface well with a CNN-based IP and to outperform its
loosely-coupled counterpart.
In this framework, the objective of this paper is to combine a CNN-based
feature detector with a CEPPnP solver whilst evaluating the performance of a
135
proposed tightly-coupled navigation filter against the performance of a loosely-
coupled filter. Specifically, the novelty of this work stands in extending the
authors’ previous findings [
29
,
30
] by further linking the current research on
CNN-based feature detection, covariant-based PnP solvers, and navigation filters.
The main contributions of this work are:140
6
1.
To assess the feasibility of a simplified CNN for feature detection within
the IP
2.
To improve the pose estimation by incorporating heatmaps-derived covari-
ance matrices in the CEPPnP
3.
To compare the performance of tightly- and loosely-coupled navigation
145
filters.
The paper is organized as follows. The overall pose estimation framework is
illustrated in Section II. Section III introduces the proposed CNN architecture
together with the adopted training, validation, and testing datasets. In Section
IV, special focus is given to the derivation of covariance matrices from the
150
CNN heatmaps, whereas Section V describes the CEPPnP solver. Besides,
Section VI provides a description of the tightly- and loosely-coupled filters
adopted. The simulation environment is presented in Section VII together with
the simulation results. Finally, Section VIII provides the main conclusions and
recommendations.155
2. Pose Estimation Framework
This work considers a servicer spacecraft flying in relative motion around a
target spacecraft located in a Low Earth Orbit (LEO), with the relative motion
being described in a Local Vertical Local Horizontal (LVLH) reference frame
co-moving with the servicer (Figure 1a). Furthermore, it is assumed that the
160
servicer is equipped with a single monocular camera. The relative attitude of
the target with respect to the servicer can then be defined as the rotation of the
target body-fixed frame B with respect to the servicer camera frame C, where
these frames are tied to each spacecraft’s body. The distance between the origins
of these two frames defines their relative position. Together, these two quantities
165
characterize the relative pose. This information can then be transferred from
the camera frame to the servicer’s center of mass by accounting for the relative
7
pose of the camera with respect to the LVLH frame.
From a high-level perspective, a model-based monocular pose estimation
170
system receives as input a 2D image and matches it with an existing wireframe
3D model of the target spacecraft to estimate the pose of such target with respect
to the servicer camera. Referring to Figure 1b, the pose estimation problem
consists in determining the position of the target’s centre of mass
tC
and its
orientation with respect to the camera frame C, represented by the rotation
175
matrix RC
B. The Perspective-n-Points (PnP) equations,
rC=xCyCzCT
=RC
BrB+tC(1)
p= (ui, vi) = xC
zCfx+Cx,yC
zCfy+Cy,(2)
relate the unknown pose with a feature point
p
in the image plane via the relative
position
rC
of the feature with respect to the camera frame. Here,
rB
is the
point location in the 3D model, expressed in the body-frame coordinate system
B, whereas
fx
and
fy
denote the focal lengths of the camera and (
Cx, Cy
) is the
180
principal point of the image.
From these equations, it can already be seen that an important aspect of
estimating the pose resides in the capability of the IP system to extract features
p
from a 2D image of the target spacecraft, which in turn need to be matched with
pre-selected features
rB
in the wireframe 3D model. Notably, such wireframe
185
model of the target needs to be made available prior to the estimation. Notice
also that the problem is not well defined for
n <
3 feature points, and can
have up to four positive solutions for
n
= 3 [
33
]. Generally, more features are
required in presence of large noise and/or symmetric objects. Besides, it can
also be expected that the time variation of the relative pose plays a crucial role
190
while navigating around the target spacecraft, e.g. if rotational synchronization
with the target spacecraft is required in the final approach phase. As such, it is
clear that the estimation of both the relative translational and angular velocities
8
Incoming light
CNN-based IP
2D Image
Initialization
CEPPnP
Navigation
Covariance computationFeature detection
Peak location
Covariance
Figure 2: Functional flow of the proposed tightly-coupled pose estimation architecture.
represent an essential step within the navigation system.
195
The proposed tightly-coupled architecture combines the above key ingredi-
ents in three main stages, which are shown in Figure 2 and described in more
detail in the following sections. In the CNN-based IP block, a CNN is used to
extract features from a 2D image of the target spacecraft. Statistical informa-
tion is derived by computing a covariance matrix for each features using the
200
information included in the output heatmaps. In the Navigation block, both
the peak locations and the covariances are fed into the navigation filter, which
estimates the relative pose as well as the relative translational and rotational
velocities. The filter is initialized by the CEPPnP block, which takes peak
location and covariance matrix of each feature as input and outputs the initial
205
relative pose by solving the PnP problem in Eqs.1-2. Thanks to the availability
of a covariance matrix of the detected features, this architecture can guarantee
a more accurate representation of feature uncertainties, especially in case of
inaccurate detection of the CNN due to adverse illumination conditions and/or
unfavourable relative geometries between servicer and target. Together with the
210
CEPPnP initialization, this aspect can return a robust and accurate estimation of
the relative pose and velocities and assure a safe approach of the target spacecraft.
10
In this work, a rectilinear VBAR approach of the servicer spacecraft towards
the target spacecraft is considered, as this typically occurs during the final stages
215
of close-proximity operations in rendezvous and docking missions [
1
,
2
]. This
assumption is justified by the fact that the proposed method needs to be first
validated on simplified relative trajectories before assessing its feasibility under
more complex relative geometries. Following the same line of reasoning, the
relative attitude is also simplified by considering a perturbation-free rotational
220
dynamics between the servicer and the target. This is described in more detail
in Section VI.
3. Convolutional Neural Network
CNNs are currently emerging as a promising features extraction method,
mostly due to the capability of their convolutional layers to extract high-level
225
features of objects with improved robustness against image noise and illumination
conditions. In order to optimize CNNs for the features extraction process, a
stacked hourglass architecture has been proposed [
23
,
24
], and other architectures
such as the U-net [34] and the HRNet [28] were tested in recent years.
Compared to the network proposed by Pavlakos et al.
[23]
, the architecture
230
proposed in this work is composed of only one encoder/decoder block, constituting
a single hourglass module. This was chosen in order to reduce the network size
and comply with the limitations in computing power which characterizes space-
grade processors. The encoder includes six blocks, each including a convolutional
layer formed by a fixed number of filter kernels of size 3
×
3, a batch normalization
235
module and max pooling layer, whereas the six decoder blocks accomodate an
up-sampling block in spite of max pooling. In the encoder stage, the initial
image resolution is decreased by a factor of two, with this downsampling process
continuing until reaching the lowest resolution of 4
×
4 pixels. An upsampling
process follows in the decoder with each layer increasing the resolution by a
240
factor of two and returning output heatmaps at the same resolution as the input
11
image. Figure 3 shows the high-level architecture of the network layers, together
with the corresponding input and output.
Overall, the size of the 2D input image and the number of kernels per
convolutional layer drive the total number of parameters. In the current analysis,
245
an input size of 256
×
256 pixels is chosen, and 128 kernels are considered per
convolutional layer, leading to a total of
∼
1
,
800
,
000 trainable parameters.
Compared to the CNNs analyzed by Sun et al.
[28]
, this represents a reduction
of more than an order of magnitude in network size.
Figure 3: Overview of the single hourglass architecture. Downsampling is performed in the
encoder stage, in which the image size is decrease after each block, whereas upsampling occurs
in the decoder stage. The output of the network consists of heatmap responses, and is used for
keypoints localization.
As already mentioned, the output of the network is a set of heatmaps around
250
the selected features. Ideally, the heatmap’s peak intensity associated to a wrong
detection should be relatively small compared to the correctly detected features,
highlighting that the network is not confident about that particular wrongly-
detected feature. At the same time, the heatmap’s amplitude should provide an
additional insight into the confidence level of each detection, a large amplitude255
being related to large uncertainty about the detection. The network is trained
12
Figure 4: Illustration of the selected features for a given Envisat pose.
Table 1: Parameters of the camera used to generate the synthetic images in Cinema 4D©.
Parameter Value Unit
Image resolution 512×512 pixels
Focal length 3.9·10−3m
Pixel size 1.1·10−5m
with the
x
- and
y
- image coordinates of the feature points, computed offline
based on the intrinsic camera parameters as well as on the feature coordinates
in the target body frame, which were extracted from the wireframe 3D model
prior to the training. During training, the network is optimized to locate 16
260
features of the Envisat spacecraft, consisting of the corners of the main body,
the Synthetic-Aperture Radar (SAR) antenna, and the solar panel, respectively.
Figure 4 illustrates the selected features for a specific target pose.
3.1. Training, Validation and Test
For the training, validation, and test datasets, synthetic images of the Envisat
265
spacecraft were rendered in the Cinema 4D
©
software. Table 1 lists the main
camera parameters adopted. Constant Sun elevation and azimuth angles of 30
13
degrees were chosen in order to recreate favourable as well as adverse illumination
conditions. Relative distances between camera and target were discretized every
30 m in the interval 90 m - 180 m, with the Envisat always located in the camera
270
boresight direction in order to prevent some of the Envisat features from falling
outside the camera field of view. Although being a conservative assumption,
this allows to test the CNN detection under ideal servicer-target geometries
during a rectilinear approach. Subsequently, relative attitudes were generated
by discretizing the yaw, pitch, and roll angles of the target with respect to the
275
camera by 10 degrees each. Together, these two choices were made in order
to recreate several relative attitudes between the servicer and the target. The
resulting database was then shuffled to randomize the images, and was ultimately
split into training (18,000 images), validation (6,000 images), and test (6,000
images) datasets. Figure 5 shows a subset of the camera pose distribution for 100
280
representative training images, whereas Figure 6 illustrates some of the images
included in the training dataset.
During training, the validation dataset is used beside the training one to compute
the validation losses and avoid overfitting. The Adam optimizer [
35
] is used with
a learning rate of 10
−3
for a total number of 50 epochs. Finally, the network
285
performance after training is assessed with the test dataset.
Preliminary results on the single-stack network performance were already
reported by Pasqualetto Cassinis et al.
[29]
. Above all, one key advantage of
relying on CNNs for feature detection was found in the capability of learning
290
the relative position between features under a variety of relative poses present
in the training. As a result, both features which are not visible due to adverse
illumination and features occulted by other parts of the target can be detected.
Besides, a challenge was identified in the specific selection of the trainable features.
Since the features selected in this work represent highly symmetrical points of
295
the Envisat spacecraft, such as corners of the solar panel, SAR antenna or main
body, the network could be unable to distinguish between similar features, and
return multiple heatmaps for a single feature output. Figure 7 illustrates these
14
Figure 5: Illustration of the pose space discretization in the training dataset. The concentric
spheres represent the discretization of the relative distance in the range 90 m - 180 m. Only
100 random relative camera poses are shown for clarity.
Figure 6: A montage of eight synthetic images selected from the training set.
15
findings. Notably, the detection of wrong features results in weak heatmaps,
which can be filtered out by selecting a proper threshold on their total brightness.
300
(a) Occulted spacecraft body
(b) Heatmap of a single solar panel’s
corner
Figure 7: Robustness and challenges of feature detection with the proposed CNN. On the
left-hand side, the network has been trained to recognize the pattern of the features, and
can correctly locate the body features which are not visible, i.e. parts occulted by the solar
panel and corners of the SAR antenna. Conversely, the right-hand side shows the detection of
multiple heatmaps for a single corner of the solar panel. As can be seen, the network can have
difficulties in distinguishing similar features, such as the corners of the solar panel.
In order to compare the feature detection accuracy of the proposed Single-
stack Hourglass with a more complex CNN architecture, the HRNet proposed
by Sun et al.
[28]
has been selected and trained on the same Envisat datasets.
This architecture had already been tested on the SPEED dataset [
25
] and
already proved to return highly accurate features of the TANGO spacecraft. The
305
performance is assessed in terms of Root Mean Squared Error (RMSE) between
the ground truth (GT) and the
x
,
y
coordinates of the extracted features, which
is computed as
ERMSE =
v
u
u
u
u
u
t
ntot
X
i=1
[(xGT,i −xi)2+ (yGT ,i −yi)2]
ntot
.(3)
16
Figure 8: RMSE over the test dataset for the HRNet and the Single-stack Hourglass.
Figure 8 shows the RMSE error over the test dataset for the two CNNs,
whereas Table 2 reports the mean
µ
and standard deviation
σ
of the associated
310
histograms. As expected, the added complexity of HRNet, translates into a
more accurate detection of the selected features, thanks to the higher number of
parameters: only 4% of the test images are characterized by a RMSE above 5
pixels, as opposed to the 15% in the Single-stack Hourglass case.
Table 2: Mean
µ
and Standard Deviation
σ
of the adopted networks over the Envisat test
dataset.
Network No. Params µ[pxl] σ[pxl]
Single-stack Hourglass 1.8M 3.4 4.3
HRNet 25M 2.4 1.4
Although HRNet proves to return more accurate features, it is also believed
315
that the larger RMSE scenarios returned by the Single-stack Hourglass can be
properly handled, if a larger uncertainty can be associated to their corresponding
17
heatmaps. As an example, a large RMSE could be associated to the inaccurate
detection of only a few features which, if properly weighted, could not have
a severe impact on the pose estimation step. This task can be performed by
320
deriving a covariance matrix for each detected feature, in order to represent
its detection uncertainty. Above all, this may prevent the pose solver and
the navigation filter from trusting wrong detections by relying more on other
accurate features. In this way, the navigation filter can cope with poorly accurate
heatmaps while at the same time relying on a computationally-low CNN.325
4. Covariance Computation
Compared to the methods discussed in Section I [
13
,
14
,
15
], the proposed
method derives a covariance matrix associated to each feature directly from the
heatmaps detected by the CNN, rather than from the computation of the image
gradient around each feature. In order to do so, the first step is to obtain a
330
statistical population around the heatmap’s peak. This is done by thresholding
each heatmap image so that only the
x
- and
y
- location of heatmap’s pixels
are extracted. Secondly, each pixel within the population is given a normalized
weight wibased on the gray intensity Iiat its location,
wi=wRRi+wGGi+wBBi,(4)
where
R, G, B
are the components of the coloured image and
wR, wG, wB
are
335
the weights assigned to each channel in order to get the grayscale intensity. This
is done in order to give more weight to pixels which are particularly bright and
close to the peak, and less weight to pixels which are very faint and far from
the peak. Finally, the obtained statistical population of each feature is used to
compute the weighted covariance between
x, y
and consequently the covariance
340
matrix Ci,
Ci=
cov(x, x) cov(x, y)
cov(y, x) cov(y, y)
,(5)
18
where
cov(x, y) =
n
X
i=1
wi(xi−px)·(yi−py) (6)
and
n
is the number of pixels in each feature’s heatmap. In this work, the mean
is replaced by the peak location
p
= (
px, py
) in order to represent a distribution
around the peak of the detected feature, rather than around the heatmap’s mean.
345
This is particularly relevant when the heatmaps are asymmetric and their mean
does not coincide with their peak.
Figure 9 shows the overall flow to obtain the covariance matrix for three
different heatmap shapes. The ellipse associated to each features covariance is
obtained by computing the eigenvalues λxand λyof the covariance matrix,350
x
λx2
+y
λy2
=s, (7)
where
s
defines the scale of the ellipse and is derived from the confidence interval
of interest, e.g.
s
= 2
.
2173 for a 68% confidence interval. As can be seen,
different heatmaps can result in very different covariance matrices. Above
all, the computed covariance can capture the different CNN uncertainty over
x
,
y
. Notice that, due to its symmetric nature, the covariance matrix can only
355
represent bivariate normal distributions. As a result, asymmetrical heatmaps
such as the one in the third scenario are approximated by Gaussian distributions
characterized by an ellipse which might overestimate the heatmap’s dispersion
over some directions.
5. Pose Estimation360
The CEPPnP method proposed by Ferraz et al.
[13]
was selected to estimate
the relative pose from the detected features as well as from their covariance
matrices. The first step of this method is to rewrite the PnP problem in Eqs.1-2 as
a function of a 12-dimensional vector
y
containing the control point coordinates
19
0.68
0.99
0.68
0.99 0.99 0.68
Thresholding
Weight allocation
Covariance computation
Figure 9: Schematic of the procedure followed to derive covariance matrices from CNN
heatmaps. The displayed ellipses are derived from the computed covariances by assuming the
confidence intervals 1σ= 0.68 and 3σ= 0.99.
20
in the camera reference system,365
M y =0,(8)
where
M
is a 2
n×
12 known matrix. This is the fundamental equation in the
EPnP problem [
9
]. The likelihood of each observed feature location
ui
is then
represented as
P(ui) = k·e−1
2∆uT
iC−1
ui∆ui,(9)
where ∆
ui
is a small, independent and unbiased noise with expectation
E
[∆
ui
] =
0
and covariance
E
[∆
ui
∆
uT
i
] =
σ2Cui
and
k
is a normalization constant. Here,
σ2
represents the global uncertainty in the image, whereas
Cui
is the 2x2
unnormalized covariance matrix representing the Gaussian distribution of each
detected feature, computed from the CNN heatmaps. After some calculations
370
[13], the EPnP formulation can be rewritten as
(N−L)y=λy.(10)
This is an eigenvalue problem in which both
N
and
L
matrices are a function
of
y
and
Cui
. The problem is solved iteratively by means of the closed-loop
EPPnP solution for the four control points, assuming no feature uncertainty.
Once
y
is estimated, the relative pose is computed by solving the generalized
375
Orthogonal Procrustes problem used in the EPPnP [10].
6. Navigation Filter
Several navigation filters for close-proximity operations were investigated in
recent years in the context of relative pose estimation. The reader is referred
to Pasqualetto Cassinis et al.
[8]
for a comprehensive overview that goes be-
380
yond the scope of this work. In the proposed navigation system, the so-called
Multiplicative Extended Kalman Filter (MEKF) is used. Remarkably, other
works [
15
,
30
] adopted a standard formulation of the EKF that propagates the
relative pose, expressed in terms of relative position and quaternions, as well as
21
the relative translational and rotational velocities (prediction step), correcting
385
the prediction with the measurements obtained from the monocular camera
(correction step). However, the quaternion set consists of four parameters to
describe the 3DOF attitude, hence one of its parameters is deterministic. As
reported by Tweddle and Saenz-Otero
[36]
and Sharma and D’Amico
[31]
, this
makes the covariance matrix of a quaternion have one eigenvalue that is exactly
390
zero. As a result, the entire state covariance propagated by the filter may
become non-positive-definite and lead to the divergence of the filter. The MEKF,
introduced for the first time by Lefferts et al.
[37]
, aims at solving the above
issue by using two different parametrizations of the relative attitude. A three
element error parametrization, expressed in terms of quaternions, is propagated
395
and corrected inside the filter to return an estimate of the attitude error. At
each estimation step, this error estimate is used to update a reference quaternion
and is reset to zero for the next iteration. Notably, the reset step prevents the
attitude error parametrization from reaching singularities, which generally occur
for large angles.400
6.1. Propagation Step
A standard EKF state vector for relative pose estimation is composed of
the relative pose between the servicer and the target, as well as the relative
translational and rotational velocities
v
and
ω
. Under the assumption that the
camera frame onboard the servicer is co-moving with the LVLH frame, with the
405
camera boresight aligned with the along-track direction, this translates into
x=tCvqωT
,(11)
where
q
=
q0qv
is the quaternion set that represents the relative attitude.
Notice that the assumption of the camera co-moving with the LVLH is made
only to focus on the navigation aspects rather than on the attitude control of
the servicer. Therefore, the application of the filter can be extended to other
410
scenarios, if attitude control is included in the system.
22
In the MEKF, the modified state vector propagated inside the filter becomes
˜
x=tCv a ωT
,(12)
where ais four times the Modified Rodrigues Parameters (MRP) σ,
a= 4σ= 4 qv
1 + q0
.(13)
The discrete attitude propagation step is derived by linearizing
˙
a
around
a=03×1and assuming small angle rotations [36],415
˙
a=1
2[ω×]a+ω.(14)
As a result, the discrete linearized propagation of the full state becomes
˜
xk=Φk˜
xk−1+ΓkQk,(15)
where Qkrepresents the process noise and
Φk=
ΦCW 06×6
06×6Φa,ω
(16)
ΦCW =
Φrr Φrv
Φvr Φvv
(17)
Φrr =
1 0 6(∆θ−sin ∆θ)
0 cos ∆θ0
0 0 4 −3 cos ∆θ
(18)
Φrv =
1/ωs(4 sin ∆θ−3∆θ)02/ωs(1 −cos ∆θ)
0 1/ωssin ∆θ0
2/ωs(cos ∆θ−1) 0 sin ∆θ/ω
(19)
Φvr =
006ωs(1 −cos ∆θ)
0ωssin ∆θ0
0 0 3ωssin ∆θ
(20)
23
Φvv =
4 cos ∆θ−3 0 2 sin ∆θ
0 cos ∆θ0
−2 sin ∆θ0 cos ∆θ
(21)
Φa,ω =
e1
2[ω×]∆tR∆t
0e1
2[ω×]τdτ
03×3I3×3
(22)
Γk+1 =
1
2mI3×3∆t203×3
1
mI3×3∆t03×3
03×3∆tR∆t
0e1
2[ω×]τJ−1dτ
03×3J−1∆t
.(23)
The terms
ωs
and ∆
θ
in Eq. 17 represent the servicer argument of perigee and
true anomaly variation from time
t0
to
t
, respectively, whereas the term
J
in
Eq. 23 is the inertia matrix of the target spacecraft. In Tweddle and Saenz-Otero
420
[36]
, the integral terms ins Eqs.22-23 are solved by creating a temporary linear
system from Eq. 14, augmented with the angular velocity and the process noise.
The State Transition Matrix of this system is then solved numerically with the
matrix exponential.
6.2. Correction Step425
At this stage, the propagated state
˜
xk
is corrected with the measurements
z
to
return an update of the state
ˆ
xk
. In a loosely-coupled filter, these measurements
are represented by the relative pose between the servicer and the target spacecraft,
obtained by solving the PnP problem with the CEPPnP solver described in
Section V. In this case, a pseudomeasurements vector is derived by transforming
430
the relative quaternion set into the desired attitude error a,
δqz=qz⊗qref∗
k→a= 4 δqv
1 + δq0
(24)
24
zk=
tC
a
=Hkxk+Vk=
I3×303×303×303×3
03×303×3I3×303×3
tC
v
a
ω
k
+
Vr
Va
k
.
(25)
In Eq. 24,
⊗
denotes the quaternion product. Conversely, in a tightly-coupled
filter the measurements are represented by the pixel coordinates of the detected
features,
z= (x1, y1. . . xn, yn)T.(26)
Referring to Eqs.1-2, this translates into the following equations for each
435
detected point pi:
hi=xC
i
zC
i
fx+Cx,yC
i
zC
i
fy+CyT
(27)
rC=q⊗rB
i⊗q∗+tC,(28)
where
q∗
is the quaternion conjugate. As a result, the measurements update
equation can be written as
zk=Hkxk+V=
HtC,i 02n×3Ha,i 02n×3
.
.
..
.
..
.
..
.
.
HtC,n 02n×3Ha,n 02n×3
tC
v
a
w
k
+
Vr
Va
(29)
and the Jacobian
Hk
of the observation model with respect of the state vector
is a 2n×13 matrix whose elements are440
Hr,i =Hint
i·Hext
tC,i (30)
25
Ha,i =Hint
i·Hext
a,i =Hint
i·Hext
q,i ·Hq
a(31)
Hint
i=∂hi
∂rC
i
=
fx
zC
i
0−fx
(zC
i)2xC
i
0fy
zC
i
−fy
(zC
i)2yC
i
(32)
Hext
q,i =∂rC
i
∂q=∂q⊗rB
i⊗q∗
∂q;Hext
tC,i =∂rC
i
∂tC=I3(33)
Hq
a=∂δq⊗qref
∂a=∂Qrefδq
∂a=Qref
∂δq
∂a(34)
Qref =
q0−q1−q2−q3
q1q0q3−q2
q2−q3q0q1
q3q2−q1q0
ref
.(35)
The partial derivatives of the differential quaternion set with respect to the
attitude error are computed from the relation between the attitude error
a
and
the differential quaternion set δq,
δq0=16 − kak2
16 + kak2δqv= 8 a
16 + kak2(36)
∂δq
∂a=8
16 + kak22
−8a1−8a2−8a3
16 + kak2−2a2
1−2a1a2−2a1a3
−2a2a116 + kak2−2a2
2−2a2a3
−2a3a1−2a3a216 + kak2−2a2
3
.
(37)
In the tightly-coupled filter, the measurement covariance matrix
R
is a time-
varying block diagonal matrix constructed with the heatmaps-derived covariances
445
26
Ciin Eq. 5,
R=
C1
...
Cn
.(38)
Notice that
Ci
can differ for each feature in a given frame as well as vary over
time. Preliminary navigation results [
30
] already showed that such heatmaps-
derived covariance matrix can capture the statistical distribution of the measured
features and improve the measurements update step of the navigation filter.
450
Conversely, in the loosely-coupled filter
R
represents the uncertainty in the
pose estimation step and hence it is not directly related to the CNN heatmaps.
A constant value is therefore chosen based on the pose estimation accuracy
observed for the test dataset.
455
Finally, the updated state estimate
ˆ
xk
is obtained from the propagated state
˜
xk, the residuals ˜
y, and the Kalman Gain K,
˜
y=z−h(˜
xk) (39)
K=PkHT
k(HkPkHT
k+Rk)−1(40)
ˆ
xk=˜
xk+K˜
y.(41)
6.3. Reset Step
In the reset step, the reference quaternion
qref
is updated with the attitude
error estimate ˆ
apand the new attitude error is set to zero,460
ˆ
qk=δq(ˆ
a)⊗qrefk(42)
ˆ
a=03×1(43)
27
qrefk+1 =ˆ
qk.(44)
The obtained estimated quaternion set
ˆ
qk
is then compared to the real
quaternion set to assess the angle accuracy of the filter.
7. Simulations
In this section, the simulation environment and the results are presented.
Firstly, the impact of including a heatmaps-derived covariance in the pose
465
estimation step is addressed by comparing the CEPPnP method with a standard
solver which does not account for feature uncertainty. The weights in Eq. 4
are selected based on the standard RGB-to-grayscale conversion (
wR
= 0
.
299,
wG
= 0
.
587,
wB
= 0
.
114). Secondly, the performance of the MEKF is evaluated
by comparing the convergence profiles with a heatmaps-derived covariance matrix
470
against covariance matrices with arbitrary selected covariances. Initialization is
provided by the CEPPnP for all the scenarios.
Two separate error metrics are adopted in the evaluation, in accordance with
Sharma and D’Amico
[20]
. Firstly, the translational error between the estimated
relative position ˆ
tCand the ground truth tis computed as475
ET=|tC−ˆ
tC|.(45)
This metric is also applied for the translational and rotational velocities estimated
in the navigation filter. Secondly, the attitude accuracy is measured in terms of
the Euler axis-angle error between the estimated quaternion
ˆ
q
and the ground
truth q,
β=βsβv=q⊗ˆ
q(46)
ER= 2 arccos (|βs|).(47)
28
7.1. Pose Estimation480
Three representative scenarios are selected from the CNN test dataset for
a preliminary evaluation of the Single-stack Hourglass performance. These
scenarios were chosen in order to analyze different heatmaps’ distributions
around the detected features. A comparison is made between the proposed
CEPPnP and the EPPnP. Figure 10 shows the characteristics of the covariance
485
matrices derived from the predicted heatmaps. Here, the ratio between the
minimum and maximum eigenvalues of the associated covariances is represented
against the ellipse’s area and the RMSE between the Ground Truth (GT) and
the x, y coordinates of the extracted features,
ERMSE,i =q(xGT,i −xi)2+ (yGT ,i −yi)2.(48)
Notably, interesting relations can be established between the three quantities
490
reported in the figure. In the first scenario, the correlation between the sub-
pixel RMSE and the large eigenvalues ratio suggests that a very accurate CNN
detection can be associated with circular-shaped heatmaps. Moreover, the
relatively low ellipse’s areas indicate that, in general, small heatmaps are expected
for an accurate detection. Conversely, in the second scenario the larger ellipses’
495
area correlates with a larger RMSE. Furthermore, it can be seen that the largest
difference between the
x
- and
y
- components of the RMSE occurs either for the
most eccentric heatmap (ID 13) or for the one with the largest area (ID 6). The
same behaviour can be observed in the last scenario, where the largest RMSE
coincides with a large, highly eccentric heatmap.500
Table 3 lists the pose estimation results for the three scenarios. As anticipated
in Figure 10, the statistical information derived from the heatmaps in the first
scenario is uniform for all the features, due to the very accurate CNN detection.
As a result, the inclusion of features covariance in the CEPPnP solver does not
help refining the estimated pose. Both solvers are characterized by the same
505
pose accuracy.
Not surprisingly, the situation changes as soon as the heatmaps are not uni-
29
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
0.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
50
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
0.5
1
1.5
x
y
(a) Test scenario 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
0.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
100
200
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
1
2
3x
y
(b) Test scenario 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
0.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
200
400
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame ID
0
5
10
x
y
(c) Test scenario 3
Figure 10: Characteristics of the ellipses derived from the covariance matrices for the three
selected scenarios.
Table 3: Single-stack Hourglass Pose Estimation performance results for the selected three
representative scenarios.
Metric Scenario CEPPnP EPPnP
ET[m]
1 [0.18 0.22 0.24] [0.17 0.22 0.24]
2 [0.35 0.41 0.59] [0.14 0.4 22.8]
3 [0.49 0.12 1.41] [0.56 0.16 5.01]
ER[deg]
1 0.36 0.35
2 0.75 6.08
3 1.99 2.72
30
form across the feature IDs. Due to its capability of accomodating feature
uncertainties in the estimation, the CEPPnP method outperforms the EPPnP
for the remaining scenarios. In other words, the CEPPnP solver proves to be
510
more robust against inaccurate CNN detections by accounting for a reliable
representation of the features covariance.
Next, the previous comparison is extended to the entire test dataset as well
as to HRNet, by computing the mean and standard deviation of the estimated
515
relative position and attitude as a function of the relative range, respectively.
This is represented in Figures 11-12. First of all, it can be seen that the pose
accuracy of the CEPPnP solver in the Single-stack Hourglass scenario does not
improve compared to the EPPnP, as opposed to the ideal behaviour reported
in Table 3. There are two potential causes of this behaviour. On the one hand,
520
most of the test images characterized by a large RMSE (Figure 8) could not
return statistically-meaningful heatmaps that would help the CEPPnP solver.
This could be due to multiple heatmaps or highly inaccurate detections in which
two different corners are confused with each other. On the other hand, this could
be a direct consequence of the large relative ranges considered in this work. As
525
already reported by Park et al.
[26]
and Sharma and D’Amico
[31]
, a decreasing
performance of EPPnP is indeed expected for increasing relative distances, due
to the nonlinear relation between the pixel location of the detected features and
zC
in Eq. 2. In other words, relatively large pixel errors could lead to inaccurate
pose estimates for large relative distances, independently of the use of either
530
CEPPnP or EPPnP.
Furthermore, it can be seen from a different comparison level that both
the mean and standard deviation of the estimated relative pose are improved,
when HRNet is used prior to the PnP solver (Figures 11b-12b). Again, this is
535
a direct consequence of the smaller RMSE reported in Figure 8. As a result,
the above-mentioned degradation of the pose estimation accuracy for increasing
relative ranges is less critical for HRNet. Notice also that, despite an actual
31
90 100 110 120 130 140 150 160 170 180
Relative range [m]
-15
-10
-5
0
5
10
15
20
25
30
35
Mean Position Error [m]
CEPPnP
EPPnP
(a) Single-stack Hourglass
90 100 110 120 130 140 150 160 170 180
Relative range [m]
-4
-2
0
2
4
6
8
10
12
Mean Position Error [m]
CEPPnP
EPPnP
(b) HRNet
Figure 11: Pose Estimation Results - The standard deviation of the position error
ET
is
depicted as the length of each error bar above and below the mean error ET.
improvement of CEPPnP over EPPnP can be seen in the HRNet scenario, the
improvements in both the mean and standard deviation of the estimation error
540
are relatively small at large relative distances. This is considered to be related to
the fact that HRNet returns circular heatmaps for most of the detected features,
due to its higher detection accuracy compared to the Single-stack Hourglass.
Notably, it is important to assess how well the pose estimation system can
545
scale when tested on datasets different than the Envisat one. To this aim, the
proposed heatmaps-based scheme was benchmarked on the SPEED dataset, in
order to compare its pose accuracy against standard as well as CNN-based systems
[
19
,
26
,
25
]. The reader is referred to Barad
[38
, p. 115
]
for a comprehensive
quantitative analysis of such comparison. The results demonstrated that the
550
performance of the proposed pipeline, based on extracting feature heatmaps
and using the CEPPnP solver, compares well with the state-of-the-art pose
estimation systems.
7.2. Navigation Filter
To assess the performance of the proposed MEKF, a rendezvous scenarios
555
with Envisat is rendered in Cinema 4D
©
. This is a perturbation-free VBAR
trajectory characterized by a relative velocity
kvk
= 0 m/s. The Envisat performs
32
90 100 110 120 130 140 150 160 170 180
Relative range [m]
-10
-5
0
5
10
15
20
25
Mean Attitude Error [deg]
CEPPnP
EPPnP
(a) Single-stack Hourglass
90 100 110 120 130 140 150 160 170 180
Relative range [m]
-6
-4
-2
0
2
4
6
8
10
12
14
Mean Attitude Error [deg]
CEPPnP
EPPnP
(b) HRNet
Figure 12: Pose Estimation Results - The standard deviation of the attitude error
ER
is
depicted as the length of each error bar above and below the mean error ER.
a roll rotation of
kωk
= 5 deg/s, with the servicer camera frame aligned with
the LVLH frame. Table 4 lists the initial conditions of the trajectory, whereas
Figure 13 shows some of the associated rendered 2D images. It is assumed that
560
the images are made available to the filter every 2 seconds for the measurement
update step, with the propagation step running at 1 Hz. In both scenarios,
the MEKF is initialized with the CEPPnP pose solution at time
t0
. The other
elements of the initial state vector are randomly chosen assuming a standard
deviation of 1 mm/s and 1 deg/s for all the axes of terms (
ˆ
v0−v
) and (
ˆ
ω0−ω
),
565
respectively. Table 5 reports the initial conditions of the filter.
Table 4: VBAR approach scenario. The attitude is represented in terms of ZYX Euler angles
for clarity. Note that the camera boresight is the Y-axis of the LVLH frame.
θ0[deg] ω0[deg/s] tC
0[m] v0[mm/s]
[-180 30 -80]T[-2.5 -4.3 0.75]T[0 150 0]T[0 0 0]T
Figures 14-15 show the convergence profiles for the translational and rotational
states in the tightly- and loosely-coupled MEKF, respectively. Besides, a Monte
Carlo simulation with 1,000 runs was performed to assess the robustness of the
filter estimate against varying the initial state
ˆ
x0
. Table 6 lists the standard
570
deviation chosen for the deviation from the true initial state of the filter. The
33
Figure 13: Montage of the selected VBAR approach scenario. Images are shown every 6 s for
clarity.
Table 5: Initial state vector in the MEKF. Here, HG refers to the Single-stack Hourglass
architecture.
CNN ˆ
θ0[deg] ˆ
ω0[deg/s] ˆ
tC
0[m] ˆ
v0[mm/s]
HG [-180.2 28.7 -80.6]T[-2.1 -4.1 0.1]T[0.1 149.7 0.1]T[2.8 -1.3 3]T
HRNet [-179.5 33.5 -79.7]T[-2.1 -4.1 0.1]T[-0.1 149.8 -0.1]T[2.8 -1.3 3]T
Table 6: Standard deviation of Monte Carlo variables.
σ∆φ0[deg] σω0[deg/s] σtC
0[m] σv0[mm/s]
10 1 [1 10 1]T10
34
distribution follows a Gaussian profile with true-state mean. For the attitude
initial error, the initial reference quaternion
qref0
is perturbed by introducing a
random angular error around the correct Euler axis [39, p. 44],
∆φ0= ∆φ0qv(49)
qref0=q0⊗
1
1
2∆φ0
.(50)
Table 7 reports the mean of the steady-state pose estimates together with their
575
standard deviation. From these results, important insights can be gained on two
different levels of the comparison.
Table 7: Monte Carlo Simulation Results. The mean and standard deviation of the relative
pose errors are taken from the absolute errors at filter steady state for each Monte Carlo run.
Single-stack Hourglass
MEKF ET1[m] ET2[m] ET3[m] ER[deg]
Tigthly-coupled 0.1182 ±6.3E-4 0.03 ±0.0024 0.096 ±5.4E-4 1.33 ±0.03
Loosely-coupled 0.1 ±2E-7 0.33 ±3E-7 0.01 ±1E-4 4.7 ±12.6
HRNet
MEKF ET1[m] ET2[m] ET3[m] ER[deg]
Tigthly-coupled 0.0683 ±5.2E-4 0.03 ±0.014 0.01 ±3.4E-5 4.3 ±0.03
Loosely-coupled 0.0075 ±1E-4 0.36 ±6.3E-5 0.002 ±4E-4 0.93 ±3.2E-7
On a CNN performance level, the results in Figure 14 show that a slightly
worse cross-track estimate of the Single-stack Hourglass is compensated by a
more accurate estimate of the relative attitude. Given the limited impact of
580
these estimation errors at the relatively large inter-satellite range of 150 m, these
results suggest that the Single-stack Hourglass has a comparable performance
with the HRNet for the selected scenario. Next, on a filter architecture level, a
comparison between Figures 14-15 illustrate the different convergence pattern
between the tightly- and loosely-coupled MEKF. Most importantly, it can be
585
35
seen that the loosely-coupled estimate of the relative along-track position is
characterized by a bias which is not present in the tightly-coupled estimate.
This occurs due to the decoupling of the translational and rotational states,
reflected in the Jacobian
Hk
in Eq. 25. As a result, the relative position is
estimated without accounting for the attitude measurements and viceversa. In
590
other words, the creation of pseudomeasurements of the relative pose prior to the
loosely-coupled filter leads to two separate translational and rotational estimates.
Conversely, in the tightly-coupled filter the full statistical information is enclosed
in the detected features, and can be used to simultaneously refine both the
translational and the rotational states. Moreover, a close inspection of the Single-
595
stack Hourglass attitude estimates in Table 7 suggests that the tightly-coupled
MEKF is characterized by a lower standard deviation, highlighting a better
robustness with respect to the initial conditions of the filter when compared to
the loosely-coupled MEKF. Note that, due to the higher accuracy of HRNet in
the feature detection step - and hence also in the pose estimation step, this is
600
not observed for the latter CNN. In conclusion, a tightly-coupled architecture
is expected to return higher pose accuracies if simplified CNNs, such as the
proposed single-stack hourglass, are implemented at a feature detection level.
8. Conclusions and Recommendations
This paper introduces a novel framework to estimate the relative pose of
605
an uncooperative target spacecraft with a single monocular camera onboard a
servicer spacecraft. A method is proposed in which a CNN-based IP algorithm
is combined with a CEPPnP solver and a tightly-coupled MEKF to return a
robust estimate of the relative pose as well as of the relative translational and
rotational velocities. The performance of the proposed method is evaluated
610
at different levels of the pose estimation system, by comparing the detection
accuracy of two different CNNs (feature detection step and pose estimation step)
whilst assessing the accuracy and robustness of the selected tightly-coupled filter
against a loosely-coupled filter (navigation filter step).
36
0 0.5 1 1.5 2
Number of Orbits
0
1
2
3
4
5
6
7
8
Error [deg]
Euler axis-angle Error
Single-stack Hourglass
HRNet
Figure 14: Navigation Filter Results - Tightly-coupled MEKF. The dashed lines represent the
1σof the estimated quantities.
37
0 0.5 1 1.5 2
Number of Orbits
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Error [deg]
Euler axis-angle Error
Single-stack Hourglass
HRNet
Figure 15: Navigation Filter Results - Loosely-coupled MEKF. The dashed lines represent the
1σof the estimated quantities.
38
615
The main novelty of the proposed CNN-based pose estimation system is to
introduce a heatmaps-derived covariance representation of the detected features
and to exploit this information in a tigthly-coupled, Single-stack Hourglass-based
MEKF. On a feature detection level, the performance of the proposed Single-
stack Hourglass is compared to the more complex HRNet to asses the feasibility
620
of a reduced-parameters CNN within the IP. Results on the selected test dataset
suggest a comparable mean detection accuracy, despite a larger standard devia-
tion of the former network. Notably, this latter aspect is found to decrease the
pose estimation accuracy of the proposed CNN compared to HRNet, despite
the adoption of CEPPnP to capture features uncertainty. However, important
625
insights are gained at a navigation filter level, delineating two major benefits
of the proposed tightly-coupled MEKF. First of all, the capability of deriving
a measurements covariance matrix directly from the CNN heatmaps allows to
capture a more representative statistical distribution of the measurements in the
filter. Notably, this is expected to be a more complex task if a loosely-coupled
630
filter is used, due to the need to convert the heatmaps distribution into a pose
estimation uncertainty through a linear transformation. Secondly, the coupling
between the rotational and translational states within the filter guarantees a
mutual interaction which is expected to improve the global accuracy of the filter,
especially in the along-track estimate. Besides, the navigation results for the
635
selected VBAR scenario demonstrated that the proposed Single-stack Hourglass
could represent a valid alternative to the more complex HRNet, provided that its
larger detection uncertainty is reflected in the measurements covariance matrix.
Together, these improvements suggest a promising scheme to cope with the
challenging demand for robust navigation in close-proximity scenarios.640
However, further work is required in several directions. First of all, more
recent CNN architectures shall be investigated to assess the achievable robustness
and accuracy in the feature detection step. Secondly, the impact of a reduction
in the number of CNN parameter on the computational complexity shall be
645
39
assessed by testing the CNNs in space-representative processors. Moreover,
broader relative ranges between the servicer camera and the target spacecraft
shall be considered, most importantly to allow a thorough investigation of the
3D depth perception challenges when approaching the target spacecraft with
a single monocular camera. Besides, more close-proximity scenarios shall be
650
recreated to assess the impact of perturbations on the accuracy and robustness
of the navigation filter. In this context, other navigation filters such as the
Unscented Kalman Filter shall be investigated.
Acknowledgments
This study is funded and supported by the European Space Agency and
655
Airbus Defence and Space under Network/Partnering Initiative (NPI) program
with grant number NPI 577 - 2017. The first author would like to thank Christoph
Haskampf for the rendering of the images in Cinema 4D
©
and Kuldeep Barad
for the adaptation of HRNet to the Envisat scenario.
References660
[1]
A. Tatsch, N. Fitz-Coy, S. Gladun, On-orbit Servicing: A brief survey,
in: Proceedings of the 2006 Performance Metrics for Intelligent Systems
Workshop, 2006, pp. 21–23.
[2]
M. Wieser, H. Richard, G. Hausmann, J.-C. Meyer, S. Jaekel, M. Lavagna,
R. Biesbroek, e.deorbit mission: OHB debris removal concepts, in: ASTRA
665
2015-13th Symposium on Advanced Space Technologies in Robotics and
Automation, Noordwijk, The Netherlands, 2015.
[3]
J. Davis, H. Pernicka, Proximity operations about and identification of non-
cooperative resident space objects using stereo imaging, Acta Astronautica
155 (2019) 418–425.670
40
[4]
V. Pesce, M. Lavagna, R. Bevilacqua, Stereovision-based pose and inertia
estimation of unknown and uncooperative space objects, Advances in Space
Research 59 (2017) 236–251.
[5]
R. Opromolla, G. Fasano, G. Rufino, M. Grassi, Uncooperative pose
estimation with a LIDAR-based system, Acta Astronautica 110 (2015)
675
287–297.
[6]
S. Segal, P. Gurfil, K. Shahid, In-Orbit Tracking of Resident Space Objects:
A Comparison of Monocular and Stereoscopic Vision, IEEE Transactions
on Aerospace and Electronic Systems 50 (2014) 676–688.
[7]
S. Sharma, J. Ventura, S. D’Amico, Robust Model-Based Monocular
680
Pose Initialization for Noncooperative Spacecraft Rendezvous, Journal of
Spacecraft and Rockets 55 (2018) 1–16.
[8]
L. Pasqualetto Cassinis, R. Fonod, E. Gill, Review of the Robustness and
Applicability of Monocular Pose Estimation Systems for Relative Navigation
with an Uncooperative Spacecraft, Progress in Aerospace Sciences 110
685
(2019).
[9]
Lepetit, F. Moreno-Noguer, P. Fua, EPnP: an accurate O(n) solution to the
PnP problem, International Journal of Computer Vision 81 (2009) 155–166.
[10]
L. Ferraz, X. Binefa, F. Moreno-Noguer, Very Fast Solution to the
PnP Problem with Algebraic Outlier Rejection, in: IEEE Conference
690
on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014.
doi:10.1109/CVPR.2014.71.
[11]
A. Ostrowsky, Solution of Equations and Systems of Equations, 2 ed.,
Academic Press, New York, 1966.
[12]
S. Urban, J. Leitloff, S. Hinz, Mlpnp: A rel-time maximum likelihood
695
solution to the perspective-n-point problem, Proceedings of the British
Machine Vision Conference 3 (2016).
41
[13]
L. Ferraz, X. Binefa, F. Moreno-Noguer, Leveraging Feature Uncertainty in
the PnP Problem, in: Proceedings of the British Machine Vision Conference,
Nottingham, UK, 2014. doi:10.5244/C.28.83.700
[14]
J. Cui, C. Min, X. Bai, J. Cui, An Improved Pose Estimation Method
Based on Projection Vector with Noise Error Uncertainty, IEEE Photonics
Journal 11 (2019).
[15]
A. Harvard, V. Capuano, E. Shao, S.-J. Chung, Spacecraft Pose Estima-
tion from Monocular Images Using Neural Network Based Keypoints and
705
Visibility Maps, in: AIAA Scitech 2020 Forum, Orlando, FL, USA, 2020.
doi:10.1109/AERO.2018.8396425.
[16]
M. Kisantal, S. Sharma, T. Park, D. Izzo, M. Martens, S. D’Amico, Satellite
Pose Estimation Challenge: Dataset, Competition Design and Results,
IEEE Transactions on Aerospace and Electronic Systems (2020).710
[17]
D. Rondao, N. Aouf, Multi-View Monocular Pose Estimation for Spacecraft
Relative Navigation, in: 2018 AIAA Guidance, Navigation, and Control
Conference, Kissimmee, FL, USA, 2018. doi:10.2514/6.2018-2100.
[18]
V. Capuano, S. Alimo, A. Ho, S. Chung, Robust Features Extraction for
On-board Monocular-based Spacecraft Pose Acquisition, in: AIAA Scitech
715
2019 Forum, San Diego, CA, USA, 2019. doi:10.2514/6.2019-2005.
[19]
S. Sharma, C. Beierle, S. D’Amico, Pose Estimation for Non-Cooperative
Spacecraft Rendezvous using Convolutional Neural Networks, in: IEEE
Aerospace Conference, Big Sky, MT, USA, 2018. doi:
10.1109/AERO.2018.
8396425.720
[20]
S. Sharma, S. D’Amico, Pose Estimation for Non-Cooperative Spacecraft
Rendezvous using Neural Networks, in: 29th AAS/AIAA Space Flight
Mechanics Meeting, Ka’anapali, HI, USA, 2019. doi:
10.1109/AERO.2018.
8396425.
42
[21]
J. Shi, S. Ulrich, S. Ruel, CubeSat Simulation and Detection using Monoc-
725
ular Camera Images and Convolutional Neural Networks, in: 2018 AIAA
Guidance, Navigation, and Control Conference, Kissimmee, FL, USA, 2018.
doi:10.2514/6.2018-1604.
[22]
S. Sonawani, R. Alimo, R. Detry, D. Jeong, A. Hess, H. Ben Amor, Assistive
relative pose estimation for on-orbit assembly using convolutional neural
730
networks, in: AIAA Scitech 2020 Forum, Orlando, FL, USA, 2020. doi:10.
1109/AERO.2018.8396425.
[23]
G. Pavlakos, X. Zhou, A. Chan, K. Derpanis, K. Daniilidis, 6-DoF Object
Pose from Semantic Keypoints, in: IEEE International Conference on
Robotics and Automation, 2017.735
[24]
A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose
estimation, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer
Vision - ECCV 2016, volume 9912, Springer, Cham, 2016, pp. 483–499.
[25]
B. Chen, J. Cao, A. Parra, T. Chin, Satellite Pose Estimation with Deep
Landmark Regression and Nonlinear Pose Refinement, in: International
740
Conference on Computer Vision, Seoul, South Korea, 2019.
[26]
T. Park, S. Sharma, S. D’Amico, Towards Robust Learning-Based Pose
Estimation of Noncooperative Spacecraft, in: AAS/AIAA Astrodynamics
Specialist Conference, Portland, ME, USA, 2019.
[27]
S. D’Amico, M. Benn, J. Jorgensen, Pose Estimation of an Uncooperative
745
Spacecraft from Actual Space Imagery, International Journal of Space
Science and Engineering 2 (2014) 171–189.
[28]
K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation
learning for human pose estimation, in: 2019 IEEE Conference on Computer
Vision and Pattern Recognition, Long Beach, CA, USA, 2019.750
[29]
L. Pasqualetto Cassinis, R. Fonod, E. Gill, I. Ahrns, J. Gil Fernandez,
Comparative Assessment of Image Processing Algorithms for the Pose
43
Estimation of an Uncooperative Spacecraft, in: International Workshop on
Satellite Constellations & Formation Flying, Glasgow, UK, 2019.
[30]
L. Pasqualetto Cassinis, R. Fonod, E. Gill, I. Ahrns, J. Gil Fernandez, CNN-
755
Based Pose Estimation System for Close-Proximity Operations Around
Uncooperative Spacecraft, in: AIAA Scitech 2019 Forum, Orlando, FL,
USA, 2020. doi:10.2514/6.2020-1457.
[31]
S. Sharma, S. D’Amico, Reduced-Dynamics Pose Estimation for Non-
Cooperative Spacecraft Rendezvous Using Monocular Vision, in: 38th AAS
760
Guidance and Control Conference, Breckenridge, CO, USA, 2017.
[32] H. Curtis, Orbital Mechanics for Engineering Students, Elsevier, 2005.
[33]
M. A. Fischer, R. Bolles, Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography,
Communications of the ACM 24 (1981) 381–395.765
[34]
O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks
for biomedical image segmentation, in: Medical Image Computing and
Computer-Assisted Intervention, Springer, 2015, pp. 234–241.
[35]
D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: 3rd
International Conference for Learning Representations, San Diego, CA,
770
USA, 2015. doi:10.2514/6.2018-2100.
[36]
B. Tweddle, A. Saenz-Otero, Relative Computer Vision Based Navigation for
Small Inspection Spacecraft, Journal of Guidance, Control, and Dynamics
38 (2015) 969–978.
[37]
E. Lefferts, F. Markley, M. Shuster, Kalman Filtering for Spacecraft Attitude
775
Estimation, Journal of Guidance, Control, and Dynamics 5 (1982) 417–429.
[38]
K. Barad, Robust Navigation Framework for Proximity Operations around
Uncooperative Spacecraft, Delft University of Technology, Delft, The Nether-
lands, 2020. MSc Thesis.
44
[39]
J. Sol`a, Quaternion kinematics for the error-state Kalman filter, arXiv
780
e-prints (2017) arXiv:1711.02508.
45