Content uploaded by Lorenzo Pasqualetto Cassinis

Author content

All content in this area was uploaded by Lorenzo Pasqualetto Cassinis on Feb 25, 2021

Content may be subject to copyright.

Evaluation of Tightly- and Loosely-coupled Approaches

in CNN-based Pose Estimation Systems for

Uncooperative Spacecraft

Lorenzo Pasqualetto Cassinisa,∗

, Robert Fonoda, Eberhard Gilla, Ingo Ahrnsb,

Jesus Gill Fernandezc

aDelft University of Technology, Kluyverweg 1 2629 HS, Delft, The Netherlands

bAirbus DS GmbH, Airbusallee 1, 28199, Bremen, Germany

cESTEC, Keplerlaan 1, 2201 AZ, Noordwijk, The Netherlands

Abstract

The relative pose estimation of an inactive spacecraft by an active servicer

spacecraft is a critical task in the design of current and planned space missions,

due to its relevance for close-proximity operations, such as In-Orbit Servicing

and Active Debris Removal. This paper introduces a novel framework to enable

robust monocular pose estimation for close-proximity operations around an

uncooperative spacecraft, which combines a Convolutional Neural Network

(CNN) for feature detection with a Covariant Eﬃcient Procrustes Perspective-n-

Points (CEPPnP) solver and a Multiplicative Extended Kalman Filter (MEKF).

The performance of the proposed method is evaluated at diﬀerent levels of the

pose estimation system. A Single-stack Hourglass CNN is proposed for the feature

detection step in order to decrease the computational load of the Image Processing

(IP), and its accuracy is compared to the standard, more complex High-Resolution

Net (HRNet). Subsequently, heatmaps-derived covariance matrices are included

in the CEPPnP solver to assess the pose estimation accuracy prior to the

navigation ﬁlter. This is done in order to support the performance evaluation

of the proposed tightly-coupled approach against a loosely-coupled approach,

∗Corresponding author

Email addresses: L.PasqualettoCassinis@tudelft.nl (Lorenzo Pasqualetto Cassinis),

Robert.Fonod@ieee.org (Robert Fonod), E.K.A.Gill@tudelft.nl (Eberhard Gill),

ingo.ahrns@airbus.com

(Ingo Ahrns),

Jesus.Gil.Fernandez@esa.int

(Jesus Gill Fernandez)

Preprint submitted to Elsevier February 25, 2021

in which the detected features are converted into pseudomeasurements of the

relative pose prior to the ﬁlter. The performance results of the proposed system

indicate that a tightly-coupled approach can guarantee an advantageous coupling

between the rotational and translational states within the ﬁlter, whilst reﬂecting a

representative measurements covariance. This suggest a promising scheme to cope

with the challenging demand for robust navigation in close-proximity scenarios.

Synthetic 2D images of the European Space Agency’s Envisat spacecraft are used

to generate datasets for training, validation and testing of the CNN. Likewise,

the images are used to recreate a representative close-proximity scenario for the

validation of the proposed ﬁlter.

Keywords: Relative Pose Estimation, Active Debris Removal, In-Orbit

Servicing, Monocular-based Navigation Filters, Convolutional Neural Networks,

2010 MSC: 00-01, 99-00

1. Introduction

Nowadays, key Earth-based applications such as remote sensing, navigation,

and telecommunication, rely on satellite technology on a daily basis.

To ensure

a high reliability of these services, the safety and operations of satellites in orbit

has to be guaranteed. In this context, advancements in the ﬁeld of Guidance,

5

Navigation, and Control (GNC) were made in the past years to cope with

the challenges involved in In-Orbit Servicing (IOS) and

Active Debris Removal

(ADR) missions [

1

,

2

]. For such scenarios, the estimation of the relative pose

(position and attitude) of an uncooperative spacecraft by an active servicer

spacecraft represents a critical task. Compared to cooperative close-proximity

10

missions, the pose estimation problem is indeed complicated by the fact that the

target satellite is not functional and/or not able to aid the relative navigation.

Hence, optical sensors shall be preferred over Radio Frequency (RF) sensors to

cope with a lack of navigation devices such as Global Positioning System (GPS)

sensors and/or antennas onboard the target.15

2

From a high-level perspective, optical sensors can be divided into active and

passive devices, depending on whether they require power to function, i.e. LIght

Detection And Ranging (LIDAR) sensors and Time-Of-Flight (TOF) cameras, or

if they passively acquire radiation, i.e. monocular and stereo cameras. Spacecraft

20

relative navigation usually exploits Electro-Optical (EO) sensors such as stereo

cameras [

3

,

4

] and/or a LIDAR sensor [

5

] in combination with one or more

monocular cameras, in order to overcome the partial observability that results

from the lack of range information in these latter [

6

]. In this framework, pose

estimation systems based solely on a monocular camera are recently becoming an

25

attractive alternative to systems based on active sensors or stereo cameras, due

to their reduced mass, power consumption and system complexity [

7

]. However,

given the low Signal-To-Noise Ratio (SNR) and the high contrast which charac-

terize space images, a signiﬁcant eﬀort is still required to comply with most of the

demanding requirements for a robust and accurate monocular-based navigation

30

system. Interested readers are referred to Pasqualetto Cassinis et al.

[8]

for a

recent overview of the current trends in monocular-based pose estimation systems.

Notably, the aforementioned navigation system cannot rely on known visual

markers, as they are typically not installed on an uncooperative target. Since the

extraction of visual features is an essential step in the pose estimation process,

35

advanced Image Processing (IP) techniques are required to extract keypoints

(or interest points), corners, and/or edges on the target body. In model-based

methods, the detected features are then matched with pre-deﬁned features on an

oﬄine wireframe 3D model of the target to solve for the relative pose. In other

words, a reliable detection of key features under adverse orbital conditions is

40

highly desirable to guarantee safe operations around an uncooperative spacecraft.

Moreover, it would be beneﬁcial from a diﬀerent standpoint to obtain a model of

feature detection uncertainties. This would provide the navigation system with

additional statistical information about the measurements, which could in turn

improve the robustness of the entire estimation process.45

Unfortunately, standard pose estimation solvers such as the Eﬃcient Perspective-

3

n-Point (EPnP) [9], the Eﬃcient Procrustes Perspective-n-Point (EPPnP) [10],

or the multi-dimensional Newton Raphson Method (NRM) [

11

] do not have

the capability to include features uncertainties. Only recently, the Maximum-

50

Likelihood PnP (MLPnP) [

12

] and the Covariant EPPnP (CEPPnP) [

13

] solvers

were introduced to exploit statistical information by including feature covariances

in the pose estimation. Ferraz et al.

[13]

proposed a method for computing

the covariance which takes diﬀerent camera poses to create a ﬁctitious distri-

bution around each detected keypoint. Other authors proposed an improved

55

pose estimation method based on projection vector, in which the covariance

is associated to the image gradient magnitude and direction at each feature

location [

14

], or a method in which covariance information is derived for each

feature based on feature’s visibility and robustness against illumination changes

[

15

]. However, in all these methods the derivation of features covariance matrices

60

is a lengthy process which generally cannot be directly related to the actual

detection uncertainty. Moreover, this procedure could not be easily applied if

Convolutional Neural Networks (CNNs) are used in the feature detection step,

due to the diﬃculty to associate statistical meaning to the IP tasks performed

within the network. In this context, another procedure should be followed in

65

which the output of the CNNs is directly exploited to return relevant statistical

information about the detection step. This could, in turn, provide a reliable

representation of the detection uncertainty.

The implementation of CNNs for monocular pose estimation in space has

70

already become an attractive solution in recent years, also thanks to the creation

of the Spacecraft PosE Estimation Dataset (SPEED) [

16

], a database of highly

representative synthetic images of PRISMA’s TANGO spacecraft made publicly

available by Stanford’s Space Rendezvous Laboratory (SLAB) and applicable

to train and test diﬀerent network architectures. One of the main advantages

75

of CNNs over standard feature-based algorithms for relative pose estimation

[

7

,

17

,

18

] is an increase in robustness under adverse illumination conditions,

as well as a reduction in the computational complexity. Initially, end-to-end

4

CNNs were exploited to map a 2D input image directly into a relative pose

by means of learning complex non-linear functions [

19

,

20

,

21

,

22

]. However,

80

since the pose accuracies of these end-to-end CNNs proved to be lower than

the accuracies returned by common pose estimation solvers, especially in the

estimation of the relative attitude [

19

], recent eﬀorts investigated the capability

of CNNs to perform keypoint localization prior to the actual pose estimation

[

23

,

24

,

25

,

26

]. The output of these networks is a set of so called heatmaps

85

around pre-trained features. The coordinates of the heatmap’s peak intensity

characterize the predicted feature location, with the intensity and the shape

indicating the conﬁdence of locating the corresponding keypoint at this position

[

23

]. Additionally, due to the fact that the trainable features can be selected

oﬄine prior to the training, the matching of the extracted feature points with

90

the features of the wireframe model can be performed without the need of a

large search space for the image-model correspondences, which usually char-

acterizes most of the edges/corners-based methods [

27

]. In this context, the

High-Resolution Net (HRNet) [

28

] already proved to be a reliable and accurate

keypoint detector prior to pose estimation, due to its capability of maintaining

95

a high-resolution representation of the heatmaps through the whole detection

process.

To the best of the authors’ knowledge, the reviewed implementations of

CNNs feed solely the heatmap’s peak location into the pose estimation solver,

100

despite multiple information could be extracted from the detected heatmaps.

Only in Pavlakos et al.

[23]

, the pose estimation is solved by assigning weights

to each feature based on their heatmap’s peak intensities, in order to penalize

inaccurate detections. Yet, there is another aspect related to the heatmaps which

has not been considered. It is in fact hardly acknowledged how the overall shape

105

of the detected heatmaps returned by CNN can be translated into a statistical

distribution around the peak, allowing reliable feature covariances and, in turn,

a robust navigation performance. As already investigated by the authors in

earlier works [29, 30], deriving an accurate representation of the measurements

5

uncertainty from feature heatmaps can in fact not only improve the pose esti-

110

mation, but it can also beneﬁt the estimation of the full relative state vector,

which would include the relative pose as well as the relative translational and

rotational velocities.

From a high level perspective, two diﬀerent navigation architectures are

115

normally exploited in the framework of relative pose estimation. A tightly-

coupled architecture, where the extracted features are directly processed by the

navigation ﬁlter as measurements, and a loosely-coupled architecture, in which

the relative pose is computed by a pose solver prior to the navigation ﬁlter, in

order to derive pseudomeasurements from the target features [

31

]. Usually, a

120

loosely-coupled approach is preferred for an uncooperative tumbling target, due

to the fact that the fast relative dynamics could jeopardize feature tracking and

return highly-variable measurements to the ﬁlter. However, one shortcoming of

this approach is that it is generally hard to obtain a representative covariance

matrix for the pseudomeasurements. This can be quite challenging when ﬁlter

125

robustness is demanded. Remarkably, the adoption of a CNN in the feature

detection step can overcome the challenges in feature tracking by guaranteeing

the detection of a constant, pre-deﬁned set of features. At the same time, the

CNN heatmaps can be used to derive a measurements covariance matrix and

improve ﬁlter robustness. Following this line of reasoning, a thightly-coupled

130

ﬁlter is expected to interface well with a CNN-based IP and to outperform its

loosely-coupled counterpart.

In this framework, the objective of this paper is to combine a CNN-based

feature detector with a CEPPnP solver whilst evaluating the performance of a

135

proposed tightly-coupled navigation ﬁlter against the performance of a loosely-

coupled ﬁlter. Speciﬁcally, the novelty of this work stands in extending the

authors’ previous ﬁndings [

29

,

30

] by further linking the current research on

CNN-based feature detection, covariant-based PnP solvers, and navigation ﬁlters.

The main contributions of this work are:140

6

1.

To assess the feasibility of a simpliﬁed CNN for feature detection within

the IP

2.

To improve the pose estimation by incorporating heatmaps-derived covari-

ance matrices in the CEPPnP

3.

To compare the performance of tightly- and loosely-coupled navigation

145

ﬁlters.

The paper is organized as follows. The overall pose estimation framework is

illustrated in Section II. Section III introduces the proposed CNN architecture

together with the adopted training, validation, and testing datasets. In Section

IV, special focus is given to the derivation of covariance matrices from the

150

CNN heatmaps, whereas Section V describes the CEPPnP solver. Besides,

Section VI provides a description of the tightly- and loosely-coupled ﬁlters

adopted. The simulation environment is presented in Section VII together with

the simulation results. Finally, Section VIII provides the main conclusions and

recommendations.155

2. Pose Estimation Framework

This work considers a servicer spacecraft ﬂying in relative motion around a

target spacecraft located in a Low Earth Orbit (LEO), with the relative motion

being described in a Local Vertical Local Horizontal (LVLH) reference frame

co-moving with the servicer (Figure 1a). Furthermore, it is assumed that the

160

servicer is equipped with a single monocular camera. The relative attitude of

the target with respect to the servicer can then be deﬁned as the rotation of the

target body-ﬁxed frame B with respect to the servicer camera frame C, where

these frames are tied to each spacecraft’s body. The distance between the origins

of these two frames deﬁnes their relative position. Together, these two quantities

165

characterize the relative pose. This information can then be transferred from

the camera frame to the servicer’s center of mass by accounting for the relative

7

pose of the camera with respect to the LVLH frame.

From a high-level perspective, a model-based monocular pose estimation

170

system receives as input a 2D image and matches it with an existing wireframe

3D model of the target spacecraft to estimate the pose of such target with respect

to the servicer camera. Referring to Figure 1b, the pose estimation problem

consists in determining the position of the target’s centre of mass

tC

and its

orientation with respect to the camera frame C, represented by the rotation

175

matrix RC

B. The Perspective-n-Points (PnP) equations,

rC=xCyCzCT

=RC

BrB+tC(1)

p= (ui, vi) = xC

zCfx+Cx,yC

zCfy+Cy,(2)

relate the unknown pose with a feature point

p

in the image plane via the relative

position

rC

of the feature with respect to the camera frame. Here,

rB

is the

point location in the 3D model, expressed in the body-frame coordinate system

B, whereas

fx

and

fy

denote the focal lengths of the camera and (

Cx, Cy

) is the

180

principal point of the image.

From these equations, it can already be seen that an important aspect of

estimating the pose resides in the capability of the IP system to extract features

p

from a 2D image of the target spacecraft, which in turn need to be matched with

pre-selected features

rB

in the wireframe 3D model. Notably, such wireframe

185

model of the target needs to be made available prior to the estimation. Notice

also that the problem is not well deﬁned for

n <

3 feature points, and can

have up to four positive solutions for

n

= 3 [

33

]. Generally, more features are

required in presence of large noise and/or symmetric objects. Besides, it can

also be expected that the time variation of the relative pose plays a crucial role

190

while navigating around the target spacecraft, e.g. if rotational synchronization

with the target spacecraft is required in the ﬁnal approach phase. As such, it is

clear that the estimation of both the relative translational and angular velocities

8

Incoming light

CNN-based IP

2D Image

Initialization

CEPPnP

Navigation

Covariance computationFeature detection

Peak location

Covariance

Figure 2: Functional ﬂow of the proposed tightly-coupled pose estimation architecture.

represent an essential step within the navigation system.

195

The proposed tightly-coupled architecture combines the above key ingredi-

ents in three main stages, which are shown in Figure 2 and described in more

detail in the following sections. In the CNN-based IP block, a CNN is used to

extract features from a 2D image of the target spacecraft. Statistical informa-

tion is derived by computing a covariance matrix for each features using the

200

information included in the output heatmaps. In the Navigation block, both

the peak locations and the covariances are fed into the navigation ﬁlter, which

estimates the relative pose as well as the relative translational and rotational

velocities. The ﬁlter is initialized by the CEPPnP block, which takes peak

location and covariance matrix of each feature as input and outputs the initial

205

relative pose by solving the PnP problem in Eqs.1-2. Thanks to the availability

of a covariance matrix of the detected features, this architecture can guarantee

a more accurate representation of feature uncertainties, especially in case of

inaccurate detection of the CNN due to adverse illumination conditions and/or

unfavourable relative geometries between servicer and target. Together with the

210

CEPPnP initialization, this aspect can return a robust and accurate estimation of

the relative pose and velocities and assure a safe approach of the target spacecraft.

10

In this work, a rectilinear VBAR approach of the servicer spacecraft towards

the target spacecraft is considered, as this typically occurs during the ﬁnal stages

215

of close-proximity operations in rendezvous and docking missions [

1

,

2

]. This

assumption is justiﬁed by the fact that the proposed method needs to be ﬁrst

validated on simpliﬁed relative trajectories before assessing its feasibility under

more complex relative geometries. Following the same line of reasoning, the

relative attitude is also simpliﬁed by considering a perturbation-free rotational

220

dynamics between the servicer and the target. This is described in more detail

in Section VI.

3. Convolutional Neural Network

CNNs are currently emerging as a promising features extraction method,

mostly due to the capability of their convolutional layers to extract high-level

225

features of objects with improved robustness against image noise and illumination

conditions. In order to optimize CNNs for the features extraction process, a

stacked hourglass architecture has been proposed [

23

,

24

], and other architectures

such as the U-net [34] and the HRNet [28] were tested in recent years.

Compared to the network proposed by Pavlakos et al.

[23]

, the architecture

230

proposed in this work is composed of only one encoder/decoder block, constituting

a single hourglass module. This was chosen in order to reduce the network size

and comply with the limitations in computing power which characterizes space-

grade processors. The encoder includes six blocks, each including a convolutional

layer formed by a ﬁxed number of ﬁlter kernels of size 3

×

3, a batch normalization

235

module and max pooling layer, whereas the six decoder blocks accomodate an

up-sampling block in spite of max pooling. In the encoder stage, the initial

image resolution is decreased by a factor of two, with this downsampling process

continuing until reaching the lowest resolution of 4

×

4 pixels. An upsampling

process follows in the decoder with each layer increasing the resolution by a

240

factor of two and returning output heatmaps at the same resolution as the input

11

image. Figure 3 shows the high-level architecture of the network layers, together

with the corresponding input and output.

Overall, the size of the 2D input image and the number of kernels per

convolutional layer drive the total number of parameters. In the current analysis,

245

an input size of 256

×

256 pixels is chosen, and 128 kernels are considered per

convolutional layer, leading to a total of

∼

1

,

800

,

000 trainable parameters.

Compared to the CNNs analyzed by Sun et al.

[28]

, this represents a reduction

of more than an order of magnitude in network size.

Figure 3: Overview of the single hourglass architecture. Downsampling is performed in the

encoder stage, in which the image size is decrease after each block, whereas upsampling occurs

in the decoder stage. The output of the network consists of heatmap responses, and is used for

keypoints localization.

As already mentioned, the output of the network is a set of heatmaps around

250

the selected features. Ideally, the heatmap’s peak intensity associated to a wrong

detection should be relatively small compared to the correctly detected features,

highlighting that the network is not conﬁdent about that particular wrongly-

detected feature. At the same time, the heatmap’s amplitude should provide an

additional insight into the conﬁdence level of each detection, a large amplitude255

being related to large uncertainty about the detection. The network is trained

12

Figure 4: Illustration of the selected features for a given Envisat pose.

Table 1: Parameters of the camera used to generate the synthetic images in Cinema 4D©.

Parameter Value Unit

Image resolution 512×512 pixels

Focal length 3.9·10−3m

Pixel size 1.1·10−5m

with the

x

- and

y

- image coordinates of the feature points, computed oﬄine

based on the intrinsic camera parameters as well as on the feature coordinates

in the target body frame, which were extracted from the wireframe 3D model

prior to the training. During training, the network is optimized to locate 16

260

features of the Envisat spacecraft, consisting of the corners of the main body,

the Synthetic-Aperture Radar (SAR) antenna, and the solar panel, respectively.

Figure 4 illustrates the selected features for a speciﬁc target pose.

3.1. Training, Validation and Test

For the training, validation, and test datasets, synthetic images of the Envisat

265

spacecraft were rendered in the Cinema 4D

©

software. Table 1 lists the main

camera parameters adopted. Constant Sun elevation and azimuth angles of 30

13

degrees were chosen in order to recreate favourable as well as adverse illumination

conditions. Relative distances between camera and target were discretized every

30 m in the interval 90 m - 180 m, with the Envisat always located in the camera

270

boresight direction in order to prevent some of the Envisat features from falling

outside the camera ﬁeld of view. Although being a conservative assumption,

this allows to test the CNN detection under ideal servicer-target geometries

during a rectilinear approach. Subsequently, relative attitudes were generated

by discretizing the yaw, pitch, and roll angles of the target with respect to the

275

camera by 10 degrees each. Together, these two choices were made in order

to recreate several relative attitudes between the servicer and the target. The

resulting database was then shuﬄed to randomize the images, and was ultimately

split into training (18,000 images), validation (6,000 images), and test (6,000

images) datasets. Figure 5 shows a subset of the camera pose distribution for 100

280

representative training images, whereas Figure 6 illustrates some of the images

included in the training dataset.

During training, the validation dataset is used beside the training one to compute

the validation losses and avoid overﬁtting. The Adam optimizer [

35

] is used with

a learning rate of 10

−3

for a total number of 50 epochs. Finally, the network

285

performance after training is assessed with the test dataset.

Preliminary results on the single-stack network performance were already

reported by Pasqualetto Cassinis et al.

[29]

. Above all, one key advantage of

relying on CNNs for feature detection was found in the capability of learning

290

the relative position between features under a variety of relative poses present

in the training. As a result, both features which are not visible due to adverse

illumination and features occulted by other parts of the target can be detected.

Besides, a challenge was identiﬁed in the speciﬁc selection of the trainable features.

Since the features selected in this work represent highly symmetrical points of

295

the Envisat spacecraft, such as corners of the solar panel, SAR antenna or main

body, the network could be unable to distinguish between similar features, and

return multiple heatmaps for a single feature output. Figure 7 illustrates these

14

Figure 5: Illustration of the pose space discretization in the training dataset. The concentric

spheres represent the discretization of the relative distance in the range 90 m - 180 m. Only

100 random relative camera poses are shown for clarity.

Figure 6: A montage of eight synthetic images selected from the training set.

15

ﬁndings. Notably, the detection of wrong features results in weak heatmaps,

which can be ﬁltered out by selecting a proper threshold on their total brightness.

300

(a) Occulted spacecraft body

(b) Heatmap of a single solar panel’s

corner

Figure 7: Robustness and challenges of feature detection with the proposed CNN. On the

left-hand side, the network has been trained to recognize the pattern of the features, and

can correctly locate the body features which are not visible, i.e. parts occulted by the solar

panel and corners of the SAR antenna. Conversely, the right-hand side shows the detection of

multiple heatmaps for a single corner of the solar panel. As can be seen, the network can have

diﬃculties in distinguishing similar features, such as the corners of the solar panel.

In order to compare the feature detection accuracy of the proposed Single-

stack Hourglass with a more complex CNN architecture, the HRNet proposed

by Sun et al.

[28]

has been selected and trained on the same Envisat datasets.

This architecture had already been tested on the SPEED dataset [

25

] and

already proved to return highly accurate features of the TANGO spacecraft. The

305

performance is assessed in terms of Root Mean Squared Error (RMSE) between

the ground truth (GT) and the

x

,

y

coordinates of the extracted features, which

is computed as

ERMSE =

v

u

u

u

u

u

t

ntot

X

i=1

[(xGT,i −xi)2+ (yGT ,i −yi)2]

ntot

.(3)

16

Figure 8: RMSE over the test dataset for the HRNet and the Single-stack Hourglass.

Figure 8 shows the RMSE error over the test dataset for the two CNNs,

whereas Table 2 reports the mean

µ

and standard deviation

σ

of the associated

310

histograms. As expected, the added complexity of HRNet, translates into a

more accurate detection of the selected features, thanks to the higher number of

parameters: only 4% of the test images are characterized by a RMSE above 5

pixels, as opposed to the 15% in the Single-stack Hourglass case.

Table 2: Mean

µ

and Standard Deviation

σ

of the adopted networks over the Envisat test

dataset.

Network No. Params µ[pxl] σ[pxl]

Single-stack Hourglass 1.8M 3.4 4.3

HRNet 25M 2.4 1.4

Although HRNet proves to return more accurate features, it is also believed

315

that the larger RMSE scenarios returned by the Single-stack Hourglass can be

properly handled, if a larger uncertainty can be associated to their corresponding

17

heatmaps. As an example, a large RMSE could be associated to the inaccurate

detection of only a few features which, if properly weighted, could not have

a severe impact on the pose estimation step. This task can be performed by

320

deriving a covariance matrix for each detected feature, in order to represent

its detection uncertainty. Above all, this may prevent the pose solver and

the navigation ﬁlter from trusting wrong detections by relying more on other

accurate features. In this way, the navigation ﬁlter can cope with poorly accurate

heatmaps while at the same time relying on a computationally-low CNN.325

4. Covariance Computation

Compared to the methods discussed in Section I [

13

,

14

,

15

], the proposed

method derives a covariance matrix associated to each feature directly from the

heatmaps detected by the CNN, rather than from the computation of the image

gradient around each feature. In order to do so, the ﬁrst step is to obtain a

330

statistical population around the heatmap’s peak. This is done by thresholding

each heatmap image so that only the

x

- and

y

- location of heatmap’s pixels

are extracted. Secondly, each pixel within the population is given a normalized

weight wibased on the gray intensity Iiat its location,

wi=wRRi+wGGi+wBBi,(4)

where

R, G, B

are the components of the coloured image and

wR, wG, wB

are

335

the weights assigned to each channel in order to get the grayscale intensity. This

is done in order to give more weight to pixels which are particularly bright and

close to the peak, and less weight to pixels which are very faint and far from

the peak. Finally, the obtained statistical population of each feature is used to

compute the weighted covariance between

x, y

and consequently the covariance

340

matrix Ci,

Ci=

cov(x, x) cov(x, y)

cov(y, x) cov(y, y)

,(5)

18

where

cov(x, y) =

n

X

i=1

wi(xi−px)·(yi−py) (6)

and

n

is the number of pixels in each feature’s heatmap. In this work, the mean

is replaced by the peak location

p

= (

px, py

) in order to represent a distribution

around the peak of the detected feature, rather than around the heatmap’s mean.

345

This is particularly relevant when the heatmaps are asymmetric and their mean

does not coincide with their peak.

Figure 9 shows the overall ﬂow to obtain the covariance matrix for three

diﬀerent heatmap shapes. The ellipse associated to each features covariance is

obtained by computing the eigenvalues λxand λyof the covariance matrix,350

x

λx2

+y

λy2

=s, (7)

where

s

deﬁnes the scale of the ellipse and is derived from the conﬁdence interval

of interest, e.g.

s

= 2

.

2173 for a 68% conﬁdence interval. As can be seen,

diﬀerent heatmaps can result in very diﬀerent covariance matrices. Above

all, the computed covariance can capture the diﬀerent CNN uncertainty over

x

,

y

. Notice that, due to its symmetric nature, the covariance matrix can only

355

represent bivariate normal distributions. As a result, asymmetrical heatmaps

such as the one in the third scenario are approximated by Gaussian distributions

characterized by an ellipse which might overestimate the heatmap’s dispersion

over some directions.

5. Pose Estimation360

The CEPPnP method proposed by Ferraz et al.

[13]

was selected to estimate

the relative pose from the detected features as well as from their covariance

matrices. The ﬁrst step of this method is to rewrite the PnP problem in Eqs.1-2 as

a function of a 12-dimensional vector

y

containing the control point coordinates

19

0.68

0.99

0.68

0.99 0.99 0.68

Thresholding

Weight allocation

Covariance computation

Figure 9: Schematic of the procedure followed to derive covariance matrices from CNN

heatmaps. The displayed ellipses are derived from the computed covariances by assuming the

conﬁdence intervals 1σ= 0.68 and 3σ= 0.99.

20

in the camera reference system,365

M y =0,(8)

where

M

is a 2

n×

12 known matrix. This is the fundamental equation in the

EPnP problem [

9

]. The likelihood of each observed feature location

ui

is then

represented as

P(ui) = k·e−1

2∆uT

iC−1

ui∆ui,(9)

where ∆

ui

is a small, independent and unbiased noise with expectation

E

[∆

ui

] =

0

and covariance

E

[∆

ui

∆

uT

i

] =

σ2Cui

and

k

is a normalization constant. Here,

σ2

represents the global uncertainty in the image, whereas

Cui

is the 2x2

unnormalized covariance matrix representing the Gaussian distribution of each

detected feature, computed from the CNN heatmaps. After some calculations

370

[13], the EPnP formulation can be rewritten as

(N−L)y=λy.(10)

This is an eigenvalue problem in which both

N

and

L

matrices are a function

of

y

and

Cui

. The problem is solved iteratively by means of the closed-loop

EPPnP solution for the four control points, assuming no feature uncertainty.

Once

y

is estimated, the relative pose is computed by solving the generalized

375

Orthogonal Procrustes problem used in the EPPnP [10].

6. Navigation Filter

Several navigation ﬁlters for close-proximity operations were investigated in

recent years in the context of relative pose estimation. The reader is referred

to Pasqualetto Cassinis et al.

[8]

for a comprehensive overview that goes be-

380

yond the scope of this work. In the proposed navigation system, the so-called

Multiplicative Extended Kalman Filter (MEKF) is used. Remarkably, other

works [

15

,

30

] adopted a standard formulation of the EKF that propagates the

relative pose, expressed in terms of relative position and quaternions, as well as

21

the relative translational and rotational velocities (prediction step), correcting

385

the prediction with the measurements obtained from the monocular camera

(correction step). However, the quaternion set consists of four parameters to

describe the 3DOF attitude, hence one of its parameters is deterministic. As

reported by Tweddle and Saenz-Otero

[36]

and Sharma and D’Amico

[31]

, this

makes the covariance matrix of a quaternion have one eigenvalue that is exactly

390

zero. As a result, the entire state covariance propagated by the ﬁlter may

become non-positive-deﬁnite and lead to the divergence of the ﬁlter. The MEKF,

introduced for the ﬁrst time by Leﬀerts et al.

[37]

, aims at solving the above

issue by using two diﬀerent parametrizations of the relative attitude. A three

element error parametrization, expressed in terms of quaternions, is propagated

395

and corrected inside the ﬁlter to return an estimate of the attitude error. At

each estimation step, this error estimate is used to update a reference quaternion

and is reset to zero for the next iteration. Notably, the reset step prevents the

attitude error parametrization from reaching singularities, which generally occur

for large angles.400

6.1. Propagation Step

A standard EKF state vector for relative pose estimation is composed of

the relative pose between the servicer and the target, as well as the relative

translational and rotational velocities

v

and

ω

. Under the assumption that the

camera frame onboard the servicer is co-moving with the LVLH frame, with the

405

camera boresight aligned with the along-track direction, this translates into

x=tCvqωT

,(11)

where

q

=

q0qv

is the quaternion set that represents the relative attitude.

Notice that the assumption of the camera co-moving with the LVLH is made

only to focus on the navigation aspects rather than on the attitude control of

the servicer. Therefore, the application of the ﬁlter can be extended to other

410

scenarios, if attitude control is included in the system.

22

In the MEKF, the modiﬁed state vector propagated inside the ﬁlter becomes

˜

x=tCv a ωT

,(12)

where ais four times the Modiﬁed Rodrigues Parameters (MRP) σ,

a= 4σ= 4 qv

1 + q0

.(13)

The discrete attitude propagation step is derived by linearizing

˙

a

around

a=03×1and assuming small angle rotations [36],415

˙

a=1

2[ω×]a+ω.(14)

As a result, the discrete linearized propagation of the full state becomes

˜

xk=Φk˜

xk−1+ΓkQk,(15)

where Qkrepresents the process noise and

Φk=

ΦCW 06×6

06×6Φa,ω

(16)

ΦCW =

Φrr Φrv

Φvr Φvv

(17)

Φrr =

1 0 6(∆θ−sin ∆θ)

0 cos ∆θ0

0 0 4 −3 cos ∆θ

(18)

Φrv =

1/ωs(4 sin ∆θ−3∆θ)02/ωs(1 −cos ∆θ)

0 1/ωssin ∆θ0

2/ωs(cos ∆θ−1) 0 sin ∆θ/ω

(19)

Φvr =

006ωs(1 −cos ∆θ)

0ωssin ∆θ0

0 0 3ωssin ∆θ

(20)

23

Φvv =

4 cos ∆θ−3 0 2 sin ∆θ

0 cos ∆θ0

−2 sin ∆θ0 cos ∆θ

(21)

Φa,ω =

e1

2[ω×]∆tR∆t

0e1

2[ω×]τdτ

03×3I3×3

(22)

Γk+1 =

1

2mI3×3∆t203×3

1

mI3×3∆t03×3

03×3∆tR∆t

0e1

2[ω×]τJ−1dτ

03×3J−1∆t

.(23)

The terms

ωs

and ∆

θ

in Eq. 17 represent the servicer argument of perigee and

true anomaly variation from time

t0

to

t

, respectively, whereas the term

J

in

Eq. 23 is the inertia matrix of the target spacecraft. In Tweddle and Saenz-Otero

420

[36]

, the integral terms ins Eqs.22-23 are solved by creating a temporary linear

system from Eq. 14, augmented with the angular velocity and the process noise.

The State Transition Matrix of this system is then solved numerically with the

matrix exponential.

6.2. Correction Step425

At this stage, the propagated state

˜

xk

is corrected with the measurements

z

to

return an update of the state

ˆ

xk

. In a loosely-coupled ﬁlter, these measurements

are represented by the relative pose between the servicer and the target spacecraft,

obtained by solving the PnP problem with the CEPPnP solver described in

Section V. In this case, a pseudomeasurements vector is derived by transforming

430

the relative quaternion set into the desired attitude error a,

δqz=qz⊗qref∗

k→a= 4 δqv

1 + δq0

(24)

24

zk=

tC

a

=Hkxk+Vk=

I3×303×303×303×3

03×303×3I3×303×3

tC

v

a

ω

k

+

Vr

Va

k

.

(25)

In Eq. 24,

⊗

denotes the quaternion product. Conversely, in a tightly-coupled

ﬁlter the measurements are represented by the pixel coordinates of the detected

features,

z= (x1, y1. . . xn, yn)T.(26)

Referring to Eqs.1-2, this translates into the following equations for each

435

detected point pi:

hi=xC

i

zC

i

fx+Cx,yC

i

zC

i

fy+CyT

(27)

rC=q⊗rB

i⊗q∗+tC,(28)

where

q∗

is the quaternion conjugate. As a result, the measurements update

equation can be written as

zk=Hkxk+V=

HtC,i 02n×3Ha,i 02n×3

.

.

..

.

..

.

..

.

.

HtC,n 02n×3Ha,n 02n×3

tC

v

a

w

k

+

Vr

Va

(29)

and the Jacobian

Hk

of the observation model with respect of the state vector

is a 2n×13 matrix whose elements are440

Hr,i =Hint

i·Hext

tC,i (30)

25

Ha,i =Hint

i·Hext

a,i =Hint

i·Hext

q,i ·Hq

a(31)

Hint

i=∂hi

∂rC

i

=

fx

zC

i

0−fx

(zC

i)2xC

i

0fy

zC

i

−fy

(zC

i)2yC

i

(32)

Hext

q,i =∂rC

i

∂q=∂q⊗rB

i⊗q∗

∂q;Hext

tC,i =∂rC

i

∂tC=I3(33)

Hq

a=∂δq⊗qref

∂a=∂Qrefδq

∂a=Qref

∂δq

∂a(34)

Qref =

q0−q1−q2−q3

q1q0q3−q2

q2−q3q0q1

q3q2−q1q0

ref

.(35)

The partial derivatives of the diﬀerential quaternion set with respect to the

attitude error are computed from the relation between the attitude error

a

and

the diﬀerential quaternion set δq,

δq0=16 − kak2

16 + kak2δqv= 8 a

16 + kak2(36)

∂δq

∂a=8

16 + kak22

−8a1−8a2−8a3

16 + kak2−2a2

1−2a1a2−2a1a3

−2a2a116 + kak2−2a2

2−2a2a3

−2a3a1−2a3a216 + kak2−2a2

3

.

(37)

In the tightly-coupled ﬁlter, the measurement covariance matrix

R

is a time-

varying block diagonal matrix constructed with the heatmaps-derived covariances

445

26

Ciin Eq. 5,

R=

C1

...

Cn

.(38)

Notice that

Ci

can diﬀer for each feature in a given frame as well as vary over

time. Preliminary navigation results [

30

] already showed that such heatmaps-

derived covariance matrix can capture the statistical distribution of the measured

features and improve the measurements update step of the navigation ﬁlter.

450

Conversely, in the loosely-coupled ﬁlter

R

represents the uncertainty in the

pose estimation step and hence it is not directly related to the CNN heatmaps.

A constant value is therefore chosen based on the pose estimation accuracy

observed for the test dataset.

455

Finally, the updated state estimate

ˆ

xk

is obtained from the propagated state

˜

xk, the residuals ˜

y, and the Kalman Gain K,

˜

y=z−h(˜

xk) (39)

K=PkHT

k(HkPkHT

k+Rk)−1(40)

ˆ

xk=˜

xk+K˜

y.(41)

6.3. Reset Step

In the reset step, the reference quaternion

qref

is updated with the attitude

error estimate ˆ

apand the new attitude error is set to zero,460

ˆ

qk=δq(ˆ

a)⊗qrefk(42)

ˆ

a=03×1(43)

27

qrefk+1 =ˆ

qk.(44)

The obtained estimated quaternion set

ˆ

qk

is then compared to the real

quaternion set to assess the angle accuracy of the ﬁlter.

7. Simulations

In this section, the simulation environment and the results are presented.

Firstly, the impact of including a heatmaps-derived covariance in the pose

465

estimation step is addressed by comparing the CEPPnP method with a standard

solver which does not account for feature uncertainty. The weights in Eq. 4

are selected based on the standard RGB-to-grayscale conversion (

wR

= 0

.

299,

wG

= 0

.

587,

wB

= 0

.

114). Secondly, the performance of the MEKF is evaluated

by comparing the convergence proﬁles with a heatmaps-derived covariance matrix

470

against covariance matrices with arbitrary selected covariances. Initialization is

provided by the CEPPnP for all the scenarios.

Two separate error metrics are adopted in the evaluation, in accordance with

Sharma and D’Amico

[20]

. Firstly, the translational error between the estimated

relative position ˆ

tCand the ground truth tis computed as475

ET=|tC−ˆ

tC|.(45)

This metric is also applied for the translational and rotational velocities estimated

in the navigation ﬁlter. Secondly, the attitude accuracy is measured in terms of

the Euler axis-angle error between the estimated quaternion

ˆ

q

and the ground

truth q,

β=βsβv=q⊗ˆ

q(46)

ER= 2 arccos (|βs|).(47)

28

7.1. Pose Estimation480

Three representative scenarios are selected from the CNN test dataset for

a preliminary evaluation of the Single-stack Hourglass performance. These

scenarios were chosen in order to analyze diﬀerent heatmaps’ distributions

around the detected features. A comparison is made between the proposed

CEPPnP and the EPPnP. Figure 10 shows the characteristics of the covariance

485

matrices derived from the predicted heatmaps. Here, the ratio between the

minimum and maximum eigenvalues of the associated covariances is represented

against the ellipse’s area and the RMSE between the Ground Truth (GT) and

the x, y coordinates of the extracted features,

ERMSE,i =q(xGT,i −xi)2+ (yGT ,i −yi)2.(48)

Notably, interesting relations can be established between the three quantities

490

reported in the ﬁgure. In the ﬁrst scenario, the correlation between the sub-

pixel RMSE and the large eigenvalues ratio suggests that a very accurate CNN

detection can be associated with circular-shaped heatmaps. Moreover, the

relatively low ellipse’s areas indicate that, in general, small heatmaps are expected

for an accurate detection. Conversely, in the second scenario the larger ellipses’

495

area correlates with a larger RMSE. Furthermore, it can be seen that the largest

diﬀerence between the

x

- and

y

- components of the RMSE occurs either for the

most eccentric heatmap (ID 13) or for the one with the largest area (ID 6). The

same behaviour can be observed in the last scenario, where the largest RMSE

coincides with a large, highly eccentric heatmap.500

Table 3 lists the pose estimation results for the three scenarios. As anticipated

in Figure 10, the statistical information derived from the heatmaps in the ﬁrst

scenario is uniform for all the features, due to the very accurate CNN detection.

As a result, the inclusion of features covariance in the CEPPnP solver does not

help reﬁning the estimated pose. Both solvers are characterized by the same

505

pose accuracy.

Not surprisingly, the situation changes as soon as the heatmaps are not uni-

29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

0.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

50

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

0.5

1

1.5

x

y

(a) Test scenario 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

0.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

100

200

300

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

1

2

3x

y

(b) Test scenario 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

0.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

200

400

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Frame ID

0

5

10

x

y

(c) Test scenario 3

Figure 10: Characteristics of the ellipses derived from the covariance matrices for the three

selected scenarios.

Table 3: Single-stack Hourglass Pose Estimation performance results for the selected three

representative scenarios.

Metric Scenario CEPPnP EPPnP

ET[m]

1 [0.18 0.22 0.24] [0.17 0.22 0.24]

2 [0.35 0.41 0.59] [0.14 0.4 22.8]

3 [0.49 0.12 1.41] [0.56 0.16 5.01]

ER[deg]

1 0.36 0.35

2 0.75 6.08

3 1.99 2.72

30

form across the feature IDs. Due to its capability of accomodating feature

uncertainties in the estimation, the CEPPnP method outperforms the EPPnP

for the remaining scenarios. In other words, the CEPPnP solver proves to be

510

more robust against inaccurate CNN detections by accounting for a reliable

representation of the features covariance.

Next, the previous comparison is extended to the entire test dataset as well

as to HRNet, by computing the mean and standard deviation of the estimated

515

relative position and attitude as a function of the relative range, respectively.

This is represented in Figures 11-12. First of all, it can be seen that the pose

accuracy of the CEPPnP solver in the Single-stack Hourglass scenario does not

improve compared to the EPPnP, as opposed to the ideal behaviour reported

in Table 3. There are two potential causes of this behaviour. On the one hand,

520

most of the test images characterized by a large RMSE (Figure 8) could not

return statistically-meaningful heatmaps that would help the CEPPnP solver.

This could be due to multiple heatmaps or highly inaccurate detections in which

two diﬀerent corners are confused with each other. On the other hand, this could

be a direct consequence of the large relative ranges considered in this work. As

525

already reported by Park et al.

[26]

and Sharma and D’Amico

[31]

, a decreasing

performance of EPPnP is indeed expected for increasing relative distances, due

to the nonlinear relation between the pixel location of the detected features and

zC

in Eq. 2. In other words, relatively large pixel errors could lead to inaccurate

pose estimates for large relative distances, independently of the use of either

530

CEPPnP or EPPnP.

Furthermore, it can be seen from a diﬀerent comparison level that both

the mean and standard deviation of the estimated relative pose are improved,

when HRNet is used prior to the PnP solver (Figures 11b-12b). Again, this is

535

a direct consequence of the smaller RMSE reported in Figure 8. As a result,

the above-mentioned degradation of the pose estimation accuracy for increasing

relative ranges is less critical for HRNet. Notice also that, despite an actual

31

90 100 110 120 130 140 150 160 170 180

Relative range [m]

-15

-10

-5

0

5

10

15

20

25

30

35

Mean Position Error [m]

CEPPnP

EPPnP

(a) Single-stack Hourglass

90 100 110 120 130 140 150 160 170 180

Relative range [m]

-4

-2

0

2

4

6

8

10

12

Mean Position Error [m]

CEPPnP

EPPnP

(b) HRNet

Figure 11: Pose Estimation Results - The standard deviation of the position error

ET

is

depicted as the length of each error bar above and below the mean error ET.

improvement of CEPPnP over EPPnP can be seen in the HRNet scenario, the

improvements in both the mean and standard deviation of the estimation error

540

are relatively small at large relative distances. This is considered to be related to

the fact that HRNet returns circular heatmaps for most of the detected features,

due to its higher detection accuracy compared to the Single-stack Hourglass.

Notably, it is important to assess how well the pose estimation system can

545

scale when tested on datasets diﬀerent than the Envisat one. To this aim, the

proposed heatmaps-based scheme was benchmarked on the SPEED dataset, in

order to compare its pose accuracy against standard as well as CNN-based systems

[

19

,

26

,

25

]. The reader is referred to Barad

[38

, p. 115

]

for a comprehensive

quantitative analysis of such comparison. The results demonstrated that the

550

performance of the proposed pipeline, based on extracting feature heatmaps

and using the CEPPnP solver, compares well with the state-of-the-art pose

estimation systems.

7.2. Navigation Filter

To assess the performance of the proposed MEKF, a rendezvous scenarios

555

with Envisat is rendered in Cinema 4D

©

. This is a perturbation-free VBAR

trajectory characterized by a relative velocity

kvk

= 0 m/s. The Envisat performs

32

90 100 110 120 130 140 150 160 170 180

Relative range [m]

-10

-5

0

5

10

15

20

25

Mean Attitude Error [deg]

CEPPnP

EPPnP

(a) Single-stack Hourglass

90 100 110 120 130 140 150 160 170 180

Relative range [m]

-6

-4

-2

0

2

4

6

8

10

12

14

Mean Attitude Error [deg]

CEPPnP

EPPnP

(b) HRNet

Figure 12: Pose Estimation Results - The standard deviation of the attitude error

ER

is

depicted as the length of each error bar above and below the mean error ER.

a roll rotation of

kωk

= 5 deg/s, with the servicer camera frame aligned with

the LVLH frame. Table 4 lists the initial conditions of the trajectory, whereas

Figure 13 shows some of the associated rendered 2D images. It is assumed that

560

the images are made available to the ﬁlter every 2 seconds for the measurement

update step, with the propagation step running at 1 Hz. In both scenarios,

the MEKF is initialized with the CEPPnP pose solution at time

t0

. The other

elements of the initial state vector are randomly chosen assuming a standard

deviation of 1 mm/s and 1 deg/s for all the axes of terms (

ˆ

v0−v

) and (

ˆ

ω0−ω

),

565

respectively. Table 5 reports the initial conditions of the ﬁlter.

Table 4: VBAR approach scenario. The attitude is represented in terms of ZYX Euler angles

for clarity. Note that the camera boresight is the Y-axis of the LVLH frame.

θ0[deg] ω0[deg/s] tC

0[m] v0[mm/s]

[-180 30 -80]T[-2.5 -4.3 0.75]T[0 150 0]T[0 0 0]T

Figures 14-15 show the convergence proﬁles for the translational and rotational

states in the tightly- and loosely-coupled MEKF, respectively. Besides, a Monte

Carlo simulation with 1,000 runs was performed to assess the robustness of the

ﬁlter estimate against varying the initial state

ˆ

x0

. Table 6 lists the standard

570

deviation chosen for the deviation from the true initial state of the ﬁlter. The

33

Figure 13: Montage of the selected VBAR approach scenario. Images are shown every 6 s for

clarity.

Table 5: Initial state vector in the MEKF. Here, HG refers to the Single-stack Hourglass

architecture.

CNN ˆ

θ0[deg] ˆ

ω0[deg/s] ˆ

tC

0[m] ˆ

v0[mm/s]

HG [-180.2 28.7 -80.6]T[-2.1 -4.1 0.1]T[0.1 149.7 0.1]T[2.8 -1.3 3]T

HRNet [-179.5 33.5 -79.7]T[-2.1 -4.1 0.1]T[-0.1 149.8 -0.1]T[2.8 -1.3 3]T

Table 6: Standard deviation of Monte Carlo variables.

σ∆φ0[deg] σω0[deg/s] σtC

0[m] σv0[mm/s]

10 1 [1 10 1]T10

34

distribution follows a Gaussian proﬁle with true-state mean. For the attitude

initial error, the initial reference quaternion

qref0

is perturbed by introducing a

random angular error around the correct Euler axis [39, p. 44],

∆φ0= ∆φ0qv(49)

qref0=q0⊗

1

1

2∆φ0

.(50)

Table 7 reports the mean of the steady-state pose estimates together with their

575

standard deviation. From these results, important insights can be gained on two

diﬀerent levels of the comparison.

Table 7: Monte Carlo Simulation Results. The mean and standard deviation of the relative

pose errors are taken from the absolute errors at ﬁlter steady state for each Monte Carlo run.

Single-stack Hourglass

MEKF ET1[m] ET2[m] ET3[m] ER[deg]

Tigthly-coupled 0.1182 ±6.3E-4 0.03 ±0.0024 0.096 ±5.4E-4 1.33 ±0.03

Loosely-coupled 0.1 ±2E-7 0.33 ±3E-7 0.01 ±1E-4 4.7 ±12.6

HRNet

MEKF ET1[m] ET2[m] ET3[m] ER[deg]

Tigthly-coupled 0.0683 ±5.2E-4 0.03 ±0.014 0.01 ±3.4E-5 4.3 ±0.03

Loosely-coupled 0.0075 ±1E-4 0.36 ±6.3E-5 0.002 ±4E-4 0.93 ±3.2E-7

On a CNN performance level, the results in Figure 14 show that a slightly

worse cross-track estimate of the Single-stack Hourglass is compensated by a

more accurate estimate of the relative attitude. Given the limited impact of

580

these estimation errors at the relatively large inter-satellite range of 150 m, these

results suggest that the Single-stack Hourglass has a comparable performance

with the HRNet for the selected scenario. Next, on a ﬁlter architecture level, a

comparison between Figures 14-15 illustrate the diﬀerent convergence pattern

between the tightly- and loosely-coupled MEKF. Most importantly, it can be

585

35

seen that the loosely-coupled estimate of the relative along-track position is

characterized by a bias which is not present in the tightly-coupled estimate.

This occurs due to the decoupling of the translational and rotational states,

reﬂected in the Jacobian

Hk

in Eq. 25. As a result, the relative position is

estimated without accounting for the attitude measurements and viceversa. In

590

other words, the creation of pseudomeasurements of the relative pose prior to the

loosely-coupled ﬁlter leads to two separate translational and rotational estimates.

Conversely, in the tightly-coupled ﬁlter the full statistical information is enclosed

in the detected features, and can be used to simultaneously reﬁne both the

translational and the rotational states. Moreover, a close inspection of the Single-

595

stack Hourglass attitude estimates in Table 7 suggests that the tightly-coupled

MEKF is characterized by a lower standard deviation, highlighting a better

robustness with respect to the initial conditions of the ﬁlter when compared to

the loosely-coupled MEKF. Note that, due to the higher accuracy of HRNet in

the feature detection step - and hence also in the pose estimation step, this is

600

not observed for the latter CNN. In conclusion, a tightly-coupled architecture

is expected to return higher pose accuracies if simpliﬁed CNNs, such as the

proposed single-stack hourglass, are implemented at a feature detection level.

8. Conclusions and Recommendations

This paper introduces a novel framework to estimate the relative pose of

605

an uncooperative target spacecraft with a single monocular camera onboard a

servicer spacecraft. A method is proposed in which a CNN-based IP algorithm

is combined with a CEPPnP solver and a tightly-coupled MEKF to return a

robust estimate of the relative pose as well as of the relative translational and

rotational velocities. The performance of the proposed method is evaluated

610

at diﬀerent levels of the pose estimation system, by comparing the detection

accuracy of two diﬀerent CNNs (feature detection step and pose estimation step)

whilst assessing the accuracy and robustness of the selected tightly-coupled ﬁlter

against a loosely-coupled ﬁlter (navigation ﬁlter step).

36

0 0.5 1 1.5 2

Number of Orbits

0

1

2

3

4

5

6

7

8

Error [deg]

Euler axis-angle Error

Single-stack Hourglass

HRNet

Figure 14: Navigation Filter Results - Tightly-coupled MEKF. The dashed lines represent the

1σof the estimated quantities.

37

0 0.5 1 1.5 2

Number of Orbits

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Error [deg]

Euler axis-angle Error

Single-stack Hourglass

HRNet

Figure 15: Navigation Filter Results - Loosely-coupled MEKF. The dashed lines represent the

1σof the estimated quantities.

38

615

The main novelty of the proposed CNN-based pose estimation system is to

introduce a heatmaps-derived covariance representation of the detected features

and to exploit this information in a tigthly-coupled, Single-stack Hourglass-based

MEKF. On a feature detection level, the performance of the proposed Single-

stack Hourglass is compared to the more complex HRNet to asses the feasibility

620

of a reduced-parameters CNN within the IP. Results on the selected test dataset

suggest a comparable mean detection accuracy, despite a larger standard devia-

tion of the former network. Notably, this latter aspect is found to decrease the

pose estimation accuracy of the proposed CNN compared to HRNet, despite

the adoption of CEPPnP to capture features uncertainty. However, important

625

insights are gained at a navigation ﬁlter level, delineating two major beneﬁts

of the proposed tightly-coupled MEKF. First of all, the capability of deriving

a measurements covariance matrix directly from the CNN heatmaps allows to

capture a more representative statistical distribution of the measurements in the

ﬁlter. Notably, this is expected to be a more complex task if a loosely-coupled

630

ﬁlter is used, due to the need to convert the heatmaps distribution into a pose

estimation uncertainty through a linear transformation. Secondly, the coupling

between the rotational and translational states within the ﬁlter guarantees a

mutual interaction which is expected to improve the global accuracy of the ﬁlter,

especially in the along-track estimate. Besides, the navigation results for the

635

selected VBAR scenario demonstrated that the proposed Single-stack Hourglass

could represent a valid alternative to the more complex HRNet, provided that its

larger detection uncertainty is reﬂected in the measurements covariance matrix.

Together, these improvements suggest a promising scheme to cope with the

challenging demand for robust navigation in close-proximity scenarios.640

However, further work is required in several directions. First of all, more

recent CNN architectures shall be investigated to assess the achievable robustness

and accuracy in the feature detection step. Secondly, the impact of a reduction

in the number of CNN parameter on the computational complexity shall be

645

39

assessed by testing the CNNs in space-representative processors. Moreover,

broader relative ranges between the servicer camera and the target spacecraft

shall be considered, most importantly to allow a thorough investigation of the

3D depth perception challenges when approaching the target spacecraft with

a single monocular camera. Besides, more close-proximity scenarios shall be

650

recreated to assess the impact of perturbations on the accuracy and robustness

of the navigation ﬁlter. In this context, other navigation ﬁlters such as the

Unscented Kalman Filter shall be investigated.

Acknowledgments

This study is funded and supported by the European Space Agency and

655

Airbus Defence and Space under Network/Partnering Initiative (NPI) program

with grant number NPI 577 - 2017. The ﬁrst author would like to thank Christoph

Haskampf for the rendering of the images in Cinema 4D

©

and Kuldeep Barad

for the adaptation of HRNet to the Envisat scenario.

References660

[1]

A. Tatsch, N. Fitz-Coy, S. Gladun, On-orbit Servicing: A brief survey,

in: Proceedings of the 2006 Performance Metrics for Intelligent Systems

Workshop, 2006, pp. 21–23.

[2]

M. Wieser, H. Richard, G. Hausmann, J.-C. Meyer, S. Jaekel, M. Lavagna,

R. Biesbroek, e.deorbit mission: OHB debris removal concepts, in: ASTRA

665

2015-13th Symposium on Advanced Space Technologies in Robotics and

Automation, Noordwijk, The Netherlands, 2015.

[3]

J. Davis, H. Pernicka, Proximity operations about and identiﬁcation of non-

cooperative resident space objects using stereo imaging, Acta Astronautica

155 (2019) 418–425.670

40

[4]

V. Pesce, M. Lavagna, R. Bevilacqua, Stereovision-based pose and inertia

estimation of unknown and uncooperative space objects, Advances in Space

Research 59 (2017) 236–251.

[5]

R. Opromolla, G. Fasano, G. Ruﬁno, M. Grassi, Uncooperative pose

estimation with a LIDAR-based system, Acta Astronautica 110 (2015)

675

287–297.

[6]

S. Segal, P. Gurﬁl, K. Shahid, In-Orbit Tracking of Resident Space Objects:

A Comparison of Monocular and Stereoscopic Vision, IEEE Transactions

on Aerospace and Electronic Systems 50 (2014) 676–688.

[7]

S. Sharma, J. Ventura, S. D’Amico, Robust Model-Based Monocular

680

Pose Initialization for Noncooperative Spacecraft Rendezvous, Journal of

Spacecraft and Rockets 55 (2018) 1–16.

[8]

L. Pasqualetto Cassinis, R. Fonod, E. Gill, Review of the Robustness and

Applicability of Monocular Pose Estimation Systems for Relative Navigation

with an Uncooperative Spacecraft, Progress in Aerospace Sciences 110

685

(2019).

[9]

Lepetit, F. Moreno-Noguer, P. Fua, EPnP: an accurate O(n) solution to the

PnP problem, International Journal of Computer Vision 81 (2009) 155–166.

[10]

L. Ferraz, X. Binefa, F. Moreno-Noguer, Very Fast Solution to the

PnP Problem with Algebraic Outlier Rejection, in: IEEE Conference

690

on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014.

doi:10.1109/CVPR.2014.71.

[11]

A. Ostrowsky, Solution of Equations and Systems of Equations, 2 ed.,

Academic Press, New York, 1966.

[12]

S. Urban, J. Leitloﬀ, S. Hinz, Mlpnp: A rel-time maximum likelihood

695

solution to the perspective-n-point problem, Proceedings of the British

Machine Vision Conference 3 (2016).

41

[13]

L. Ferraz, X. Binefa, F. Moreno-Noguer, Leveraging Feature Uncertainty in

the PnP Problem, in: Proceedings of the British Machine Vision Conference,

Nottingham, UK, 2014. doi:10.5244/C.28.83.700

[14]

J. Cui, C. Min, X. Bai, J. Cui, An Improved Pose Estimation Method

Based on Projection Vector with Noise Error Uncertainty, IEEE Photonics

Journal 11 (2019).

[15]

A. Harvard, V. Capuano, E. Shao, S.-J. Chung, Spacecraft Pose Estima-

tion from Monocular Images Using Neural Network Based Keypoints and

705

Visibility Maps, in: AIAA Scitech 2020 Forum, Orlando, FL, USA, 2020.

doi:10.1109/AERO.2018.8396425.

[16]

M. Kisantal, S. Sharma, T. Park, D. Izzo, M. Martens, S. D’Amico, Satellite

Pose Estimation Challenge: Dataset, Competition Design and Results,

IEEE Transactions on Aerospace and Electronic Systems (2020).710

[17]

D. Rondao, N. Aouf, Multi-View Monocular Pose Estimation for Spacecraft

Relative Navigation, in: 2018 AIAA Guidance, Navigation, and Control

Conference, Kissimmee, FL, USA, 2018. doi:10.2514/6.2018-2100.

[18]

V. Capuano, S. Alimo, A. Ho, S. Chung, Robust Features Extraction for

On-board Monocular-based Spacecraft Pose Acquisition, in: AIAA Scitech

715

2019 Forum, San Diego, CA, USA, 2019. doi:10.2514/6.2019-2005.

[19]

S. Sharma, C. Beierle, S. D’Amico, Pose Estimation for Non-Cooperative

Spacecraft Rendezvous using Convolutional Neural Networks, in: IEEE

Aerospace Conference, Big Sky, MT, USA, 2018. doi:

10.1109/AERO.2018.

8396425.720

[20]

S. Sharma, S. D’Amico, Pose Estimation for Non-Cooperative Spacecraft

Rendezvous using Neural Networks, in: 29th AAS/AIAA Space Flight

Mechanics Meeting, Ka’anapali, HI, USA, 2019. doi:

10.1109/AERO.2018.

8396425.

42

[21]

J. Shi, S. Ulrich, S. Ruel, CubeSat Simulation and Detection using Monoc-

725

ular Camera Images and Convolutional Neural Networks, in: 2018 AIAA

Guidance, Navigation, and Control Conference, Kissimmee, FL, USA, 2018.

doi:10.2514/6.2018-1604.

[22]

S. Sonawani, R. Alimo, R. Detry, D. Jeong, A. Hess, H. Ben Amor, Assistive

relative pose estimation for on-orbit assembly using convolutional neural

730

networks, in: AIAA Scitech 2020 Forum, Orlando, FL, USA, 2020. doi:10.

1109/AERO.2018.8396425.

[23]

G. Pavlakos, X. Zhou, A. Chan, K. Derpanis, K. Daniilidis, 6-DoF Object

Pose from Semantic Keypoints, in: IEEE International Conference on

Robotics and Automation, 2017.735

[24]

A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose

estimation, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer

Vision - ECCV 2016, volume 9912, Springer, Cham, 2016, pp. 483–499.

[25]

B. Chen, J. Cao, A. Parra, T. Chin, Satellite Pose Estimation with Deep

Landmark Regression and Nonlinear Pose Reﬁnement, in: International

740

Conference on Computer Vision, Seoul, South Korea, 2019.

[26]

T. Park, S. Sharma, S. D’Amico, Towards Robust Learning-Based Pose

Estimation of Noncooperative Spacecraft, in: AAS/AIAA Astrodynamics

Specialist Conference, Portland, ME, USA, 2019.

[27]

S. D’Amico, M. Benn, J. Jorgensen, Pose Estimation of an Uncooperative

745

Spacecraft from Actual Space Imagery, International Journal of Space

Science and Engineering 2 (2014) 171–189.

[28]

K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation

learning for human pose estimation, in: 2019 IEEE Conference on Computer

Vision and Pattern Recognition, Long Beach, CA, USA, 2019.750

[29]

L. Pasqualetto Cassinis, R. Fonod, E. Gill, I. Ahrns, J. Gil Fernandez,

Comparative Assessment of Image Processing Algorithms for the Pose

43

Estimation of an Uncooperative Spacecraft, in: International Workshop on

Satellite Constellations & Formation Flying, Glasgow, UK, 2019.

[30]

L. Pasqualetto Cassinis, R. Fonod, E. Gill, I. Ahrns, J. Gil Fernandez, CNN-

755

Based Pose Estimation System for Close-Proximity Operations Around

Uncooperative Spacecraft, in: AIAA Scitech 2019 Forum, Orlando, FL,

USA, 2020. doi:10.2514/6.2020-1457.

[31]

S. Sharma, S. D’Amico, Reduced-Dynamics Pose Estimation for Non-

Cooperative Spacecraft Rendezvous Using Monocular Vision, in: 38th AAS

760

Guidance and Control Conference, Breckenridge, CO, USA, 2017.

[32] H. Curtis, Orbital Mechanics for Engineering Students, Elsevier, 2005.

[33]

M. A. Fischer, R. Bolles, Random Sample Consensus: A Paradigm for Model

Fitting with Applications to Image Analysis and Automated Cartography,

Communications of the ACM 24 (1981) 381–395.765

[34]

O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks

for biomedical image segmentation, in: Medical Image Computing and

Computer-Assisted Intervention, Springer, 2015, pp. 234–241.

[35]

D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: 3rd

International Conference for Learning Representations, San Diego, CA,

770

USA, 2015. doi:10.2514/6.2018-2100.

[36]

B. Tweddle, A. Saenz-Otero, Relative Computer Vision Based Navigation for

Small Inspection Spacecraft, Journal of Guidance, Control, and Dynamics

38 (2015) 969–978.

[37]

E. Leﬀerts, F. Markley, M. Shuster, Kalman Filtering for Spacecraft Attitude

775

Estimation, Journal of Guidance, Control, and Dynamics 5 (1982) 417–429.

[38]

K. Barad, Robust Navigation Framework for Proximity Operations around

Uncooperative Spacecraft, Delft University of Technology, Delft, The Nether-

lands, 2020. MSc Thesis.

44

[39]

J. Sol`a, Quaternion kinematics for the error-state Kalman ﬁlter, arXiv

780

e-prints (2017) arXiv:1711.02508.

45