JINT manuscript No.
(will be inserted by the editor)
Guided Sonar-to-Satellite Translation
Giovanni G. De Giacomo ·Matheus M.
dos Santos ·Paulo L. J. Drews-Jr ·Silvia
S. C. Botelho
Received: date / Accepted: date
Abstract Underwater navigation and localization are greatly enhanced by the
use of acoustic images. However, such images are of diﬃcult interpretation. Con-
trarily, aerial images are easier to interpret, but require Global Positioning System
(GPS) sensors. Due to absorption phenomena, GPS sensors are unavailable in un-
derwater environments. Thus, we propose a method to translate sonar images ac-
quired underwater to an aerial counterpart. This process is called sonar-to-satellite
translation. To perform the conversion, a U-Net based neural network is proposed,
enhanced with state-of-the-art techniques, such as dilated convolutions and guided
ﬁlters. Afterwards, our approach is validated on two datasets containing sonar im-
ages and their satellite analogue. Qualitative experimental results indicate that
the proposed method can transfer features from acoustic images to aerial images,
generating satellite images that are easier to interpret and visualize.
Keywords Deep Learning ·Neural Networks ·Robotics
PACS 68 Computer Science
Mathematics Subject Classiﬁcation (2010) MSC 68T45 ·MSC 68T40 ·
Giovanni G. De Giacomo (Corresponding Author)
Federal University of Rio Grande (FURG)
Computer Science Center
Av. It´alia Km 8
Rio Grande, RS, Brazil
Tel.: +55 (53) 99951-9892
Matheus M. dos Santos, Paulo L. J. Drews-Jr, Silvia S. C. Botelho
Federal University of Rio Grande (FURG)
Computer Science Center
ORCID: 0000-0002-6036-8670, 0000-0002-7519-0502, 0000-0002-8857-0221
E-mail: email@example.com, firstname.lastname@example.org, email@example.com
2 Giovanni G. De Giacomo et al.
Converting images into other images, even when dealing with diﬀerent domains, is
an exciting problem that lies at the core of many Machine Learning applications.
Performing that conversion is known as image-to-image translation. A general
U-Net based Convolutional Neural Network (CNN) was proposed by Isola et al.
(2017) to solve it.
Water has physical properties that make it diﬃcult to work with underwater
robots, such as Autonomous Underwater Vehicles (AUVs). Mainly, the malfunc-
tion of light based sensors, e.g. cameras and lasers, and Global Positioning Sys-
tem (GPS) sensors. These malfunctions happen because of the rapid attenuation
that electromagnetic waves undergo below water. Therefore, underwater robots
traditionally use sonar images as their preferred source of reliable input. How-
ever, acoustic images require intensive processing to extract information, due to
phenomena such as noise. As a consequence, underwater robot localization and
navigation, such as attempted in Dos Santos et al. (2019b,a), is an area that is
requiring new and eﬀective methods.
This paper proposes facilitating the interpretation of sonar images through
aerial images when operating in underwater environments. This proposal consists
in the translation of an acoustic image into a satellite counterpart. Giacomo et al.
deﬁned sonar-to-satellite translation as the task of converting an acoustic image
into an aerial one. The objective is that satellite images generated by such methods
could be used to perform matching with authentic satellite images and locate a
robot without needing a GPS sensor. Figure 1shows a schematic of the sonar-to-
satellite translation problem when solved by a CNN.
Fig. 1: Sonar-To-Satellite Translation is deﬁned as the conversion of the acoustic
image in the left to the satellite image in the right. This diagram also shows a
CNN as a solution to the problem.
This work is an extension of Giacomo et al. (2018), that deﬁned sonar-to-
satellite translation and presented results with a pix2pix CNN. The method was
extended by using a new architecture that includes dilated convolutions from Yu
and Koltun (2015); Steﬀens et al. (2019, 2020) and guided ﬁlters from He et al.
(2013) in the generator, as well as using more sophisticated loss functions, such as
the style reconstruction loss by Johnson et al. (2016). Also, a whole new dataset
was included for testing our model. These extensions will be discussed further in
In the paper, a myriad of experiments are conducted with two local datasets,
using a U-Net based neural network architecture augmented with guided layers
Guided Sonar-to-Satellite Translation 3
from Gon¸calves et al. (2018) and with a conditional Generative Adversarial Net-
work (cGAN). After that, the visual quality of the results is veriﬁed by doing side
comparisons of them with the actual ground truth satellite images, as well as cal-
culating image quality metrics. These ground truth images were obtained by using
the data from the GPS and magnetometer sensors of the underwater robot.
The goal of the research is verifying if it is possible to diminish the diﬃculty
in acquiring aerial images when navigating underwater. This diﬃculty is caused
by the unreliability of GPS sensors in said environments. The proposed method
attempts to solve the issue by translating sonar images into satellite ones, using a
This paper is organized in the following way: in the next section, we will discuss
the related works; Then, we will introduce the two datasets used in the experiments
and talk about their particularities; Afterwards, we will present the methodology
used to attack the sonar-to-satellite problem; Subsequently, we will present the
experimental results in the two datasets and discuss them. Finally, we will conclude
and summarize our contributions.
2 Related Works
Other than Giacomo et al., using a CNN to extract an aerial image from a sonar
one is an unprecedented concept. On the other hand, there are various related
works upon which this paper is based. Therefore, in this section, papers on CNN,
image ﬁltering and general neural network techniques that inspired our model and
research will be described. Articles about locating vehicles on land using satellite
images will also be discussed, since a similar idea is being proposed on this paper,
but for underwater domains.
Image-to-image translation is the task of translating one possible representa-
tion of a scene into another, as deﬁned in Isola et al. (2017). To solve this problem,
Isola et al. created a CNN architecture called pix2pix. This architecture is based
on the U-Net network for medical segmentation that was proposed in Ronneberger
et al. (2015). By adding a cGAN component to the network architecture and pro-
viding general hyperparameters, Isola et al. provided a standardized approach to
solving image translation problems. In addition, Isola et al. performed various
experiments using his proposed methodology, working with varied datasets that
referred to several problems within the domain of Computer Vision (CV). Among
the problems, there was one that aimed to convert aerial images into charts and
that inspired the present work.
As one of the basis of our CNN architecture, the U-Net network, described in
Ronneberger et al. (2015), is an important related work. In his paper, Ronneberger
et al. ﬁrst proposed the idea of skip connections. A skip connection is a step in
a neural network where you feed a feature map from a backward layer into a
forward layer in encoder-decoder architectures. Due to their proven capacity of
improving learning by returning structures previously discarded by the network,
skip connections have been largely used in many applications and neural network
architectures, including our own. However, it is also important to mention that
skip connections come with a large memory footprint, since they need the required
feature maps from previous layers to be stored.
4 Giovanni G. De Giacomo et al.
Yu and Koltun introduced dilated convolutions that are useful for aggregating
contextual information without losing coverage or resolution. By making use of
dilated convolutions, one is able to expand the receptive ﬁeld dramatically. There-
fore, the network is able to capture signiﬁcantly higher amounts of context than
it would with traditional convolutional layers.
He et al. proposed the guided ﬁlter, an image ﬁlter that outputs a locally lin-
ear transform of the guidance image. As detailed in its original paper, the guided
ﬁlter has good edge-preserving properties. However, it does not suﬀer from gradi-
ent reversal artifacts, like the bilateral ﬁlter from Tomasi and Manduchi (1998).
Also, unlike the bilateral and simple linear translation-invariant (LTI) ﬁlters, the
guided ﬁlter can be used to transfer structure from the guidance image to the
output. Due to the structure-transferring property of the guided ﬁlter, He et al.
(2013) envisioned that it could be used in applications such as feathering, dehazing
and high-quality stereo matching methods. Wu et al. (2018) published and made
available an implementation for a fast end-to-end trainable guided ﬁlter.
Gon¸calves et al. introduced GuidedNet, a model that used a new neural net-
work layer, called guided layer. Guided layers use guided ﬁlters as a way to transfer
structural information, which are partially lost due to convolutions, back into the
output of the neural network. Although the model was initially proposed for image
dehazing, it also works well for other tasks involving image generation or restora-
Viswanathan et al. localizes a ground vehicle by using satellite images as a map.
The method creates a feature database by splitting the satellite image in a grid
and describing each cell. First, the ground-based panoramic images are warped
into a top-down view of the scene. Then, the view is described and used as a query
on the satellite database. Finally, a particle ﬁlter framework is integrated on the
solution to estimate the vehicle position and orientation during the navigation. In
Viswanathan et al. (2014), the proposal is validated with experimental tests that
often shows better position estimates than the GPS.
Kim and Walter proposed a ground localization method using a learning em-
bedding strategy. A CNN based on the Siamese architecture is used to extract
a 4096-dimensional feature vector able to match ground-level imagery with their
respective satellite view. Then, these matches serve as a noise observation of the
position and orientation of the vehicle. These observations are then used into a
particle ﬁlter that maintains a distribution on the pose during navigation.
Deng et al. proposed a method to generate ground level images from aerial
images. Combined with the method proposed in this paper, it would be possible
to produce an aerial image from an underwater acoustic image. Thereafter, a
ground level image can be generated, using the method by Deng et al., and used
for appropriate applications.
To evaluate our model under diﬀerent conditions and locations, two real-world
datasets were used: datasets ARACATI 2014 and ARACATI 2017. These datasets
were both captured in the Yacht Club of Rio Grande, Brazil. However, they were
obtained in diﬀerent places inside the Yacht Club, as well as in diﬀerent years.
Guided Sonar-to-Satellite Translation 5
Therefore, the acoustic images contained in these datasets are substantially diﬀer-
ent, a fact you can verify in Figure 2.
Fig. 2: 2a shows an example of acoustic image for the ARACATI 2014 dataset. 2b
presents a sonar image from the ARACATI 2017 dataset. 2c displays a satellite
image highlighting the places where the images 2a and 2b were captured. Satellite
images from Google c
, Digital Globe c
08-06-2017, 32o01’30.1”S 52o06’24.1”W.
Both datasets were recorded by a mini Remotely Operated Vehicle (ROV)
Seabotix LBV-300 with a Teledyne BlueView P900-130 Multibeam Forward Look-
ing Sonar (MFLS), a magnetic compass and a SOUTH S82T Diﬀerential Global
Position System (DGPS). A ﬂoating board is attached on the vehicle so that it
follows the vehicle and remains on the surface of the water during the trajectory.
The DGPS is installed on top of the ﬂoating board and records the 2D vehicle
position with high precision.
3.1 ARACATI 2014
This dataset was ﬁrst published in the work of Silveira et al. (2015). In 75 minutes,
the vehicle travels a total of 802 meters acquiring 10232 images. Figure 3a shows
the path travelled by the robot. The MFLS was conﬁgured to cover a range of
30 meters. The dense presence of structures such as pier and boats are the main
features of this dataset because of the place and the path travelled by the vehicle.
3.2 ARACATI 2017
As previously mentioned, this dataset was collected at one of the harbors of the
Yacht Club of Rio Grande, Brazil by an underwater robot. The MFLS was con-
6 Giovanni G. De Giacomo et al.
ﬁgured to cover a range of 50 meters. Figure 3b shows the path travelled by the
vehicle, alongside further information regarding the length of the voyage. In to-
tal 24676 images were captured in 77 minutes. Unlike ARACATI 2014, the main
characteristic of this dataset is the sparse presence of structures that involves a
smaller area on the images because of the increased coverage range of the sonar.
Fig. 3: Robot path of the adopted datasets. 3a shows the path of ARACATI
2014. 3b shows the path of ARACATI 2017. Satellite images from Google c
Digital Globe c
08-06-2017, (a) 32o01’33.7”S 52o06’30.7”W (b) 32o01’30.1”S
Fig. 4: Diagram for data preprocessing workﬂow which generates the ground truth
data. Satellite images from Google c
, Digital Globe c
Guided Sonar-to-Satellite Translation 7
3.3 Data Preprocessing
To create the training dataset, a satellite image of the Yacht Club provided by
Google Earth was used. The satellite image is automatically cropped considering
the position from the DGPS, heading from the magnetic compass and the coverage
ﬁeld of each acoustic image1as shown in Figure 4.
Fig. 5: Manual compass correction tool. Each satellite image is manually rotated by
using mouse commands until it correctly matches the correspondent sonar image.
After initial processing, problems were discovered with the compass data. Some
cropped satellite images did not correctly match with the sonar images, worsening
the learning process of the neural network. In order to ﬁx this problem, a tool
was developed for manual correction of the compass data. Figure 5displays the
interface of the tool.
A ﬁxed oﬀset is not enough to solve the misalignment problem of all images
because the compass is aﬀected by magnetic interferences by the ship hulls or even
by the vehicle motor. Therefore, each image had to be manually corrected.
An image selection criteria was adopted where images with a time-stamp dif-
ference lower than 0.13 seconds in the DGPS and compass data were selected. This
procedure resulted in 2894 images. Since DGPS, compass and sonar images have
diﬀerent acquisition rate, the criteria ensures a selection of the most synchronized
data. Figure 6shows the selected images partially cover the entire dataset.
After preprocessing of the datasets, the ARACATI 2017 dataset contained 2894
pairs of acoustic and ground truth satellite images that were used for training
purposes. On the other hand, the ARACATI 2014 dataset contained 839 pairs of
acoustic and ground truth satellite images used exclusively for testing purposes.
1A video showcasing the cropping of the dataset is available at https://youtu.be/
8 Giovanni G. De Giacomo et al.
Fig. 6: The position of the 2894 selected images that were manually cor-
rected. Image provided by Google c
, Digital Globe c
The formal deﬁnition of sonar-to-satellite translation is a function G:RH×W→
RC×H×W, where Gis, in this case, a generative CNN and C,H,Ware the depth,
height and width of the satellite image, respectively.
To attack the sonar-to-satellite problem, this section proposes a trainable end-
to-end CNN, using state-of-the-art techniques from the Deep Learning literature.
A generator and a discriminator network operate jointly to build up the archi-
tecture. The generator is a custom U-Net architecture, making use of encoding
and decoding layers, as well as skip connections. Additionally, the generator uses
trainable end-to-end guided ﬁlters to transfer structure from acoustic images to
aerial ones. On the other hand, the discriminator is a Deep Convolutional Gener-
ative Adversarial Network (DCGAN) and exists only for training purposes, i.e., it
does not exist during evaluation. Both of these networks and their details will be
outlined in this section.
4.1.1 U-Net based network augmented with guided ﬁlter
One of the most critical pieces of the proposed architecture is the guided ﬁlter
from He et al. (2013). Since the guided ﬁlter is a general linear translation-variant
ﬁlter, the following equation describes its output at a pixel i:
In this equation, pjis the input pixel, Wij is the ﬁlter kernel, Iis the guide and
qiis the output pixel.
Guided Sonar-to-Satellite Translation 9
As deﬁned in He et al. (2013), the following function deﬁnes the guided ﬁlter
for color images:
where Iiis a 3 ×1 color vector, ωkis a window centered in pixel k,akis a 3 ×1
vector of coeﬃcients, qiand bkare scalars.
By minimizing a linear ridge regression model, as in Draper and Smith (2014),
the coeﬃcients for the local linear model can be deﬁned as follows:
ak= (Σk+U)−1 1
where Σkis the 3×3 covariance matrix of Iin ωkand U is a 3×3 identity matrix.
By manipulating Equations 1,2,3and 4, it can be proven that the kernel
weights are given by:
Wij (I) = 1
k:(i,j)∈ωk1 + (Ii−µk)(Ij−µk)
where µkand σ2
kare the mean and variance of Iin the ﬁlter and |ω|is the number
of pixels in ωk.
For this architecture, an implementation of the guided ﬁlter provided by Wu
et al. was used to integrate the ﬁlter with CNNs to form deep guided ﬁltering
Another vital piece of the architecture is the Generative Adversarial Network
(GAN). It works by using a discriminator network, which is described in the next
section, to calculate the probability of images in a training batch being real. After
that, the sigmoid cross-entropy of these probabilities is computed and used as part
of the objective function.
The network model is inspired by GuidedNet from Gon¸calves et al. (2018).
Similarly to GuidedNet, our generator uses guided ﬁlters as a way to transfer
structure from the original acoustic image to the output satellite image. Also, di-
lated convolutions, proposed by Yu and Koltun, were used to increase the receptive
ﬁeld dramatically. Therefore, the proposed network is capable of capturing more
context and structure than the original CNN used in Giacomo et al. (2018).
Our general architecture is a U-Net based CNN, as in Ronneberger et al. (2015).
The network expects a 256 ×128 acoustic image as input. A schematic of our
generator architecture is presented in Figure 7.
10 Giovanni G. De Giacomo et al.
Decode 3 Decode 2 Decode 1 Encode 4
Sonar Encode 2 Encode 3
Fig. 7: A diagram that describes the model of our generator network.
As displayed in Figure 7, the network possesses three high-level layers. Namely,
the encode, decode and guided layers. These layers are formulated in the following
–Encode is a layer starting with four dilated convolutions of kernel 3 ×3 with
dilation rates: 1, 2, 4 and 8. Then, these convolutions are concatenated, and
a max pooling and ReLU are applied. Finally, the layer ﬁnishes with a batch
–Decode is the layer that starts with an up convolution, i.e., up-sampling fol-
lowed by a convolution of kernel size 4 ×4. Then, a dropout of rate 0.2 and a
ReLU are applied. Afterwards, a batch normalization step is employed. Finally,
the layer performs a skip connection with the feature map of the equivalent
–Guided is a layer starting with a convolution of kernel 3 ×3 followed by a
ReLU activation. Finally, a batch normalization step and a guided ﬁlter are
performed using the input acoustic image as a guide.
After going through all the layers presented in Figure 7, the network applies
the Rectiﬁed Linear Unit (ReLU) activation function. Previously, Giacomo et al.
(2018) had used the hyperbolic tangent and concluded that it whitened the output
In the end, the network will produce a 256 ×128 satellite image from a given
acoustic image. Therefore, the model described constitutes a trainable end-to-end
solution to the sonar-to-satellite problem.
4.1.2 Loss function based on discriminative network
For the cost function of our architecture, a linear combination of three diﬀerent
losses was used.
Guided Sonar-to-Satellite Translation 11
If you consider the acoustic image to be x, the ground truth aerial image y, the
generator G, the discriminator Dand the VGG-16 neural network to be φ. Then,
the ﬁrst loss, the L1 distance is given by:
LL1(G) = Ex,y [|y− G(x)|] (6)
The second loss function, the style reconstruction loss, proposed by Johnson
et al. (2016), requires the computation of the Gram matrix of the feature maps,
given by the following mathematical function:
Gj(x) = ψψT
where ψis φj(x) reshaped into a matrix of dimensions Cj×HjWj.
Then, the style reconstruction loss can be calculated by the squared Frobenius
norm of the diﬀerence between the Gram matrices of the output and target images,
as shown below:
Lstyle(G) = Ex,y hkGj(G(x)) −Gj(y)k2
The third loss function is the cGAN that can be expressed in the following
LGAN (G, D) = Ex,y [log D(x, y)] + Ex[log [1 −D(x, G(x))]] .(9)
The ﬁnal objective function is then derived by linearly combining Equations
6,8and 9. Each of these individual losses is weighted by a hyperparameter, as
L(G, D) = arg min
Dλ1LGAN (G, D) + λ2Lstyle (G) + λ3LL1(G).(10)
This equation is a minimax two-player game, where the generator attempts to
minimize the function and the discriminator to maximize it.
The input of the discriminator network is a concatenation of the input acous-
tic image with either the target satellite image or the generated satellite image
outputted by the generator network. On the other hand, the discriminator out-
puts a probability vector that estimates the chance of a given image in the batch
belonging to the training set, i.e., being real as understood by the discriminator.
The architecture of the discriminator network can be visualized in Figure 8.
This architecture uses some ideas from Radford et al. (2015), such as batch nor-
malization and strided convolutions in the discriminator.
Our discriminator consists of three convolutional steps followed by a ﬂattening
and then a dense layer. Each convolution is applied with a 5 ×5 kernel size, stride
2 and followed by a batch normalization. Afterwards, the contents of the feature
maps are ﬂattened and thrown into a Multilayer Perceptron (MLP). Finally, the
network applies the sigmoid activation function to acquire the probability of each
image in the batch having come from the training data.
12 Giovanni G. De Giacomo et al.
Fig. 8: The schematic that represents the model of our discriminator network.
4.3 Optimization and Training
To train the network, one gradient descent step on the generator Gand then one
step on the discriminator Dwere alternated. For updating the weights, the Adam
optimizer, introduced in Kingma and Ba (2014), was used.
Implementation of the networks2was made using the TensorFlow (TF) library.
An NVIDIA Titan X was used for the majority of the conducted experiments.
Training ran for exactly 100 epochs. Hyperparameters used were as suggested
in Isola et al. (2017): a learning rate of 0.0002 and Adam momentum parameters of
β1= 0.5 and β2= 0.999. Each epoch elapsed approximately 5 minutes of training
on an NVIDIA Titan X or NVIDIA GTX 1080. After training, new acoustic images
can be evaluated at a frequency of about 20 Hz.
5 Experimental Results
In this section, some results for the two datasets, ARACATI 2014 and 2017 will
be presented. Also, qualitative and quantitative analyses of the results will be
performed to identify the strengths and weaknesses of the proposed method.
5.1 ARACATI 2017
ARACATI 2017 was divided into two parts: 90% to be used for training purposes
and 10% for validation. Figure 9showcases some samples from the testing set that
were used to evaluate the method.
2TensorFlow implementation of the model and the ARACATI 2017 dataset are available
for download at: https://github.com/giovgiac/son2sat.
Guided Sonar-to-Satellite Translation 13
Acoustic Output Ground Truth
Fig. 9: Results extracted when running our method on the testing set of the ARA-
CATI 2017 dataset.
The testing set consisted of about 289 sonar images propagated through the
trained generator. Afterwards, a few output images that highlighted the strengths
and weaknesses of the method were picked.
As is visible in Figure 9, the network manages to properly transfer structures
from the acoustic image in the corresponding satellite image.
From Figure 9, it is perceivable that the CNN encounters some issues when
dealing with the pier. It can be inferred that incorrect GPS and compass data cause
these issues that remain in the dataset, even after manual correction. However,
the network still manages to transfer the pier, leading to impressive results.
Methods MSE PSNR SSIM
Giacomo et al. (2018) 0.0176 18.9372 0.8310
Ours 0.0122 21.8455 0.8213
Table 1: Quantitative results for the ARACATI 2017 dataset in three image quality
Table 1lays out the results of three image quality metrics: Mean Squared
Error (MSE), Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index
Measure (SSIM) from both the presented method and the one proposed in Giacomo
et al. (2018). In general the new method performs better, however it loses by a
small percentage margin in the SSIM metric.
14 Giovanni G. De Giacomo et al.
5.2 ARACATI 2014
ARACATI 2014 was used as a testing dataset where reliable data was available for
producing ground truth images. Therefore, satellite images were extracted from
a total of 839 acoustic images in a location that the network had never seen be-
fore. This dataset was introduced to test if the method would generalize when
encountering diﬀerent scenarios. In particular, the acoustic images from these two
datasets are quite diﬀerent, as ARACATI 2017 was captured with a forward dis-
tance of 50m and ARACATI 2014 with 30m.
Figure 10 presents a few samples that were chosen to highlight the performance
of the network in this dataset.
Acoustic Output Ground Truth
Fig. 10: Results extracted when running our method on the ARACATI 2014
As observable in Figure 10, the network does not perform as well in the ARA-
CATI 2014 dataset as it did in the ARACATI 2017. However, it is also noticeable
that the main strengths of the method were maintained. Thus, the generator suc-
cessfully transferred essential features from the acoustic image to the satellite
image. Also, it is perceivable that the CNN can capture contextual information
from the images and take appropriate advantage of that.
It is important to note that the network missed several of the piers from the
ARACATI 2014 dataset. However, that outcome is expected, due to the acoustic
images having diﬀerent forward distances. Also, the ARACATI 2014 dataset has
a much larger density of objects when compared to the ARACATI 2017 dataset.
Therefore, the 2014 images are signiﬁcantly more polluted.
Table 2once again lays out the results of three image quality metrics, for the
presented method and Giacomo et al. (2018). As observed in the ARACATI 2017
Guided Sonar-to-Satellite Translation 15
dataset, the proposed method still loses by a small percentage diﬀerence in the
SSIM, but performs better in the other two metrics.
Methods MSE PSNR SSIM
Giacomo et al. (2018) 0.0404 14.1208 0.6699
Ours 0.0316 15.0925 0.6035
Table 2: Quantitative results for the ARACATI 2014 dataset in three image quality
Key features are translated from the source image into the target image suc-
cessfully. To exemplify, it is possible to adequately visualize the borders between
water and land in both the ARACATI 2014 and 2017 datasets. Also, the generator
manages to avoid pitfalls that could occur due to the noisy nature of sonar im-
ages. However, one may notice that some details, such as boats and piers, are often
missed. With all that considered, these results still succeed in reaching our goal,
i.e., adequately transferring structure from a sonar image to a generated satellite
image to allow for easier image processing down the line.
In this paper, we introduced a novel method was introduced, which improves the
one proposed in Giacomo et al. (2018), for acquiring satellite images from given
acoustic images that were captured in the same region. Our proposal consists of
using a U-Net based CNN augmented with guided ﬁlters and dilated convolutions
to train a generator neural network attached to a DCGAN discriminator. Also,
we train and validate our proposed network with two real datasets, which were
captured by underwater vehicles in the coast of Brazil. We qualitatively and quan-
titatively analyze the generated results with samples from the testing sets of the
We believe our method can help facilitate traditionally diﬃcult robotic tasks
like underwater localization and navigation. Using our proposed methodology,
AUVs can acquire acoustic images, convert them to satellite images and then use
that data to locate themselves or map the environment around them. Since satel-
lite images are of easier interpretation, robots should be able to achieve superior
results with less time.
For future work, we intend to consider whether it is possible to use drones to
capture aerial images. In an aﬃrmative scenario, it would be beneﬁcial to cooperate
drones and underwater robot for eﬀective localization techniques. Finally, we want
to follow up on diﬀerent applications that open up when successfully translating
acoustic images into aerial ones. These applications might include, for example,
underwater localization and navigation, among others.
Acknowledgements This research is partly supported by CNPq, CAPES and FAPERGS.
We also would like to thank the colleagues from NAUTEC-FURG for helping with the ex-
perimental data and for productive discussions and meetings. Finally, we would like to thank
16 Giovanni G. De Giacomo et al.
NVIDIA for donating high-performance graphics cards. All authors are with NAUTEC, In-
telligent Robotics and Automation Group, Universidade Federal do Rio Grande - FURG, Rio
Grande - Brazil.
7.1 Ethical Approval
7.2 Consent to Participate
7.3 Consent to Publish
7.4 Authors Contributions
–Giovanni G. De Giacomo: implementation and execution of the Deep Learning
experiments; writing of the manuscript.
–Matheus M. dos Santos: development of the dataset and associated tools;
helped writing the manuscript.
–Paulo L. J. Drews-Jr: theoretical support on the idea; revising the manuscript.
–Silvia S. C. Botelho: theoretical support on the idea; revising the manuscript.
This study was partly supported by the National Council for Scientiﬁc and Tech-
nological Development (CNPq) and Coordenacao de Aperfeioamento de Pessoal
de Nivel Superior - Brasil (CAPES) - Finance Code 001. This paper is also a con-
tribution of the INCT-Mar COI funded by CNPq Grant Number 610012/2011-8.
7.6 Conﬂicts of Interest
The authors declare that they have no conﬂict of interest.
7.7 Availability of data and materials
TensorFlow implementation of the model and the ARACATI 2017 dataset are
available for download at: https://github.com/giovgiac/son2sat.
Guided Sonar-to-Satellite Translation 17
Deng X, Zhu Y, Newsam S (2018) What is it like down there? generating dense
ground-level views and image features from overhead imagery using conditional
generative adversarial networks. arXiv preprint arXiv:180605129
Dos Santos MM, De Giacomo GG, Drews P, Botelho SS (2019a) Satellite and
underwater sonar image matching using deep learning. In: 2019 Latin American
Robotics Symposium (LARS), 2019 Brazilian Symposium on Robotics (SBR)
and 2019 Workshop on Robotics in Education (WRE), IEEE, pp 109–114
Dos Santos MM, De Giacomo GG, Drews PL, Botelho SS (2019b) Underwater
sonar and aerial images data fusion for robot localization. In: 2019 19th Inter-
national Conference on Advanced Robotics (ICAR), IEEE, pp 578–583
Draper NR, Smith H (2014) Applied regression analysis, vol 326. John Wiley &
Giacomo G, Machado M, Drews P, Botelho S (2018) Sonar-to-satellite translation
using deep learning. In: 2018 17th IEEE International Conference on Machine
Learning and Applications (ICMLA), IEEE, pp 454–459
Gon¸calves LT, de Oliveira Gaya JF, Junior PJLD, da Costa Botelho SS (2018)
Guidednet: Single image dehazing using an end-to-end convolutional neural net-
work. In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images
(SIBGRAPI), IEEE, pp 79–86
He K, Sun J, Tang X (2013) Guided image ﬁltering. IEEE transactions on pattern
analysis and machine intelligence 35(6):1397–1409
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with condi-
tional adversarial networks. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 1125–1134
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer
and super-resolution. In: European conference on computer vision, Springer, pp
Kim D, Walter MR (2017) Satellite image-based localization via learned embed-
dings. In: 2017 IEEE International Conference on Robotics and Automation
(ICRA), pp 2073–2080, DOI 10.1109/ICRA.2017.7989239
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv
Radford A, Metz L, Chintala S (2015) Unsupervised representation learn-
ing with deep convolutional generative adversarial networks. arXiv preprint
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for
biomedical image segmentation. In: International Conference on Medical image
computing and computer-assisted intervention, Springer, pp 234–241
Silveira L, Guth F, Drews-Jr P, Ballester P, Machado M, Codevilla F, Duarte-
Filho N, Botelho S (2015) An open-source bio-inspired solution to underwater
slam. IFAC-PapersOnLine 48(2):212–217
Steﬀens C, Messias L, Drews-Jr P, Botelho S (2020) Cnn based image restoration:
Adjusting ill-exposed srgb images in post-processing. Journal of Intelligent &
Robotic Systems DOI 10.1007/s10846-019-01124-9
Steﬀens CR, Messias LRV, Drews-Jr P, Botelho SSdC (2019) Contrast enhance-
ment and image completion: A cnn based model to restore ill exposed images.
In: 2019 IEEE 17th International Conference on Industrial Informatics (INDIN),
18 Giovanni G. De Giacomo et al.
IEEE, vol 1, pp 226–232
Tomasi C, Manduchi R (1998) Bilateral ﬁltering for gray and color images. In:
Iccv, vol 98, p 2
Viswanathan A, Pires BR, Huber D (2014) Vision based robot localization by
ground to satellite matching in gps-denied situations. In: 2014 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems, pp 192–198, DOI
Wu H, Zheng S, Zhang J, Huang K (2018) Fast end-to-end trainable guided ﬁl-
ter. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp 1838–1847
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions.
arXiv preprint arXiv:151107122