PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Underwater navigation and localization are greatly enhanced by the use of acoustic images. However, such images are of difficult interpretation. Contrarily, aerial images are easier to interpret but require Global Positioning System (GPS) sensors. Due to absorption phenomena, GPS sensors are unavailable in underwater environments. Thus, we propose a method to translate sonar images acquired underwater to an aerial counterpart. This process is called sonar-to-satellite translation. To perform the conversion, a U-Net based neural network is proposed, enhanced with state-of-the-art techniques, such as dilated convolutions and guided filters. Afterward, our approach is validated on two datasets containing sonar images and their satellite analogue. Qualitative experimental results indicate that the proposed method can transfer features from acoustic images to aerial images, generating satellite images that are easier to interpret and visualize.
Content may be subject to copyright.
JINT manuscript No.
(will be inserted by the editor)
Guided Sonar-to-Satellite Translation
Giovanni G. De Giacomo ·Matheus M.
dos Santos ·Paulo L. J. Drews-Jr ·Silvia
S. C. Botelho
Received: date / Accepted: date
Abstract Underwater navigation and localization are greatly enhanced by the
use of acoustic images. However, such images are of difficult interpretation. Con-
trarily, aerial images are easier to interpret, but require Global Positioning System
(GPS) sensors. Due to absorption phenomena, GPS sensors are unavailable in un-
derwater environments. Thus, we propose a method to translate sonar images ac-
quired underwater to an aerial counterpart. This process is called sonar-to-satellite
translation. To perform the conversion, a U-Net based neural network is proposed,
enhanced with state-of-the-art techniques, such as dilated convolutions and guided
filters. Afterwards, our approach is validated on two datasets containing sonar im-
ages and their satellite analogue. Qualitative experimental results indicate that
the proposed method can transfer features from acoustic images to aerial images,
generating satellite images that are easier to interpret and visualize.
Keywords Deep Learning ·Neural Networks ·Robotics
PACS 68 Computer Science
Mathematics Subject Classification (2010) MSC 68T45 ·MSC 68T40 ·
MSC 68T30
Giovanni G. De Giacomo (Corresponding Author)
Federal University of Rio Grande (FURG)
Computer Science Center
Av. It´alia Km 8
Rio Grande, RS, Brazil
Tel.: +55 (53) 99951-9892
ORCID: 0000-0003-1670-2536
Matheus M. dos Santos, Paulo L. J. Drews-Jr, Silvia S. C. Botelho
Federal University of Rio Grande (FURG)
Computer Science Center
ORCID: 0000-0002-6036-8670, 0000-0002-7519-0502, 0000-0002-8857-0221
2 Giovanni G. De Giacomo et al.
1 Introduction
Converting images into other images, even when dealing with different domains, is
an exciting problem that lies at the core of many Machine Learning applications.
Performing that conversion is known as image-to-image translation. A general
U-Net based Convolutional Neural Network (CNN) was proposed by Isola et al.
(2017) to solve it.
Water has physical properties that make it difficult to work with underwater
robots, such as Autonomous Underwater Vehicles (AUVs). Mainly, the malfunc-
tion of light based sensors, e.g. cameras and lasers, and Global Positioning Sys-
tem (GPS) sensors. These malfunctions happen because of the rapid attenuation
that electromagnetic waves undergo below water. Therefore, underwater robots
traditionally use sonar images as their preferred source of reliable input. How-
ever, acoustic images require intensive processing to extract information, due to
phenomena such as noise. As a consequence, underwater robot localization and
navigation, such as attempted in Dos Santos et al. (2019b,a), is an area that is
requiring new and effective methods.
This paper proposes facilitating the interpretation of sonar images through
aerial images when operating in underwater environments. This proposal consists
in the translation of an acoustic image into a satellite counterpart. Giacomo et al.
defined sonar-to-satellite translation as the task of converting an acoustic image
into an aerial one. The objective is that satellite images generated by such methods
could be used to perform matching with authentic satellite images and locate a
robot without needing a GPS sensor. Figure 1shows a schematic of the sonar-to-
satellite translation problem when solved by a CNN.
Fig. 1: Sonar-To-Satellite Translation is defined as the conversion of the acoustic
image in the left to the satellite image in the right. This diagram also shows a
CNN as a solution to the problem.
This work is an extension of Giacomo et al. (2018), that defined sonar-to-
satellite translation and presented results with a pix2pix CNN. The method was
extended by using a new architecture that includes dilated convolutions from Yu
and Koltun (2015); Steffens et al. (2019, 2020) and guided filters from He et al.
(2013) in the generator, as well as using more sophisticated loss functions, such as
the style reconstruction loss by Johnson et al. (2016). Also, a whole new dataset
was included for testing our model. These extensions will be discussed further in
later sections.
In the paper, a myriad of experiments are conducted with two local datasets,
using a U-Net based neural network architecture augmented with guided layers
Guided Sonar-to-Satellite Translation 3
from Gon¸calves et al. (2018) and with a conditional Generative Adversarial Net-
work (cGAN). After that, the visual quality of the results is verified by doing side
comparisons of them with the actual ground truth satellite images, as well as cal-
culating image quality metrics. These ground truth images were obtained by using
the data from the GPS and magnetometer sensors of the underwater robot.
The goal of the research is verifying if it is possible to diminish the difficulty
in acquiring aerial images when navigating underwater. This difficulty is caused
by the unreliability of GPS sensors in said environments. The proposed method
attempts to solve the issue by translating sonar images into satellite ones, using a
This paper is organized in the following way: in the next section, we will discuss
the related works; Then, we will introduce the two datasets used in the experiments
and talk about their particularities; Afterwards, we will present the methodology
used to attack the sonar-to-satellite problem; Subsequently, we will present the
experimental results in the two datasets and discuss them. Finally, we will conclude
and summarize our contributions.
2 Related Works
Other than Giacomo et al., using a CNN to extract an aerial image from a sonar
one is an unprecedented concept. On the other hand, there are various related
works upon which this paper is based. Therefore, in this section, papers on CNN,
image filtering and general neural network techniques that inspired our model and
research will be described. Articles about locating vehicles on land using satellite
images will also be discussed, since a similar idea is being proposed on this paper,
but for underwater domains.
Image-to-image translation is the task of translating one possible representa-
tion of a scene into another, as defined in Isola et al. (2017). To solve this problem,
Isola et al. created a CNN architecture called pix2pix. This architecture is based
on the U-Net network for medical segmentation that was proposed in Ronneberger
et al. (2015). By adding a cGAN component to the network architecture and pro-
viding general hyperparameters, Isola et al. provided a standardized approach to
solving image translation problems. In addition, Isola et al. performed various
experiments using his proposed methodology, working with varied datasets that
referred to several problems within the domain of Computer Vision (CV). Among
the problems, there was one that aimed to convert aerial images into charts and
that inspired the present work.
As one of the basis of our CNN architecture, the U-Net network, described in
Ronneberger et al. (2015), is an important related work. In his paper, Ronneberger
et al. first proposed the idea of skip connections. A skip connection is a step in
a neural network where you feed a feature map from a backward layer into a
forward layer in encoder-decoder architectures. Due to their proven capacity of
improving learning by returning structures previously discarded by the network,
skip connections have been largely used in many applications and neural network
architectures, including our own. However, it is also important to mention that
skip connections come with a large memory footprint, since they need the required
feature maps from previous layers to be stored.
4 Giovanni G. De Giacomo et al.
Yu and Koltun introduced dilated convolutions that are useful for aggregating
contextual information without losing coverage or resolution. By making use of
dilated convolutions, one is able to expand the receptive field dramatically. There-
fore, the network is able to capture significantly higher amounts of context than
it would with traditional convolutional layers.
He et al. proposed the guided filter, an image filter that outputs a locally lin-
ear transform of the guidance image. As detailed in its original paper, the guided
filter has good edge-preserving properties. However, it does not suffer from gradi-
ent reversal artifacts, like the bilateral filter from Tomasi and Manduchi (1998).
Also, unlike the bilateral and simple linear translation-invariant (LTI) filters, the
guided filter can be used to transfer structure from the guidance image to the
output. Due to the structure-transferring property of the guided filter, He et al.
(2013) envisioned that it could be used in applications such as feathering, dehazing
and high-quality stereo matching methods. Wu et al. (2018) published and made
available an implementation for a fast end-to-end trainable guided filter.
Gon¸calves et al. introduced GuidedNet, a model that used a new neural net-
work layer, called guided layer. Guided layers use guided filters as a way to transfer
structural information, which are partially lost due to convolutions, back into the
output of the neural network. Although the model was initially proposed for image
dehazing, it also works well for other tasks involving image generation or restora-
Viswanathan et al. localizes a ground vehicle by using satellite images as a map.
The method creates a feature database by splitting the satellite image in a grid
and describing each cell. First, the ground-based panoramic images are warped
into a top-down view of the scene. Then, the view is described and used as a query
on the satellite database. Finally, a particle filter framework is integrated on the
solution to estimate the vehicle position and orientation during the navigation. In
Viswanathan et al. (2014), the proposal is validated with experimental tests that
often shows better position estimates than the GPS.
Kim and Walter proposed a ground localization method using a learning em-
bedding strategy. A CNN based on the Siamese architecture is used to extract
a 4096-dimensional feature vector able to match ground-level imagery with their
respective satellite view. Then, these matches serve as a noise observation of the
position and orientation of the vehicle. These observations are then used into a
particle filter that maintains a distribution on the pose during navigation.
Deng et al. proposed a method to generate ground level images from aerial
images. Combined with the method proposed in this paper, it would be possible
to produce an aerial image from an underwater acoustic image. Thereafter, a
ground level image can be generated, using the method by Deng et al., and used
for appropriate applications.
3 Datasets
To evaluate our model under different conditions and locations, two real-world
datasets were used: datasets ARACATI 2014 and ARACATI 2017. These datasets
were both captured in the Yacht Club of Rio Grande, Brazil. However, they were
obtained in different places inside the Yacht Club, as well as in different years.
Guided Sonar-to-Satellite Translation 5
Therefore, the acoustic images contained in these datasets are substantially differ-
ent, a fact you can verify in Figure 2.
Fig. 2: 2a shows an example of acoustic image for the ARACATI 2014 dataset. 2b
presents a sonar image from the ARACATI 2017 dataset. 2c displays a satellite
image highlighting the places where the images 2a and 2b were captured. Satellite
images from Google c
, Digital Globe c
08-06-2017, 32o01’30.1”S 52o06’24.1”W.
Both datasets were recorded by a mini Remotely Operated Vehicle (ROV)
Seabotix LBV-300 with a Teledyne BlueView P900-130 Multibeam Forward Look-
ing Sonar (MFLS), a magnetic compass and a SOUTH S82T Differential Global
Position System (DGPS). A floating board is attached on the vehicle so that it
follows the vehicle and remains on the surface of the water during the trajectory.
The DGPS is installed on top of the floating board and records the 2D vehicle
position with high precision.
3.1 ARACATI 2014
This dataset was first published in the work of Silveira et al. (2015). In 75 minutes,
the vehicle travels a total of 802 meters acquiring 10232 images. Figure 3a shows
the path travelled by the robot. The MFLS was configured to cover a range of
30 meters. The dense presence of structures such as pier and boats are the main
features of this dataset because of the place and the path travelled by the vehicle.
3.2 ARACATI 2017
As previously mentioned, this dataset was collected at one of the harbors of the
Yacht Club of Rio Grande, Brazil by an underwater robot. The MFLS was con-
6 Giovanni G. De Giacomo et al.
figured to cover a range of 50 meters. Figure 3b shows the path travelled by the
vehicle, alongside further information regarding the length of the voyage. In to-
tal 24676 images were captured in 77 minutes. Unlike ARACATI 2014, the main
characteristic of this dataset is the sparse presence of structures that involves a
smaller area on the images because of the increased coverage range of the sonar.
(a) (b)
Fig. 3: Robot path of the adopted datasets. 3a shows the path of ARACATI
2014. 3b shows the path of ARACATI 2017. Satellite images from Google c
Digital Globe c
08-06-2017, (a) 32o01’33.7”S 52o06’30.7”W (b) 32o01’30.1”S
Fig. 4: Diagram for data preprocessing workflow which generates the ground truth
data. Satellite images from Google c
, Digital Globe c
08-06-2017, 32o01’30.1”S
Guided Sonar-to-Satellite Translation 7
3.3 Data Preprocessing
To create the training dataset, a satellite image of the Yacht Club provided by
Google Earth was used. The satellite image is automatically cropped considering
the position from the DGPS, heading from the magnetic compass and the coverage
field of each acoustic image1as shown in Figure 4.
(a) (b)
Fig. 5: Manual compass correction tool. Each satellite image is manually rotated by
using mouse commands until it correctly matches the correspondent sonar image.
After initial processing, problems were discovered with the compass data. Some
cropped satellite images did not correctly match with the sonar images, worsening
the learning process of the neural network. In order to fix this problem, a tool
was developed for manual correction of the compass data. Figure 5displays the
interface of the tool.
A fixed offset is not enough to solve the misalignment problem of all images
because the compass is affected by magnetic interferences by the ship hulls or even
by the vehicle motor. Therefore, each image had to be manually corrected.
An image selection criteria was adopted where images with a time-stamp dif-
ference lower than 0.13 seconds in the DGPS and compass data were selected. This
procedure resulted in 2894 images. Since DGPS, compass and sonar images have
different acquisition rate, the criteria ensures a selection of the most synchronized
data. Figure 6shows the selected images partially cover the entire dataset.
After preprocessing of the datasets, the ARACATI 2017 dataset contained 2894
pairs of acoustic and ground truth satellite images that were used for training
purposes. On the other hand, the ARACATI 2014 dataset contained 839 pairs of
acoustic and ground truth satellite images used exclusively for testing purposes.
1A video showcasing the cropping of the dataset is available at
8 Giovanni G. De Giacomo et al.
Fig. 6: The position of the 2894 selected images that were manually cor-
rected. Image provided by Google c
, Digital Globe c
08-06-2017, 32o01’30.1”S
4 Methodology
The formal definition of sonar-to-satellite translation is a function G:RH×W
RC×H×W, where Gis, in this case, a generative CNN and C,H,Ware the depth,
height and width of the satellite image, respectively.
To attack the sonar-to-satellite problem, this section proposes a trainable end-
to-end CNN, using state-of-the-art techniques from the Deep Learning literature.
A generator and a discriminator network operate jointly to build up the archi-
tecture. The generator is a custom U-Net architecture, making use of encoding
and decoding layers, as well as skip connections. Additionally, the generator uses
trainable end-to-end guided filters to transfer structure from acoustic images to
aerial ones. On the other hand, the discriminator is a Deep Convolutional Gener-
ative Adversarial Network (DCGAN) and exists only for training purposes, i.e., it
does not exist during evaluation. Both of these networks and their details will be
outlined in this section.
4.1 Generator
4.1.1 U-Net based network augmented with guided filter
One of the most critical pieces of the proposed architecture is the guided filter
from He et al. (2013). Since the guided filter is a general linear translation-variant
filter, the following equation describes its output at a pixel i:
Wij (I)pj.(1)
In this equation, pjis the input pixel, Wij is the filter kernel, Iis the guide and
qiis the output pixel.
Guided Sonar-to-Satellite Translation 9
As defined in He et al. (2013), the following function defines the guided filter
for color images:
where Iiis a 3 ×1 color vector, ωkis a window centered in pixel k,akis a 3 ×1
vector of coefficients, qiand bkare scalars.
By minimizing a linear ridge regression model, as in Draper and Smith (2014),
the coefficients for the local linear model can be defined as follows:
ak= (Σk+U)1 1
where Σkis the 3×3 covariance matrix of Iin ωkand U is a 3×3 identity matrix.
By manipulating Equations 1,2,3and 4, it can be proven that the kernel
weights are given by:
Wij (I) = 1
k:(i,j)ωk1 + (Iiµk)(Ijµk)
where µkand σ2
kare the mean and variance of Iin the filter and |ω|is the number
of pixels in ωk.
For this architecture, an implementation of the guided filter provided by Wu
et al. was used to integrate the filter with CNNs to form deep guided filtering
Another vital piece of the architecture is the Generative Adversarial Network
(GAN). It works by using a discriminator network, which is described in the next
section, to calculate the probability of images in a training batch being real. After
that, the sigmoid cross-entropy of these probabilities is computed and used as part
of the objective function.
The network model is inspired by GuidedNet from Gon¸calves et al. (2018).
Similarly to GuidedNet, our generator uses guided filters as a way to transfer
structure from the original acoustic image to the output satellite image. Also, di-
lated convolutions, proposed by Yu and Koltun, were used to increase the receptive
field dramatically. Therefore, the proposed network is capable of capturing more
context and structure than the original CNN used in Giacomo et al. (2018).
Our general architecture is a U-Net based CNN, as in Ronneberger et al. (2015).
The network expects a 256 ×128 acoustic image as input. A schematic of our
generator architecture is presented in Figure 7.
10 Giovanni G. De Giacomo et al.
Decode 3 Decode 2 Decode 1 Encode 4
Guided 1
Guided 2
Encode 1
Sonar Encode 2 Encode 3
Fig. 7: A diagram that describes the model of our generator network.
As displayed in Figure 7, the network possesses three high-level layers. Namely,
the encode, decode and guided layers. These layers are formulated in the following
Encode is a layer starting with four dilated convolutions of kernel 3 ×3 with
dilation rates: 1, 2, 4 and 8. Then, these convolutions are concatenated, and
a max pooling and ReLU are applied. Finally, the layer finishes with a batch
normalization step.
Decode is the layer that starts with an up convolution, i.e., up-sampling fol-
lowed by a convolution of kernel size 4 ×4. Then, a dropout of rate 0.2 and a
ReLU are applied. Afterwards, a batch normalization step is employed. Finally,
the layer performs a skip connection with the feature map of the equivalent
encoding layer.
Guided is a layer starting with a convolution of kernel 3 ×3 followed by a
ReLU activation. Finally, a batch normalization step and a guided filter are
performed using the input acoustic image as a guide.
After going through all the layers presented in Figure 7, the network applies
the Rectified Linear Unit (ReLU) activation function. Previously, Giacomo et al.
(2018) had used the hyperbolic tangent and concluded that it whitened the output
In the end, the network will produce a 256 ×128 satellite image from a given
acoustic image. Therefore, the model described constitutes a trainable end-to-end
solution to the sonar-to-satellite problem.
4.1.2 Loss function based on discriminative network
For the cost function of our architecture, a linear combination of three different
losses was used.
Guided Sonar-to-Satellite Translation 11
If you consider the acoustic image to be x, the ground truth aerial image y, the
generator G, the discriminator Dand the VGG-16 neural network to be φ. Then,
the first loss, the L1 distance is given by:
LL1(G) = Ex,y [|y− G(x)|] (6)
The second loss function, the style reconstruction loss, proposed by Johnson
et al. (2016), requires the computation of the Gram matrix of the feature maps,
given by the following mathematical function:
Gj(x) = ψψT
where ψis φj(x) reshaped into a matrix of dimensions Cj×HjWj.
Then, the style reconstruction loss can be calculated by the squared Frobenius
norm of the difference between the Gram matrices of the output and target images,
as shown below:
Lstyle(G) = Ex,y hkGj(G(x)) Gj(y)k2
The third loss function is the cGAN that can be expressed in the following
LGAN (G, D) = Ex,y [log D(x, y)] + Ex[log [1 D(x, G(x))]] .(9)
The final objective function is then derived by linearly combining Equations
6,8and 9. Each of these individual losses is weighted by a hyperparameter, as
L(G, D) = arg min
Dλ1LGAN (G, D) + λ2Lstyle (G) + λ3LL1(G).(10)
This equation is a minimax two-player game, where the generator attempts to
minimize the function and the discriminator to maximize it.
4.2 Discriminator
The input of the discriminator network is a concatenation of the input acous-
tic image with either the target satellite image or the generated satellite image
outputted by the generator network. On the other hand, the discriminator out-
puts a probability vector that estimates the chance of a given image in the batch
belonging to the training set, i.e., being real as understood by the discriminator.
The architecture of the discriminator network can be visualized in Figure 8.
This architecture uses some ideas from Radford et al. (2015), such as batch nor-
malization and strided convolutions in the discriminator.
Our discriminator consists of three convolutional steps followed by a flattening
and then a dense layer. Each convolution is applied with a 5 ×5 kernel size, stride
2 and followed by a batch normalization. Afterwards, the contents of the feature
maps are flattened and thrown into a Multilayer Perceptron (MLP). Finally, the
network applies the sigmoid activation function to acquire the probability of each
image in the batch having come from the training data.
12 Giovanni G. De Giacomo et al.
Fig. 8: The schematic that represents the model of our discriminator network.
4.3 Optimization and Training
To train the network, one gradient descent step on the generator Gand then one
step on the discriminator Dwere alternated. For updating the weights, the Adam
optimizer, introduced in Kingma and Ba (2014), was used.
Implementation of the networks2was made using the TensorFlow (TF) library.
An NVIDIA Titan X was used for the majority of the conducted experiments.
Training ran for exactly 100 epochs. Hyperparameters used were as suggested
in Isola et al. (2017): a learning rate of 0.0002 and Adam momentum parameters of
β1= 0.5 and β2= 0.999. Each epoch elapsed approximately 5 minutes of training
on an NVIDIA Titan X or NVIDIA GTX 1080. After training, new acoustic images
can be evaluated at a frequency of about 20 Hz.
5 Experimental Results
In this section, some results for the two datasets, ARACATI 2014 and 2017 will
be presented. Also, qualitative and quantitative analyses of the results will be
performed to identify the strengths and weaknesses of the proposed method.
5.1 ARACATI 2017
ARACATI 2017 was divided into two parts: 90% to be used for training purposes
and 10% for validation. Figure 9showcases some samples from the testing set that
were used to evaluate the method.
2TensorFlow implementation of the model and the ARACATI 2017 dataset are available
for download at:
Guided Sonar-to-Satellite Translation 13
Acoustic Output Ground Truth
Fig. 9: Results extracted when running our method on the testing set of the ARA-
CATI 2017 dataset.
The testing set consisted of about 289 sonar images propagated through the
trained generator. Afterwards, a few output images that highlighted the strengths
and weaknesses of the method were picked.
As is visible in Figure 9, the network manages to properly transfer structures
from the acoustic image in the corresponding satellite image.
From Figure 9, it is perceivable that the CNN encounters some issues when
dealing with the pier. It can be inferred that incorrect GPS and compass data cause
these issues that remain in the dataset, even after manual correction. However,
the network still manages to transfer the pier, leading to impressive results.
Giacomo et al. (2018) 0.0176 18.9372 0.8310
Ours 0.0122 21.8455 0.8213
Table 1: Quantitative results for the ARACATI 2017 dataset in three image quality
Table 1lays out the results of three image quality metrics: Mean Squared
Error (MSE), Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index
Measure (SSIM) from both the presented method and the one proposed in Giacomo
et al. (2018). In general the new method performs better, however it loses by a
small percentage margin in the SSIM metric.
14 Giovanni G. De Giacomo et al.
5.2 ARACATI 2014
ARACATI 2014 was used as a testing dataset where reliable data was available for
producing ground truth images. Therefore, satellite images were extracted from
a total of 839 acoustic images in a location that the network had never seen be-
fore. This dataset was introduced to test if the method would generalize when
encountering different scenarios. In particular, the acoustic images from these two
datasets are quite different, as ARACATI 2017 was captured with a forward dis-
tance of 50m and ARACATI 2014 with 30m.
Figure 10 presents a few samples that were chosen to highlight the performance
of the network in this dataset.
Acoustic Output Ground Truth
Fig. 10: Results extracted when running our method on the ARACATI 2014
As observable in Figure 10, the network does not perform as well in the ARA-
CATI 2014 dataset as it did in the ARACATI 2017. However, it is also noticeable
that the main strengths of the method were maintained. Thus, the generator suc-
cessfully transferred essential features from the acoustic image to the satellite
image. Also, it is perceivable that the CNN can capture contextual information
from the images and take appropriate advantage of that.
It is important to note that the network missed several of the piers from the
ARACATI 2014 dataset. However, that outcome is expected, due to the acoustic
images having different forward distances. Also, the ARACATI 2014 dataset has
a much larger density of objects when compared to the ARACATI 2017 dataset.
Therefore, the 2014 images are significantly more polluted.
Table 2once again lays out the results of three image quality metrics, for the
presented method and Giacomo et al. (2018). As observed in the ARACATI 2017
Guided Sonar-to-Satellite Translation 15
dataset, the proposed method still loses by a small percentage difference in the
SSIM, but performs better in the other two metrics.
Giacomo et al. (2018) 0.0404 14.1208 0.6699
Ours 0.0316 15.0925 0.6035
Table 2: Quantitative results for the ARACATI 2014 dataset in three image quality
Key features are translated from the source image into the target image suc-
cessfully. To exemplify, it is possible to adequately visualize the borders between
water and land in both the ARACATI 2014 and 2017 datasets. Also, the generator
manages to avoid pitfalls that could occur due to the noisy nature of sonar im-
ages. However, one may notice that some details, such as boats and piers, are often
missed. With all that considered, these results still succeed in reaching our goal,
i.e., adequately transferring structure from a sonar image to a generated satellite
image to allow for easier image processing down the line.
6 Conclusions
In this paper, we introduced a novel method was introduced, which improves the
one proposed in Giacomo et al. (2018), for acquiring satellite images from given
acoustic images that were captured in the same region. Our proposal consists of
using a U-Net based CNN augmented with guided filters and dilated convolutions
to train a generator neural network attached to a DCGAN discriminator. Also,
we train and validate our proposed network with two real datasets, which were
captured by underwater vehicles in the coast of Brazil. We qualitatively and quan-
titatively analyze the generated results with samples from the testing sets of the
We believe our method can help facilitate traditionally difficult robotic tasks
like underwater localization and navigation. Using our proposed methodology,
AUVs can acquire acoustic images, convert them to satellite images and then use
that data to locate themselves or map the environment around them. Since satel-
lite images are of easier interpretation, robots should be able to achieve superior
results with less time.
For future work, we intend to consider whether it is possible to use drones to
capture aerial images. In an affirmative scenario, it would be beneficial to cooperate
drones and underwater robot for effective localization techniques. Finally, we want
to follow up on different applications that open up when successfully translating
acoustic images into aerial ones. These applications might include, for example,
underwater localization and navigation, among others.
Acknowledgements This research is partly supported by CNPq, CAPES and FAPERGS.
We also would like to thank the colleagues from NAUTEC-FURG for helping with the ex-
perimental data and for productive discussions and meetings. Finally, we would like to thank
16 Giovanni G. De Giacomo et al.
NVIDIA for donating high-performance graphics cards. All authors are with NAUTEC, In-
telligent Robotics and Automation Group, Universidade Federal do Rio Grande - FURG, Rio
Grande - Brazil.
7 Declarations
7.1 Ethical Approval
Not applicable.
7.2 Consent to Participate
Not applicable.
7.3 Consent to Publish
Not applicable.
7.4 Authors Contributions
Giovanni G. De Giacomo: implementation and execution of the Deep Learning
experiments; writing of the manuscript.
Matheus M. dos Santos: development of the dataset and associated tools;
helped writing the manuscript.
Paulo L. J. Drews-Jr: theoretical support on the idea; revising the manuscript.
Silvia S. C. Botelho: theoretical support on the idea; revising the manuscript.
7.5 Funding
This study was partly supported by the National Council for Scientific and Tech-
nological Development (CNPq) and Coordenacao de Aperfeioamento de Pessoal
de Nivel Superior - Brasil (CAPES) - Finance Code 001. This paper is also a con-
tribution of the INCT-Mar COI funded by CNPq Grant Number 610012/2011-8.
7.6 Conflicts of Interest
The authors declare that they have no conflict of interest.
7.7 Availability of data and materials
TensorFlow implementation of the model and the ARACATI 2017 dataset are
available for download at:
Guided Sonar-to-Satellite Translation 17
Deng X, Zhu Y, Newsam S (2018) What is it like down there? generating dense
ground-level views and image features from overhead imagery using conditional
generative adversarial networks. arXiv preprint arXiv:180605129
Dos Santos MM, De Giacomo GG, Drews P, Botelho SS (2019a) Satellite and
underwater sonar image matching using deep learning. In: 2019 Latin American
Robotics Symposium (LARS), 2019 Brazilian Symposium on Robotics (SBR)
and 2019 Workshop on Robotics in Education (WRE), IEEE, pp 109–114
Dos Santos MM, De Giacomo GG, Drews PL, Botelho SS (2019b) Underwater
sonar and aerial images data fusion for robot localization. In: 2019 19th Inter-
national Conference on Advanced Robotics (ICAR), IEEE, pp 578–583
Draper NR, Smith H (2014) Applied regression analysis, vol 326. John Wiley &
Giacomo G, Machado M, Drews P, Botelho S (2018) Sonar-to-satellite translation
using deep learning. In: 2018 17th IEEE International Conference on Machine
Learning and Applications (ICMLA), IEEE, pp 454–459
Gon¸calves LT, de Oliveira Gaya JF, Junior PJLD, da Costa Botelho SS (2018)
Guidednet: Single image dehazing using an end-to-end convolutional neural net-
work. In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images
(SIBGRAPI), IEEE, pp 79–86
He K, Sun J, Tang X (2013) Guided image filtering. IEEE transactions on pattern
analysis and machine intelligence 35(6):1397–1409
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with condi-
tional adversarial networks. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 1125–1134
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer
and super-resolution. In: European conference on computer vision, Springer, pp
Kim D, Walter MR (2017) Satellite image-based localization via learned embed-
dings. In: 2017 IEEE International Conference on Robotics and Automation
(ICRA), pp 2073–2080, DOI 10.1109/ICRA.2017.7989239
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv
preprint arXiv:14126980
Radford A, Metz L, Chintala S (2015) Unsupervised representation learn-
ing with deep convolutional generative adversarial networks. arXiv preprint
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for
biomedical image segmentation. In: International Conference on Medical image
computing and computer-assisted intervention, Springer, pp 234–241
Silveira L, Guth F, Drews-Jr P, Ballester P, Machado M, Codevilla F, Duarte-
Filho N, Botelho S (2015) An open-source bio-inspired solution to underwater
slam. IFAC-PapersOnLine 48(2):212–217
Steffens C, Messias L, Drews-Jr P, Botelho S (2020) Cnn based image restoration:
Adjusting ill-exposed srgb images in post-processing. Journal of Intelligent &
Robotic Systems DOI 10.1007/s10846-019-01124-9
Steffens CR, Messias LRV, Drews-Jr P, Botelho SSdC (2019) Contrast enhance-
ment and image completion: A cnn based model to restore ill exposed images.
In: 2019 IEEE 17th International Conference on Industrial Informatics (INDIN),
18 Giovanni G. De Giacomo et al.
IEEE, vol 1, pp 226–232
Tomasi C, Manduchi R (1998) Bilateral filtering for gray and color images. In:
Iccv, vol 98, p 2
Viswanathan A, Pires BR, Huber D (2014) Vision based robot localization by
ground to satellite matching in gps-denied situations. In: 2014 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems, pp 192–198, DOI
Wu H, Zheng S, Zhang J, Huang K (2018) Fast end-to-end trainable guided fil-
ter. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp 1838–1847
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions.
arXiv preprint arXiv:151107122
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This paper investigates conditional generative adversarial networks (cGANs) to overcome a fundamental limitation of using geotagged media for geographic discovery, namely its sparse and uneven spatial distribution. We train a cGAN to generate ground-level views of a location given overhead imagery. We show the "fake" ground-level images are natural looking and are structurally similar to the real images. More significantly, we show the generated images are representative of the locations and that the representations learned by the cGANs are informative. In particular, we show that dense feature maps generated using our framework are more effective for land-cover classification than approaches which spatially interpolate features extracted from sparse ground-level images. To our knowledge, ours is the first work to use cGANs to generate ground-level views given overhead imagery in order to explore the benefits of the learned representations.
Conference Paper
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at .
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.