Conference PaperPDF Available

Sonification of 3D Object Shape for Sensory Substitution: An Empirical Exploration


Abstract and Figures

An empirical investigation is presented of different approaches to sonification of 3D objects as a part of a sensory substitution system. The system takes 3D point clouds of objects obtained from a depth camera and presents them to a user as spatial audio. Two approaches to shape sonification are presented and their characteristics investigated. The first approach directly encodes the contours belonging to the object in the image as sound waveforms. The second approach categorizes the object according to its 3D surface properties as encapsulated in the rotation invariant Fast Point Feature Histogram (FPFH), and each category is represented by a different synthesized musical instrument. Object identification experiments are done with human users to evaluate the ability of each encoding to transmit object identity to a user. Each of these approaches has their disadvantages. Although the FPFH approach is more invariant to object pose and contains more information about the object, it lacks generality because of the intermediate recognition step. On the other, since contour based approach has no information about depth and curvature of objects, it fails in identifying different objects with similar silhouettes. On the task of distinguishing between 10 different 3D shapes, the FPFH approach produced more accurate responses. However, the fact that it is a direct encoding means that the contour-based approach is more likely to scale up to a wider variety of shapes.
Content may be subject to copyright.
Sonification of 3D Object Shape for Sensory Substitution:
An Empirical Exploration
Torkan Gholamalizadeh, Hossein Pourghaemi, Ahmad Mhaish, G¨
okhan ˙
Ince and Damien Jade Duff
Faculty of Computer & Informatics Engineering
Istanbul Technical University
Abstract—An empirical investigation is presented of different
approaches to sonification of 3D objects as a part of a sensory
substitution system. The system takes 3D point clouds of objects
obtained from a depth camera and presents them to a user as
spatial audio. Two approaches to shape sonification are presented
and their characteristics investigated. The first approach directly
encodes the contours belonging to the object in the image as
sound waveforms. The second approach categorizes the object
according to its 3D surface properties as encapsulated in the
rotation invariant Fast Point Feature Histogram (FPFH), and
each category is represented by a different synthesized musical
instrument. Object identification experiments are done with
human users to evaluate the ability of each encoding to transmit
object identity to a user. Each of these approaches has their
disadvantages. Although the FPFH approach is more invariant
to object pose and contains more information about the object, it
lacks generality because of the intermediate recognition step. On
the other, since contour based approach has no information about
depth and curvature of objects, it fails in identifying different
objects with similar silhouettes. On the task of distinguishing
between 10 different 3D shapes, the FPFH approach produced
more accurate responses. However, the fact that it is a direct
encoding means that the contour-based approach is more likely
to scale up to a wider variety of shapes.
KeywordsSensory substitution; sensory augmentation; point
clouds; depth cameras; sound synthesis.
Sensory substitution is the use of technology to replace
one sensory modality with another. In visual-to-audio sensory
substitution, visual information captured by a camera is pre-
sented to users as sound. Such systems promise help for the
sight-impaired: imagine users navigating using space/obstacle
information, grasping novel objects, eating meals with utensils,
and so forth. By not falling into the trap of many artificial
intelligence-based assistive systems of aggressively abstracting
the data provided to users, user agency is preserved and the
user’s own advanced cognitive data processing capabilities are
leveraged. Sensory substitution systems also provide inter-
esting platforms for exploring synaesthesia and cross-modal
sensory processing [1].
Recent work in utilizing depth cameras for sensory sub-
stitution promises to increase the usefulness of such visual-
to-audio sensory substitution systems [2][3]. Mhaish et al.’s
system [2] uses a 3D depth camera to create point clouds
characterizing the surfaces of objects in a scene and presents
those surfaces to a user using spatial audio. See Figure 1 for a
summary of the information flow in that approach. The present
work extends that system, offering an investigation of different
ways of encoding 3D spatial surfaces as audio, an area ripe
for exploration in the context of sensory substitution systems.
Figure 1. The flow of data in the full sensory substitution system, from
real-world objects, via the depth camera, to a point cloud, and a segmented
tracked object, and finally a sound waveform played to a user.
Figure 2. Left: the system being used “in the wild”. Right: physical set-up
of the experiments described in this paper.
Broadly speaking, the process of encoding information as
(non-speech) sounds is called sonification. Sonification is used
in applications such as medical imaging where it is used for
example to differentiate a healthy brain from an unhealthy one,
geological activity detection, and so forth. In the present paper,
we present two approaches to sonification of object shape.
Object shape is particularly important in providing functional
information about objects, particularly for blind users who
may wish to perceive objects in their environment in order
to recognize them, avoid them, or manipulate them. The first
approach to shape sonification described in the present paper
is encoding based on 2D object contours and the second is
based on a 3D object recognition descriptor called the Fast
Point Feature Histogram.
The present study will be presented in three sections. In
next section, an overview of the sensory substitution system
is discussed briefly and two different sound generation ap-
proaches are explained in detail. In third section, details of
experiments and results are shown. Finally, the results are
discussed and future work is considered.
Sensory substitution systems are systems that map visual
information to audio in an attempt to create an effect like
vision but channeled through a different sense. More broadly,
systems that map any kind of information (including visual
and graphical) to audio are called sonification systems. In
general, there are two kinds of sonification systems, high-
level and low-level sonification systems, where the high-level
approaches are designed to convert information to speech.
A significant proportion of these systems are text-to-speech
applications, widely used for visually impaired people. Exam-
ples include VoiceOver and JAWS [1]. In addition to text-to-
speech applications there are some other high-level sonification
systems which are more complex and can detect objects and
identify them and return their names in real-time, like LookTel
[4] and Microsoft Seeing AI project [5] that can read texts,
describe people and identify their emotions. These high-level
sonification systems are easy for users and do not require
training, but they can fail in sonifying complex environment
or shapes for which the system has not been adapted. On
the other hand, low-level sonification systems generate sound
directly based on visual information. The main difference
between these systems and high-level sonification systems is
that users need to be trained before using these systems to be
able to understand the relation between the generated sounds
and properties of observed objects. Though these kinds of
systems can seem difficult to use, they can be more flexible for
new environments and undefined objects because they produce
sounds based on characteristics directly calculated from input
data [6]. One of the most well-known systems of this group
is the sensory substitution system The vOICe [1], which uses
the gray-level image of the scene and scans the image from
left to right and generates and sums audible frequencies based
on pixels’ location with amplitude based on pixels’ intensity.
The disadvantages of the vOICe system are that it requires 1
second ro scan the image. Further, the image-sound mapping
is somewhat abstract if used with depth images and does
not explore physical or metaphorical synergies with shape in
particular. However, our proposed system is conceived as a
system for generating spatial audio generated based on surface
and shape information for helping users to localize objects and
identify them in real-time. Systems closet to our own, include
the electro-tactile stereo-based navigation system ENVS of [7]
with ten channels of depth information calculated from stereo
transmitted to ten fingers, which focuses on navigation but
not shape understanding and uses the tactile pathway, and
the depth-camera visual-to-audio based sensory substitution
system See ColOr of [3] which, though using depth-cameras,
concentrates on bringing color (and not space or shape) to
blind users by mapping different intervals of hue and value to
instruments like violin, trumpet, piano etc. Conversely, finding
a proper method for mapping shape information to audio is a
vital step in many low-level sonification systems. In this area,
the work of Shelley et el. [8] is close to the proposed system,
focusing on sonification of shape and curvature of 3D objects
in an augmented reality environment as part of the SATIN
project, where the user of the system is able to touch and alter
the 3D objects using the visual-haptic interface of the system.
In that article, object cross sections (and associated curvature)
are used to modulate the frequency of a carrier signal or the
parameters of physical sound generation [8].
As discussed above choosing a good approach to sonifica-
tion, plays an important role in achieving good performance
of low-level sonification systems. Therefore in the current
work, two different sound generation methods are provided for
Mhaish et al.’s [2] system and their accuracies are measured
on the task of synthetic 3D object identification. The idea of
using synthetic objects instead of real objects is to evaluate the
performance of different sound generation methods isolated
from the performance of other components and environmental
noise. In future work, the best approach or mix of these
approaches will be applied in the identification of real objects.
Output from a head-mounted depth camera (DepthSense
325 or ASUS Xtion) is converted to a head-centred point cloud,
which is segmented by curvature and point-distance in real-
time [9] into surface primitives. These surface primitives are
tracked using simple data association, selected using size and
closeness criteria, and presented to the user as spatially-located
audio (played using a wrapper around the spatial audio library
OpenAL [10], the wrapper taking care of time tracking and
Heavy use of the Point Cloud Library [11] is made in
the point-cloud processing steps and particular care is made
to keep processing of point clouds at 15+ frames per second
so as to provide responsive sensory feedback to user probing
motions. An illustration of the system being tested can be
found in the left picture of Figure 2.
Note that this system is designed to segment surface
primitives rather than objects. Although for some applications,
such as tabletop object manipulation, short-cuts can be taken
to extract separate objects, general object segmentation is
an unsolved problem. Since the current paper is focused on
sonification (making sounds to represent data), the focus here
is on the sonification of whole but mostly simple objects.
Figure 3. Software level of data flow diagram.
Before going into detail about the approaches used to
sonify shape, the processing steps used by the system to extract
visual information and process it to audio will be explained.
Figure 3 shows the data flow architecture of the system. The
steps in the architecture are further explained as follow:
1) Preprocessing: RGB and depth information produced
by the time of flight or structured light camera is passed to a
preprocessing step in the form of a point cloud and in this step
normals are calculated from the point cloud “organized” in a
2D array of points, using a real time integral image algorithm
supplied by Holzer et al [12].
2) Segmentation: The 2D organized point cloud is then
segmented by the method of Trevor et al [9] and segments
Figure 4. Rotary-contour-based encoding. Left: Original object contour in
x-y image space. Right: Resulting waveform as a plot of amplitude(A)
against time(t).
Figure 5. Vertical-contour-based encoding. Left: original object contour.
Right: the resulting waveform as amplitude (A) against time (t).
Carrier Carrier
Modulator Modulator
Figure 6. A simplified diagram of the direct encoding of an object contour
as sound.
are obtained characterized by slowly changing surface normal
vectors and no intervening gaps. Further processing can be
applied to find and remove tabletop surfaces for tabletop
3) Feature Extraction: Information to characterize the ac-
quired segments is extracted. In the current work, contour-
based and FPFH-based approaches are explored.
4) Sound Generation: In the system presented by Mhaish
et al. [2], a simple sonification approach was proposed based
on a conversion of principle object dimensions to frequencies.
A circular buffer is used to create, update and interpolate sound
waves and their envelopes and the rate at which frames are
arriving is estimated in order to send the appropriate number
of samples to the OpenAL spatial sound system. In the current
system, FPFH signatures are converted via a recognition step to
different instruments from the STK simulation toolkit [13] and
object contours are converted via interpolation and modulation
data processing steps to sound waves.
The present paper focuses on the feature extraction and
sound generation steps, proposing the contour- and FPFH-
based approaches, explained in the next sections.
5) Sound player: The sound is played using the OpenAl
library which is provided as many samples as necessary from
the filled circular buffer and generates spatial-audio based
on binaural cues or, alternatively, Head Related Transform
Functions (HRTFs).
A. Contour-based sonification
In the contour-based encoding, object contours are trans-
lated directly into auditory waveforms, and frequency and
amplitude modulations of waveforms.
In the variation on this idea tested in this paper, the rotary-
contour is extracted from the object and used to generate a
carrier signal. In the rotary-contour-based encoding, a path is
traced out around the contour of the object and the distance of
each contour point from the object’s horizontal axis (defined
by the centroid of the points in the object) becomes an
instantaneous amplitude in the sound waveform (normalized to
fit within the range of acceptable sample amplitudes). Spher-
ical or circular objects thus translate perfectly into sinusoidal
waveforms. For instance, the object on the left side of Figure
4 becomes the waveform plot (amplitude vs time) on the right
hand side, with radial distances converted into instantaneous
amplitudes which are then potentially interpolated.
Because the signal waveform depends on the object con-
tour, some timbre properties also depend on the object contour.
The carrier signal is then modulated at a slower (consciously
perceivable) time-scale using frequency and amplitude modu-
lation by another time-varying function which we call here
vertical-contour-based encoding. In this kind of encoding,
which is illustrated in Figure 5, the top to bottom scanned
width of the object silhouette is converted to the amplitude
of a modulating signal which is then applied to the carrier
signal as frequency and amplitude modulation. Thus, multiple
perceptual channels are used to transfer information to the user.
For a sketch of the signal processing flow used to generate the
resulting waveform, see Figure 6.
The contour-based approach is motivated both by the
conceptual clarity of the mapping, but also by the fact that
sounds already arise as vibrations in objects and spaces, and
travel through the objects, reflecting in the resulting waveforms
the shape and size of these spaces; thus, the method, depending
on the exact encoding used, has an analogue in the physics of
real sound generation and consequently natural synergies with
B. FPFH-based sonification
The FPFH is a feature extracted from point clouds or point
cloud parts, designed for representing information about the
shape of the cloud that is invariant to rotation. It is comparable
to a histogram of curvatures measured in different ways across
the object surface.
FPFH is a 33-bin histogram extracted from the points
and normals in the point cloud. This histogram counts 3
different curvature measures with 11 bins for each measure.
The relative position and surface-normal vector of each point
is processed and the bin into which the point falls for each
of the 3 dimensions incremented [14]. Note: the FPFH is not
originally designed as a full object descriptor but it has proved
sufficient for current purposes: other more or less viewpoint-
invariant or object-global descriptors can also be easily adapted
to this purpose. Examples of FPFH descriptors extracted from
point clouds used in experiments in this paper can be found
in Figure 7. As can be seen in the figure, different shapes
generally correspond to different histograms and different sizes
Figure 7. Sample point cloud views with normalized histogram shapes
(FPFH). Top row: teapot. Middle row: cube. Bottom row: cone. Left:
baseline view. Middle: a different object size. Right: a different view
Object Instruments
Teapot (Tp) Shakers
Cube (Cb) Struck Bow
Cuboid (Cd) Drawn Bow
Cylinder (Cl) High Flute
Cone (Cn) Plucked String
Elipsoid (El) Hammond-style Organ
Icosahedron (Ic) Saxophone
Stretched Cylinder (SC) Low Flute
Sphere (Sp) Clarinet
Torus (Ts) Sitar
and scales generally do not affect the histograms radically.
However, the external contours do not always affect the result,
as can be seen by comparing the bottom view of the cone and
the side view of the cube.
After an object is encoded using FPFH, a database of
existing FPFH descriptors is searched (using an indexing KD-
tree) for the closest descriptor and the resulting object label
retrieved. A mapping (Table I) is provided from object label
to instrument type and the relevant instrument is synthesized
using the Synthesis Took Kit (STK) [13].
3D object recognition techniques are attractive for the
current application since the field of robotic vision has well-
established approaches, and many descriptors are available for
representing shape, having rotational invariance built in for
example [14]. Moreover, synthetic instrument models provide
highly discriminable sounds, which can support a sound-object
mapping approach to the task under consideration.
To evaluate the ability of the encodings discussed above
to transmit shape as sound, the ability of users to identify
objects under changing conditions was investigated. For a
Figure 8. The set of objects used in experiments. Top: Teapot (Tp), Cube
(Cb), Cuboid (Cd), Cylinder (Cl), Cone (Cn), Bottom: Elipsoid (El),
Icosahedron (Ic), Stretched Cylinder (SC), Sphere (Sp), Torus (Ts).
Figure 9. The set of poses used in the pose-varying experiment. Top: the
object model as seen from different viewpoints. Bottom: the point cloud
resulting from each viewpoint. Colour in the point cloud represents
normalized distance from the camera of the points.
clear evaluation of the relationship between shape and sound,
point clouds presented to the user were based on point-cloud
samplings of views of the ten object meshes shown in Figure
8. Performance of the proposed encodings was measured
by conducting two experiments. In the first experiment, the
location from which objects are viewed was varied among five
different equi-distant viewpoints, illustrated in Figure 9, and
in the second experiment, five different scales of objects were
presented, scale here stands for either size or viewing distance
but only in the context of the encodings used in the present
paper - not all point cloud encodings will confuse size and
distance. The five sizes used are shown in Figure 10.
The main idea of choosing these two experiments is that
these parameters are the most changing parameters in wild.
Other possible parameters include lighting conditions, but our
cameras use active lighting, or material properties, but these
depend on the particular choice of depth-sensing device, to
which our approach is designed to be mostly agnostic.
A. Training session
Each experiment comprises two conditions, the FPFH
condition and the contour-based condition, presented to the
individuals in a random order. For each condition, an indepen-
dent training session was conducted. A single training session
takes 15 minutes, including 2-3 minutes for describing the
principles of the system followed by a free experimentation
period. During the training sessions participants were given
the ability to play sounds for all five different viewpoints of
Figure 10. The set of object sizes used in the size-varying experiment.
General parameters
Input point cloud size (width ×height) 320 ×240
Object distances to camera (metres) 3
Sound generation: Sample rate (Hz) 44100
Sound generation: Bits per sample 16
Contour-based approach
Pixel-sample ratio: rotary-contour carrier 1:1
Pixel-sample ratio: vertical-contour modulator 600:1
Peak frequency deviation proportion: modulator 1.0
Peak amplitude deviation proportion: modulator 1.0
FPFH-based approach
FPFH database size (num object views) 100
FPFH database size (num objects) 10
all ten objects for the viewpoint experiment and for all five
scales for the scale/size experiment. Users were allowed to
play with the system, choosing objects and viewpoints/sizes
from the training set and playing the sounds as well as viewing
a 3D visual representation from that viewpoint/size and they
were asked to remember the sounds related to each object.
B. Experimental session
Experiments were conducted with 16 non-disabled par-
ticipants with an average age of 24 years, divided into two
groups with each group containing both male and female
participants. Participants of each group performed only either
the viewpoint or the scale experiment. Both sound generation
approaches were tested with each participant. In experimental
sessions, sounds of randomly selected objects with randomly
selected viewpoints/sizes from the training set were presented
and participants were asked to identify the objects. In these
sessions, for each approach 30 trials were conducted with each
participant and the participant was informed after each trial
whether the answer was correct, and when the answer was
wrong, the experimenter informed the participant the actual
object identity. Answers were recorded in confusion matrices.
In these experiments, users were not supposed to guess the
viewpoints or scales; they were asked only to identify the
objects based on the sound that they were hearing. The physical
set-up of the experiment can be found on the right of Figure
2. Parameters of the system used in experiments are shown in
Table II.
C. Results
The complementary properties of the two methods tested
can be observed in the confusion matrices in Tables IV and
III. In these matrices, numbers in cells represent the number
of times the object in the row header was identified by
participants as the object in the column header. Zero values
are left blank. The inability of the contour-based approach to
take account of the depth information in the interior of an
object leading it to confuse objects with similar silhouettes, as
it can be seen in the tables.
With an overall accuracy of 60% and 57% on the two
experiments vs. 36% and 42% for the contour-based method,
the FPFH approach performed better (verified with χ2tests,
which are applicable when class sizes are balanced, 1 D.O.F.,
p= 0.01). However, the FPFH-based approach was still unable
to account for the contour on the silhouette of an object,
leading it to confuse objects with similar visible curvatures.
Contour- Response
based Tp Cb Cd Cl Cn El Ic SC Sp Ts
TP 33.3 11.1 11.1 11.1 11.1 22.2
Cb 33.3 25.0 8.3 8.3 8.3 16.6
Cd 16.6 50.0 16.6
Cl 6.6 13.3 13.3 33.3 6.6 13.3 13.3
Cn 10.0 70.0 10.0 10.0
El 11.1 11.1 11.1 11.1 33.3 11.1 11.1
Ic 15.3 15.3 23.0 7.6 7.6 23.0 7.6
SC 11.1 11.1 22.2 55.5
Sp 25.0 25.0 50.0
Ts 33.3 66.6
FPFH- Response
based Tp Cb Cd Cl Cn El Ic SC Sp Ts
TP 100.0
Cb 76.9 7.6 15.3
Cd 21.4 64.2 14.2
Cl 16.6 33.3 50.0
Cn 11.1 5.5 5.5 55.5 5.5 5.5 11.1
El 7.6 7.6 15.3 7.6 15.3 46.1
Ic 6.6 33.3 6.6 26.6 6.6 13.3
SC 37.5 6.2 56.2 6.6
Sp 16.6 33.3 50.0
Ts 11.1 88.8
Contour- Response
based Tp Cb Cd Cl Cn El Ic SC Sp Ts
TP 56.2 6.2 25.0 6.2 6.2
Cb 29.4 17.6 17.6 7.6 7.6
Cd 69.2 7.6 7.6 7.6 7.6
Cl 9.2 28.5 14.2 14.2 4.7 9.2 9.2 4.7 4.7
Cn 9.0 9.0 4.5 4.5 36.3 9.0 4.5 18.1 4.5
El 8.3 8.3 8.3 41.6 8.3 25.0
Ic 6.6 20.0 20.0 20.0 20.0 13.3
SC 50.0 25.0 25.0
Sp 5.2 5.2 5.2 15.7 10.4 10.4 36.8 10.4
Ts 11.1 22.2 11.1 22.2 33.3
FPFH- Response
based Tp Cb Cd Cl Cn El Ic SC Sp Ts
TP 100.0
Cb 37.5 37.5 25.0
Cd 23.5 52.9 11.7 11.7
Cl 15.3 53.8 7.6 15.3 7.6
Cn 36.3 9.0 36.3 9.0 8.0
El 12.5 12.5 50.0 12.5 12.5
Ic 28.5 14.2 57.1
SC 5.5 11.1 11.1 66.6 5.5
Sp 9.0 9.0 9.0 18.1 54.5
Ts 7.1 92.8
Figure 11. Five different view points of cube and their histograms.Top: Left:
Top view. Middle: Frontal view. Right: Right front top corner view. Bottom:
Left: Right Front edge view. Right: Front bottom edge view.
Figure 12. Five different view points of cuboid and their histograms.Top:
Left: Top view. Middle: Frontal view. Right: Right front top corner view.
Bottom: Left: Right Front edge view. Right: Front bottom edge view.
Figure 13. Similarity in histograms (FPFH) causes the system to mis-classify
the objects. Left: FPFH for third view of cone. Right: FPFH of largest
stretched cylinder(with second viewpoint)
For the FPFH-based approach, the natural user strategy to
the identification problem is to learn a sound-object mapping.
This worked as long as the system could find the correct
mapping, but the system itself did not always use all available
information. For instance, the flat bottom of a cone and cube
or cuboid produce the same FPFH signature - see Figure
7. The same confusion occurred between cube and cuboid.
Participants using the system frequently misclassified cube
as cuboid in all the of 5 viewpoints of cube, as is shown
in Figure 11 and Figure 12. The FPFH of the cube is so
similar to cuboid as to cause the system to mis-classify the
cube. This is sufficient to explain why in Table IV the cube
is classified as a cuboid almost as much as it is a cube.
The lack of distinguishability of FPFH signatures between
the top view of the cylinder, the stretched cylinder and the
cube is apparent because they all have a single flat surface
visible. There are also some unexpected confusions such as
the recognition system itself wrongly identifying third view
of cone as largest scale of stretched-cylinder (see Figure 13).
Since the frequency of occurrence of this confusion was low,
participants were able to hear the related sound for the cone
more frequently, so it did not affect their performance and they
could treat the second sound as noise.
There were some objects that the system did not have any
difficulty in identifying, such as the icosahedron, ellipsoid,
sphere, teapot and torus. However, their classification accuracy
varies from one-in-two to near-perfect. For example, ellipsoid,
sphere and icosahedron are correctly identified in 50.0%,
54.5% and 57.1% of trials, while teapot and torus were
identified perfectly (100% for teapot and 92.8% for torus-
Figure 14. Example of objects contours for which the proposed contour
based approach can not generate sufficiently distinguishable sounds(red lines
around the objects represent objects contours). Left column: Top: sphere
contour. Bottom: icosahedron contour. Right column: Top: cylinder(front
view). Bottom: rectangle(front view)
Figure 15. Similar contours of multiple objects used in experiment(red lines
around the objects represent objects contours). Top: Left: sphere, Middle:
cone, Right: icosahedron. Bottom: Left:torus Right: ellipsoid
see Table IV). This high difference in confusion rates is due
to the choice of instrument corresponding to each object, a
fact that was mentioned by most of the participants during
the experimental session. They believed that identifying the
teapot and torus was easy because their sounds (shakers and
sitar) are more distinct than the others. Hence, putting similar
sound for shapes that are geometrically similar to each other
may not be a good idea or work should be done to ensure
that instruments are more distinguishable from each other.
However, using dissimilar instruments for similar shapes can
defeat any attempt to sonify subtle differences in shape.
In the contour-based approach, it was also observed that
some participants preferred the abstract learning strategy of
learning identity-sound associations rather than understanding
the sound-shape mapping representation as well. For this
approach, participants reported that some important object
properties were not available to them, leading them to confuse
objects like the sphere with the icosahedron or the front view of
the cuboid with the same view of the cylinder (see Figure 14).
For these two pairs of objects, the output of the system does
not produce exactly the same result but similar results which
makes it hard for users to distinguish them from each other and
they need to put in more effort to understand the differences.
However, it should be noted at this point that visually similar
objects should be expected in any successful system to pose a
larger challenge. Moreover, as discussed before, this approach
is viewpoint-variant and for some viewpoints of different
objects which have similar contours, it generates identical or
too-similar sounds which causes the user to choose the wrong
object. For instance, as shown in Figure 15, the top view of
cylinder, cone, torus and sphere all have a circular contour
which makes their sounds exactly the same.
Two different encodings of 3D shape into sound were
presented, i.e., contour-based and FPFH-based. The contour-
based approach presented maps directly from shape to sound.
This is an advantage in that any new object can be represented
in sound, and that similarly shaped objects produce similar
sounds. However, the encoding attempted here only transmits
the image-contour of the object and is not robust to viewpoint.
Some participants also preferred to learn the abstract object
mapping, suggesting that work is needed on making this
approach more intuitive when it comes to the relationship
between shape and sound.
The FPFH-based approach solves these problems by using
data about the full 3D object shape and by representing
features that are somewhat invariant to viewpoint (though
only to the extent that surfaces are visible). The FPFH-based
approach also has the advantage when creating distinguishable
sounds of using a mature sound-synthesis system with highly
recognizable objects. However, again, the use of discrete
instruments reduces flexibility in encoding different shape
properties. In order to make the system work, object exemplars
must be paired with sounds, restricting the generalizability of
the system to new objects and abstracting some of the user’s
agency, not fully utilizing their cognitive capacity.
The next step in this work is to extend these approaches
to reduce the above-mentioned limitations. In the case of
the contour-based approach, a more sophisticated encoding
is needed, that takes into account 3D aspects of the object.
Adding some viewpoint invariance may be desirable, but it
would be a subject of empirical investigation as to whether
this viewpoint invariance would actually be helpful when
considering other tasks that users might want to do with
objects, such as manipulation, in which users need to perceive
also the orientation of the object. In the case of the FPFH-based
approach, a way is needed of generalizing from the exemplars
in an appropriate way, for example by using machine learning
techniques in conjunction with user input. Other point cloud
features with different properties also should be systematically
Further work also involves testing these sonifications “in
the wild” and with multiple objects, which will require work
on more aggressive noise elimination and object (or surface
primitive) tracking. In both approaches, it is important to
exploit and extend the intuitive mappings from shape to sound
whose exploration was begun here, for quick learning and
application of the system, as well as for recruiting the advanced
cognitive capabilities of users.
This work was supported by the Scientific and Techno-
logical Research Council of Turkey (T ¨
UBITAK), Project No
[1] M. Auvray, S. Hanneton, and J. K. O’Regan, “Learning to perceive
with a visuo-auditory substitution system: Localisation and object
recognition with ‘The vOICe’,” Perception, vol. 36, no. 3, 2007, pp.
416 – 430, URL:
[retrieved: 2017-02-04].
[2] A. Mhaish, T. Gholamalizadeh, G. ˙
Ince, and D. Duff, “Assessment of
a visual-to-spatial audio sensory substitution system,” in Signal Pro-
cessing and Communications Applications(SIU). Zonguldak, Turkey:
IEEE, May 2016 24th, pp. 245–248, URL:
stamp/stamp.jsp?arnumber=7495723 [retrieved:2017-02-04].
[3] J. D. Gomez Valencia, “A computer-vision based sensory substitution
device for the visually impaired (See ColOr),” Ph.D. dissertation,
University of Geneva, 2014, URL:
34568 [retrieved:2017-02-04].
[4] J. Sudol, O. Dialameh, C. Blanchard, and T. Dorcey, “Looktel—A
comprehensive platform for computer-aided visual assistance,” in IEEE
Computer Society Conference on Computer Vision and Pattern Recog-
nition Workshops (CVPRW). IEEE, 2010, pp. 73–80, URL:http:// [retrieved:2017-02-01].
[5] “Microsoft,” 2016, URL:
microsoft-cognitive-services- introducing-the-seeing- ai-app/#sm.
00001gckhn7usbfpbyeceij35wz8i#570rbWhUTwooz7X5.97 [accessed:
[6] T. Yoshida, K. M. Kitani, H. Koike, S. Belongie, and K. Schlei,
“EdgeSonic: image feature sonification for the visually impaired,”
in Proceedings of the 2nd Augmented Human International Confer-
ence. ACM, 2011, p. 11, URL:
1959837 [retrieved:2017-02-01].
[7] S. Meers and K. Ward, “A vision system for providing 3d perception
of the environment via transcutaneous electro-neural stimulation,” in
International Conference on Information Visualisation. London, UK:
IEEE, Jul. 2004, pp. 546–552, URL:
document/1320198/ [retrieved:2017-02-02].
[8] S. Shelley, M. Alonso, J. Hollowood, M. Pettitt, S. Sharples,
D. Hermes, and A. Kohlrausch, “Interactive Sonification of Curve
Shape and Curvature Data,” in Haptic and Audio Interaction
Design. Springer, Berlin, Heidelberg, Sep. 2009, pp. 51–
60, DOI: 10.1007/978-3-642-04076-4 6, URL:
chapter/10.1007/978-3- 642-04076- 4 6 [retrieved:2017-02-01].
[9] A. J. Trevor, S. Gedikli, R. B. Rusu, and H. I. Christensen,
“Efficient organized point cloud segmentation with connected
components,” Semantic Perception Mapping and Exploration
(SPME), 2013, URL:
ad70db489701ade007b365fe215478303003.pdf [retrieved: 2017-
[10] G. Hiebert, “Openal 1.1 specification and reference,” 2005, URL:http:
// [retrieved: 2017-02-
[11] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (PCL),”
in IEEE Intl. Conf. on Robotics & Automation. Shanghai, China:
IEEE, 2011, pp. 1–4, URL: all.jsp?
arnumber=5980567 [retrieved:2017-02-04].
[12] S. Holzer, R. B. Rusu, M. Dixon, S. Gedikli, and N. Navab, “Adaptive
neighborhood selection for real-time surface normal estimation from
organized point cloud data using integral images,” in Intelligent Robots
and Systems (IROS), 2012 IEEE/RSJ International Conference on.
IEEE, 2012, pp. 2684–2689, URL:
all.jsp?arnumber=6385999 [retrieved:2017-01-20].
[13] G. P. Scavone and P. R. Cook, “RtMidi, RtAudio, and a synthesis
toolkit (STK) update,” Synthesis, 2004, URL:
[retrieved: 2017-02-04].
[14] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms
(FPFH) for 3d registration,” in IEEE Intl. Conf. on Robotics &
Automation. Kobe, Japan: IEEE, 2009, URL:
org/xpls/abs all.jsp?arnumber=5152473 [retrieved: 2017-02-04].
... As a first step, we enabled users to localize single objects on a table top using simple spatial audio and tones to sonify direction and distance of objects [4]; next, we investigated different approaches to sonification of simulated 3D shapes [5]. An open research question is whether such sonification methods can be used in the real world; as such, in the current paper, instead of only sonifying simulated shapes, we test a new version of our system that enables localization and recognition of objects on the floor in an empty room. ...
... We also realize a proto-object concept based on segmenting, tracking and sonifying multiple parts of objects separately so that users can understand the shape of objects as the assembly of their sonified parts. Finally, we improve the sonification approach to sonify not just the external contours (outlines) of objects, as previously done [5], but to completely sonify all internal contours of the object using a direct encoding of the full visible 3D shape of the object as sound waves. A summary of the flow of data for each frame in our real-time system can be seen on the right of Figure 1. ...
... In our previous work [5], we presented two different approaches to 3D shape sonification: a method based on object recognition techniques commonly used in cognitive robotics applications that first recognizes an object and then chooses an instrument to sonify it accordingly, and a method in which sound waves are directly generated based on objects' outlines in an image -no attempt is made to account for the internal shape of objects. Moreover, that system was tested using artificially generated 3D objects on the problem of object recognition. ...
Conference Paper
Full-text available
In this paper, we present a new approach to real-time tracking and sonification of 3D object shapes and test the ability of blindfolded participants to learn to locate and recognize objects using our system in a controlled physical environment. In our sonification and sensory substitution system, a depth camera accesses the 3D structure of objects in the form of point clouds and objects are presented to users as spatial audio in real time. We introduce a novel object tracking scheme, which allows the system to be used in the wild, and a method for sonification of objects which encodes the internal 3D contour of objects. We compare the new sonfication method with our previous object-outline based approach. We use an ABA/BAB experimental protocol variant to test the effect of learning during training and testing and to control for order effects with a small group of participants. Results show that our system allows object recognition and localization with short learning time and similar performance between the two sonification methods.
... Or cela rejoint une autre problématique abordée en sonification : comment représenter un objet 3D par du son ? Les précédentes propositions de sonification d'objet 3D reposent sur deux principes : d'une part la sonification indirecte des paramètres de courbures liés au contour de l'objet [1,7], d'autre part la sonification d'une image 2D, obtenue par projection ou par découpage en tranches successives de l'objet 3D. La sonification d'une image 2D reprend alors des concepts similaires à ceux développés pour les premiers systèmes de substitution sensorielle auditifs à destination des non-voyants [11] : l'image est découpée en pixels dont on encode la position verticale par la fréquence (plus le motif visuel est haut, plus le son est aigu), la position horizontale de manière temporelle (le balayage d'une image est réalisé en une seconde), et le niveau de gris par l'intensité du son. ...
... Afin de comparer notre approche de sonification avec les méthodes déjà établies, nous nous appuierons sur les protocoles utilisés dans les études passées. Ainsi, la première expérimentation prévue s'appuie sur la tâche d'identification d'objets 3D décrite dans [7]. Les participants devront écouter la sonfication d'un objet parmi une liste de 8 objets possibles et l'identifier dans cette liste. ...
Conference Paper
To extend pre-existing protein visualisation methods, we present a new approach of immersive sonification to represent protein surfaces through sound. The protein surface is first discretized so each point of the surface is attached to a sound source spatialized in such a way the user is immersed in the center of the protein. We add a spherical filtering system, that the user can control, to select the surface points that would be rendered in order to reinforce the auditory interpretation of the 3D shape. Several questions, which can benefit the HCI community, are discussed both on audio and audiographical filtering consistency, and on multimodal integration of data coming from different point of view and point of listening.
... In addition, there is work that actually did propose SSDs that use depth data, yet they often had problems implementing or testing the systems in practice because the technology was not yet advanced enough (too slow, heavy and/or expensive) [91][92][93][94][95][96][97]. In recent years, however, papers have been published (most of them using sound as an output) that actually proposed and tested applied 3D systems [97][98][99][100][101][102][103][104][105][106][107]. This includes the Sound of Vision device, which today probably ranks among the most advanced and sophisticated, which has also been evaluated in several studies. ...
Full-text available
This paper documents the design, implementation and evaluation of the Unfolding Space Glove – an open source sensory substitution device. It transmits the relative position and distance of nearby objects as vibratory stimuli to the back of the hand and thus enables blind people to haptically explore the depth of their surrounding space, assisting with navigation tasks such as object recognition and wayfinding. The prototype requires no external hardware, is highly portable, operates in all lighting conditions, and provides continuous and immediate feedback – all while being visually unobtrusive. Both blind (n = 8) and blindfolded sighted participants (n = 6) completed structured training and obstacle courses with both the prototype and a white long cane to allow performance comparisons to be drawn between them. The subjects quickly learned how to use the glove and successfully completed all of the trials, though still being slower with it than with the cane. Qualitative interviews revealed a high level of usability and user experience. Overall, the results indicate the general processability of spatial information through sensory substitution using haptic, vibrotactile interfaces. Further research would be required to evaluate the prototype’s capabilities after extensive training and to derive a fully functional navigation aid from its features.
... Gholamalizadeh et al. [11] explore the sonification of 3D shapes which they use as a substitute for visualization. With the same intent, Commère et al. [12] investigate the sonification of 3D point clouds. ...
... A more recent category of vision to audition devices involves the use of depth sensors, and shows potential in terms of required training time and performance for navigation or localisation [15,16,17,18,19]. In the work of [15], blindfolded sighted participants were able to navigate in unknown path after only 8 minutes of training. ...
Full-text available
Vision to audition substitution devices are designed to convey visual information through auditory input. The acceptance of such systems depends heavily on their ease of use, training time, reliability and on the amount of coverage of online auditory perception of current auditory scenes. Existing devices typically require extensive training time or complex and computationally demanding technology. The purpose of this work is to investigate the learning curve for a vision to audition substitution system that provides simple location features. Forty-two blindfolded users participated in experiments involving location and navigation tasks. Participants had no prior experience with the system. For the location task, participants had to locate 3 objects on a table after a short familiarisation period (10 minutes). Then once they understood the manipulation of the device, they proceeded to the navigation task: participants had to walk through a large corridor without colliding with obstacles randomly placed on the floor. Participants were asked to repeat the task 5 times. In the end of the experiment, each participant had to fill out a questionnaire to provide feedback. They were able to perform localisation and navigation effectively after a short training time with an average of 10 minutes. Their navigation skills greatly improved across the trials.
... In the HCI domain, crossmodal information has been applied in various multisensory design situations, using both bottom-up and top-down approaches, with the research emphasis on robust information perception [35,21,17], sensory substitution [29], sports skill acquisition [42,43,41] ,embodied interaction experience [4], and data representation through multisensory modalities [53,30,19]. Interestingly, some applications have been designed where congruent crossmodal associations were not always in line with intended activity. ...
Leveraging the perceptual phenomenon of crossmoal correspondence has been shown to facilitate peoples information processing and improves sensorimotor performance. However for goal-oriented interactive tasks, the question of how to enhance the perception of specific Crossmodal information, and how Crossmodal information integration takes place during interaction is still unclear. The present paper reports two experiments investigating these questions. In the first experiment, a cognitive priming technique was introduced as a way to enhance the perception of two Crossmodal stimuli, in two conditions respectively, and their effect on sensory-motor performance was observed. Based on the results, the second experiment combined the two Crossmodal stimuli in the same interfaces in a way that their correspondence congruency was mutually exclusive. The same priming techniques was applied as a manipulating factor to observe the Crossmodal integration process. Results showed that first, the Crossmodal integration during interaction can be enhanced by the priming technique, but the effect varies according to the combination of Crossmodal stimuli and the types of priming material. Moreover, peoples subjective evaluations towards priming types were in contradiction with their objective behavioural data. Second, when two Crossmodal sequences can be perceived simultaneously, results suggested different perceptual weights are possessed by different participants, and the perceptual enhancement effect was observed only on the dominant one, the pitch-elevation. Furthermore, the Crossmodal integration tended to be integrated in a selective manner without priming. These results contribute design implications for multisensory feedback and mindless computing.
Conference Paper
Full-text available
Sensory substitution is a technique whereby sensory information in one modality such as vision can be assimilated by an individual in another modality such as hearing. This paper makes use of a depth sensor to provide a spatial-auditory sensory substitution system, which converts an array of range data to spatial auditory information in real-time. In experiments, participants were trained with the system then blindfolded, seated behind a table while equipped with the sensory substitution system while keeping the sensor in front of their eyes. In the experiments, participants had to localise a target on the table by reporting its direction and in its distance. Results showed that the using the proposed system participants achieved a high accuracy rate (90%) in detecting the direction of the object, and showed a performance of 56% for determining the object's distance.
Full-text available
This paper presents new and ongoing development efforts directed toward open-source, cross-platform C++ "tools" for music and audio programming. RtMidi provides a common application programming interface (API) for re- altime MIDI input and output on Linux, Windows, Mac- intosh, and SGI computer systems. RtAudio provides complementary functionality for realtime audio input and output streaming. The Synthesis ToolKit in C++ (STK) is a set of audio signal processing and algorithmic synthesis classes designed to facilitate rapid development of music synthesis and audio processing software.
Conference Paper
Full-text available
This paper presents a number of different sonification approaches that aim to communicate geometrical data, specifically curve shape and curvature information, of virtual 3-D objects. The system described here is part of a multi-modal augmented reality environment in which users interact with virtual models through the modalities vision, hearing and touch. An experiment designed to assess the performance of the sonification strategies is described and the key findings are presented and discussed.
Conference Paper
Full-text available
With the advent of new, low-cost 3D sensing hardware such as the Kinect, and continued efforts in advanced point cloud processing, 3D perception gains more and more importance in robotics, as well as other fields. In this paper we present one of our most recent initiatives in the areas of point cloud perception: PCL (Point Cloud Library - PCL presents an advanced and extensive approach to the subject of 3D perception, and it's meant to provide support for all the common 3D building blocks that applications need. The library contains state-of- the art algorithms for: filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation. PCL is supported by an international community of robotics and perception researchers. We provide a brief walkthrough of PCL including its algorithmic capabilities and implementation strategies.
Conference Paper
Full-text available
We propose a framework to aid a visually impaired user to recognize objects in an image by sonifying image edge features and distance-to-edge maps. Visually impaired people usually touch objects to recognize their shape. However, it is difficult to recognize objects printed on flat surfaces or objects that can only be viewed from a distance, solely with our haptic senses. Our ultimate goal is to aid a visually impaired user to recognize basic object shapes, by transposing them to aural information. Our proposed method provides two types of image sonification: (1) local edge gradient sonification and (2) sonification of the distance to the closest image edge. Our method was implemented on a touch-panel mobile device, which allows the user to aurally explore image context by sliding his finger across the image on the touch screen. Preliminary experiments show that the combination of local edge gradient sonification and distance-to-edge sonification are effective for understanding basic line drawings. Furthermore, our tests show a significant improvement in image understanding with the introduction of proper user training.
In this paper we present two real-time methods for estimating surface normals from organized point cloud data. The proposed algorithms use integral images to perform highly efficient border- and depth-dependent smoothing and covariance estimation. We show that this approach makes it possible to obtain robust surface normals from large point clouds at high frame rates and therefore, can be used in real-time computer vision algorithms that make use of Kinect-like data.
Conference Paper
In our recent work [1], [2], we proposed Point Feature Histograms (PFH) as robust multi-dimensional features which describe the local geometry around a point p for 3D point cloud datasets. In this paper, we modify their mathematical expressions and perform a rigorous analysis on their robustness and complexity for the problem of 3D registration for overlapping point cloud views. More concretely, we present several optimizations that reduce their computation times drastically by either caching previously computed values or by revising their theoretical formulations. The latter results in a new type of local features, called Fast Point Feature Histograms (FPFH), which retain most of the discriminative power of the PFH. Moreover, we propose an algorithm for the online computation of FPFH features for realtime applications. To validate our results we demonstrate their efficiency for 3D registration and propose a new sample consensus based method for bringing two datasets into the convergence basin of a local non-linear optimizer: SAC-IA (SAmple Consensus Initial Alignment).
Conference Paper
We present an extensible platform that integrates state of the art computer vision techniques with mobile communications to deliver a portable visual assistance tool. Live input video from a mobile smartphone is streamed over a 3G or wireless connection while an object recognition engine on a desktop processes the data stream. Recognition results are returned in real-time to the mobile device and announced by a text-to-speech engine. The system design is complete and includes the ability to add new items, share databases, and provide live remote human sighted assistance.