Conference PaperPDF Available

Abstract

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.
Look and Listen: A Multi-modality Late Fusion
Approach to Scene Classification for Autonomous Machines
Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Anik´
o Ek´
art, and George Vogiatzis
Abstract The novelty of this study consists in a multi-
modality approach to scene classification, where image and
audio complement each other in a process of deep late fusion.
The approach is demonstrated on a difficult classification
problem, consisting of two synchronised and balanced datasets
of 16,000 data objects, encompassing 4.4 hours of video of 8
environments with varying degrees of similarity. We first extract
video frames and accompanying audio at one second intervals.
The image and the audio datasets are first classified indepen-
dently, using a fine-tuned VGG16 and an evolutionary optimised
deep neural network, with accuracies of 89.27% and 93.72%,
respectively. This is followed by late fusion of the two neural
networks to enable a higher order function, leading to accuracy
of 96.81% in this multi-modality classifier with synchronised
video frames and audio clips. The tertiary neural network
implemented for late fusion outperforms classical state-of-the-
art classifiers by around 3% when the two primary networks
are considered as feature generators. We show that situations
where a single-modality may be confused by anomalous data
points are now corrected through an emerging higher order
integration. Prominent examples include a water feature in a
city misclassified as a river by the audio classifier alone and a
densely crowded street misclassified as a forest by the image
classifier alone. Both are examples which are correctly classified
by our multi-modality approach.
I. INTRO DUC TIO N
‘Where am I?’ is a relatively simple question answered
by human beings though it requires exceptionally complex
neural processes. Humans use their senses of vision, hearing,
temperature etc. as well as past experiences to discern
whether they happen to be indoors, outdoors, and geolocate
in general. This process occurs, for all intents and purposes,
in an instant. Visuo-auditory perception is optimally inte-
grated by humans in order to solve ambiguities; it is widely
recognised that audition dominates time perception while
vision dominates space perception. Both modalities are es-
sential for awareness of the surrounding environment [1]. In a
world rapidly moving towards autonomous machines outside
of the laboratory or home, environmental recognition is an
important piece of information which should be considered
as part of interpretive processes of spatial awareness.
Current trends in Robotic Vision [2]–[5] indicate two
main reasons for the usefulness of scene classification. The
most obvious reason is simply the ability of an awareness
of where one currently is, but furthermore, and in more
complex situations, the awareness of one’s surroundings
can be further used as input to learning models or as a
parameter within an intelligent decision making process.
Just as humans ’classify’ their surroundings for every day
navigation and reasoning, this ability will very soon become
paramount for the growing field of autonomous machines
MFCC
RGB
Sound
VGG16 Tailored
MLP
Optimised
D-MLP
Late
Fusion
City
Forest
Store
Beach
Prediction
Softmax
Fig. 1: The proposed multi-modality (video and audio)
approach to scene classification. MFCC are extracted from
audio-frames as input to an optimised DNN, while VGG16
and an ANN classify the images. We propose a higher-order
function to perform late fusion.
in the outside world such as self-driving cars and self-flying
drones, and possibly, autonomous humanoid androids further
into the future. Related work (Section II) finds that although
the processes of classification themselves are well-explored,
multi-modality classification is a ripe area enabled by the
rapidly increasing hardware limits faced by researchers and
consumers. With this finding in mind, we explore a bi-modal
sensory cue combination for environment recognition as
illustrated in Figure 1. This endows the autonomous machine
with the ability to look (Computer Vision) and to hear (Audio
Processing) before predicting the environment with a late
fusion interpretation network for higher order functions such
as anomaly detection and decision making. The main moti-
vation for this is to disambiguate the classification process;
for example, if a person were to observe busy traffic on a
country road, the sound of the surroundings alone could be
misclassified as a city street, whereas vision enables the ob-
server to recognise the countryside and correct this mistake.
Conversely, a densely crowded city street confuses a strong
vision model since no discernable objects are recognised at
multiple scales, but the sounds of the city street can still
be heard. Though this anomalous data point has confused
the visual model, the interpretation network learns these
patterns, and the audio classification is given precedence
leading to a correct prediction. The main contributions of
this study are centred around the proposed multi-modality
framework illustrated in Figure 2 and are the following: (1) A
large dataset encompassing multiple dynamic environments
CNN
(VGG16)
DEvoMLP
MLP
DenseInterpretation
Layer
(Trainable)
Final
Prediction
(Softmax)
Audio
Image
PretrainedWeightsforImageClassification
VisualFunction
PretrainedWeightsforAudioClassification
AuditoryFunction
MFCC
Extraction
Higher
Order
Function
Video
Clip
RGB
Channel
Separation
Preprocessing
Preprocessing
Denseconnectionsfrom
previousnetworks
Late
Fusion
"FOREST"
Fig. 2: Overview of the multi-modality network. Pre-trained networks without softmax activation layer take synchronised
images and audio segments as input, and classify based on interpretations of the outputs of the two models.
is formed and made publicly available.1This dataset pro-
vides a challenging problem, since many environments have
similar visual and audio features. (2) Supervised Transfer
Learning of the VGG16 model towards scene classification
by training upon the visual data, together with engineering
a range of interpretation neurons for fine-tuning, lead to
accurate classification abilities. (3) The evolutionary opti-
misation of a deep neural network for audio processing
of attributes extracted from the accompanying audio leads
to accurate classification abilities, similarly to the vision
network. (4) The final late fusion model combines and
interprets the output of previously trained networks in order
to discern and correct various anomalous data points that
led to mistakes (examples of this are given in Section IV-D).
The multi-modality model outperforms both the visual and
audio networks alone, therefore we argue that multi-modality
classification is a better solution for scene classification.
II. REL ATE D WORK
Much state-of-the-art work in scene classification explores
the field of autonomous navigation in self-driving cars. Many
notable recent studies [6]–[8] find dynamic environment
mapping leading to successful real-time navigation and ob-
ject detection through LiDAR data. Visual data in the form
of images are often shown to be useful in order to observe
and classify an environment; notably 66.2% accuracy was
achieved on a large scene dataset through transfer learning
from the Places CNN2compared to ImageNet transfer learn-
ing and SVM which achieved only 49.6% [9]. Similarly
Xie, et al. [10] found that through a hybrid CNN trained
for scene classification, scores of 82.24% were achieved for
1Full dataset is available at:
https://www.kaggle.com/birdy654/scene-classification-images-and-audio
2http://places.csail.mit.edu/downloadCNN.html
the ImageNet dataset.3Though beyond the current capa-
bilities of autonomous machine hardware, an argument has
recently been put forward for temporal awareness through
LSTM [11], achieving 78.56% and 70.11% pixel accuracy on
two large image datasets. A previous single-modality study
found improvement of scene classification ability by transfer-
ring from both VGG16 and scene images from videogames
to photographic images of real-life environments [12] with
an average improvement of +7.15% when simulation data
was present prior to transfer of weights. In terms of audio,
the usefulness of MFCC audio features in statistical learning
for recognition of environment has recently been shown [13],
gaining classification accuracies of 89.5%, 89.5% and 95.1%
with KNN, GMM, and SVM methods respectively. Nearest-
neighbour MFCC classification of 25 environments achieved
68.4% accuracy compared to a subject group of human
beings who on average recognised environments from audio
data with 70% accuracy [14]. It is argued that a deep neural
network outperforms an SVM for scene classification from
audio data, gaining up to 92% accuracy [15].
Researchers have shown that human beings use multiple
parts of the brain for general recognition tasks, including
the ability of environmental awareness [16], [17]. Though
in many of these studies a single-modality is successful, we
argue that, since the human brain merges the senses into
a robust percept for recognition tasks, the field of scene
classification should find some loose inspiration from this
process through data fusion. We explore visual and audio in
this experiment due to accessibility, since there is a lot of
audio-visual video data available to researchers. We propose
that in the future further sensory data are explored, given the
success of this preliminary experiment (Section V).
3http://www.image-net.org
III. PROB LEM A ND MET HOD
The state-of-the-art is to interpret real-world data consid-
ering a single input. Conversely, the idea of multi-modality
learning is to consider multiple forms of input [18].
Simply put, the question posed to a classifier is ‘where are
you?’. Synchronised images and audio are treated as inputs
to the classifier, and are labelled semantically. A diagram
of this process can be observed in Figure 2;4visual and
auditory functions consider synchronised image and audio
independently, before a higher order function occurs. The
two neural networks are concatenated into an interpretation
network via late fusion to a further hidden layer before a
final prediction is made. Following dataset acquisition of
videos, video frames and accompanying audio clips, the
general experimental processes are as follows. (i) For audio
classification: the extraction of MFCCs of each audio clip to
generate numerical features and evolutionary optimisation of
neural network topology to derive network hyperparameters.
(ii) For image classification: pre-processing through a centre-
crop (square) and resizing to a 128x128x3 RGB matrix due
to the computational complexity required for larger images,
and subsequent fine tuning of the interpretation layers for
fine-tune transfer learning of the VGG16 trained weight
set. (iii) For the final model: freeze the trained weights of
the first two models while benchmarking an interpretation
layer for synchronised classification of both visual and audio
data. This process is described in more detail throughout
Subsections III-A and III-B.
A. Dataset Acquisition
Initially, 45 videos as sources are collected in varying
length for 9 environmental classes at NTSC 29.97 FPS and
are later reduced to 2000 seconds each: Beach (4 sources,
2080 seconds), City (5 sources, 2432 seconds), Forest (3
sources, 2000 seconds), River (8 sources, 2500 seconds)
Jungle (3 sources, 2000 seconds), Football Match (4 sources,
2300 seconds), Classroom (6 sources, 2753 seconds), Restau-
rant (8 sources, 2300 seconds), and Grocery Store (4 sources,
2079 seconds). The videos are dynamic, from the point of
view of a human being. All audio is naturally occurring
within the environment. It must be noted that some classes
are similar environments and thus provide a difficult recog-
nition problem. To generate the initial data objects, a crop is
performed at each second. The central frame of the second
of video is extracted with the accompanying second of audio,
an example of data processing for a city is shown in Figure
3. Further observation lengths should be explored in future.
This led to 32,000 data objects, 16,000 images (128x128x3
RGB matrices) accompanied by 16,000 seconds (4.4 hours)
of audio data. We then extract the the Mel-Frequency Cep-
stral Coefficients (MFCC) [20] of the audio clips through
a set of sliding windows 0.25s in length (ie frame size of
4K sampling points) and an additional set of overlapping
windows, thus producing 8 sliding windows. From each
audio frame, we extract 13 MFCC attributes, producing 104
4VGG Convolutional Topology is detailed in [19]
attributes per 1 second clip. MFCC extraction consists of
the following steps: The Fourier Transform (FT) of the time
window data ωis derived as X() = R
−∞ x(t)ejωt dt.
The powers from the FT are mapped to the Mel scale,
the psychological scale of audible pitch [21]. This occurs
through the use of a triangular temporal window. The Mel-
Frequency Cepstrum (MFC), or power spectrum of sound,
is considered and logs of each of their powers are taken.
The derived Mel-log powers are treated as a signal, and a
Discrete Cosine Transform (DCT) is measured. This is given
as Xk=PN1
n=0 xncos π
N(n+1
2)kwhere k= 0, ..., N 1
is the index of the output coefficient being calculated and x
is the array of length Nbeing transformed. The amplitudes
of the spectrum are known as the MFCCs.
The learning process we present is applicable to consumer-
level hardware (unlike temporal techniques) and thus acces-
sible for the current abilities of autonomous machines.
B. Machine Learning Processes
For audio classification, an evolutionary algorithm [22]
was used to select the amount of layers and neurons con-
tained within a MLP in order to derive the best network
topology. Population is set to 20 and generations to 10, since
stabilisation occurs prior to generation 10. The simulation is
executed five times in order to avoid stagnation at local min-
ima being taken forward as a false best solution. Activations
of the hidden layers are set to ReLu. For image classification,
the VGG16 layers and weights [19] are implemented except
the dense interpretation layers beyond the Convolutional
layers, which is then followed by {2,4,8,· · · ,4096}ReLu
neurons for interpretation and finally a softmax activated
layer towards the nine-class problem. In order to generate the
final model, the previous process of neuron benchmarking
is also followed. The two trained models for audio and
image classification have their weights frozen, and training
concentrates on the interpretation of the outputs of the
networks. Referring back to Figure 2, the softmax activation
layers are removed from the initial two networks in order
to pass their interpretations to the final interpretation layer
through concatenation , a densely connected layer following
the two networks and {2,4,8,· · · ,4096}ReLu neurons are
benchmarked in order to show multi-modality classification
ability. All neural networks are trained for 100 epochs with
shuffled 10-fold cross-validation.
IV. EXP ERI MEN TAL RESULTS
A. Fine Tuning of VGG16 Weights and Topology
Figure 4 shows the tuning of interpretation neurons for the
image classification network. The best result was 89.27% 10-
fold classification accuracy, for 2048 neurons.
B. Evolving the Sound Processing Network
Regardless of initial (random) population, stabilisation of
the audio network topology search occurred around the 92-
94% accuracy mark. The best solution was a deep network
of 977, 365, 703, 41 hidden-layer neurons, which gained
0s 5sTime
VIDEO
FRAMES
AUDIO
CLIPS
Fig. 3: Example of extracted data from a five second timeline. Each second, a frame is extracted from the video along with
the accompanying second of audio.
2
4
8
16
32
64
128
256
512
1024
2048
4096
10
20
30
40
50
60
70
80
90
100
Interpretation Neurons
Image Classification Accuracy (%)
Fig. 4: Image 10-fold Classification Accuracy corresponding
to interpretation neuron numbers.
TABLE I: Final results of the five Evolutionary Searches
sorted by 10-fold validation Accuracy. Conns. denotes the
number of connections in the network.
Simulation Hidden Neurons Connections Accuracy
2977, 365, 703, 41 743,959 93.72%
41521, 76, 422, 835 664,902 93.54%
1934, 594, 474 937,280 93.47%
3998, 276, 526, 797, 873 1,646,563 93.45%
51524, 1391, 212, 1632 2,932,312 93.12%
93.72% accuracy via 10-fold cross validation. All final solu-
tions are presented in Table I. Interestingly, a less complex
solution scores a competitive score of 93.54% accuracy with
79,057 fewer network connections.
C. Fine Tuning the Final Model
With the two input networks frozen at the previously
trained weights, the results of the multi-modality network
can be observed in Figure 5. The best interpretation layer
was selected as 32, which attained a classification accuracy
of 96.81% as shown in Table II. Late fusion was tested
with other models by treating the two networks as feature
generators for input, a Random Forest scored 94.21%, Naive
Bayes scored 93.61% and an SVM scored 95.08%, which
were all outperformed by the tertiary deep neural network.
2
4
8
16
32
64
128
256
512
1024
2048
4096
10
20
30
40
50
60
70
80
90
100
Interpretation Neurons
Scene Classification Accuracy (%)
Fig. 5: Multi-modality 10-fold Classification Accuracy cor-
responding to interpretation neuron numbers.
Fig. 6: Beach sonogram (with speech at 3s-4.5s)
Fig. 7: Restaurant sonogram (with speech throughout)
Image:"CITY"
Audio:"CITY"
Multi-modality:"CITY"
Image:"FOREST"
Audio:"CITY"
Multi-modality:"CITY"
Fig. 8: An example of confusion of the vision model, which
is corrected through multi-modality. In the second frame,
the image of hair is incorrectly classified as the “FOREST”
environment through Computer Vision.
Image:"CITY"
Audio:"RIVER"
Multi-modality:"CITY"
Image:"CITY"
Audio:"RIVER"
Multi-modality:"CITY"
Fig. 9: An example of confusion of the audio model, which
is corrected through multi-modality. In both examples, the
audio of a City is incorrectly classified as the “RIVER”
environment due to the sounds of a fountain and flowing
water by the audio classification network.
D. Comparison and Analysis of Models
For final comparison of classification models, Table II
shows the best performances of the tuned vision, audio,
and multi-modality models, through 10-fold cross valida-
tion. Though visual classification was the most difficult
task at 89.27% prediction accuracy, it was only slightly
outperformed by the audio classification task at 93.72%.
Outperforming both models was the multi-modality approach
(Figure 2), when both vision and audio are considered
through network concatenation, the model learns not only to
classify both network outputs concurrently, but more impor-
tantly calculates relationships between them. An example of
confusion of the audio model can be seen by the sonograms
in Figures 6 and 7. Multiple frames of audio from the beach
clip were mis-classified as ‘Restaurant’ due to focus on
the human speech audio. The image classification model
on the other hand correctly classified these frames, and
the multi-modality model did also. The same was observed
several times within the classes ‘City’,‘Grocery Store’, and
‘Football Match’. We note that these Sonograms show the
frequency of the raw audio (stereo averaged into mono) for
demonstrative purposes and MFCC extraction occurs after
this point. Another example of this can be seen in Figure 8,
in which the Vision model has been confused by a passerby.
The audio model recognises the sounds of traffic and crowds
etc. (this is also possibly why the audio model outperforms
TABLE II: Scene Classification ability of the three tuned
models on the dataset
Model Scene Classification Ability
Visual 89.27%
Auditory 93.72%
Multi-modality 96.81%
TABLE III: Results of the three approaches applied to
completely unseen data (9 classes)
Approach Correct/Incorrect Classification Accuracy
Audio Classification 359/1071 33.52%
Image Classification 706/1071 65.92%
Multi-modality 856/1071 79.93%
the image model slightly), the interpretation network has
learnt this pattern and thus has ‘preferred’ the outputs of
the audio model in this case. Since the multi-modality model
outperforms both single-modality models, this confusion also
occurs in the opposite direction; observe that in Figure 9, the
audio model has inadvertently predicted that the environment
is a river due to the sounds of water, yet the image classifier
correctly predicts that it is a city, in this case, Las Vegas.
The multi-modality model, again, has learnt such patterns
and has preferred the prediction of the image model, leading
to a correct recognition of environment.
The results of applying the models to completely unseen
data (two minutes per class) can be seen in Table III. It can
be observed that audio classification of environments is weak
at 33.52%, which is outperformed by image classification at
65.92% accuracy. Both approaches are outperformed by the
multi-modality approach which scores 79.93% classification
accuracy. The confusion matrix of the multi-modality model
Forest
Beach
Classroom
City
Football-match
Restaurant
Jungle
River
Supermarket
Forest
Beach
Classroom
City
Football-match
Restaurant
Jungle
River
Supermarket
0.975 0 0 0.017 0 0 0 0.008 0
010000000
0 0 0.63 0 0 0.37 0 0 0
0 0 0 100000
0 0 0 0 0.992 0 0 0 0.008
0 0 0 0.084 0 0.05 0 0 0.866
0.193 0 0 0 0.008 0 0.756 0.042 0
0 0 0 0 0.118 0 0 0.882 0
0 0 0 0.051 0.017 0.025 0 0 0.907
Fig. 10: Confusion matrix for the multi-modality model
applied to completely unseen data
can be observed in Figure 10; the main issue is caused by
’Restaurant’ being confused as ‘Supermarket’, while all other
environments are classified strongly. On manual observation,
both classes in the unseen data both feature a large number of
people with speech audio, we conjecture that this is possibly
most similar to the supermarkets in the training dataset and
thus the model is confident that both of these classes belongs
to supermarket. This suggests that the data could be more
diversified in future in order to feature more minute details
and thus improve the model’s abilities for discerning between
the two.
V. CONCL USI ONS A ND FUT URE WO RK
This study presented and analysed three scene classifica-
tion models: (a) a vision model through fine-tuned VGG16
weights for classification of images of environments. (b)
a deep neural network for classification of audio of envi-
ronments and (c), a multi-modality approach, which out-
performed the two original approaches through the gained
ability of detection of anomalous data through consideration
of the outputs of both models. The tertiary neural network
for late fusion was compared and found to be superior
to Naive Bayes, Random Forest, and Support Vector Ma-
chine classifiers. We argue that since audio classification
is a relatively easy task, it should be implemented where
available to improve environmental recognition tasks. This
work focused on the context of autonomous machines, and
thus consumer hardware capability was taken into account
through temporal-awareness implemented within the feature
extraction process rather than within the learning process.
In future, better results could be gained from attempting to
enable a neural network to learn temporal awareness in re-
currence. Since the model was found to be effective with the
complex problem posed through our dataset, future studies
could concern other publicly available datasets in order to
explore the applicability more widely. With the available
hardware, evolutionary selection of network topology was
only possible with the audio classifier. In future and with
more resources, this algorithm could be applied to both
the vision and interpretation models with the expectation to
achieve a better set of hyperparameters beyond the tuning
performed in this study. The model could also be applied in
real-world scenarios. For example, it has recently been shown
that autonomous environment detection is useful in the
automatic application of scene settings for hearing aids [23].
Future works could also consider optimisation of the frame
segmentation process itself as well as exploration of the
possibility of multiple image inputs per task. Additionally,
given the success of late fusion in this work, applications
to video classification tasks could be considered through a
similar approach.
REF ERE NCES
[1] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in IEEE
International Conference on Computer Vision (ICCV), pp. 609–617,
2017.
[2] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun, “3D traffic
scene understanding from movable platforms,IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), vol. 36, no. 5,
pp. 1012–1025, 2014.
[3] M. Cordts, T. Rehfeld, M. Enzweiler, U. Franke, and S. Roth, “Tree-
structured models for efficient multi-cue scene labeling,IEEE TPAMI,
vol. 39, no. 7, pp. 1444–1454, 2017.
[4] J. Xue, H. Zhang, and K. Dana, “Deep texture manifold for ground
terrain recognition,” in IEEE/CVF CVPR, pp. 558–567, 2018.
[5] K. Onda, T. Oishi, and Y. Kuroda, “Dynamic environment recogni-
tion for autonomous navigation with wide FOV 3D-LiDAR,” IFAC-
PapersOnLine, vol. 51, no. 22, pp. 530–535, 2018.
[6] F. Yu, J. Xiao, and T. Funkhouser, “Semantic alignment of LiDAR data
at city scale,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1722–1731, 2015.
[7] C. Zach, A. Penate-Sanchez, and M. Pham, “A dynamic programming
approach for fast and robust object pose recognition from range
images,” in IEEE CVPR, pp. 196–203, 2015.
[8] D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion
for 3D bounding box estimation,” in IEEE/CVF CVPR, pp. 244–253,
2018.
[9] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
deep features for scene recognition using places database,” in Advances
in neural information processing systems (NIPS), pp. 487–495, 2014.
[10] G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, “Hybrid cnn and
dictionary-based models for scene recognition and domain adaptation,”
IEEE Transactions on Circuits and Systems for Video Technology,
vol. 27, no. 6, pp. 1263–1274, 2015.
[11] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling
with lstm recurrent neural networks,” in IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), pp. 3547–3555, 2015.
[12] J. J. Bird, D. R. Faria, P. P. Ayrosa, and A. Ek´
art, “From simulation
to reality: Cnn transfer learning for scene classification,” in 2020
International Conference on Intelligent Systems (IS), IEEE, 2020.
[13] S. Chu, S. Narayanan, C.-C. J. Kuo, and M. J. Mataric, “Where am
I? Scene recognition for mobile robots using audio features,” in IEEE
International conference on multimedia and expo, pp. 885–888, 2006.
[14] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa,
“Computational auditory scene recognition,” in IEEE International
Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II–
1941, 2002.
[15] Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for
audio scene recognition,” in IEEE 23rd European Signal Processing
Conference (EUSIPCO), pp. 125–129, 2015.
[16] M. P. Mattson, “Superior pattern processing is the essence of the
evolved human brain,Frontiers in neuroscience, vol. 8, p. 265, 2014.
[17] M. W. Eysenck and M. T. Keane, Cognitive psychology: A student’s
handbook. Psychology press, 2015.
[18] N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep
Boltzmann machines,” in Advances in Neural Information Processing
Systems (NIPS), (USA), pp. 2222–2230, 2012.
[19] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in International Conference on
Learning Representations, 2015.
[20] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algo-
rithms using mel frequency cepstral coefficient (MFCC) and dynamic
time warping (DTW) techniques,” arXiv preprint arXiv:1003.4083,
2010.
[21] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the
measurement of the psychological magnitude pitch,” The Journal of
the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937.
[22] J. J. Bird, A. Ek´
art, C. D. Buckingham, and D. R. Faria, “Evolutionary
optimisation of fully connected artificial neural network topology,
in Intelligent Computing-Proceedings of the Computing Conference,
pp. 751–762, Springer, 2019.
[23] M. B¨
uchler, S. Allegro, S. Launer, and N. Dillier, “Sound classification
in hearing aids inspired by auditory scene analysis,” Journal on
Advances in Signal Processing, no. 18, pp. 387–845, 2005.
... (1) Early fusion methods integrate features immediately after they are extracted [10], [11], [12]. (2) Late fusion methods use unimodal decision values and fuse them [13], [14]. (3) Model-based fusion methods are specially designed to perform multimodal problems where multiple features are fused in the process of model optimization. ...
... Because of the two advantages, there is much work done using this fusion strategy. For example, Han et al. [13] proposed a rusted multi-modal classification method, and this method uses the Dempster-Shafer theory to combine the classification results of each modality by weighting them; Bird et al. [14] proposed a multi-modal scene classification for autonomous machines, and this method uses a higher-order function to perform late fusion. ...
Article
Multi-modal classification (MMC) uses the information from different modalities to improve the performance of classification. Existing MMC methods can be grouped into two categories: traditional methods and deep learning-based methods. The traditional methods often implement fusion in a low-level original space. Besides, they mostly focus on the inter-modal fusion and neglect the intra-modal fusion. Thus, the representation capacity of fused features induced by them is insufficient. The deep learning-based methods implement the fusion in a high-level feature space where the associations among features are considered, while the whole process is implicit and the fused space lacks interpretability. Based on these observations, we propose a novel interpretative association-based fusion method for MMC, named AF. In AF, both the association information and the high-order information extracted from feature space are simultaneously encoded into a new feature space to help to train an MMC model in an explicit manner. Moreover, AF is a general fusion framework, and most existing MMC methods can be embedded into it to improve their performance. Finally, the effectiveness and the generality of AF are validated on 22 datasets, four typically traditional MMC methods adopting best modality, early, late and model fusion strategies and a deep learning-based MMC method.
... In this chapter, we evaluate the performance of Federated Transfer Learning with visual-audio multimodal data [BFP + 20]. In order to better analyze the performance, the implemented baseline models as comparison are required. ...
Preprint
Smart cars, smartphones and other devices in the Internet of Things (IoT), which usually have more than one sensors, produce multimodal data. Federated Learning supports collecting a wealth of multimodal data from different devices without sharing raw data. Transfer Learning methods help transfer knowledge from some devices to others. Federated Transfer Learning methods benefit both Federated Learning and Transfer Learning. This newly proposed Federated Transfer Learning framework aims at connecting data islands with privacy protection. Our construction is based on Federated Learning and Transfer Learning. Compared with previous Federated Transfer Learnings, where each user should have data with identical modalities (either all unimodal or all multimodal), our new framework is more generic, it allows a hybrid distribution of user data. The core strategy is to use two different but inherently connected training methods for our two types of users. Supervised Learning is adopted for users with only unimodal data (Type 1), while Self-Supervised Learning is applied to user with multimodal data (Type 2) for both the feature of each modality and the connection between them. This connection knowledge of Type 2 will help Type 1 in later stages of training. Training in the new framework can be divided in three steps. In the first step, users who have data with the identical modalities are grouped together. For example, user with only sound signals are in group one, and those with only images are in group two, and users with multimodal data are in group three, and so on. In the second step, Federated Learning is executed within the groups, where Supervised Learning and Self-Supervised Learning are used depending on the group's nature. Most of the Transfer Learning happens in the third step, where the related parts in the network obtained from the previous steps are aggregated (federated).
... -Scene Recognition (Bird et al. 2020b)-The participant requests a scene recognition algorithm to be instantiated, a camera and microphone are activated for multi-modality classification. (Manurung et al. 2008;Petrović and Matthews 2013)-The participant requests to hear a joke, a joke-generator algorithm is executed and output is printed. ...
Article
Full-text available
In this work we present the Chatbot Interaction with Artificial Intelligence (CI-AI) framework as an approach to the training of a transformer based chatbot-like architecture for task classification with a focus on natural human interaction with a machine as opposed to interfaces, code, or formal commands. The intelligent system augments human-sourced data via artificial paraphrasing in order to generate a large set of training data for further classical, attention, and language transformation-based learning approaches for Natural Language Processing (NLP). Human beings are asked to paraphrase commands and questions for task identification for further execution of algorithms as skills. The commands and questions are split into training and validation sets. A total of 483 responses were recorded. Secondly, the training set is paraphrased by the T5 model in order to augment it with further data. Seven state-of-the-art transformer-based text classification algorithms (BERT, DistilBERT, RoBERTa, DistilRoBERTa, XLM, XLM-RoBERTa, and XLNet) are benchmarked for both sets after fine-tuning on the training data for two epochs. We find that all models are improved when training data is augmented by the T5 model, with an average increase of classification accuracy by 4.01%. The best result was the RoBERTa model trained on T5 augmented data which achieved 98.96% classification accuracy. Finally, we found that an ensemble of the five best-performing transformer models via Logistic Regression of output label predictions led to an accuracy of 99.59% on the dataset of human responses. A highly-performing model allows the intelligent system to interpret human commands at the social-interaction level through a chatbot-like interface (e.g. “Robot, can we have a conversation?”) and allows for better accessibility to AI by non-technical users.
... • Scene Recognition [38] -The participant requests a scene recognition algorithm to be instantiated, a camera and microphone are activated for multi-modality classification. ...
Preprint
Full-text available
In this work, we present the Chatbot Interaction with Artificial Intelligence (CI-AI) framework as an approach to the training of deep learning chatbots for task classification. The intelligent system augments human-sourced data via artificial paraphrasing in order to generate a large set of training data for further classical, attention, and language transformation-based learning approaches for Natural Language Processing. Human beings are asked to paraphrase commands and questions for task identification for further execution of a machine. The commands and questions are split into training and validation sets. A total of 483 responses were recorded. Secondly, the training set is paraphrased by the T5 model in order to augment it with further data. Seven state-of-the-art transformer-based text classification algorithms (BERT, DistilBERT, RoBERTa, DistilRoBERTa, XLM, XLM-RoBERTa, and XLNet) are benchmarked for both sets after fine-tuning on the training data for two epochs. We find that all models are improved when training data is augmented by the T5 model, with an average increase of classification accuracy by 4.01%. The best result was the RoBERTa model trained on T5 augmented data which achieved 98.96% classification accuracy. Finally, we found that an ensemble of the five best-performing transformer models via Logistic Regression of output label predictions led to an accuracy of 99.59% on the dataset of human responses. A highly-performing model allows the intelligent system to interpret human commands at the social-interaction level through a chatbot-like interface (e.g. "Robot, can we have a conversation?") and allows for better accessibility to AI by non-technical users.
... In much of the state-of-the-art work in Sign Language recognition, a single modality approach is followed, with multimodality experiments being some of the latest studies in the field. The inspiration for the network topology and method of fusion in this work comes from the work in [27] (albeit applied to scene recognition in this instance), similarly, this work fuses two differing synchronous data types via late-fusion by benchmarking network topologies at each step. In the aforementioned work however, weights of the networks were frozen for late fusion layer training (derived from benchmarking the two separate models). ...
Article
Full-text available
In this work, we show that a late fusion approach to multimodality in sign language recognition improves the overall ability of the model in comparison to the singular approaches of image classification (88.14%) and Leap Motion data classification (72.73%). With a large synchronous dataset of 18 BSL gestures collected from multiple subjects, two deep neural networks are benchmarked and compared to derive a best topology for each. The Vision model is implemented by a Convolutional Neural Network and optimised Artificial Neural Network, and the Leap Motion model is implemented by an evolutionary search of Artificial Neural Network topology. Next, the two best networks are fused for synchronised processing, which results in a better overall result (94.44%) as complementary features are learnt in addition to the original task. The hypothesis is further supported by application of the three models to a set of completely unseen data where a multimodality approach achieves the best results relative to the single sensor method. When transfer learning with the weights trained via British Sign Language, all three models outperform standard random weight distribution when classifying American Sign Language (ASL), and the best model overall for ASL classification was the transfer learning multimodality approach, which scored 82.55% accuracy.
... Additionally, we also show that it is possible to perform transfer learning between two ethnologues with the proposed approaches, for British and American Sign Languages. The inspiration for the network topology and method of fusion in this work comes from [15] (albeit applied to scene recognition in this instance), similarly, this work fuses two differing synchronous data types via late-fusion by benchmarking network topologies at each step. In the aforementioned work though, weights of the networks were frozen for late fusion layer training (derived from benchmarking the two separate models). ...
Preprint
Full-text available
In this work, we show that a late fusion approach to multi-modality in sign language recognition improves the overall ability of the model in comparison to the singular approaches of Computer Vision (88.14%) and Leap Motion data classification (72.73%). With a large synchronous dataset of 18 BSL gestures collected from multiple subjects, two deep neural networks are benchmarked and compared to derive a best topology for each. The Vision model is implemented by a CNN and optimised MLP and the Leap Motion model is implemented by an evolutionary optimised deep MLP topology search. Next, the two best networks are fused for synchronised processing which results in a better overall result (94.44%) since complementary features are learnt in addition to the original task. The hypothesis is further supported by application of the three models to a set of completely unseen data where a multi-modality approach achieves the best results relative to the single sensor method. When transfer learning with the weights trained via BSL, all three models outperform standard random weight distribution when classifying ASL, and the best model overall for ASL classification was the transfer learning multi-modality approach which scored 82.55% accuracy.
Conference Paper
Full-text available
In this work, we show that both fine-tune learning and cross-domain sim-to-real transfer learning from virtual to real-world environments improve the starting and final scene classification abilities of a computer vision model. A 6-class computer vision problem of scene classification is presented from both videogame environments and photographs of the real world, where both datasets have the same classes. 12 networks are trained from 2, 4, 8, …, 4096 hidden interpretation neurons following a fine-tuned VGG16 Convolutional Neural Network for a dataset of virtual data gathered from the Unity game engine and for a photographic dataset gathered from an online image search engine. 12 Transfer Learning networks are then benchmarked using the trained networks on virtual data as a starting weight distribution for a neural network to classify the real-world dataset. Results show that all of the transfer networks have a higher starting accuracy pre-training, with the best showing an improvement of +48.34% image classification ability and an average increase of +38.33% for the starting abilities of all hyperparameter sets benchmarked. Of the 12 experiments, nine transfer experiments showed an improvement over non-transfer learning, two showed a slightly lower ability, and one did not change. The best accuracy overall was obtained by a transfer learning model with a layer of 64 interpretation neurons scoring 89.16% compared to the non-transfer counterpart of 88.27%. An average increase of +7.15% was observed over all experiments. The main finding is that not only can a higher final classification accuracy be achieved, but strong classification abilities prior to any training whatsoever are also encountered when transferring knowledge from simulation to real-world data, proving useful domain knowledge transfer between the datasets.
Chapter
Full-text available
This paper proposes an approach to selecting the amount of layers and neurons contained within Multilayer Perceptron hidden layers through a single-objective evolutionary approach with the goal of model accuracy. At each generation, a population of Neural Network architectures are created and ranked by their accuracy. The generated solutions are combined in a breeding process to create a larger population, and at each generation the weakest solutions are removed to retain the population size inspired by a Darwinian ‘survival of the fittest’. Multiple datasets are tested, and results show that architectures can be successfully improved and derived through a hyper-heuristic evolutionary approach, in less than 10% of the exhaustive search time. The evolutionary approach was further optimised through population density increase as well as gradual solution max complexity increase throughout the simulation.
Article
We have developed the three dimensional light detection and ranging (3D-LiDAR) which has wide field of view in the vertical direction suitable for driving autonomously in urban areas and crowds. In this paper, we propose a method to recognize dynamic obstacles from motion of objects without using shape information. We classify point clouds as obstacles from the distance relationship between point to point, observe the movement of obstacles by tracking their centers of gravity. For creating precise map in the dynamic environment, we remove the dynamic objects from the point-clouds. The characteristics of wide field-of-view of the LiDAR makes localization robust. Experiments at Meiji University Ikuta campus showed the usefulness of this 3D-LiDAR and this method.
Article
We present PointFusion, a generic 3D object detection method that leverages both image and 3D point cloud information. Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset-specific assumptions, PointFusion is conceptually simple and application-agnostic. The image data and the raw point cloud data are independently processed by a CNN and a PointNet architecture, respectively. The resulting outputs are then combined by a novel fusion network, which predicts multiple 3D box hypotheses and their confidences, using the input 3D points as spatial anchors. We evaluate PointFusion on two distinctive datasets: the KITTI dataset that features driving scenes captured with a lidar-camera setup, and the SUN-RGBD dataset that captures indoor environments with RGB-D cameras. Our model is the first one that is able to perform better or on-par with the state-of-the-art on these diverse datasets without any dataset-specific model tuning.
Article
In this paper, we address the problem of computational auditory scene recognition and describe methods to classify auditory scenes into predefined classes. By auditory scene recognition we mean recognition of an environment using audio information only. The auditory scenes comprised tens of everyday outside and inside environments, such as streets, restaurants, offices, family homes, and cars. Two completely different but almost equally effective classification systems were used: band-energy ratio features with 1-NN classifier and Mel-frequency cepstral coefficients with Gaussian mixture models. The best obtained recognition rate for 17 different scenes out of 26 and for an analysis duration of 30 seconds was 68.4%. For comparison, the recognition accuracy of humans was 70% for 25 different scenes and the average response time was around 20 seconds. The efficiency of different acoustic features and the effect of test sequence length were studied.