Perceived Emotion from Images through Deep Neural Networks
Institute of Cognitive Science
University of Osnabr¨
Abstract—One of the goals of affective computing is predict-
ing the emotional response of people elicited by multimedia
content. Although remarkable steps have been made in the
ﬁeld, the problem still remains open. Emotions are conveyed
by many and varied factors, from very low-level cues, such as
the colors of the stimulus, to high-level aspects, such as the
semantics of the scene. One of the main challenges is the fact
that even though some of these factors are known by neuro-
scientists, computer scientists or artists, many of the stimulus
features that play a role in eliciting emotion probably remain
unknown. The recent success of deep learning methods, which
are able to automatically learn relevant stimuli representations,
seems to set a promising path to follow. Here we will explore
new deep neural architectures suitable for affective content
analysis, together with semi-supervised models that help handle
the relative lack of available data and the uncertainty and
subjectivity of the annotations.
The study of emotion was one of the ﬁrst research areas
in psychology and neuroscience  and today it receives a
renewed and increasing attention . The reason is prob-
ably that emotions are sometimes regarded as “the most
signiﬁcant events in our lives” . They play a central
role in communication, problem-solving, decision-making
and other aspects of cognition . The awareness of the
importance of emotions in everyday lives and particularly
in the interaction with technology has therefore drawn the
attention of other ﬁelds such as artiﬁcial intelligence or
affective computing .
Since the early days of affective computing, the analysis
of the expected emotional response of viewers of multimedia
content, often referred to as affective content analysis, has
been one of its main areas of research and it receives these
days even an ever increasing attention by the community .
Especially considering the increasing amount of available
audiovisual content, such a technology is becoming crucial
for successful systems that involve human-computer inter-
action and multimedia retrieval and recommendation. Other
applications include mental healthcare, emotional interfaces
for social network, advertising and audiovisual production.
In spite of the many works on emotion, there is still
no consensus about what emotions are  and the exact
deﬁnition and description has been the subject of intense
debate , . In this research we will simply consider the
wide and comprehensive deﬁnition given by Antonio Dama-
sio : emotions are “a collection of changes occurring in
both brain and body, usually prompted by a particular mental
content”. Such mental content can be in turn triggered by
How and which visual stimuli elicit certain emotional
responses are some of the main research questions we aim at
shedding some light on with this work. Finding an accurate
answer to these questions is however challenging due to
the complexity and subjectivity of emotions. We believe
that it requires the contribution of different disciplines, not
only psychology and neuroscience, but also art or artiﬁcial
intelligence . Hence, here we propose to combine the
knowledge from machine learning and cognitive science to
design algorithms that can learn meaningful representations
of images and predict the induced emotional response on the
viewers. Analyzing these representations and the models can
be in turn insightful to understand how emotional stimuli are
perceived by human observers and how computers can learn
about emotions and affect.
Affective content analysis has been traditionally ap-
proached by relying on hand-crafted audiovisual features,
often inspired by psychology or aesthetics theories .
These features are then used to train a machine learning
classiﬁer, such as SVM or logistic regression, to learn the
induced emotion by images or videos , . A survey
on the literature of affective content analysis shows that
relative success has been achieved by using either low-level
, mid-level  or high-level features . This only
shows that visual emotion elicitation is triggered by many
and varied factors and perhaps hand-crafted features alone
cannot provide a solid solution to the problem.
This prompts us to hypothesize that convolutional neural
networks (CNNs) can be a promising tool for affective visual
content analysis. Artiﬁcial deep neural networks are able
to automatically learn hierarchical representations of their
inputs , from very low- to high-level features. Some
of the descriptors learned by a neural network might be
similar to the hand-crafted features already proposed in the
literature, but there is also a great potential to ﬁnd relevant
features that are still unknown or ignored by experts.
Automatically learning these features comes with the
need for large sets of labeled data, which in the case of
affective content is a high price to pay. By way of illus-
tration, the ImageNet data set that has enabled the great
success of CNNs on object recognition consists of over
10 million labeled images . For this reason we also
propose to apply semi-supervised learning methods that are
able to learn useful representations from unlabeled data
and combine them with features learned from the available
labeled data sets .
2. Related work
Much research in affective computing have been devoted
to the recognition of emotions from face images , where
the challenge is detecting the emotion from information
coming directly from the subject. A similar concept but
using different data sources is the affect detection from
psychophysiological signals, such as electrodermal activity
. One of the ﬁrst works where emotional information
is detected directly from image content can be found in
, where low-level visual features are mapped to natural
language. In  the standard approach of training an SVM
classiﬁer with low-level descriptors was already followed.
A similar procedure applied to video data can be found in
. A related popular research area is the recognition of
aesthetics from image and video content , . See 
for a general review of models, methods and applications of
The great success of deep learning ,  in computer
vision, natural language understanding, speech recognition
and other ﬁelds has recently started drawing the attention
of affective computing as well. In  recurrent neural
networks were used to model the variation of emotion over
time from EEG signals and facial expressions. One of the
ﬁrst applications of deep learning for affective video content
analysis can be found in  with promising results and
they have also been applied for speech emotion recognition
Regarding emotion recognition from images, which is
the focus of this research, convolutional neural networks
have only recently been employed in , where emotions
are recognized from abstract paintings and in , where a
large-scale data set with emotion annotations was presented.
State-of-the-art results on this data set using a combination
of CNNs were recently presented in .
3. Description of the research idea
Given the early stage in the application of deep learning
methods on affective image content analysis, we propose to
further explore the use of neural networks to predict the ex-
pected emotion elicited by natural images on their viewers.
Convolutional neural networks (CNNs) have demonstrated
an exceptional performance in other computer vision tasks
like object recognition since 2012, when they started achiev-
ing largely better results than other traditional machine
learning methods. For a review of the history, techniques and
achievements of deep learning methods in different ﬁelds
see , . The good results obtained by deep neural
networks in object recognition tasks enable us to make
the hypothesis that they have a high potential in emotion
recognition as well.
In contrast to currently standard methods, where hand-
crafted visual descriptors are proposed to train classiﬁers
, deep neural networks can be trained by directly using
the original digital images as inputs. The networks, typically
organized by interconnected layers of artiﬁcial neurons, are
trained through backpropagation, which iteratively updates
the weights of the neurons to minimize a training prediction
error. In the case of convolutional neural networks (CNNs),
 and similar works have shown that these learned
weights form kernels that can be interpreted as feature
hierarchies. That is, neurons in early layers automatically
learn low-level representations, whereas higher layers learn
mid- to high-level features, in a relatively comparable way
to how the visual cortex is organized .
If neural networks trained to recognize objects on images
can learn efﬁcient, meaningful and useful features for that
task, training them to recognize emotions might very well
generate valuable representations that improve the accuracy
of affective content analysis methods. It is possible that
some of these representations resemble the features that have
been traditionally hand-crafted for affective computing tasks
, , , whereas others might be still unknown or
ignored by the experts. This area is worth exploring, as it
could contribute to augment the knowledge about emotion
elicitation for ﬁelds like psychology and visual arts.
One disadvantage of current supervised deep neural
networks is that they require many more labeled examples
than other machine learning methods in order to enable
learning and yield good generalization performance. In the
case of affective content analysis, labeling data is a costly
and subjective task: there is often signiﬁcant inter-subject
variability in the report of perceived affect. Therefore, af-
fective data sets usually exhibit two undesirable properties:
1) reduced number of examples  and 2) noisy labels
. By way of comparison, the well-known large database
for object recognition, ImageNet , contains millions of
images with highly reliable labels. Therefore, the application
of deep learning to affective content analysis poses particular
challenges: rethinking the typical procedures to ﬁnd ways to
handle both the lack of data and the noisy labels.
In order to compensate for the lack of the labeled data,
we propose to make use of semi-supervised methods. Semi-
supervised machine learning, which has been employed
in affective computing in , consists in enhancing the
learning process of a supervised model by making use
of a large set of unlabeled examples. In the context of
deep learning, such an approach can be addressed through
autoencoders , among other methods. We hypothesize
that it is possible to learn useful emotional representations
from images similar to the labeled ones, i.e. coming from a
similar probability distribution, and then use these features
to enhance the performance of a classiﬁer ﬁne-tuned with
labeled images. In order to handle the noisy nature of
the labels, we propose adding uncertainty information that
(a) Amusement (b) Awe (c) Contentment (d) Excitement
(e) Anger (f) Disgust (g) Fear (h) sadness
Figure 1. Example images from the data set. Top row: positive emotions. Bottom row: negative emotions.
can regularize the learning process of purely supervised
Finally, an important part of the proposal is performing
an analysis beyond the machine learning results. We believe
that the study of emotions should be approached in an inter-
disciplinary way . Therefore we will combine insights
from cognitive science, in particular our experience with
eye-movements research , , to gain more valuable
knowledge about visual emotion perception. For example,
we plan to study the correlation between eye movements
and relevant regions for the neural network prediction .
Additionally we will base the network architecture exper-
imentation on neurobiology concepts, which has provided
inspiration to deep learning in the past .
Even if we propose to apply semi-supervised techniques,
still a large amount of labeled images is required. Recently,
a new data set of images for emotion recognition was pub-
lished, containing a considerably larger number of images
than the previously existing databases .
This data set consists of images from Flickr and Insta-
gram that were tagged with one of the 8 emotions in 
fear and sadness) as keywords. An original set of 90,000
images was submitted to Amazon Mechanical Turk, where
5 people validated the tagged emotion for each image. After
the validation process 23,000 images got 3 or more favorable
votes and they become part of what the authors called
strongly labeled set. Figure 1 shows some sample images
from the data set.
It is important to notice that even the so-called strong
labels have some degree of uncertainty, so we propose taking
this into consideration and treating the annotations as noisy
labels, for example by softening them or establishing some
conﬁdence by making use of the validation information.
The rest of the images with weak labels will be used as
input for training unsupervised models that can potentially
enhance the classiﬁcation. We also plan to enrich this new
data set with images from other existing databases from the
literature, such as IAPS .
In order to carry out our proposal and achieve successful
results we will ﬁrst systematically train different neural
networks. Suitable architectures for dealing with emotional
content might be different to the ones that work well for
object recognition. Therefore, we will explore new archi-
tectures with an emphasis on biologically inspired models.
For example, in  the baseline prediction of typical neural
networks is improved with an architecture that considers the
different nature of features that potentially have an inﬂuence
on the elicited emotion.
In a subsequent experimentation phase we will try to
enhance the performance with semi-supervised training, as
explained above. Finally, we will perform an analysis of the
learned features through network visualization techniques
aiming at widening the knowledge about emotion elicitation
in psychology, affective computing and other ﬁelds.
5. Preliminary results and tentative plans
Although this research project is still at an early stage
and we have not obtained state-of-the-art results so far, we
have ﬁrst performed an analysis of the data annotations
from  that has given us some insight and inspiration for
the future experimentation. Quantitatively we have assessed
the amount of uncertainty from the keywords validation
process by analyzing the level of agreement for the different
(a) Amusement (b) Disgust
Figure 2. Histograms showing the number of images that got 0,1, ..., 5agreement/disagreement votes in the validation process for annotating the data set.
Figure 2 shows the analysis for two of the emotions,
amusement and disgust (the rest of the emotions are not
shown due to space limitations). First, the histograms show
that there is a large number of images whose original
emotion keyword was not validated by the majority of the
voters, especially in the cases of disgust, awe, fear and
anger, which all have a similar distribution. Second, even
when most voters validated the emotion keyword, there are
only a few images on which the viewers made a unanimous
vote, which is an indicator of the relative uncertainty of the
In view of this analysis, one may think that the labels
might not be reliable. However, by inspecting manually
some of the images with strong labels we have found that in
most cases the labels are reasonable, which is an indicator
that neural networks can learn useful representations from
the images and achieve good performance.
Regarding the experimentation, up to now we have only
trained a small network with three convolutional layers
and on fully connected layer using only the data set with
strong labels, together with data augmentation techniques.
We obtained a classiﬁcation accuracy of 50 %. Although
this result is still far from the state-of-the-art 65 % in ,
it is largely above the chance baseline and it shows that the
neural network approach is promising and there is a lot of
room for improvement, given the simplicity of this network.
The main expected contribution of this work is im-
proving the performance of affective image analysis sys-
tems through the application of deep convolutional neural
networks, which is a signiﬁcant change with respect to
the standard approach found in the literature of employing
hand-crafted features. We expect that deep learning methods
not only achieve better classiﬁcation performance, but also
provide new insights about the type of visual information
that plays a role in emotion elicitation. We also expect to
contribute to the ﬁeld by making an extensive exploration
of new neural network architectures that are particularly
suitable for affective content.
To date, the PhD candidate has contributed to the ﬁeld
of affective computing with research on aesthetics recog-
nition from videos , proposing successful hand-crafted
features based on psychology and ﬁlm theory and providing
a systematic quantitative comparison of them . Aural
descriptors were proposed as well in . Similar descrip-
tors have proved to be suitable as well for the prediction of
emotion and attention in videos, labeled from electrodermal
activity measurements .
The current research laboratory of the candidate has a
long experience on eye-movements research (see  for a
recent review) and has conducted experiments exploring the
connections between emotions and human visual attention
, , an area that we will further explore in combina-
tion with the results from the deep learning experiments.
This project has received funding from the European
Union’s Horizon 2020 research and innovation programme
under the Marie Sklodowska-Curie grant agreement 641805.
 C. Darwin, “The expression of the emotions in man and animals,”
London, UK: John Marry, 1872.
 J. LeDoux, “Rethinking the emotional brain,” Neuron, vol. 73, no. 4,
pp. 653–676, 2012.
 J. E. LeDoux and R. Brown, “A higher-order theory of emotional
consciousness,” Proceedings of the National Academy of Sciences, p.
 N. Bianchi-Berthouze and C. L. Lisetti, “Modeling multimodal ex-
pression of users affective subjective experience,” User modeling and
user-adapted interaction, vol. 12, no. 1, pp. 49–84, 2002.
 R. W. Picard, Affective computing. MIT press Cambridge, 1997, vol.
 M. Soleymani, Y.-H. Yang, G. Irie, A. Hanjalic et al., “Challenges and
perspectives for affective analysis in multimedia,” IEEE Transactions
on Affective Computing 6 (3), pp. 206–208, 2015.
 K. R. Scherer, “What are emotions? and how can they be measured?”
Social science information, vol. 44, no. 4, pp. 695–729, 2005.
 R. B. Zajonc, “Feeling and thinking: Preferences need no inferences.”
American Psychologist, vol. 35, no. 2, pp. 151–175, 1980. [Online].
 R. S. Lazarus, “A cognitivist’s reply to zajonc on emotion and
 A. R. Damasio, Descartes’ error: emotion, reason, and the human
brain. New York: Avon Books, 1994.
 Y. Baveye, C. Chamaret, E. Dellandr´
ea, and L. Chen, “Affective video
content analysis: A multidisciplinary insight,” IEEE Transactions on
Affective Computing, 2017.
 J. Machajdik and A. Hanbury, “Affective image classiﬁcation using
features inspired by psychology and art theory,” in Proceedings of
the 18th ACM international conference on Multimedia. ACM, 2010,
 X. Wang, J. Jia, J. Yin, and L. Cai, “Interpretable aesthetic features
for affective image classiﬁcation,” in Image Processing (ICIP), 2013
20th IEEE International Conference on. IEEE, 2013, pp. 3230–3234.
 A. Hanjalic and L.-Q. Xu, “Affective video content representation
and modeling,” IEEE transactions on multimedia, vol. 7, no. 1, pp.
 Q. Wu, C. Zhou, and C. Wang, “Content-based affective image
classiﬁcation and retrieval using support vector machines,” in Interna-
tional Conference on Affective Computing and Intelligent Interaction.
Springer, 2005, pp. 239–247.
 D. Joshi, R. Datta, E. Fedorovskaya, Q.-T. Luong, J. Z. Wang,
J. Li, and J. Luo, “Aesthetics and emotions in images,” IEEE Signal
Processing Magazine, vol. 28, no. 5, pp. 94–115, 2011.
 R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2014, pp. 580–587.
 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on. IEEE, 2009, pp. 248–255.
 Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset
for image emotion recognition: The ﬁne print and the benchmark,”
arXiv preprint arXiv:1605.02677, 2016.
 C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh,
S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recog-
nition using facial expressions, speech and multimodal information,”
in Proceedings of the 6th international conference on Multimodal
interfaces. ACM, 2004, pp. 205–211.
 M. Soleymani, G. Chanel, J. J. M. Kierkels, and T. Pun, “Affective
characterization of movie scenes based on multimedia content anal-
ysis and user’s physiological emotional responses.” in ISM. IEEE
Computer Society, 2008, pp. 228–235.
 N. Bianchi-Berthouze, “K-dime: an affective image ﬁltering system,”
IEEE MultiMedia, vol. 10, no. 3, pp. 103–106, 2003.
 R. Datta, D. Joshi, J. Li, and J. Wang, “Studying aesthetics in
photographic images using a computational approach,” Computer
Vision–ECCV 2006, pp. 288–301, 2006.
 F. Fern´
ınez, A. Hern´
ıa, and F. D´
“Succeeding metadata based annotation scheme and visual tips for the
automatic assessment of video aesthetic quality in car commercials.”
Expert Systems with Applications, pp. 293–305, 2015.
 R. A. Calvo and S. D’Mello, “Affect detection: An interdisciplinary
review of models, methods, and their applications,” IEEE Transac-
tions on affective computing, vol. 1, no. 1, pp. 18–37, 2010.
 Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436–444, 2015.
 J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural networks, vol. 61, pp. 85–117, 2015.
 M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis of
eeg signals and facial expressions for continuous emotion detection,”
IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 17–28,
 Y. Baveye, E. Dellandr´
ea, C. Chamaret, and L. Chen, “Deep learning
vs. kernel methods: Performance for emotion prediction in videos,”
in Affective Computing and Intelligent Interaction (ACII), 2015 In-
ternational Conference on. IEEE, 2015, pp. 77–83.
 W. Zheng, J. Yu, and Y. Zou, “An experimental study of speech
emotion recognition based on deep convolutional neural networks,”
in Affective Computing and Intelligent Interaction (ACII), 2015 In-
ternational Conference on. IEEE, 2015, pp. 827–831.
 X. Alameda-Pineda, E. Ricci, Y. Yan, and N. Sebe, “Recognizing
emotions from abstract paintings using non-linear matrix completion,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 5240–5248.
 T. Rao, M. Xu, and D. Xu, “Learning multi-level deep representations
for image emotion classiﬁcation,” arXiv preprint arXiv:1611.07145,
 Y. Bengio, D.-H. Lee, J. Bornschein, T. Mesnard, and Z. Lin,
“Towards biologically plausible deep learning,” arXiv preprint
 A. Hern´
ıa, F. Fern´
ınez, and F. D´
“Comparing visual descriptors and automatic rating strategies for
video aesthetics prediction,” Signal Processing: Image Communica-
tion, vol. 47, pp. 280–288, 2016.
 P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affec-
tive picture system (IAPS): Technical manual and affective ratings,”
NIMH Center for the Study of Emotion and Attention, pp. 39–58,
 T.-S. Park and B.-T. Zhang, “Consensus analysis and modeling of
visual aesthetic perception,” IEEE Transactions on Affective Comput-
ing, vol. 6, no. 3, pp. 272–285, 2015.
 N. Li, Y. Xia, and Y. Xia, “Semi-supervised emotional classiﬁcation
of color images by learning from cloud,” in Affective Computing
and Intelligent Interaction (ACII), 2015 International Conference on.
IEEE, 2015, pp. 84–90.
 J. Masci, U. Meier, D. Cires¸an, and J. Schmidhuber, “Stacked convo-
lutional auto-encoders for hierarchical feature extraction,” Artiﬁcial
Neural Networks and Machine Learning–ICANN 2011, pp. 52–59,
 P. K ¨
onig, N. Wilming, T. C. Kietzmann, J. P. Ossand´
on, S. Onat,
B. V. Ehinger, R. R. Gameiro, and K. Kaspar, “Eye movements as a
window to cognitive processes,” J. Eye Mov. Res, vol. 9, pp. 1–16,
 K. Kaspar and P. K ¨
onig, “Emotions and personality traits as high-
level factors in visual attention: a review,” Frontiers in Human Neu-
roscience, vol. 6, p. 321, 2012.
 G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨
“Explaining nonlinear classiﬁcation decisions with deep taylor de-
composition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
 J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J.
Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images
from the international affective picture system,” Behavior research
methods, vol. 37, no. 4, pp. 626–630, 2005.
 F. Fern´
ınez, A. Hern´
ıa, A. Gallardo-Antol´
and F. D´
ıa, “Combining audio-visual features for viewers’
perception classiﬁcation of youtube car commercials,” 2014.
 A. Hern´
ıa, F. Fern´
ınez, and F. D´
“Emotion and attention: predicting electrodermalvisual activity trough
video visual descriptors,” in Workshop on Affective Computing and
Emotion Recognition–ACER 2017, under review, 2017.
 K. Kaspar, T.-M. Hloucal, J. Kriz, S. Canzler, R. R. Gameiro,
V. Krapp, and P. K¨
onig, “Emotions’ impact on viewing behavior under
natural conditions,” PloS one, vol. 8, no. 1, p. e52737, 2013.