Conference PaperPDF Available

Perceived emotion from images through deep neural networks

Perceived Emotion from Images through Deep Neural Networks
Alex Hern´
Institute of Cognitive Science
University of Osnabr¨
uck, Germany
Abstract—One of the goals of affective computing is predict-
ing the emotional response of people elicited by multimedia
content. Although remarkable steps have been made in the
field, the problem still remains open. Emotions are conveyed
by many and varied factors, from very low-level cues, such as
the colors of the stimulus, to high-level aspects, such as the
semantics of the scene. One of the main challenges is the fact
that even though some of these factors are known by neuro-
scientists, computer scientists or artists, many of the stimulus
features that play a role in eliciting emotion probably remain
unknown. The recent success of deep learning methods, which
are able to automatically learn relevant stimuli representations,
seems to set a promising path to follow. Here we will explore
new deep neural architectures suitable for affective content
analysis, together with semi-supervised models that help handle
the relative lack of available data and the uncertainty and
subjectivity of the annotations.
1. Motivation
The study of emotion was one of the first research areas
in psychology and neuroscience [1] and today it receives a
renewed and increasing attention [2]. The reason is prob-
ably that emotions are sometimes regarded as “the most
significant events in our lives” [3]. They play a central
role in communication, problem-solving, decision-making
and other aspects of cognition [4]. The awareness of the
importance of emotions in everyday lives and particularly
in the interaction with technology has therefore drawn the
attention of other fields such as artificial intelligence or
affective computing [5].
Since the early days of affective computing, the analysis
of the expected emotional response of viewers of multimedia
content, often referred to as affective content analysis, has
been one of its main areas of research and it receives these
days even an ever increasing attention by the community [6].
Especially considering the increasing amount of available
audiovisual content, such a technology is becoming crucial
for successful systems that involve human-computer inter-
action and multimedia retrieval and recommendation. Other
applications include mental healthcare, emotional interfaces
for social network, advertising and audiovisual production.
In spite of the many works on emotion, there is still
no consensus about what emotions are [7] and the exact
definition and description has been the subject of intense
debate [8], [9]. In this research we will simply consider the
wide and comprehensive definition given by Antonio Dama-
sio [10]: emotions are “a collection of changes occurring in
both brain and body, usually prompted by a particular mental
content”. Such mental content can be in turn triggered by
visual stimuli.
How and which visual stimuli elicit certain emotional
responses are some of the main research questions we aim at
shedding some light on with this work. Finding an accurate
answer to these questions is however challenging due to
the complexity and subjectivity of emotions. We believe
that it requires the contribution of different disciplines, not
only psychology and neuroscience, but also art or artificial
intelligence [11]. Hence, here we propose to combine the
knowledge from machine learning and cognitive science to
design algorithms that can learn meaningful representations
of images and predict the induced emotional response on the
viewers. Analyzing these representations and the models can
be in turn insightful to understand how emotional stimuli are
perceived by human observers and how computers can learn
about emotions and affect.
Affective content analysis has been traditionally ap-
proached by relying on hand-crafted audiovisual features,
often inspired by psychology or aesthetics theories [12].
These features are then used to train a machine learning
classifier, such as SVM or logistic regression, to learn the
induced emotion by images or videos [13], [14]. A survey
on the literature of affective content analysis shows that
relative success has been achieved by using either low-level
[15], mid-level [16] or high-level features [12]. This only
shows that visual emotion elicitation is triggered by many
and varied factors and perhaps hand-crafted features alone
cannot provide a solid solution to the problem.
This prompts us to hypothesize that convolutional neural
networks (CNNs) can be a promising tool for affective visual
content analysis. Artificial deep neural networks are able
to automatically learn hierarchical representations of their
inputs [17], from very low- to high-level features. Some
of the descriptors learned by a neural network might be
similar to the hand-crafted features already proposed in the
literature, but there is also a great potential to find relevant
features that are still unknown or ignored by experts.
Automatically learning these features comes with the
need for large sets of labeled data, which in the case of
affective content is a high price to pay. By way of illus-
tration, the ImageNet data set that has enabled the great
success of CNNs on object recognition consists of over
10 million labeled images [18]. For this reason we also
propose to apply semi-supervised learning methods that are
able to learn useful representations from unlabeled data
and combine them with features learned from the available
labeled data sets [19].
2. Related work
Much research in affective computing have been devoted
to the recognition of emotions from face images [20], where
the challenge is detecting the emotion from information
coming directly from the subject. A similar concept but
using different data sources is the affect detection from
psychophysiological signals, such as electrodermal activity
[21]. One of the first works where emotional information
is detected directly from image content can be found in
[22], where low-level visual features are mapped to natural
language. In [15] the standard approach of training an SVM
classifier with low-level descriptors was already followed.
A similar procedure applied to video data can be found in
[14]. A related popular research area is the recognition of
aesthetics from image and video content [23], [24]. See [25]
for a general review of models, methods and applications of
affective detection.
The great success of deep learning [26], [27] in computer
vision, natural language understanding, speech recognition
and other fields has recently started drawing the attention
of affective computing as well. In [28] recurrent neural
networks were used to model the variation of emotion over
time from EEG signals and facial expressions. One of the
first applications of deep learning for affective video content
analysis can be found in [29] with promising results and
they have also been applied for speech emotion recognition
in [30].
Regarding emotion recognition from images, which is
the focus of this research, convolutional neural networks
have only recently been employed in [31], where emotions
are recognized from abstract paintings and in [19], where a
large-scale data set with emotion annotations was presented.
State-of-the-art results on this data set using a combination
of CNNs were recently presented in [32].
3. Description of the research idea
Given the early stage in the application of deep learning
methods on affective image content analysis, we propose to
further explore the use of neural networks to predict the ex-
pected emotion elicited by natural images on their viewers.
Convolutional neural networks (CNNs) have demonstrated
an exceptional performance in other computer vision tasks
like object recognition since 2012, when they started achiev-
ing largely better results than other traditional machine
learning methods. For a review of the history, techniques and
achievements of deep learning methods in different fields
see [26], [27]. The good results obtained by deep neural
networks in object recognition tasks enable us to make
the hypothesis that they have a high potential in emotion
recognition as well.
In contrast to currently standard methods, where hand-
crafted visual descriptors are proposed to train classifiers
[12], deep neural networks can be trained by directly using
the original digital images as inputs. The networks, typically
organized by interconnected layers of artificial neurons, are
trained through backpropagation, which iteratively updates
the weights of the neurons to minimize a training prediction
error. In the case of convolutional neural networks (CNNs),
[17] and similar works have shown that these learned
weights form kernels that can be interpreted as feature
hierarchies. That is, neurons in early layers automatically
learn low-level representations, whereas higher layers learn
mid- to high-level features, in a relatively comparable way
to how the visual cortex is organized [33].
If neural networks trained to recognize objects on images
can learn efficient, meaningful and useful features for that
task, training them to recognize emotions might very well
generate valuable representations that improve the accuracy
of affective content analysis methods. It is possible that
some of these representations resemble the features that have
been traditionally hand-crafted for affective computing tasks
[12], [13], [34], whereas others might be still unknown or
ignored by the experts. This area is worth exploring, as it
could contribute to augment the knowledge about emotion
elicitation for fields like psychology and visual arts.
One disadvantage of current supervised deep neural
networks is that they require many more labeled examples
than other machine learning methods in order to enable
learning and yield good generalization performance. In the
case of affective content analysis, labeling data is a costly
and subjective task: there is often significant inter-subject
variability in the report of perceived affect. Therefore, af-
fective data sets usually exhibit two undesirable properties:
1) reduced number of examples [35] and 2) noisy labels
[36]. By way of comparison, the well-known large database
for object recognition, ImageNet [18], contains millions of
images with highly reliable labels. Therefore, the application
of deep learning to affective content analysis poses particular
challenges: rethinking the typical procedures to find ways to
handle both the lack of data and the noisy labels.
In order to compensate for the lack of the labeled data,
we propose to make use of semi-supervised methods. Semi-
supervised machine learning, which has been employed
in affective computing in [37], consists in enhancing the
learning process of a supervised model by making use
of a large set of unlabeled examples. In the context of
deep learning, such an approach can be addressed through
autoencoders [38], among other methods. We hypothesize
that it is possible to learn useful emotional representations
from images similar to the labeled ones, i.e. coming from a
similar probability distribution, and then use these features
to enhance the performance of a classifier fine-tuned with
labeled images. In order to handle the noisy nature of
the labels, we propose adding uncertainty information that
(a) Amusement (b) Awe (c) Contentment (d) Excitement
(e) Anger (f) Disgust (g) Fear (h) sadness
Figure 1. Example images from the data set. Top row: positive emotions. Bottom row: negative emotions.
can regularize the learning process of purely supervised
Finally, an important part of the proposal is performing
an analysis beyond the machine learning results. We believe
that the study of emotions should be approached in an inter-
disciplinary way [11]. Therefore we will combine insights
from cognitive science, in particular our experience with
eye-movements research [39], [40], to gain more valuable
knowledge about visual emotion perception. For example,
we plan to study the correlation between eye movements
and relevant regions for the neural network prediction [41].
Additionally we will base the network architecture exper-
imentation on neurobiology concepts, which has provided
inspiration to deep learning in the past [33].
4. Methodology
Even if we propose to apply semi-supervised techniques,
still a large amount of labeled images is required. Recently,
a new data set of images for emotion recognition was pub-
lished, containing a considerably larger number of images
than the previously existing databases [19].
This data set consists of images from Flickr and Insta-
gram that were tagged with one of the 8 emotions in [42]
fear and sadness) as keywords. An original set of 90,000
images was submitted to Amazon Mechanical Turk, where
5 people validated the tagged emotion for each image. After
the validation process 23,000 images got 3 or more favorable
votes and they become part of what the authors called
strongly labeled set. Figure 1 shows some sample images
from the data set.
It is important to notice that even the so-called strong
labels have some degree of uncertainty, so we propose taking
this into consideration and treating the annotations as noisy
labels, for example by softening them or establishing some
confidence by making use of the validation information.
The rest of the images with weak labels will be used as
input for training unsupervised models that can potentially
enhance the classification. We also plan to enrich this new
data set with images from other existing databases from the
literature, such as IAPS [35].
In order to carry out our proposal and achieve successful
results we will first systematically train different neural
networks. Suitable architectures for dealing with emotional
content might be different to the ones that work well for
object recognition. Therefore, we will explore new archi-
tectures with an emphasis on biologically inspired models.
For example, in [32] the baseline prediction of typical neural
networks is improved with an architecture that considers the
different nature of features that potentially have an influence
on the elicited emotion.
In a subsequent experimentation phase we will try to
enhance the performance with semi-supervised training, as
explained above. Finally, we will perform an analysis of the
learned features through network visualization techniques
aiming at widening the knowledge about emotion elicitation
in psychology, affective computing and other fields.
5. Preliminary results and tentative plans
Although this research project is still at an early stage
and we have not obtained state-of-the-art results so far, we
have first performed an analysis of the data annotations
from [19] that has given us some insight and inspiration for
the future experimentation. Quantitatively we have assessed
the amount of uncertainty from the keywords validation
process by analyzing the level of agreement for the different
(a) Amusement (b) Disgust
Figure 2. Histograms showing the number of images that got 0,1, ..., 5agreement/disagreement votes in the validation process for annotating the data set.
Figure 2 shows the analysis for two of the emotions,
amusement and disgust (the rest of the emotions are not
shown due to space limitations). First, the histograms show
that there is a large number of images whose original
emotion keyword was not validated by the majority of the
voters, especially in the cases of disgust, awe, fear and
anger, which all have a similar distribution. Second, even
when most voters validated the emotion keyword, there are
only a few images on which the viewers made a unanimous
vote, which is an indicator of the relative uncertainty of the
In view of this analysis, one may think that the labels
might not be reliable. However, by inspecting manually
some of the images with strong labels we have found that in
most cases the labels are reasonable, which is an indicator
that neural networks can learn useful representations from
the images and achieve good performance.
Regarding the experimentation, up to now we have only
trained a small network with three convolutional layers
and on fully connected layer using only the data set with
strong labels, together with data augmentation techniques.
We obtained a classification accuracy of 50 %. Although
this result is still far from the state-of-the-art 65 % in [32],
it is largely above the chance baseline and it shows that the
neural network approach is promising and there is a lot of
room for improvement, given the simplicity of this network.
6. Contributions
The main expected contribution of this work is im-
proving the performance of affective image analysis sys-
tems through the application of deep convolutional neural
networks, which is a significant change with respect to
the standard approach found in the literature of employing
hand-crafted features. We expect that deep learning methods
not only achieve better classification performance, but also
provide new insights about the type of visual information
that plays a role in emotion elicitation. We also expect to
contribute to the field by making an extensive exploration
of new neural network architectures that are particularly
suitable for affective content.
To date, the PhD candidate has contributed to the field
of affective computing with research on aesthetics recog-
nition from videos [24], proposing successful hand-crafted
features based on psychology and film theory and providing
a systematic quantitative comparison of them [34]. Aural
descriptors were proposed as well in [43]. Similar descrip-
tors have proved to be suitable as well for the prediction of
emotion and attention in videos, labeled from electrodermal
activity measurements [44].
The current research laboratory of the candidate has a
long experience on eye-movements research (see [39] for a
recent review) and has conducted experiments exploring the
connections between emotions and human visual attention
[40], [45], an area that we will further explore in combina-
tion with the results from the deep learning experiments.
This project has received funding from the European
Union’s Horizon 2020 research and innovation programme
under the Marie Sklodowska-Curie grant agreement 641805.
[1] C. Darwin, “The expression of the emotions in man and animals,”
London, UK: John Marry, 1872.
[2] J. LeDoux, “Rethinking the emotional brain,” Neuron, vol. 73, no. 4,
pp. 653–676, 2012.
[3] J. E. LeDoux and R. Brown, “A higher-order theory of emotional
consciousness,” Proceedings of the National Academy of Sciences, p.
201619316, 2017.
[4] N. Bianchi-Berthouze and C. L. Lisetti, “Modeling multimodal ex-
pression of users affective subjective experience,” User modeling and
user-adapted interaction, vol. 12, no. 1, pp. 49–84, 2002.
[5] R. W. Picard, Affective computing. MIT press Cambridge, 1997, vol.
[6] M. Soleymani, Y.-H. Yang, G. Irie, A. Hanjalic et al., “Challenges and
perspectives for affective analysis in multimedia,” IEEE Transactions
on Affective Computing 6 (3), pp. 206–208, 2015.
[7] K. R. Scherer, “What are emotions? and how can they be measured?”
Social science information, vol. 44, no. 4, pp. 695–729, 2005.
[8] R. B. Zajonc, “Feeling and thinking: Preferences need no inferences.”
American Psychologist, vol. 35, no. 2, pp. 151–175, 1980. [Online].
[9] R. S. Lazarus, “A cognitivist’s reply to zajonc on emotion and
cognition.” 1981.
[10] A. R. Damasio, Descartes’ error: emotion, reason, and the human
brain. New York: Avon Books, 1994.
[11] Y. Baveye, C. Chamaret, E. Dellandr´
ea, and L. Chen, “Affective video
content analysis: A multidisciplinary insight,” IEEE Transactions on
Affective Computing, 2017.
[12] J. Machajdik and A. Hanbury, “Affective image classification using
features inspired by psychology and art theory,” in Proceedings of
the 18th ACM international conference on Multimedia. ACM, 2010,
pp. 83–92.
[13] X. Wang, J. Jia, J. Yin, and L. Cai, “Interpretable aesthetic features
for affective image classification,” in Image Processing (ICIP), 2013
20th IEEE International Conference on. IEEE, 2013, pp. 3230–3234.
[14] A. Hanjalic and L.-Q. Xu, “Affective video content representation
and modeling,” IEEE transactions on multimedia, vol. 7, no. 1, pp.
143–154, 2005.
[15] Q. Wu, C. Zhou, and C. Wang, “Content-based affective image
classification and retrieval using support vector machines,” in Interna-
tional Conference on Affective Computing and Intelligent Interaction.
Springer, 2005, pp. 239–247.
[16] D. Joshi, R. Datta, E. Fedorovskaya, Q.-T. Luong, J. Z. Wang,
J. Li, and J. Luo, “Aesthetics and emotions in images,IEEE Signal
Processing Magazine, vol. 28, no. 5, pp. 94–115, 2011.
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2014, pp. 580–587.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on. IEEE, 2009, pp. 248–255.
[19] Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset
for image emotion recognition: The fine print and the benchmark,”
arXiv preprint arXiv:1605.02677, 2016.
[20] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh,
S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recog-
nition using facial expressions, speech and multimodal information,”
in Proceedings of the 6th international conference on Multimodal
interfaces. ACM, 2004, pp. 205–211.
[21] M. Soleymani, G. Chanel, J. J. M. Kierkels, and T. Pun, “Affective
characterization of movie scenes based on multimedia content anal-
ysis and user’s physiological emotional responses.” in ISM. IEEE
Computer Society, 2008, pp. 228–235.
[22] N. Bianchi-Berthouze, “K-dime: an affective image filtering system,
IEEE MultiMedia, vol. 10, no. 3, pp. 103–106, 2003.
[23] R. Datta, D. Joshi, J. Li, and J. Wang, “Studying aesthetics in
photographic images using a computational approach,” Computer
Vision–ECCV 2006, pp. 288–301, 2006.
[24] F. Fern´
ınez, A. Hern´
ıa, and F. D´
“Succeeding metadata based annotation scheme and visual tips for the
automatic assessment of video aesthetic quality in car commercials.”
Expert Systems with Applications, pp. 293–305, 2015.
[25] R. A. Calvo and S. D’Mello, “Affect detection: An interdisciplinary
review of models, methods, and their applications,IEEE Transac-
tions on affective computing, vol. 1, no. 1, pp. 18–37, 2010.
[26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,Nature, vol.
521, no. 7553, pp. 436–444, 2015.
[27] J. Schmidhuber, “Deep learning in neural networks: An overview,
Neural networks, vol. 61, pp. 85–117, 2015.
[28] M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis of
eeg signals and facial expressions for continuous emotion detection,
IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 17–28,
[29] Y. Baveye, E. Dellandr´
ea, C. Chamaret, and L. Chen, “Deep learning
vs. kernel methods: Performance for emotion prediction in videos,”
in Affective Computing and Intelligent Interaction (ACII), 2015 In-
ternational Conference on. IEEE, 2015, pp. 77–83.
[30] W. Zheng, J. Yu, and Y. Zou, “An experimental study of speech
emotion recognition based on deep convolutional neural networks,
in Affective Computing and Intelligent Interaction (ACII), 2015 In-
ternational Conference on. IEEE, 2015, pp. 827–831.
[31] X. Alameda-Pineda, E. Ricci, Y. Yan, and N. Sebe, “Recognizing
emotions from abstract paintings using non-linear matrix completion,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 5240–5248.
[32] T. Rao, M. Xu, and D. Xu, “Learning multi-level deep representations
for image emotion classification,” arXiv preprint arXiv:1611.07145,
[33] Y. Bengio, D.-H. Lee, J. Bornschein, T. Mesnard, and Z. Lin,
“Towards biologically plausible deep learning,” arXiv preprint
arXiv:1502.04156, 2015.
[34] A. Hern´
ıa, F. Fern´
ınez, and F. D´
ıaz-de Mar´
“Comparing visual descriptors and automatic rating strategies for
video aesthetics prediction,” Signal Processing: Image Communica-
tion, vol. 47, pp. 280–288, 2016.
[35] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affec-
tive picture system (IAPS): Technical manual and affective ratings,”
NIMH Center for the Study of Emotion and Attention, pp. 39–58,
[36] T.-S. Park and B.-T. Zhang, “Consensus analysis and modeling of
visual aesthetic perception,” IEEE Transactions on Affective Comput-
ing, vol. 6, no. 3, pp. 272–285, 2015.
[37] N. Li, Y. Xia, and Y. Xia, “Semi-supervised emotional classification
of color images by learning from cloud,” in Affective Computing
and Intelligent Interaction (ACII), 2015 International Conference on.
IEEE, 2015, pp. 84–90.
[38] J. Masci, U. Meier, D. Cires¸an, and J. Schmidhuber, “Stacked convo-
lutional auto-encoders for hierarchical feature extraction,” Artificial
Neural Networks and Machine Learning–ICANN 2011, pp. 52–59,
[39] P. K ¨
onig, N. Wilming, T. C. Kietzmann, J. P. Ossand´
on, S. Onat,
B. V. Ehinger, R. R. Gameiro, and K. Kaspar, “Eye movements as a
window to cognitive processes,J. Eye Mov. Res, vol. 9, pp. 1–16,
[40] K. Kaspar and P. K ¨
onig, “Emotions and personality traits as high-
level factors in visual attention: a review,Frontiers in Human Neu-
roscience, vol. 6, p. 321, 2012.
[41] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨
“Explaining nonlinear classification decisions with deep taylor de-
composition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
[42] J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J.
Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images
from the international affective picture system,Behavior research
methods, vol. 37, no. 4, pp. 626–630, 2005.
[43] F. Fern´
ınez, A. Hern´
ıa, A. Gallardo-Antol´
and F. D´
ıaz-de Mar´
ıa, “Combining audio-visual features for viewers’
perception classification of youtube car commercials,” 2014.
[44] A. Hern´
ıa, F. Fern´
ınez, and F. D´
ıaz-de Mar´
“Emotion and attention: predicting electrodermalvisual activity trough
video visual descriptors,” in Workshop on Affective Computing and
Emotion Recognition–ACER 2017, under review, 2017.
[45] K. Kaspar, T.-M. Hloucal, J. Kriz, S. Canzler, R. R. Gameiro,
V. Krapp, and P. K¨
onig, “Emotions’ impact on viewing behavior under
natural conditions,” PloS one, vol. 8, no. 1, p. e52737, 2013.
Full-text available
In recent years, due to its great economic and social potential, the recognition of facial expressions linked to emotions has become one of the most flourishing applications in the field of artificial intelligence, and has been the subject of many developments. However, despite significant progress, this field is still subject to many theoretical debates and technical challenges. It therefore seems important to make a general inventory of the different lines of research and to present a synthesis of recent results in this field. To this end, we have carried out a systematic review of the literature according to the guidelines of the PRISMA method. A search of 13 documentary databases identified a total of 220 references over the period 2014–2019. After a global presentation of the current systems and their performance, we grouped and analyzed the selected articles in the light of the main problems encountered in the field of automated facial expression recognition. The conclusion of this review highlights the strengths, limitations and main directions for future research in this field.
Conference Paper
Full-text available
This paper contributes to the field of affective video content analysis through the novel employment of electrodermal activity (EDA) measurements as ground truth for machine learning algorithms. The variation of the electrical properties of the skin, known as EDA, is a psychophysiological indicator widely used in medicine, psychology and neuroscience which can be considered a somatic marker of the emotional and attentional reaction of subjects towards stimuli. One of its main advantages is that the recorded information is not biased by the cognitive process of giving an opinion or a score to characterize the subjective perception. In this work, we predict the levels of emotion and attention, derived from EDA records, by means of a small set of low-level visual descriptors computed from the video stimuli. Linear regression experiments show that our descriptors predict significantly well the sum of emotion and attention levels, reaching a coefficient of determination R2 = 0.25. This result sets a promising path for further research on the prediction of emotion and attention from videos using EDA.
Full-text available
Eye movement research is a highly active and productive research field. Here we focus on how the embodied nature of eye movements can act as a window to the brain and the mind. In particular, we discuss how conscious perception depends on the trajectory of fixated locations and consequently address how fixation locations are selected. Specifically, we argue that the selection of fixation points during visual exploration can be understood to a large degree based on retinotopically structured models. Yet, these models largely ignore spatiotemporal structure in eye-movement sequences. Explaining spatiotemporal structure in eye-movement trajectories requires an understanding of spatiotemporal properties of the visual sampling process. With this in mind, we discuss the availability of external information to internal inference about causes in the world. We demonstrate that visual foraging is a dynamic process that can be systematically modulated either towards exploration or exploitation. For an analysis at high temporal resolution, we suggest a new method: The renewal density allows the investigation of precise temporal relation of eye movements and other actions like a button press. We conclude with an outlook and propose that eye movement research has reached an appropriate stage and can easily be combined with other research methods to utilize this window to the brain and mind to its fullest.
Full-text available
Emotional states of consciousness, or what are typically called emotional feelings, are traditionally viewed as being innately programmed in subcortical areas of the brain, and are often treated as different from cognitive states of consciousness, such as those related to the perception of external stimuli. We argue that conscious experiences, regardless of their content, arise from one system in the brain. In this view, what differs in emotional and nonemotional states are the kinds of inputs that are processed by a general cortical network of cognition, a network essential for conscious experiences. Although subcortical circuits are not directly responsible for conscious feelings, they provide nonconscious inputs that coalesce with other kinds of neural signals in the cognitive assembly of conscious emotional experiences. In building the case for this proposal, we defend a modified version of what is known as the higher-order theory of consciousness.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
In our present society, the cinema has become one of the major forms of entertainment providing unlimited contexts of emotion elicitation for the emotional needs of human beings. Since emotions are universal and shape all aspects of our interpersonal and intellectual experience, they have proved to be a highly multidisciplinary research field, ranging from psychology, sociology, neuroscience, etc., to computer science. However, affective multimedia content analysis work from the computer science community benefits but little from the progress achieved in other research fields. In this paper, a multidisciplinary state-of-the-art for affective movie content analysis is given, in order to promote and encourage exchanges between researchers from a very wide range of fields. In contrast to other state-of-the-art papers on affective video content analysis, this work confronts the ideas and models of psychology, sociology, neuroscience, and computer science. The concepts of aesthetic emotions and emotion induction, as well as the different representations of emotions are introduced, based on psychological and sociological theories. Previous global and continuous affective video content analysis work, including video emotion recognition and violence detection, are also presented in order to point out the limitations of affective video content analysis work.
In this paper, we propose a new deep network that learns multi-level deep representations for image emotion classification (MldrNet). Image emotion can be recognized through image semantics, image aesthetics and low-level visual features from both global and local views. Existing image emotion classification works using hand-crafted features or deep features mainly focus on either low-level visual features or semantic-level image representations without taking all factors into consideration. Our proposed MldrNet unifies deep representations of three levels, i.e. image semantics, image aesthetics and low-level visual features through multiple instance learning (MIL) in order to effectively cope with noisy labeled data, such as images collected from the Internet. Extensive experiments on both Internet images and abstract paintings demonstrate the proposed method outperforms the state-of-the-art methods using deep features or hand-crafted features. The proposed approach also outperforms the state-of-the-art methods with at least 6% performance improvement in terms of overall classification accuracy.