Conference PaperPDF Available

Pororobot: A Deep Learning Robot That Plays Video Q&A Games

Authors:

Abstract and Figures

Recent progress in machine learning has lead to great advancements in robot intelligence and human-robot interaction (HRI). It is reported that robots can deeply understand visual scene information and describe the scenes in natural language using object recognition and natural language processing methods. Image-based question and answering (Q&A) systems can be used for enhancing HRI. However, despite these successful results, several key issues still remain to be discussed and improved. In particular, it is essential for an agent to act in a dynamic, uncertain, and asynchronous environment for achieving human-level robot intelligence. In this paper, we propose a prototype system for a video Q&A robot " Pororobot ". The system uses the state-of-the-art machine learning methods such as a deep concept hierarchy model. In our scenario, a robot and a child plays a video Q&A game together under real world environments. Here we demonstrate preliminary results of the proposed system and discuss some directions as future works.
Content may be subject to copyright.
Pororobot: A Deep Learning Robot That Plays Video Q&A Games
Kyung-Min Kim1, Chang-Jun Nan1, Jung-Woo Ha2, Yu-Jung Heo1, and Byoung-Tak Zhang1,3
1School of Computer Science and Engineering & 3Institute for Cognitive Science, Seoul National University, Seoul 151-744, Korea
2 NAVER LABS, NAVER Corp., Seongnam 463-867, Korea
{kmkim, cjnan, yjheo, btzhang}@bi.snu.ac.kr, jungwoo.ha@navercorp.com
Abstract
Recent progress in machine learning has lead to great advance-
ments in robot intelligence and human-robot interaction (HRI). It
is reported that robots can deeply understand visual scene infor-
mation and describe the scenes in natural language using object
recognition and natural language processing methods. Image-
based question and answering (Q&A) systems can be used for
enhancing HRI. However, despite these successful results, several
key issues still remain to be discussed and improved. In particular,
it is essential for an agent to act in a dynamic, uncertain, and
asynchronous environment for achieving human-level robot intel-
ligence. In this paper, we propose a prototype system for a video
Q&A robot “Pororobot”. The system uses the state-of-the-art
machine learning methods such as a deep concept hierarchy mod-
el. In our scenario, a robot and a child plays a video Q&A game
together under real world environments. Here we demonstrate
preliminary results of the proposed system and discuss some di-
rections as future works.
Introduction
Human-robot interaction (HRI) is an emerging field which
makes an effort to create socially interactive robots that
help humans in various aspects, including healthcare (Fa-
sola and Mataric 2013), and education (Kory and Breazeal
2014). Now, robots can interact with children and help
their educational development (Saerbeck et al. 2010; Kory
et al. 2013; Fridin 2014), and personalized tutor robots
have shown to notably increase the effectiveness of tutor-
ing (Leyzberg et al. 2014). Particularly, recent advance-
ment in machine learning enables robots to deeply under-
stand visual scene information and describe the scene in
natural language as compared to that of humans (Fang et al.
2015; Karpathy and Fei-Fei 2015; Vinyals et al. 2015).
Many studies on image question & answering (Q&A) have
Copyright © 2015, Association for the Advancement of Artificial Intelli-
gence (www.aaai.org). All rights reserved.
also shown successful results (Gao et al. 2015; Malinowski
et al. 2015; Ren et al. 2015) and improve the level of HRI.
However, in order to interact with a human in an effective
manner, a robot agent must deal with real-world environ-
ments including dynamic, uncertain, and asynchronous
properties based on lifelong learning (Zhang 2013). In this
paper, we propose a prototype system using state-of-the-art
deep learning methods for a child-robot interaction scenar-
io. The deep concept hierarchy (Ha et al. 2015) is used as a
knowledge base and deep learning methods including deep
convolutional neural networks (CNNs) (Krizhevsky et al.
2012) and recurrent neural networks (RNNs) (Mikolov et
al. 2013; Gao et al. 2015) are used for representing features
of the video data and generate answers from the questions.
The scenario is based on a video Q&A game under the real
world environment where a child and a robot asks ques-
tions on the cartoon video contents and answers these
questions to each other. As a robot platform, we use the
Nao Evolution V5. Nao can synthesize speech from the
text and recognize the human voice. For preliminary exper-
iments, we generate questions from the video and the re-
sults determined by a BLEU score and human evaluators
show that the questions are somewhat appropriate to be
raised to the child and the robot. Furthermore we discuss
some directions of future work for achieving human-level
robot intelligence.
Video Q&A Game Robots
Video Q&A game robots could be a favorable technology
to improve children’s early education in two aspects. First,
both a child and a robot can learn new concepts or
knowledge in the video during the question & answering
game. For example, the robot can learn an unknown fact or
knowledge by asking questions to the child and vice versa.
Second, the robot can leverage children’s social abilities.
The robot and the child have the same experiences, i.e.
watching a video, and interact with each other with that
experience. Here is an example scenario of a child-robot
Figure 1: A Video Q&A Game Robot “Pororobot”
Figure 3: The Structure of Deep Concept Hierarchies
interaction whilst playing the video Q&A game. A child
and a robot sit in front of a television and watch a cartoon
video together for 10~15 minutes together. After watching
this, the robot asks some questions to the child regarding
the story (e.g. “First questions, what did they do last
night?”). The robot should be able to generate questions
from the observed video. If the child answers correctly, the
robot agrees with the answer and gives the next question
(e.g. the robot says while clapping “Correct! Second ques-
tion, what is the color of the sky?”). As the game goes on,
the child can learn more concepts from the video with the
robot. Figure 1 shows an example of a video question &
answering game played by a child and a robot “Pororobot”.
Video Q&A System
The overall process of the video question & answering is
described in Figure 2. The system has four parts. (i) a pre-
processing part for feature generation consisting of convo-
lutional neural networks (CNNs) and recurrent neural net-
works (RNNs). (ii) deep concept hierarchies for learning
visual-linguistic concepts and storing the knowledge from
the video into a network structure. (iii) a vision-question
conversion part, and (iv) a candidate microcode extraction
component using RNNs. An RNN is used to generate the
next word in the answer and the method is similar to the
current image question & answering techniques (Gao et al.
2015).
The Four Parts of the System
(i) The preprocessing part converts a video into a set of
visual-linguistic features. Firstly, the video is converted to
scene-subtitle pairs. Whenever the subtitle appears in the
video, the scene at that time is captured. Also, each scene
is converted to a set of image patches using Regions with
Convolutional Neural Network (R-CNN) features
(Girshick et al 2014). Each patch is represented by a CNN
feature, which is represented by a 4096 dimensional real-
value vector using the Caffe implementation (Jia 2013). A
subtitle is converted to a sequence of words and each word
is represented by a real-valued vector using RNN. In this
work, we use the word2vec to encode the words (Mikolov
et al. 2013).
Figure 2: Diagram of the Video Question & Answering Process
(ii) The objective of the second part is to learn visual-
linguistic knowledge, i.e. concepts or stories, in the video
and store the knowledge into a network structure. For
learning concepts, we use a recently proposed concept
learning algorithm, deep concept hierarchies (DCH). The
overall structure of DCH is described in Figure 3. The
DCH contains a large population of microcodes in the
sparse population code (SPC) layer (layer h) which
encodes the concrete textual words and image patches in
the scenes and subtitles of the video (Zhang et al 2012). A
flexible property of the model structure allows the model
to incrementally learn the new conceptual knowledge in
the video (Ha et al 2015). A node in concept layer 1 (layer
c1) is a cluster of microcodes in layer h and as the model
continuously observes the video, the number of nodes in c1
dynamically change according to the similarities of the
microcodes in the cluster. A node in concept layer 2 (layer
c2) represents an observable character in the video. The
number of c2 nodes matches the total number of characters
and the connections between the c1 and c2 layer are con-
structed according to the appearance of the characters in hm.
The weight of the connection between the node in c1 and
the node in c2 is proportional to the frequency of appear-
ance of the character in hm. DCH stores the learned concept
into a network which becomes a knowledge base. The
knowledge base is used to generate questions for the given
video clip in the third part and give a hypothesis micro-
codes set which is input to the fourth part.
(iii) The third part converts scenes to questions using DCH.
The conversion is similar to a machine translation problem.
For this conversion, we make an additional question &
answering dataset and convert it into scene-question pairs.
Another DCH is used to learn the patterns between scenes
and questions and generate questions from the images. The
vision-question translation can be formulated as follows.
*arg max ( | , ) arg max ( | , ) ( )
q q
q P q P P
q q q
= =r r q q,
, (1)
where θ is a DCH model and q* are the best questions gen-
erated from the model.
(iv) The fourth component contains a candidate microcode
extraction component and a RNN. The extraction compo-
nent receives a question Q and selects a hypothesis micro-
code set m.
arg max ( , )
m i
i
m s Q E=
, (2)
Ei is the i-th microcode in the DCH and sm is a selection
function. sm converts Q into a set of distributed semantic
vectors with the same embedding vector space used in the
preprocessing part and computes the cosine similarity be-
tween Q and Ei. The selected hypothesis microcode set is
then fed into the RNN to generate answer.
1
arg max ( ,[ ... ], )
a M
a s Q E E w=
, (3)
where M is the total number of the selected microcodes and
sa is an answer generation function. In this paper, this func-
tion will be similar to recent RNN techniques used in im-
age question & answering problems (Gao et al. 2015].
Experiment Design and Preliminary Results
Cartoon Video and Q&A Dataset Description
Cartoon videos are a popular material for early language
learning for children. They have a succinct and explicit
story, which is represented with very simple images and
easy words. These properties allow the cartoon videos to
be a test bed material suitable for a video question & an-
swering played by a child and a robot. For the experiment,
we use a famous cartoon video ‘Pororo’ of 1232 minutes
and 183 episodes.
Also, we make approximately 1200 question & answer
pairs from ‘Pororo’ video by making five questions for
every five minutes in the video. In detail, there exist two
types of questions. One can be answered using the image
information only (e.g. “What did Pororo and his friends do
after eating?”, “What did Eddy ride?”) and the other type
needs additional information like subtitles or the story to
be answered (e.g. “Why did Pororo and his friends go into
cave?”, “How was Loopy's cooking?”).
Robotic Platform
For a robotic platform, the Nao Evolution V5, 58-cm tall
humanoid robot is used. Nao is equipped with text-to-
speech / face detection software and mainly used in an ed-
ucational environment.
Video Q&A Game Environment
A child and a robot will play a video question & answering
game. The game will be situated in a notebook and the
game event can be streamed to the robot. Based on the vid-
eo contents, the robot asks questions to the child and the
child answers as described before. Some simple sound ef-
fects or animations may be included.
Question Generation Results
To evaluate the reasonability of the questions generated
from DCH, we measure a BLEU score of the questions and
conduct human evaluations. Table 1 summarize the results
of the evaluation. A BLEU score is typically used in ma-
chine translation problems and can range from 0 to 1 (Pap-
ineni et al. 2012). Higher scores indicate close matches to
the reference translations. In this paper, we match the gen-
erated questions to the ground truth questions and the DCH
achieves a BLEU score of 0.3513 with 1-gram precision on
200 generated questions. For human evaluations, seven
human judges rate the generated questions with scores
from 1 to 5. The average score on 80 generated questions is
2.534. The results show that the generated questions are
somewhat appropriate to be raised to the child and the ro-
bot. Figure 4 shows the examples of the generated ques-
tions. A query scene image is positioned in center (4th im-
age) in an image row and images are ordered sequentially
as are in the video.
Discussions and Conclusions
We proposed a deep learning-based system for a video
question & answering game robot “Pororobot”. The system
consists of four parts using state-of-the art machine learn-
ing methods. The first part is the preprocessing part con-
sisting of the CNNs and RNNs for feature construction.
The knowledge on an observed video story is learned in the
second part where DCH is used to learn visual-linguistic
concepts in the video. The third part converts the observed
video scenes to questions and the last part uses the RNN
which encodes the words in the answer and generates next
answer words. We demonstrate that the preliminary results
of the proposed system can allow for making a socially
interactive robot which can play with a child and thus build
a stepping stone for achieving human-level robot intelli-
gence.
For further research, the robot should be able to generate
questions and give a reaction differently according to the
child’s states like emotions. The state determined by the
sensors of the robot may be an important factor to decide
which questions should be generated during the child-robot
interaction. To this end, the system we described in this
paper should be expanded to be a “purposive” or “inten-
tional” agent. These creative modes is prerequisite for life-
long learning environments (Zhang 2014). Also, the model
should be end-to-end equipped with a large-scale GPU
cluster to deal with real-time, dynamic, heterogeneous na-
ture of the real-world data in everyday life.
Acknowledgements
This work was supported by ICT R&D program of
MSIP/IITP (R0126-15-1072), In addition, this work was
supported in part by ICT R&D program of MSIP/IITP
(10044009), and the US Air Force Research Laboratory
(AFOSR 124087), and NAVER LABS.
References
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dol-
lár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L.,
Zweig, G. 2015. From Captions to Visual Concepts and Back, In
Proceedings of IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR 2015). 1473-1482
Fasola, J., and Mataric, M. 2013. A socially assistive robot exer-
cise coach for the elderly. Journal of Human-Robot Interaction.
2(2):3-32.
Fridin, M. 2014. Storytelling by a kindergarten social assistive
robot: A tool for constructive learning in preschool education.
Computers & Education. 70(0):53–64.
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W. 2015.
Are You Talking to a Machine? Dataset and Methods for Multi-
lingual Image Question Answering. arXiv preprint
arXiv:1505.05612.
Girshick, R., Donahue, J., Darrell, T., Malik, J. 2014. Rich feature
hierarchies for accurate object detection and semantic segmenta-
tion. In Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR 2014). 580-587.
Ha, J.-W., Kim, K.-M., and Zhang, B.-T. 2015. Automated Visu-
al-Linguistic Knowledge Construction via Concept Learning from
Cartoon Videos. In Proceedings of the Twenty-Ninth AAAI Con-
ference on Artificial Intelligence (AAAI 2015). 522-528.
Jia, Y. 2013. Caffe: An open source convolutional architecture for
fast feature embedding. http://caffe.berkeleyvision.org/.
Figure 4: Example Questions Generated from DCH
Karpathy, A., Fei-Fei, L. 2015. Deep Visual-Semantic Align-
ments for Generating Image Description. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (CVPR
2015). 3128-3137
Kory, J. M., and Breazeal, C. L. 2014. Storytelling with Robots:
Learning Companions for Preschool Children’s Language Devel-
opment. In Proceedings of the 23rd IEEE International Symposi-
um on Robot and Human Interactive Communication (RO-MAN).
Kory, J. M., Jeong, S., and Breazeal, C. L. 2013. Robotic learning
companions for early language development. In Proceedings of
the 15th ACM on International conference on multimodal interac-
tion. 71–72.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet
classification with deep convolutional neural networks. In Pro-
ceedings of Advances in Neural Information Processing Systems
(NIPS 2012). 1097–1105.
Leyzberg, D., Spaulding, S., and Scassellati, B. 2014. Personaliz-
ing robot tutors to individuals’ learning differences. In Proceed-
ings of the 2014 ACM/IEEE international conference on Human-
robot interaction. 423–430.
Malinowski, M., Rohrbach, M., and Fritz, M. 2015. Ask your
neurons: A neural-based approach to answering questions about
images. arXiv preprint arXiv:1505.01121.
Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. 2013.
Distributed Representation of Words and Phrases and Their
Compositionality. In Proceedings of Advances in Neural Infor-
mation Processing Systems (NIPS 2013). 3111-3119
Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. Bleu: A
method for automatic evaluation of machine translation. In Pro-
ceedings of ACL '02 Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics. 311-318
Ren, M., Kiros, R., and Zemel, R. 2015. Image question answer-
ing: A visual semantic embedding model and a new dataset.
ICML 2015 Deep Learning Workshop.
Saerbeck, M., Schut, T., Bartneck, C.; and Janse, M. D. 2010.
Expressive robots in education: varying the degree of social sup-
portive behavior of a robotic tutor. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. 1613–
1622.
Vinyals, O., Toshev, A., Bengio, S., Erhan, D. 2015. Show and
Tell: A Neural Image Caption Generator. arXiv preprint
arXiv:1411.4555.
Zhang, B.-T., Ha, J.-W., and Kang, M. 2012. Sparse Population
Code Models of Word Learning in Concept Drift. In Proceedings
of the 34th Annual Conference of Cognitive Science Society
(Cogsci 2012). 1221-1226.
Zhang, B.-T. 2013. Information-Theoretic Objective Functions
for Lifelong Learning. AAAI 2013 Spring Symposium on Lifelong
Machine Learning. 62-69.
Zhang, B.-T. 2014. Ontogenesis of agency in machines: A multi-
disciplinary review. AAAI 2014 Fall Symposium on The Nature of
Humans and Machines: A Multidisciplinary Discourse.
... Subsequently, Mnih et al. (2014) proposed a new recurrent neural network model, which can adaptively select specific areas or locations to extract information from images or videos and process the selected area at high resolution. As the algorithm has increasingly mature, the application of the algorithm in related fields has also been breaking through recently, such as the caption generation of car images (Chen, He & Fan, 2017), the description generation of facial expressions (Kuznetsova et al., 2014), and educational NAO robots driven by image caption generation for video Q&A games for children's education (Kim et al., 2015). Recent research on image caption generation also shows that the accuracy and reliability of the technology have increased (Ding et al., 2019). ...
... An increasing number of studies have been conducted on HRI combined with image caption generation algorithm. Kim et al. (2015) used the structure of a convolutional neural network (CNN) combined with RNN + deep concept hierarchies (DCH) to design and develop an educational intelligent humanoid robot system for play video games with children. In this study, CNN was used to extract and pre-process cartoons with educational features, and RNN and DCH were used to convert the collected video features into Q&A about cartoons. ...
... The framework uses the structure of the NIC algorithm to better realize the interaction of HSRs from HRI to the direction of bionic-companionship. According to the initial descriptions of robot companions, as in the studies by Turkle (2006) and (Kim et al., 2015), the proposed framework should provide HSRs with more natural interactions and a more sensitive understanding of the environment, and hence, the design of the framework is divided into two subsystems (see the dotted red). ...
Article
Full-text available
At present, industrial robotics focuses more on motion control and vision, whereas humanoid service robotics (HSRs) are increasingly being investigated and researched in the field of speech interaction. The problem and quality of human-robot interaction (HRI) has become a widely debated topic in academia. Especially when HSRs are applied in the hospitality industry, some researchers believe that the current HRI model is not well adapted to the complex social environment. HSRs generally lack the ability to accurately recognize human intentions and understand social scenarios. This study proposes a novel interactive framework suitable for HSRs. The proposed framework is grounded on the novel integration of Trevarthen ’s (2001) companionship theory and neural image captioning (NIC) generation algorithm. By integrating image-to-natural interactivity generation and communicating with the environment to better interact with the stakeholder, thereby changing from interaction to a bionic-companionship. Compared to previous research a novel interactive system is developed based on the bionic-companionship framework. The humanoid service robot was integrated with the system to conduct preliminary tests. The results show that the interactive system based on the bionic-companionship framework can help the service humanoid robot to effectively respond to changes in the interactive environment, for example give different responses to the same character in different scenes.
... The content of work, methods, and roles of the service staff are in flux in healthcare and well-being services in, for example, Finland, the rest of Europe, and Japan. Many believe that new technological solutions and the possibilities of robotics require from future care professionals new kinds of expertise, from their educators new approaches to curricula development, and "deep learning" (Kim, Nan, Ha, Heo, & Zhang, 2015;Lecun, Bengio & Hinto, 2015;Niiniluoto, 2019), maybe even radically new models of learning. ...
... These methods are different from previous approaches in that they do not require any feature engineering and exploit a large amount of training data. The performance improvements have continuously extended to image QA tasks [Fukui et al., 2016;Kim et al., 2016]. ...
Conference Paper
Full-text available
Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children’s cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained sentences for scene description, and 8,913 story-related QA pairs. Our experimental results show that the DEMN outperforms other QA models. This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention. DEMN also achieved state-of-the-art results on the MovieQA benchmark.
... These methods are different from previous approaches in that they do not require any feature engineering and exploit a large amount of training data. The performance improvements have continuously extended to image QA tasks [Fukui et al., 2016;Kim et al., 2016]. ...
Article
Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained sentences for scene description, and 8,913 story-related QA pairs. Our experimental results show that the DEMN outperforms other QA models. This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention. DEMN also achieved state-of-the-art results on the MovieQA benchmark.
... Kim et al. [7] highlight that applying deep learning to visual scene information in an HRI scenario was successful, but that generating behaviours for the robot to be able to act in a dynamic and uncertain environment remains a challenge. ...
Conference Paper
Full-text available
Child-robot interactions are increasingly being explored in domains which require longer-term application, such as healthcare and education. In order for a robot to behave in an appropriate manner over longer timescales, its behaviours should be coterminous with that of the interacting children. Generating such sustained and engaging social behaviours is an ongoing research challenge, and we argue here that the recent progress of deep machine learning opens new perspectives that the HRI community should embrace. As an initial step in that direction, we propose the creation of a large open dataset of child-robot social interactions. We detail our proposed methodology for data acquisition: children interact with a robot puppeted by an expert adult during a range of playful face-to-face social tasks. By doing so, we seek to capture a rich set of human-like behaviours occurring in natural social interactions, that are explicitly mapped to the robot's embodiment and affordances.
Article
Video-text matching is of high importance in the field of machine vision and artificial intelligence. The main challenging issue in video-text matching is the projection of the video and textual features into a common semantic space, which is called video-text joint embedding. The proper functionality of video-text joint embedding depends on two important factors: the effectiveness of the extracted information for video-text matching and the suitability of network structure for the projection of the extracted features into a common space. Generally, existing approaches do not leverage all the audio and visual information of a video for video-text matching. This study proposes a new approach for video-text matching by extracting a comprehensive set of visual and textual features and projecting them into a common semantic space using an effective structure. We use a new deep network with two textual and video branches that extracts several informative high-level visual and textual features and maps them into a shared space. In the video branch, we extract several descriptors comprising appearance-based, concept-based, and action-based features as well as an audio encoder to extract sound features in the video clip. We also utilize several feature extraction approaches in the textual branch for the effective transformation of an input sentence to a common semantic space. Furthermore, an image description database is used to pre-train and initialize network weights. We evaluated the proposed architecture with two popular video description datasets and compared the results with the results of several state-of-the-art approaches. The comparison results showed the effectiveness of the proposed algorithm.
Article
Full-text available
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.
Conference Paper
Full-text available
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Neural-Image-QA, an end-to-end formulation to this problem for which all parts are trained jointly. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question). Our approach Neural-Image-QA doubles the performance of the previous best approach on this problem. We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extends the original DAQUAR dataset to DAQUAR-Consensus.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Conventional paradigms of machine learning assume all the training data are available when learning starts. However, in lifelong learning, the examples are observed sequentially as learning unfolds, and the learner should continually explore the world and reorganize and refine the internal model or knowledge of the world. This leads to a fundamental challenge: How to balance long-term and short-term goals and how to trade-off between information gain and model complexity? These questions boil down to "what objective functions can best guide a lifelong learning agent?" Here we develop a sequential Bayesian framework for lifelong learning, build a taxonomy of lifelong-learning paradigms, and examine information-theoretic objective functions for each paradigm, with an emphasis on active learning. The objective functions can provide theoretical criteria for designing algorithms and determining effective strategies for selective sampling, representation discovery, knowledge transfer, and continual update over a lifetime of experience. © 2013, Association for the Advancement of artificial intelligence.
Conference Paper
Children's oral language skills in preschool can predict their academic success later in life. As such, increasing children's skills early on could improve their success in middle and high school. To this end, we propose that a robotic learning companion could supplement children's early language education. The robot targets both the social nature of language learning and the adaptation necessary to help individual children. The robot is designed as a social character that interacts with children as a peer, not as a tutor or teacher. It will play a storytelling game, during which it will introduce new vocabulary words, and model good story narration skills, such as including a beginning, middle, and end; varying sentence structure; and keeping cohesion across the story. We will evaluate whether adapting the robot's level of language to the child's - so that, as children improve their storytelling skills, so does the robot - influences (i) whether children learn new words from the robot, (ii) the complexity and style of stories children tell, (iii) the similarity of children's stories to the robot's stories. We expect children will learn more from a robot that adapts to maintain an equal or greater ability than the children, and that they will copy its stories and narration style more than they would with a robot that does not adapt (a robot of lesser ability). However, we also expect that playing with a robot of lesser ability could prompt teaching or mentoring behavior from children, which could also be beneficial to language learning.
Article
This work aims to address the problem of image-based question-answering (QA) with new models and datasets. In our work, we propose to use recurrent neural networks and visual semantic embeddings without intermediate stages such as object detection and image segmentation. Our model performs 1.8 times better than the recently published results on the same dataset. Another main contribution is an automatic question generation algorithm that converts the currently available image description dataset into QA form, resulting in a 10 times bigger dataset with more evenly distributed answers.