Conference Paper

Bag of multimodal LDA models for concept formation

Dept. of Electron. Eng., Univ. of Electro-Commun., Chofu, Japan
DOI: 10.1109/ICRA.2011.5980324 In proceeding of: Robotics and Automation (ICRA), 2011 IEEE International Conference on
Source: IEEE Xplore

ABSTRACT In this paper a novel framework for multimodal categorization using Bag of multimodal LDA models is proposed. The main issue, which is tackled in this paper, is granularity of categories. The categories are not fixed but varied according to context. Selective attention is the key to model this granularity of categories. This fact motivates us to introduce various sets of weights to the perceptual information. Obviously, as the weights change, the categories vary. In the proposed model, various sets of weights and model structures are assumed. Then the multimodal LDA-based categorization is carried out many times that results in a variety of models. In order to make the categories (concepts) useful for inference, significant models should be selected. The selection process is carried out through the interaction between the robot and the user. These selected models enable the robot to infer unobserved properties of the object. For example, the robot can infer audio information only from its appearance. Furthermore, the robot can describe appearance of any objects using some suitable words, thanks to the connection between words and perceptual information. The proposed algorithm is implemented on a robot platform and preliminary experiment is carried out to validate the proposed algorithm.

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper considers Bayesian data fusion of con-ventional robot sensor information with ambiguous human-generated categorical information about continuous world states of interest. First, it is shown that such soft information can be generally modeled via hybrid continuous-to-discrete likelihoods based on the softmax function. A new hybrid fusion procedure, called variational Bayesian importance sampling (VBIS), is then introduced to combine the strengths of variational Bayes approximations and fast Monte Carlo methods to produce reliable posterior estimates for Gaussian priors and softmax likelihoods. VBIS is then extended to more general fusion problems involving complex Gaussian mixture (GM) priors and multimodal softmax likelihoods, leading to accurate GM approximations of highly non-Gaussian fusion posteriors for a wide range of robot sensor data and soft human data. Experiments for hardware-based mul-titarget search missions with a cooperative human-autonomous robot team show that humans can serve as highly informative sensors through proper data modeling and fusion, and that VBIS provides reliable and scalable Bayesian fusion estimates via GMs.
    IEEE Transactions on Robotics 01/2013; 29(1):189-206. · 2.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a robot that acquires multimodal information, i.e. auditory, visual, and haptic information, fully autonomous way using its embodiment. We also propose an online algorithm of multimodal categorization based on the acquired multimodal information and words, which are partially given by human users. The proposed framework makes it possible for the robot to learn object concepts naturally in everyday operation in conjunction with a small amount of linguistic information from human users. In order to obtain multimodal information, the robot detects an object on a fla surface. Then the robot grasps and shakes it for gaining haptic and auditory information. For obtaining visual information, the robot uses a hand held small observation table, so that the robot can control the viewpoints for observing the object. As for the multimodal concept formation, the multimodal LDA using Gibbs sampling is extended to the online version in this paper. The proposed algorithms are implemented on a real robot and tested using real everyday objects in order to show validity of the proposed system.
    2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2011, San Francisco, CA, USA, September 25-30, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Natural human-robot interaction in complex and unpredictable environments is one of the main research lines in robotics. In typical real-world scenarios, humans are at some distance from the robot and the acquired signals are strongly impaired by noise, reverberations and other interfering sources. In this context, the detection and localisation of speakers plays a key role since it is the pillar on which several tasks (e.g.: speech recognition and speaker tracking) rely. We address the problem of how to detect and localize people that are both seen and heard by a humanoid robot. We introduce a hybrid deterministic/probabilistic model. Indeed, the deterministic component allows us to map the visual information into the auditory space. By means of the probabilistic component, the visual features guide the grouping of the auditory features in order to form AV objects. The proposed model and the associated algorithm are implemented in real-time (17 FPS) using a stereoscopic camera pair and two microphones embedded into the head of the humanoid robot NAO. We performed experiments on (i) synthetic data, (ii) a publicly available data set and (iii) data acquired using the robot. The results we obtained validate the approach and encourage us to further investigate how vision can help robot hearing.