Conference Paper

Bag of multimodal LDA models for concept formation

Dept. of Electron. Eng., Univ. of Electro-Commun., Chofu, Japan
DOI: 10.1109/ICRA.2011.5980324 Conference: Robotics and Automation (ICRA), 2011 IEEE International Conference on
Source: IEEE Xplore

ABSTRACT In this paper a novel framework for multimodal categorization using Bag of multimodal LDA models is proposed. The main issue, which is tackled in this paper, is granularity of categories. The categories are not fixed but varied according to context. Selective attention is the key to model this granularity of categories. This fact motivates us to introduce various sets of weights to the perceptual information. Obviously, as the weights change, the categories vary. In the proposed model, various sets of weights and model structures are assumed. Then the multimodal LDA-based categorization is carried out many times that results in a variety of models. In order to make the categories (concepts) useful for inference, significant models should be selected. The selection process is carried out through the interaction between the robot and the user. These selected models enable the robot to infer unobserved properties of the object. For example, the robot can infer audio information only from its appearance. Furthermore, the robot can describe appearance of any objects using some suitable words, thanks to the connection between words and perceptual information. The proposed algorithm is implemented on a robot platform and preliminary experiment is carried out to validate the proposed algorithm.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Natural human-robot interaction in complex and unpredictable environments is one of the main research lines in robotics. In typical real-world scenarios, humans are at some distance from the robot and the acquired signals are strongly impaired by noise, reverberations and other interfering sources. In this context, the detection and localisation of speakers plays a key role since it is the pillar on which several tasks (e.g.: speech recognition and speaker tracking) rely. We address the problem of how to detect and localize people that are both seen and heard by a humanoid robot. We introduce a hybrid deterministic/probabilistic model. Indeed, the deterministic component allows us to map the visual information into the auditory space. By means of the probabilistic component, the visual features guide the grouping of the auditory features in order to form AV objects. The proposed model and the associated algorithm are implemented in real-time (17 FPS) using a stereoscopic camera pair and two microphones embedded into the head of the humanoid robot NAO. We performed experiments on (i) synthetic data, (ii) a publicly available data set and (iii) data acquired using the robot. The results we obtained validate the approach and encourage us to further investigate how vision can help robot hearing.
    The International Journal of Robotics Research 11/2014; · 2.86 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper considers Bayesian data fusion of con-ventional robot sensor information with ambiguous human-generated categorical information about continuous world states of interest. First, it is shown that such soft information can be generally modeled via hybrid continuous-to-discrete likelihoods based on the softmax function. A new hybrid fusion procedure, called variational Bayesian importance sampling (VBIS), is then introduced to combine the strengths of variational Bayes approximations and fast Monte Carlo methods to produce reliable posterior estimates for Gaussian priors and softmax likelihoods. VBIS is then extended to more general fusion problems involving complex Gaussian mixture (GM) priors and multimodal softmax likelihoods, leading to accurate GM approximations of highly non-Gaussian fusion posteriors for a wide range of robot sensor data and soft human data. Experiments for hardware-based mul-titarget search missions with a cooperative human-autonomous robot team show that humans can serve as highly informative sensors through proper data modeling and fusion, and that VBIS provides reliable and scalable Bayesian fusion estimates via GMs.
    IEEE Transactions on Robotics 01/2013; 29(1):189-206. · 2.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a method to detect social events in a set of pictures from an image hosting service (Flickr). This method relies on the analysis of user-generated tags, by using statistical models trained on both a small set of manually annotated data and a large data set collected from the Internet. Social event modeling relies on multi-span topic model based on LDA (Latent Dirichlet Allocation). Experiments are conducted in the experimental setup of MediaEval'2011 evaluation campaign. The proposed system outperforms significantly the best system of this benchmark, reaching a F-measure score of about 71%.
    Content-Based Multimedia Indexing (CBMI), 2013 11th International Workshop on; 01/2013