Conference Paper

Non-Speech Audio Event Detection

INESC-ID, Lisboa
DOI: 10.1109/ICASSP.2009.4959998 Conference: ICASSP 2009 - IEEE International Conference on Acoustics, Speech and Signal Processing, At Taipei, Taiwan
Source: IEEE Xplore

ABSTRACT Audio event detection is one of the tasks of the European project VIDIVIDEO. This paper focuses on the detection of non-speech events, and as such only searches for events in audio segments that have been previously classified as non-speech. Preliminary experiments with a small corpus of sound effects have shown the potential of this type of corpus for training purposes. This paper describes our experiments with SVM and HMM-based classifiers, using a 290-hour corpus of sound effects. Although we have only built detectors for 15 semantic concepts so far, the method seems easily portable to other concepts. The paper reports experiments with multiple features, different kernels and several analysis windows. Preliminary experiments on documentaries and films yielded promising results, despite the difficulties posed by the mixtures of audio events that characterize real sounds.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Advances in audio recognition have enabled the real-world success of a wide variety of interactive voice systems over the last two decades. More recently, these same techniques have shown promise in recognizing non-speech audio events. Sounds are ubiquitous in real-world manipulation, such as the click of a button, the crash of an object being knocked over, and the whine of activation from an electric power tool. Surprisingly, very few autonomous robots leverage audio feedback to improve their performance. Modern audio recognition techniques exist that are capable of learning and recognizing real-world sounds, but few implementations exist that are easily incorporated into modern robotic programming frameworks. This paper presents a new software library known as the ROS Open-source Audio Recognizer (ROAR). ROAR provides a complete set of end-to-end tools for online supervised learning of new audio events, feature extraction, automatic one-class Support Vector Machine model tuning, and real-time audio event detection. Through implementation on a Barrett WAM arm, we show that combining the contextual information of the manipulation action with a set of learned audio events yields significant improvements in robotic task-completion rates.
    Autonomous Robots 04/2013; 34(3). · 1.91 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Auditory data contains important information about the content of multimedia data. This paper presents a method for content based event retrieval on broadcast audio. The aim of this study is to retrieve audio events from huge multimedia databases. 17 classes which are most frequently observed in TV broadcast, and which are considered as an important input to higher level semantic analysis of multimedia data are selected. Audio streams are divided into homogenous segments in order to generate fingerprints that describe both temporal and spectral information of audio events. Both spectral and temporal properties of audio events are analyzed and some fingerprints to represent these properties are presented. Audio events are modeled by Gaussian Mixture Models. For the retrieval, an ordered sequence is provided to the user for each event, sorted by the likelihood values of the fingerprints. The system aims to bring the query events with higher likelihood values first. Mean average precision value is used to evaluate retrieval performance.17 audio classes are tested on 11 hours of TV recordings and 18,5% average precision is achieved.
    Signal Processing and Communications Applications (SIU), 2011 IEEE 19th Conference on; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: The QoS factors such as accuracy and time delay plays a major role in time critical applications. The proposed SVM-instance based algorithm improves the accuracy and reduces the time delay for the recognition of emergency vehicle sound. In this approach, the time delay is reduced by identifying the support vectors which are the data points near the margin of hyper plane and the accuracy is increased by increasing the margin between the classes. The MFCC which is derived from frequency and intensity is used for accurate sound recognition. Thus time delay was reduced and accuracy was improved in recognition of emergency vehicle sound.
    Advanced Computing (ICoAC), 2012 Fourth International Conference on; 01/2012

Full-text (3 Sources)

Available from
Jun 5, 2014