Conference PaperPDF Available

Deep Image Features in Music Information Retrieval

Authors:

Abstract and Figures

Applications of Convolutional Neural Networks (CNNs) to various problems have been the subject of a number of recent studies ranging from image classification and object detection to scene parsing, segmentation 3D volumetric images and action recognition in videos. CNNs are able to learn input data representation, instead of using fixed engineered features. In this study, the image model trained on CNN were applied to a Music Information Retrieval (MIR), in particular to musical genre recognition. The model was trained on ILSVRC-2012 (more than 1 million natural images) to perform image classification and was reused to perform genre classification using spectrograms images. Harmonic/percussive separation was applied, because it is characteristic for musical genre. At final stage, the evaluation of various strategies of merging Support Vector Machines (SVMs) was performed on well known in MIR community - GTZAN dataset. Even though, the model was trained on natural images, the results achieved in this study were close to the state-of-the-art.
Content may be subject to copyright.
A preview of the PDF is not available
... 3) Harmony: Harmony enhances the beauty of the melody. 4) Rhythm: It is the driving beat or pulse of a song, defining the movement of the piece of music [4]. ...
Article
Full-text available
The ability of music to spread joy and excitement across lives, makes it widely acknowledged as the human race's universal language. The phrase "music genre" is frequently used to group several musical styles together as following a shared custom or set of guidelines. According to their unique preferences, people now make playlists based on particular musical genres. Due to the determination and extraction of appropriate audio elements, music genre identification is regarded as a challenging task. Music information retrieval, which extracts meaningful information from music, is one of several real-world applications of machine learning. The objective of this paper is to efficiently categorise songs into various genres based on their attributes using various machine learning approaches. To enhance the outcomes, appropriate feature engineering and data pre-processing techniques have been performed. Finally, using suitable performance assessment measures, the output from each model has been compared. Compared to other machine learning algorithms, Random Forest along with efficient feature selection and hyperparameter tuning has produced better results in classifying music genres.
... (3) Spectrum flatness: It is a measure of the uniformity of the power spectrum frequency distribution. It can be calculated as the ratio of the sub-band geometric average to the arithmetic average (equivalent to the MPEG-7 audio frequency spectrum flatness (ASF) description Character (Grzywczak and Gwardys, 2014). ...
Article
Full-text available
Timbre fusion is the theoretical basis of instrument acoustics and Chinese and Western orchestral acoustics. Currently, studies on timbre fusion are mainly focused on Western instruments, but there are some studies on the timbre fusion of Chinese instruments. In this paper, the characteristics of timbre fusion for Chinese and Western instruments are explored, focusing on the subjective attributes and objective acoustic parameters, and a series of experiments is carried out. First, a database containing 518 mixed timbre stimuli of Chinese and Western instruments was constructed to provide basic data that are necessary for the subjective and objective analyses of timbre fusion. We designed and conducted a subjective evaluation experiment of timbre perception attributes based on the method of successive categories. The experimental data were processed using statistical approaches, such as variance analysis, multidimensional preference analysis, and correlation analysis, and we studied the influence of the temporal envelopes and instrument types on fusion, segregation, roughness, and pleasantness. In addition, the differences between Chinese and Western instruments were compared based on these four perception attributes. The results show that fusion and segregation are the most important attributes for Chinese instrument timbre, while roughness is the most important attribute for Western instrument timbre. In addition, multiple linear regression, random forest, and multilayer perceptron were used to construct a set of timbre fusion models for Chinese and Western instruments. The results show that these models can better predict the timbre fusion attributes. It was also found that there are some differences between the timbre fusion models for Chinese and Western instruments, which is consistent with the analysis results of subjective experimental data. The contribution of acoustic objective parameters to the fusion model is also discussed.
... This paper also suggests using MFCC spectrograms for song pre-processing. [8] The work of Gwardys et al. demonstrates an intriguing strategy incorporating transfer learning. They used ILSVRC-2012 [9] to train the model for image recognition, and then reused it for genre classification on MFCC spectrograms. ...
Conference Paper
Full-text available
Due to the enormous expansion in the accessibility of music data, music genre classification has taken on new significance in recent years. In order to have better access to them, we need to correctly index them. Automatic music genre classification is essential when working with a large collection of music. For the majority of contemporary music genre classification methodologies, researchers have favoured machine learning techniques. In this study, we employed two datasets with different genres. A Deep Learning approach is utilised to train and classify the system. A convolution neural network is used for training and classification. In speech analysis, the most crucial task is to perform speech analysis is feature extraction. The Mel Frequency Cepstral Coefficient (MFCC) is utilised as the main audio feature extraction technique. By extracting the feature vector, the suggested method classifies music into several genres. Our findings suggest that our system has an 80% accuracy level, which will substantially improve on further training and facilitate music genre classification.
... For imaging applications, the ImageNet database 49 , with over 14 million natural images, is frequently used in this role. The applications range from image classification 50 to other domains like audio data 51 . Transfer learning has also been successfully applied to CT reconstruction tasks. ...
Article
Full-text available
Deep learning approaches for tomographic image reconstruction have become very effective and have been demonstrated to be competitive in the field. Comparing these approaches is a challenging task as they rely to a great extent on the data and setup used for training. With the Low-Dose Parallel Beam (LoDoPaB)-Ct dataset, we provide a comprehensive, open-access database of computed tomography images and simulated low photon count measurements. It is suitable for training and comparing deep learning methods as well as classical reconstruction approaches. The dataset contains over 40000 scan slices from around 800 patients selected from the LIDC/IDRI database. The data selection and simulation setup are described in detail, and the generating script is publicly accessible. In addition, we provide a Python library for simplified access to the dataset and an online reconstruction challenge. Furthermore, the dataset can also be used for transfer learning as well as sparse and limited-angle reconstruction scenarios.
... Training CNNs on MFCCs or log-mel-spectrograms has been shown to produce state-ofthe-art results in audio processing for a number of applications [10,11,12,13]. Pretraining CNNs on images has been shown to transfer well to audio tasks and has been used to establish benchmarks in a number of audio datasets [14,15,16,17,18]. Segments of audio were fed to the network and trained against a binary decision of "hit/not a hit". ...
Conference Paper
Full-text available
We present a new method and a large-scale database to detect audio-video synchronization(A/V sync) errors in tennis videos. A deep network is trained to detect the visual signature of the tennis ball being hit by the racquet in the video stream. Another deep network is trained to detect the auditory signature of the same event in the audio stream. During evaluation, the audio stream is searched by the audio network for the audio event of the ball being hit. If the event is found in audio, the neighboring interval in video is searched for the corresponding visual signature. If the event is not found in the video stream but is found in the audio stream, A/V sync error is flagged. We developed a large-scaled database of 504,300 frames from 6 hours of videos of tennis events, simulated A/V sync errors, and found our method achieves high accuracy on the task.
Article
Full-text available
The paper presents an application for automatically classifying emotions in film music. A model of emotions is proposed, which is also associated with colors. The model created has nine emotional states, to which colors are assigned according to the color theory in film. Subjective tests are carried out to check the correctness of the assumptions behind the adopted emotion model. For that purpose, a statistical analysis of the subjective test results is performed. The application employs a deep convolutional neural network (CNN), which classifies emotions based on 30 s excerpts of music works presented to the CNN input using mel-spectrograms. Examples of classification results of the selected neural networks used to create the system are shown.
Conference Paper
Full-text available
This paper describes an algorithm for real-time beat tracking with a visual interface. Multiple tempo and phase hypotheses are represented by a comb filter matrix. The user can interact by specifying the tempo and phase to be tracked by the algorithm, which will seek to find a continuous path through the space. We present results from evaluating the algorithm on the Hainsworth database and offer a comparison with another existing real-time beat tracking algorithm and offline algorithms.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
A great deal of research has focused on algorithms for learning features from un- labeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning al- gorithms and deep models. In this paper, however, we show that several very sim- ple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning al- gorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (stride) be- tween extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itselfso critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-the- art performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure it- self, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.0% accuracy respectively).
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Low-level aspects of music audio such as timbre, loud-ness and pitch, can be relatively well modelled by features extracted from short-time windows. Higher-level aspects such as melody, harmony, phrasing and rhythm, on the other hand, are salient only at larger timescales and require a better representation of time dynamics. For various music information retrieval tasks, one would benefit from modelling both low and high level aspects in a unified feature extraction framework. By combining adaptive features computed at different timescales, short-timescale events are put in context by detecting longer timescale features. In this paper, we describe a method to obtain such multi-scale features and evaluate its effectiveness for automatic tag annotation.
Article
Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment composition-ality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.