Article

A Simple Method to Determine if a Music Information Retrieval System is a “Horse”

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

We propose and demonstrate a simple method to explain the figure of merit (FoM) of a music information retrieval (MIR) system evaluated in a dataset, specifically, whether the FoM comes from the system using characteristics confounded with the “ground truth” of the dataset. Akin to the controlled experiments designed to test the supposed mathematical ability of the famous horse “Clever Hans,” we perform two experiments to show how three state-of-the-art MIR systems produce excellent FoM in spite of not using musical knowledge. This provides avenues for improving MIR systems, as well as their evaluation. We make available a reproducible research package so that others can apply the same method to evaluating other MIR systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A significant amount of research in the disciplines of music content analysis and content-based music information retrieval (MIR) is plagued by an inability to distinguish between solutions and "horses" [Gouyon et al. 2013;Urbano et al. 2013;Sturm 2014a;2014b]. In its most basic form, a "horse" is a system that appears as if it is solving a particular problem when it actually is not [Sturm 2014a]. ...
... A significant amount of research in the disciplines of music content analysis and content-based music information retrieval (MIR) is plagued by an inability to distinguish between solutions and "horses" [Gouyon et al. 2013;Urbano et al. 2013;Sturm 2014a;2014b]. In its most basic form, a "horse" is a system that appears as if it is solving a particular problem when it actually is not [Sturm 2014a]. This was exactly the case with Clever Hans [Pfungst 1911], a real horse that was claimed to be capable of doing arithmetic and other feats of abstract thought. ...
... We cannot tell which task DeSPerF-BALLROOM is performing just from looking at Fig. 1. While comparing the output of a music content analysis system to the ground truth of a dataset is convenient, it simply does not distinguish between "horses" and solutions [Sturm 2013;2014a]. It does not produce valid evidence of intelligence. ...
Preprint
Building systems that possess the sensitivity and intelligence to identify and describe high-level attributes in music audio signals continues to be an elusive goal, but one that surely has broad and deep implications for a wide variety of applications. Hundreds of papers have so far been published toward this goal, and great progress appears to have been made. Some systems produce remarkable accuracies at recognising high-level semantic concepts, such as music style, genre and mood. However, it might be that these numbers do not mean what they seem. In this paper, we take a state-of-the-art music content analysis system and investigate what causes it to achieve exceptionally high performance in a benchmark music audio dataset. We dissect the system to understand its operation, determine its sensitivities and limitations, and predict the kinds of knowledge it could and could not possess about music. We perform a series of experiments to illuminate what the system has actually learned to do, and to what extent it is performing the intended music listening task. Our results demonstrate how the initial manifestation of music intelligence in this state-of-the-art can be deceptive. Our work provides constructive directions toward developing music content analysis systems that can address the music information and creation needs of real-world users.
... We can assume that diversity in non-lingual effects is expected to increase in larger speech corpora with greater diversity in data collection settings. Such corpora may improve the generalization by reducing the mismatch with unknown test utterances [83]. Therefore, the availability of a large and diversified corpus is important for developing robust LID systems. ...
... However, we suspect that to solve the issue of insufficient training data for the low-resourced corpora, if we use excessively large numbers of augmented samples, it can increase the computation. Further, with too many augmented utterances compared to the original samples, the LID model can start learning the acoustic variations and diverge from learning the true language-discriminating cues present in the original training data [83]. In spite of notable advancements for exploring different audio augmentation techniques, we observe that the existing literature [25,36,96] does not provide comprehensive analysis on the amount of audio data to be augmented for training. ...
... This is an interesting observation that can be justified by saying that beyond γ = 3, there are too many acoustically diversified training samples compared to the original training data. So, the LID models may start learning the non-lingual diversities instead of the actual language-discriminating cues present in the original data [83]. Moreover, a smaller value of the fold-factor, γ = 2, 3, also reduces computation and the model training time. ...
Preprint
Full-text available
This work addresses the cross-corpora generalization issue for the low-resourced spoken language identification (LID) problem. We have conducted the experiments in the context of Indian LID and identified strikingly poor cross-corpora generalization due to corpora-dependent non-lingual biases. Our contribution to this work is twofold. First, we propose domain diversification, which diversifies the limited training data using different audio data augmentation methods. We then propose the concept of maximally diversity-aware cascaded augmentations and optimize the augmentation fold-factor for effective diversification of the training data. Second, we introduce the idea of domain generalization considering the augmentation methods as pseudo-domains. Towards this, we investigate both domain-invariant and domain-aware approaches. Our LID system is based on the state-of-the-art emphasized channel attention, propagation, and aggregation based time delay neural network (ECAPA-TDNN) architecture. We have conducted extensive experiments with three widely used corpora for Indian LID research. In addition, we conduct a final blind evaluation of our proposed methods on the Indian subset of VoxLingua107 corpus collected in the wild. Our experiments demonstrate that the proposed domain diversification is more promising over commonly used simple augmentation methods. The study also reveals that domain generalization is a more effective solution than domain diversification. We also notice that domain-aware learning performs better for same-corpora LID, whereas domain-invariant learning is more suitable for cross-corpora generalization. Compared to basic ECAPA-TDNN, its proposed domain-invariant extensions improve the cross-corpora EER up to 5.23%. In contrast, the proposed domain-aware extensions also improve performance for same-corpora test scenarios.
... Under the Cranfield Paradigm, state-of-the-art MIR systems perform exceptionally well in reproducing the ground truth of some datasets, e.g., inferring rhythm, genre or emotion from audio data. However, slight and irrelevant transformations of the audio can suddenly render these systems ineffectual [Sturm, 2014b, Rodríguez-Algarra et al., 2016. In one case [Rodríguez-Algarra et al., 2016], a "genre recognition" system relies on infrasonic signatures, seemingly originating from the data collection, but nonetheless imperceptible and irrelevant for human listeners. ...
... While the review of does not explicitly refer to validity or even mentions the Cranfield Paradigm, it makes clear how central computer experiments are to MIR, and criticizes a general lack of consideration of users in its conclusions. Sturm [2013] asserts that just reporting figures of merit like accuracy is not sufficient to decide whether an MIR system is really recognizing "genre" in musical signals, or whether it relies on irrelevant confounding factors, later stating that these uncontrolled factors are a danger to the validity of conclusions drawn from such experiments [Sturm, 2014b[Sturm, , 2017. attempts to define music description (including the "use case") to motivate evaluating music description systems in ways that allow for valid and relevant conclusions. ...
... The above review of previous work on validity in MIR shows it to be dispersed and fragmented across only relatively few publications. Despite a small chorus of calls to address major methodological problems of MIR experiments to improve validity in the discipline, e.g., Urbano [2011], , Sturm [2014b], Urbano and Flexer [2018], Liem and Mostert [2020], , there has yet to be published a systematic and critical engagement of what validity means in the context of MIR, and how to consider it when designing, implementing and analyzing experiments. ...
Preprint
Full-text available
Validity is the truth of an inference made from evidence, such as data collected in an experiment, and is central to working scientifically. Given the maturity of the domain of music information research (MIR), validity in our opinion should be discussed and considered much more than it has been so far. Considering validity in one's work can improve its scientific and engineering value. Puzzling MIR phenomena like adversarial attacks and performance glass ceilings become less mysterious through the lens of validity. In this article, we review the subject of validity in general, considering the four major types of validity from a key reference: Shadish et al. 2002. We ground our discussion of these types with a prototypical MIR experiment: music classification using machine learning. Through this MIR experimentalists can be guided to make valid inferences from data collected from their experiments.
... Adversarial examples were first reported in the field of image classification (Szegedy et al., 2014), where marginal perturbations of input data could significantly degrade the performance of a machine learning system. From there on, the phenomenon was observed in various other fields, including natural language and speech processing (Carlini and Wagner, 2018;Zhang et al., 2020), as well as MIR (Sturm, 2014). Nevertheless, literature concerning adversarial vulnerability in MIR remains sparse and its relevance questionable, as it is not considered to pose a security issue, as it is in other fields. ...
... As first adversarial attacks in MIR, filtering transformations and tempo changes have been used in untargeted black-box attacks to deflate and inflate the performance of genre, emotion and rhythm classification systems to no better than chance level or perfect 100% (Sturm, 2014(Sturm, , 2016. Also a targeted white-box attack on genre recognition systems has been proposed (Kereliuk et al., 2015), in which magnitude spectral frames computed from audio are treated as images and attacked building upon approaches from image object recognition. ...
... Changes larger than zero denote an increase of the k-occurrence after an attack. that impressive results of performance at almost human level might not use musical knowledge at all (Sturm 2013(Sturm , 2014. This directly brings us to the question of validity, i.e., whether our instrument classification experiment is actually measuring what we intended to measure (Trochim and Donnelly, 2001;Urbano et al., 2013). ...
Article
Full-text available
Small adversarial perturbations of input data can drastically change the performance of machine learning systems, thereby challenging their validity. We compare several adversarial attacks targeting an instrument classifier, where for the first time in Music Information Retrieval (MIR) the perturbations are computed directly on the waveform. The attacks can reduce the accuracy of the classifier significantly, while at the same time keeping perturbations almost imperceptible. Furthermore, we show the potential of adversarial attacks being a security issue in MIR by artificially boosting playcounts through an attack on a real-world music recommender system.
... We make the instruments annotations for these recordings publicly available. 1 A common pitfall of music classification systems is overreliance on confounding factors in training and test data, which may lead to poor generalization ability. A system for genre classification may, for example, make decisions based on inaudible artifacts rather than musical content [1]. ...
... 1 A common pitfall of music classification systems is overreliance on confounding factors in training and test data, which may lead to poor generalization ability. A system for genre classification may, for example, make decisions based on inaudible artifacts rather than musical content [1]. Such confounding effects may also arise for our IAD system by, e. g., affecting predictions for classes that are often active simultaneously (such as brass and woodwinds). ...
Article
Full-text available
Instrument activity detection is a fundamental task in music information retrieval, serving as a basis for many applications, such as music recommendation, music tagging, or remixing. Most published works on this task cover popular music and music for smaller ensembles. In this paper, we embrace orchestral and opera music recordings as a rarely considered scenario for automated instrument activity detection. Orchestral music is particularly challenging since it consists of intricate polyphonic and polytimbral sound mixtures where multiple instruments are playing simultaneously. Orchestral instruments can naturally be arranged in hierarchical taxonomies, according to instrument families. As the main contribution of this paper, we show that a hierarchical classification approach can be used to detect instrument activity in our scenario, even if only few fine-grained, instrument-level annotations are available. We further consider additional loss terms for improving the hierarchical consistency of predictions. For our experiments, we collect a dataset containing 14 hours of orchestral music recordings with aligned instrument activity annotations. Finally, we perform an analysis of the behavior of our proposed approach with regard to potential confounding errors.
... However, [7] convicted most of these approaches as being a horse, i.e., classifying not based on causal relationships between acoustic feature and genre, but based on statistical coincidences inherent in the given dataset. This means, that many of the commonly extracted low-level features are meaningless for the task of genre classification, and the classifier is not universally valid, i.e., it is not likely to perform similarly well when applied to another dataset. ...
... The final predictions are made by averaging the predictions of each individual tree. Model fitting was done using sklearn 7 . For the classifier we applied 5-fold cross-validation, i.e., the original sample is randomly partitioned into 5 equal sized subsamples. ...
Conference Paper
Full-text available
Producers of Electronic Dance Music (EDM) typically spend more time creating, shaping, mixing and mastering sounds, than with aspects of composition and arrangement. They analyze the sound by close listening and by leveraging audio metering and audio analysis tools, until they successfully created the desired sound aesthetics. DJs of EDM tend to play sets of songs that meet their sound ideal. We therefore suggest using audio metering and monitoring tools from the recording studio to analyze EDM, instead of relying on conventional low-level audio features. We test our novel set of features by a simple classification task. We attribute songs to DJs who would play the specific song. This new set of features and the focus on DJ sets is targeted at EDM as it takes the producer and DJ culture into account. With simple dimensionality reduction and machine learning these features enable us to attribute a song to a DJ with an accuracy of 63%. The features from the audio metering and monitoring tools in the recording studio could serve for many applications in Music Information Retrieval, such as genre, style and era classification and music recommendation for both DJs and consumers of electronic dance music.
... Considering the current scenario of first-impression analysis in the context of job candidate screening, the quest for algorithmic accountability will inevitably also bring ethical questions. Will automatically trained pipelines inadvertently pick up on data properties that actually are irrelevant to the problem at hand [80]? May certain individuals get disadvantaged because of this, or due to inherent data biases? ...
Preprint
Explainability and interpretability are two critical aspects of decision support systems. Within computer vision, they are critical in certain tasks related to human behavior analysis such as in health care applications. Despite their importance, it is only recently that researchers are starting to explore these aspects. This paper provides an introduction to explainability and interpretability in the context of computer vision with an emphasis on looking at people tasks. Specifically, we review and study those mechanisms in the context of first impressions analysis. To the best of our knowledge, this is the first effort in this direction. Additionally, we describe a challenge we organized on explainability in first impressions analysis from video. We analyze in detail the newly introduced data set, the evaluation protocol, and summarize the results of the challenge. Finally, derived from our study, we outline research opportunities that we foresee will be decisive in the near future for the development of the explainable computer vision field.
... It is not too abundant in classical music, and extremely rare at the very beginning of pieces -and thus all the more unexpected and surprising here. 7 This may also help avoid the Clever Hans effect recently identified in various MIR systems[Sturm 2014], which is clearly related to these systems focusing on features at musically irrelevant levels. 8 Some of these problems are addressed in recent work on Markov Constraint models, such as [Roy andPachet 2013], which proposes a solution for the counting problem in musical meter and, more recently,[Papadopoulos et al. 2015], which presents a method for sampling Markov sequences that satisfy some regular constraints (represented by an automaton).ACM Transactions on Intelligent Systems and Technology, Vol. ...
Preprint
This text offers a personal and very subjective view on the current situation of Music Information Research (MIR). Motivated by the desire to build systems with a somewhat deeper understanding of music than the ones we currently have, I try to sketch a number of challenges for the next decade of MIR research, grouped around six simple truths about music that are probably generally agreed on, but often ignored in everyday research.
... What we find most promising about gradient regularization, though, is that it significantly changes the shape of the models' decision boundaries, which suggests that they make predictions for qualitatively different (and perhaps better) reasons. It is unlikely that regularizing for this kind of smoothness will be a panacea for all manifestations of the "Clever Hans" effect (Sturm 2014) in deep neural networks, but in this case the prior it represents -that predictions should not be sensitive to small perturbations in input space -helps us find models that make more robust and interpretable predictions. No matter what method proves most effective in the general case, we suspect that any progress towards ensuring either interpretability or adversarial robustness in deep neural networks will likely represent progress towards both. ...
Preprint
Deep neural networks have proven remarkably effective at solving many classification problems, but have been criticized recently for two major weaknesses: the reasons behind their predictions are uninterpretable, and the predictions themselves can often be fooled by small adversarial perturbations. These problems pose major obstacles for the adoption of neural networks in domains that require security or transparency. In this work, we evaluate the effectiveness of defenses that differentiably penalize the degree to which small changes in inputs can alter model predictions. Across multiple attacks, architectures, defenses, and datasets, we find that neural networks trained with this input gradient regularization exhibit robustness to transferred adversarial examples generated to fool all of the other models. We also find that adversarial examples generated to fool gradient-regularized models fool all other models equally well, and actually lead to more "legitimate," interpretable misclassifications as rated by people (which we confirm in a human subject experiment). Finally, we demonstrate that regularizing input gradients makes them more naturally interpretable as rationales for model predictions. We conclude by discussing this relationship between interpretability and robustness in deep neural networks.
... As interest in these grand challenges has grown, so has scrutiny of the benchmarks and models that appear to solve them. Are we making progress towards these challenges, or are good results the latest incarnation of horses [6], [7] and Potemkin villages [8], with neural networks finding unexpected correlates that provide shortcuts to give away the answer? ...
Preprint
In recent years, visual question answering (VQA) has become topical. The premise of VQA's significance as a benchmark in AI, is that both the image and textual question need to be well understood and mutually grounded in order to infer the correct answer. However, current VQA models perhaps `understand' less than initially hoped, and instead master the easier task of exploiting cues given away in the question and biases in the answer distribution. In this paper we propose the inverse problem of VQA (iVQA). The iVQA task is to generate a question that corresponds to a given image and answer pair. We propose a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer. Based on this model, we show that iVQA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iVQA model needs to understand the image better to be successful. As a second contribution, we show how to use iVQA in a novel reinforcement learning framework to diagnose any existing VQA model by way of exposing its belief set: the set of question-answer pairs that the VQA model would predict true for a given image. This provides a completely new window into what VQA models `believe' about images. We show that existing VQA models have more erroneous beliefs than previously thought, revealing their intrinsic weaknesses. Suggestions are then made on how to address these weaknesses going forward.
... The difference in model performance serves as a coarse detection mechanism for shortcuts. A similar methodology is employed by Sturm [174] in the context of music information retrieval systems. In their work, they propose the "method of irrelevant transformations", modifying data using changes that should not affect the target variable (e.g., applying slight equalization or cropping irrelevant parts of audio recordings). ...
Preprint
Full-text available
Shortcuts, also described as Clever Hans behavior, spurious correlations, or confounders, present a significant challenge in machine learning and AI, critically affecting model generalization and robustness. Research in this area, however, remains fragmented across various terminologies, hindering the progress of the field as a whole. Consequently, we introduce a unifying taxonomy of shortcut learning by providing a formal definition of shortcuts and bridging the diverse terms used in the literature. In doing so, we further establish important connections between shortcuts and related fields, including bias, causality, and security, where parallels exist but are rarely discussed. Our taxonomy organizes existing approaches for shortcut detection and mitigation, providing a comprehensive overview of the current state of the field and revealing underexplored areas and open challenges. Moreover, we compile and classify datasets tailored to study shortcut learning. Altogether, this work provides a holistic perspective to deepen understanding and drive the development of more effective strategies for addressing shortcuts in machine learning.
... In AI, the Clever Hans effect has also been widely recognised (Sturm, 2014;Hernandez-Orallo, 2019). In particular, reinforcement learning agents are known for solving tasks in unintended ways (Krakovna et al., 2020). ...
Preprint
Full-text available
The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI systems. The internal validity of these benchmarks - i.e., making sure they are free from confounding factors - is crucial for ensuring that they are measuring what they are designed to measure. In this paper, we explore a key issue related to internal validity: the possibility that AI systems can solve benchmarks in unintended ways, bypassing the capability being tested. This phenomenon, widely known in human and animal experiments, is often referred to as the 'Clever Hans' effect, where tasks are solved using spurious cues, often involving much simpler processes than those putatively assessed. Previous research suggests that language models can exhibit this behaviour as well. In several older Natural Language Processing (NLP) benchmarks, individual n-grams like "not" have been found to be highly predictive of the correct labels, and supervised NLP models have been shown to exploit these patterns. In this work, we investigate the extent to which simple n-grams extracted from benchmark instances can be combined to predict labels in modern multiple-choice benchmarks designed for LLMs, and whether LLMs might be using such n-gram patterns to solve these benchmarks. We show how simple classifiers trained on these n-grams can achieve high scores on several benchmarks, despite lacking the capabilities being tested. Additionally, we provide evidence that modern LLMs might be using these superficial patterns to solve benchmarks. This suggests that the internal validity of these benchmarks may be compromised and caution should be exercised when interpreting LLM performance results on them.
... It has long been known that deep neural networks (and, in fact, most other machine learning models as well) are sensitive to adversarial perturbation [39,41]. In the case of image processing tasks, this means that-given a network and an input image-an adversary can compute a specific input perturbation that is invisible to the human eye yet changes the output arbitrarily. ...
Preprint
Achieving robustness against adversarial input perturbation is an important and intriguing problem in machine learning. In the area of semantic image segmentation, a number of adversarial training approaches have been proposed as a defense against adversarial perturbation, but the methodology of evaluating the robustness of the models is still lacking, compared to image classification. Here, we demonstrate that, just like in image classification, it is important to evaluate the models over several different and hard attacks. We propose a set of gradient based iterative attacks and show that it is essential to perform a large number of iterations. We include attacks against the internal representations of the models as well. We apply two types of attacks: maximizing the error with a bounded perturbation, and minimizing the perturbation for a given level of error. Using this set of attacks, we show for the first time that a number of models in previous work that are claimed to be robust are in fact not robust at all. We then evaluate simple adversarial training algorithms that produce reasonably robust models even under our set of strong attacks. Our results indicate that a key design decision to achieve any robustness is to use only adversarial examples during training. However, this introduces a trade-off between robustness and accuracy.
... Der Ursprung des Gehörs und seine Physiologie lassen bereits eine enge Verbindung zwischen Klang und Raum erkennen, wie im nächsten Kapitel erörtert wird. 109 Siehe Sturm (2014). 110 Siehe z. ...
Chapter
Musik ist in vielerlei Hinsicht räumlich. Musikalische Konzepte und Musikwahrnehmung werden in vielen Kulturen mit räumlichen Begriffen beschrieben. Dieses räumliche Denken spiegelt sich in der Musik wider, von räumlichen Kompositionen bis hin zu stereophonen Aufnahme- und Mischtechniken. Folglich nutzen sowohl traditionelle Musiktheorien als auch moderne Ansätze der computergestützten Musikanalyse räumliche Konzepte und Operationen, um ein tieferes Verständnis von Musik zu erlangen. Dieses Kapitel gibt einen Überblick über die Konzepte der Räumlichkeit in der Musikpsychologie, den Stand der Technik bei der räumlichen Musikkomposition und beim Abmischen im Tonstudio sowie einen Überblick über die Räumlichkeit in der Musiktheorie und der computergestützten Musikanalyse. Die Bedeutung von Raumkonzepten in all diesen theoretischen und praktischen Disziplinen unterstreicht die Bedeutung des Raums in der Musik. Diese tiefe Beziehung wird deutlich, wenn man Musik als kreative Kunst, als akustisches Signal und als psychologisches Phänomen betrachtet.
... The second point is to notice the internal representation that the network learns to perform its transduction: the only thing the network "knows" is the set of weights between units onto which it has converged during training. While it is tempting to characterize the knowledge that has been captured by these weights as purely musical, several studies have demonstrated how the network may instead have learned some entirely extrinsic regularity of the training set, e.g., background noise or recording artifacts [22,23]. One of the difficulties in working with neural networks is that they end up with this black-box quality. ...
Article
Full-text available
The history of algorithmic composition using a digital computer has undergone many representations—data structures that encode some aspects of the outside world, or processes and entities within the program itself. Parallel histories in cognitive science and artificial intelligence have (of necessity) confronted their own notions of representations, including the ecological perception view of J.J. Gibson, who claims that mental representations are redundant to the affordances apparent in the world, its objects, and their relations. This review tracks these parallel histories and how the orientations and designs of multimodal interactive systems give rise to their own affordances: the representations and models used expose parameters and controls to a creator that determine how a system can be used and, thus, what it can mean.
... • Ideally, there should not be significant bias for the acoustic room environments among the utterances of different language classes. If this criterion is not fulfilled, the classifier model can recognize different recording environments as different language classes [128]. ...
Preprint
Full-text available
Automatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields.
... As the number of singers to be considered increases, this issue becomes crucial. Second, since the songs in each singer's albums usually contain instrumental accompaniment, it is difficult for the SID model to extract vocal-only features from such recordings, which will reduce the generalization ability of the SID model (Hsieh et al. 2020;Van, Quang, and Thanh 2019;Sharma, Das, and Li 2019;Rafii et al. 2018;Sturm 2014). ...
Article
Recently, a non-local (NL) operation has been designed as the central building block for deep-net models to capture long-range dependencies (Wang et al. 2018). Despite its excellent performance, it does not consider the interaction between positions across channels and layers, which is crucial in fine-grained classification tasks. To address the limitation, we target at singer identification (SID) task and present a fully generalized non-local (FGNL) module to help identify fine-grained vocals. Specifically, we first propose a FGNL operation, which extends the NL operation to explore the correlations between positions across channels and layers. Secondly, we further apply a depth-wise convolution with Gaussian kernel in the FGNL operation to smooth feature maps for better generalization. More, we modify the squeeze-and-excitation (SE) scheme into the FGNL module to adaptively emphasize correlated feature channels to help uncover relevant feature responses and eventually the target singer. Evaluating results on the benchmark artist20 dataset shows that the FGNL module significantly improves the accuracy of the deep-net models in SID. Codes are available at https://github.com/ian-k-1217/Fully-Generalized-Non-Local-Network.
... Direct Human-AI Comparison in AAI that give rise to this phenomenon, wherein they appear to be intelligently solving tasks when they are actually only finessing solutions via a number of "shortcuts" (the so-called Clever Hans Effect of AI; Sebeok and Rosenthal, 1981;Sturm, 2014;Hernández-Orallo, 2019. Geirhos et al. argue that an effective measure against this issue is to test AIs on out- ofdistribution (o.o.d.) data, since, so long as testing uses data drawn from the same distribution as the training data (i.i.d., independent and identically distributed), it is impossible to distinguish between an agent that genuinely knows how to solve a problem and one that is using problem-irrelevant shortcuts to maximize reward (also Dickinson, 2012 for a perspective from animal psychology). ...
Article
Full-text available
Artificial Intelligence is making rapid and remarkable progress in the development of more sophisticated and powerful systems. However, the acknowledgement of several problems with modern machine learning approaches has prompted a shift in AI benchmarking away from task-oriented testing (such as Chess and Go) towards ability-oriented testing, in which AI systems are tested on their capacity to solve certain kinds of novel problems. The Animal-AI Environment is one such benchmark which aims to apply the ability-oriented testing used in comparative psychology to AI systems. Here, we present the first direct human-AI comparison in the Animal-AI Environment, using children aged 6–10 (n = 52). We found that children of all ages were significantly better than a sample of 30 AIs across most of the tests we examined, as well as performing significantly better than the two top-scoring AIs, “ironbar” and “Trrrrr,” from the Animal-AI Olympics Competition 2019. While children and AIs performed similarly on basic navigational tasks, AIs performed significantly worse in more complex cognitive tests, including detour tasks, spatial elimination tasks, and object permanence tasks, indicating that AIs lack several cognitive abilities that children aged 6–10 possess. Both children and AIs performed poorly on tool-use tasks, suggesting that these tests are challenging for both biological and non-biological machines.
... There are several possible explanations for the traditional methods shortages. On the one hand, the song's accompaniment is an intense noise for the SID task [16]. In the SID model, eliminating the influence of accompaniment is an essential skill for improving the identification correct rate. ...
Preprint
Full-text available
Metaverse is an interactive world that combines reality and virtuality, where participants can be virtual avatars. Anyone can hold a concert in a virtual concert hall, and users can quickly identify the real singer behind the virtual idol through the singer identification. Most singer identification methods are processed using the frame-level features. However, expect the singer's timbre, the music frame includes music information, such as melodiousness, rhythm, and tonal. It means the music information is noise for using frame-level features to identify the singers. In this paper, instead of only the frame-level features, we propose to use another two features that address this problem. Middle-level feature, which represents the music's melodiousness, rhythmic stability, and tonal stability, and is able to capture the perceptual features of music. The timbre feature, which is used in speaker identification, represents the singers' voice features. Furthermore, we propose a convolutional recurrent neural network (CRNN) to combine three features for singer identification. The model firstly fuses the frame-level feature and timbre feature and then combines middle-level features to the mix features. In experiments, the proposed method achieves comparable performance on an average F1 score of 0.81 on the benchmark dataset of Artist20, which significantly improves related works.
... Adversarial examples were previously reported in various fields of application (cf. [1][2][3]) as small perturbations of input data that drastically change the performance of machine learning systems. Since then, numerous attempts to make systems more robust and defend them against these attacks have been made (cf. ...
Preprint
Full-text available
Adversarial attacks can drastically degrade performance of recommenders and other machine learning systems, resulting in an increased demand for defence mechanisms. We present a new line of defence against attacks which exploit a vulnerability of recommenders that operate in high dimensional data spaces (the so-called hubness problem). We use a global data scaling method, namely Mutual Proximity (MP), to defend a real-world music recommender which previously was susceptible to attacks that inflated the number of times a particular song was recommended. We find that using MP as a defence greatly increases robustness of the recommender against a range of attacks, with success rates of attacks around 44% (before defence) dropping to less than 6% (after defence). Additionally, adversarial examples still able to fool the defended system do so at the price of noticeably lower audio quality as shown by a decreased average SNR.
... Issues of confounding variables crop up in Music Information Retrieval tasks too where analytical approaches have been developed to identify and resolve these confounds (Sturm, 2014). This analytic approach works in musical applications such as instrument classification, because invariably we know what we are looking for. ...
Article
Full-text available
Theories across sciences and humanities posit a central role for musicking in the evolution of the social, biological and technical pat- terns that underpin modern humanity. In this talk I suggest that contemporary computer musicking can play a similarly critical role in supporting us through contemporary existential, ecological, technological and social crises, by providing a space for reworking our relationships with each other and the world, including the technologies that we make. Framed by Gregory Bateson’s analysis of the fundamental epistemological error which leads to interrelated existential, social and ecological crises, I will draw upon a range of personal projects to illustrate the value of computer music practices in learning to think better: from cybernetic generative art, through ecosystemic evolutionary art and feedback musicianship to the need for interactive approaches to algorithm interpretation in ma- chine listening to biodiversity. I will illustrate how computer musicking can help in three ways: firstly by developing complexity literacy, helping us to better understand the complex systems of the anthropocene; secondly by providing a space to explore other modes of relation through learning to let others be; and thirdly to clarify the importance of aligning technologies with and not against, the biosphere. As pre-historic musicking made us human, so contemporary computer musicking can help us learn to think through the challenges we face today and be better humans tomorrow.
... • Ideally, there should not be significant bias for the acoustic room environments among the utterances of different language classes. If this criterion is not fulfilled, the classifier model can recognize different recording environments as different language classes [128]. • The variations due to several transmission channels should also be taken care of such that these variations should not be confused with individual language identities. ...
Article
Automatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields.
... Our approach to ED evaluation taps into the fast-growing area of research aimed at assessing model robustness especially relevant for data-driven machine learning techniques. One of the first studies on this topic (Sturm, 2014) argued that the state-of-the art music information retrieval systems show very good performance on the standard benchmarks without the real understanding of the task at hand since their predictions relied solely on the confounds present in the ground truth. Sturm (2014) also coined the term for this phenomena: the "Clever Hans" effect, named after the infamous horse that appeared to solve arithmetic problems while only following unintentional body language cues given by the trainer. ...
... Another key area in which the robustness of explanations can play a central role is in the context of assessing a model's ability to generalise. Explainability methods can often be used to determine whether a model's decisions are 'right for the right reasons', and hence whether the model will remain accurate when faced with unseen data [31]. Since this new data may have a slightly different distribution to previous data, explanations lacking in robustness may obscure the similarities in model behaviour and make it more difficult to trust the model's transferability. ...
Preprint
There exist several methods that aim to address the crucial task of understanding the behaviour of AI/ML models. Arguably, the most popular among them are local explanations that focus on investigating model behaviour for individual instances. Several methods have been proposed for local analysis, but relatively lesser effort has gone into understanding if the explanations are robust and accurately reflect the behaviour of underlying models. In this work, we present a survey of the works that analysed the robustness of two classes of local explanations (feature importance and counterfactual explanations) that are popularly used in analysing AI/ML models in finance. The survey aims to unify existing definitions of robustness, introduces a taxonomy to classify different robustness approaches, and discusses some interesting results. Finally, the survey introduces some pointers about extending current robustness analysis approaches so as to identify reliable explainability methods.
... As mentioned in Qian et al. (87), to fully understand the concepts brought into play when working on these corpora, the best practice is to collect numerous data about the speakers and the recording conditions. This allows unraveling biases that are not necessarily identified and to ensure that, when elaborating classifiers, the samples are classified by what they are meant to, and not by a related bias (88,89). Moreover, these measures allow studying the robustness of systems with respect to other factors (sex, age, demographic data, comorbidity, ...). ...
Article
Full-text available
This article presents research on the detection of pathologies affecting speech through automatic analysis. Voice processing has indeed been used for evaluating several diseases such as Parkinson, Alzheimer, or depression. If some studies present results that seem sufficient for clinical applications, this is not the case for the detection of sleepiness. Even two international challenges and the recent advent of deep learning techniques have still not managed to change this situation. This article explores the hypothesis that the observed average performances of automatic processing find their cause in the design of the corpora. To this aim, we first discuss and refine the concept of sleepiness related to the ground-truth labels. Second, we present an in-depth study of four corpora, bringing to light the methodological choices that have been made and the underlying biases they may have induced. Finally, in light of this information, we propose guidelines for the design of new corpora.
... Even if this approach is not commonly applied and not equivalent to the result of embedding learning, we include this simplification to demonstrate how graph embedding information is generalisable to unseen data. In particular, we would like to prove that our system is not learning a hashing on the data but musically relevant features, demonstrating to not being a "horse" [56]. In the context of this experiment, the reported accuracy refers to the predictions on the test set generated by the neural network trained on the training set. ...
Article
Full-text available
An important problem in large symbolic music collections is the low availability of high-quality metadata, which is essential for various information retrieval tasks. Traditionally, systems have addressed this by relying either on costly human annotations or on rule-based systems at a limited scale. Recently, embedding strategies have been exploited for representing latent factors in graphs of connected nodes. In this work, we propose MIDI2vec, a new approach for representing MIDI files as vectors based on graph embedding techniques. Our strategy consists of representing the MIDI data as a graph, including the information about tempo, time signature, programs and notes. Next, we run and optimise node2vec for generating embeddings using random walks in the graph. We demonstrate that the resulting vectors can successfully be employed for predicting the musical genre and other metadata such as the composer, the instrument or the movement. In particular, we conduct experiments using those vectors as input to a Feed-Forward Neural Network and we report good comparable accuracy scores in the prediction with respect to other approaches relying purely on symbolic music, avoiding feature engineering and producing highly scalable and reusable models with low dimensionality. Our proposal has real-world applications in automated metadata tagging for symbolic music, for example in digital libraries for musicology, datasets for machine learning, and knowledge graph completion.
... Our goal was not to question state-of-the-art results in wheeze classification, but to emphasize the critical role of experimental design in the evaluation of the algorithms' performance. As asserted by Sturm [35], when the performance of an artificial system appears to support the claim that the system is addressing a complex human task (e.g., classify sounds as wheezes), the default position (null hypothesis) should be that the system is (a) FD test set (best run -Boost). ...
Conference Paper
Full-text available
Patients with respiratory conditions typically exhibit adventitious respiratory sounds, such as wheezes. Wheeze events have variable duration. In this work we studied the influence of event duration on wheeze classification, namely how the creation of the non-wheeze class affected the classifiers' performance. First, we evaluated several classifiers on an open access respiratory sound database, with the best one reaching sensitivity and speci-ficity values of 98% and 95%, respectively. Then, by changing one parameter in the design of the non-wheeze class, i.e., event duration, the best classifier only reached sensitivity and specificity values of 55% and 76%, respectively. These results demonstrate the importance of experimental design on the assessment of wheeze classification algorithms' performance.
... For example, it may be understandable if a model that learns term representations based on the text of Shakespeare's Hamlet is effective at retrieving passages relevant to a search query from The Bard's other works, but performs poorly when the retrieval task involves a corpus of song lyrics by Jay-Z. However, the poor performances on new corpus can also be indicative that the model is overfitting, or suffering from the Clever Hans 2 effect[140]. For example, an IR model trained on recent news corpus may learn to associate "Theresa May" with the query "uk prime minister" and as a consequence may perform poorly on older TREC datasets where the connection to "John Major" may be more appropriate.ML models that are hyper-sensitive to corpus distributions may be vulnerable when faced with unexpected changes in distributions in the test data. ...
Thesis
Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents--or short passages--in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms--such as a person's name or a product model number--not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections--such as the document index of a commercial Web search engine--containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks.
... As mentioned in (Qian et al., 2020), to fully understand the concepts brought into play when working on these corpora, the best practice is to collect numerous data about the speakers and the recording conditions. This allows unraveling biases that are not necessarily identified and to ensure that, when elaborating classifiers, the samples are classified by what they are meant to, and not by a related bias (Sturm, 2014(Sturm, , 2016. Moreover, these measures allow studying the robustness of systems with respect to other factors (sex, age, demographic data, comorbidity, ...). ...
Preprint
Full-text available
This article presents research on the detection of pathologies affecting speech through automatic analysis. Voice processing has indeed been used for evaluating several diseases such as Parkinson, Alzheimer or depression. If some studies present results that seem sufficient for clinical applications, this is not the case for the detection of sleepiness. Even two international challenges and the recent advent of deep learning techniques have still not managed to change this situation. This paper explores the hypothesis that the observed average performances of automatic processing find their cause in the design of the corpora. To this aim, we first discuss and refine the concept of sleepiness related to the ground-truth labels. Second, we present an in-depth study of four corpora, bringing to light the methodological choices that have been made and the underlying biases they may have induced. Finally, in light of this information, we propose guidelines for the design of new corpora.
... In these cases, even after the compensation, cross-corpora evaluations show EER more than 50%, which is an interesting fact for further investigation. This indicates that the classifier captures the nonlingual similarities, which affect the similarity score between the target language and test audio [34]. ...
Preprint
Full-text available
In this paper, we conduct one of the very first studies for cross-corpora performance evaluation in the spoken language identification (LID) problem. Cross-corpora evaluation was not explored much in LID research, especially for the Indian languages. We have selected three Indian spoken language corpora: IIITH-ILSC, LDC South Asian, and IITKGP-MLILSC. For each of the corpus, LID systems are trained on the state-of-the-art time-delay neural network (TDNN) based architecture with MFCC features. We observe that the LID performance degrades drastically for cross-corpora evaluation. For example, the system trained on the IIITH-ILSC corpus shows an average EER of 11.80 % and 43.34 % when evaluated with the same corpora and LDC South Asian corpora, respectively. Our preliminary analysis shows the significant differences among these corpora in terms of mismatch in the long-term average spectrum (LTAS) and signal-to-noise ratio (SNR). Subsequently, we apply different feature level compensation methods to reduce the cross-corpora acoustic mismatch. Our results indicate that these feature normalization schemes can help to achieve promising LID performance on cross-corpora experiments.
... In several situations, ML models-especially those based on Deep Learning (DL) techniques-even have been claimed to perform 'better than humans' [1], [2]. At the same time, this latter claim has met with controversy, as seemingly well-performing ML models were found to make unexpected mistakes that humans would not make [3], [4]. ...
... In several situations, ML models-especially those based on Deep Learning (DL) techniques-even have been claimed to perform 'better than humans' [1], [2]. At the same time, this latter claim has met with controversy, as seemingly well-performing ML models were found to make unexpected mistakes that humans would not make [3], [4]. ...
Preprint
Full-text available
Mutation testing is a well-established technique for assessing a test suite's quality by injecting artificial faults into production code. In recent years, mutation testing has been extended to machine learning (ML) systems, and deep learning (DL) in particular; researchers have proposed approaches, tools, and statistically sound heuristics to determine whether mutants in DL systems are killed or not. However, as we will argue in this work, questions can be raised to what extent currently used mutation testing techniques in DL are actually in line with the classical interpretation of mutation testing. We observe that ML model development resembles a test-driven development (TDD) process, in which a training algorithm (`programmer') generates a model (program) that fits the data points (test data) to labels (implicit assertions), up to a certain threshold. However, considering proposed mutation testing techniques for ML systems under this TDD metaphor, in current approaches, the distinction between production and test code is blurry, and the realism of mutation operators can be challenged. We also consider the fundamental hypotheses underlying classical mutation testing: the competent programmer hypothesis and coupling effect hypothesis. As we will illustrate, these hypotheses do not trivially translate to ML system development, and more conscious and explicit scoping and concept mapping will be needed to truly draw parallels. Based on our observations, we propose several action points for better alignment of mutation testing techniques for ML with paradigms and vocabularies of classical mutation testing.
... However, [8] convicted most of these approaches as being a horse, i.e., classifying not based on causal relationships between acoustic feature and genre, but based on statistical coincidences inherent in the given dataset. This means, that many of the commonly extracted low-level features are meaningless for the task of genre classification, and the classifier is not universally valid, i.e., it is not likely to perform similarly well when applied to another dataset. ...
Preprint
Full-text available
In the recording studio, producers of Electronic Dance Music (EDM) spend more time creating, shaping, mixing and mastering sounds, than with compositional aspects or arrangement. They tune the sound by close listening and by leveraging audio metering and audio analysis tools, until they successfully creat the desired sound aesthetics. DJs of EDM tend to play sets of songs that meet their sound ideal. We therefore suggest using audio metering and monitoring tools from the recording studio to analyze EDM, instead of relying on conventional low-level audio features. We test our novel set of features by a simple classification task. We attribute songs to DJs who would play the specific song. This new set of features and the focus on DJ sets is targeted at EDM as it takes the producer and DJ culture into account. With simple dimensionality reduction and machine learning these features enable us to attribute a song to a DJ with an accuracy of 63%. The features from the audio metering and monitoring tools in the recording studio could serve for many applications in Music Information Retrieval, such as genre, style and era classification and music recommendation for both DJs and consumers of electronic dance music.
Article
Counterfactual attention learning [1] utilizes counterfactual causality to guide attention learning and has demonstrated great potential in vision-based fine-grained recognition tasks. Despite its excellent performance, existing counterfactual attention is not learned directly from the network itself; instead, it relies on employing random attentions. To address the limitation and considering the inherent differences between visual and acoustic characteristics, we target music classification tasks and present a learnable counterfactual attention (LCA) mechanism, to enhance the ability of counterfactual attention to help identify fine-grained sounds. Specifically, our LCA mechanism is implemented by introducing a counterfactual attention branch into the original attention-based deep-net model. Guided by multiple well-designed loss functions, the model pushes the counterfactual attention branch to uncover biased attention regions that are meaningful yet not overly discriminative (seemingly accurate but ultimately misleading), while guiding the main branch to deviate from those regions, thereby focusing attention on discriminative regions to learn task-specific features in fine-grained sounds. Evaluations on the benchmark datasets artist20 [2], GTZAN [3], and FMA [4] demonstrate that our LCA mechanism brings a comprehensive performance improvement for deep-net models on singer identification and musical genre classification. Moreover, since the LCA mechanism is only used during training, it doesn't impact testing efficiency.
Conference Paper
Does the model that appear to detect fake voices use cues relevant to the problem? Or is it merely a product of how a dataset was constructed? In this paper, we demonstrate how spurious correlations in training data results in improved voice spoofing detection. A simple framework to identify such effects, also known as the Clever Hans effect in machine learning (ML), is proposed and its efficacy is demonstrated using a popular deep spoofing detector on two anti-spoofing benchmarks: ASVspoof 2017 and ASVspoof 2019 PA. By raising awareness of this effect we hope to increase the credibility and reliability of anti-spoofing solutions on these benchmarks. Furthermore, using a separate deep architecture we demonstrate that such effect is not model specific and that any ML solution may ex- hibit such behaviour.
Article
Full-text available
Data mining is known as data, which promotes the growth of knowledge discovery. It is the process of analyzing descriptive data from divergent perspectives and summarizing it into valuable information, which is high-level music processing out of which a machine intends to decipher the Raaga of a frequency or the pitch of the music. One of the ways to approach the task is by comparing selected music features from the spectrum and a Raaga database. Recognizing emotion from music has become one of the active research themes in image processing and applications based on human-computer interaction. This research conducts an experimental study on recognizing facial emotions. The flow of the emotion recognition system includes the basic process in the singular value decomposition system. These include music acquisition, pre-processing of a spectrum, feature detection, feature extraction, classification, and when the emotions are classified, the system assigns the particular user music according to his emotion. The proposed system focuses on live images taken from the music database. This research aims to develop an automatic music recognition system for innovative manufacturing through the additive manufacturing route. The emotions considered for the experiments include happiness, Sadness, Surprise, Fear, Disgust, and Anger that are universally accepted. This paper overviews the progress of applying Additive manufacturing in Applied Machine learning which sustains the capability of disruptive digital manufacturing.
Article
In hyperspectral anomaly detection, anomalies are rare targets that exhibit distinct spectral signatures from the background. Thus, anomalies are with low probabilities of occurrence in hyperspectral images. In this article, we develop a new technique for hyperspectral anomaly detection that adopts a new information theory perspective, to fully utilize the aforementioned concepts. Our goal is to transform system entropy into quantitative metrics of anomaly conspicuousness of pixels. To do so, two tasks are first completed: first, the construction of occurrence probability of pixels based on the density peak clustering algorithm, and second, the valid system definitions for pixels in specific anomaly detection problems with multiviews. Specifically, three types of systems are separately established by pixel pairs to conform to the definitions of three entropy definitions in information theory, i.e., Shannon entropy, joint entropy, and relative entropy. Then, three individual entropy-based metrics that assess the anomaly conspicuousness are defined. In addition, we design a standard deviation-based ensemble strategy for the integrated representation of the three individual metrics, which considers both logic "or" and "and" operations to simultaneously improve the detection rate and reduce the false alarm rate. Our experimental results obtained on two publicly available datasets with anomalies of different sizes and shapes demonstrate the superiority of our newly proposed anomaly detection method.
Preprint
Entity disambiguation (ED) is the last step of entity linking (EL), when candidate entities are reranked according to the context they appear in. All datasets for training and evaluating models for EL consist of convenience samples, such as news articles and tweets, that propagate the prior probability bias of the entity distribution towards more frequently occurring entities. It was previously shown that the performance of the EL systems on such datasets is overestimated since it is possible to obtain higher accuracy scores by merely learning the prior. To provide a more adequate evaluation benchmark, we introduce the ShadowLink dataset, which includes 16K short text snippets annotated with entity mentions. We evaluate and report the performance of popular EL systems on the ShadowLink benchmark. The results show a considerable difference in accuracy between more and less common entities for all of the EL systems under evaluation, demonstrating the effects of prior probability bias and entity overshadowing.
Preprint
Full-text available
In this paper we present a controlled study on the linearized IRM framework (IRMv1) introduced in Arjovsky et al. (2020). We show that IRMv1 (and its variants) framework can be potentially unstable under small changes to the optimal regressor. This can, notably, lead to worse generalisation to new environments, even compared with ERM which converges simply to the global minimum for all training environments mixed up all together. We also highlight the isseus of scaling in the the IRMv1 setup. These observations highlight the importance of rigorous evaluation and importance of unit-testing for measuring progress towards IRM.
Book
Full-text available
Con este trabajo de ciencia, Velarde nos acerca a cuestionamientos actuales, perennes y de la ciencia ficción. Velarde teje una historia cautivante de lo que ella ha llamado la Era artificial, y presenta las predicciones sobre el impacto tecnológico gracias a los avances de la inteligencia artificial.
Conference Paper
Full-text available
Because music conveys and evokes feelings, a wealth of research has been performed on music emotion recognition. Previous research has shown that musical mood is linked to features based on rhythm, timbre, spectrum and lyrics. For example, sad music correlates with slow tempo, while happy music is generally faster. However, only limited success has been obtained in learning automatic classifiers of emotion in music. In this paper, we collect a ground truth data set of 2904 songs that have been tagged with one of the four words "happy " , "sad " , "angry " and "relaxed " , on the Last.FM web site. An excerpt of the audio is then retrieved from 7Digital.com, and various sets of audio features are extracted using standard algorithms. Two classifiers are trained using support vector machines with the polynomial and radial basis function kernels, and these are tested with 10-fold cross validation. Our results show that spectral features outperform those based on rhythm, dynamics , and, to a lesser extent, harmony. We also find that the polynomial kernel gives better results than the radial basis function, and that the fusion of different feature sets does not always lead to improved classification.
Article
Full-text available
We argue that an evaluation of system behavior at the level of the music is required to usefully address the fundamental problems of music genre recognition (MGR), and indeed other tasks of music information retrieval, such as autotagging. A recent review of works in MGR since 1995 shows that most (82 %) measure the capacity of a system to recognize genre by its classification accuracy. After reviewing evaluation in MGR, we show that neither classification accuracy, nor recall and precision, nor confusion tables, necessarily reflect the capacity of a system to recognize genre in musical signals. Hence, such figures of merit cannot be used to reliably rank, promote or discount the genre recognition performance of MGR systems if genre recognition (rather than identification by irrelevant confounding factors) is the objective. This motivates the development of a richer experimental toolbox for evaluating any system designed to intelligently extract information from music signals.
Conference Paper
Full-text available
Many robot dances are preprogrammed by choreographers for a particular piece of music so that the motions can be smoothly executed and synchronized to the music. We are interested in automating the task of robot dance choreography to allow robots to dance without detailed human planning. Robot dance movements are synchronized to the beats and reflect the emotion of any music. Our work is made up of two parts: (1) The first algorithm plans a sequence of dance movements that is driven by the beats and the emotions detected through the preprocessing of selected dance music. (2) We also contribute a real-time synchronizing algorithm to minimize the error between the execution of the motions and the plan. Our work builds on previous research to extract beats and emotions from music audio. We created a library of parameterized motion primitives, whereby each motion primitive is composed of a set of keyframes and durations and generate the sequence of dance movements from this library. We demonstrate the feasibility of our algorithms on the NAO humanoid robot to show that the robot is capable of using the mappings defined to autonomously dance to any music. Although we present our work using a humanoid robot, our algorithm is applicable to other robots.
Conference Paper
Full-text available
Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning approach where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled data via sparse coding and then it is applied to the labeled data used for classification. In this work, we implemented this method for the music genre classification task using two different databases: one as unlabeled data pool and the other for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes.
Article
Full-text available
Personalization and context-awareness are highly important topics in research on Intelligent Information Systems. In the fields of Music Information Retrieval (MIR) and Music Recommendation in particular, user-centric algorithms should ideally provide music that perfectly fits each individual listener in each imaginable situation and for each of her information or entertainment needs. Even though preliminary steps towards such systems have recently been presented at the “International Society for Music Information Retrieval Conference” (ISMIR) and at similar venues, this vision is still far away from becoming a reality. In this article, we investigate and discuss literature on the topic of user-centric music retrieval and reflect on why the breakthrough in this field has not been achieved yet. Given the different expertises of the authors, we shed light on why this topic is a particularly challenging one, taking computer science and psychology points of view. Whereas the computer science aspect centers on the problems of user modeling, machine learning, and evaluation, the psychological discussion is mainly concerned with proper experimental design and interpretation of the results of an experiment. We further present our ideas on aspects crucial to consider when elaborating user-aware music retrieval systems.
Conference Paper
Full-text available
We propose a multi-modal approach to the music emotion recognition (MER) problem, combining information from distinct sources, namely audio, MIDI and lyrics. We introduce a methodology for the automatic creation of a multi-modal music emotion dataset resorting to the AllMusic database, based on the emotion tags used in the MIREX Mood Classification Task. Then, MIDI files and lyrics corresponding to a sub-set of the obtained audio samples were gathered. The dataset was organized into the same 5 emotion clusters defined in MIREX. From the audio data, 177 standard features and 98 melodic features were extracted. As for MIDI, 320 features were collected. Finally, 26 lyrical features were extracted. We experimented with several supervised learning and feature selection strategies to evaluate the proposed multi-modal approach. Employing only standard audio features, the best attained performance was 44.3% (F-measure). With the multi-modal approach, results improved to 61.1%, using only 19 multi-modal features. Melodic audio features were particularly important to this improvement.
Article
Full-text available
In this paper we present an approach to music genre classification which converts an audio signal into spectrograms and extracts texture features from these time-frequency images which are then used for modeling music genres in a classification system. The texture features are based on Local Binary Pattern, a structural texture operator that has been successful in recent image classification research. Experiments are performed with two well-known datasets: the Latin Music Database (LMD), and the ISMIR 2004 dataset. The proposed approach takes into account some different zoning mechanisms to perform local feature extraction. Results obtained with and without local feature extraction are compared. We compare the performance of texture features with that of commonly used audio content based features (i.e. from the MARSYAS framework), and show that texture features always outperforms the audio content based features. We also compare our results with results from the literature. On the LMD, the performance of our approach reaches about 82.33%, above the best result obtained in the MIREX 2010 competition on that dataset. On the ISMIR 2004 database, the best result obtained is about 80.65%, i.e. below the best result on that dataset found in the literature.
Article
Full-text available
A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims. Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.
Article
Full-text available
In music genre classification, most approaches rely on statistical characteristics of low-level features computed on short audio frames. In these methods, it is implicitly considered that frames carry equally relevant information loads and that either individual frames, or distributions thereof, somehow capture the specificities of each genre. In this paper we study the representation space defined by short-term audio features with respect to class boundaries, and compare different processing techniques to partition this space. These partitions are evaluated in terms of accuracy on two genre classification tasks, with several types of classifiers. Experiments show that a randomized and unsupervised partition of the space, used in conjunction with a Markov Model classifier lead to accuracies comparable to the state of the art. We also show that unsupervised partitions of the space tend to create less hubs.
Article
Full-text available
In Exp I, 2 White Carneaux pigeons responded at more than 80% correct in a single-operandum discrimination learning task when the S+ was a 1-min excerpt of Bach flute music and the S– was a 1-min excerpt of Hindemith viola music. In Exp II, 4 Ss responded at more than 70% correct when they were required to peck the left of 2 disks during presentations of any portion of a 20-min Bach organ piece and to peck the right disk during any portion of Stravinsky's Rite of Spring for orchestra. These discriminations were learned slowly. However, the Ss generalized consistently and independently of the instruments involved when presented with novel musical excerpts in Exp III. They preferred the left "Bach disk" when novel excerpts from Buxtehude and Scarlatti were introduced and the right "Stravinsky disk" when novel excerpts from Eliot Carter, Walter Piston, and another Stravinsky work were introduced. Seven college students responded similarly. Therefore, the pigeon's response to complex auditory events may be more like the human's than is often assumed. (29 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Domestic animals are highly capable of detecting human cues, while wild relatives tend to perform less well (e.g., responding to pointing gestures). It is suggested that domestication may have led to the development of such cognitive skills. Here, we hypothesized that because domestic animals are so attentive and dependant to humans’ actions for resources, the counter effect may be a decline of self sufficiency, such as individual task solving. Here we show a negative correlation between the performance in a learning task (opening a chest) and the interest shown by horses toward humans, despite high motivation expressed by investigative behaviors directed at the chest. If human-directed attention reflects the development of particular skills in domestic animals, this is to our knowledge the first study highlighting a link between human-directed behaviors and impaired individual solving task skills (ability to solve a task by themselves) in horses.
Article
Full-text available
This paper surveys the state of the art in automatic emo-tion recognition in music. Music is oftentimes referred to as a "language of emotion" [1], and it is natural for us to categorize music in terms of its emotional associations. Myriad features, such as harmony, timbre, interpretation, and lyrics affect emotion, and the mood of a piece may also change over its duration. But in developing automated systems to organize music in terms of emotional content, we are faced with a problem that oftentimes lacks a well-defined answer; there may be considerable disagreement regarding the perception and interpretation of the emotions of a song or ambiguity within the piece itself. When com-pared to other music information retrieval tasks (e.g., genre identification), the identification of musical mood is still in its early stages, though it has received increasing attention in recent years. In this paper we explore a wide range of research in music emotion recognition, particularly focus-ing on methods that use contextual text information (e.g., websites, tags, and lyrics) and content-based approaches, as well as systems combining multiple feature domains.
Article
Full-text available
In this paper we describe how to build a variety of in-formation retrieval models for music collections based on social tags. We discuss the particular nature of social tags for music and apply latent semantic dimension reduction methods to co-occurrence counts of words in tags given to individual tracks. We evaluate the performance of various latent semantic models in relation to both previous work and a simple full-rank vector space model based on tags. We investigate the extent to which our low-dimensional semantic spaces respect traditional catalogue organisation by artist and genre, and how well they generalise to unseen tracks, and we illustrate some of the concepts expressed by the learned dimensions.
Article
Full-text available
Organising or browsing music collections in a musically meaningful way calls for tagging the data in terms of e.g. rhythmic, melodic or harmonic aspects, among others. In some cases, such metadata can be extracted automatically from musical files; in others, a trained listener must extract it by hand. In this article, we consider a specific set of rhythmic descriptors for which we provide procedures of automatic extraction from audio signals. Evaluating the relevance of such descriptors is a difficult task that can easily become highly subjective. To avoid this pitfall, we assessed the relevance of these descriptors by measuring their rate of success in genre classification experiments. We conclude on the particular relevance of the tempo and a set of 15 MFCC-like descriptors.
Article
Full-text available
Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications amongst which the most popular are related to music similarity retrieval, artist identification, musical genre or instrument recognition. Current MIR-related classification systems usually do not take into account the mid-term temporal properties of the signal (over several frames) and lie on the assumption that the observations of the features in different frames are statistically independent. The aim of this paper is to demonstrate the usefulness of the information carried by the evolution of these characteristics over time. To that purpose, we propose a number of methods for early and late temporal integration and provide an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases. In particular, the impact of the time horizon over which the temporal integration is performed will be assessed both for fixed and variable frame length analysis. Also, a number of proposed alignment kernels will be used for late temporal integration. For all experiments, the results are compared to a state of the art musical instrument recognition system.
Conference Paper
Full-text available
Music is a widely enjoyed content type, existing in many multifaceted representations. With the digital information age, a lot of digitized music information has theoretically become available at the user's fingertips. However, the abundance of information is too large-scaled and too diverse to annotate, oversee and present in a consistent and human manner, motivating the development of automated Music Information Retrieval (Music-IR) techniques. In this paper, we encourage to consider music content beyond a monomodal audio signal and argue that Music-IR approaches with multimodal and user-centered strategies are necessary to serve real-life usage patterns and maintain and improve accessibility of digital music data. After discussing relevant existing work in these directions, we show that the field of Music-IR faces similar challenges as neighboring fields, and thus suggest opportunities for joint collaboration and mutual inspiration.
Conference Paper
Full-text available
Recent research has studied the relevance of various features for automatic genre classication, showing the particular importance of tempo in dance music classica- tion. We complement this work by considering a domain- specic learning methodology, where the computed tempo is used to select an expert classier which has been spe- cialised on its own tempo range. This enables the all-class learning task to be reduced to a set of two- and three-class learning tasks. Current results are around 70% classi- cation accuracy (8 ballroom dance music classes, 698 in- stances, baseline 15.9%).
Conference Paper
Full-text available
Audio-based music similarity measures can be applied to automatically generate playlists or recommendations. In this paper spectral similarity is combined with comple- mentary information from fluctuation patterns including two new descriptors derived thereof. The performance is evaluated in a series of experiments on four music col- lections. The evaluations are based on genre classifica- tion, assuming that very similar tracks belong to the same genre. The main findings are that, (1) although the im- provements are substantial on two of the four collections our extensive experiments confirm earlier findings that we are approaching the limit of how far we can get using sim- ple audio statistics. (2) We have found that evaluating sim- ilarity through genre classification is biased by the music collection (and genre taxonomy) used. Furthermore, (3) in a cross validation no pieces from the same artist should be in both training and test set.
Conference Paper
Full-text available
Genre definition and attribution is generally considered to be subjective. This makes evaluation of any genre- labelling system intrinsically difficult, as the ground-tr uth against which it is compared is based upon subjective responses, with little inter-participant consensus. This paper presents a novel method of analysing the results of a genre-labelling task, and demonstrates that there are groups of genre-labelling behaviourwhich are self- consistent. It is proposed that the evaluation of any genre classification system uses this modified analysis method.
Conference Paper
Full-text available
The purpose of this paper is to address several aspects of music autotagging. We start by presenting autotagging experiments conducted with two different systems and show performances on a par with a method representative of the state-of-the-art. Beyond that, we illustrate via systematic experiments the importance of a number of issues relevant to autotagging, yet seldom reported in the literature. First, we show that the evaluation of autotagging techniques is fragile in the sense that small alterations to the set of tags to be learned, or in the set of music pieces may lead to dramatically different results. Hence we stress a set of methodological recommendations regarding data and evaluation metrics. Second, we conduct experiments on the generality of autotagging models, showing that a number of different methods at a similar performance level to the state-of-the-art fail to learn tag models able to generalize to datasets from different origins. Third we show that current performance level of a direct mapping between audio features and tags still appears insufficient to enable the possibility of exploiting natural tag correlations as a second stage to improve performance.
Conference Paper
Full-text available
Typical signal-based approaches to extract musical de- scriptions from audio only have limited precision. A pos- sible explanation is that they do not exploit context, which provides important cues in human cognitive processing of music: e.g. electric guitar is unlikely in 1930s music, chil - dren choirs rarely perform heavy metal, etc. We propose an architecture to train a large set of binary classifiers si- multaneously, for many different musical metadata (genre, instrument, mood, etc.), in such a way that correlation be- tween metadata is used to reinforce each individual classi- fier. The system is iterative: it uses classification decisio ns it made on some classification problems as new features for new, harder problems; and hybrid: it uses a signal clas- sifier based on timbre similarity to bootstrap symbolic in- ference with decision trees. While further work is needed, the approach seems to outperform signal-only algorithms by 5% precision on average, and sometimes up to 15% for traditionally difficult problems such as cultural and sub- jective categories.
Conference Paper
Full-text available
Musical genre classification is the automatic classification of audio signals into user defined labels describing pieces of music. A problem inherent to genre classification experiments in music information retrieval research is the use of songs from the same artist in both training and test sets. We show that this does not only lead to overoptimistic accuracy results but also selectively favours particular classification approaches. The advantage of using models of songs rather than models of genres vanishes when applying an artist filter. The same holds true for the use of spectral features versus fluctuation patterns for preprocessing of the audio files. 1
Article
Full-text available
Social tags are user-generated keywords associated with some resource on the Web. In the case of music, social tags have become an important component of “Web 2.0” recommender systems, allowing users to generate playlists based on use-dependent terms such as chill or jogging that have been applied to particular songs. In this paper, we propose a method for predicting these social tags directly from MP3 files. Using a set of 360 classifiers trained using the online ensemble learning algorithm FilterBoost, we map audio features onto social tags collected from the Web. The resulting automatic tags (or autotags) furnish information about music that is otherwise untagged or poorly tagged, allowing for insertion of previously unheard music into a social recommender. This avoids the “cold-start problem” common in such systems. Autotags can also be used to smooth the tag space from which similarities and recommendations are made by providing a set of comparable baseline tags for all tracks in a recommender system. Because the words we learn are the same as those used by people who label their music collections, it is easy to integrate our predictions into existing similarity and prediction methods based on web data.
Article
Full-text available
Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.
Article
Recently there has been a great deal of attention paid to the automatic prediction of tags for music and audio in general. Social tags are user-generated keywords associated with some resource on the Web. In the case of music, social tags have become an important component of "Web 2.0" recommender systems. There have been many attempts at automatically applying tags to audio for different purposes: database management, music recommendation, improved human-computer interfaces, estimating similarity among songs, and so on. Many published results show that this problem can be tackled using machine learning techniques, however, no method so far has been proven to be particularly suited to the task. First, it seems that no one has yet found an appropriate algorithm to solve this challenge. But second, the task definition itself is problematic. In an effort to better understand the task and also to help new researchers bring their insights to bear on this problem, this chapter provides a review of the state-of-the-art methods for addressing automatic tagging of audio. It is divided in the following sections: goal, framework, audio representation, labeled data, classification, evaluation, and future directions. Such a division helps understand the commonalities and strengths of the different methods that have been proposed.
Conference Paper
We make explicit the formalism underlying evaluation in music information retrieval research. We define a “system,” what it means to “analyze” one, and make clear the aims, parts, design, execution, interpretation, assumptions and limitations of its “evaluation.” We apply this formalism to discuss the MIREX automatic mood classification task.
Article
Too much current statistical work takes a superficial view of the client's research question, adopting techniques which have a solid history, a sound mathematical basis or readily available software, but without considering in depth whether the questions being answered are in fact those which should be asked. Examples, some familiar and others less so, are given to illustrate this assertion. It is clear that establishing the mapping from the client's domain to a statistical question is one of the most difficult parts of a statistical analysis. It is a part in which the responsibility is shared by both client and statistician. A plea is made for more research effort to go in this direction and some suggestions are made for ways to tackle the problem.
Technical Report
This document proposes a Roadmap for Music Information Research with the aim to expand the context of this research field from the perspectives of technological advances, user behaviour, social and cultural aspects, and ex- ploitation methods. The Roadmap embraces the themes of multimodality, multidisciplinarity and multiculturalism, and promotes ideas of personalisation, interpretation, embodiment, findability and community. From the perspective of technological advances, the Roadmap defines Music Information Research as a research field which focuses on the processing of digital data related to music, including gathering and organisation of machine-readable musical data, development of data representations, and methodologies to process and understand that data. More specifically, this section of the Roadmap examines (i) musically relevant data; (ii) music representations; (iii) data processing methodologies; (iv) knowledge-driven methodologies; (v) estimation of elements related to musical concepts; and (vi) evaluation methodologies. A series of challenges are identified, related to each of these research subjects, including: (i) identifying all relevant types of data sources describing music, ensuring quality of data, and addressing legal and ethical issues concerning data; (ii) investigating more meaningful features and representations, unifying formats and extending the scope of ontologies; (iii) enabling cross-disciplinary transfer of methodologies, integrating multiple modalities of data, and adopting recent machine learning techniques; (iv) integrating insights from relevant disciplines, incorporating musicological knowledge and strengthening links to music psychology and neurology; (v) separating the various sources of an audio signal, developing style-specific musical representations and considering non-Western notation systems; (vi) promoting best practice evaluation methodology, defining meaningful evaluation methodologies and targeting long-term sustainability of MIR. Further challenges can be found by referring to the Specific Challenges section under each subject in the Roadmap. In terms of user behaviour, the Roadmap addresses the user perspective, both in order to understand the user roles within the music communication chain and to develop technologies for the interaction of these users with music data. User behaviour is examined by identifying the types of users related to listening, performing or creating music. User interaction is analysed by addressing established Human Computer Interaction methodologies, and novel methods of Tangible and Tabletop Interaction. Challenges derived from these investigations include analysing user needs and behaviour carefully, identifying new user roles related to music activities; developing tools and open systems which automatically adapt to the user; designing MIR-based systems more holistically; addressing collaborative, co-creative and sharing multi-user applications, and expanding MIR interaction beyond the multi-touch paradigm. Social and cultural aspects define music as a social phenomenon centering on communication and on the context in which music is created. Within this context, Music Information Research aims at processing musical data that captures the social and cultural context and at developing data processing methodologies with which to model the whole musical phenomenon. The Roadmap analyses specifically music-related collective influences, trends and behaviours, and multiculturalism. Identified challenges include promoting methodologies for modeling music- related social and collective behavior, adapting complex networks and dynamic systems, analysing interaction and activity in social music networks, identifying music cultures that can be studied from a data driven perspective, gathering culturally relevant data for different music cultures, and identifying specific music characteristics for each culture. The exploitation perspective considers Music Information Research as relevant for producing exploitable technologies for organising, discovering, retrieving, delivering, and tracking information related to music, in order to enable improved user experience and commercially viable applications and services for digital media stake- holders. This section of the Roadmap focuses specifically on music distribution applications, creative tools, and other exploitation areas such as applications in musicology, digital libraries, education and eHealth. Challenges include demonstrating better exploitation possibilities of MIR technologies, developing systems that go beyond Executive Summary recommendation and towards discovery, developing music similarity methods for particular applications and con- texts, developing methodologies of MIR for artistic applications, developing real-time MIR tools for performance, developing creative tools for commercial environments, producing descriptors based on musicological concepts, facilitating seamless access to distributed data in digital libraries, overcoming barriers to uptake of technology in music pedagogy and expanding the scope of MIR applications in eHealth. For a full list of challenges, please refer to the relevant sections of the Roadmap. The Music Information Research Roadmap thus identifies current opportunities and challenges and reflects a variety of stakeholder views, in order to inspire novel research directions for the MIR community, and further inform policy makers in establishing key future funding strategies for this expanding research field.
Conference Paper
Systems built using deep learning neural networks trained on low-level spectral periodicity features (DeSPerF) reproduced the most “ground truth” of the systems submitted to the MIREX 2013 task, “Audio Latin Genre Classification.” To answer why this was the case, we take a closer look at the behavior of a DeSPerF system we create and evaluate using the benchmark dataset BALLROOM. We find through time stretching that this DeSPerF system appears to obtain a high figure of merit on the task of music genre recognition because of a confounding of tempo with “ground truth” in BALLROOM. This observation motivates several predictions.
Article
A decade has passed since the first review of research on a ‘flagship application’ of music information retrieval (MIR): the problem of music genre recognition (MGR). During this time, about 500 works addressing MGR have been published, and at least 10 campaigns have been run to evaluate MGR systems. This makes MGR one of the most researched areas of MIR. So, where does MGR now lie? We show that in spite of this massive amount of work, MGR does not lie far from where it began, and the paramount reason for this is that most evaluation in MGR lacks validity. We perform a case study of all published research using the most-used benchmark dataset in MGR during the past decade: GTZAN. We show that none of the evaluations in these many works is valid to produce conclusions with respect to recognizing genre, i.e. that a system is using criteria relevant for recognizing genre. In fact, the problems of validity in evaluation also affect research in music emotion recognition and autotagging. We conclude by discussing the implications of our work for MGR and MIR in the next ten years.
Conference Paper
Much work is focused upon music genre recognition (MGR) from audio recordings, symbolic data, and other modalities. While reviews have been written of some of this work before, no survey has been made of the approaches to evaluating approaches to MGR. This paper compiles a bibliography of work in MGR, and analyzes three aspects of evaluation: experimental designs, datasets, and figures of merit.
Conference Paper
A fundamental problem with nearly all work in music genre recognition (MGR) is that evaluation lacks validity with respect to the principal goals of MGR. This problem also occurs in the evaluation of music emotion recognition (MER). Standard approaches to evaluation, though easy to implement, do not reliably differentiate between recognizing genre or emotion from music, or by virtue of confounding factors in signals (e.g., equalization). We demonstrate such problems for evaluating an MER system, and conclude with recommendations.
Article
The field of Music Information Retrieval has always acknowledged the need for rigorous scientific evaluations, and several efforts have set out to develop and provide the infrastructure, technology and methodologies needed to carry out these evaluations. The community has enormously gained from these evaluation forums, but we have reached a point where we are stuck with evaluation frameworks that do not allow us to improve as much and as well as we want. The community recently acknowledged this problem and showed interest in addressing it, though it is not clear what to do to improve the situation. We argue that a good place to start is again the Text IR field. Based on a formalization of the evaluation process, this paper presents a survey of past evaluation work in the context of Text IR, from the point of view of validity, reliability and efficiency of the experiments. We show the problems that our community currently has in terms of evaluation, point to several lines of research to improve it and make various proposals in that line.
Article
As we look to advance the state of the art in content-based music informatics, there is a general sense that progress is decelerating throughout the field. On closer inspection, performance trajectories across several applications reveal that this is indeed the case, raising some difficult questions for the discipline: why are we slowing down, and what can we do about it? Here, we strive to address both of these concerns. First, we critically review the standard approach to music signal analysis and identify three specific deficiencies to current methods: hand-crafted feature design is sub-optimal and unsustainable, the power of shallow architectures is fundamentally limited, and short-time analysis cannot encode musically meaningful structure. Acknowledging breakthroughs in other perceptual AI domains, we offer that deep learning holds the potential to overcome each of these obstacles. Through conceptual arguments for feature learning and deeper processing architectures, we demonstrate how deep processing models are more powerful extensions of current methods, and why now is the time for this paradigm shift. Finally, we conclude with a discussion of current challenges and the potential impact to further motivate an exploration of this promising research area.
Article
Studies using three koi (Cyprinus carpio) investigated discrimination of musical stimuli. The common protocol used a single manipulandum and a multiple continuous reinforcement-extinction schedule signaled by music of the S+ and S− types in 30-sec presentations separated by a silent 15-sec intertrial interval. In a categorization study, the fish learned to discriminate blues recordings from classical, generalizing from John Lee Hooker (guitar and vocals) and Bach (oboe concertos) to multiple artists and ensembles. A control-by-reversal test developed into a demonstration of progressive improvement in iterated reversal learning. The subjects next learned to discriminate single-timbre synthesized versions of similar music. In the final study, which used melodies with the same order of note-duration values, but with mirror-image orders of pitch values, one fish discriminated melodies with no timbre cues, in contrast to results reported in rats.
Article
The reinforcing property of music for Java sparrows was examined in a chamber with three perches. One of the end perches produced music by Bach while the other perch produced music by Schoenberg. Two of four birds significantly stayed longer on the perch associated with Bach music and retained their preference of Bach to Schoenberg when other pieces of music by Bach and Schoenberg were used. These two birds also preferred Vivaldi to Carter, suggesting preference for classical music over modern music. One of the two birds that did not show a preference between Bach and Schoenberg preferred Bach to a white noise, but the remaining one did not show any musical preference to noise. These results suggest that Java sparrows have musical preference and that the reinforcing properties depend on individuals.
Article
Musical genre is probably the most popular music descrip- tor. In the context of large musical databases and Electronic Music Distribution, genre is therefore a crucial metadata for the description of music content. However, genre is intrinsi- cally ill-defined and attempts at defining genre precisely have a strong tendency to end up in circular, ungrounded projec- tions of fantasies. Is genre an intrinsic attribute of music titles, as, say, tempo? Or is genre a extrinsic description of the whole piece? In this article, we discuss the various approaches in representing musical genre, and propose to classify these approaches in three main categories: manual, prescriptive and emergent approaches. We discuss the pros and cons of each approach, and illustrate our study with results of the Cuidado IST project.
Article
A music piece can be considered as a sequence of sound events which represent both short-term and long-term temporal information. However, in the task of automatic music genre classification, most of text-categorization-based approaches could only capture temporal local dependencies (e.g., unigram and bigram-based occurrence statistics) to represent music contents. In this paper, we propose the use of time-constrained sequential patterns (TSPs) as effective features for music genre classification. First of all, an automatic language identification technique is performed to tokenize each music piece into a sequence of hidden Markov model indices. Then TSP mining is applied to discover genre-specific TSPs, followed by the computation of occurrence frequencies of TSPs in each music piece. Finally, support vector machine classifiers are employed based on these occurrence frequencies to perform the classification task. Experiments conducted on two widely used datasets for music genre classification, GTZAN and ISMIR2004Genre, show that the proposed method can discover more discriminative temporal structures and achieve a better recognition accuracy than the unigram and bigram-based statistical approach.
Article
The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.
Article
The diversity of symbolic dimensions along which we think about music in our ev-eryday listening experience is puzzling. Songs are commonly said to be "energetic", to make us "sad" or "nostalgic", to sound "like film music", to be perfect to "drive a car on the highway" among many other similar metaphors. Such descriptions are generally considered as well-defined sensory constructs, strongly coupled to the acoustic properties of the corresponding musical stimuli. Yet 10 years of computer pattern recognition research for musical audio signals have failed to reliably charac-terize such mappings. This either means that pattern recognition needs to develop richer models of human auditory processing, or that the words we use to talk about music are less heavily based on acoustic similarity than we usually think. In this paper, we pursue the latter hypothesis. We show that even if their acoustic mapping is weak, typical audio categories can be reliably predicted when considering inter-symbolic associations. We propose a computational model to re-integrate pattern recognition techniques into this larger vision, and discuss its implications for the evolution and learnability of human symbolic systems.
Article
Design of Comparative Experiments develops a coherent framework for thinking about factors that affect experiments and their relationships, including the use of Hasse diagrams. These diagrams are used to elucidate structure, calculate degrees of freedom and allocate treatment sub-spaces to appropriate strata. Good design considers units and treatments first, and then allocates treatments to units. Based on a one-term course the author has taught since 1989, the book is ideal for advanced undergraduate and beginning graduate courses. This book should be on the shelf of every practicing statistician who designs experiments.
Article
Music information retrieval (MIR) is an emerging research area that receives growing attention from both the research community and music industry. It addresses the problem of querying and retrieving certain types of music from large music data set. Classification is a fundamental problem in MIR. Many tasks in MIR can be naturally cast in a classification setting, such as genre classification, mood classification, artist recognition, instrument recognition, etc. Music annotation, a new research area in MIR that has attracted much attention in recent years, is also a classification problem in the general sense. Due to the importance of music classification in MIR research, rapid development of new methods, and lack of review papers on recent progress of the field, we provide a comprehensive review on audio-based classification in this paper and systematically summarize the state-of-the-art techniques for music classification. Specifically, we have stressed the difference in the features and the types of classifiers used for different classification tasks. This survey emphasizes on recent development of the techniques and discusses several open issues for future research.
Conference Paper
We argue that it is time to re-evaluate the MIR community's approach to building artificial systems which operate over music. We suggest that it is fundamentally problematic to view music simply as data representing audio signals, and that the notion of the so-called "semantic gap" is misleading. We propose a philosophical framework within which scientific and/or technological study of music can be carried out, free from such artificial constructions. Ultimately, we argue that Music (as opposed to sound) can be studied only in a context which explicitly allows for, and is built on, (albeit de facto) models of human perception; to do otherwise is not to study Music at all.