Gerhard Widmer

Gerhard Widmer
Johannes Kepler University Linz | JKU · Department of Computational Perception

About

431
Publications
87,005
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,243
Citations
Citations since 2017
134 Research Items
4393 Citations
20172018201920202021202220230200400600800
20172018201920202021202220230200400600800
20172018201920202021202220230200400600800
20172018201920202021202220230200400600800

Publications

Publications (431)
Preprint
We introduce a novel way to incorporate prior information into (semi-) supervised non-negative matrix factorization, which we call differentiable dictionary search. It enables general, highly flexible and principled modelling of mixtures where non-linear sources are linearly mixed. We study its behavior on an audio decomposition task, and conduct a...
Preprint
This paper describes several improvements to a new method for signal decomposition that we recently formulated under the name of Differentiable Dictionary Search (DDS). The fundamental idea of DDS is to exploit a class of powerful deep invertible density estimators called normalizing flows, to model the dictionary in a linear decomposition method s...
Preprint
The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Henc...
Preprint
Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CN...
Article
Full-text available
Given the rise of deep learning and its inherent black-box nature, the desire to interpret these systems and explain their behaviour became increasingly more prominent. The main idea of so-called explainers is to identify which features of particular samples have the most influence on a classifier’s prediction, and present them as explanations. Eva...
Article
Full-text available
Current state-of-the-art methods for polyphonic piano transcription tend to use high capacity neural networks. Most models are trained “end-to-end”, and learn a mapping from audio input to pitch labels. They require large training corpora consisting of many audio recordings of different piano models and temporally aligned pitch labels. It has been...
Preprint
Full-text available
Current approaches for explaining deep learning systems applied to musical data provide results in a low-level feature space, e.g., by highlighting potentially relevant time-frequency bins in a spectrogram or time-pitch bins in a piano roll. This can be difficult to understand, particularly for musicologists without technical knowledge. To address...
Preprint
The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Chal...
Preprint
Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architect...
Preprint
Full-text available
This paper presents the specifications of match: a file format that extends a MIDI human performance with note-, beat-, and downbeat-level alignments to a corresponding musical score. This enables advanced analyses of the performance that are relevant for various tasks, such as expressive performance modeling, score following, music transcription,...
Preprint
Full-text available
Partitura is a lightweight Python package for handling symbolic musical information. It provides easy access to features commonly used in music information retrieval tasks, like note arrays (lists of timed pitched events) and 2D piano roll matrices, as well as other score elements such as time and key signatures, performance directives, and repeat...
Preprint
Full-text available
Adversarial attacks can drastically degrade performance of recommenders and other machine learning systems, resulting in an increased demand for defence mechanisms. We present a new line of defence against attacks which exploit a vulnerability of recommenders that operate in high dimensional data spaces (the so-called hubness problem). We use a glo...
Article
Full-text available
The task of real-time alignment between a music performance and the corresponding score (sheet music), also known as score following, poses a challenging multi-modal machine learning problem. Training a system that can solve this task robustly with live audio and real sheet music (i.e., scans or score images) requires precise ground truth alignment...
Preprint
We present a prototype of an automatic page turning system that works directly on real scores, i.e., sheet images, without any symbolic representation. Our system is based on a multi-modal neural network architecture that observes a complete sheet image page as input, listens to an incoming musical performance, and predicts the corresponding positi...
Preprint
The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of tran...
Preprint
Fully automatic opera tracking is challenging because of the acoustic complexity of the genre, combining musical and linguistic information (singing, speech) in complex ways. In this paper, we propose a new pipeline for complete opera tracking. The pipeline is based on two trackers. A music tracker that has proven to be effective at tracking orches...
Preprint
Audio-to-lyrics alignment has become an increasingly active research task in MIR, supported by the emergence of several open-source datasets of audio recordings with word-level lyrics annotations. However, there are still a number of open problems, such as a lack of robustness in the face of severe duration mismatches between audio and lyrics repre...
Preprint
Full-text available
Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and va...
Preprint
Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recogniz...
Preprint
Full-text available
Local explanation methods such as LIME have become popular in MIR as tools for generating post-hoc, model-agnostic explanations of a model's classification decisions. The basic idea is to identify a small set of human-understandable features of the classified example that are most influential on the classifier's prediction. These are then presented...
Article
Full-text available
Small adversarial perturbations of input data can drastically change the performance of machine learning systems, thereby challenging their validity. We compare several adversarial attacks targeting an instrument classifier, where for the first time in Music Information Retrieval (MIR) the perturbations are computed directly on the waveform. The at...
Preprint
Full-text available
Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in developing reliable and generalizable models. One important step towards better models would be to...
Preprint
This paper addresses the problem of cross-modal musical piece identification and retrieval: finding the appropriate recording(s) from a database given a sheet music query, and vice versa, working directly with audio and scanned sheet music images. The fundamental approach to this is to learn a cross-modal embedding space with a suitable similarity...
Preprint
In this paper, we study the performance of variants of well-known Convolutional Neural Network (CNN) architectures on different audio tasks. We show that tuning the Receptive Field (RF) of CNNs is crucial to their generalization. An insufficient RF limits the CNN's ability to fit the training data. In contrast, CNNs with an excessive RF tend to ove...
Article
In this paper, we study the performance of variants of well-known Convolutional Neural Network (CNN) architectures on different audio tasks. We show that tuning the Receptive Field (RF) of CNNs is crucial to their generalization. An insufficient RF limits the CNN's ability to fit the training data. In contrast, CNNs with an excessive RF tend to ove...
Preprint
Algorithms for reliable real-time score following in live opera promise a lot of useful applications such as automatic subtitles display, or real-time video cutting in live streaming. Until now, such systems were based on the strong assumption that an opera performance follows the structure of the score linearly. However, this is rarely the case in...
Preprint
This paper addresses the problem of sheet-image-based on-line audio-to-score alignment also known as score following. Drawing inspiration from object detection, a conditional neural network architecture is proposed that directly predicts x,y coordinates of the matching positions in a complete score sheet image at each point in time for a given musi...
Chapter
Full-text available
Although current music recommender systems suggest new tracks to their users, they do not provide listenable explanations of why a user should listen to them. LEMONS (Demonstration video: https://youtu.be/giSPrPnZ7mc) is a new system that addresses this gap by (1) adopting a deep learning approach to generate audio content-based recommendations fro...
Preprint
Full-text available
Emotion and expressivity in music have been topics of considerable interest in the field of music information retrieval. In recent years, mid-level perceptual features have been suggested as means to explain computational predictions of musical emotion. We find that the diversity of musical styles and genres in the available dataset for learning th...
Preprint
Unsupervised anomalous sound detection is concerned with identifying sounds that deviate from what is defined as 'normal', without explicitly specifying the types of anomalies. A significant obstacle is the diversity and rareness of outliers, which typically prevent us from collecting a representative set of anomalous sounds. As a consequence, most...
Preprint
Full-text available
Deep Neural Networks are known to be very demanding in terms of computing and memory requirements. Due to the ever increasing use of embedded systems and mobile devices with a limited resource budget, designing low-complexity models without sacrificing too much of their predictive performance gained great importance. In this work, we investigate an...
Preprint
Robust real-time opera tracking (score following) would be extremely useful for many processes surrounding live opera staging and streaming, including automatic lyrics displays, camera control, or live video cutting. Recent work has shown that, with some appropriate measures to account for common problems such as breaks and interruptions, spontaneo...
Preprint
Deep neural networks (DNNs) are successfully applied in a wide variety of music information retrieval (MIR) tasks. Such models are usually considered "black boxes", meaning that their predictions are not interpretable. Prior work on explainable models in MIR has generally used image processing tools to produce explanations for DNN predictions, but...
Preprint
Full-text available
We explore how much can be learned from noisy labels in audio music tagging. Our experiments show that carefully annotated labels result in highest figures of merit, but even high amounts of noisy labels contain enough information for successful learning. Artificial corruption of curated data allows us to quantize this contribution of noisy labels.
Preprint
Full-text available
A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listener...
Preprint
Deep neural networks (DNNs) are successfully applied in a wide variety of music information retrieval (MIR) tasks but their predictions are usually not interpretable. We propose audioLIME, a method based on Local Interpretable Model-agnostic Explanations (LIME) extended by a musical definition of locality. The perturbations used in LIME are created...
Preprint
Convolutional Neural Networks (CNNs) have been successfully used in various Music Information Retrieval (MIR) tasks, both as end-to-end models and as feature extractors for more complex systems. However, the MIR field is still dominated by the classical VGG-based CNN architecture variants, often in combination with more complex modules such as atte...
Preprint
This paper addresses the task of score following in sheet music given as unprocessed images. While existing work either relies on OMR software to obtain a computer-readable score representation, or crucially relies on prepared sheet image excerpts, we propose the first system that directly performs score following in full-page, completely unprocess...
Article
Recurrent voice-leading patterns like the Mi-Re-Do compound cadence (MRDCC) rarely appear on the musical surface in complex polyphonic textures, so finding these patterns using computational methods remains a tremendous challenge. The present study extends the canonical n-gram approach by using skip-grams, which include sub-sequences in an n-gram l...
Preprint
Data augmentation techniques have become standard practice in deep learning, as it has been shown to greatly improve the generalisation abilities of models. These techniques rely on different ideas such as invariance-preserving transformations (e.g, expert-defined augmentation), statistical heuristics (e.g, Mixup), and learning the data distributio...
Preprint
Recurrent voice-leading patterns like the Mi-Re-Do compound cadence (MRDCC) rarely appear on the musical surface in complex polyphonic textures, so finding these patterns using computational methods remains a tremendous challenge. The present study extends the canonical n-gram approach by using skip-grams, which include sub-sequences in an n-gram l...
Preprint
Recent advances in real-time music score following have made it possible for machines to automatically track highly complex polyphonic music, including full orchestra performances. In this paper, we attempt to take this to an even higher level, namely, live tracking of full operas. We first apply a state-of-the-art audio alignment method based on o...
Article
Full-text available
Score following is the process of tracking a musical performance (audio) in a corresponding symbolic representation (score). While methods using computer-readable score representations as input are able to achieve reliable tracking results, there is little research on score following based on raw score images. In this paper, we build on previous wo...
Preprint
We present CP-JKU submission to MediaEval 2019; a Receptive Field-(RF)-regularized and Frequency-Aware CNN approach for tagging music with emotion/mood labels. We perform an investigation regarding the impact of the RF of the CNNs on their performance on this dataset. We observe that ResNets with smaller receptive fields -- originally adapted for a...
Preprint
The goal of score following is to track a musical performance, usually in the form of audio, in a corresponding score representation. Established methods mainly rely on computer-readable scores in the form of MIDI or MusicXML and achieve robust and reliable tracking results. Recently, multimodal deep learning methods have been used to follow along...
Preprint
Full-text available
Quantitative analysis of commonalities and differences between recorded music performances is an increasingly common task in computational musicology. A typical scenario involves manual annotation of different recordings of the same piece along the time dimension, for comparative analysis of, e.g., the musical tempo, or for mapping other performanc...
Preprint
Acoustic scene classification and related tasks have been dominated by Convolutional Neural Networks (CNNs). Top-performing CNNs use mainly audio spectograms as input and borrow their architectural design primarily from computer vision. A recent study has shown that restricting the receptive field (RF) of CNNs in appropriate ways is crucial for the...
Preprint
Full-text available
We explore a novel way of conceptualising the task of polyphonic music transcription, using so-called invertible neural networks. Invertible models unify both discriminative and generative aspects in one function, sharing one set of parameters. Introducing invertibility enables the practitioner to directly inspect what the discriminative model has...
Preprint
Distribution mismatches between the data seen at training and at application time remain a major challenge in all application areas of machine learning. We study this problem in the context of machine listening (Task 1b of the DCASE 2019 Challenge). We propose a novel approach to learn domain-invariant classifiers in an end-to-end fashion by enforc...
Technical Report
Full-text available
This technical report describes the submissions from the SAL/CP JKU team for Task 1-Subtask C (classification on data that includes classes not encountered in the training data) of the DCASE-2019 challenge. Our method uses a ResNet variant specifically adapted to be used along with spectrograms in the context of Acoustic Scene Classification (ASC)....
Preprint
Full-text available
Emotional aspects play an important part in our interaction with music. However, modelling these aspects in MIR systems have been notoriously challenging since emotion is an inherently abstract and subjective experience, thus making it difficult to quantify or predict in the first place, and to make sense of the predictions in the next. In an attem...
Technical Report
Full-text available
In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the R...
Preprint
Full-text available
Convolutional Neural Networks (CNNs) have had great success in many machine vision as well as machine audition tasks. Many image recognition network architectures have consequently been adapted for audio processing tasks. However, despite some successes, the performance of many of these did not translate from the image to the audio domain. For exam...
Preprint
Full-text available
Convolutional Neural Networks (CNNs) have had great success in many machine vision as well as machine audition tasks. Many image recognition network architectures have consequently been adapted for audio processing tasks. However, despite some successes, the performance of many of these did not translate from the image to the audio domain. For exam...
Preprint
Full-text available
Connecting large libraries of digitized audio recordings to their corresponding sheet music images has long been a motivation for researchers to develop new cross-modal retrieval systems. In recent years, retrieval systems based on embedding space learning with deep neural networks got a step closer to fulfilling this vision. However, global and lo...
Conference Paper
Full-text available
Musicians produce individualized, expressive performances by manipulating parameters such as dynamics, tempo and articulation. This manipulation of expressive parameters is informed by elements of score information such as pitch, meter , and tempo and dynamics markings (among others). In this paper we present an interactive interface that gives use...
Preprint
Full-text available
In many musical traditions, the melody line is of primary significance in a piece. Human listeners can readily distinguish melodies from accompaniment; however, making this distinction given only the written score -- i.e. without listening to the music performed -- can be a difficult task. Solving this task is of great importance for both Music Inf...
Preprint
Full-text available
We investigate a late-fusion approach to piano transcription, combined with a strong temporal prior in the form of a handcrafted Hidden Markov Model (HMM). The network architecture under consideration is compact in terms of its number of parameters and easy to train with gradient descent. The network outputs are fused over time in the final stage t...
Article
Full-text available
Generative Adversarial Networks have a surprising ability to generate sharp and realistic images, but they are known to suffer from the so-called mode collapse problem. In this paper, we propose a new GAN variant called Mixture Density GAN that overcomes this problem by encouraging the Discriminator to form clusters in its embedding space, which in...
Preprint
Musicians produce individualized, expressive performances by manipulating parameters such as dynamics, tempo and articulation. This manipulation of expressive parameters is informed by elements of score information such as pitch, meter, and tempo and dynamics markings (among others). In this paper we present an interactive interface that gives user...
Article
Full-text available
The availability of increasingly larger multimedia collections has fostered extensive research in recommender systems. Instead of capturing general user preferences, the task of next-item recommendation focuses on revealing specific session preferences encoded in the most recent user interactions. This study focuses on the music domain, particularl...
Preprint
Current ML models for music emotion recognition, while generally working quite well, do not give meaningful or intuitive explanations for their predictions. In this work, we propose a 2-step procedure to arrive at spectrogram-level explanations that connect certain aspects of the audio to interpretable mid-level perceptual features, and these to th...
Article
Full-text available
Music recommender systems have become a key technology to support the interaction of users with the increasingly larger music catalogs of on-line music streaming services, on-line music shops, and personal devices. An important task in music recommender systems is the automated continuation of music playlists, that enables the recommendation of mus...
Article
Full-text available
There has been a rapid growth of digitally available music data, including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal...
Preprint
Full-text available
Viewing polyphonic piano transcription as a multitask learning problem, where we need to simultaneously predict onsets, intermediate frames and offsets of notes, we investigate the performance impact of additional prediction targets, using a variety of suitable convolutional neural network architectures. We quantify performance differences of addit...
Chapter
Music recommender systems have become a key technology supporting the access to increasingly larger music catalogs in on-line music streaming services, on-line music shops, and private collections. The interaction of users with large music catalogs is a complex phenomenon researched from different disciplines. We survey our works investigating the...
Article
Clinical decision-making in cancer and other diseases relies on timely and cost-effective genome-wide testing. Classical bioinformatic algorithms, such as Rawcopy, can support genomic analysis by calling genomic breakpoints and copy-number variations (CNVs), but often require manual data curation, which is error prone, time-consuming, and thus subs...
Presentation
Deep Neural Networks (DNNs) are known for excellent performance in supervised tasks such as classification. Convolutional Neural Networks (CNNs), in particular, can learn effective features and build high-level representations that can be used for classification, but also for querying and nearest neighbor search. However, CNNs have also been shown...
Conference Paper
Full-text available
Generative Adversarial Networks have surprising ability for generating sharp and realistic images, though they are known to suffer from the so-called mode collapse problem. In this paper, we propose a new GAN variant called Mixture Density GAN that while being capable of generating high-quality images, overcomes this problem by encouraging the Disc...
Chapter
The buildup and release of a sense of tension is one of the most essential aspects of the process of listening to music. A veridical computational model of perceived musical tension would be an important ingredient for many music informatics applications [27]. The present paper presents a new approach to modelling harmonic tension based on a distri...
Conference Paper
Full-text available
This paper describes our submission to the first Freesound generalpurpose audio tagging challenge carried out within the DCASE 2018 challenge. Our proposal is based on a fully convolutional neural network that predicts one out of 41 possible audio class labels when given an audio spectrogram excerpt as an input. What makes this classification datas...
Conference Paper
Full-text available
In this paper, we present our approach used for the CP-JKU submission in Task 4 of the DCASE-2018 Challenge. We propose a novel iterative knowledge distillation technique for weakly-labeled semi-supervised event detection using neural networks, specifically Recurrent Convolutional Neural Networks (R-CNNs). R-CNNs are used to tag the unlabeled data...
Preprint
Generative Adversarial Networks have surprising ability for generating sharp and realistic images, though they are known to suffer from the so-called mode collapse problem. In this paper, we propose a new GAN variant called Mixture Density GAN that while being capable of generating high-quality images, overcomes this problem by encouraging the Disc...