Stefan Balke’s research while affiliated with Johannes Kepler University of Linz and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (23)


Figure 1: (a) Above: overview of the Jazz Structure dataset (JSD). (b) Below: running example "Jordu" by Clifford Brown. The figure shows a novelty function and structure annotations within a web-based interface (T = theme; the pictograms indicate the current soloist and the accompaniment).
Figure 4: Raw annotation format for "Jordu" as contained in the JSD. Each row of the CSV file corresponds to a segment. The columns indicate the start time, the end time, the label, and the instrumentation of each segment.
Figure 7: Overview of the evaluation results for all recordings contained in the JSD's test set. The link (red arrow) leads to the details page as depicted in Figure 8.
Figure 8: (a) Evaluation web page showing the output of all methods for the running example "Jordu" by Clifford Brown. (b) Evaluation results of Foote's method with the input SSM based on MFCCs. (c) Evaluation results of a CNN consisting of the novelty curve of five networks and the bagged novelty curve.
The loss function is the binary cross-entropy. Note that the binary classification problem is highly unbalanced, since there are only few boundary frames
JSD: A Dataset for Structure Analysis in Jazz Music
  • Article
  • Full-text available

November 2022

·

208 Reads

·

5 Citations

Transactions of the International Society for Music Information Retrieval

Stefan Balke

·

·

·

[...]

·

Given a music recording, music structure analysis aims at identifying important structural elements and segmenting the recording according to these elements. In jazz music, a performance is often structured by repeating harmonic schemata (known as choruses), which lay the foundation for improvisation by soloists. Within the fields of music information retrieval (MIR) and computational musicology, the Weimar Jazz Database (WJD) has turned out to be an extremely valuable resource for jazz research. Containing high-quality solo transcriptions for 456 solo sections, the dataset opened up new avenues for the understanding of creative processes in jazz improvisation using computational methods. In this paper, we complement this dataset by introducing the Jazz Structure Dataset (JSD), which provides annotations on structure and instrumentation of entire recordings. The JSD comprises 340 recordings with more than 3000 annotated segments, along with a segment-wise encoding of the solo and accompanying instruments. These annotations provide the basis for training, testing, and evaluating models for various important MIR tasks, including structure analysis, solo detection, or instrument recognition. As an example application, we consider the task of structure boundary detection. Based on a traditional novelty-based as well as a more recent data-driven approach using deep learning, we indicate the potential of the JSD while critically reflecting on some evaluation aspects of structure analysis. In this context, we also demonstrate how the JSD annotations and analysis results can be made accessible in a user-friendly way via web-based interfaces for data inspection and visualization. All annotations, experimental results, and code for reproducibility are made publicly available for research purposes.

Download

MTD: A Multimodal Dataset of Musical Themes for MIR Research

October 2020

·

516 Reads

·

9 Citations

Transactions of the International Society for Music Information Retrieval

Musical themes are essential elements in Western classical music. In this paper, we present the Musical Theme Dataset (MTD), a multimodal dataset inspired by “A Dictionary of Musical Themes” by Barlow and Morgenstern from 1948. For a subset of 2067 themes of the printed book, we created several digital representations of the musical themes. Beyond graphical sheet music, we provide symbolic music encodings, audio snippets of music recordings, alignments between the symbolic and audio representations, as well as detailed metadata on the composer, work, recording, and musical characteristics of the themes. In addition to the data, we also make several parsers and web-based interfaces available to access and explore the different modalities and their relations through visualizations and sonifications. These interfaces also include computational tools, bridging the gap between the original dictionary and music information retrieval (MIR) research. The dataset is of relevance for various subfields and tasks in MIR, such as cross-modal music retrieval, music alignment, optical music recognition, music transcription, and computational musicology.


Figure 6: Optimal tempo curve and corresponding optimal actions A t for a continuous agent (piece: J. S. Bach, BWV994). The A t would be the target values for training an agent with supervised, feed-forward regression.
Figure 8: Two examples of policy outputs. (a) Agent is behind the target, resulting in a high probability for increasing the pixel speed (π (+Δν pxl |s;θ) = 0.795). (b) Agent is ahead of the target, suggesting a reduction of pixel speed (π (-Δν pxl |s;θ) = 0.903).
Figure 9: Visualization of the agent's focus on different parts of the input state for the situation shown in Figure 8b. The salience map was created via integrated gradients (Sundararajan et al., 2017), a technique to identify the most relevant input features for the agent's decision-in this case, for decreasing its pixel speed.
Comparison of score following approaches. MIDI-ODTW considers a perfectly extracted score MIDI file and aligns it to a performance with ODTW. OMR-ODTW does the same, but uses a score MIDI file extracted by an OMR system. MM-Loc is obtained by using the method presented by Dorfer et al. (2016) with a temporal context of 4 and 2 seconds for Nottingham and MSMD, respectively. For MSMD, we use the models from the references and re-evaluate them on the cleaned data set. For A2C, PPO and REINFORCE bl we report the average over 10 evaluation runs. The mean absolute tracking error and its standard deviation are given in centimeters.
Hyperparameter overview.
Score Following as a Multi-Modal Reinforcement Learning Problem

November 2019

·

234 Reads

·

21 Citations

Transactions of the International Society for Music Information Retrieval

Score following is the process of tracking a musical performance (audio) in a corresponding symbolic representation (score). While methods using computer-readable score representations as input are able to achieve reliable tracking results, there is little research on score following based on raw score images. In this paper, we build on previous work that formulates the score following task as a multi-modal Markov Decision Process (MDP). Given this formal definition, one can address the problem of score following with state-of-the-art deep reinforcement learning (RL) algorithms. In particular, we design end-to-end multi-modal RL agents that simultaneously learn to listen to music recordings, read the scores from images of sheet music, and follow the music along in the sheet. Using algorithms such as synchronous Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), we reproduce and further improve existing results. We also present first experiments indicating that this approach can be extended to track real piano recordings of human performances. These audio recordings are made openly available to the research community, along with precise note-level alignment ground truth.


Figure 3: Illustration of the sheet music, input attention (normalized), and spectrogram for five examples from the MSMD test set. (a) L. v. Beethoven -Piano Sonata (Op. 79, 1st Mvt.), (b) J. S. Bach -Goldberg Variations: Variatio 12 (BWV 988), (c) J. S. Bach -French Suite VI: Menuet (BWV 817), (d) R. Schumann -Album for the Youth: Untitled (Op. 68, Nr. 26), and (e) M. Mussorgsky -Pictures at an Exhibition VIII: Catacombae.
Figure 4: Entropy of the input attention vs. the number of onsets in the respective audio frame.
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval

June 2019

·

124 Reads

Connecting large libraries of digitized audio recordings to their corresponding sheet music images has long been a motivation for researchers to develop new cross-modal retrieval systems. In recent years, retrieval systems based on embedding space learning with deep neural networks got a step closer to fulfilling this vision. However, global and local tempo deviations in the music recordings still require careful tuning of the amount of temporal context given to the system. In this paper, we address this problem by introducing an additional soft-attention mechanism on the audio input. Quantitative and qualitative results on synthesized piano data indicate that this attention increases the robustness of the retrieval system by focusing on different parts of the input representation based on the tempo of the audio. Encouraged by these results, we argue for the potential of attention models as a very general tool for many MIR tasks.



Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies

February 2019

·

5 Reads

There has been a rapid growth of digitally available music data, including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal retrieval algorithms that, given a query in one modality (e.g., a short audio excerpt), find corresponding information and entities in other modalities (e.g., the name of the piece and the sheet music). This goes beyond exact audio identification and subsequent retrieval of metainformation as performed by commercial applications like Shazam [1].


Figure 1: The different representations for music data and data transformations relevant for cross-modal music retrieval.
Figure 4: An illustration of symbolic fingerprints.
Figure 5: An illustration of cross-modal retrieval via piano transcription and symbolic fingerprinting. (Photo of Werner Goebl courtesy of Clemens Chmelar.)
Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies

February 2019

·

813 Reads

·

68 Citations

IEEE Signal Processing Magazine

There has been a rapid growth of digitally available music data, including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal retrieval algorithms that, given a query in one modality (e.g., a short audio excerpt), find corresponding information and entities in other modalities (e.g., the name of the piece and the sheet music). This goes beyond exact audio identification and subsequent retrieval of metainformation as performed by commercial applications like Shazam [1].


Figure 1. Procedure for mapping feature values from individual solos onto the timeline using the recording years.
Figure 2. Complexity measure Γ based on the circle of fifths. Values for a sparse chroma vector (left), a flat chroma vector (middle), and a more realistic chroma vector (right) are shown. The red arrows denote the resultant vectors.
Figure 3. Complexity values for musical scales in several tempi, computed with different window lengths. (a) Diatonic scale. (b) Chromatic scale. (c) Complexity values for the diatonic scale. (d) Complexity values for the chromatic scale.
Figure 5. (a) Average number of solos per year contained in the dataset. Evolution curve and artist means based on (b) symbolic transcriptions and (c) harmonic component of audio recordings.
Figure 6. Evolution curve based on (a) symbolic transcription, (b) source-separated melody (score-informed), (c) harmonic part of audio (HPRS), (d) full audio mix.
Computational Corpus Analysis: A Case Study on Jazz Solos

September 2018

·

474 Reads

·

6 Citations

For musicological studies on large corpora, the compilation of suitable data constitutes a time-consuming step. In particular, this is true for high-quality symbolic representations that are generated manually in a tedious process. A recent study on Western classical music has shown that musical phenomena such as the evolution of tonal complexity over history can also be analyzed on the basis of audio recordings. As our first contribution, we transfer this corpus analysis method to jazz music using the Weimar Jazz Database, which contains high-level symbolic transcriptions of jazz solos along with the audio recordings. Second, we investigate the influence of the input representation type on the corpus-level observations. In our experiments , all representation types led to qualitatively similar results. We conclude that audio recordings can build a reasonable basis for conducting such type of corpus analysis.


Improving Bass Saliency Estimation using Label Propagation and Transfer Learning

June 2018

·

55 Reads

·

2 Citations

In this paper, we consider two methods to improve an algorithm for bass saliency estimation in jazz ensemble recordings which are based on deep neural networks. First, we apply label propagation to increase the amount of training data by transferring pitch labels from our labeled dataset to unlabeled audio recordings using a spectral similarity measure. Second, we study in several transfer learning experiments , whether isolated note recordings can be beneficial for pre-training a model which is later fine-tuned on ensemble recordings. Our results indicate that both strategies can improve the performance on bass saliency estimation by up to five percent in accuracy.


Figure 1: Example of a basic trackswitch.js instance with 3 tracks, a visualization of the waveform, and a seekhead indicating the current playback position. 
Figure 4: Application of trackswitch.js for sound source separation. Toggling the "solo" buttons switches the tracks or mixes them if multiple tracks are selected. 
Figure 5: Presenting results obtained from applying audio decomposition techniques to the "Amen Break". The player comprises 7 tracks with several visualizations including sheet music, spectral, and temporal representations of the decomposed sources. Furthermore, trackswitch.js's seekhead functionality is used to indicate the current playback position. 
Figure 6: Example player instance, 2 tracks, with spectrogram visualization. Fundamental fundamental frequency annotation in second figure is only shown if second track is selected. 
trackswitch.js: A Versatile Web-Based Audio Player for Presenting Scientific Results

March 2018

·

1,492 Reads

·

8 Citations

trackswitch.js is a versatile web-based audio player that enables researchers to conveniently present examples and results from scientific audio processing applications. Based on a multitrack architecture, trackswitch.js allows a listener to seamlessly switch between multiple audio tracks, while synchronously indicating the playback position within images associated with the audio tracks. These images may correspond to feature representations such as spectrograms or to visualizations of annotations such as structural boundaries or musical note information. The provided switching and playback functionalities are simple yet useful tools for analyzing, navigating, understanding, and evaluating results obtained from audio processing algorithms. Furthermore, trackswitch.js is an easily extendible and manageable software tool, designed for non-expert developers and inexperienced users. Offering a small but useful selection of options and buttons, trackswitch.js requires only basic knowledge to implement a versatile range of components for web-based audio demonstrators and user interfaces. Besides introducing the underlying techniques and the main functionalities of trackswitch.js we provide several use cases that indicate the flexibility and usability of our software for different audio-related research areas.


Citations (20)


... The Jazz musicians dataset is real and it is utilized in network analysis to explore the relationships and collaborations among jazz musicians. By examining this dataset, researchers can gain insights into the structure of the jazz community, identify influential musicians, and study patterns of musical collaboration within the genre [129]. ...

Reference:

Community detection in social networks using machine learning: a systematic mapping study
JSD: A Dataset for Structure Analysis in Jazz Music

Transactions of the International Society for Music Information Retrieval

... The selection of modalities for dataset construction is a key consideration in MIR research. For instance, [11] utilized audio and lyrics to predict musical moods, while [23] employed symbolic and audio data for MIR applications. These studies collectively underscore the overarching goal of integrating diverse modalities to enhance the analytical capabilities of MIR systems. ...

MTD: A Multimodal Dataset of Musical Themes for MIR Research

Transactions of the International Society for Music Information Retrieval

... Deep Learning has largely been absent from music alignment, with notable exceptions in real-time audioimage matching [17,18] and in symbolic score following [9]. On the other hand, we take inspiration from image processing, in particular from the task of local feature matching [19]: the matching of pixels encoding the same location on an object in two images of the same object. ...

Score Following as a Multi-Modal Reinforcement Learning Problem

Transactions of the International Society for Music Information Retrieval

... It is common to align different modalities to perform multi-modal applications. Efforts spanned aligning text and images [20,23,41,43], videos [2,3,47], audio [33,54], robotic states [19], 3D shapes [62], 3D scenes [34], 3D human poses [15], human motions [59,72], and so forth. Beyond empowering cross-modal retrieval, connecting modalities gives birth to powerful multi-informed versatile encodings. ...

Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies

IEEE Signal Processing Magazine

... están midiendo. Por ejemplo, la implementación de varias estrategias de complejidad para encontrar patrones en música (Lopes & Tenreiro, 2019), la detección de patrones de complejidad el estudio de la evolución tonal en grandes corpus de jazz (Weiss et al., 2018), el análisis de biodiversidad en partituras musicales (Angeler, 2020), representaciones cuantitativas visuales con notación musical (Chase, 2006) o algoritmos para medir complejidad sintáctica en música (Holder, 2015) utilizan la capacidad computacional para calcular datos, pero no toman en cuenta la capacidad humana para hacerlo. Probablemente admiten que el modelo que utiliza memoria computacional es compatible con la forma en que las neuronas producen actividad cerebral. ...

Computational Corpus Analysis: A Case Study on Jazz Solos

... Various model architectures ranging from fully-connnected neural networks (FCNN) [1,[14][15][16], over convolutional neural networks (CNN) [8,17], to recurrent neural networks (RNN) [10,14] are used and combined for the tasks of pitch estimation and voicing detection. Bittner et al. [11] propose a CNN model for multitask learning, which is trained to simultaneously perform melody, bass, and vocal transcription. ...

Improving Bass Saliency Estimation using Label Propagation and Transfer Learning
  • Citing Conference Paper
  • June 2018

... Such digital music collections comprise, for example, audio recordings, digitized images of sheet music, album covers and liner notes, and an increasing number of video clips. Facing these vast volumes of multimodal data makes it difficult and time consuming to manually manage such collections, and raises the need for support by intelligent systems (Balke, 2018). One important property of this kind of data is its multimodality, meaning that entities are represented by two or more of the mentioned modalities at the same time. ...

Multimedia Processing Techniques for Retrieving, Extracting, and Accessing Musical Content
  • Citing Thesis
  • January 2018

... In our approach, these notes are grouped with other notes to form a final note. System accuracy diminishes mainly due to limited training dataset [15]. The initial stage of transcription system follows Ratio to Tonic [2] measure is to find ratio between different notes to the tonic [1] and then melody contour [15] is plotted against Cent scale [1] (Note melody value in Cent) achieved from ratio to tonic values [1], significantly higher performance is achieved when evaluating the same transcription algorithm on a training dataset containing individual fundamental frequency ranges of notes. ...

DATA-DRIVEN SOLO VOICE ENHANCEMENT FOR JAZZ MUSIC RETRIEVAL

... The effect can be easily demonstrated by listening to a single rectangular pulse with long enough pulse width so that one clearly hears two distinct clicks. Sound examples of DVN as well as filtered OVN are provided online 1 , using the web audio player from [25]. The filtered OVN sequences were generated using linear prediction (LP) of order N = 10 on the DVN sequence. ...

trackswitch.js: A Versatile Web-Based Audio Player for Presenting Scientific Results