Conference PaperPDF Available

Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning

Authors:
Visualisations of Jazz Standards Derived from Transformer-based Multi-task
Learning
KONSTANTINOS VELENIS, Aristotle University, School of Music Studies, Greece
MAXIMOS KALIAKATSOS-PAPAKOSTAS, Hellenic Mediterranean University, Department of Music Technol-
ogy and Accoustics, Greece
LEANDROS PASIAS, Aristotle University, School of Music Studies, Greece
EMILIOS CAMBOUROPOULOS, Aristotle University, School of Music Studies, Greece
This paper presents a method for creating multiple 2D visualizations of 1048 jazz standard chord charts in text format. The novel
component of this method is that each visualization re-arranges the data available, prioritizing specic musical aspects, namely
harmonic context, genre style, composition year, music form, composer, and tonality, allowing the exploration of this dataset according
to user-dened criteria. The basic component of this model is a stack of transformer encoders that generate contextualized information
from chart-text input. This information is subsequently fed to downstream feed-forward models that perform classication and
regression based on manually annotated target labels and values of the aforementioned visualization criteria. The training of our
model was performed on a distinct dataset from the validation and testing sets, ensuring robustness and minimizing the risk of
overtting. A model variation where the transformer component performs jointly unmasking of the text input outperforms a variation
without umnasking, when both models are trained on all 11 transpositions of the available pieces and tested on the original tonalities.
The resulting visualisations reect the categorisation and regression capabilities of the outperforming model, while revealing some
interesting inter-cluster relations as well as some justiable out-of-place pieces. A few hard to explain placements of specic pieces
provide pointers for future work. Overall, such visualisations show the potential to empower jazz musicians to explore jazz standards
under multiple perspectives.
CCS Concepts: Human-centered computing
Visualization theory, concepts and paradigms;Visualization theory,
concepts and paradigms;Computing methodologies Articial intelligence.
Additional Key Words and Phrases: music visualisation, jazz standards, transformers, multi-task learning
ACM Reference Format:
Konstantinos Velenis, Maximos Kaliakatsos-Papakostas, Leandros Pasias, and Emilios Cambouropoulos. 2023. Visualisations of Jazz
Standards Derived from Transformer-based Multi-task Learning. In 27th Pan-Hellenic Conference on Progress in Computing and
Informatics (PCI 2023), November 24–26, 2023, Lamia, Greece. PCI, Lamia, Greece, 13 pages. https://doi.org/10.1145/3635059.3635072
1 INTRODUCTION
Facilitation exploration of large datasets oers an interesting perspective towards music discovery and education. For
instance, in jazz music education, musicians are required to develop a solid foundation in harmonic elements such as
chord types, chord progressions, scales, and tonalities as well as other structural components as forms. Additionally,
one musician may have studied the compositions of a specic composer or a genre and would like to explore similar
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
1
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
composers or genres, based on general harmonic and structural characteristics that cannot be clearly dened. To
enhance the learning experience and expand the repertoire of Jazz students, it is crucial to provide them with tools that
facilitate exploration and familiarization with pieces with a wide spectrum of characteristics [
1
]. The ever-increasing
availability of music data oers opportunities to transform music education, making it more engaging, eective, and
tailored to the individual needs and interests of students and teachers alike.
However, the abundance of data can also present challenges, particularly in identifying relevant content for users
within large datasets. Traditional music recommendation systems, relying on community-based similarity metrics and
content-based techniques [
16
], are primarily designed for music streaming platforms like Spotify or Pandora. These
systems may not eectively address the needs of music education, especially in the context of Jazz standards and
improvisation training. Unlike mainstream genres, Jazz standards demand a more nuanced understanding of harmonic
contexts and personalized preferences.
This paper addresses the challenge of presenting a large music collection to users through visualisations that re-
arrange data according to priorities in specic criteria given by the user. This is performed through a novel method
that leverages the representation capabilities of transformer models for modeling implicitly harmonic similarity and
generating 2-dimensional mappings that facilitate content-based exploration and retrieval of Jazz standards. The method
employs symbolic data that describe the harmonic and structural information of jazz charts in a custom text format
and include manual annotations of harmonic context, harmonic style, composition year, composer, genre and music
form. Through a multi-dimensional transformation and projection into a visually appealing 2D space using t-SNE, the
method positions data points based on their harmonic content in conjunction with their metadata similarities. 1
2 RELATED WORK
In the exploration of large musical datasets, various approaches have been employed, each oering unique insights into
content-based exploration. Notable among these is the Every Noise at Once initiative by EchoNest
2
, which utilizes
a scatterplot to visually organize diverse musical genres found on Spotify into a two-dimensional map. This map
represents one dimension as a spectrum between organic and mechanical/electric qualities, while the other dimension
captures the dichotomy between dense/atmospheric and spiky/bouncy qualities. Another strategy involves visualizing
databases on a two-dimensional space, taking into account assigned moods and genres [25].
Alternatively, a dierent approach suggests the use of a Self-Organizing Map constructed by a neural network,
which follows perceived patterns of sound similarity [
19
]. In an innovative interface, music databases are arranged in
three-dimensional virtual landscapes [
13
,
14
]. These landscapes generate virtual islands dedicated to specic musical
genres, where close distances indicate similarity between pieces belonging to the same style, while larger distances
separate dierent styles. Multiple modes are oered, allowing users to listen to specic songs, explore descriptive words
and explicit genre tags, or view related images. A recent variation of this visualization interface involves alternating
virtual islands with city tower blocks [21].
The idea that is developed in the method presented in the paper at hand concerns the visualisation of information
from data that are “compressed” by machine learning methods during training. Variational Autoencoders (VAE) [
12
]
are a good candidate for such an approach, since they are trained in an unsupervised way and their role is to generate
meaningful compressed representations of fewer dimensions than the original data - that additionally follow a normal
1https://github.com/maximoskp/chameleon_jazz.git
2https://everynoise.com
2
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
distribution. Such an approach, e.g., was followed in [
27
] where the latent space from a variation of a VAE model is
employed for generating visualisations of rhythms.
Since Autoencoders are not ideal for representing sequential data, latent spaces from Long-Short Term Memory
(LSTM) have be employed for creating visualisations. In these cases, the LSTM networks are trained autoregressively,
i.e., for predicting the next time step at every step, and therefore there is no requirement for annotated data. In [
22
] the
latent spaces of both VAE and LSTM networks are trained on audio data of traditional Chinese music for producing
visualisations, among other tasks. Regarding symbolic music data, a method that was based on autoregressively trained
LSTM networks was presented in [11] for generating visualisations.
In parallel, transformer architectures [
26
] have taken by storm several tasks that concern sequence processing.
Beginning from natural language processing tasks [
5
,
18
], transformer-based architectures have also been employed
for music-related tasks. Such tasks include symbolic music generation based on user-dened conditions, e.g. on
target emotion [
9
]) or genre [
20
], structural segmentation [
29
] and motif variation [
28
]. The eectiveness of those
models relies in their ability to capture contextual relations in large sequences: transform an input sequence of
multidimensional representations into another abstract multidimensional sequence that reects context-based relations
between components of the sequence. Regarding visualisation of transformer-based latent spaces, MuseNet
3
learns to
generate music autoregressively using a GPT-2 decoder with text prompts from a text encoder. The latent space of this
network were employed for generating 2D representations based on t-SNE [24].
Transformer models learn rich representations in their latent space, capturing meaningful semantic information
about the data. Transformer encoders have also been examined as means of generating latent representations not for
autoregressive generation, but for feeding subsequent networks for multi-task classication [
6
]. This way, transformers
can shape their latent representations to provide information that is useful under multiple perspectives, according to the
number and the nature of the downstream tasks. The principle of this approach is the foundation of the paper at hand:
a transformer encoder is trained as a basic transformation module for feeding other models for multiple downstream
tasks. We expand further from this idea, by leveraging the internal abstract representations of each downstream task
model to create visualisations, as explained in detail in Section 3.2.
3 METHOD
Aim of the method is to receive multiple strings that represent the harmonic information of a large set of jazz standards
(a harmonic string describing the harmony of each jazz standard) and to generate numerical data that can be used
to create dierent types of 2D visualisations that focus on dierent attributes. This section describes the form of the
harmonic strings, the manually annotated dataset and the method that processes string information and generates
the visualisations. The method employs a transformer [
26
] encoder for creating contextualised representations of
masked versions of the harmonic strings and multiple feedforward models for decoding the transformer output to
multiple supervised downstream learning tasks. In addition to the supervised task, a version of the model employs a
BERT-like [
5
] unmasking learning process in the output of the transformer component for “catalysing” context-learning.
The penultimate layer of each classier is employed for generating multidimensional (64 dimensions) representations
of data that are afterwards reduced to 2D representations with t-SNE [24].
3https://openai.com/research/musenet
3
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
3.1 Data representation and description
The dataset comprises 1048 jazz standards, which are represented in string format that includes information about
style, section, tonality, tempo, time signature per bar and chords (root and type) and beat position of chords. Style,
section and time signature information might change within a piece string whereas tempo and tonality are constant.
An example of a string is given as follows:
s t y l e ~ Open Sw in g , t em po ~ 1 6 0 , t o n a l i t y ~ C , b a r ~ 4 / 4 , c h or d ~ C \ u0 39 47 @0 . 0 , b a r ~ 4 / 4 , c ho r d ~
B \ u0 0 f8 7@ 0 . 0 , c h or d ~ E7b 9@2 . 0 , b a r ~ 4 / 4 , c ho r d ~Am7@0 . 0 , c h or d ~ D7@2 . 0 , b a r ~ 4 / 4 ,
c ho r d ~Gm7@0. 0 , c ho r d ~C7@2 . 0 , b a r ~ 4 / 4 , c h o rd ~ F7@0 . 0 , b a r ~ 4 / 4 , c h or d ~ Fm7@0 . 0 , c h or d
~B 7@2. 0 , b a r ~ 4 / 4 , c ho r d ~Em7@0. 0 , c h or d ~ A7@2 . 0 , b a r ~ 4 / 4 , c h o rd ~ Em7@0. 0 , c h o rd ~A
7@2. 0 , b a r ~ 4 / 4 , c ho r d ~Dm7@0 . 0 , b a r ~ 4 / 4 , c h o rd ~ G7@0 . 0 , b a r ~ 4 / 4 , c ho r d ~C \ u0 39 47 @0
. 0 , c hor d ~A7b9@2 . 0 , b ar ~ 4 / 4 , c h or d ~Dm7@0. 0 , c h or d ~G7@2 . 0 , e nd
The text representation for all available jazz standard charts is transposed into to all 11 remaining tonalities. This
helps not only to populate the dataset with additional entries (12576 in total), but also to integrate the harmonic context
introduced by each jazz standard to all tonalities. The harmonic strings are split into meaningful blocks of information
that are isolated in separate tokens for describing tonality (24 tokens, 12 for major and 12 for minor), style, tempo, time
signature, chord root and type (among 60 available types) and chord temporal position (in quarter notes). This process
leads to a vocabulary of 2649 unique tokens
4
; the maximum length of all tokenised strings is a sequence of 324 tokens.
Incorporating all transpositions ensures that all 24 tonalities and all root transpositions of each encountered chord type
are accounted for.
The dataset of 1048 jazz standard songs underwent a comprehensive annotation process by one of the authors, which
added the following information to each piece: Form, Harmony, Year of Composition, Tonality, Composer, and Genre.
Year of composition and composer are retrieved from the real books and online resources. Only the starting tonality
is annotated, which is also included in the song string. The remaining categories were annotated according to the
following labels:
Harmonic context. [2,17]
Tonal:
Tonality features Functional Harmony with a Tonal Center. Each chord serves a specic function in the
progression.
Modal:
In the context of jazz, modal compositions can be written in any mode. In such compositions chords do not
have a functional role and there is no need for chords to resolve to a tonic. Chords stand independently as
standalone entities.
Modern:
Similar to Modal harmony, but without necessary belonging to a mode. It breaks traditional rules and may
or may not have a tonal center. The focus is on experimenting with dierent chord colors without adhering to
functional patterns.
Mixed:
Compositions that feature a combination of harmonic context such as Tonal/Mondern, Tonal/Modal and
Modal/Modern.
4
Three special and necessary tokens are included in the vocabulary: [PAD] for trail-padding sequences with sizes smaller than the maximum size (324);
[UNK] for letting the tokeniser represent tokens that are not accounted for (practically unused in the presented application, but typically necessary for
the employed tokeniser implementation); and [MASK] which will be useful for the BERT-like unmasking training process.
4
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
Genre style. [4,7,8]
Swing:
Encompassing all jazz styles before the emergence of Bebop, such as Early Jazz, New Orleans jazz, Dixieland,
Hollywood, Broadway and more.
Latin Jazz:
Involving compositions by Latin American composers and pieces that incorporate Latin musical elements.
Ballad: Encompassing all types of ballads, as long as they exhibit a ballad-like character.
Modern: Referring to jazz standards composed roughly from the 1980s onwards.
Fusion: Describing songs that blend jazz with elements of rock or funk.
Avant-Garde:
Including harmonically and melodically sophisticated pieces and some of them with an incorporation
of free jazz elements.
Form. [3,15]
Blues: The most signicant form of the blues compositions, the 12-bar form.
16b: Compositions written in a 16-bar form.
24b: Compositions written in a 24-bar form.
32b:
Compositions written in a 32-bar form with thematic variations presented in the plot as AB, AABA, AABC, ABAC,
ABCD and Rhythm Changes. In some instances, a few additional measures were justied within the form to t
into a specic form group.
-: Compositions that don’t belong in any of the previous forms.
3.2 Neural Network Architecture and visualisation
Fig. 1. Overview of the model.
An overview of the unmasking variation of the model is illustrated in Figure 1. The input to this version of the
model is masked (25%) tokenised sequences with a size of 324 tokens (context length), which is the maximum tokenized
sequence length in the dataset; smaller sequences are padded with the [PAD] token. The transformer encoder comprises
8 layers of transformer encoder blocks with 8 attention heads each. Embeddings of size 256 are employed and the
5
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
Fig. 2. Overview of the predictor module in a downstream task.
internal dimensionality of each attention head is 256. The output of the transformer encoder is a sequence of 324
×
256-dimensional embeddings, which is split into two parts: one part trains the system to unmask the masked tokens
and the other part is performing a 1-dimensional global average pooling to the transformer output, creating a 256-
dimensional average of the 256
×
324 transformer output, which is propagated to the “predictor” parts of the network
for the regression and classication tasks. Year prediction for each jazz standard is the only regression task, while all
the others are classication tasks, namely: composer (445 classes), harmonic style (7 classes), form (11 classes), tonality
(24 classes), genre (18 classes). The variation of the model that does not incorporate masking receives the entire text of
the chart without masking and the output of the transformer is not directed to the unmasking task.
Each predictor (Figure 2) receives a 256-dimensional array coming from the global average polling of the transformer
output and passes this information through a series of dense layers before making a prediction that corresponds to a
specic task. As shown in Figure 2, the penultimate layer of the series of the dense layer in each predictor is important
and its output is preserved for the producing the visualisations; we call this layer the “descriptor” layer since it is
employed to describe what the system has learned regarding the attribute of the predictor. The idea is to use the
penultimate layer (descriptor) of each predictor to get the “last mile” latent information that the system has learned to
describe before actually performing regression or classication. The last layer of the predictor is producing probability
vectors of classes (for the classication tasks) or a logistic regression (for the regression task); those predictions are based
on the information that has been shaped by the transformer and the subsequent cascade of dense layer transformations
within each predictor, up to the penultimate layer which we claim holds useful disentangled” information that can be
used to produce meaningful visualisations of data.
The descriptor (penultimate) layer for all tasks is a 64-dimensional representation. Each of these representations is
converted to 2D using t-SNE [
24
], providing the data for visualisation according to each criterion. All visualisation are
6
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
rotated for optimal alignment, so as to have minimal average displacement of all points/pieces when switching from
one visualisation to the other. This is performed by applying only the rotation component of the Kabsch-Umeyama
algorithm [
10
,
23
]. Specically, all visualisations are rotated to align with the “year of composition” visualisation, which
is conveniently laid out in a left-older / right-newer pieces arrangement (see result in Fig. 5(a) in Section 4.2).
During training of the umnasking variation, the model learns jointly to optimize the output of all tasks, namely the
unmasking task and the regression / classication downstream tasks. This fact forces the transformer encoder part
of the model to produce contextualised output that is useful for multiple viewpoints. Some of those viewpoints are
orthogonal, i.e., one might be unrelated with the other (e.g., the tonality of a piece is not necessarily related to its genre),
a fact that forces the transformer part of the network to use all attention heads in all encoder layers wisely, to end up
with a parsimonious, yet useful representation of all this information in the nal concatenation and global average
pooling part that prepares data for the predictors.
The aim of jointly learning the unmasking task in parallel with the predictions is two fold: rst to force the entire
network (including the transformer and the predictors) to learn even with missing information (and prevent the
model from over-tting); and second to force the transformer part to maintain contextual information that is generally
meaningful and does not skew towards quick convergence to the tasks that might be easier than others. The intuition
behind the second aim is based on the assumption that some tasks might be easier than others, e.g., predicting among 7
harmonic styles might be easier for the network than predicting among 445 composers. In this specic example, this
holds not only because of the number of classes but also because in harmonic styles, the vocabulary of chords used in a
piece might be clearly indicative of its harmonic style, while dierent composers might use the same chord vocabulary -
and actually have compositions belonging to the same harmonic style. This fact will expectedly produce larger (in an
absolute sense) gradients to the harmonic style predictor which will back-propagate to the transformer, forcing the
transformer to focus more on improving the performance of this attribute, over the attribute of composer classication.
This would force the transformer to over-t early in terms harmonic style and ignore the composer classication tasks,
ending up in weight congurations that are only close to local minima in terms of the ignored attribute. The unmasking
tasks is expected to smoothen such transformer biases and force it to produce contextualised information that is useful
in its own right, while also improving the outcome of all the prediction tasks.
4 RESULTS
Two model variations are compared in terms of convergence, with the one that includes unmasking proving to be more
eective that the one that does not. Afterwards, the visualisations generated using the unmasking version are examined
through an online interactive interface by a jazz expert.
4.1 Convergence and classification/regression analysis
Both versions of the model (with and without unmasking) converge apparently similarly to low error and high accuracy
values, as shown in Fig. 3. This gure shows the error values relative to the maximum error value in each task, therefore,
each curve has a maximum value of 1. Some curves in both graphs in this gure present some spikes of error increase,
which are permitted by the employed variation of gradient descend (adam method). Note that the image in Fig. 3(a)
includes an additional curve that describes the unmasking error, which follows an interesting convergence pattern:
after an initial quick error drop (rst few epochs), it follows a slow and almost linear drop of around 10% and then,
around epoch 1000, it follows an “typical” exponential-like convergence. This behaviour of the unmasking error curve,
however, does not inuence the convergence of the other curves in comparison to the no-unmasking case, i.e., there
7
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
(a) Unmasking (b) No unmasking
Fig. 3. Convergence (normalized error) when training with and without unmasking.
does not appear to be similar “convergence” boost in any other curve after epoch 1000. This was veried by examining
the behaviour of all curves with moving averages in a range between 30 and 100 (results not shown).
The unmasking method appears to be superior in terms of generalisation capabilities, meaning that it helps to
reduce over-tting. Classication and regression errors on the test set are shown in Table 1, where the method with no
unmasking misclassies more pieces in all tasks except from the tonality tasks, where both methods produce no error.
The fact that both methods perform accurately in tonality classication, has to do with the fact that tonality is explicitly
included in the harmonic string that describes each piece (e.g.,
tonalityAm
indicates an A minor tonality). Therefore,
the transformer component of the model needs to learn that the important part in the string for identifying tonality
is the rst one or two characters (in case of major and minor tonalities, respectively) that come after the
tonality
prex. Apparently, the transformer component is able to successfully extract this information and pass it to the tonality
classier.
Table 1. Misclassified songs for each method variation
Misclassied songs
Task Unmasking No unmasking
harmony 0 3
form 0 4
tonality 0 0
composer 11 28
genre 6 15
Regression loss
year loss 5.61e-5 6.58e-5
As the masking error exhibits a steeper decrease, it becomes evident that errors in other downstream tasks also
experience a more pronounced reduction, as shown in Fig. 4. Consequently, the impact of unmasking seems to expedite
error reduction for the validation set, providing additional evidence that this approach mitigates the risk of overtting.
However, it’s important to note that this assertion warrants further investigation through focused studies.
8
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
(a) Composer training validation losses (b) Year training validation losses
Fig. 4. Composer and Year figures of training validation losses
Form information is also incorporated in some cases within the harmonic string that describes each piece. If the
form of a piece is, e.g., AABA, then this means that the piece transitions to “sections”, i.e. groups of bars, that are
annotated with A, A, B and A respectively. In the harmonic string, the bar that incorporates the section change includes
a corresponding annotation, i.e., in the
bar
includes a
sectionA
substring that indicates, e.g., that section A begins.
Therefore, in these cases, the transformer component needs to identify the
section
prex, focus on the next character
each time and keep track of how those characters succeed one another. Both model versions learn to perform this
almost perfectly, with a slight superiority for the unmasking variation that presents no errors (with 4 errors for the
other version). It has to be noted that other annotated form types are based on other factors, e.g., the number of bars,
and not the succession of section parts; some such pieces do not even include the
section
prex at all within their
harmonic string. In those cases, again, both models correctly learn to count the occurrences of bar.
The remaining classication tasks, as well as the regression task, incorporate more subtle characteristics that the
models need to learn. In all those cases, again, the unmasking variation outperforms the non-unmasking variation.
Given that both model variations are trained only on the transposition of the original pieces, where only the roots of
chords (and the tonality annotation) are changed, and that they perform so well on the original pieces, it appears that
the models focus on the chord types and their transitions. It would be useful to examine whether and to what extent
root relations are implicitly learned, e.g., a chord with root G going to a chord with root C is the same relation as going
from D to G. This examination, however, requires focused experimental processes that are left for future work.
4.2 alitative analysis
The visualisation results obtained by the unmasking variation of the model for each downstream task are shown
in Figure 5. As expected according to the high classication results, the model indeed presents the songs in groups
“correctly”, i.e., according to the annotated labels - which are employed for coloring each song/point on the graph. A
publicly accessible web-page
5
has been developed based on these graphs that allows users to explore the visualisation
of each downstream task in interactive way; this visualisation interface enables users to acquire extra information about
each piece upon mouse hover. Additionally, zooming in/out capabilities are oered as well as the option to highlight
5https://maxim.mus.auth.gr:5001
9
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
(a) Year (warmer color, newer piece) (b) Tonality
(c) Form (d) Genre
(e) Composer (f) Harmony
Fig. 5. Results of visualisations obtained from the predictor of each downstream task. Colors correspond to the annotated labels.
a specic piece selected through a search text-box. The web visualisation interface was employed for the qualitative
analysis of the results by one of the authors who is an expert in jazz musicology.
Some interesting ndings concern inter-cluster relations. When the representation of the model is based on Form
Style, compositions following the Rhythm Changes, a distinctive harmonic variation of the AABA form, were accurately
10
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
clustered adjacent to traditional AABA songs. Likewise, compositions following the 12-bar Blues Form were positioned
near the 16-bar compositions due to their closest resemblance in terms of form style, highlighting the model’s sensitivity
to temporal characteristics.
Regarding specic songs that appear to be “out of place”, the model eectively grouped the blues song “Serenade to a
Soul Sister”, alongside other 24-bar compositions. Blues songs written in ¾time signature, such as the one mentioned
above, have their 12 bar structure doubled into 24 bars. This result conrmed the model’s ability to understand
complex harmonic frameworks and appropriately organize compositions with similar structures. However, during
the analysis, specic clustering errors were identied, notably in the placement of the blues song “All Blues.”; despite
sharing a matching 24-bar structure with “Serenade to a Soul Sister”, the model positioned it in a dierent region,
seemingly prioritizing other musical elements. Such discrepancies indicate the need for further renement in the
model’s decision-making process to improve its accuracy.
Regarding the analysis of the tonalities graph, minor tonalities were eectively grouped together, but they appeared
signicantly distant from their respective major tonalities. This suggests that the string representation, incorporating
chord symbols, does not lead to a musically convenient circle of fths pattern but instead emphasizes the convenient
separation of tonality mode annotation (major/minor).
These observations underscore the model’s prociency in capturing musical nuances and its potential for meaningful
visualizations. The web interface provides a user-friendly platform for exploring and interpreting the results. However,
the identied discrepancies and clustering errors indicate areas for renement in the model’s decision-making process.
Overall, the qualitative analysis reinforces the model’s ability to organize and present jazz standard compositions based
on various annotated perspectives, laying the foundation for further renement and exploration.
5 CONCLUSIONS
Visualising large datasets of music under multiple perspectives, according to the needs of the user, facilitates the
discovery of useful material based on user-selected criteria. This paper presents a method for generating multiple
visualisations of a jazz standard dataset under several user-dened music-related perspectives. The examined dataset
includes strings of harmonic and structural descriptions of 1048 jazz standards musical charts, each accompanied by a
set of manual annotations for each perspective. A model is proposed that receives a string as input and learns to classify
and do regression based on the annotations. The basis of the model is a stack of transformer encoders, which prepare
a context-aware output for multiple feed-forward downstream models that do classication and regression for each
annotated aspect. The output of the penultimate layer of each downstream task model is employed a multidimensional
latent representation that is further compressed to 2D for visualisation.
Two variations of the model are examined: one learns only the downstream tasks, while in the other, the transformer
component learns to unmask a masked version of the input jointly with the downstream tasks. Training is performed
on the 11 transpositions in all keys for each string (with corresponding changes in the annotations) and testing is
performed on the original strings of the dataset. Results show that both versions learn each downstream task accurately,
with the umasking variation outperforming slightly the other in most tasks. The resulting visualisations show that
the model clusters the data properly for each downstream task. The observed acceleration in error reduction due to
unmasking during validation underscores its role in eectively mitigating overtting concerns. After examination by
an expert on jazz standards, it was shown that in some cases inter-cluster relations make musical sense and that the
placement of some out-of-place pieces, i.e., pieces that are visualised in the area of a dierent cluster than the one they
were annotated, is musically justied.
11
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
However, it’s essential to acknowledge certain limitations in our research. While the proposed model demonstrates
prociency in accurately learning and visualizing downstream tasks, there is room for further exploration. Future
empirical experiments could delve into the model’s ability to identify relations between chord roots within progressions.
Investigating whether the model’s visual representations align with perceptual distances perceived by participants
would contribute valuable insights.
Moreover, the potential for creating visualizations based on compound criteria, such as both "form" and "genre,"
remains an intriguing avenue for future research. This would necessitate a deeper exploration of the latent representa-
tions within each downstream module, paving the way for aligning these spaces with those of pre-trained language
models. Such alignment could enable the construction of custom visualizations based on free-text queries, providing
users with more exibility in exploring and understanding the rich landscape of jazz standards.
In conclusion, while our research demonstrates promising results in music data visualization, there are exciting
opportunities for further renement and expansion. By addressing these limitations and exploring new directions,
we can advance the eld of music education and discovery, oering users enhanced tools for exploring the intricate
world of jazz standards. This work underscores the importance of pushing the boundaries of music data visualization to
empower users with more meaningful and personalized insights, ultimately contributing to the broader landscape of
music research and education.
ACKNOWLEDGMENTS
This research has been co-nanced by the European Regional Development Fund of the European Union and Greek
national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call
RESEARCH CREATE INNOVATE. Project Acronym: MusiCoLab, Project Code: T2EDK-00353.
REFERENCES
[1]
Charles W. Beale. 2001. From jazz to jazz in education: An investigation of tensions between player and educator denitions of jazz. Doctoral thesis.
Institute of Education, University of London. Green open access.
[2] D. Berkman and Sher Music Co. 2013. The Jazz Harmony Book: A Course in Adding Chords to Melodies. Sher Music Company.
[3] M. Cooke and D. Horn. 2003. The Cambridge Companion to Jazz. Cambridge University Press.
[4] J.S. Davis. 2012. Historical Dictionary of Jazz. Scarecrow Press.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Timothy Greer, Xuan Shi, Benjamin Ma, and Shrikanth Narayanan. 2023. Creating musical features using multi-faceted, multi-task encoders based
on transformers. Scientic Reports 13, 1 (2023), 10713.
[7] M.C. Gridley. 2011. Jazz Styles. Pearson Education.
[8] T. Holmes and W. Duckworth. 2006. American Popular Music: Jazz. Checkmark.
[9]
Shulei Ji and Xinyu Yang. 2023. EmoMusicTV: Emotion-conditioned Symbolic Music Generation with Hierarchical Transformer VAE. IEEE
Transactions on Multimedia (2023).
[10]
Wolfgang Kabsch. 1976. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diraction,
Theoretical and General Crystallography 32, 5 (1976), 922–923.
[11]
Maximos Kaliakatsos-Papakostas, Konstantinos Velenis, Konstantinos Giannos, and Emilios Cambouropoulos. 2022. Exploring Jazz Standards with
Web Visualisation for Improvisation Training. In Web Audio Conference.
[12] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[13]
Peter Knees, Markus Schedl, Tim Pohle, and Gerhard Widmer. 2006. An innovative three-dimensional user interface for exploring music collections
enriched. In Proceedings of the 14th ACM international conference on Multimedia. 17–24.
[14]
Peter Knees, Markus Schedl, Tim Pohle, and Gerhard Widmer. 2007. Exploring music collections in virtual landscapes. IEEE multimedia 14, 3 (2007),
46–54.
[15] R.J. Lawn. 2013. Experiencing Jazz: eBook Only. Taylor & Francis.
12
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
[16]
Dave Liebman, Phil Markowitz, Vic Juris, and Bob Reich. 1991. A chromatic approach to jazz harmony and melody. Advance music Rottenburg am
Neckar.
[17] R. Miller. 1996. Modal Jazz Composition & Harmony. Number 𝜏. 1 in Modal Jazz Composition & Harmony. Advance Music.
[18]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al
.
2018. Improving language understanding by generative pre-training. (2018).
[19]
Andreas Rauber, Elias Pampalk, and Dieter Merkl. 2003. The SOM-enhanced JukeBox: Organization and visualization of music collections based on
perceptual models. Journal of New Music Research 32, 2 (2003), 193–210.
[20]
Pedro Sarmento, Adarsh Kumar, Yu-Hua Chen, CJ Carr, Zack Zukowski, and Mathieu Barthet. 2023. GTR-CTRL: Instrument and Genre Conditioning
for Guitar-Focused Music Generation with Transformers. In International Conference on Computational Intelligence in Music, Sound, Art and Design
(Part of EvoStar). Springer, 260–275.
[21]
Markus Schedl, Michael Mayr, and Peter Knees. 2020. Music Tower Blocks: Multi-Faceted Exploration Interface for Web-Scale Music Access. In
Proceedings of the 2020 International Conference on Multimedia Retrieval. 388–392.
[22]
Jingyi Shen, Runqi Wang, and Han-Wei Shen. 2020. Visual exploration of latent space for traditional Chinese music. Visual Informatics 4, 2 (2020),
99–108. https://doi.org/10.1016/j.visinf.2020.04.003 PacicVis 2020 Workshop on Visualization Meets AI.
[23]
Shinji Umeyama. 1991. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis &
Machine Intelligence 13, 04 (1991), 376–380.
[24] Laurens Van der Maaten and Georey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[25]
Rob van Gulik, Fabio Vignoli, and Huub van de Wetering. 2004. Mapping music in the palm of your hand, explore and discover your collection. In
Proceedings of the 5th International Conference on Music Information Retrieval. Queen Mary, University of London London.
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. Advances in neural information processing systems 30 (2017).
[27]
Gabriel Vigliensoni, Louis McCallum, Esteban Maestre, Rebecca Fiebrink, et al
.
2022. R-VAE: Live latent space drum rhythm generation from
minimal-size datasets. Journal of Creative Music Systems 1, 1 (2022).
[28] Heng Wang, Sen Hao, Cong Zhang, Xiaohu Wang, and Yilin Chen. 2023. Motif Transformer: Generating Music With Motifs. IEEE Access (2023).
[29]
Guowei Wu, Shipei Liu, and Xiaoya Fan. 2023. The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in
Symbolic Music Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 1409–1420.
13
... Such models can theoretically be applied to all kinds of music, including Western classical (Ens & Pasquier, 2020;Kong et al., 2020;Mahmud Rafee et al., 2023;Stamatatos & Widmer, 2005;Tang et al., 2023;Yang & Tsai, 2021; and popular music (Chou et al., 2024). Here, we provide a case study of jazz, which is particularly interesting for the freedoms afforded to performers and has been studied in a number of prior works (e.g., Cheston et al., 2024b;Edwards et al., 2023;Ramirez et al., 2010;Velenis et al., 2023). Through improvisation, jazz performers manipulate many different aspects of the music they play, such as the harmony, melody, and rhythm of a composition. ...
... This multi-input architecture is different to the multi-task learning approach described earlier by Velenis et al. (2023), where a single input representation from a jazz "lead sheet" was used to generate multiple downstream predictions (for e.g., composer, tonality, form identification). Instead, our approach can be thought of as conceptually similar to a mixture-of-experts model without a gating network, allowing us to explicitly control the contribution of each input and sub-network ("expert") towards a single downstream task (performer identification). ...
Preprint
Full-text available
Artistic style has been studied for centuries, and recent advances in machine learning create new possibilities for understanding it computationally. However, ensuring that machine-learning models produce insights aligned with the interests of practitioners and critics remains a significant challenge. Here, we focus on musical style, which benefits from a rich theoretical and mathematical analysis tradition. We train a variety of supervised-learning models to identify 20 iconic jazz musicians across a carefully curated dataset of 84 hours of recordings, and interpret their decision-making processes. Our models include a novel multi-input architecture that enables four musical domains (melody, harmony, rhythm, and dynamics) to be analysed separately. These models enable us to address fundamental questions in music theory and also advance the state-of-the-art in music performer identification (94% accuracy across 20 classes). We release open-source implementations of our models and an accompanying web application for exploring musical styles.
Article
Full-text available
Computational machine intelligence approaches have enabled a variety of music-centric technologies in support of creating, sharing and interacting with music content. A strong performance on specific downstream application tasks, such as music genre detection and music emotion recognition, is paramount to ensuring broad capabilities for computational music understanding and Music Information Retrieval. Traditional approaches have relied on supervised learning to train models to support these music-related tasks. However, such approaches require copious annotated data and still may only provide insight into one view of music—namely, that related to the specific task at hand. We present a new model for generating audio-musical features that support music understanding, leveraging self-supervision and cross-domain learning. After pre-training using masked reconstruction of musical input features using self-attention bidirectional transformers, output representations are fine-tuned using several downstream music understanding tasks. Results show that the features generated by our multi-faceted, multi-task, music transformer model, which we call M3BERT, tend to outperform other audio and music embeddings on several diverse music-related tasks, indicating the potential of self-supervised and semi-supervised learning approaches toward a more generalized and robust computational approach to modeling music. Our work can offer a starting point for many music-related modeling tasks, with potential applications in learning deep representations and enabling robust technology applications.
Article
Full-text available
Music is composed of a set of regular sound waves, which are usually ordered and have a large number of repetitive structures. Important notes, chords, and music fragments often appear repeatedly. Such repeated fragments (referred to as motifs) are usually the soul of a song. However, most music generated by existing music generation methods can not have distinct motifs like real music. This study proposes a novel multi- encoders model called Motif Transformer to generate music containing more motifs. The model is constructed using an encoder-decoder framework that includes an original encoder, a bidirectional long short term memory-attention encoder (abbreviated as bilstm-attention encoder), and a gated decoder. Where the original encoder is taken from the transformer’s encoder and the bilstm-attention encoder is constructed from the bidirectional long short-term memory network (BILSTM) and the attention mechanism; Both the original encoder and the bilstm-attention encoder encode the motifs and input the encoded information representations to the gated decoder; The gated decoder decodes the entire input of the music and the information passed by the encoders and enhances the model’s ability to capture motifs of the music in a gated manner to generate music with significantly repeated fragments. In addition, in order to better measure the model’s ability of generating motifs, this study proposes an evaluation metric called used motifs. Experiments on multiple music field metrics show that the model proposed in this study can generate smoother and more beautiful music, and the generated music contains more motifs.
Article
Full-text available
Generating compact and effective numerical representations of data is a fundamental step for many machine learning tasks. Traditionally, handcrafted features are used but as deep learning starts to show its potential, using deep learning models to extract compact representations becomes a new trend. Among them, adopting vectors from the model’s latent space is the most popular. There are several studies focused on visual analysis of latent space in NLP and computer vision. However, relatively little work has been done for music information retrieval (MIR) especially incorporating visualization. To bridge this gap, we propose a visual analysis system utilizing Autoencoders to facilitate analysis and exploration of traditional Chinese music. Due to the lack of proper traditional Chinese music data, we construct a labeled dataset from a collection of pre-recorded audios and then convert them into spectrograms. Our system takes music features learned from two deep learning models (a fully-connected Autoencoder and a Long Short-Term Memory (LSTM) Autoencoder) as input. Through interactive selection, similarity calculation, clustering and listening, we show that the latent representations of the encoded data allow our system to identify essential music elements, which lay the foundation for further analysis and retrieval of Chinese music in the future.
Article
Emotion is one of the most crucial attributes of music. However, due to the scarcity of emotional music datasets, emotion-conditioned symbolic music generation using deep learning techniques has not been investigated in depth. In particular, no study explores conditional music generation with the guidance of emotion, and few studies adopt time-varying emotional conditions. To address these issues, first, we endow three public lead sheet datasets with fine-grained emotions by automatically computing the valence labels from the chord progressions. Second, we propose a novel and effective encoder-decoder architecture named EmoMusicTV to explore the impact of emotional conditions on multiple music generation tasks and to capture the rich variability of musical sequences. EmoMusicTV is a transformer-based variational autoencoder (VAE) that contains a hierarchical latent variable structure to model holistic properties of the music segments and short-term variations within bars. The piece-level and bar-level emotional labels are embedded in their corresponding latent spaces to guide music generation. Third, we pretrain EmoMusicTV with the lead sheet continuation task to further improve its performance on conditional melody or harmony generation. Experimental results demonstrate that EmoMusicTV outperforms previous methods on three tasks, i.e., melody harmonization, melody generation given harmony, and lead sheet generation. Ablation studies verify the significant roles of emotional conditions and hierarchical latent variable structure on conditional music generation. Human listening shows that the lead sheets generated by EmoMusicTV are closer to the ground truth (GT) and perform slightly worse than the GT in conveying emotional polarity.
Chapter
Recently, symbolic music generation with deep learning techniques has witnessed steady improvements. Most works on this topic focus on MIDI representations, but less attention has been paid to symbolic music generation using guitar tablatures (tabs) which can be used to encode multiple instruments. Tabs include information on expressive techniques and fingerings for fretted string instruments in addition to rhythm and pitch. In this work, we use the DadaGP dataset for guitar tab music generation, a corpus of over 26k songs in GuitarPro and token formats. We introduce methods to condition a Transformer-XL deep learning model to generate guitar tabs (GTR-CTRL) based on desired instrumentation (inst-CTRL) and genre (genre-CTRL). Special control tokens are appended at the beginning of each song in the training corpus. We assess the performance of the model with and without conditioning. We propose instrument presence metrics to assess the inst-CTRL model’s response to a given instrumentation prompt. We trained a BERT model for downstream genre classification and used it to assess the results obtained with the genre-CTRL model. Statistical analyses evidence significant differences between the conditioned and unconditioned models. Overall, results indicate that the GTR-CTRL methods provide more flexibility and control for guitar-focused symbolic music generation than an unconditioned model.KeywordsControllable Music GenerationDeep LearningConditioningTransformersInteractive Music AIGuitar Tablatures
Article
Symbolic music generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. Learning contextual representations are also related to the structural elements in music, i.e., intro, verse, and chorus, which have not received much attention of scientific publications. In this paper, we propose a hierarchical Transformer model to learn multiscale contexts in music. In the encoding phase, we first design a fragment scope localization module to separate the music parts into chords and sections. Then, we use a multiscale attention mechanism to learn note-, chord-, and section-level contexts. In the decoding phase, we propose a hierarchical Transformer model that uses fine decoders to generate sections in parallel and a coarse decoder to decode the combined music. We also designed a music style normalization layer to achieve a consistent music style between the generated sections. Our model is evaluated on two open MIDI datasets. Experiments show that our model outperforms other comparative models in 50% (6 out of 12 metrics) and 83.3% (10 out of 12 metrics) of the quantitative metrics for short- and long-term music generation, respectively. Preliminary visual analysis also suggests its potential in following compositional rules, such as reuse of rhythmic patterns and critical melodies, which are associated with improved music quality.
Article
The vibrant world of jazz may be viewed from many angles, from social and cultural history to music analysis, from economics to ethnography. It is challenging and exciting territory. This volume of nineteen specially commissioned essays offers informed and accessible guidance to the challenge, taking the reader through a series of five basic subject areas--locating jazz historically and geographically; defining jazz as musical and cultural practice; jazz in performance; the uses of jazz for audiences, markets, education and for other art forms; and the study of jazz. "…an excellent resource for the serious student as well as the music educator interested in learning more about the subtleties of jazz. …provides an outstanding format for reflecting the complexities and differing viewpoints prevalent in jazz…" Music Educators Journal "a refreshing and significant addition to the jazz literature…. Highly recommended." Choice "… [part of] new essentials for public, academic, and music libraries, and they would enhance high school libraries as well." Against the Grain. Prizes: SAM: Irving Lowens book Award (2/04)
Article
The availability of large music repositories calls for new ways of automatically organizing and accessing them. While artist-based listings or title indexes may help in locating a specific piece of music, a more intuitive, genre-based organization is required to allow users to browse an archive and explore its contents. So far, however, these organizations following musical styles have to be designed manually. With the SOM-enhanced JukeBox (SOMeJB) we propose an approach to automatically create an organization of music archives following their perceived sound similarity. More specifically, characteristics of frequency spectra are extracted and transformed according to psychoacoustic models. The resulting psychoacoustic Rhythm Patterns are further organized using the Growing Hierarchical Self-Organizing Map, an unsupervised neural network. On top of this advanced visualizations including Islands of Music (IoM) and Weather Charts offer an interface for interactive exploration of large music repositories.