Content uploaded by Maximos Kaliakatsos-Papakostas
Author content
All content in this area was uploaded by Maximos Kaliakatsos-Papakostas on Feb 18, 2024
Content may be subject to copyright.
Visualisations of Jazz Standards Derived from Transformer-based Multi-task
Learning
KONSTANTINOS VELENIS, Aristotle University, School of Music Studies, Greece
MAXIMOS KALIAKATSOS-PAPAKOSTAS, Hellenic Mediterranean University, Department of Music Technol-
ogy and Accoustics, Greece
LEANDROS PASIAS, Aristotle University, School of Music Studies, Greece
EMILIOS CAMBOUROPOULOS, Aristotle University, School of Music Studies, Greece
This paper presents a method for creating multiple 2D visualizations of 1048 jazz standard chord charts in text format. The novel
component of this method is that each visualization re-arranges the data available, prioritizing specic musical aspects, namely
harmonic context, genre style, composition year, music form, composer, and tonality, allowing the exploration of this dataset according
to user-dened criteria. The basic component of this model is a stack of transformer encoders that generate contextualized information
from chart-text input. This information is subsequently fed to downstream feed-forward models that perform classication and
regression based on manually annotated target labels and values of the aforementioned visualization criteria. The training of our
model was performed on a distinct dataset from the validation and testing sets, ensuring robustness and minimizing the risk of
overtting. A model variation where the transformer component performs jointly unmasking of the text input outperforms a variation
without umnasking, when both models are trained on all 11 transpositions of the available pieces and tested on the original tonalities.
The resulting visualisations reect the categorisation and regression capabilities of the outperforming model, while revealing some
interesting inter-cluster relations as well as some justiable out-of-place pieces. A few hard to explain placements of specic pieces
provide pointers for future work. Overall, such visualisations show the potential to empower jazz musicians to explore jazz standards
under multiple perspectives.
CCS Concepts: •Human-centered computing
→
Visualization theory, concepts and paradigms;Visualization theory,
concepts and paradigms;•Computing methodologies →Articial intelligence.
Additional Key Words and Phrases: music visualisation, jazz standards, transformers, multi-task learning
ACM Reference Format:
Konstantinos Velenis, Maximos Kaliakatsos-Papakostas, Leandros Pasias, and Emilios Cambouropoulos. 2023. Visualisations of Jazz
Standards Derived from Transformer-based Multi-task Learning. In 27th Pan-Hellenic Conference on Progress in Computing and
Informatics (PCI 2023), November 24–26, 2023, Lamia, Greece. PCI, Lamia, Greece, 13 pages. https://doi.org/10.1145/3635059.3635072
1 INTRODUCTION
Facilitation exploration of large datasets oers an interesting perspective towards music discovery and education. For
instance, in jazz music education, musicians are required to develop a solid foundation in harmonic elements such as
chord types, chord progressions, scales, and tonalities as well as other structural components as forms. Additionally,
one musician may have studied the compositions of a specic composer or a genre and would like to explore similar
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
1
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
composers or genres, based on general harmonic and structural characteristics that cannot be clearly dened. To
enhance the learning experience and expand the repertoire of Jazz students, it is crucial to provide them with tools that
facilitate exploration and familiarization with pieces with a wide spectrum of characteristics [
1
]. The ever-increasing
availability of music data oers opportunities to transform music education, making it more engaging, eective, and
tailored to the individual needs and interests of students and teachers alike.
However, the abundance of data can also present challenges, particularly in identifying relevant content for users
within large datasets. Traditional music recommendation systems, relying on community-based similarity metrics and
content-based techniques [
16
], are primarily designed for music streaming platforms like Spotify or Pandora. These
systems may not eectively address the needs of music education, especially in the context of Jazz standards and
improvisation training. Unlike mainstream genres, Jazz standards demand a more nuanced understanding of harmonic
contexts and personalized preferences.
This paper addresses the challenge of presenting a large music collection to users through visualisations that re-
arrange data according to priorities in specic criteria given by the user. This is performed through a novel method
that leverages the representation capabilities of transformer models for modeling implicitly harmonic similarity and
generating 2-dimensional mappings that facilitate content-based exploration and retrieval of Jazz standards. The method
employs symbolic data that describe the harmonic and structural information of jazz charts in a custom text format
and include manual annotations of harmonic context, harmonic style, composition year, composer, genre and music
form. Through a multi-dimensional transformation and projection into a visually appealing 2D space using t-SNE, the
method positions data points based on their harmonic content in conjunction with their metadata similarities. 1
2 RELATED WORK
In the exploration of large musical datasets, various approaches have been employed, each oering unique insights into
content-based exploration. Notable among these is the Every Noise at Once initiative by EchoNest
2
, which utilizes
a scatterplot to visually organize diverse musical genres found on Spotify into a two-dimensional map. This map
represents one dimension as a spectrum between organic and mechanical/electric qualities, while the other dimension
captures the dichotomy between dense/atmospheric and spiky/bouncy qualities. Another strategy involves visualizing
databases on a two-dimensional space, taking into account assigned moods and genres [25].
Alternatively, a dierent approach suggests the use of a Self-Organizing Map constructed by a neural network,
which follows perceived patterns of sound similarity [
19
]. In an innovative interface, music databases are arranged in
three-dimensional virtual landscapes [
13
,
14
]. These landscapes generate virtual islands dedicated to specic musical
genres, where close distances indicate similarity between pieces belonging to the same style, while larger distances
separate dierent styles. Multiple modes are oered, allowing users to listen to specic songs, explore descriptive words
and explicit genre tags, or view related images. A recent variation of this visualization interface involves alternating
virtual islands with city tower blocks [21].
The idea that is developed in the method presented in the paper at hand concerns the visualisation of information
from data that are “compressed” by machine learning methods during training. Variational Autoencoders (VAE) [
12
]
are a good candidate for such an approach, since they are trained in an unsupervised way and their role is to generate
meaningful compressed representations of fewer dimensions than the original data - that additionally follow a normal
1https://github.com/maximoskp/chameleon_jazz.git
2https://everynoise.com
2
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
distribution. Such an approach, e.g., was followed in [
27
] where the latent space from a variation of a VAE model is
employed for generating visualisations of rhythms.
Since Autoencoders are not ideal for representing sequential data, latent spaces from Long-Short Term Memory
(LSTM) have be employed for creating visualisations. In these cases, the LSTM networks are trained autoregressively,
i.e., for predicting the next time step at every step, and therefore there is no requirement for annotated data. In [
22
] the
latent spaces of both VAE and LSTM networks are trained on audio data of traditional Chinese music for producing
visualisations, among other tasks. Regarding symbolic music data, a method that was based on autoregressively trained
LSTM networks was presented in [11] for generating visualisations.
In parallel, transformer architectures [
26
] have taken by storm several tasks that concern sequence processing.
Beginning from natural language processing tasks [
5
,
18
], transformer-based architectures have also been employed
for music-related tasks. Such tasks include symbolic music generation based on user-dened conditions, e.g. on
target emotion [
9
]) or genre [
20
], structural segmentation [
29
] and motif variation [
28
]. The eectiveness of those
models relies in their ability to capture contextual relations in large sequences: transform an input sequence of
multidimensional representations into another abstract multidimensional sequence that reects context-based relations
between components of the sequence. Regarding visualisation of transformer-based latent spaces, MuseNet
3
learns to
generate music autoregressively using a GPT-2 decoder with text prompts from a text encoder. The latent space of this
network were employed for generating 2D representations based on t-SNE [24].
Transformer models learn rich representations in their latent space, capturing meaningful semantic information
about the data. Transformer encoders have also been examined as means of generating latent representations not for
autoregressive generation, but for feeding subsequent networks for multi-task classication [
6
]. This way, transformers
can shape their latent representations to provide information that is useful under multiple perspectives, according to the
number and the nature of the downstream tasks. The principle of this approach is the foundation of the paper at hand:
a transformer encoder is trained as a basic transformation module for feeding other models for multiple downstream
tasks. We expand further from this idea, by leveraging the internal abstract representations of each downstream task
model to create visualisations, as explained in detail in Section 3.2.
3 METHOD
Aim of the method is to receive multiple strings that represent the harmonic information of a large set of jazz standards
(a harmonic string describing the harmony of each jazz standard) and to generate numerical data that can be used
to create dierent types of 2D visualisations that focus on dierent attributes. This section describes the form of the
harmonic strings, the manually annotated dataset and the method that processes string information and generates
the visualisations. The method employs a transformer [
26
] encoder for creating contextualised representations of
masked versions of the harmonic strings and multiple feedforward models for decoding the transformer output to
multiple supervised downstream learning tasks. In addition to the supervised task, a version of the model employs a
BERT-like [
5
] unmasking learning process in the output of the transformer component for “catalysing” context-learning.
The penultimate layer of each classier is employed for generating multidimensional (64 dimensions) representations
of data that are afterwards reduced to 2D representations with t-SNE [24].
3https://openai.com/research/musenet
3
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
3.1 Data representation and description
The dataset comprises 1048 jazz standards, which are represented in string format that includes information about
style, section, tonality, tempo, time signature per bar and chords (root and type) and beat position of chords. Style,
section and time signature information might change within a piece string whereas tempo and tonality are constant.
An example of a string is given as follows:
s t y l e ~ Open Sw in g , t em po ~ 1 6 0 , t o n a l i t y ~ C , b a r ~ 4 / 4 , c h or d ~ C \ u0 39 47 @0 . 0 , b a r ~ 4 / 4 , c ho r d ~
B \ u0 0 f8 7@ 0 . 0 , c h or d ~ E7b 9@2 . 0 , b a r ~ 4 / 4 , c ho r d ~Am7@0 . 0 , c h or d ~ D7@2 . 0 , b a r ~ 4 / 4 ,
c ho r d ~Gm7@0. 0 , c ho r d ~C7@2 . 0 , b a r ~ 4 / 4 , c h o rd ~ F7@0 . 0 , b a r ~ 4 / 4 , c h or d ~ Fm7@0 . 0 , c h or d
~B −7@2. 0 , b a r ~ 4 / 4 , c ho r d ~Em7@0. 0 , c h or d ~ A7@2 . 0 , b a r ~ 4 / 4 , c h o rd ~ E−m7@0. 0 , c h o rd ~A
−7@2. 0 , b a r ~ 4 / 4 , c ho r d ~Dm7@0 . 0 , b a r ~ 4 / 4 , c h o rd ~ G7@0 . 0 , b a r ~ 4 / 4 , c ho r d ~C \ u0 39 47 @0
. 0 , c hor d ~A7b9@2 . 0 , b ar ~ 4 / 4 , c h or d ~Dm7@0. 0 , c h or d ~G7@2 . 0 , e nd
The text representation for all available jazz standard charts is transposed into to all 11 remaining tonalities. This
helps not only to populate the dataset with additional entries (12576 in total), but also to integrate the harmonic context
introduced by each jazz standard to all tonalities. The harmonic strings are split into meaningful blocks of information
that are isolated in separate tokens for describing tonality (24 tokens, 12 for major and 12 for minor), style, tempo, time
signature, chord root and type (among 60 available types) and chord temporal position (in quarter notes). This process
leads to a vocabulary of 2649 unique tokens
4
; the maximum length of all tokenised strings is a sequence of 324 tokens.
Incorporating all transpositions ensures that all 24 tonalities and all root transpositions of each encountered chord type
are accounted for.
The dataset of 1048 jazz standard songs underwent a comprehensive annotation process by one of the authors, which
added the following information to each piece: Form, Harmony, Year of Composition, Tonality, Composer, and Genre.
Year of composition and composer are retrieved from the real books and online resources. Only the starting tonality
is annotated, which is also included in the song string. The remaining categories were annotated according to the
following labels:
Harmonic context. [2,17]
Tonal:
Tonality features Functional Harmony with a Tonal Center. Each chord serves a specic function in the
progression.
Modal:
In the context of jazz, modal compositions can be written in any mode. In such compositions chords do not
have a functional role and there is no need for chords to resolve to a tonic. Chords stand independently as
standalone entities.
Modern:
Similar to Modal harmony, but without necessary belonging to a mode. It breaks traditional rules and may
or may not have a tonal center. The focus is on experimenting with dierent chord colors without adhering to
functional patterns.
Mixed:
Compositions that feature a combination of harmonic context such as Tonal/Mondern, Tonal/Modal and
Modal/Modern.
4
Three special and necessary tokens are included in the vocabulary: “[PAD]” for trail-padding sequences with sizes smaller than the maximum size (324);
“[UNK]” for letting the tokeniser represent tokens that are not accounted for (practically unused in the presented application, but typically necessary for
the employed tokeniser implementation); and “[MASK]” which will be useful for the BERT-like unmasking training process.
4
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
Genre style. [4,7,8]
Swing:
Encompassing all jazz styles before the emergence of Bebop, such as Early Jazz, New Orleans jazz, Dixieland,
Hollywood, Broadway and more.
Latin Jazz:
Involving compositions by Latin American composers and pieces that incorporate Latin musical elements.
Ballad: Encompassing all types of ballads, as long as they exhibit a ballad-like character.
Modern: Referring to jazz standards composed roughly from the 1980s onwards.
Fusion: Describing songs that blend jazz with elements of rock or funk.
Avant-Garde:
Including harmonically and melodically sophisticated pieces and some of them with an incorporation
of free jazz elements.
Form. [3,15]
Blues: The most signicant form of the blues compositions, the 12-bar form.
16b: Compositions written in a 16-bar form.
24b: Compositions written in a 24-bar form.
32b:
Compositions written in a 32-bar form with thematic variations presented in the plot as AB, AABA, AABC, ABAC,
ABCD and Rhythm Changes. In some instances, a few additional measures were justied within the form to t
into a specic form group.
-: Compositions that don’t belong in any of the previous forms.
3.2 Neural Network Architecture and visualisation
Fig. 1. Overview of the model.
An overview of the unmasking variation of the model is illustrated in Figure 1. The input to this version of the
model is masked (25%) tokenised sequences with a size of 324 tokens (context length), which is the maximum tokenized
sequence length in the dataset; smaller sequences are padded with the “[PAD]” token. The transformer encoder comprises
8 layers of transformer encoder blocks with 8 attention heads each. Embeddings of size 256 are employed and the
5
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
Fig. 2. Overview of the predictor module in a downstream task.
internal dimensionality of each attention head is 256. The output of the transformer encoder is a sequence of 324
×
256-dimensional embeddings, which is split into two parts: one part trains the system to unmask the masked tokens
and the other part is performing a 1-dimensional global average pooling to the transformer output, creating a 256-
dimensional average of the 256
×
324 transformer output, which is propagated to the “predictor” parts of the network
for the regression and classication tasks. Year prediction for each jazz standard is the only regression task, while all
the others are classication tasks, namely: composer (445 classes), harmonic style (7 classes), form (11 classes), tonality
(24 classes), genre (18 classes). The variation of the model that does not incorporate masking receives the entire text of
the chart without masking and the output of the transformer is not directed to the unmasking task.
Each predictor (Figure 2) receives a 256-dimensional array coming from the global average polling of the transformer
output and passes this information through a series of dense layers before making a prediction that corresponds to a
specic task. As shown in Figure 2, the penultimate layer of the series of the dense layer in each predictor is important
and its output is preserved for the producing the visualisations; we call this layer the “descriptor” layer since it is
employed to describe what the system has learned regarding the attribute of the predictor. The idea is to use the
penultimate layer (descriptor) of each predictor to get the “last mile” latent information that the system has learned to
describe before actually performing regression or classication. The last layer of the predictor is producing probability
vectors of classes (for the classication tasks) or a logistic regression (for the regression task); those predictions are based
on the information that has been shaped by the transformer and the subsequent cascade of dense layer transformations
within each predictor, up to the penultimate layer which we claim holds useful “disentangled” information that can be
used to produce meaningful visualisations of data.
The descriptor (penultimate) layer for all tasks is a 64-dimensional representation. Each of these representations is
converted to 2D using t-SNE [
24
], providing the data for visualisation according to each criterion. All visualisation are
6
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
rotated for optimal alignment, so as to have minimal average displacement of all points/pieces when switching from
one visualisation to the other. This is performed by applying only the rotation component of the Kabsch-Umeyama
algorithm [
10
,
23
]. Specically, all visualisations are rotated to align with the “year of composition” visualisation, which
is conveniently laid out in a left-older / right-newer pieces arrangement (see result in Fig. 5(a) in Section 4.2).
During training of the umnasking variation, the model learns jointly to optimize the output of all tasks, namely the
unmasking task and the regression / classication downstream tasks. This fact forces the transformer encoder part
of the model to produce contextualised output that is useful for multiple viewpoints. Some of those viewpoints are
orthogonal, i.e., one might be unrelated with the other (e.g., the tonality of a piece is not necessarily related to its genre),
a fact that forces the transformer part of the network to use all attention heads in all encoder layers wisely, to end up
with a parsimonious, yet useful representation of all this information in the nal concatenation and global average
pooling part that prepares data for the predictors.
The aim of jointly learning the unmasking task in parallel with the predictions is two fold: rst to force the entire
network (including the transformer and the predictors) to learn even with missing information (and prevent the
model from over-tting); and second to force the transformer part to maintain contextual information that is generally
meaningful and does not skew towards quick convergence to the tasks that might be easier than others. The intuition
behind the second aim is based on the assumption that some tasks might be easier than others, e.g., predicting among 7
harmonic styles might be easier for the network than predicting among 445 composers. In this specic example, this
holds not only because of the number of classes but also because in harmonic styles, the vocabulary of chords used in a
piece might be clearly indicative of its harmonic style, while dierent composers might use the same chord vocabulary -
and actually have compositions belonging to the same harmonic style. This fact will expectedly produce larger (in an
absolute sense) gradients to the harmonic style predictor which will back-propagate to the transformer, forcing the
transformer to focus more on improving the performance of this attribute, over the attribute of composer classication.
This would force the transformer to over-t early in terms harmonic style and ignore the composer classication tasks,
ending up in weight congurations that are only close to local minima in terms of the ignored attribute. The unmasking
tasks is expected to smoothen such transformer biases and force it to produce contextualised information that is useful
in its own right, while also improving the outcome of all the prediction tasks.
4 RESULTS
Two model variations are compared in terms of convergence, with the one that includes unmasking proving to be more
eective that the one that does not. Afterwards, the visualisations generated using the unmasking version are examined
through an online interactive interface by a jazz expert.
4.1 Convergence and classification/regression analysis
Both versions of the model (with and without unmasking) converge apparently similarly to low error and high accuracy
values, as shown in Fig. 3. This gure shows the error values relative to the maximum error value in each task, therefore,
each curve has a maximum value of 1. Some curves in both graphs in this gure present some spikes of error increase,
which are permitted by the employed variation of gradient descend (adam method). Note that the image in Fig. 3(a)
includes an additional curve that describes the unmasking error, which follows an interesting convergence pattern:
after an initial quick error drop (rst few epochs), it follows a slow and almost linear drop of around 10% and then,
around epoch 1000, it follows an “typical” exponential-like convergence. This behaviour of the unmasking error curve,
however, does not inuence the convergence of the other curves in comparison to the no-unmasking case, i.e., there
7
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
(a) Unmasking (b) No unmasking
Fig. 3. Convergence (normalized error) when training with and without unmasking.
does not appear to be similar “convergence” boost in any other curve after epoch 1000. This was veried by examining
the behaviour of all curves with moving averages in a range between 30 and 100 (results not shown).
The unmasking method appears to be superior in terms of generalisation capabilities, meaning that it helps to
reduce over-tting. Classication and regression errors on the test set are shown in Table 1, where the method with no
unmasking misclassies more pieces in all tasks except from the tonality tasks, where both methods produce no error.
The fact that both methods perform accurately in tonality classication, has to do with the fact that tonality is explicitly
included in the harmonic string that describes each piece (e.g.,
tonality∼Am
indicates an A minor tonality). Therefore,
the transformer component of the model needs to learn that the important part in the string for identifying tonality
is the rst one or two characters (in case of major and minor tonalities, respectively) that come after the
tonality∼
prex. Apparently, the transformer component is able to successfully extract this information and pass it to the tonality
classier.
Table 1. Misclassified songs for each method variation
Misclassied songs
Task Unmasking No unmasking
harmony 0 3
form 0 4
tonality 0 0
composer 11 28
genre 6 15
Regression loss
year loss 5.61e-5 6.58e-5
As the masking error exhibits a steeper decrease, it becomes evident that errors in other downstream tasks also
experience a more pronounced reduction, as shown in Fig. 4. Consequently, the impact of unmasking seems to expedite
error reduction for the validation set, providing additional evidence that this approach mitigates the risk of overtting.
However, it’s important to note that this assertion warrants further investigation through focused studies.
8
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
(a) Composer training validation losses (b) Year training validation losses
Fig. 4. Composer and Year figures of training validation losses
Form information is also incorporated in some cases within the harmonic string that describes each piece. If the
form of a piece is, e.g., AABA, then this means that the piece transitions to “sections”, i.e. groups of bars, that are
annotated with A, A, B and A respectively. In the harmonic string, the bar that incorporates the section change includes
a corresponding annotation, i.e., in the
bar∼
includes a
section∼A
substring that indicates, e.g., that section A begins.
Therefore, in these cases, the transformer component needs to identify the
section∼
prex, focus on the next character
each time and keep track of how those characters succeed one another. Both model versions learn to perform this
almost perfectly, with a slight superiority for the unmasking variation that presents no errors (with 4 errors for the
other version). It has to be noted that other annotated form types are based on other factors, e.g., the number of bars,
and not the succession of section parts; some such pieces do not even include the
section∼
prex at all within their
harmonic string. In those cases, again, both models correctly learn to count the occurrences of bar∼.
The remaining classication tasks, as well as the regression task, incorporate more subtle characteristics that the
models need to learn. In all those cases, again, the unmasking variation outperforms the non-unmasking variation.
Given that both model variations are trained only on the transposition of the original pieces, where only the roots of
chords (and the tonality annotation) are changed, and that they perform so well on the original pieces, it appears that
the models focus on the chord types and their transitions. It would be useful to examine whether and to what extent
root relations are implicitly learned, e.g., a chord with root G going to a chord with root C is the same relation as going
from D to G. This examination, however, requires focused experimental processes that are left for future work.
4.2 alitative analysis
The visualisation results obtained by the unmasking variation of the model for each downstream task are shown
in Figure 5. As expected according to the high classication results, the model indeed presents the songs in groups
“correctly”, i.e., according to the annotated labels - which are employed for coloring each song/point on the graph. A
publicly accessible web-page
5
has been developed based on these graphs that allows users to explore the visualisation
of each downstream task in interactive way; this visualisation interface enables users to acquire extra information about
each piece upon mouse hover. Additionally, zooming in/out capabilities are oered as well as the option to highlight
5https://maxim.mus.auth.gr:5001
9
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
(a) Year (warmer color, newer piece) (b) Tonality
(c) Form (d) Genre
(e) Composer (f) Harmony
Fig. 5. Results of visualisations obtained from the predictor of each downstream task. Colors correspond to the annotated labels.
a specic piece selected through a search text-box. The web visualisation interface was employed for the qualitative
analysis of the results by one of the authors who is an expert in jazz musicology.
Some interesting ndings concern inter-cluster relations. When the representation of the model is based on Form
Style, compositions following the Rhythm Changes, a distinctive harmonic variation of the AABA form, were accurately
10
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
clustered adjacent to traditional AABA songs. Likewise, compositions following the 12-bar Blues Form were positioned
near the 16-bar compositions due to their closest resemblance in terms of form style, highlighting the model’s sensitivity
to temporal characteristics.
Regarding specic songs that appear to be “out of place”, the model eectively grouped the blues song “Serenade to a
Soul Sister”, alongside other 24-bar compositions. Blues songs written in ¾time signature, such as the one mentioned
above, have their 12 bar structure doubled into 24 bars. This result conrmed the model’s ability to understand
complex harmonic frameworks and appropriately organize compositions with similar structures. However, during
the analysis, specic clustering errors were identied, notably in the placement of the blues song “All Blues.”; despite
sharing a matching 24-bar structure with “Serenade to a Soul Sister”, the model positioned it in a dierent region,
seemingly prioritizing other musical elements. Such discrepancies indicate the need for further renement in the
model’s decision-making process to improve its accuracy.
Regarding the analysis of the tonalities graph, minor tonalities were eectively grouped together, but they appeared
signicantly distant from their respective major tonalities. This suggests that the string representation, incorporating
chord symbols, does not lead to a musically convenient circle of fths pattern but instead emphasizes the convenient
separation of tonality mode annotation (major/minor).
These observations underscore the model’s prociency in capturing musical nuances and its potential for meaningful
visualizations. The web interface provides a user-friendly platform for exploring and interpreting the results. However,
the identied discrepancies and clustering errors indicate areas for renement in the model’s decision-making process.
Overall, the qualitative analysis reinforces the model’s ability to organize and present jazz standard compositions based
on various annotated perspectives, laying the foundation for further renement and exploration.
5 CONCLUSIONS
Visualising large datasets of music under multiple perspectives, according to the needs of the user, facilitates the
discovery of useful material based on user-selected criteria. This paper presents a method for generating multiple
visualisations of a jazz standard dataset under several user-dened music-related perspectives. The examined dataset
includes strings of harmonic and structural descriptions of 1048 jazz standards musical charts, each accompanied by a
set of manual annotations for each perspective. A model is proposed that receives a string as input and learns to classify
and do regression based on the annotations. The basis of the model is a stack of transformer encoders, which prepare
a context-aware output for multiple feed-forward downstream models that do classication and regression for each
annotated aspect. The output of the penultimate layer of each downstream task model is employed a multidimensional
latent representation that is further compressed to 2D for visualisation.
Two variations of the model are examined: one learns only the downstream tasks, while in the other, the transformer
component learns to unmask a masked version of the input jointly with the downstream tasks. Training is performed
on the 11 transpositions in all keys for each string (with corresponding changes in the annotations) and testing is
performed on the original strings of the dataset. Results show that both versions learn each downstream task accurately,
with the umasking variation outperforming slightly the other in most tasks. The resulting visualisations show that
the model clusters the data properly for each downstream task. The observed acceleration in error reduction due to
unmasking during validation underscores its role in eectively mitigating overtting concerns. After examination by
an expert on jazz standards, it was shown that in some cases inter-cluster relations make musical sense and that the
placement of some out-of-place pieces, i.e., pieces that are visualised in the area of a dierent cluster than the one they
were annotated, is musically justied.
11
PCI 2023, November 24–26, 2023, Lamia, Greece Velenis, et al.
However, it’s essential to acknowledge certain limitations in our research. While the proposed model demonstrates
prociency in accurately learning and visualizing downstream tasks, there is room for further exploration. Future
empirical experiments could delve into the model’s ability to identify relations between chord roots within progressions.
Investigating whether the model’s visual representations align with perceptual distances perceived by participants
would contribute valuable insights.
Moreover, the potential for creating visualizations based on compound criteria, such as both "form" and "genre,"
remains an intriguing avenue for future research. This would necessitate a deeper exploration of the latent representa-
tions within each downstream module, paving the way for aligning these spaces with those of pre-trained language
models. Such alignment could enable the construction of custom visualizations based on free-text queries, providing
users with more exibility in exploring and understanding the rich landscape of jazz standards.
In conclusion, while our research demonstrates promising results in music data visualization, there are exciting
opportunities for further renement and expansion. By addressing these limitations and exploring new directions,
we can advance the eld of music education and discovery, oering users enhanced tools for exploring the intricate
world of jazz standards. This work underscores the importance of pushing the boundaries of music data visualization to
empower users with more meaningful and personalized insights, ultimately contributing to the broader landscape of
music research and education.
ACKNOWLEDGMENTS
This research has been co-nanced by the European Regional Development Fund of the European Union and Greek
national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call
RESEARCH – CREATE – INNOVATE. Project Acronym: MusiCoLab, Project Code: T2EDK-00353.
REFERENCES
[1]
Charles W. Beale. 2001. From jazz to jazz in education: An investigation of tensions between player and educator denitions of jazz. Doctoral thesis.
Institute of Education, University of London. Green open access.
[2] D. Berkman and Sher Music Co. 2013. The Jazz Harmony Book: A Course in Adding Chords to Melodies. Sher Music Company.
[3] M. Cooke and D. Horn. 2003. The Cambridge Companion to Jazz. Cambridge University Press.
[4] J.S. Davis. 2012. Historical Dictionary of Jazz. Scarecrow Press.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Timothy Greer, Xuan Shi, Benjamin Ma, and Shrikanth Narayanan. 2023. Creating musical features using multi-faceted, multi-task encoders based
on transformers. Scientic Reports 13, 1 (2023), 10713.
[7] M.C. Gridley. 2011. Jazz Styles. Pearson Education.
[8] T. Holmes and W. Duckworth. 2006. American Popular Music: Jazz. Checkmark.
[9]
Shulei Ji and Xinyu Yang. 2023. EmoMusicTV: Emotion-conditioned Symbolic Music Generation with Hierarchical Transformer VAE. IEEE
Transactions on Multimedia (2023).
[10]
Wolfgang Kabsch. 1976. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diraction,
Theoretical and General Crystallography 32, 5 (1976), 922–923.
[11]
Maximos Kaliakatsos-Papakostas, Konstantinos Velenis, Konstantinos Giannos, and Emilios Cambouropoulos. 2022. Exploring Jazz Standards with
Web Visualisation for Improvisation Training. In Web Audio Conference.
[12] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[13]
Peter Knees, Markus Schedl, Tim Pohle, and Gerhard Widmer. 2006. An innovative three-dimensional user interface for exploring music collections
enriched. In Proceedings of the 14th ACM international conference on Multimedia. 17–24.
[14]
Peter Knees, Markus Schedl, Tim Pohle, and Gerhard Widmer. 2007. Exploring music collections in virtual landscapes. IEEE multimedia 14, 3 (2007),
46–54.
[15] R.J. Lawn. 2013. Experiencing Jazz: eBook Only. Taylor & Francis.
12
Visualisations of Jazz Standards Derived from Transformer-based Multi-task Learning PCI 2023, November 24–26, 2023, Lamia, Greece
[16]
Dave Liebman, Phil Markowitz, Vic Juris, and Bob Reich. 1991. A chromatic approach to jazz harmony and melody. Advance music Rottenburg am
Neckar.
[17] R. Miller. 1996. Modal Jazz Composition & Harmony. Number 𝜏. 1 in Modal Jazz Composition & Harmony. Advance Music.
[18]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al
.
2018. Improving language understanding by generative pre-training. (2018).
[19]
Andreas Rauber, Elias Pampalk, and Dieter Merkl. 2003. The SOM-enhanced JukeBox: Organization and visualization of music collections based on
perceptual models. Journal of New Music Research 32, 2 (2003), 193–210.
[20]
Pedro Sarmento, Adarsh Kumar, Yu-Hua Chen, CJ Carr, Zack Zukowski, and Mathieu Barthet. 2023. GTR-CTRL: Instrument and Genre Conditioning
for Guitar-Focused Music Generation with Transformers. In International Conference on Computational Intelligence in Music, Sound, Art and Design
(Part of EvoStar). Springer, 260–275.
[21]
Markus Schedl, Michael Mayr, and Peter Knees. 2020. Music Tower Blocks: Multi-Faceted Exploration Interface for Web-Scale Music Access. In
Proceedings of the 2020 International Conference on Multimedia Retrieval. 388–392.
[22]
Jingyi Shen, Runqi Wang, and Han-Wei Shen. 2020. Visual exploration of latent space for traditional Chinese music. Visual Informatics 4, 2 (2020),
99–108. https://doi.org/10.1016/j.visinf.2020.04.003 PacicVis 2020 Workshop on Visualization Meets AI.
[23]
Shinji Umeyama. 1991. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis &
Machine Intelligence 13, 04 (1991), 376–380.
[24] Laurens Van der Maaten and Georey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[25]
Rob van Gulik, Fabio Vignoli, and Huub van de Wetering. 2004. Mapping music in the palm of your hand, explore and discover your collection. In
Proceedings of the 5th International Conference on Music Information Retrieval. Queen Mary, University of London London.
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. Advances in neural information processing systems 30 (2017).
[27]
Gabriel Vigliensoni, Louis McCallum, Esteban Maestre, Rebecca Fiebrink, et al
.
2022. R-VAE: Live latent space drum rhythm generation from
minimal-size datasets. Journal of Creative Music Systems 1, 1 (2022).
[28] Heng Wang, Sen Hao, Cong Zhang, Xiaohu Wang, and Yilin Chen. 2023. Motif Transformer: Generating Music With Motifs. IEEE Access (2023).
[29]
Guowei Wu, Shipei Liu, and Xiaoya Fan. 2023. The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in
Symbolic Music Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 1409–1420.
13