Content uploaded by Derek Lomas
Author content
All content in this area was uploaded by Derek Lomas on Feb 16, 2022
Content may be subject to copyright.
GuessTheMusic: Song Identification from
Electroencephalography response
Dhananjay Sonawane
dhananjay.sonawane@alumni.iitgn.ac.in
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat
Krishna Prasad Miyapuram
kprasad@iitgn.ac.in
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat
Bharatesh RS
bharatesh.rayappa@iitgn.ac.in
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat
Derek J. Lomas
J.D.Lomas@tudelft.nl
Delft University of Technology Netherlands
ABSTRACT
The music signal comprises of dierent features like rhythm, tim-
bre, melody, harmony. Its impact on the human brain has been
an exciting research topic for the past several decades. Electroen-
cephalography (EEG) signal enables the non-invasive measurement
of brain activity. Leveraging the recent advancements in deep learn-
ing, we proposed a novel approach for song identication using
a Convolution Neural network given the electroencephalography
(EEG) responses. We recorded the EEG signals from a group of
20 participants while listening to a set of 12 song clips, each of
approximately 2 minutes, that were presented in random order. The
repeating nature of Music is captured by a data slicing approach
considering brain signals of 1 second duration as representative of
each song clip. More specically, we predict the song corresponding
to one second of EEG data when given as input rather than a com-
plete two-minute response. We have also discussed pre-processing
steps to handle large dimensions of a dataset and various CNN
architectures. For all the experiments, we have considered each
participant’s EEG response for each song in both train and test data.
We have obtained 84.96% accuracy for the same at 0.3 train-test split
ratio. Moreover, our model gave commendable results as compare
to chance level probability when trained on only 10% of the total
dataset. The performance observed gives appropriate implication
towards the notion that listening to a song creates specic patterns
in the brain, and these patterns vary from person to person.
CCS CONCEPTS
•Computing methodologies →Neural networks
;
Supervised
learning by classication;Cognitive science.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
CODS COMAD 2021, January 2–4, 2021, Bangalore, India
©2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8817-7/21/01. . . $15.00
https://doi.org/10.1145/3430984.3431023
KEYWORDS
EEG, CNN, neural entrainmment, music, frequency following re-
sponse, brain signals, classication
ACM Reference Format:
Dhananjay Sonawane, Krishna Prasad Miyapuram, Bharatesh RS, and Derek
J. Lomas. 2021. GuessTheMusic: Song Identication from Electroencephalog-
raphy response. In 8th ACM IKDD CODS and 26th COMAD (CODS COMAD
2021), January 2–4, 2021, Bangalore, India. ACM, New York, NY, USA, 9 pages.
https://doi.org/10.1145/3430984.3431023
1 INTRODUCTION
Audio is a type of time-series signal characterized by frequency
and amplitude. Music signals are a particular type of audio signals
that posses a specic type of acoustic and structural features. Ac-
cordingly, one would expect that music aects dierent parts of the
brain as compared to other audio signals. Nevertheless, how closely
the brain activity pattern is related to the perception of periodic
signal like music? Electroencephalography (EEG) is a method to
measure electrical activity generated by the synchronized activity
of neurons. There is plethora of evidence published by researchers
to bolster linkage between EEG response and music. Braticco et
al.[
2
] showed that the brain anticipates the melodic information
before the onset of the stimulus and that they are processed in
the secondary auditory cortex. The study on the processing of the
rhythms by Snyder et al. found out that the gamma activity in the
EEG response corresponds to the beats in the simple rhythms[
14
].
A recent study showed that it is possible to extract tempo - a crit-
ical music stimuli feature, from the EEG signal[
16
]. The authors
concluded that the quality of tempo estimation was highly depen-
dent on the music stimulus used. The frequency of neural response
generated after entrainment of music is highly related to its beat
frequency[
11
]. Further, few researchers carried out Canonical Cor-
relation Analysis to estimate the correlation coecient of music
stimuli with EEG data[5, 13].
However, work done on pattern of brain activity reecting neu-
ral entrainment to music listening and its recognition is still at an
early stage. These patterns are very much intricated and thus it
is hard to interpret what is happening in the human brain when
a person is listening to a song. Moreover, aesthetic experience as-
sociated with music listening is highly subjective - i.e. it varies
from person to person and also from time to time depending on
154
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
various contextual factors such as mood of the individual who is
listening to music. That is why the song identication task is chal-
lenging. Previous research has focused on the relationship between
the song and its brain (EEG) responses. They have used engineered
features for processing the EEG data which are dependent on the
domain knowledge. There have been few attempts on automatic
feature extraction from EEG data using neural networks for song
classication task[18].
Taking the notion of the resonance between EEG signal and mu-
sic stimuli, in this paper, we hypothesize the following - 1) music
stimuli create identiable patterns in EEG response 2) for a given
song; these patterns vary from person to person. We pose these
hypothesis as a song identication task using deep learning archi-
tecture. To study the rst hypothesis, we split each participant’s
EEG response for each song in training and test dataset. We ex-
plored how large the train data should be and the eect of train
data size on the performance of the model. For a given participant,
the model learns song pattern present in EEG response from train-
ing data and try to predict song ID for test data. For the second
hypothesis, we exclude some participants entirely from the training
dataset. During data preprocessing, raw EEG response is divided
into segments, and each such chunk corresponds to 1 second long
EEG response. Our model predicts song ID for each such chunk
present in test data. It is represented as 2D matrices, and we call
them "song image". More about data preprocessing is discussed in
section 3.B. This technique allows us to use 2D and 3D convolu-
tion neural networks, which are usually used in computer vision
and image processing elds. The features extracted from the song
image by CNN are fed to a multilayer perceptron network for the
classication task. Our results outperform state of the art accuracy.
The remaining part of the paper is organized as follows: section
2 describes prior work on song classication problem using EEG
data and their results. Section 3 reports our methods, including data
collection, pre-processing steps, and CNN architectures used. In
section 4, we discuss the performance of our model, and in section
5, we draw a conclusion on the cognitive process behind music
perception and suggest possible future work.
2 RELATED WORK
It has been shown that human mental states can be unraveled from
non-invasive measurements of brain activity such as functional
MRI, EEG etc [
6
]. Several researchers have documented the fre-
quency following response (FFR), which is the potential induced
in the brain while listening to periodic or nearly periodic audio
signal[
1
]. A successful attempt has been made to reconstruct per-
ceptual input from EEG response. In [
8
], the objective of the study
was: a person is looking at an image, and brain activity is cap-
tured by the EEG in real-time. The EEG signals are then analyzed
and transformed into a semantically similar image to the one that
the person is looking at. They modeled the Brain2Image system
using variational autoencoders (VAE) and generative adversarial
networks (GAN). The research work[
12
] relates music and its activ-
ity using a statistical framework. Authors study the classication of
musical content via the individual EEG responses by targeting three
tasks: stimuli-specic classication, group classication, i.e., songs
recorded with lyrics, songs recorded without lyrics, and instrumen-
tal pieces and meter classication, i.e., 3/4 vs. 4/4 meter. They have
used OpenMIIR dataset[
17
]. It includes response data of 10 subjects
who listened to 12 music fragments with duration ranging from
7 s to 16 s coming from popular musical pieces. They proposed
Hidden Markov Model and probabilistic computation method to
the developed model, which was trained on 9 subjects and tested
on the 10
th
subject. They achieved 42.7%, 49.6%, 68.7% classication
rate for task1, task2, task3, respectively. Foster et al. investigated
the correlation between EEG response and music features extracted
by the librosa library in Python[
4
]. Using representational simi-
larity analysis, they report the correlation coecient of EEG data
with normalized tempogram features as 0.63 and with MFCCs as
0.62. They also deal with song identication from the EEG data
problem and obtained 28.7% accuracy using the logistic regression
model. Our study stands out in terms of methodology as we ex-
ploit the power of deep learning architecture for automatic feature
extraction from EEG response.
Yi Yu et al. used a convolution neural network called DenseNet[
7
]
for audio event classication[
18
]. The EEG responses were col-
lected on 9 male participants. The audio stimuli were 10 seconds
long, spanned over 8 dierent categories (Chant, Child singing,
Choir, Female singing, Male singing, Rapping, Synthetic singing,
and Yodeling). They achieved 69% accuracy using EEG data only.
However, the optimal result was 81%, where they used audio fea-
tures extracted from another convolution network - VGG16[
9
]
along with EEG response. Sebastian Stober et al., aimed to clas-
sify 24 music stimuli[
15
]. Each music segment comprised of 2
unique rhythms played at a dierent pitch. Due to small data,
they process and classify each EEG channel individually. CNN
was trained on 84% of complete response (approximately 27 second
response chunk out of 32 seconds), validated on 8% of complete
response(approximately 2.5 second response chunk), and tested
on 8% of complete response(approximately 2.5 second response
chunk). They report 24.4% accuracy. We, in this study, deal with
more complex data. Our music segment comprises diverse tone,
rhythm, pitch, and some of them also include vocals. This makes
our song identication task more challenging. The proposed archi-
tecture is much simpler compared to DenseNet[7] and VGG16[9].
3 DATASET
We have collected a new dataset for this study. The overall data
analysis process can be divided into two subparts: Data collection,
Data Preprocessing.
3.1 Data Collection
Participants were made to sit in a dimly lit room. Then we col-
lected demographic information like age, gender, and handedness.
Brief information regarding the EEG collection setup, time that the
experiment will take, the responses that they have to make were
discussed with all participants. Then we measure the circumfer-
ence of the participant’s head to select a suitable EEG cap. The
128 channel high density Geodesic electrode net cap (Hydrocel
Geodesic SensorNet platform, Electrical Geodesics Inc., USA, Now
Philips) is chosen according to the headsize measurement. The cap
is immersed in the KCl electrolyte solution prepared in 1 litre pure
155
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
Figure 1: Illustration of the 128-channel-system and elec-
trode position
distilled water. The reference electrode position is measured as the
intersecting point on the lines between nasion (point in between
eyebrows) and inion (middle point of skull ending at the backside)
with preauricular points on both sides, and then it is marked. Other
electrodes are placed according to the International 10-20 system,
as depicted in Fig.1.
After this setup, participants were asked to close the eyes on a
single beep tone. This is followed by 10 seconds of silence, after
which the song stimulus is presented. At the end of each stimulus,
a double beep tone was sounded at which the participants were
instructed to open eyes and make a response. They were asked two
questions to rate their familiarity and enjoyment of the song on a
scale of 1 to 5. Since the maximum length of the song is 132 seconds,
all other responses were zero padded accordingly. Therefore, all
song responses are of 142 seconds after considering the above
window. Total of 20 participants data was collected on 12 music
stimuli. Songs used in the experiment are listed in Table 1 along
with their genres. The songs contain some tonal and vocal excerpts.
The sampling rate for 11 participants was 1000Hz, and for the
remaining 9 participants, it was 250Hz. 16 participants were male,
while 4 were female. All of them right-handed with an average age
of 25.3 years and a standard deviation of 3.38. All music stimuli
were presented to the subject in random order.
3.2 Data Preprocessing
EEG is highly sensitive to noise. It also captures eye blinking and
high-frequency muscle movements. Therefore, it is necessary to
clean EEG data before using it for any application. EEGLAB tool-
box was used to implement the majority of preprocessing steps.
Once the channel locations are provided, we performed average
re-referencing with respect to average of all channels. Raw EEG
signal was then loaded as epochs for each presented song. 12 epochs
were created for each participant. By this, we eliminated the signals
which do not lie in our area of interest. We then used independent
component analysis to remove artifact data. This has been achieved
Table 1: Songs used in EEG data collection
Song Song Name Genre Artist Song
ID Length
(in sec)
1 Trip to the Electronics Mark Alow 125
lonely planet
2 Sail Indie Awolnation 114
3 Concept 15 Electronics Kodomo 132
4 Aurore New age Claire David 111
5 Proof Dance Idiotape 124
6 Glider Ambient Tycho 100
7 Raga Behag Classical Pandit Hari 116
Prasad Chaurasiya
8 Albela sajan Classical Shankar Mahadevan 121
9 Mor Bani Dance Aditi Paul 126
Thanghat Kare
10 Red Suit Rock DJ David. G 129
11 Fly Kicks Hip Hop DJ Kimera 113
12 JB Rock Nobody.one 117
using ’runica’ algorithm present in MATLAB. We have used ’ad-
just’ toolbox for artifact removal. For simplicity, positive innities,
negative innities, and NaN (not a number) values are replaced by
zero. However, taking the average value of surrounding electrodes
for these outliers would be a better approach and may improve the
performance. The above steps create nal ready-to-use data.
For deep learning models, we would need extensive data with
reasonable feature dimensions to detect a pattern. In our task, we
had the opposite situation. Our data contains 240 EEG responses
corresponding to 20 participants and 12 song stimuli. However,
a number of samples collected in one electrode for one song of
one participant is greater than 27,000 at the 250Hz sampling rate.
The number of samples goes beyond 100,000 at 1000Hz sampling
rate. Thus, our data consists of a few examples with very high
dimensionality.
The data augmentation technique was used to increase data. We
split the EEG response of each participant for each song in a chunk
of 1-second long window. Thus, giving us 128(electrodes) * 250 or
1000 (samples per second) dimension 2D matrices. We call them
"song image" and label them the same song ID of the corresponding
song. Such a formulation not only increases number of examples in
original dataset but also allows us to use 2D, and 3D convolution
networks. Fig. 2 shows the song images of the 26
th
second of two
dierent songs for one of the participants in time domain. How-
ever, window size is design parameter and it is dicult to decide
optimal window size for obtaining song images. Larger window
size increases dimensionality of a song image and decreases the
total number of song examples. Thereby killing the purpose of the
data augmentation. High dimensional input to Convolution Neu-
ral Network (CNN) also drastically increases number of trainable
parameters provided that the rest of the architecture remains the
same. Smaller window size will carry less information about EEG
signal and CNN may perform poorly. As the sampling rate changes,
the 1 second time window will carry dierent number of samples in
156
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
(a) Song ID - 6
(b) Song ID - 7
Figure 2: Song images of 26th second for participant ID: 1902
in time domain
song image. This causes inconsistency in input data to the CNN. We
needed more concrete preprocessing steps so that smaller window
size would increase number of examples in dataset but not at the
cost of the performance. Fig. 3 illustrates the Fast Fourier Transform
(FFT) of the all 12 song response for one participant. It is worth not-
ing that, in all the FFT’s, the maximum frequency component is less
than 100Hz. This is expected as EEG is characterized by frequency
bands - 0Hz - 4Hz (Delta band), 4Hz - 8Hz (Theta band), 8Hz - 15Hz
(Alpha band), 15Hz - 32Hz (Beta band), and frequencies higher than
32Hz (Gamma band), exhibit high power in low frequency ranges.
Using spectopo function in EEGLAB, we converted time-domain
EEG data into the frequency domain. Spectopo calculates the am-
plitude of a frequency component present in each 1 second window
of the EEG response. The maximum frequency component in the
frequency domain representation of data is chosen as per Nyquist’s
criteria. It is 125Hz and 500Hz when the sampling rate is 250Hz
and 1000Hz, respectively. Regardless of what is sampling frequency
of EEG, we can safely choose 125Hz as maximum frequency com-
ponent. Fig. 4 explains the dimensionality and conversion of time-
domain data to frequency domain data. Frequency domain data
helps in dimensionality reduction as well as makes input dataset
consistent and compact. But, the time window for which we calcu-
late FFT is again a design parameter. The eect of the time window
Figure 3: FFT of EEG response of the participant ID 1902 for
all 12 songs
Figure 4: Time domain to frequency domain conversion for
one participant data using spectopo
on a performance of the CNN in both time and frequency domain
has been further addressed in next section.
The row of song image in frequency domain denotes electrode while
column denotes the frequency. Fig. 5 shows the song images of the
26
th
second of the dierent song for participant 1902 in frequency
domain.
157
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
(a) Song ID - 6
(b) Song ID - 7
Figure 5: Song images of 26th second for participant ID: 1902
in frequency domain
We follow standard practice in machine learning to develop
the model. Test-Train split is 0.3 with a random selection of song
images in training (70%) and testing (30%) data. We set the shue
and random-state parameters of train_test_split() function to True
and None respectively throughout the experiment, which ensures
100% randomness. Validation split parameter is set to 0.2.
4 METHODS
The aim of this work is to create an approach to classify EEG
response corresponding to respective song event. This section de-
scribes : CNN architecture and model development, Model Testing.
4.1 Experiment and Model development
Convolutional Neural Network is the core part of our model because
it learns the underlying pattern in the song image. It includes many
hyperparameters that need to be set carefully. We apply CNN to the
both time domain and frequency domain song image dataset. The
CNN architecture remains same except for the input layer where
shape of the input song image changes as per domain and time
window.
We have created a 3-layer CNN network for feature extraction
and a 2-layer dense neural network for song classication. Except
for the last output layer, each convolution layer as well as the dense
layer has the ReLu[
10
] activation function. It brings non-linearity
in architecture and helps to detect complex patterns. Since we
are doing multiclass classication, the output layer has a softmax
activation function. The loss function used is categorical cross-
entropy. It is given by,
Crossentropy=−
C
Õ
i=1
yo,c∗loд(po,c),(1)
where,
C
is total number of classes - 12 in our case,
yo,c
is a binary
number if class label
c
is correct classication for observation
o
,
po,cpredicted probability observation ois of class c.
We used Adam optimizer to minimize categorical cross-entropy
loss. The kernel size is 3*3, and we have used 16 such lters at
each convolution layer. Two max-pooling layers have been added
after convolution layers 2 and 3. Their exclusion almost doubles
total trainable parameter, thereby increasing network complexity,
training time. Fig.6 shows the 2D CNN architecture. Removing one
or more layers from the architecture mentioned above resulted in an
underdetermined system and failed to learn all the patterns. Adding
extra layers led to overtting and thus reducing the performance.
We used 30 epochs for training the CNN. The network architec-
ture which we have proposed is designed in the Keras framework.
The initialization of weights is assigned with some random num-
bers by using the Keras framework[
3
] itself. The GPU NVIDIA GTX
1050 that we have used for this experiment has 4GB of RAM. The
batch size was kept 16 for all the experiments because of memory
constraints.
4.2 Model Testing
To study our rst hypothesis that music creates an identiable
pattern in the brain EEG signals, we consider all participants data in
training data. An immediate problem is how much of 1 participant’s
response should be considered in training data to predict song ID?
To answer this question, we vary the train-test split from 20% to
95%. The
x
% train-test split means
x
% of data will be chosen as test
data while
(
100
−x)
% will be treated as train data. For each split,
we randomly select test samples.
We have analyzed the eect of the time window on the perfor-
mance of the CNN in time domain. We created a 3 separate datasets
with time window set to 1 second, 2 seconds, 3 seconds and having
the song image shape 128*250, 128*500, 128*750 respectively. We did
not increase time window beyond 3 seconds as number of examples
in dataset were reduced signicantly. The 9 participants with EEG
sampling frequency of 250Hz were chosen in above dataset. Similar
steps were applied to the participant where EEG sampling rate was
1000Hz. To maintain the symmetry, we randomly chose 9 partici-
pants out of 11. In frequency domain, we did not increase the time
window for which FFT is calculated. Because, 1 sec time window
gave reasonable results. We have also analyzed the eect of higher
sampling frequency. For this, we choose those 11 participants for
which sampling rate was 1000Hz. This resulted in the maximum
frequency component being 500Hz, thus song image shape changes
to 128*501. The previous model was trained on new data. We could
have compared this result with an earlier model where the song
image was 128*126, but since earlier data had 20 participants. For a
fair comparison, we choose the same 11 participants and discard
158
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
Figure 6: 2D CNN architecture followed by Dense network for song classication
all the frequency from 127Hz to 501Hz. We trained the model on
this data as well.
For the investigation of our second hypothesis - music creates a
dierent pattern on a dierent person; we used 5 of the randomly
chosen participant’s responses as a test dataset. Training data in-
cluded the remaining 15 participant’s responses. To improve the
result for this task, we developed a 3D CNN model. All the parame-
ters remain same as the previous model except for the kernel, which
changed to 3*3*3, and the max-pool layer modied as 2*2*2. We
stacked 10 consecutive song images and fed it to the rst layer of 3D
CNN as input. The choice 10 is made by considering the trade-o
between the number of samples in data and dimensionality of each
3D input sample.
5 RESULT AND DISCUSSION
When each participant’s response is accounted for in both train
and test data, our model reports outstanding training accuracy as
90.90% and test accuracy as 84.96% in frequency domain (model 5,
train-test split = 0.3, validation split = 0.2). We outperformed the
state-of-the-art 24.7% accuracy for within-participant classication
obtained by 2-layer CNN model (train-test split = 0.33, validation
split = 0.33) in [
15
]. Though the dataset in [
15
] is dierent from
ours, the study is almost similar in terms of a problem statement,
methodologies, and pre-processing steps. Our dataset is more so-
phisticated as it includes EEG recording to songs with a dierent
tone, rhythm, pitch, and lyrics, unlike considering songs with only
dierent tones. The notable accuracy of 22.12% is observed when
we train model only on 5% of total data (train-test split = 0.05),
almost the same as that of [
15
]. The detailed discussion is at the
end of this section. Confusion matrix is shown in Fig. 8a. It shows
that test data is well spread across all 12 classes, and almost all
of them are correctly classied. Fig. 7b shows accuracy vs. epoch
curve for the above model. Table 2 summarizes the accuracy of the
all CNN models. The model trained on time domain dataset hardly
learnt anything for song identication task. By changing the time
window and sampling frequency did not change the performance
of the CNN in time domain. But the same CNN architecture ob-
tained high accuracy when trained on frequency domain dataset.
The performance of CNN in frequency domain could be either due
to learning the temporal pattern in EEG or learning from other
participant’s responses. To examine the latter cause, we retrained
the same CNN model, this time excluding 5 participants entirely
from the train data. We got 86.95% training accuracy, but the model
reported 7.73% test accuracy (model 10). We extend this experiment
by training the 3D CNN model, and observed 9.44% test accuracy
for cross-participant data (model 11). This shows that CNN depends
upon the temporal features in each participant’s response for the
song prediction. It also gives insights regarding the EEG pattern
generated due to music entrainment dier from person to person
for the same song. However, 42.7% test accuracy is obtained by [
12
]
for cross-participation classication task using engineered features
calculated by Hidden Markov Models (HMMs). The accuracy in-
creases to 99.8% when they ignored EEG response and only used
acoustic features. We also studied the high-frequency signals gen-
erated in the brain due to music entrainment. For this, we choose
those participants whose data is collected at 1000Hz. It will en-
sure the high value of the maximum frequency component(up to
500Hz) in the song image. All participant responses were consid-
ered in train and test data. Two models of the same architecture
were developed; one trained on data having all 500Hz frequency
element while the other trained on data by choosing only the rst
126Hz frequencies out of 500Hz. Both performed almost equally
well, giving 80.99% and 76.19% accuracy, respectively (model 8,9).
It explains that higher EEG frequencies do not contribute much
to the pattern generated while listening to music. To investigate
how much each participant’s data should be included in train data
to predict song ID on test data, we vary the train-test split from
20% to 95%. Fig. 7a shows accuracy plot for dierent train-test split
159
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
Table 2: Performance of the CNN models
CNN Domain Song image Total number CNN Train Test
Model shape of song trainable accuracy accuracy
images parameters (%) (%)
1 Time 128*250 11,772 3,398,476 8.60 8.01
2 Time 128*500 5,832 6,953,804 8.64 7.48
3 Time 128*750 3,888 10,623,820 8.09 8.82
4 Time 128*1000 14,338 14,179,148 8.65 8.05
5 Time 128*2000 7,194 28,515,148 8.79 7.45
6 Time 128*3000 3,597 42,851,148 8.48 8.29
7 Frequency 128*126 34,080 1,678,156 90.90 84.96
8 Frequency 128*500 18,774 5,823,308 80.01 80.99
9 Frequency 128*126 18,774 1,678,156 91.31 76.19
10 Frequency 128*126 34,080 1,678,156 86.95 7.73
Cross participant
11 Frequency 128*126*10 3600 1,925,228 9.44 9.44
3D input Cross participant
(a) Change in the test accuracy for dierent train-test
split values
(b) Training and validation curve
Figure 7: Accuracy plots
value. We got remarkable test accuracy 78.12% by training the CNN
model on 20% of total data(Train-test split = 0.8). In other words, by
learning approximately 17 seconds long EEG response of 120 sec-
onds prolonged music stimuli, we were able to predict song ID for
the rest of the 103 seconds with 78% correct prediction probability.
More commendable accuracy is 22.12% at 0.95 train-test split. This
performance is much better than a random guess, which is 8.33%
for this 12 class classication problem. This also ensures that our
model is not overtting because of decent generalization accuracy
using very less training dataset. Fig.8a, 8b, 8c shows the confusion
matrices for 0.3, 0.5, 0.95 train-test split ratio, respectively. We have
visualized the intermediate CNN outputs. Fig. 9 shows the output
of the 3
rd
convolution layer of model 7. For the same song ID - 9,
the participant 1901 and 1905 have learnt dierent features. Filter1,
Filter7, Filter14, Filter15 shows dierent patterns in Fig. 9a, 9b. This
supports our 2
nd
hypothesis that the EEG patterns vary from per-
son to person for a given song. Similar analogy applies to Fig. 9c,
9d.
6 CONCLUSION
In this paper, we proposed an approach to identify the song from
brain activity, which is recorded when a person is listening to it.
We worked on our data collected for 20 participants and 12 two
minutes of songs having diverse tone, pitch, rhythm, and vocals. In
particular, we were successfully able to classify songs from only
1 second long EEG response in frequency domain by implement-
ing automatic feature extractor deep learning model. But, CNN
model failed in time domain. We developed a simple but yet ef-
cient 3 layer deep learning model in the Keras framework. The
results show that identiable patterns are generated in the brain
during music entrainment. We were able to detect them when each
participant’s EEG response considered in both train and test data.
Our model performed poorly when some of the participants were
completely excluded from the train data. Manual feature extraction
approach worked better than automatic feature extraction by deep
learning model. This gives us insights about the dierent patterns
created when dierent persons were listening to the same song.
The possible reason could be people focus on a dierent tone, vocals
during music entrainment, thereby reducing performance for cross-
participant song identication task. Thus, as future work, we aim
at acquiring more data and look for other preprocessing methods,
CNN architectures to improve accuracy for across participant data.
However, results achieved in this paper are highly encouraging and
provides an essential step towards an ambitious mind-reading goal.
160
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
(a) Train-test split = 0.3 (b) Train-test split = 0.5 (c) Train-test split = 0.95
Figure 8: Confusion matrices
(a) Participant ID: 1901, Song ID:9 (b) Participant ID: 1905, Song ID:9
(c) Participant ID: 1901, Song ID:3 (d) Participant ID: 1905, Song ID:3
Figure 9: Convolution layer 3 output
161
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
REFERENCES
[1]
Gavin Bidelman and Louise Powers. 2018. Response properties of the human
frequency-following response (FFR) to speech and non-speech sounds: level de-
pendence, adaptation and phase-locking limits. International journal of audiology
57, 9 (2018), 665–672.
[2]
Elvira Brattico, Mari Tervaniemi, Risto Näätänen, and Isabelle Peretz. 2006. Mu-
sical scale properties are automatically processed in the human auditory cortex.
Brain research 1117, 1 (2006), 162–174.
[3] F. Chollet. 2015. https://github.com/fchollet/keras.
[4]
Chris Foster, Dhanush Dharmaretnam, Haoyan Xu, Alona Fyshe, and George
Tzanetakis. 2018. Decoding music in the human brain using EEG data. In 2018
IEEE 20th International Workshop on Multimedia Signal Processing (MMSP). IEEE,
1–6.
[5]
Nick Gang, Blair Kaneshiro, Jonathan Berger, and Jacek P Dmochowski. 2017.
Decoding Neurally Relevant Musical Features Using Canonical Correlation Anal-
ysis.. In ISMIR. 131–138.
[6]
John-Dylan Haynes and Geraint Rees. 2006. Decoding mental states from brain
activity in humans. Nature Reviews Neuroscience 7, 7 (2006), 523–534.
[7]
Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell,
and Kurt Keutzer. 2014. Densenet: Implementing ecient convnet descriptor
pyramids. arXiv preprint arXiv:1404.1869 (2014).
[8]
Isaak Kavasidis, Simone Palazzo, Concetto Spampinato, Daniela Giordano, and
Mubarak Shah. 2017. Brain2image: Converting brain signals into images. In
Proceedings of the 25th ACM international conference on Multimedia. 1809–1817.
[9]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional
networks for semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 3431–3440.
[10]
Vinod Nair and Georey E Hinton. 2010. Rectied linear units improve restricted
boltzmann machines. In ICML.
[11]
Sylvie Nozaradan. 2014. Exploring how musical rhythm entrains brain activity
with electroencephalogram frequency-tagging. Philosophical Transactions of the
Royal Society B: Biological Sciences 369, 1658 (2014), 20130393.
[12]
Stavros Ntalampiras and Ilyas Potamitis. 2019. A statistical inference framework
for understanding music-related brain activity. IEEE Journal of Selected Topics in
Signal Processing 13, 2 (2019), 275–284.
[13]
Shankha Sanyal, Sayan Nag, Archi Banerjee, Ranjan Sengupta, and Dipak Ghosh.
2019. Music of brain and music on brain: a novel EEG sonication approach.
Cognitive neurodynamics 13, 1 (2019), 13–31.
[14]
Joel S Snyder and Edward W Large. 2005. Gamma-band activity reects the
metric structure of rhythmic tone sequences. Cognitive brain research 24, 1 (2005),
117–126.
[15]
Sebastian Stober, Daniel J Cameron, and Jessica A Grahn. 2014. Using Convolu-
tional Neural Networks to Recognize Rhythm Stimuli from Electroencephalogra-
phy Recordings. In Advances in neural information processing systems. 1449–1457.
[16]
Sebastian Stober, Thomas Prätzlich, and Meinard Müller. 2016. Brain Beats:
Tempo Extraction from EEG Data.. In ISMIR. 276–282.
[17]
Sebastian Stober, Avital Sternin, Adrian M Owen, and Jessica A Grahn. 2015. To-
wards Music Imagery Information Retrieval: Introducing the OpenMIIR Dataset
of EEG Recordings from Music Perception and Imagination.. In ISMIR. 763–769.
[18]
Yi Yu, Samuel Beuret, Donghuo Zeng, and Keizo Oyama. 2018. Deep learning
of human perception in audio event classication. In 2018 IEEE International
Symposium on Multimedia (ISM). IEEE, 188–189.
162