Conference PaperPDF Available

GuessTheMusic: Song Identification from Electroencephalography response

GuessTheMusic: Song Identification from
Electroencephalography response
Dhananjay Sonawane
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat
Krishna Prasad Miyapuram
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat
Bharatesh RS
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat
Derek J. Lomas
Delft University of Technology Netherlands
The music signal comprises of dierent features like rhythm, tim-
bre, melody, harmony. Its impact on the human brain has been
an exciting research topic for the past several decades. Electroen-
cephalography (EEG) signal enables the non-invasive measurement
of brain activity. Leveraging the recent advancements in deep learn-
ing, we proposed a novel approach for song identication using
a Convolution Neural network given the electroencephalography
(EEG) responses. We recorded the EEG signals from a group of
20 participants while listening to a set of 12 song clips, each of
approximately 2 minutes, that were presented in random order. The
repeating nature of Music is captured by a data slicing approach
considering brain signals of 1 second duration as representative of
each song clip. More specically, we predict the song corresponding
to one second of EEG data when given as input rather than a com-
plete two-minute response. We have also discussed pre-processing
steps to handle large dimensions of a dataset and various CNN
architectures. For all the experiments, we have considered each
participant’s EEG response for each song in both train and test data.
We have obtained 84.96% accuracy for the same at 0.3 train-test split
ratio. Moreover, our model gave commendable results as compare
to chance level probability when trained on only 10% of the total
dataset. The performance observed gives appropriate implication
towards the notion that listening to a song creates specic patterns
in the brain, and these patterns vary from person to person.
Computing methodologies Neural networks
learning by classication;Cognitive science.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from
CODS COMAD 2021, January 2–4, 2021, Bangalore, India
©2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8817-7/21/01. . . $15.00
EEG, CNN, neural entrainmment, music, frequency following re-
sponse, brain signals, classication
ACM Reference Format:
Dhananjay Sonawane, Krishna Prasad Miyapuram, Bharatesh RS, and Derek
J. Lomas. 2021. GuessTheMusic: Song Identication from Electroencephalog-
raphy response. In 8th ACM IKDD CODS and 26th COMAD (CODS COMAD
2021), January 2–4, 2021, Bangalore, India. ACM, New York, NY, USA, 9 pages.
Audio is a type of time-series signal characterized by frequency
and amplitude. Music signals are a particular type of audio signals
that posses a specic type of acoustic and structural features. Ac-
cordingly, one would expect that music aects dierent parts of the
brain as compared to other audio signals. Nevertheless, how closely
the brain activity pattern is related to the perception of periodic
signal like music? Electroencephalography (EEG) is a method to
measure electrical activity generated by the synchronized activity
of neurons. There is plethora of evidence published by researchers
to bolster linkage between EEG response and music. Braticco et
] showed that the brain anticipates the melodic information
before the onset of the stimulus and that they are processed in
the secondary auditory cortex. The study on the processing of the
rhythms by Snyder et al. found out that the gamma activity in the
EEG response corresponds to the beats in the simple rhythms[
A recent study showed that it is possible to extract tempo - a crit-
ical music stimuli feature, from the EEG signal[
]. The authors
concluded that the quality of tempo estimation was highly depen-
dent on the music stimulus used. The frequency of neural response
generated after entrainment of music is highly related to its beat
]. Further, few researchers carried out Canonical Cor-
relation Analysis to estimate the correlation coecient of music
stimuli with EEG data[5, 13].
However, work done on pattern of brain activity reecting neu-
ral entrainment to music listening and its recognition is still at an
early stage. These patterns are very much intricated and thus it
is hard to interpret what is happening in the human brain when
a person is listening to a song. Moreover, aesthetic experience as-
sociated with music listening is highly subjective - i.e. it varies
from person to person and also from time to time depending on
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
various contextual factors such as mood of the individual who is
listening to music. That is why the song identication task is chal-
lenging. Previous research has focused on the relationship between
the song and its brain (EEG) responses. They have used engineered
features for processing the EEG data which are dependent on the
domain knowledge. There have been few attempts on automatic
feature extraction from EEG data using neural networks for song
classication task[18].
Taking the notion of the resonance between EEG signal and mu-
sic stimuli, in this paper, we hypothesize the following - 1) music
stimuli create identiable patterns in EEG response 2) for a given
song; these patterns vary from person to person. We pose these
hypothesis as a song identication task using deep learning archi-
tecture. To study the rst hypothesis, we split each participant’s
EEG response for each song in training and test dataset. We ex-
plored how large the train data should be and the eect of train
data size on the performance of the model. For a given participant,
the model learns song pattern present in EEG response from train-
ing data and try to predict song ID for test data. For the second
hypothesis, we exclude some participants entirely from the training
dataset. During data preprocessing, raw EEG response is divided
into segments, and each such chunk corresponds to 1 second long
EEG response. Our model predicts song ID for each such chunk
present in test data. It is represented as 2D matrices, and we call
them "song image". More about data preprocessing is discussed in
section 3.B. This technique allows us to use 2D and 3D convolu-
tion neural networks, which are usually used in computer vision
and image processing elds. The features extracted from the song
image by CNN are fed to a multilayer perceptron network for the
classication task. Our results outperform state of the art accuracy.
The remaining part of the paper is organized as follows: section
2 describes prior work on song classication problem using EEG
data and their results. Section 3 reports our methods, including data
collection, pre-processing steps, and CNN architectures used. In
section 4, we discuss the performance of our model, and in section
5, we draw a conclusion on the cognitive process behind music
perception and suggest possible future work.
It has been shown that human mental states can be unraveled from
non-invasive measurements of brain activity such as functional
MRI, EEG etc [
]. Several researchers have documented the fre-
quency following response (FFR), which is the potential induced
in the brain while listening to periodic or nearly periodic audio
]. A successful attempt has been made to reconstruct per-
ceptual input from EEG response. In [
], the objective of the study
was: a person is looking at an image, and brain activity is cap-
tured by the EEG in real-time. The EEG signals are then analyzed
and transformed into a semantically similar image to the one that
the person is looking at. They modeled the Brain2Image system
using variational autoencoders (VAE) and generative adversarial
networks (GAN). The research work[
] relates music and its activ-
ity using a statistical framework. Authors study the classication of
musical content via the individual EEG responses by targeting three
tasks: stimuli-specic classication, group classication, i.e., songs
recorded with lyrics, songs recorded without lyrics, and instrumen-
tal pieces and meter classication, i.e., 3/4 vs. 4/4 meter. They have
used OpenMIIR dataset[
]. It includes response data of 10 subjects
who listened to 12 music fragments with duration ranging from
7 s to 16 s coming from popular musical pieces. They proposed
Hidden Markov Model and probabilistic computation method to
the developed model, which was trained on 9 subjects and tested
on the 10
subject. They achieved 42.7%, 49.6%, 68.7% classication
rate for task1, task2, task3, respectively. Foster et al. investigated
the correlation between EEG response and music features extracted
by the librosa library in Python[
]. Using representational simi-
larity analysis, they report the correlation coecient of EEG data
with normalized tempogram features as 0.63 and with MFCCs as
0.62. They also deal with song identication from the EEG data
problem and obtained 28.7% accuracy using the logistic regression
model. Our study stands out in terms of methodology as we ex-
ploit the power of deep learning architecture for automatic feature
extraction from EEG response.
Yi Yu et al. used a convolution neural network called DenseNet[
for audio event classication[
]. The EEG responses were col-
lected on 9 male participants. The audio stimuli were 10 seconds
long, spanned over 8 dierent categories (Chant, Child singing,
Choir, Female singing, Male singing, Rapping, Synthetic singing,
and Yodeling). They achieved 69% accuracy using EEG data only.
However, the optimal result was 81%, where they used audio fea-
tures extracted from another convolution network - VGG16[
along with EEG response. Sebastian Stober et al., aimed to clas-
sify 24 music stimuli[
]. Each music segment comprised of 2
unique rhythms played at a dierent pitch. Due to small data,
they process and classify each EEG channel individually. CNN
was trained on 84% of complete response (approximately 27 second
response chunk out of 32 seconds), validated on 8% of complete
response(approximately 2.5 second response chunk), and tested
on 8% of complete response(approximately 2.5 second response
chunk). They report 24.4% accuracy. We, in this study, deal with
more complex data. Our music segment comprises diverse tone,
rhythm, pitch, and some of them also include vocals. This makes
our song identication task more challenging. The proposed archi-
tecture is much simpler compared to DenseNet[7] and VGG16[9].
We have collected a new dataset for this study. The overall data
analysis process can be divided into two subparts: Data collection,
Data Preprocessing.
3.1 Data Collection
Participants were made to sit in a dimly lit room. Then we col-
lected demographic information like age, gender, and handedness.
Brief information regarding the EEG collection setup, time that the
experiment will take, the responses that they have to make were
discussed with all participants. Then we measure the circumfer-
ence of the participant’s head to select a suitable EEG cap. The
128 channel high density Geodesic electrode net cap (Hydrocel
Geodesic SensorNet platform, Electrical Geodesics Inc., USA, Now
Philips) is chosen according to the headsize measurement. The cap
is immersed in the KCl electrolyte solution prepared in 1 litre pure
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
Figure 1: Illustration of the 128-channel-system and elec-
trode position
distilled water. The reference electrode position is measured as the
intersecting point on the lines between nasion (point in between
eyebrows) and inion (middle point of skull ending at the backside)
with preauricular points on both sides, and then it is marked. Other
electrodes are placed according to the International 10-20 system,
as depicted in Fig.1.
After this setup, participants were asked to close the eyes on a
single beep tone. This is followed by 10 seconds of silence, after
which the song stimulus is presented. At the end of each stimulus,
a double beep tone was sounded at which the participants were
instructed to open eyes and make a response. They were asked two
questions to rate their familiarity and enjoyment of the song on a
scale of 1 to 5. Since the maximum length of the song is 132 seconds,
all other responses were zero padded accordingly. Therefore, all
song responses are of 142 seconds after considering the above
window. Total of 20 participants data was collected on 12 music
stimuli. Songs used in the experiment are listed in Table 1 along
with their genres. The songs contain some tonal and vocal excerpts.
The sampling rate for 11 participants was 1000Hz, and for the
remaining 9 participants, it was 250Hz. 16 participants were male,
while 4 were female. All of them right-handed with an average age
of 25.3 years and a standard deviation of 3.38. All music stimuli
were presented to the subject in random order.
3.2 Data Preprocessing
EEG is highly sensitive to noise. It also captures eye blinking and
high-frequency muscle movements. Therefore, it is necessary to
clean EEG data before using it for any application. EEGLAB tool-
box was used to implement the majority of preprocessing steps.
Once the channel locations are provided, we performed average
re-referencing with respect to average of all channels. Raw EEG
signal was then loaded as epochs for each presented song. 12 epochs
were created for each participant. By this, we eliminated the signals
which do not lie in our area of interest. We then used independent
component analysis to remove artifact data. This has been achieved
Table 1: Songs used in EEG data collection
Song Song Name Genre Artist Song
ID Length
(in sec)
1 Trip to the Electronics Mark Alow 125
lonely planet
2 Sail Indie Awolnation 114
3 Concept 15 Electronics Kodomo 132
4 Aurore New age Claire David 111
5 Proof Dance Idiotape 124
6 Glider Ambient Tycho 100
7 Raga Behag Classical Pandit Hari 116
Prasad Chaurasiya
8 Albela sajan Classical Shankar Mahadevan 121
9 Mor Bani Dance Aditi Paul 126
Thanghat Kare
10 Red Suit Rock DJ David. G 129
11 Fly Kicks Hip Hop DJ Kimera 113
12 JB Rock 117
using ’runica’ algorithm present in MATLAB. We have used ’ad-
just’ toolbox for artifact removal. For simplicity, positive innities,
negative innities, and NaN (not a number) values are replaced by
zero. However, taking the average value of surrounding electrodes
for these outliers would be a better approach and may improve the
performance. The above steps create nal ready-to-use data.
For deep learning models, we would need extensive data with
reasonable feature dimensions to detect a pattern. In our task, we
had the opposite situation. Our data contains 240 EEG responses
corresponding to 20 participants and 12 song stimuli. However,
a number of samples collected in one electrode for one song of
one participant is greater than 27,000 at the 250Hz sampling rate.
The number of samples goes beyond 100,000 at 1000Hz sampling
rate. Thus, our data consists of a few examples with very high
The data augmentation technique was used to increase data. We
split the EEG response of each participant for each song in a chunk
of 1-second long window. Thus, giving us 128(electrodes) * 250 or
1000 (samples per second) dimension 2D matrices. We call them
"song image" and label them the same song ID of the corresponding
song. Such a formulation not only increases number of examples in
original dataset but also allows us to use 2D, and 3D convolution
networks. Fig. 2 shows the song images of the 26
second of two
dierent songs for one of the participants in time domain. How-
ever, window size is design parameter and it is dicult to decide
optimal window size for obtaining song images. Larger window
size increases dimensionality of a song image and decreases the
total number of song examples. Thereby killing the purpose of the
data augmentation. High dimensional input to Convolution Neu-
ral Network (CNN) also drastically increases number of trainable
parameters provided that the rest of the architecture remains the
same. Smaller window size will carry less information about EEG
signal and CNN may perform poorly. As the sampling rate changes,
the 1 second time window will carry dierent number of samples in
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
(a) Song ID - 6
(b) Song ID - 7
Figure 2: Song images of 26th second for participant ID: 1902
in time domain
song image. This causes inconsistency in input data to the CNN. We
needed more concrete preprocessing steps so that smaller window
size would increase number of examples in dataset but not at the
cost of the performance. Fig. 3 illustrates the Fast Fourier Transform
(FFT) of the all 12 song response for one participant. It is worth not-
ing that, in all the FFT’s, the maximum frequency component is less
than 100Hz. This is expected as EEG is characterized by frequency
bands - 0Hz - 4Hz (Delta band), 4Hz - 8Hz (Theta band), 8Hz - 15Hz
(Alpha band), 15Hz - 32Hz (Beta band), and frequencies higher than
32Hz (Gamma band), exhibit high power in low frequency ranges.
Using spectopo function in EEGLAB, we converted time-domain
EEG data into the frequency domain. Spectopo calculates the am-
plitude of a frequency component present in each 1 second window
of the EEG response. The maximum frequency component in the
frequency domain representation of data is chosen as per Nyquist’s
criteria. It is 125Hz and 500Hz when the sampling rate is 250Hz
and 1000Hz, respectively. Regardless of what is sampling frequency
of EEG, we can safely choose 125Hz as maximum frequency com-
ponent. Fig. 4 explains the dimensionality and conversion of time-
domain data to frequency domain data. Frequency domain data
helps in dimensionality reduction as well as makes input dataset
consistent and compact. But, the time window for which we calcu-
late FFT is again a design parameter. The eect of the time window
Figure 3: FFT of EEG response of the participant ID 1902 for
all 12 songs
Figure 4: Time domain to frequency domain conversion for
one participant data using spectopo
on a performance of the CNN in both time and frequency domain
has been further addressed in next section.
The row of song image in frequency domain denotes electrode while
column denotes the frequency. Fig. 5 shows the song images of the
second of the dierent song for participant 1902 in frequency
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
(a) Song ID - 6
(b) Song ID - 7
Figure 5: Song images of 26th second for participant ID: 1902
in frequency domain
We follow standard practice in machine learning to develop
the model. Test-Train split is 0.3 with a random selection of song
images in training (70%) and testing (30%) data. We set the shue
and random-state parameters of train_test_split() function to True
and None respectively throughout the experiment, which ensures
100% randomness. Validation split parameter is set to 0.2.
The aim of this work is to create an approach to classify EEG
response corresponding to respective song event. This section de-
scribes : CNN architecture and model development, Model Testing.
4.1 Experiment and Model development
Convolutional Neural Network is the core part of our model because
it learns the underlying pattern in the song image. It includes many
hyperparameters that need to be set carefully. We apply CNN to the
both time domain and frequency domain song image dataset. The
CNN architecture remains same except for the input layer where
shape of the input song image changes as per domain and time
We have created a 3-layer CNN network for feature extraction
and a 2-layer dense neural network for song classication. Except
for the last output layer, each convolution layer as well as the dense
layer has the ReLu[
] activation function. It brings non-linearity
in architecture and helps to detect complex patterns. Since we
are doing multiclass classication, the output layer has a softmax
activation function. The loss function used is categorical cross-
entropy. It is given by,
is total number of classes - 12 in our case,
is a binary
number if class label
is correct classication for observation
po,cpredicted probability observation ois of class c.
We used Adam optimizer to minimize categorical cross-entropy
loss. The kernel size is 3*3, and we have used 16 such lters at
each convolution layer. Two max-pooling layers have been added
after convolution layers 2 and 3. Their exclusion almost doubles
total trainable parameter, thereby increasing network complexity,
training time. Fig.6 shows the 2D CNN architecture. Removing one
or more layers from the architecture mentioned above resulted in an
underdetermined system and failed to learn all the patterns. Adding
extra layers led to overtting and thus reducing the performance.
We used 30 epochs for training the CNN. The network architec-
ture which we have proposed is designed in the Keras framework.
The initialization of weights is assigned with some random num-
bers by using the Keras framework[
] itself. The GPU NVIDIA GTX
1050 that we have used for this experiment has 4GB of RAM. The
batch size was kept 16 for all the experiments because of memory
4.2 Model Testing
To study our rst hypothesis that music creates an identiable
pattern in the brain EEG signals, we consider all participants data in
training data. An immediate problem is how much of 1 participant’s
response should be considered in training data to predict song ID?
To answer this question, we vary the train-test split from 20% to
95%. The
% train-test split means
% of data will be chosen as test
data while
% will be treated as train data. For each split,
we randomly select test samples.
We have analyzed the eect of the time window on the perfor-
mance of the CNN in time domain. We created a 3 separate datasets
with time window set to 1 second, 2 seconds, 3 seconds and having
the song image shape 128*250, 128*500, 128*750 respectively. We did
not increase time window beyond 3 seconds as number of examples
in dataset were reduced signicantly. The 9 participants with EEG
sampling frequency of 250Hz were chosen in above dataset. Similar
steps were applied to the participant where EEG sampling rate was
1000Hz. To maintain the symmetry, we randomly chose 9 partici-
pants out of 11. In frequency domain, we did not increase the time
window for which FFT is calculated. Because, 1 sec time window
gave reasonable results. We have also analyzed the eect of higher
sampling frequency. For this, we choose those 11 participants for
which sampling rate was 1000Hz. This resulted in the maximum
frequency component being 500Hz, thus song image shape changes
to 128*501. The previous model was trained on new data. We could
have compared this result with an earlier model where the song
image was 128*126, but since earlier data had 20 participants. For a
fair comparison, we choose the same 11 participants and discard
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
Figure 6: 2D CNN architecture followed by Dense network for song classication
all the frequency from 127Hz to 501Hz. We trained the model on
this data as well.
For the investigation of our second hypothesis - music creates a
dierent pattern on a dierent person; we used 5 of the randomly
chosen participant’s responses as a test dataset. Training data in-
cluded the remaining 15 participant’s responses. To improve the
result for this task, we developed a 3D CNN model. All the parame-
ters remain same as the previous model except for the kernel, which
changed to 3*3*3, and the max-pool layer modied as 2*2*2. We
stacked 10 consecutive song images and fed it to the rst layer of 3D
CNN as input. The choice 10 is made by considering the trade-o
between the number of samples in data and dimensionality of each
3D input sample.
When each participant’s response is accounted for in both train
and test data, our model reports outstanding training accuracy as
90.90% and test accuracy as 84.96% in frequency domain (model 5,
train-test split = 0.3, validation split = 0.2). We outperformed the
state-of-the-art 24.7% accuracy for within-participant classication
obtained by 2-layer CNN model (train-test split = 0.33, validation
split = 0.33) in [
]. Though the dataset in [
] is dierent from
ours, the study is almost similar in terms of a problem statement,
methodologies, and pre-processing steps. Our dataset is more so-
phisticated as it includes EEG recording to songs with a dierent
tone, rhythm, pitch, and lyrics, unlike considering songs with only
dierent tones. The notable accuracy of 22.12% is observed when
we train model only on 5% of total data (train-test split = 0.05),
almost the same as that of [
]. The detailed discussion is at the
end of this section. Confusion matrix is shown in Fig. 8a. It shows
that test data is well spread across all 12 classes, and almost all
of them are correctly classied. Fig. 7b shows accuracy vs. epoch
curve for the above model. Table 2 summarizes the accuracy of the
all CNN models. The model trained on time domain dataset hardly
learnt anything for song identication task. By changing the time
window and sampling frequency did not change the performance
of the CNN in time domain. But the same CNN architecture ob-
tained high accuracy when trained on frequency domain dataset.
The performance of CNN in frequency domain could be either due
to learning the temporal pattern in EEG or learning from other
participant’s responses. To examine the latter cause, we retrained
the same CNN model, this time excluding 5 participants entirely
from the train data. We got 86.95% training accuracy, but the model
reported 7.73% test accuracy (model 10). We extend this experiment
by training the 3D CNN model, and observed 9.44% test accuracy
for cross-participant data (model 11). This shows that CNN depends
upon the temporal features in each participant’s response for the
song prediction. It also gives insights regarding the EEG pattern
generated due to music entrainment dier from person to person
for the same song. However, 42.7% test accuracy is obtained by [
for cross-participation classication task using engineered features
calculated by Hidden Markov Models (HMMs). The accuracy in-
creases to 99.8% when they ignored EEG response and only used
acoustic features. We also studied the high-frequency signals gen-
erated in the brain due to music entrainment. For this, we choose
those participants whose data is collected at 1000Hz. It will en-
sure the high value of the maximum frequency component(up to
500Hz) in the song image. All participant responses were consid-
ered in train and test data. Two models of the same architecture
were developed; one trained on data having all 500Hz frequency
element while the other trained on data by choosing only the rst
126Hz frequencies out of 500Hz. Both performed almost equally
well, giving 80.99% and 76.19% accuracy, respectively (model 8,9).
It explains that higher EEG frequencies do not contribute much
to the pattern generated while listening to music. To investigate
how much each participant’s data should be included in train data
to predict song ID on test data, we vary the train-test split from
20% to 95%. Fig. 7a shows accuracy plot for dierent train-test split
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
Table 2: Performance of the CNN models
CNN Domain Song image Total number CNN Train Test
Model shape of song trainable accuracy accuracy
images parameters (%) (%)
1 Time 128*250 11,772 3,398,476 8.60 8.01
2 Time 128*500 5,832 6,953,804 8.64 7.48
3 Time 128*750 3,888 10,623,820 8.09 8.82
4 Time 128*1000 14,338 14,179,148 8.65 8.05
5 Time 128*2000 7,194 28,515,148 8.79 7.45
6 Time 128*3000 3,597 42,851,148 8.48 8.29
7 Frequency 128*126 34,080 1,678,156 90.90 84.96
8 Frequency 128*500 18,774 5,823,308 80.01 80.99
9 Frequency 128*126 18,774 1,678,156 91.31 76.19
10 Frequency 128*126 34,080 1,678,156 86.95 7.73
Cross participant
11 Frequency 128*126*10 3600 1,925,228 9.44 9.44
3D input Cross participant
(a) Change in the test accuracy for dierent train-test
split values
(b) Training and validation curve
Figure 7: Accuracy plots
value. We got remarkable test accuracy 78.12% by training the CNN
model on 20% of total data(Train-test split = 0.8). In other words, by
learning approximately 17 seconds long EEG response of 120 sec-
onds prolonged music stimuli, we were able to predict song ID for
the rest of the 103 seconds with 78% correct prediction probability.
More commendable accuracy is 22.12% at 0.95 train-test split. This
performance is much better than a random guess, which is 8.33%
for this 12 class classication problem. This also ensures that our
model is not overtting because of decent generalization accuracy
using very less training dataset. Fig.8a, 8b, 8c shows the confusion
matrices for 0.3, 0.5, 0.95 train-test split ratio, respectively. We have
visualized the intermediate CNN outputs. Fig. 9 shows the output
of the 3
convolution layer of model 7. For the same song ID - 9,
the participant 1901 and 1905 have learnt dierent features. Filter1,
Filter7, Filter14, Filter15 shows dierent patterns in Fig. 9a, 9b. This
supports our 2
hypothesis that the EEG patterns vary from per-
son to person for a given song. Similar analogy applies to Fig. 9c,
In this paper, we proposed an approach to identify the song from
brain activity, which is recorded when a person is listening to it.
We worked on our data collected for 20 participants and 12 two
minutes of songs having diverse tone, pitch, rhythm, and vocals. In
particular, we were successfully able to classify songs from only
1 second long EEG response in frequency domain by implement-
ing automatic feature extractor deep learning model. But, CNN
model failed in time domain. We developed a simple but yet ef-
cient 3 layer deep learning model in the Keras framework. The
results show that identiable patterns are generated in the brain
during music entrainment. We were able to detect them when each
participant’s EEG response considered in both train and test data.
Our model performed poorly when some of the participants were
completely excluded from the train data. Manual feature extraction
approach worked better than automatic feature extraction by deep
learning model. This gives us insights about the dierent patterns
created when dierent persons were listening to the same song.
The possible reason could be people focus on a dierent tone, vocals
during music entrainment, thereby reducing performance for cross-
participant song identication task. Thus, as future work, we aim
at acquiring more data and look for other preprocessing methods,
CNN architectures to improve accuracy for across participant data.
However, results achieved in this paper are highly encouraging and
provides an essential step towards an ambitious mind-reading goal.
CODS COMAD 2021, January 2–4, 2021, Bangalore, India Dhananjay, et al.
(a) Train-test split = 0.3 (b) Train-test split = 0.5 (c) Train-test split = 0.95
Figure 8: Confusion matrices
(a) Participant ID: 1901, Song ID:9 (b) Participant ID: 1905, Song ID:9
(c) Participant ID: 1901, Song ID:3 (d) Participant ID: 1905, Song ID:3
Figure 9: Convolution layer 3 output
GuessTheMusic: Song Identification from Electroencephalography response CODS COMAD 2021, January 2–4, 2021, Bangalore, India
Gavin Bidelman and Louise Powers. 2018. Response properties of the human
frequency-following response (FFR) to speech and non-speech sounds: level de-
pendence, adaptation and phase-locking limits. International journal of audiology
57, 9 (2018), 665–672.
Elvira Brattico, Mari Tervaniemi, Risto Näätänen, and Isabelle Peretz. 2006. Mu-
sical scale properties are automatically processed in the human auditory cortex.
Brain research 1117, 1 (2006), 162–174.
[3] F. Chollet. 2015.
Chris Foster, Dhanush Dharmaretnam, Haoyan Xu, Alona Fyshe, and George
Tzanetakis. 2018. Decoding music in the human brain using EEG data. In 2018
IEEE 20th International Workshop on Multimedia Signal Processing (MMSP). IEEE,
Nick Gang, Blair Kaneshiro, Jonathan Berger, and Jacek P Dmochowski. 2017.
Decoding Neurally Relevant Musical Features Using Canonical Correlation Anal-
ysis.. In ISMIR. 131–138.
John-Dylan Haynes and Geraint Rees. 2006. Decoding mental states from brain
activity in humans. Nature Reviews Neuroscience 7, 7 (2006), 523–534.
Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell,
and Kurt Keutzer. 2014. Densenet: Implementing ecient convnet descriptor
pyramids. arXiv preprint arXiv:1404.1869 (2014).
Isaak Kavasidis, Simone Palazzo, Concetto Spampinato, Daniela Giordano, and
Mubarak Shah. 2017. Brain2image: Converting brain signals into images. In
Proceedings of the 25th ACM international conference on Multimedia. 1809–1817.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional
networks for semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 3431–3440.
Vinod Nair and Georey E Hinton. 2010. Rectied linear units improve restricted
boltzmann machines. In ICML.
Sylvie Nozaradan. 2014. Exploring how musical rhythm entrains brain activity
with electroencephalogram frequency-tagging. Philosophical Transactions of the
Royal Society B: Biological Sciences 369, 1658 (2014), 20130393.
Stavros Ntalampiras and Ilyas Potamitis. 2019. A statistical inference framework
for understanding music-related brain activity. IEEE Journal of Selected Topics in
Signal Processing 13, 2 (2019), 275–284.
Shankha Sanyal, Sayan Nag, Archi Banerjee, Ranjan Sengupta, and Dipak Ghosh.
2019. Music of brain and music on brain: a novel EEG sonication approach.
Cognitive neurodynamics 13, 1 (2019), 13–31.
Joel S Snyder and Edward W Large. 2005. Gamma-band activity reects the
metric structure of rhythmic tone sequences. Cognitive brain research 24, 1 (2005),
Sebastian Stober, Daniel J Cameron, and Jessica A Grahn. 2014. Using Convolu-
tional Neural Networks to Recognize Rhythm Stimuli from Electroencephalogra-
phy Recordings. In Advances in neural information processing systems. 1449–1457.
Sebastian Stober, Thomas Prätzlich, and Meinard Müller. 2016. Brain Beats:
Tempo Extraction from EEG Data.. In ISMIR. 276–282.
Sebastian Stober, Avital Sternin, Adrian M Owen, and Jessica A Grahn. 2015. To-
wards Music Imagery Information Retrieval: Introducing the OpenMIIR Dataset
of EEG Recordings from Music Perception and Imagination.. In ISMIR. 763–769.
Yi Yu, Samuel Beuret, Donghuo Zeng, and Keizo Oyama. 2018. Deep learning
of human perception in audio event classication. In 2018 IEEE International
Symposium on Multimedia (ISM). IEEE, 188–189.
... veloped methods to efficiently use EEG responses in Convolutional Neural Networks (CNNs) to correctly classify the name of the song a participant is listening to. This was shown to work strongly (85%) when participants listened to music they were familiar with (8.33% chance), and if the input of the models were formatted as 2D representations of the power spectrum density across channels [5]. Using different datasets, another study replicated the prior study's results and further evaluated the efficacy of EEG data being used as images in computer vision models implemented through a custom version of AlexNet architecture and in transfer learning with a pretrained ResNet model. ...
... Generalization stays within participants and across unseen chunks of time that are balanced to be sampled from the beginning, middle, and end of the songs. Other EEG studies also keep generalization within participants [5,13], and a recent study has shown that weak correlations across participants in the NMED-T could be why generalizing to unseen participants is difficult without many recorded participants in the dataset [14]. ...
... Each arrow in the figure is a model showing the designated input to target mapping. Prior classification studies have shown a preference for input representations either being raw voltage or a Power Spectral Density (PSD), and their ability to boost performance in CNNs [5,6]. As a regression task, we also find it important to compare target representations since linear and melspectrogram representations have tradeoffs. ...
Full-text available
Information retrieval from brain responses to auditory and visual stimuli has shown success through classification of song names and image classes presented to participants while recording EEG signals. Information retrieval in the form of reconstructing auditory stimuli has also shown some success, but here we improve on previous methods by reconstructing music stimuli well enough to be perceived and identified independently. Furthermore, deep learning models were trained on time-aligned music stimuli spectrum for each corresponding one-second window of EEG recording, which greatly reduces feature extraction steps needed when compared to prior studies. The NMED-Tempo and NMED-Hindi datasets of participants passively listening to full length songs were used to train and validate Convolutional Neural Network (CNN) regressors. The efficacy of raw voltage versus power spectrum inputs and linear versus mel spectrogram outputs were tested, and all inputs and outputs were converted into 2D images. The quality of reconstructed spectrograms was assessed by training classifiers which showed 81% accuracy for mel-spectrograms and 72% for linear spectrograms (10% chance accuracy). Lastly, reconstructions of auditory music stimuli were discriminated by listeners at an 85% success rate (50% chance) in a two-alternative match-to-sample task.
... Yu et al. [15] improved the performance to ∼81% by incorporating canonical correlation analysis between DenseNet and pre-trained VGG model that extracted audio features of the experimental stimuli. Most recently, Sonawane et al. [16] improved on these approaches and showed that longer and complex stimuli ( 2 mins of music) could be used to evoke EEG responses used as spectral 2D CNN inputs to classify song ID (∼85.0%). ...
... This follows the training approach of the visually evoked EEG response classification models [4]. Other EEG classification studies also do not do across participant generalization because of the high cost of acquiring a large number of new participants when one participant can give you sufficient recordings [14,16]. ...
... Recent attempts at classifying auditory EEG responses have achieved similar performance ( 84%) all while relying on feature extraction or simple and short acoustic stimuli [14,16]. Study [16] provides a strong example for treating EEG responses as images in classification tasks. ...
Full-text available
Classifying EEG responses to naturalistic acoustic stimuli is of theoretical and practical importance, but standard approaches are limited by processing individual channels separately on very short sound segments (a few seconds or less). Recent developments have shown classification for music stimuli (~2 mins) by extracting spectral components from EEG and using convolutional neural networks (CNNs). This paper proposes an efficient method to map raw EEG signals to individual songs listened for end-to-end classification. EEG channels are treated as a dimension of a [Channel x Sample] image tile, and images are classified using CNNs. Our experimental results (88.7%) compete with state-of-the-art methods (85.0%), yet our classification task is more challenging by processing longer stimuli that were similar to each other in perceptual quality, and were unfamiliar to participants. We also adopt a transfer learning scheme using a pre-trained ResNet-50, confirming the effectiveness of transfer learning despite image domains unrelated from each other.
... A logistic regression model is shown to achieve a classification accuracy of 20% above the random chance baseline of 8.33% (1/12) [20]. Recent work by Sonawane et al. [23] demonstrated three layers of CNN architecture to classify songs from brain responses with higher accuracy in the frequency domain than in the time domain. These studies have demonstrated that learning can be possible using short segments of brain responses. ...
... The dataset [23] consists of naturalistic music with songs taken from World Music. The stimuli consisted of 12 songs which are unrelated to each other with a difference in their genre. ...
... The pre-processed data were chunked into segments of 1, 3, and 5 seconds as shown in Fig. 2. In recent papers [22], [23], short segments of 1 to 10 seconds of brain responses have been selected. We investigated three window sizes to observe neural entrainment to beat frequencies. ...
... Studies can further analyse different f eatures from the dataset to understand specific neural entrainment of naturalistic music. Studies on machine learning and neural networks could also utilize this dataset [5] . This dataset would be of interest to researchers from different fields, especially from musicology, neuroscience and computer science. ...
... After the preprocessing, the data can be analysed further to correlate with temporal and spectral features of the songs (refer [1] for analysis of Tempo as an example). Sonawane et al. [5] used the time-domain and frequency domain data obtained using spectopo function of EEGLAB [6] for song identification with machine learning algorithms applied with deep learning architectures. Pandey et al. [9] extended this work by demonstrating that music identification is possible with EEG data corresponding to initial snippets during the song listening task. ...
Full-text available
The article provides an open-source Music Listening- Genre (MUSIN-G) EEG dataset which contains 20 participants’ continuous Electroencephalography responses to 12 songs of different genres (from Indian folk music to Goth Rock to western electronic), along with their familiarity and enjoyment ratings. The participants include 16 males and 4 females, with an average age of 25.3 (+/-3.38). The EEG data was collected at the Indian Institute of Technology Gandhinagar, India, using 128 channels Hydrocel Geodesic Sensor Net (HCGSN) and the Netstation 5.4 data acquiring software. We provide the raw and partially preprocessed data of each participant while they listened to 12 different songs with closed eyes. The dataset also contains the behavioural familiarity and enjoyment ratings (scale of 1 to 5) of the participants for each of the songs. In this article, we further discuss the preprocessing steps which can be used on the dataset and prepare the data for analysis, as in the paper [1].
... Song identification from brain responses is a complex problem due to differences in music perception. Song identification based on brain activity is proposed in [20]; the authors use two approaches, one in the time domain and another in the frequency domain. They consider the 2D image of the 1s EEG time response (i.e., channels × time points), and FFT decomposition of a 1s spectral frame (i.e., channels × Nyquist frequency) in the frequency domain. ...
... Overall, the model performs poorly when no sample of the target (test) user is part of the training set. These observations have been echoed in [20] for song classification, where poor results were obtained after randomly omitting five participants from a group of twenty. ...
Full-text available
We examine user and song identification from neural (EEG) signals. Owing to perceptual subjectivity in human-media interaction, music identification from brain signals is a challenging task. We demonstrate that subjective differences in music perception aid user identification, but hinder song identification. In an attempt to address intrinsic complexities in music identification, we provide empirical evidence on the role of enjoyment in song recognition. Our findings reveal that considering song enjoyment as an additional factor can improve EEG-based song recognition.
... Deep learning-based methods have successfully recognized the patterns in brain activity, captured from functional MRI, Electro-and Magneto-Encephalo Graphy (EEG, MEG) etc. For example, predicting meditation expertise [12], classification of songs based on brain responses [13], and other several deep learning applications in brain imaging data [9]. Previous studies on emotion classification have primarily focused on contrasting states, such as positivenegative-neutral, threat-safety, and sadness-happiness [8,11]. ...
... We used 30 epochs with batch size 16 for training the CNN. The network architecture that we have demonstrated was also tested on our previous work [13]. Keras framework [5] was employed for all the experiments and initially, the weights of the models were randomly assigned. ...
Full-text available
Visual, audio, and emotional perception by human beings have been an interesting research topic in the past few decades. Electroencephalography (EEG) signals are one of the ways to represent human brain activity. It has been shown, that different brain networks correspond to processes corresponding to varieties of emotional stimuli. In this paper, we demonstrate a deep learning architecture for the movie identification task from the EEG response using Convolutional Neural Network (CNN). The dataset includes nine movie clips that span across different emotional states. The EEG time series data has been collected for 20 participants. Given one second EEG response of particular participant, we tried to predict its corresponding movie ID. We have also discussed the various pre-processing steps for data cleaning and data augmentation process. All the participants have been considered in both train and test data. We obtained 80.22% test accuracy for this movie classification task. We also tried cross participant testing using the same model and the performance was poor for the unseen participants. Our result gives insight toward the creation of identifiable patterns in the brain during audiovisual perception.
Full-text available
Can we hear the sound of our brain? Is there any technique which can enable us to hear the neuro-electrical impulses originating from the different lobes of brain? The answer to all these questions is YES. In this paper we present a novel method with which we can sonify the Electroencephalogram (EEG) data recorded in rest state as well as under the influence of a simplest acoustical stimuli - a tanpura drone. The tanpura drone has a very simple yet very complex acoustic features, which is generally used for creation of an ambiance during a musical performance. Hence, for this pilot project we chose to study the correlation between a simple acoustic stimuli (tanpura drone) and sonified EEG data. Till date, there have been no study which deals with the direct correlation between a bio-signal and its acoustic counterpart and how that correlation varies under the influence of different types of stimuli. This is the first of its kind study which bridges this gap and looks for a direct correlation between music signal and EEG data using a robust mathematical microscope called Multifractal Detrended Cross Correlation Analysis (MFDXA). For this, we took EEG data of 10 participants in 2 min 'rest state' (i.e. with white noise) and in 2 min 'tanpura drone' (musical stimulus) listening condition. Next, the EEG signals from different electrodes were sonified and MFDXA technique was used to assess the degree of correlation (or the cross correlation coefficient) between tanpura signal and EEG signals. The variation of {\gamma}x for different lobes during the course of the experiment also provides major interesting new information. Only music stimuli has the ability to engage several areas of the brain significantly unlike other stimuli (which engages specific domains only).
Conference Paper
Full-text available
Electroencephalography (EEG) recordings of rhythm perception might contain enough information to distinguish different rhythm types/genres or even identify the rhythms themselves. We apply convolutional neural networks (CNNs) to analyze and classify EEG data recorded within a rhythm perception study in Kigali, Rwanda which comprises 12 East African and 12 Western rhythmic stimuli - each presented in a loop for 32 seconds to 13 participants. We investigate the impact of the data representation and the pre-processing steps for this classification tasks and compare different network structures. Using CNNs, we are able to recognize individual rhythms from the EEG with a mean classification accuracy of 24.4% (chance level 4.17%) over all subjects by looking at less than three seconds from a single channel. Aggregating predictions for multiple channels, a mean accuracy of up to 50% can be achieved for individual subjects.
Full-text available
The ability to perceive a regular beat in music and synchronize to this beat is a widespread human skill. Fundamental to musical behaviour, beat and meter refer to the perception of periodicities while listening to musical rhythms and often involve spontaneous entrainment to move on these periodicities. Here, we present a novel experimental approach inspired by the frequency-tagging approach to understand the perception and production of rhythmic inputs. This approach is illustrated here by recording the human electroencephalogram responses at beat and meter frequencies elicited in various contexts: mental imagery of meter, spontaneous induction of a beat from rhythmic patterns, multisensory integration and sensorimotor synchronization. Collectively, our observations support the view that entrainment and resonance phenomena subtend the processing of musical rhythms in the human brain. More generally, they highlight the potential of this approach to help us understand the link between the phenomenology of musical beat and meter and the bias towards periodicities arising under certain circumstances in the nervous system. Entrainment to music provides a highly valuable framework to explore general entrainment mechanisms as embodied in the human brain.
Following the success in Music Information Retrieval (MIR), research is now steering towards understanding the relationship existing between brain activity and the music stimuli causing it. To this end, a new MIR topic has emerged, namely Music Imagery Information Retrieval, with its main scope being to bridge the gap existing between music stimuli and its respective brain activity. In this paper, the encephalographic modality was chosen to capture brain activity as it is more widely available since of-the-shelf devices recording such responses are already affordable unlike more expensive brain imaging techniques. After defining three tasks assessing different aspects of the specific problem (stimuli identification, group and meter classification), we present a common method to address them, which explores the temporal evolution of the acquired signals. In more detail, we rely on the parameters of linear time-invariant models extracted out of electroencephalographic responses to heterogeneous music stimuli. Subsequently, the probability density function of such parameters is estimated by hidden Markov models taking into account their succession in time. We report encouraging classification rates in the above-mentioned tasks suggesting the existence of an underlying relationship between music stimuli and their electroencephalographic responses.
Objective: The frequency-following response (FFR) is a neurophonic potential used to assess auditory neural encoding at subcortical stages. Despite the FFR’s empirical and clinical utility, basic response properties of this evoked potential remain undefined. Design: We measured FFRs to speech and nonspeech (pure tone, chirp sweeps) stimuli to quantify three key properties of this potential: level-dependence (I/O functions), adaptation and the upper limit of neural phase-locking. Study sample: n = 13 normal-hearing listeners. Results: I/O functions showed FFR amplitude increased with increasing stimulus presentation level between 25 and 80 dB SPL; FFR growth was steeper for tones than speech when measured at the same frequency. FFR latency decreased 4–5 ms with decreasing presentation level from 25 and 80 dB SPL but responses were ∼2 ms earlier for speech than tones. FFR amplitudes showed a 50% reduction over 6 min of recording with the strongest adaptation in the first 60 s (250 trials). Estimates of neural synchronisation revealed FFRs contained measurable phase-locking up to ∼1200–1300 Hz, slightly higher than the single neuron limit reported in animal models. Conclusions: Findings detail fundamental response properties that will be important for using FFRs in clinical and empirical applications.
Conference Paper
Reading the human mind has been a hot topic in the last decades, and recent research in neuroscience has found evidence on the possibility of decoding, from neuroimaging data, how the human brain works. At the same time, the recent rediscovery of deep learning combined to the large interest of scientific community on generative methods has enabled the generation of realistic images by learning a data distribution from noise. The quality of generated images increases when the input data conveys information on visual content of images. Leveraging on these recent trends, in this paper we present an approach for generating images using visually-evoked brain signals recorded through an electroencephalograph (EEG). More specifically, we recorded EEG data from several subjects while observing images on a screen and tried to regenerate the seen images. To achieve this goal, we developed a deep-learning framework consisting of an LSTM stacked with a generative method, which learns a more compact and noise-free representation of EEG data and employs it to generate the visual stimuli evoking specific brain responses. OurBrain2Image approach was trained and tested using EEG data from six subjects while they were looking at images from 40 ImageNet classes. As generative models, we compared variational autoencoders (VAE) and generative adversarial networks (GAN). The results show that, indeed, our approach is able to generate an image drawn from the same distribution of the shown images. Furthermore, GAN, despite generating less realistic images, show better performance than VAE, especially as concern sharpness. The obtained performance provides useful hints on the fact that EEG contains patterns related to visual content and that such patterns can be used to effectively generate images that are semantically coherent to the evoking visual stimuli.