Available via license: CC BY-NC-SA 4.0
Content may be subject to copyright.
DOES THE DEFINITION OF DIFFICULTY MATTER? 1
Does the Definition of Difficulty Matter?
Scoring Functions and their Role for
Curriculum Learning
Simon Rampp1,Manuel Milling1,2,Andreas Triantafyllopoulos1,2,Bj ¨
orn W. Schuller1,2,3,4
1CHI – Chair of Health Informatics, Technical University of Munich, Munich, Germany
2MCML – Munich Center for Machine Learning, Munich, Germany
3GLAM – Group on Language, Audio, & Music, Imperial College, London, UK
4MDSI – Munich Data Science Institute, Munich, Germany
{simon.rampp;manuel.milling;andreas.triantafyllopoulos;schuller}@tum.de
Abstract—Curriculum learning (CL) describes a machine learn-
ing training strategy in which samples are gradually introduced
into the training process based on their difficulty. Despite a
partially contradictory body of evidence in the literature, CL
finds popularity in deep learning research due to its promise
of leveraging human-inspired curricula to achieve higher model
performance. Yet, the subjectivity and biases that follow any
necessary definition of difficulty, especially for those found
in orderings derived from models or training statistics, have
rarely been investigated. To shed more light on the underlying
unanswered questions, we conduct an extensive study on the
robustness and similarity of the most common scoring functions
for sample difficulty estimation, as well as their potential benefits
in CL, using the popular benchmark dataset CIFAR-10 and
the acoustic scene classification task from the DCASE2020
challenge as representatives of computer vision and computer
audition, respectively. We report a strong dependence of scoring
functions on the training setting, including randomness, which
can partly be mitigated through ensemble scoring. While we do
not find a general advantage of CL over uniform sampling, we
observe that the ordering in which data is presented for CL-
based training plays an important role in model performance.
Furthermore, we find that the robustness of scoring functions
across random seeds positively correlates with CL performance.
Finally, we uncover that models trained with different CL
strategies complement each other by boosting predictive power
through late fusion, likely due to differences in the learnt concepts.
Alongside our findings, we release the aucurriculum toolkit
(https://github.com/autrainer/aucurriculum), implementing sample
difficulty and CL-based training in a modular fashion.
Index Terms—Curriculum Learning, Sample Difficulty, Scoring
Function Similarity, Computer Vision, Computer Audition, Deep
Learning
I. INTRODUCTION
As for many other concepts in machine learning (ML),
curriculum learning (CL) owes parts of its appeal to its relatable
inspiration from human learning: in the same way that children
are first taught clear and distinguishable concepts before more
difficult or nuanced ones [
1
], [
2
], ML should also benefit from
a curriculum structure, in which models are first confronted
with easy examples before difficulty is steadily increased. In
the broader optimisation background of deep neural networks
(DNNs), this idea could be conceptualised as easier samples
providing a smoother loss landscape, allowing models to
quickly reach favourable parameter regions and finally converge
therein [
3
], [
4
]. The most common implementations of CL
apply the underlying concept in the following way [
5
]: First, all
samples of a given training dataset are sorted by difficulty. Then,
during training, the model is gradually exposed to more and
more samples of increasing difficulty, beginning with an initial
‘easy’ subset, with the full dataset incrementally introduced in
later stages [3], [4].
Despite the intuitive motivation behind CL, defining or
determining a measure of sample difficulty (SD) to create such
a difficulty ordering remains a key open challenge [
4
], [
6
], [
7
].
This aspect is so crucial that any investigations into the potential
benefits of CL depend on a well-founded quantification of SD.
Without at least a robust contextual sense of difficulty, the core
assumption for the concept of CL vanishes. Additionally, a
universally applicable human-inspired difficulty understanding
is still missing despite efforts in this direction [
8
], [
9
]. Moreover,
a straightforward transfer of human difficulty to CL is debatable
due to significant differences in learning between humans
and machines [
10
]. Instead, many model-based approaches
to SD estimation have been suggested in recent years. Yet,
despite the fact that these approximations depend on training
setups (e. g., model architecture, hyperparameters, or even
random effects due to initialisation), there is no comprehensive
study quantifying their impact. This is of particular importance
given the recent trend of using SD to garner insights into the
structure of datasets [
11
]. We aim to overcome this limitation
by exploring various configuration settings that shed more
insight into the impact of these different factors.
Furthermore, despite the fact that evidence of the benefits
of CL has been reported “en masse” [
5
], [
12
], [
13
], [
14
], the
methodology has not found its way into the standard training
procedures. It is not clear whether this is due to the additional
computational overhead and implementation complexity or
an overestimation of promised benefits due to the factors
mentioned above. A promising attempt to analyse contexts
in which CL proves beneficial is provided by [
15
]. The study
compares various scoring functions (SFs) for the estimation
of SD, pacing functions (PFs) for the scheduling of samples
arXiv:2411.00973v1 [cs.LG] 1 Nov 2024
DOES THE DEFINITION OF DIFFICULTY MATTER? 2
during training, and different difficulty orderings (e. g., easy-to-
hard, hard-to-easy, and random), providing evidence that the
benefits of CL are mainly limited to shorter training time or
mitigating label noise. However, the authors focus on a coarse
statistical view of the problem, leaving out a more granular
analysis of the individual aspects of SFs and their interplay with
CL settings. Moreover, with an isolated performance analysis
of the trained models, it remains an open question as to how
training dynamics influence the concepts learnt by the final
model states through the use of CL strategies – a question we
additionally explore in this work.
In our study, we also try to overcome a common limitation
in the selection of datasets for large-scale CL studies (and
fundamental ML studies in general), which heavily focus on
standard computer vision (CV) benchmark datasets [
6
], [
16
].
To sustain comparability to existing literature, we perform
our experiments on one of the most popular CV datasets,
CIFAR-10
. We further extend them to a popular computer
audition (CA) task of acoustic scene classification (ASC) to
potentially unravel modality-specific differences, as suggested
in prior studies [17], [18].
In summary, we aim to present a thorough analysis of prop-
erties and similarities for an extensive subset of popular SFs for
SD estimation. We further evaluate the interplay of difficulty
orderings with CL and analyse synergies between model states
trained with different CL strategies. In the process, we try to
bring new insights to the advantages, the limitations, and the
general workings of curriculum-based learning paradigms by,
at least partially, answering the following questions:
•
How sensitive are different SFs to varying model archi-
tectures, training settings, as well as random seeds, and
what are their limitations in a practical setting?
•
To what extent do different SFs share a similar notion of
SD?
•
Are less training setting-sensitive SFs advantageous for
achieving higher performance?
•
How do different CL strategies impact model perfor-
mance?
•
Do models trained with varying CL strategies learn
different concepts that can be combined to improve their
predictive performance?
The remainder of this work is structured as follows: Section II
presents a thorough overview of the most common SFs and
CL studies. Section III introduces the datasets and model
architectures that are investigated in this work, as well as the
selection and modifications applied to SFs and CL strategies.
The results of our study are presented and discussed in
Section IV and finally summarised in Section V.
II. REL ATED WO RK
A. Scoring Functions
Related literature offers several approaches to estimate the
difficulty of each sample in the training set, primarily discussed
in the context of CL. These can be conceptually split into two
different categories: human- and model-based difficulty. Human-
based difficulty definitions fall under two main paradigms:
Human priors
One straightforward strategy is to base the
estimation on human priors of difficulty understanding,
which are easy to define and implement. Such attempts
may include the length of a sentence for natural language
processing (NLP) tasks [
19
], [
20
], [
21
], or the signal-to-
noise-ratio (SNR) in a signal [
22
], [
23
] for CA [
4
]. Despite
their intuitive definitions, these difficulty estimations are
often limited by the tendency to disregard the complexity
of the data structures, inter-class relationships, and the
fact that human- and model-based difficulty perceptions
may differ.
Labelling efforts
A more direct approach incorporating a
human understanding of difficulty into the SD estima-
tion involves considering the human labelling behaviour
for individual samples. This approach is particularly
advantageous for datasets lacking ground truth labels
due to the subjectivity of the task, such as in emotion
recognition [
24
]. Here, gold standards are often formed for
each sample based on a summary of the varying ratings
assigned by different human annotators [
25
]. The inter-
annotator agreement can then be interpreted as a measure
of difficulty, with harder samples expected to result in
greater variation among ratings [
13
]. Alternatively, humans
can perform an explicit (direct or indirect) rating of the
SD. As reported in [
8
], the minimum viewing time (MVT)
necessary for a human to categorise an image can be
interpreted as a good proxy for difficulty. In regression
tasks, the targets – in terms of their distance to the mean
of the value range – can be considered an estimate for
difficulty [14].
The concrete examples of SFs, however, that we primarily
focus on in this work are model-based. These can be employed
in the context of DNN-based classification tasks on any data
type without requiring additional human annotations beyond
the target labels. A plethora of these approaches exist in the
literature, all of which generally share the idea of interpreting
SD through the lens of trained ML models [
7
], [
4
], [
26
]. While
we refer to Section
III-C
for a detailed definition of the SFs
relevant to this work, some of the implemented concepts for
sample-wise difficulty scores include: the expected performance
on the sample if withheld from training, the epoch at which
the sample is learnt, the consistency with which the sample
is correctly classified during training, the value of the loss
function on the sample after training, the DNN layer at which
the classification of the sample’s intermediate representation
aligns with the final model’s prediction, and the separability
of the sample’s features obtained from a pre-trained model.
Despite the abundance of approaches (and to some degree
arbitrary choices like network architectures and training param-
eters), little effort has been spent to analyse the robustness and
similarity of different SFs: Wu et al. [
15
] investigate some of
the SFs further analysed in this work and discover that difficulty
estimations vary more across architecture families – like fully
connected neural networks (FCNNs) and convolutional neural
networks (CNNs) – than within. Beyond that, they observe
a moderate to high correlation of the SD across similar
architectures for the chosen SFs. Further, Mayo et al. [
8
] explore
DOES THE DEFINITION OF DIFFICULTY MATTER? 3
the likeness of the MVT based on human image recognition
capabilities and a subset of model-based SFs. They discover a
reasonable similarity across methods, yet with some limitations
indicating differences in the understanding of SD between
humans and machines.
B. Curriculum Learning
Experimental studies surrounding the potential benefits of
CL-based learning paradigms are naturally limited through
computational resource demands to specific contexts and
settings, with the most extensive examples being restricted
to CV datasets. Consequentially, contradicting evidence is
reported in some cases across different works, leaving the
question of the advantages of CL open.
For instance, Hacohen and Weinshall [
5
] report, compared
to the baseline, a higher performance for models trained with
a curriculum considering two SFs: one is based on pre-trained
feature extraction, while the other one leverages the loss
values of the final model state. However, a form of self-paced
learning (SPL), i. e., a method in which a model determines
and dynamically adjusts the SD during training, performed
worse than the baseline training setup. Amongst the chosen
CV datasets, the advantages of CL were deemed higher for
more difficult datasets. Beyond, the results indicate that the
most impact of CL happens at the earlier stages of the training.
On the other hand, the likely most extensive CV-based study
with respect to the number of trained models [
15
] is purely
based on a set of model-based SFs. It considers ensembles to
achieve higher stability in the difficulty orderings. It agrees
with the conclusion that CL shows faster learning speed at
the beginning of the training but cannot find evidence for a
significantly higher performance enabled through CL in any
setting with a fair comparison to the baseline. As a random
SD ordering is reported to yield similar performance benefits
as any order obtained from the applied SFs, it is hypothesised
that the advantages of CL can be attributed primarily to the
dynamic dataset size during training. In particular, PFs tend
to perform better when they quickly incorporate more difficult
samples and saturate on the full dataset.
Beyond the realm of CV, Lotfian et al. [
13
] perform CL
experiments for speech emotion recognition (SER). The study
considers both model-based SFs obtained from the models’
loss functions as well as human-based SFs derived from the
inter-rater agreement. The authors report marginally higher
performance for the model-based ordering and noticeably
higher performance for the rater-based ordering, compared
to the baseline. Contrary to Wu et al. [
15
], they do not find
any comparable benefits from random orderings. A study on
CL for automatic speech recognition (ASR) was performed
by Karakasidis et al. [
27
], which considers difficulty-orderings
based on the utterance length, a model-based SF, and SPL. The
length of the utterance, which aims to incorporate human priors
into the SD estimation, achieved the lowest performance overall.
Contrary to Hacohen and Weinshall [
5
], the best results are
obtained with SPL, yet CL-based approaches still performed
on par. It is further reported that the benefits of employing
PFs are particularly eminent in the later stages of the training,
contradicting both [5] and [15].
The non-uniform conclusions reported in this section may
originate in the vastly different settings in which the studies
are performed. Many variations can be observed in terms of
SFs, PFs, datasets, model architectures, as well as the training
routines and how baseline comparisons are defined. The studies
further fail to analyse the characteristics of the suggested SFs
and the corresponding importance for CL experiments, which
we plan to address in the following.
III. METHODOLOGY
To answer the research questions posed in the introduction,
we investigate a total of six scoring functions, three sample
difficulty orderings, and four pacing functions over two datasets
and five DNN models. In the following subsections, we
elaborate on each of these aspects in detail.
A. Datasets
Our experiments are performed on two 10-class classification
datasets from the domains of CA and CV, respectively.
Both datasets vary in size, comprise samples with different
representation sizes, and have different baseline accuracies.
CIFAR-10: The subset of the tiny image recognition dataset
referred to as CIFAR-10 [
28
] is one of the most popular
benchmark datasets for understanding deep learning (DL)
training in general and for CV in particular [
29
], [
30
]. It
comprises 60 000 images, divided into 50 000 samples for
training and 10 000 for testing, with both subsets being fully
balanced across ten classes, which can coarsely be categorised
as animals and vehicles. The images are characterised by a
rather small representation size of 32
×
32 pixels with three
colour channels. Modern state-of-the-art approaches are able
to achieve very high performance on the dataset [
31
], [
32
],
[
33
], [
34
]. We upsample all images using bilinear interpolation
to balance accuracy and computational costs. This results in
a representation size of
64 ×64 ×3
. It is important to note
that we employ a train-test split without an explicit validation
subset, as the main focus of our work is on understanding SD
on the full training dataset rather than achieving state-of-the-
art performance. This approach is applied consistently across
both datasets, as well as the baseline and CL-based training
experiments.
DCASE2020: For the CA dataset, we choose task 1a of
the DCASE2020 challenge [
35
]. It comprises 13 962 training
and 2968 testing audio samples of 10 s length across many
different cities, which are recorded with different real and
simulated devices. Both the training and testing subsets are
nearly class-balanced. The labels indicate the acoustic scene,
representing the type of location where the audio samples
were recorded. Examples are public transport vehicles, open
spaces, and indoor environments. In our experiments, raw audio
samples are transformed into log Mel-spectrograms with 64
Mel bins, a window size of 512 ms, and a hop size of 64 ms,
following the spectrogram extraction process outlined in [
36
].
This results in a representation size of 64
×
1001
×
1, where each
element corresponds to the magnitude of the signal’s energy
in a specific Mel-frequency bin and time frame.
DOES THE DEFINITION OF DIFFICULTY MATTER? 4
B. Network architectures
Given the image-like nature of both the CIFAR-10 samples
and the log-Mel spectrograms in the DCASE2020 datasets, we
decide to employ CNN-based architectures, which are well-
established in the literature of the respective tasks [37], [38].
ResNets:residual networks (ResNets) [
39
] are among
the most impactful DNN architectures in DL, playing a
significant role up to this day. The introduction of residual
(or skip) connections allowed for developing very deep CNN
architectures. In our experiments, we specifically employ the
ResNet50 architecture for the CIFAR-10 dataset.
EfficientNets: The EfficientNet family [
31
] is inspired by
residual connections of the ResNet architecture and further
employs mobile inverted bottlenecks [
40
] as well as squeeze-
and-excitation blocks [
41
]. The base model, EfficientNet-B0,
is optimised to balance task accuracy and computational
complexity, while larger versions are derived via compound
scaling of the architecture’s depth, width, and resolution. In our
experiments, we utilise EfficientNet-B0 for both the CIFAR-10
and DCASE2020 datasets, and the larger EfficientNet-B4 only
for the CIFAR-10 dataset.
PANNs: The CNN10 and CNN14 architectures were in-
troduced as part of the large-scale pre-trained audio neu-
ral network (PANN) [
36
] family for spectrogram-based CA
tasks. These models follow a more traditional CNN design,
comprising regular convolutional and batch normalisation
layers without skip connections, and are inspired by the VGG
architecture [
42
]. CNN14 can be considered a scaled-up version
of the CNN10 architecture. Both models are only utilised for
the DCASE2020 dataset.
Overall, we consider two types of initialisation for all
presented networks: randomly initialised and pre-trained on the
large datasets ImageNet [43] for the ResNet and EfficientNet
architectures, and AudioSet [44] for the PANN architectures.
C. Scoring Functions
Consistency Score (C-score): The concept best-suited to
a model-based SD quantification in a classification setting
is the C-score, as introduced in [
45
]. Its core idea is to
measure how consistently a sample is classified correctly across
different training subsets of varying sizes, with the sample in
question excluded from the training process. Samples that can
consistently be classified with a small amount of training data
likely require less data complexity, are more representative
of their class, and are, therefore, easier to classify. However,
obtaining a robust version of the C-score is limited by the
computational costs required to train at least one model for each
subset size and sample excluded from training. By definition,
a perfect calculation of the C-score would require several
evaluations across all possible subsets of the training data. Since
this would lead to a prohibitively large amount of experiments,
the C-score can only be approximated. Jiang et al. [
45
] draw
2000 subsets for each subset ratio
s∈ {10%,...,90%}
,
creating publicly available proxies of the C-score
1
for the
CIFAR-10 dataset based on 18000 models. Given the high
1https://pluskid.github.io/structural-regularity/
computational costs associated with the calculation of the C-
score, we limit our analysis to the publicly available C-scores
for CIFAR-10 and do not determine them for the ASC task.
Cross-Validation Loss (CVLoss): As a computationally
cheaper yet still expensive alternative to the C-score, we
draw inspiration from the cross-validated self-thought (CVST)
SF [
15
] and a loss-based C-score variant [
12
], defining the
cross-validation loss (CVLoss) as the average loss of a sample
when it is held out in a randomly partitioned
k
-fold cross-
validation setting. The loss-based version allows for a more
fine-grained SD estimation compared to the accuracy-based
C-score but does not account for the effects of differently sized
training subsets. In our experiments, we use
k= 3
partitions,
allowing us to obtain one score for each sample from the
training of three models. This approach achieves a reasonable
trade-off between computational cost and the model’s expected
performance given the size of the training subset.
Cumulative Accuracy (CumAcc) and First Iteration
(FIT): More computationally lightweight examples of SFs
include cumulative accuracy (CumAcc) [
45
] and first iteration
(FIT) [
46
], [
15
]. Both approaches rely on the learning statistics
of each sample during the training process. They allow for
the SD estimation of all training samples from a single
model training. CumAcc is calculated as the sum of correct
classifications over the total number of training epochs. In
contrast, FIT is defined as the ratio of the first correct epoch,
in which a sample is correctly classified and remains correct
thereafter, and the total number of epochs. In this context,
easier samples are expected to be learnt earlier and classified
correctly with greater consistency.
Cross-Entropy Loss (CELoss): Another SF can be inferred
from the loss of each training sample in a single trained model
state. Examples with a lower loss after training are interpreted
as easier, as the model can fit them more effectively. Although
this approach was originally introduced as bootstrapping in [
5
],
we refer to this SF as cross-entropy-loss (CELoss) for clarity,
particularly in the context of the classification tasks investigated
in this work.
Transfer Teacher (TT): Similarly, transfer teacher (TT) [
47
]
estimates SD by analysing the classification boundaries of the
samples after training. In contrast to CELoss and previous
SFs, however, TT does not evaluate DL models trained on the
target task. Instead, TT extracts features from the penultimate
layer of a model pre-trained on a larger dataset, which are then
classified via a support vector classifier (SVC). The margins
between the samples and the classification boundaries indicate
the estimated SD. The underlying idea is that the SVC should
place easy samples far from the boundaries.
Prediction Depth (PD): Another combination of DNN-based
representations and traditional machine learning algorithms for
classification is the prediction depth (PD) introduced in [
48
].
PD looks at the representations extracted from the final state of
a trained model at different layers, which supposedly represent
different views of the same data sample in the form of low-
to high-level features. The representations extracted from the
layers are then classified by k-nearest neighbours (KNN) probes
with
k= 30
. Baldock et al. [
48
] place probes both at the input
of a model and after applying the softmax function to the
DOES THE DEFINITION OF DIFFICULTY MATTER? 5
output. For ResNet architectures, probes are positioned after
the normalisation layer in the stem and following the skip
connection in each residual block. For VGG architectures,
probes are placed after each convolutional layer. We extend the
ResNet probe placement strategy to the EfficientNet family by
inserting probes after each block. As both CNN10 and CNN14
are inspired by the VGG architecture, we place probes after
each normalisation layer and the first linear layer. This results
in 19 probes for ResNet50, 20 for EfficientNet-B0, 36 for
EfficientNet-B4, 12 for CNN10, and 16 for CNN14. The PD is
defined as the depth of the first layer in which the KNN probe’s
prediction matches the model’s final prediction and aligns for
all subsequent probes in deeper layers. Easier samples are
expected to be separable at lower layers, while harder samples
should require higher-level features for classification. While
Baldock et al. [
48
] do not assign a SD score if the prediction
of the final probe differs from the network’s prediction, we
assign the maximum depth to obtain a difficulty measure
for the complete training subset. To ensure that the KNN
probes are computationally feasible, we limit the representation
size at each layer to 8192. If any representation exceeds
this limit, global average pooling (GAP) is applied to the
spatial dimensions of the input, while the number of channels
is preserved. Beyond, GAP is applied to match the spatial
dimensions of the input, allowing for an equal contribution
of time and frequency information in the case of our ASC
task. However, the use of GAP may lead to the loss of critical
features, potentially challenging the SF’s ability to accurately
estimate the SD.
D. Difficulty Ordering and Ensemble Scoring
Any of the SFs introduced in Section
III-C
can assign a
difficulty score to each individual sample in the training set,
however, these scores are generally not normalised. For the
purpose of CL, as we apply it, the exact difficulty score is
not of particular importance. Instead, we solely rely on a
difficulty-based ordering derived from the SD scores provided
by the respective SF. In cases where multiple samples share
the same exact difficulty score, they are sorted according to
the original dataset ordering to avoid introducing randomness
into the process. Ensembles across multiple SFs of the same
type can be constructed by averaging the SD scores from each
SF prior to creating the final difficulty ordering [12].
E. Pacing Functions
A CL experiment, as defined in [
15
] and implemented in
this work, considers a SD ordering and starts a gradient-based
optimisation on the first fraction
b
of the training set according
to said ordering. For comparison, the experiment may also
use a reversed SD ordering (referred to as anti-curriculum
learning (ACL)) or a random ordering (referred to as random
curriculum learning (RCL)). In the case of CL and
b= 0.2
, the
first fraction corresponds to the 20 % easiest examples of the
training set. Throughout the training, the size of the dataset is
monotonically increased according to the provided SD ordering
until a fraction
a
of the total training iterations is reached, after
which the full training set is utilised. For the phase between the
training on the initial subset
b
and the full dataset, we employ
several functions that monotonically increase the size of the
subset. All PFs share the same boundary conditions, defined by
b
and
a
, for the initial dataset size and the iteration at which
the full dataset is employed. However, some PFs add samples
more slowly at the beginning and faster towards the end, while
others start by adding samples quickly before slowing down as
they approach
a
. Ordered from the fastest to the slowest dataset
size increase at the beginning of the training, we utilise some
of the PFs defined according to Wu et al. [
15
]: logarithmic
(log), root,linear, and exponential (exp).
IV. EXP ERIME NTS A ND DISCUSSION
In the following, our experiments and corresponding dis-
cussions are structured sequentially, first focusing on the be-
haviours of SFs, then analysing the performance of curriculum-
based training settings. All experiments are conducted using
Python 3.10 and PyTorch 2.1.0 [
49
]. For any hyperparameter
not explicitly modified, the default values provided by PyTorch
are used. Additionally, we release the aucurriculum
2
toolkit,
built on top of autrainer
3
, which implements all SFs and
PFs explored in this work. This package allows for obtaining
SD scores from arbitrary classification datasets and supports
curriculum-based training. The code to reproduce our experi-
ments is publicly available on GitHub
4
. Training is performed
across a variety of GPU architectures, including NVIDIA
GeForce GTX TITAN X, GeForce RTX 3090, and A40 GPUs.
Given the diverse nature of the utilised GPUs, we omit an
analysis of training time comparisons between the standard
training baseline and curriculum-based experiments, as this
anyway depends on hardware specifications.
A. Baselines
To identify the set of hyperparameters best suited to each
model and dataset and establish a baseline to compare with,
we run a set of preliminary experiments iterating over a small
set of hyperparameters (cf. Table V). In all cases, models are
trained for the full 50 epochs, and the final model selection
is based on the best-performing model state on the respective
validation set. Overall, we train 54 models per dataset, and
the best-performing training configuration for each model and
initialisation is reported in the appendix (cf. Table VI). Even
though the best performances for CIFAR-10 and DCASE2020
are obtained with pre-trained versions of ResNet50 and CNN14,
respectively, we select the best-performing EfficientNet-B0
configuration for both datasets as a baseline reference for further
analysis. This choice is motivated by the model being the only
architecture employed for both datasets and its comparably low
parameter count, allowing for a more efficient training. Training
the model from scratch further reduces the potential impact
of pre-training data. The reference baseline performances are
thus 0.839 for CIFAR-10 and 0.583 for DCASE2020, achieved
using the Adam optimiser with a learning rate of 0.001 and
the SAM optimiser with a learning rate of 0.01, respectively.
2https://github.com/autrainer/aucurriculum
3https://github.com/autrainer/autrainer
4https://github.com/ramppdev/sample-difficulty-curriculum-learning
DOES THE DEFINITION OF DIFFICULTY MATTER? 6
Nonetheless, we consider all models to investigate the impact
of varying training settings on SFs in IV-B.
B. Scoring Functions
1) Impact of Training Setting: Our first analysis builds on
top of the previous baseline experiments and the obtained
model states. We focus on how much the choice of different
training settings affects the final difficulty ordering provided
by certain SF.
We first create different SD orderings for the SFs presented
in Section
III-C
by independently varying the random seed, the
model architecture (including initialisation), and the optimiser-
learning rate combination, respectively. Each variation is based
on the best-performing EfficientNet-B0 reference configuration,
with only one setting – seed, model, or optimiser-learning rate
combination – varied at a time for comparison. Overall, we
choose six different variations per training setting and obtain
one SD ordering per SF and variation. The variations are
summarised in the appendix (cf. Table VII) and follow the
respective choices from Table V, but we add five new random
seeds to obtain variations w. r.t. randomness. Additionally, we
limit our analysis to the two best-performing learning rate
combinations for each optimiser, as some combinations did not
converge during training. Note that the random seed controls
both the model initialisation and the ordering of training data
during epochs. This guarantees that the results are reproducible,
and we further aim to minimise the influence of randomness
on model or optimiser variations.
While most SD orderings can be obtained straight from the
baseline experiments performed in Section
IV-A
, an additional
53 models need to be trained per dataset due to the added
consideration of random seeds and the requirements of the
CVLoss SF calculation. To reduce the computational costs
of the CVLoss SF, we reduce the training time to 25 epochs.
Moreover, TT is only calculated over the pre-trained versions of
the three model architectures, as the SF requires a pre-trained
model by design, and the SVC fitting on the target task does
not allow for varying random seeds or optimisation routines.
Whenever the selection of a specific model state is required to
calculate a SF, we choose the best-performing state analogous
to Section IV-A.
Table I summarizes the SF ‘robustness’ per training setting
for CIFAR-10 and the DCASE2020, respectively. We compute
the mean and standard deviation of pairwise Spearman rank
correlations of the difficulty orderings obtained for each
independently varied training setting. From the six variations
within each training setting (seed, model, and optimiser-learning
rate combination), we obtain 15 unique pairwise correlations
(excluding self-correlations).
The mean pairwise correlation is, in most cases, moderate
(
≥0.4
) or strong (
≥0.6
). This is true for both datasets, a fact
that indicates moderate to high agreement for all SFs when
changing one training setting and, thus, a consistent underlying
notion of SD for each SF.
With respect to the type of training setting, varying the
random seed results in overall higher agreement, with a change
in the model architecture exhibiting the lowest agreement. This
TABLE I: Training setting robustness in terms of agreement
of difficulty rankings obtained with varying training settings.
The respective training setting of each column is varied with
respect to the reference configuration. We then report the mean
and standard deviation of Spearman correlation coefficients (
ρ
)
computed between unique pairs of training setting variations.
SF Seed Model Optim + LR
CIFAR-10
CELoss .507 ±.026 .428 ±.043 .483 ±.055
CVLoss .676 ±.007 .629 ±.045 .689 ±.022
CumAcc .760 ±.008 .557 ±.101 .752 ±.019
FIT .586 ±.033 .416 ±.076 .623 ±.019
PD .790 ±.012 .653 ±.076 .799 ±.032
TT – .648 ±.025 –
DCASE2020
CELoss .410 ±.060 .415 ±.115 .369 ±.041
CVLoss .591 ±.018 .579 ±.044 .556 ±.046
CumAcc .821 ±.012 .590 ±.099 .758 ±.048
FIT .604 ±.020 .475 ±.084 .513 ±.052
PD .748 ±.023 .694 ±.046 .683 ±.068
TT – .523 ±.191 –
shows how different models – with different inductive biases –
encapsulate a different notion of SD, with important training
settings such as the optimiser and learning rate adding further
variability. Nevertheless, even changing the random seed alone –
and keeping all other things equal – results in a low to moderate
amount of disagreement. As the random seed controls both
the model initialisation and the order in which examples are
presented during training, it is evident that these two aspects
of training play a non-negligible role when discussing SD.
In terms of comparing the two datasets, we observe that they
exhibit differences in both directions, with CIFAR-10 showing
higher correlations than DCASE2020 in some constellations
and lower in others. This shows how the dataset plays an
additional role for the robustness of a SF across different
training settings and is an extra source of variability.
In summary, we conclude that an SD ordering is context-
dependent, as it is influenced by the choice of model, training
hyperparameters, and even random seed selection. The corollary
is that the model-based curricula we investigate later can be
seen as ‘global’ curricula in only a relative sense, as each
training setting results in a different SD ordering.
2) Robustness of Ensembles: The strong influence of ran-
domness across all SFs observed in Section
IV-B
1 poses
some concerns towards the legitimacy of model-based SD
prediction. The impacts of variations in model architecture or
optimisation routines could be attributed to inherent biases,
resulting in differing interpretations of SD. However, the role
of randomness through varying seeds is harder to justify and,
to some extent, contradicts the concept of CL as it prevents
an objective difficulty ordering.
A possible mitigation strategy for this limitation might
lie in the use of ensemble scoring functions, following the
experiments of [
15
] and [
12
]. The underlying hypothesis of
this approach is that the effects of random variations can be
counteracted by averaging over several difficulty predictions
obtained from different random seeds for each sample, thus
DOES THE DEFINITION OF DIFFICULTY MATTER? 7
Fig. 1: Correlation of SFs with increased ensemble sizes for
the datasets CIFAR-10 and DCASE2020. The ensemble size
encapsulates how many individual orderings – obtained from
different random seeds – are considered to build one ensemble.
For each SF, we report pairwise Spearman correlations across
three ensemble orderings with the same ensemble size.
unravelling a more robust and less biased SD estimation. In
order to investigate this hypothesis, we first train a total of
nine new models per dataset by varying the random seed of
the EfficientNet-B0 reference configuration. Per dataset and
SF, we obtain 15 difficulty estimations, the settings of which
only vary in the underlying random seed. We omit CVLoss due
to computational costs and TT due to a lack of dependence
on randomness from the discussion. From the 15 difficulty
estimations, we form three ensemble difficulty orderings for
each of the varying ensemble sizes
{
1,2,3,4,5
}
, as described
in Section III-D, while ensuring no overlap in the underlying
SFs across the ensembles.
Figure 1 illustrates the average pairwise correlation across
the three considered ensemble difficulty orderings for each
ensemble size. It is apparent that the pairwise correlation
consistently increases with larger ensemble sizes across datasets
and SFs. SD orders obtained from the same experimental
settings but different random seeds thus agree more with each
other if the orders are built via an ensemble over multiple
random seeds. This result is in line with our hypothesis, clearly
indicating that difficulty orderings become more robust towards
randomness with increasing ensemble sizes.
3) A Deeper Look into Difficulty Distributions: A misleading
conclusion that might be drawn from Section
IV-B
1 and
Figure 1 would be that higher training setting robustness would
automatically be a sign for a better SF. However, there are
some methodological decisions limiting this interpretation: A
comparison of difficulty orderings assumes unique and distinct
difficulty values for each sample, which cannot be guaranteed
through the SFs.
In order to still provide a necessary, ranked difficulty ordering
for CL experiments, we introduced in Section
III-C
the decision
to rank samples with an identical difficulty estimation by the
index provided by the original dataset. This decision has its
reason mainly in the exclusion of further randomness effects
from CL experiments. It also ensures the similarity of two
difficulty estimations, which assign the same difficulty values
to various samples. However, this decision has implications
for the comparison of orderings with coarse granularity, as it
TABLE II: Granularity of CIFAR-10 and DCASE2020 SF
difficulty distributions for single seed ordering and ensemble
orderings. We report the number of unique difficulty values
and the maximum number of samples assigned to a single
difficulty value (bin).
Ensemble Size 1 6
Scoring Function Unique Max Bin Unique Max Bin
CIFAR-10
CELoss 17 424 5164 49 844 10
CVLoss 32 872 768 50 000 1
CumAcc 33 18 279 157 4033
FIT 46 18 279 34 286 3824
PD 20 21 135 196 10 915
TT 50 000 1 50 000 1
C-score – – 50 000 1
DCASE2020
CELoss 8402 470 13962 1
CVLoss 13262 22 13962 1
CumAcc 32 1889 151 670
FIT 48 1889 12384 670
PD 21 4119 203 2881
TT513962 1 13962 1
assigns the same difficulty value to many samples.
An overview of the granularity of the different SFs in the
single-score as well as the ensemble context is given by Table II.
Naturally, the SFs FIT, CumAcc, and PD show the coarsest
granularity, as their discrete nature only allows for a limited
number of difficulty predictions – at least in the single-score
context. The loss- or margin-based scoring functions CELoss,
CVLoss, and TT, as well as C-score, on the other hand, have
(almost) completely unique difficulty values assigned to each
sample – at least in the ensemble version. Any non-uniqueness,
especially in the single-score version, results from limited
numerical precision for very small loss values. Therefore, they
provide a clear ordering in terms of their estimation of difficulty.
The effects of the coarse granularity on robustness are
particularly reflected in Figure 1. Here, FIT, CumAcc, and
PD all show clearly higher robustness to randomness than the
more fine-grained CELoss.
4) Agreement of Difficulty Notion: Given the previous
insights on the characteristics of SFs on an individual level, we
now extend our analysis to the similarity of difficulty notions
across SFs. For this, we calculate pairwise Spearman rank corre-
lation coefficients of the different SFs based on their ensembles
w.r. t. seed, model architecture (including initialisation), and
optimisation routine, with all results presented in the appendix
(cf. Figure 5, Figure 6, and Figure 7). Evidently, almost all
SFs share a very similar notion of difficulty. Only PD shows
an uncharacteristically low, yet still moderate correlation to all
other SFs of above 0.5 in almost all cases. This indicates some
limitations of the approach, likely due to the coarse granularity
as well as the applied GAP. With all other approaches agreeing
to more than 70 % in all but one case, their ensemble-based
difficulty orderings generally seem to be good estimators for
model-based SD in the context of CL.
5
The ensemble ordering consists of three different scores, as TT only utilizes
pre-trained models.
DOES THE DEFINITION OF DIFFICULTY MATTER? 8
TABLE III: Mean SF agreement for CIFAR-10 and
DCASE2020. Reported is the mean pairwise agreement of
ensemble SFs under variation of either random seeds, model
architectures, or optimiser-learning rate combinations.
Variation CIFAR-10 DCASE2020
Seed .724 ±.097 .689 ±.129
Model .673 ±.085 .652 ±.132
Optim + LR .732 ±.112 .724 ±.080
Table I summarises the SF agreement per ensemble type as
an average of the pairwise agreements, as, e. g., displayed in
Figure 5. Interestingly, the agreement across SFs is even higher
than most agreements within the same SF with different random
seeds. While this might seem surprising at first, the reasons
for this may lie in the fact that most approaches are based on
identical model trainings. For instance, a likely scenario might
be that a model is confronted with a certain sample early on
during training and is able to learn it quickly. Consequentially,
the sample might be correctly classified in all epochs, making
it both ‘easy’ in the context of CumAcc and FIT, and having
a persistently low loss value, making it ‘easy’ in the context
of CELoss.
A high correlation of SFs is also observed when ensembles
are constructed w.r. t. varying model architectures and opti-
misation routines, as summarised in Figure 6 and Figure 7.
The mean correlations across all ensembles are reported in
Table III.
C. Curriculum Learning
Despite the initial motivation through CL, we have so far
only investigated SFs in isolation regarding their alignment
of difficulty orderings. This section focuses on the practical
implications of SD orderings for CL. We perform experiments
as described in Section
III-E
, based on the ensemble SD
orderings obtained from Section
IV-B
4. This approach aims to
provide further insights into how the performance of a model
changes when we alter the data it is exposed to based on
different scoring functions.
1) Different orderings and pacing functions: We start the
experimental investigation of CL with very coarse differences
in the training setup. While we did notice differences in the
agreement of difficulty values across SFs and training settings,
the results of Section
IV-B
provide enough evidence to assume
that either of the SFs is a suitable estimator for a distinction
between the easiest and the most difficult samples within a
dataset. Moreover, to limit computational demands, we exclude
the CVLoss and FIT SFs from the subsequent performance
experiments, as a correlation above 80 % is noted to CumAcc
for seed-based ensembles across both datasets.
In our first experiments, we evaluate how model performance
is impacted when examples are presented in the intended
curriculum ordering (easy-to-hard; CL), the reverse ordering
(hard-to-easy; ACL), and a completely random ordering (RCL).
For a robust evaluation across the different orderings, we
choose the seed-based ensembles for all SFs. For training, we
select the EfficientNet-B0 architecture in combination with
the best baseline training setup, as preliminary experiments
indicated that this configuration also shows high performance
in combination with different pacing functions. We evaluate
each SF across four PFs (logarithmic, root, linear, exponential),
starting with an initial training dataset size of
b= 20 %
and
a saturation on the full dataset after
a= 50 %
and
a= 80 %
of the training iterations. Following the approach of [
5
], we
incorporate new training samples in a class-balanced manner,
such that the training subsets remain balanced throughout
training, assuming the full dataset is balanced. Analogous
to [
15
], we also delay the introduction of new samples until all
examples in the current training subset have been used at least
once. We train each model for a number of steps equivalent
to 50 epochs of the full dataset and replicate the experiments
across three different seeds, averaging the performance to
ensure robustness. Overall, this leads to a total of 264
6
CL
experiments for the CIFAR-10 and 216 for the DCASE2020
dataset.
The results of our experiments – averaged over SFs – as
well as the baseline performance without CL are summarised
in Figure 2. For each ordering strategy, PFs are sorted from top
to bottom by decreasing saturation speed, i. e., pacing functions
at the top quickly introduce new samples to the training set,
while those at the bottom initially maintain the number of
samples close to the initial training set. Firstly, we observe a
clear trend towards better performance obtained from quickly
saturating PFs, which is in line with the findings in [
15
]. Only
logarithmic PFs combined with a CL ordering, but not with
a RCL or ACL ordering, are able to marginally surpass the
baseline on both datasets.
Our experiments further show a clear performance advantage
of CL over ACL, with RCL scoring in between CL and ACL.
Despite being less prominent for the quickly saturating PFs, this
difference steadily increases with decreasing saturation speed.
In the case of the slowly saturating PFs in particular, models
are trained for a long time with only the easiest (CL), the
most difficult (ACL), or random samples (RCL), respectively.
While most curriculum orderings and PF lead to detrimental
performance to some degree, which can likely be attributed
to an effect of overfitting from which the model struggles to
recover at the later stages of the training, the effects are the
strongest for ACL and the weakest for CL. We conclude that
model training is clearly negatively impacted if confronted with
difficult samples first. This finding underlines the importance
of the data that models are exposed to early on. It provides
evidence supporting the concepts behind CL and adds to the
validity of model-based SFs for SD estimation.
2) Benefits of Robust Scoring Functions for Curriculum
Learning: Despite the apparent importance of data ordering
for CL settings in the extreme cases of easy-to-hard, hard-
to-easy, and random difficulty orderings, we further aim to
investigate the more subtle impacts SF-based orderings have on
CL. To further analyse this question, we revisit the robustness
experiments of SFs from Section
IV-B
2. We hypothesise that
ensemble SFs of larger size – being more robust to randomness
6
As we utilise the publicly available C-score difficulty values, we only apply
the SF to the CIFAR-10 dataset.
DOES THE DEFINITION OF DIFFICULTY MATTER? 9
Fig. 2: Mean PF performance on CIFAR-10 and DCASE2020,
averaged across SFs, saturation fractions, and three seeds.
Each bar represents the average performance for each PF and
curriculum ordering. The grey dashed vertical lines indicate the
baseline performance, averaged across the 15 random seeds.
Fig. 3: Comparison of SF robustness and CL performance
for CIFAR-10 and DCASE2020. For each scoring function,
we evaluate ensemble orderings of different ensemble sizes,
noted as a number next to each point. The
y
-axis represents
the pairwise correlation across the ensembles of the respective
ensemble size (cf.
x
-axis in Figure 1) as an indicator of SF
robustness. The
x
-axis displays the average accuracy of CL
experiments based on the corresponding ensemble orderings.
Coloured and grey dashed lines are linear least-squared-error
fits per SF and across all SFs, respectively. The slope of the
lines indicates whether trends of higher SF robustness and
higher CL performance exist.
– should have a clearer notion of difficulty, which in turn should
lead to a more reliable difficulty ordering and thus have benefits
in a CL setting.
In order to investigate this hypothesis, we base our CL
experiments on the ensemble orderings previously investigated
in Section
IV-B
2. We once again use three different orderings
per ensemble size and three different random seeds in the same
setting as Section
IV-C
1, resulting in overall 60 curriculum
experiments per SF for each dataset. Figure 3 displays the
average pairwise correlation for each ensemble size, i. e., the
values on the
y
-axis in Figure 1, versus the average performance
of the CL-based training with the same ensemble size.
At this point, results for CIFAR-10 and DCASE2020
seem to diverge considerably. The outcomes for CIFAR-10
clearly support our hypothesis both within and across SFs
(correlation across all points), with high Pearson correlation
coefficients (PCCs) between accuracy and correlation of 0.760
and 0.498, respectively. The first value is calculated as the
average correlation for each SF separately (macro PCC), while
the second value is based on the correlation of all points
independent of the concrete SFs (micro PCC). This aligns with
our expectation that a more robust SD ordering will result in
improved performance when used as the starting point for CL.
DCASE2020, however, adds more ambiguity to the discus-
sion: both the average correlation within and across SFs show
close to no correlation of -0.047 and 0.045, respectively. In
this case, a higher agreement on the SD did not translate to
improved performance when this ordering was used for CL,
indicating that other factors might influence CL behaviour.
While we find clear evidence supporting the benefit of more
robust SFs for CL on the CIFAR-10 dataset, no such evidence
can be reported for DCASE2020, suggesting further investi-
gation into how SF robustness and other factors contribute to
CL performance.
3) Scoring Function and Performance: Having compared
the effects of different orderings within the CL setting, the next
aspect we aim to investigate is whether the CL methodology
is in any way superior to standard DL training beyond the
limited training time scenario reported in [
15
]. To test this,
we select the 264 CL experiments for the CIFAR-10 and 216
for the DCASE2020 dataset from
IV-C
1. We report the best-
performing combination for easy-to-hard (CL), hard-to-easy
(ACL), and the random ordering (RCL) averaged across three
seeds, respectively. For comparison, we use the 15 models
from
IV-B
2, each replicating the EfficientNet-B0 reference
configuration across different random seeds, as baselines. We
report the performance of the single best model (B1), the
average performance of the best three models (B3), the best
five models (B5), and all baselines (B15). This setup allows us
to investigate whether CL, RCL, or ACL can offer performance
benefits over standard training across both datasets in order
to challenge the findings of Wu et al. [
15
], which reported no
improvement over the baseline. Despite the differing number
of CL and baseline experiments, the comparison aims to assess
whether significant performance improvements can be achieved
through CL-based training. However, the choice of the best-
performing CL-based configurations introduces a selection
bias, as they represent the best training setup for CL, while
the baselines are derived from a fixed training setup averaged
across multiple (best-performing) random seeds.
Table IV provides an overview of the performance exper-
iments. The results, however, are inconclusive. For CIFAR-
10, the best performance is indeed achieved by a CL-based
configuration using the computation-heavy C-score ordering.
However, the next best performances are achieved by all
baselines, followed by RCL and, finally, ACL showing the
worst performance. For DCASE2020, the best CL setting cannot
reach the performance of the best baseline run but outperforms
the average performance from the best 3 to 15 baselines, with
also ACL and RCL outperforming the average overall baseline
runs. A clear plug-and-play improvement of CL over standard
training across datasets can therefore not be concluded, which
falls in line with the results from [
15
]. Despite the apparent
limitations of this analysis due to the asymmetric selection of
DOES THE DEFINITION OF DIFFICULTY MATTER? 10
TABLE IV: Performance comparison between the averages
of the best standard training baselines (abbreviated with B)
and the best training settings for CL, RCL, and ACL averaged
across three seeds.
CIFAR-10 DCASE2020
SF Type Accuracy SF Type Accuracy
C-score CL .844 – B1 .583
– B1 .839 TT CL .577
– B3 .839 – B3 .576
– B5 .838 – B5 .571
– B15 .834 CELoss ACL .560
Random RCL .829 Random RCL .558
CELoss ACL .829 – B15 .555
comparison models, it seems that a good CL setting does not
have substantially beneficial effects on the performance over
the standard training baselines.
4) Late Fusion Performance: Our last line of experiments
focuses on the question of whether models trained with a focus
on different samples arrive at states that are complementary to
each other. We investigate this through the lens of late fusion,
which is performed by averaging the predicted class probabili-
ties of the respective models after the softmax layer. In this
context, we expect higher performance gains for combinations
that show higher independence in their voting [
50
], which
we interpret as more disconnected concepts across models.
Therefore, we select the best-performing model trained with
easy-to-hard, hard-to-easy, and random orderings alongside the
baselines from
IV-C
3. We denote the fusion of the top two
baselines as B2, the fusion of the top three baselines as B3,
and so forth.
Figure 4 provides an overview of the performance obtained
through late fusion of differently trained model states. Across
all experiments, we notice a clear performance increase when
fusing multiple models, as would be expected from the
literature [
51
], [
52
]. For CIFAR-10, the highest performance
across all experiments is achieved by the fusion of all baseline
models. In contrast, for the DCASE2020 dataset, the highest
performance is obtained from the fusion of CL, ACL, RCL,
and the best baseline. Among fusions of two, three, and four
models, we find that combinations including any of the CL-
based orderings tend to outperform baseline fusions of the same
size across both datasets, especially if CL and ACL are part of
the combination. This is particularly interesting as ACL and
RCL have a low performance compared to the best baselines
(cf. Table IV). The fact that the fusion of CL and ACL performs
better suggests that these opposing difficulty orderings may
complement each other, indicating an exploitable difference in
the learnt concepts between CL and ACL.
V. CONCLUSION
In this contribution, we extensively explored the robustness
and alignment of sample difficulty estimation through scoring
functions and their implications for curriculum learning on
benchmark computer vision and computer audition datasets.
We discovered that model-based scoring functions are almost to
an equal level impacted by randomness in terms of parameter
initialisation and dataset order, as they are by the choice of
Fig. 4: Late fusion results for combinations of curriculum (CL),
random curriculum (RCL), and anti-curriculum learning (ACL),
as well as the best baselines abbreviated with B; for instance,
B4 represents the fusion of the best 4 baseline runs.
hyperparameters in the respective training setting. Robustness
towards randomness, however, was shown to be effectively
increased through ensemble scoring functions. A generally
high agreement across different ensembles of scoring functions
based on various concepts could be observed as long as their
difficulty predictions allowed for a fine-grained estimation of
difficulty and, correspondingly, a clear profile of the difficulty
ordering.
In the context of curriculum learning, we first showed
a clear benefit of easy-to-hard over hard-to-easy difficulty
orderings, in particular for pacing functions that slowly saturate
on the full dataset. We further found evidence that more
robust scoring functions benefit curriculum learning, likely
through a more robust notion of difficulty. As we could not
show that curriculum learning has a general advantage over
traditional deep learning training, we are left to hypothesise
that many of the reported benefits in the literature are not
universally generalisable. Nevertheless, models trained with
opposing difficulty orderings revealed beneficial properties
for late fusion. This leads to the conclusion that different,
complementary concepts can be acquired when models learn
samples easy-to-hard versus hard-to-easy. Future work could
expand our findings by investigating the characteristics of
different samples and how those impact their difficulty ranking.
ACK NOWLED GME NT
This work has received funding from the DFG’s Reinhart
Koselleck project No. 442218748 (AUDI0NOMOUS).
DOES THE DEFINITION OF DIFFICULTY MATTER? 11
APPENDIX
BAS ELI NES
TABLE V: Grid search hyperparameters utilised to establish
the baseline performance. Each model was trained 50 epochs,
with the final model selection based on the best-performing
state on the respective validation set.
Hyperparameter Values
Architecture EfficientNet-B0, -B4, ResNet50, CNN10, CNN147
Initialisation Random, Pre-trained8
Optimiser Adam [53], SGD9[54], SAM10 [55]
Learning rate .01, .001, .0001
Batch size 16
Epochs 50
Random Seed 1
TABLE VI: Best baseline performance of each model on
CIFAR-10 and DCASE2020 with variations in optimisers and
learning rates as shown in Table V. The suffix -T indicates
pre-training (on ImageNet for CIFAR-10 and AudioSet for
DCASE2020). All models were trained for 50 epochs, with
final selection based on the best validation performance.
Model Optimiser Learning Rate Accuracy
CIFAR-10
ResNet-50-T SAM .001 .949
EfficientNet-B4-T SAM .001 .945
EfficientNet-B0-T SAM .001 .936
EfficientNet-B4 Adam .001 .848
EfficientNet-B0 Adam .001 .835
ResNet-50 Adam .001 .813
DCASE2020
CNN14-T Adam .0001 .678
CNN10-T SAM .01 .653
EfficientNet-B0-T SAM .01 .644
CNN10 SAM .001 .609
CNN14 SAM .001 .595
EfficientNet-B0 SAM .01 .583
7
EfficientNet-B4 and ResNet50 only are only trained for CIFAR-10, CNN10
and CNN14 only for DCASE2020
8
EfficientNets and ResNet50 are pre-trained on ImageNet, CNN10 and CNN14
are pre-trained on AudioSet
9with momentum (0.9)
10with SGD momentum (0.9)
AGR EEM ENT OF DIFFI CULTY NOTIO N
TABLE VII: Overview of variations in random seed, model
architecture (with EfficientNets abbreviated to B0 and B4)
and initialisation, and optimiser-learning rate combinations
used for SF calculation. The suffix -T indicates pre-training
(on ImageNet for CIFAR-10 and AudioSet for DCASE2020)
across model architectures. Each training setting is varied
independently, with other parameters fixed to the first entry in
each row, based on the EfficientNet-B0 reference configuration.
Variations CIFAR-10 DCASE2020
Seed 1, 2, 3, 4, 5, 6 1, 2, 3, 4, 5, 6
Model
B0, B0-T, B4, B4-T,
ResNet50, ResNet50-T
B0, B0-T, CNN10,
CNN10-T, CNN14
11
,
CNN14-T
Optim + LR
Adam, SAM, SGD (all
with .001, .01)
SAM, Adam, SGD (all
with .01, .001)
CIFAR-10
DCASE2020
Fig. 5: Agreement of different scoring functions with varying
random seeds. Displayed is the pairwise Spearman correlation
of the respective ensemble orderings of ensemble size six. The
individual orderings building up the ensemble are obtained
from the reference configuration and five additional variations
of the random seed.
11
The learning rate for CNN14 was reduced, as the model did not converge
with all other parameters fixed to the first entry in each row.
DOES THE DEFINITION OF DIFFICULTY MATTER? 12
CIFAR-10
DCASE2020
Fig. 6: Agreement of different scoring functions with varying
models. Displayed is the pairwise Spearman correlation of
the respective ensemble orderings of ensemble size six. The
individual orderings building up the ensemble are obtained
from the reference configuration and five additional variations
of the model architecture and initialisation.
REFERENCES
[1]
S. Basu and J. Christensen, “Teaching classification boundaries
to humans,” Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 27, no. 1, p. 109–115, Jun. 2013. [Online]. Available:
http://dx.doi.org/10.1609/aaai.v27i1.8623
[2]
F. Khan, B. Mutlu, and J. Zhu, “How do humans teach: On
curriculum learning and teaching dimension,” in Advances in Neural
Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett,
F. Pereira, and K. Weinberger, Eds., vol. 24. Curran Associates,
Inc., 2011. [Online]. Available: https://proceedings.neurips.cc/paper
files/paper/2011/file/f9028faec74be6ec9b852b0a542e2f39-Paper.pdf
[3]
Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
learning,” in Proceedings of the 26th Annual International Conference
on Machine Learning. ACM, Jun. 2009. [Online]. Available:
https://doi.org/10.1145/1553374.1553380
[4]
X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,”
2020. [Online]. Available: https://arxiv.org/abs/2010.13166
[5]
G. Hacohen and D. Weinshall, “On the power of curriculum
learning in training deep networks,” 2019. [Online]. Available:
https://arxiv.org/abs/1904.03626
[6]
P. Soviany, R. T. Ionescu, P. Rota, and N. Sebe, “Curriculum
learning: A survey,” International Journal of Computer Vision,
vol. 130, no. 6, pp. 1526–1565, Apr. 2022. [Online]. Available:
https://doi.org/10.1007/s11263-022-01611-x
[7]
F. Liu, T. Zhang, C. Zhang, L. Liu, L. Wang, and B. Liu,
“A review of the evaluation system for curriculum learning,”
Electronics, vol. 12, no. 7, p. 1676, Apr. 2023. [Online]. Available:
http://dx.doi.org/10.3390/electronics12071676
[8]
D. Mayo, J. Cummings, X. Lin, D. Gutfreund, B. Katz, and A. Barbu,
“How hard are computer vision datasets? calibrating dataset difficulty
to viewing time,” in Thirty-seventh Conference on Neural Information
Processing Systems Datasets and Benchmarks Track, 2023. [Online].
Available: https://openreview.net/forum?id=RADrFxYqIH
CIFAR-10
DCASE2020
Fig. 7: Agreement of different scoring functions with varying
optimiser and learning rate combinations. Displayed is the
pairwise Spearman correlation of the respective ensemble
orderings of ensemble size six. The individual orderings
building up the ensemble are obtained from the reference
configuration, two additional optimiser variations, and the two
best-performing learning rates for each optimiser.
[9]
X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,”
2020. [Online]. Available: https://arxiv.org/abs/2010.13166
[10]
Y. Song, B. Millidge, T. Salvatori, T. Lukasiewicz, Z. Xu,
and R. Bogacz, “Inferring neural activity before plasticity as a
foundation for learning beyond backpropagation,” Nature Neuroscience,
vol. 27, no. 2, p. 348–358, Jan. 2024. [Online]. Available:
http://dx.doi.org/10.1038/s41593-023-01514-1
[11]
K. Meding, L. M. S. Buschoff, R. Geirhos, and F. A. Wichmann,
“Trivial or impossible – dichotomous data difficulty masks model
differences (on imagenet and beyond),” 2021. [Online]. Available:
https://arxiv.org/abs/2110.05922
[12]
H. T. Kesgin and M. F. Amasyali, “Development and comparison of
scoring functions in curriculum learning,” 2022. [Online]. Available:
https://arxiv.org/abs/2202.06823
[13]
R. Lotfian and C. Busso, “Curriculum learning for speech emotion
recognition from crowdsourced labels,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 27, no. 4, p. 815–826, Apr.
2019. [Online]. Available: http://dx.doi.org/10.1109/TASLP.2019.2898816
[14]
A. Mallol-Ragolta, S. Liu, N. Cummins, and B. Schuller, “A curriculum
learning approach for pain intensity recognition from facial expressions,”
in 2020 15th IEEE international conference on automatic face and
gesture recognition (FG 2020). IEEE, 2020, pp. 829–833.
[15]
X. Wu, E. Dyer, and B. Neyshabur, “When do curricula work?” 2020.
[Online]. Available: https://arxiv.org/abs/2012.03107
[16]
H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing
the loss landscape of neural nets,” in Advances in Neural Information
Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates,
Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper
files/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf
[17]
A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic
similarity in audio transfer learning: Insights from the speech emotion
recognition case,” in ICASSP 2021-2021 IEEE International Conference
DOES THE DEFINITION OF DIFFICULTY MATTER? 13
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021,
pp. 7268–7272.
[18]
M. Milling, A. Triantafyllopoulos, I. Tsangko, S. D. N. Rampp, and B. W.
Schuller, “Bringing the discussion of minima sharpness to the audio
domain: A filter-normalised evaluation for acoustic scene classification,”
in ICASSP 2024-2024 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 391–395.
[19]
V. I. Spitkovsky, H. Alshawi, and D. Jurafsky, “From baby steps to
leapfrog: how ”less is more” in unsupervised dependency parsing,” USA,
p. 751–759, 2010.
[20]
E. A. Platanios, O. Stretcu, G. Neubig, B. Poczos, and T. M. Mitchell,
“Competence-based curriculum learning for neural machine translation,”
2019. [Online]. Available: https://arxiv.org/abs/1903.09848
[21]
Y. Tay, S. Wang, A. T. Luu, J. Fu, M. C. Phan, X. Yuan, J. Rao,
S. C. Hui, and A. Zhang, “Simple and effective curriculum pointer-
generator networks for reading comprehension over long narratives,”
in Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics,
2019. [Online]. Available: http://dx.doi.org/10.18653/v1/P19- 1486
[22]
S. Braun, D. Neil, and S.-C. Liu, “A curriculum learning method for
improved noise robustness in automatic speech recognition,” in 2017 25th
European Signal Processing Conference (EUSIPCO). IEEE, Aug. 2017.
[Online]. Available: http://dx.doi.org/10.23919/EUSIPCO.2017.8081267
[23]
S. Ranjan and J. H. L. Hansen, “Curriculum learning based approaches
for noise robust speaker recognition,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 26, no. 1, p. 197–210, Jan. 2018.
[Online]. Available: http://dx.doi.org/10.1109/TASLP.2017.2765832
[24]
B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell,
benchmarks, and ongoing trends,” Communications of the ACM, vol. 61,
no. 5, pp. 90–99, 2018.
[25]
F. Ringeval, B. Schuller, M. Valstar, N. Cummins, R. Cowie, L. Tavabi,
M. Schmitt, S. Alisamir, S. Amiriparian, E.-M. Messner et al., “Avec 2019
workshop and challenge: state-of-mind, detecting depression with ai, and
cross-cultural affect recognition,” in Proceedings of the 9th International
on Audio/visual Emotion Challenge and Workshop, 2019, pp. 3–12.
[26]
S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone,
“Curriculum learning for reinforcement learning domains: A framework
and survey,” Journal of Machine Learning Research, vol. 21, no. 181, pp.
1–50, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-212.html
[27]
G. Karakasidis, T. Gr
´
osz, and M. Kurimo, “Comparison and analysis of
new curriculum criteria for end-to-end asr,” 2022. [Online]. Available:
https://arxiv.org/abs/2208.05782
[28]
A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features
from tiny images,” 2009.
[29]
X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proceedings of the thirteenth
international conference on artificial intelligence and statistics. JMLR
Workshop and Conference Proceedings, 2010, pp. 249–256.
[30]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond
empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[31]
M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling
for convolutional neural networks,” 2019. [Online]. Available:
https://arxiv.org/abs/1905.11946
[32]
——, “Efficientnetv2: Smaller models and faster training,” 2021.
[Online]. Available: https://arxiv.org/abs/2104.00298
[33]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Transformers for image recognition at scale,” 2020. [Online]. Available:
https://arxiv.org/abs/2010.11929
[34]
J. Feng, H. Tan, W. Li, and M. Xie, “Conv2next: Reconsidering conv
next network design for image recognition,” in 2022 International
Conference on Computers and Artificial Intelligence Technologies
(CAIT). IEEE, Nov. 2022. [Online]. Available: http://dx.doi.org/10.
1109/CAIT56099.2022.10072172
[35]
T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classification
in dcase 2020 challenge: generalization across devices and low
complexity solutions,” in Proceedings of the Detection and Classification
of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 2020, pp.
56–60. [Online]. Available: https://arxiv.org/abs/2005.14623
[36]
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley,
“Panns: Large-scale pretrained audio neural networks for audio pattern
recognition,” 2019. [Online]. Available: https://arxiv.org/abs/1912.10211
[37]
S. Dong, P. Wang, and K. Abbas, “A survey on deep learning and its
applications,” Computer Science Review, vol. 40, p. 100379, May 2021.
[Online]. Available: http://dx.doi.org/10.1016/j.cosrev.2021.100379
[38]
B. Ding, T. Zhang, C. Wang, G. Liu, J. Liang, R. Hu, Y. Wu, and
D. Guo, “Acoustic scene classification: A comprehensive survey,” Expert
Systems with Applications, vol. 238, p. 121902, Mar. 2024. [Online].
Available: http://dx.doi.org/10.1016/j.eswa.2023.121902
[39]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385
[40]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” 2018. [Online].
Available: https://arxiv.org/abs/1801.04381
[41]
J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation
networks,” 2017. [Online]. Available: https://arxiv.org/abs/1709.01507
[42]
K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” 2014. [Online]. Available: https:
//arxiv.org/abs/1409.1556
[43]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual
recognition challenge,” International journal of computer vision, vol.
115, pp. 211–252, 2015.
[44]
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen,
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio
set: An ontology and human-labeled dataset for audio events,” in
2017 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, Mar. 2017. [Online]. Available:
http://dx.doi.org/10.1109/ICASSP.2017.7952261
[45]
Z. Jiang, C. Zhang, K. Talwar, and M. C. Mozer, “Characterizing
structural regularities of labeled data in overparameterized models,”
2020. [Online]. Available: https://arxiv.org/abs/2002.03206
[46]
M. Toneva, A. Sordoni, R. T. d. Combes, A. Trischler, Y. Bengio,
and G. J. Gordon, “An empirical study of example forgetting
during deep neural network learning,” 2018. [Online]. Available:
https://arxiv.org/abs/1812.05159
[47]
D. Weinshall, G. Cohen, and D. Amir, “Curriculum learning by transfer
learning: Theory and experiments with deep networks,” 2018. [Online].
Available: https://arxiv.org/abs/1802.03796
[48]
R. J. N. Baldock, H. Maennel, and B. Neyshabur, “Deep learning
through the lens of example difficulty,” 2021. [Online]. Available:
https://arxiv.org/abs/2106.09647
[49]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K
¨
opf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style,
high-performance deep learning library,” 2019. [Online]. Available:
https://arxiv.org/abs/1912.01703
[50]
L. I. Kuncheva, C. J. Whitaker, C. A. Shipp, and R. P. Duin, “Limits
on the majority vote accuracy in classifier fusion,” Pattern Analysis &
Applications, vol. 6, pp. 22–31, 2003.
[51]
T. G. Dietterich, “Ensemble methods in machine learning,” in Interna-
tional workshop on multiple classifier systems. Springer, 2000, pp.
1–15.
[52]
S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss
landscape perspective,” arXiv preprint arXiv:1912.02757, 2019.
[53]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2014. [Online]. Available: https://arxiv.org/abs/1412.6980
[54]
N. Qian, “On the momentum term in gradient descent learning
algorithms,” Neural Networks, vol. 12, no. 1, p. 145–151, Jan. 1999.
[Online]. Available: http://dx.doi.org/10.1016/S0893- 6080(98)00116-6
[55]
P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware
minimization for efficiently improving generalization,” 2020. [Online].
Available: https://arxiv.org/abs/2010.01412
Simon Rampp received his Bachelor of Science in
Computer Science from the University of Augsburg
in 2022 and his Master of Science from the same
university in 2024. He is currently conducting re-
search as a guest at the chair of Health Informatics,
Technical University of Munich. His work focuses
on deep learning for computer vision and computer
audition, as well as understanding sample difficulty
for curriculum learning.
DOES THE DEFINITION OF DIFFICULTY MATTER? 14
Manuel Milling received his Bachelor of Science in
Physics and in Computer Science from the University
of Augsburg in 2014 and 2015, respectively, and his
Master of Science in Physics from the same university
in 2018. He is currently a PhD candidate in Computer
Science at the chair of Health Informatics, Technical
University of Munich. His research interests include
machine learning with, a particular focus on the
core understanding and applications of deep learning
methodologies.
Andreas Triantafyllopoulos received the diploma in
ECE from the University of Patras, Greece, in 2017.
He is working toward the doctoral degree with the
Chair of Health Informatics, Technical University of
Munich. His current focus is on deep learning meth-
ods for auditory intelligence and affective computing.
Bj
¨
orn W. Schuller received his diploma in 1999,
his doctoral degree in 2006, and his habilitation and
was entitled Adjunct Teaching Professor in 2012, all
in electrical engineering and information technology
from TUM in Munich/Germany. He is Full Professor
of Artificial Intelligence and the Head of GLAM
at Imperial College London/UK, Chair of CHI –
the Chair for Health Informatics, MRI, Technical
University of Munich, Munich, Germany, amongst
other Professorships and Affiliations. He is a Fellow
of the IEEE and Golden Core Awardee of the IEEE
Computer Society, Fellow of the ACM, Fellow and President-Emeritus of
the AAAC, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA,
and Elected Full Member Sigma Xi. He (co-)authored 1,400+ publications
(60,000+ citations, h-index=111).