Content uploaded by Matej Vitek
Author content
All content in this area was uploaded by Matej Vitek on Aug 02, 2023
Content may be subject to copyright.
IPAD: Iterative pruning with activation deviation for sclera biometrics
Matej Vitek
a,
⇑
, Matic Bizjak
a
, Peter Peer
a
, Vitomir Štruc
b
a
Faculty of Computer and Information Science, University of Ljubljana, Vec
ˇna pot 113, SI-1000 Ljubljana, Slovenia
b
Faculty of Electrical Engineering, University of Ljubljana, Trz
ˇaška 25, SI-1000 Ljubljana, Slovenia
article info
Article history:
Received 23 January 2023
Revised 2 June 2023
Accepted 16 June 2023
Available online 28 June 2023
Keywords:
Biometrics
Sclera segmentation
Ocular biometrics
Ocular segmentation
Model pruning
Lightweight deep learning
abstract
The sclera has recently been gaining attention as a biometric modality due to its various desirable char-
acteristics. A key step in any type of ocular biometric recognition, including sclera recognition, is the seg-
mentation of the relevant part(s) of the eye. However, the high computational complexity of the (deep)
segmentation models used in this task can limit their applicability on resource-constrained devices such
as smartphones or head-mounted displays. As these devices are a common desired target for such bio-
metric systems, lightweight solutions for ocular segmentation are critically needed. To address this issue,
this paper introduces IPAD (Iterative Pruning with Activation Deviation), a novel method for developing
lightweight convolutional networks, that is based on model pruning. IPAD uses a novel filter-activation-
based criterion (ADC) to determine low-importance filters and employs an iterative model pruning pro-
cedure to derive the final lightweight model. To evaluate the proposed pruning procedure, we conduct
extensive experiments with two diverse segmentation models, over four publicly available datasets
(SBVPI, SLD, SMD and MOBIUS), in four distinct problem configurations and in comparison to state-of-
the-art methods from the literature. The results of the experiments show that the proposed filter-
importance criterion outperforms the standard L
1
and L
2
approaches from the literature. Furthermore,
the results also suggest that: iðÞthe pruned models are able to retain (or even improve on) the perfor-
mance of the unpruned originals, as long as they are not over-pruned, with RITnet and U-Net at 50% of
their original FLOPs reaching up to 4%and 7%higher IoU values than their unpruned versions, respec-
tively, iiðÞsmaller models require more careful pruning, as the pruning process can hurt the model’s gen-
eralization capabilities, and iiiðÞthe novel criterion most convincingly outperforms the classic approaches
when sufficient training data is available, implying that the abundance of data leads to more robust
activation-based importance computation.
Ó2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Sclera biometrics is a subfield of biometric identity recognition
research. It studies the recognition of individuals using traits from
the sclera vasculature, i.e., the vasculature contained in the white
portion of the human eye (Vitek et al., 2020a; Das et al., 2013).
Unlike competing ocular biometric modalities, such as the retina,
the vasculature of the sclera is a visible ocular characteristic and,
therefore, does not require specialized acquisition hardware for
the imaging process, which makes it suitable for everyday applica-
tions. Furthermore, it can be imaged in the visible spectrum (VIS)
and is not affected by the presence of eye lenses, unlike the iris
(Derakhshani and Ross, 2007; Rot et al., 2018). These characteris-
tics make sclera recognition an ideal candidate for mobile authen-
tication schemes, either as a standalone modality or as part of
multi-modal authentication solutions. However, much of the
recent sclera-biometrics research has focused primarily on model
accuracy and has largely ignored the memory footprint and com-
putational complexity of the processing pipelines. This makes the
results of such research difficult to apply to mobile (and edge)
authentication schemes in practice, and provides strong motiva-
tion for the development of lightweight models and mechanisms
capable of reducing the (time/space) complexity of the overall
recognition pipelines. The development of such lightweight mod-
els is the main goal of the research work presented in this paper.
https://doi.org/10.1016/j.jksuci.2023.101630
1319-1578/Ó2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University.
This is an open accessarticle under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
⇑
Corresponding author.
E-mail address: matej.vitek@fri.uni-lj.si (M. Vitek).
URL: https://sclera.fri.uni-lj.si/ (M. Vitek).
Peer review under responsibility of King Saud University.
Production and hosting by Elsevier
Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
Contents lists available at ScienceDirect
Journal of King Saud University –
Computer and Information Sciences
journal homepage: www.sciencedirect.com
A vital part of the sclera recognition process, that also accounts
for a significant part of the overall (computation and memory)
load, is sclera segmentation. Recent research focusing on compar-
ative performance evaluations of sclera segmentation models
(Das et al., 2018; Das et al., 2019; Vitek et al., 2020b; Vitek et al.,
2023) has demonstrated the superiority of deep learning solutions
for this task. However, the top performing models based on general
purpose architectures, such as U-Net (Ronneberger et al., 2015)
and DeepLab (Chen et al., 2018), are typically over-parameterized
and, as a result, are quite demanding with respect to the hardware
needed for processing, both in terms of memory footprint as well
as the number of operations required for real-time processing. To
address some of these challenges, the OpenEDS competition was
organized recently (Garbin et al., 2019). The goal of the competi-
tion was to design ocular segmentation models that could be run
on modest hardware (available with virtual-reality (VR) head
mounted displays) and to encourage research into lightweight
model-design strategies. While several participants entered the
competition, the most successful models included hand-designed
architectures (that met the 1 MB memory-footprint constraint of
OpenEDS) following the common, U-Net inspired, encoder-
decoder network topology (Chaudhary et al., 2019; Perry and
Fernandez, 2019; Boutros et al., 2019). The best performing of
these models, RITnet (Chaudhary et al., 2019), featured only
around 250000 parameters, but due to the hand-crafted design still
led to a suboptimal trade-off between model complexity and per-
formance – as also demonstrated in the experimental part of this
paper.
A more structured approach towards reducing the memory
footprint and computational complexity on contemporary deep
learning models is to adopt solutions from the field of model com-
pression (Choudhary et al., 2020). While different techniques have
been proposed in the literature in this area over the years, includ-
ing knowledge distillation procedures (Hinton et al., 2014; Romero
et al., 2015; Zhang et al., 2018; Schmid et al., 2023), quantization
mechanisms (Zhou et al., 2017; Zeng et al., 2022; Nevarez et al.,
2023), low-rank approximation techniques (Chang et al., 2022;
Kozyrskiy and Phan, 2020), weight-sharing strategies (Yi et al.,
2017; Dupuis et al., 2021; Dupuis et al., 2022) and approximate-
computation schemes (Kim et al., 2018; Masadeh et al., 2018; Hu
et al., 2022), one of the most popular and generally applicable solu-
tions towards developing lightweight models for various
computer-vision tasks, that also alleviates the need for (sub-
optimal) hand-crafted model architectures and is also at the core
of this work, is model pruning (LeCun et al., 1990; Liang et al.,
2021). Model pruning starts with a large model and removes
low-impact filters (or neurons) to decrease its complexity while
keeping the accuracy as high as possible. However, the mecha-
nisms that decide which filters of a convolutional neural network
to prune remain underexplored, with most of the existing
approaches relying on simplistic L
1
(Li et al., 2017)orL
2
(He
et al., 2018; Chin et al., 2020) norms of the kernel weights. As noted
by He et al. (2018), the L
p
norms utilized with these solutions are
commonly used as an approximation of which filter will result in
the lowest activations, and some works do in fact rely on
activation-based criteria directly (Polyak and Wolf, 2015; Hu
et al., 2016; Luo and Wu, 2017). Although considerable reductions
in model size and complexity have been achieved with the existing
pruning methods, identifying the most suitable filters to prune
remains challenging (especially globally, across model layers)
and requires more effective filter-importance criteria and pruning
techniques.
In this paper, we address this gap and propose both, a new
pruning criterion and a new iterative pruning approach that jointly
lead to start-of-the-art model compression results. Here, we
deviate from the established pruning concepts based on kernel
norms and conjecture that the most critical filters in a layer are
not necessarily the ones leading to the strongest activations, but
rather the ones with very distinct activations relative to all other
filters in the given layer. The main premise behind this conjecture
is that such filters carry the highest amount of new information
and should not be discarded due to potentially low corresponding
activations. Based on this insight, we propose a new Activation-
Deviation Criterion (ADC) in this paper that quantifies filter
importance by estimating the amount of new information the filter
contributes to the activation map of a given model layer. We show
that ADC can be combined with standard L
p
norm criteria and that
such a combination convincingly outperforms the basic L
1
and L
2
approaches from the literature. Furthermore, we demonstrate that
the novel criterion is easily adapted into existing solutions that
address global filter importance (i.e. the importance of a specific
filter in the entire network), rather than the more commonly used
local importance (i.e. the importance of a specific filter in a layer),
such as the recent state-of-the-art pruning approach LeGR (Chin
et al., 2020). Finally, we develop a novel Iterative Pruning approach
based on the proposed Activation Deviation (IPAD) criterion and
evaluate it in comprehensive experiments with two sclera segmen-
tation models and across four ocular datasets with diverse charac-
teristics. Our experimental results show that IPAD yields highly
competitive performance when compared to competing solutions
from the literature and that the proposed ADC criterion leads to
well pruned models capable of retaining (or even improving) the
performance of the initial (over-parameterized) segmentation
models.
In summary, this paper makes the following main
contributions:
We propose IPAD (Iterative Pruning with Activation Deviation),
a state-of-the-art model pruning approach that iterates
between: (1) pruning low-importance filters using a novel crite-
rion for filter importance, and (2) model retraining. The main
motivation for such an approach is to progressively incremen-
tally reduce the (time/space) complexity of the initial model,
while ensuring maximum performance after each pruning stage
through the iterative retraining. To ensure reproducibility of our
results and to facilitate further research into (mobile) ocular
biometrics, we make all experimental code publicly available
from sclera.fri.uni-lj.si.
We introduce a novel Activation Deviation Criterion (ADC) for
filter importance, designed for convolutional neural network
pruning. ADC estimates the amount of new information a filter
contributes to the activation map of the given layer. Rigorous
experiments with four different datasets and two CNN models
with distinct characteristics show that the proposed criterion
improves on the performance of the literature-standard L
1
(Li
et al., 2017) and L
2
(He et al., 2018; Chin et al., 2020) norms
of the filter weights, and that it can easily be incorporated into
other existing pruning solutions, such as the global-ranking-
based filter pruning approach from (Chin et al., 2020).
Supported by a comprehensive experimental analysis, we pro-
vide several new insights into the use of pruning and the pro-
posed IPAD approach in sclera segmentation. Namely, we
observe that the pruned models are in general able to retain
(and in several cases even improve on) the performance of the
initial unpruned models for different target FLOP (floating point
operation) counts. A notable exception here are lightweight
models used in cross-dataset experiments, where after the ini-
tial pruning, performance improvements are first observed.
With higher FLOP reductions, on the other hand, the perfor-
mance starts dropping quickly. This implies that the pruning
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
2
procedure helps identify less relevant filters and optimize the
already small model architecture. However, after the weakest
filters are eliminated, the removal of additional ones starts
hurting the model’s performance. Additionally, this observation
also implies that when the models are already small, care needs
to be taken when pruning them further, as this might reduce
the models’ generalization capabilities. Finally, we also observe
that ADC most convincingly outperforms the classic L
1
and L
2
pruning as well as the other baselines when there is sufficient
training data available. This is likely due to the abundance of
data leading to more robust activation-based importance
computation.
The remainder of this paper is structured as follows. In Section 2,
we review and summarize the relevant literature, outlining the
drawbacks and limitations of the existing approaches to model
compression and sclera segmentation. Section 3introduces our
novel pruning criterion and elaborates on the key new ideas. Sec-
tion 4first discusses the datasets, performance metrics, and base-
line models utilized for the experiments and then presents the
results of our experimental evaluation. Finally, in Section 5,we
summarize the main findings of the paper and conclude with some
parting thoughts and directions for future research.
2. Related work
In this section, we review closely related literature most rele-
vant to the proposed pruning approach. Specifically, we first dis-
cuss general model compression mechanisms, then elaborate on
the main concepts and ideas behind model pruning, and finally
review existing work in the targeted application domain, i.e., sclera
segmentation. The goal of this section is to provide context for our
work and further motivate the main contributions made. For a
more in-depth overview of the existing literature, we refer the
reader to Cheng et al. (2018), Liang et al. (2021), Choudhary et al.
(2020) for comprehensive review papers on model compression
and to Nigam et al. (2015); Das et al. (2013); Vitek et al. (2023)
for surveys and comparative studies related to sclera
segmentation.
2.1. Model compression
With the increasing availability of computational power, vari-
ous areas of machine learning have been shifting towards ever lar-
ger models. While such model scaling has led to major
breakthroughs, it has also introduced significant challenges when
deploying the models on devices with limited computing
resources, such as mobile phones or edge devices. Consequently,
the field of model compression techniques has advanced in parallel
with the development of deeper and more heavily parameterized
models. The primary objective of model compression techniques
is to reduce the memory footprint and/or computational complex-
ity of deep learning models, making it feasible to deploy them on
less capable computing hardware. Numerous solutions have
emerged in this area over the years that can conveniently be
grouped into three main categories, as also depicted in the taxon-
omy in Fig. 1, i.e.:
Architecture compression techniques that aim at reducing the
size (or space complexity) of the trained deep learning models
while maximizing performance using pruning, knowledge dis-
tillation or dedicated model-design schemes. Techniques from
this category are the most widely applicable and largely inde-
pendent of the targeted deployment device and implementation
frameworks.
Data-redundancy compression techniques that mostly focus on
decreasing the computational complexity of the deep learning
models through quantization procedures, low-rank approxima-
tion schemes and weight-sharing strategies, but, depending on
the mechanism used, also often reduce the space complexity.
These techniques are typically dependent on the targeted pro-
cessor architecture and model implementation.
Computation approximation techniques that speed-up computa-
tion in deep learning models through approximation schemes
based on approximate multipliers or Winograd networks.
Fig. 1. Taxonomy of model compression techniques. Depending on the goal of the compression techniques, existing solutions in this area can be partitioned into three
categories that target the architecture, the data representations or the mathematical operations of the deep learning models. The procedure proposed in this work falls into
the group of architecture-compression algorithms based on pruning (marked orange).
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
3
Techniques from this category commonly reduce the models’
computational complexity, while leaving their spatial require-
ments the same.
Below, we briefly review some of the recent examples from
each of the three categories.
Architecture compression. One of the most notable solutions
towards compressing pretrained deep learning models is knowl-
edge distillation (Hinton et al., 2014). Here, the key idea is to first
train a large (teacher) network and then use the outputs of the tea-
cher network as the target (reference) outputs of a smaller (stu-
dent) network. In this way, the student learns to approximate
the same function as the teacher, but does so with far fewer
parameters. Romero et al. (2015) generalized the distillation pro-
cedure to students that are deeper (but thinner) than the original
teacher model, possibly achieving even better results than the orig-
inal teacher model. Zhang et al. (2018) extended the idea of distil-
lation by training several student models cooperatively, rather
than having a single strict teacher-student relation. In Zagoruyko
and Komodakis (2017), the authors incorporated attention in the
distillation process, a very common concept in recent deep neural
networks. Several further improvements were done in Yim et al.
(2017), which introduced a novel distillation technique, useful
for fast optimization, that made distillation applicable to transfer
learning scenarios. Some examples of successfully applied distilla-
tion techniques to various vision problems are presented in Liu
et al., 2019; Chen et al., 2017; Gong et al., 2022; Luo et al., 2016.
Despite its promise, knowledge distillation has also been shown
to leave significant gaps in the predictive power of the student
models, even when the student should be able to match the tea-
cher (Stanton et al., 2021).
Another popular approach from the architecture-compression
category is the design and training of lightweight models from
scratch. The main objective here is to construct small and light-
weight models that mimic the behavior of larger and, in general,
more capable models, but are trained with standard learning tech-
niques using the available training data only. This approach has
been shown to perform well in many cases (Zhu and Gupta,
2017), but in a sense predetermines a specific model size and the
corresponding performance during the design stage. Additionally,
it is limited in its flexibility and does not allow to effectively inves-
tigate different trade-offs between model size and performance.
The approach proposed in this papers falls into the group of
pruning methods (discussed in Section 2.2) and and addresses
many of the shortcomings discussed above. As we show in the
experimental section, it maximizes performance, while decreasing
the space complexity of the initial models and allowing for control
over the complexity-vs.-performance trade-off.
Data-redundancy compression. A common approach towards
reducing the data-representation redundancy in deep learning
models is to rely on quantization (Zhou et al., 2017) and/or low-
rank approximations (Jaderberg et al., 2014; Tai et al., 2016), both
of which aim at decreasing the memory footprint of the filter-
weight matrices through approximation. Quantization achieves
this by replacing the floating-point representation of the weights
and biases (Young et al., 2021; Nevarez et al., 2023), and possibly
activations as well (Mei et al., 2021), with quantized low-bit inte-
gers (Mei et al., 2021) or even single-bit boolean values (i.e. bina-
rization) (Zhao et al., 2017; Zeng et al., 2022). Low-rank
approximation, on the other hand, focuses on optimizing each
matrix or tensor as a whole, using known mathematical methods
for low-rank matrix approximation, such as SVD (Chang et al.,
2022) or the higher-order Tucker decomposition (Zeng et al.,
2020; Kozyrskiy and Phan, 2020). Since the two methods are
related, some recent work even combines them into a single cohe-
sive method for network compression (Kozyrskiy and Phan, 2020).
As noted in Jaderberg et al. (2014), low-rank approximation meth-
ods exploit the large amount of redundancy in the network’s filter
base. A different method of exploiting this same redundancy is
weight-sharing, where certain weights are shared between various
filters, thereby eliminating their redundancy (Yi et al., 2017;
Dupuis et al., 2021; Dupuis et al., 2022) for a more efficient compu-
tational model. Techniques from this category are in general com-
plementary to the architecture-compression techniques discussed
above and can be used to further reduce the memory footprint of
the distilled or pruned models.
Computation approximation. Techniques from this category
aim to replace the computations in the deep learning models with
simpler alternatives. Solutions based on approximate multiplica-
tion (Blake et al., 1998; Lotric
ˇand Bulic
´, 2012; Gysel et al., 2016),
for example, reduce the computational complexity of the models
at a low level, replacing every multiplication operation in the net-
work with an approximate multiplier. These approximate multipli-
ers can result in much lower power consumption and faster
execution at the cost of accuracy. However, as seen in e.g. Kim
et al. (2018, 2022), the drop in accuracy is oftentimes negligible.
A comprehensive comparison of approximate multipliers is avail-
able in Masadeh et al. (2018). Winograd networks, on the other
hand, focus on speeding up the convolution operation by replacing
it with the Winograd transformation (Yu et al., 2017; Yang et al.,
2020; Wu et al., 2021; Wu et al., 2022). Similarly to data-
redundancy techniques, solutions from this category represent a
complementary addition, rather than an alternative for architec-
ture compression techniques and are applicable to further simplify
the targeted models.
2.2. Pruning
Model pruning is a way to reduce the memory footprint of deep
neural networks by removing low-impact neurons (or filters in the
case of convolutional neural networks (CNNs)) with the goal of
reducing the size of the network with as little reduction in accu-
racy as possible. The concept of model pruning has existed since
the inception of deep learning (LeCun et al., 1990) and its effective-
ness has been studied extensively in Zhu and Gupta (2017). Prun-
ing represents one of the most popular mechanism for deep neural
network compression mainly due to its flexibility and the fact that
it can easily be combined with other model compression mecha-
nism, such as knowledge distillation, quantization, and approxi-
mate multiplication (Han et al., 2016).
Many different pruning approaches were introduced in the lit-
erature over the years (Liang et al., 2021), but the vast majority
of modern techniques focus on quantifying the importance of each
filter (or neuron) within a given network layer and then pruning
away a certain fraction of the least important filters from within
the analyzed layer. Here, different scoring criteria are typically uti-
lized, but most pruning approaches rely on L
p
norms (Li et al.,
2017; He et al., 2018; Chin et al., 2020) of the filter weights (or
some extension of this concept) as a proxy for filter importance.
A notable line of research also analyzes the activations generated
by the filters directly to quantify their importance (Polyak and
Wolf, 2015; Hu et al., 2016; Luo and Wu, 2017) or uses other
derived criteria for this task (He et al., 2017; Luo et al., 2017).
Data-driven methods, such as the one proposed by Huang et al.
Huang et al. (2018), have also been proposed and were shown to
offer a straightforward way of controlling the model size/accuracy
trade-off.
More recently, pruning approaches have also been presented
that allow the estimation of filter importance across network lay-
ers and not only within the layers as discussed above. LeGR (Chin
et al., 2020), for example, learns a global ranking of the filters in
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
4
the network through a data-driven approach. As a result, it is able
to derive different variants of the pruned model with different
complexities and accuracies. Even though the global ranking intro-
duces another level of flexibility into the pruning process, at their
core, such procedures still rely on base L
p
norms to quantify the
impact of the individual filters.
In this paper, we build on the research outlined above, and pre-
sent a novel criterion for quantifying filter importance that can be
used with standard pruning techniques for within-layer filter rank-
ing, but also global techniques for ranking filters across different
model layers. Different from existing techniques, the criterion is
based on filter activations, judging filter impact based on the
amount of information a given activation contributes to the overall
output of the given layer. To the best of our knowledge, we are the
first to consider such a differential criterion, which, as we show in
the experimental section, leads to highly competitive pruning per-
formance and is complementary to the standard L
p
norm based
criteria.
2.3. Sclera segmentation
Our work studies the development of lightweight models
specifically for the task of sclera segmentation. An important
source of information on sclera segmentation approaches and their
performance is the annual Sclera Segmentation Benchmarking
Competition (SSBC) (Das et al., 2015; Das et al., 2016; Das et al.,
2017; Das et al., 2018; Das et al., 2019; Vitek et al., 2020b), which
has been organized as part of major biometrics conferences for sev-
eral years now.
Before the era of deep learning, most solutions for semantic seg-
mentation tasks used handcrafted features and filter-based meth-
ods. Few of these were sclera-specific. Two recent examples of
such methods evaluated in the scope of the SSBC competition ser-
ies are the Unsupervised Sclera Segmentation (USS) approach
(Riccio et al., 2017), which ranked 2
nd
in SSBC 2017 (Das et al.,
2017), and the Sclera Segmentation using Image Properties (SSIP)
techniques, which was the only entry from SSBC 2020 (Vitek
et al., 2020b) not based on deep learning. Dimauro et al. (2023)
presents a handcrafted approach developed for medical analysis
of the sclera. Most recent sclera-segmentation solutions, including
the top performers of the latest editions of SSBC, use (deep)
general-purpose semantic segmentation models, usually based on
the convolutional encoder-decoder (CED) architectures, such as
U-Net (Ronneberger et al., 2015; Rot et al., 2020; Vitek et al.,
2020b; Lv et al., 2022; Wang et al., 2022; Das et al., 2022), SegNet
(Badrinarayanan et al., 2017; Das et al., 2017; Rot et al., 2020; Vitek
et al., 2020a; Rot et al., 2018), ScleraSegNet (Wang et al., 2019; Das
et al., 2019), RefineNet (Lin et al., 2017, 2018, 2020,a), and DeepLab
(Chen et al., 2018; Vitek et al., 2020b).
Such models perform quite well, but have a large number of
parameters and complex architectures, which makes them expen-
sive in terms of both computation and memory requirements. We
aim to implement a lightweight model that comes as close as pos-
sible to the performance of these larger, heavily parameterized
models, but exhibits only a fraction of the memory footprint and
a significantly reduced FLOP count.
3. Methodology
The main contribution of this work is the IPAD (Iterative Prun-
ing with Activation Deviation) pruning approach, which iteratively
reduces the computational complexity of deep learning models by
pruning away the lowest-impact filters, identified through a novel
activation-deviation criterion (ADC), as shown in Fig. 2. We study
IPAD in the context of sclera-segmentation models. In this section,
we first present a high-level overview of the proposed approach
and discuss its characteristics, then introduce the novel pruning
Fig. 2. Overview of the Iterative Pruning with Activation Deviation (IPAD) approach. IPAD takes an existing deep learning model as input (top left) and then iteratively
prunes the lowest-impact filters to produce a simpler model with reduced complexity (bottom right). During each iteration, IPAD reduces the number of FLOPs of the model
by a predefined decrement
D
F. After each pruning step, the model is retrained to ensure optimal performance given the current topology. The procedure is repeated until the
desired FLOP count is reached. At the core of IPAD is a novel Activation-Deviation Criterion (ADC) that together with the standard L
p
norm criterion drives the pruning
procedure.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
5
criterion and finally recapitulate on the entire approach through a
step-by-step summary.
3.1. Iterative Pruning with Activation Deviation (IPAD)
Assume a well-trained (overparameterized) deep-learning
model Xwith Nlayers and l
i
filter kernels K
n
fg
l
i
n¼1
in each layer,
where i21;;N
fg
. The goal of the IPAD procedure is to produce
a (optimal) pruned model Y
based on the following constrained
maximization problem (He et al., 2017):
Y
¼argmax
Y2Y
PYðÞfg;s:t:FLOPs YðÞ6pFLOPs XðÞ;ð1Þ
where Yis the set of deep models with a subset Kof X’s filters (i.e.,
K K
n
fg
l
i
n¼1
), Pis the scoring function used to evaluate the perfor-
mance of the model, and p20;1½is the targeted fraction of the
floating point operations (FLOPs) to be retained. Thus, the overall
objective is to identify a pruned model Ythat maximizes perfor-
mance, while needing fewer FLOPs than the initial model Xfor pro-
cessing a given input. Because we are targeting segmentation
models in this work, Pis defined as a function that returns the aver-
age Intersection-over-Union (IoU) over the available training data
(Vitek et al., 2020a; Rot et al., 2018; Lozej et al., 2018).
As illustrated in Fig. 2, IPAD approaches the optimization pro-
cess in Eq. (1) through an iterative procedure that prunes the filters
of Xgradually in increments that correspond to a predefined
reduction
D
Fin the FLOP count of the model. In each step of the
optimization process, the least important filters are pruned and
the model is retrained for a fixed number of epochs. This process
is repeated until the desired model complexity is reached, at which
point the final (pruned) model is retrained once more until conver-
gence. Because the filter importance is determined for each layer
separately, we impose an upper limit on how many filters can
get pruned from a given layer to maintain the desired overall
model architecture, similarly to competing approaches from the
literature (Shang et al., 2022).
The main motivation for the iterative pruning process used with
IPAD is twofold:
Complexity-performance trade-off: The iterative nature of
IPAD and incremental removal of filters corresponding to a
computing budget of
D
FFLOPs, allows for fine-grained control
of the complexity-performance trade-off ensured by the pruned
models. Because the model is retrained after each pruning step,
the loss in model performance due to the pruning-induced
model reparameterization is explicitly minimized. This charac-
teristic represents a unique aspect of IPAD not available with
the majority of competing techniques.
Global relevance: IPAD quantifies filter importance in a local
manner, i.e., separately for each model layer, similarly to the
majority of existing pruning techniques (Liang et al., 2021).
Since filters are pruned locally, the fixed-epoch retraining step
(conducted at each iteration) updates the remaining filters
and, in a sense, readjusts their global importance. Thus, the iter-
ative procedure contributes towards the global relevance of the
local (iterative) pruning process and addresses the limitations
of existing pruning approaches that are typically of a local
nature.
The key component for successful and well-performing model
pruning is the criterion utilized for determining filter importance.
For IPAD, we develop a novel criterion for this task that relies on
filter activations rather than kernel norms and is presented in
detail in the next section.
3.2. Novel pruning criterion
One of the most important parts of any pruning procedure is the
selection of the low-impact neuron(s) or (in our case) filter(s) that
can be removed with minimal impact on performance. A common
way of quantifying the filter importance is to compute the L
1
or L
2
norm of the filter weights (Li et al., 2017; He et al., 2018; Chin et al.,
2020), i.e.:
w
w1
K
n
ðÞ¼K
n
kk
1
¼X
h
k
i¼1
X
w
k
j¼1
K
n
i;jðÞ
jj
or ð2aÞ
w
w2
K
n
ðÞ¼K
n
kk
2
¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
h
k
i¼1
X
w
k
j¼1
K
n
i;jðÞðÞ
2
v
u
u
t;ð2bÞ
Fig. 3. Computation of the activation maps A
n
from an input image within a given layer. The activation maps are used to compute the mean activation map of the whole
layer A. The figure is illustrative.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
6
where w
wx
:R
h
k
w
k
c
!Ris the standard weights-based criterion,
K
n
2R
h
k
w
k
c
is the weight matrix of n-th filter in the given layer,
h
k
and w
k
are the kernel height and width, and cis the number of
channels, i.e., the kernel depth. The main assumption here is that
filters with small weights, and consequently, small norms, con-
tribute little to the outputs/activations of the model and, in turn,
can be removed from the model.
However, because such weight-based criteria serve only as
proxies for the expected filter activations (He et al., 2018), which
correlate more closely with filter importance, we propose a novel
Activation-Deviation Criterion (ADC) in this section that predicts
filter impact directly from the activations produced over some
training dataset. To derive our activation-based criterion, we begin
with the activation maps A
n
2R
hwl
i
of the model, presented in
Fig. 3, where hand ware the height and width of the activation
maps, respectively, and l
i
is the number of output channels, i.e.
the number of filters in the layer. Note that in the actual imple-
mentation, the criterion is computed over batches of input images
to increase robustness, however for a simplified illustration we rely
on a single input image throughout the entire derivation. To cap-
ture the filter importance, ADC quantifies the deviation of each fil-
ter’s activation map from the overall mean activation of all filters A
within a given layer. Thus, ADC first computes the mean activation
A, as illustrated in Fig. 3:
A¼1
l
i
X
l
i
n¼1
A
n
;ð3Þ
where the n-th activation map A
n
is computed by convolving the
input with the n-th kernel K
n
of the given layer. The corresponding
deviation from the mean activation is then determined as:
D
n
¼A
n
A:ð4Þ
These deviations D
n
serve as a measure of how much new informa-
tion the n-th filter brings to the overall activation map of a given
model layer. The computation procedure is illustrated in Fig. 4.
Finally, to obtain a scalar measure of the importance of a filter,
ADC uses the standard L
1
or L
2
norms by applying them to the dif-
ference matrices computed in the previous step:
w
a1
A
n
ðÞ¼
D
n
kk
1
hw¼
X
h
i¼1
X
w
j¼1
D
n
i;jðÞ
jj
hwor ð5aÞ
w
a2
A
n
ðÞ¼
D
n
kk
2
hw¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
h
i¼1
X
w
j¼1
D
n
i;jðÞðÞ
2
v
u
u
t
hw;ð5bÞ
where w
ax
is the ADC criterion and xdefines the norm type. Note
that the standard weight-based pruning criteria from Eqs. 2a, 2b
operate on h
k
w
k
kernels, whereas the ADC criterion operates on
hwactivations maps, where typically ww
k
and hh
k
. For
IPAD, we therefore define a combined criterion wnðÞto have a com-
prehensive and complementary description of the n-th filter impor-
tance, i.e.:
wnðÞ¼
a
w
w
K
n
ðÞþ1
a
ðÞw
a
A
n
ðÞ;ð6Þ
Table 1
High-level characteristics of the four datasets used in the experiments. The datasets differ in terms of image resolution, acquisition devices used, gaze directions and blur, but
also in the amount of data available.
Dataset #Images #IDs #Eyes Resolution [px] Sources of Variability
y
SMD (Das, 2017) 500 25 50 3264 2448 BL, CN
SLD (Vitek et al., 2023) 108 27 54 3264 2448 BL, CN
SVBPI (Vitek et al., 2020a; Rot et al., 2020) 1858 55 110 3000 1700 GZ, BL
MOBIUS (Vitek et al., 2020b; Vitek et al., 2023) 3542 35 70 3000 1700 MD, CN, GZ, BL
y
GZ - gaze, BL - blur, CN - acquisition condition, MD - mobile device.
Fig. 4. Computation of the difference matrices D
n
.The difference matrices capture the amount of new information each filter’s activation brings to the overall layer’s
activation map. The scalar filter importance is then computed as the mean intensity of each difference matrix. Note that for illustration purposes, we show the absolute values
of D
n
.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
7
where ais the weighting parameter that determines the trade-off
between the two criteria. In each IPAD iteration, we prune the filter
with the lowest importance.
3.3. IPAD pseudocode
Using the newly proposed criterion, the complete IPAD pruning
approach is implemented in accordance with the pseudocode pro-
vided in Fig. 1.
4. Experiments and results
In this section, we present experiments conducted to evaluate
IPAD and the proposed filter-importance criterion. We start the
section with a description of the experimental datasets, perfor-
mance metrics, and hyperparameters used for the evaluations
and then proceed with the presentation and discussion of the
results.
4.1. Datasets
Four datasets were used for our experimental work, two of
which (MOBIUS and SBVPI) were collected at the University of
Ljubljana and are publicly available for research purposes from
sclera.fri.uni-lj.si. The remaining two (SMD and SLD) are external
datasets but are also publicly available on request (Vitek et al.,
2023). Details on the four datasets are given below and their key
characteristics are summarized in Table 1.
SBVPI (Rot et al., 2020; Vitek et al., 2020a) is a dataset of 1858
images of 55 subjects (i.e. 110 eyes) with corresponding hand-
crafted markups of the sclera and the periocular region.
A subset of roughly 130 images is also annotated with the sclera
vessels, pupil, iris, canthus, and eyelashes. The samples in the
dataset were captured in laboratory conditions with a DSLR
camera, and they are therefore high-quality high-resolution
ocular images acquired in well-lit conditions. The images come
Algorithm 1.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
8
with labels for the corresponding subject ID, eye (left/right), and
gaze direction (left/right/up/straight). The dataset also contains
additional subject information, namely age, gender, and eye col-
our. We show some sample images, as well as region markups
from the dataset in Fig. 5(a).
The MOBIUS dataset (Vitek et al., 2020b; Vitek et al., 2023),
shown in Fig. 5(b), is a mobile ocular dataset of almost 17000
ocular images belonging to 100 subjects (i.e. 200 eyes). Its seg-
mentation subset – which contains 3542 images from 35 sub-
jects (70 eyes) – comes bundled with manually crafted ground
truth markups for the sclera, pupil, iris, and periocular region.
The images in the dataset were acquired in different capturing
conditions: using 3 different mobile phones (Sony Xperia Z5
Compact, Apple iPhone 6s, Xiaomi Pocophone F1), in 3 different
lighting conditions (natural lighting, indoor lighting, unlit
indoor room), and with 4 different gaze directions (left/right/
up/straight). The dataset additionally contains some deliber-
ately unusable (‘‘bad”) images, which contain image noise (such
as motion blur, obstructions, etc.) and are intended as negative
samples in quality control. Since our experiments do not
include the study of quality assessment, we discard the bad
images to obtain the final dataset of 3475 images. All capturing
conditions, as well as the corresponding subject ID and eye (left/
right), are labelled in the image names. Additionally, the dataset
contains rich subject metadata, including information about
their age, gender, eye colour, dioptres and other medical condi-
tions, allergies, whether they smoke, and whether they wore
lenses or used eyedrops at the time of the image acquisition.
The SMD (Das, 2017) and SLD (Vitek et al., 2023), shown in
Figs. 5(c), 5(d), are external datasets, obtained and used with
the permission of their author. SMD is a dataset of 500 images
from 25 individuals (i.e. 50 eyes), captured using a mobile cam-
era in different lighting conditions. It has been used in several
SSBC competitions (Das et al., 2019; Vitek et al., 2020b). SLD
is a smaller dataset of 108 images from 27 individuals (54 eyes)
captured by a mobile camera under different gaze directions,
which was developed primarily for sclera liveness detection. It
was also utilized for the recent exploration of demographic
and algorithmic bias in sclera segmentation methods (Vitek
et al., 2023).
We split all of our datasets into a training set (used for training
the segmentation models), validation set (used for early stopping
and hyperparameter selection), and testing set (used for evalua-
tion) in a 70/20/10% split. The cross-dataset experiments (see Sec-
tion 4.5) instead use a 70/30% split on SMD for training and
validation data, and use the entire SLD dataset for evaluation
(i.e., performance reporting). Depending on the ground-truth
annotations available with the four datasets we conduct either 2-
class (sclera vs. the rest) or 4-class (sclera, iris, pupil, and back-
ground) segmentation experiments.
4.2. Performance metrics
The primary performance indicator used throughout our exper-
iments as a measure of model accuracy is IoU (Intersection-over-
Union), a standard measure in the field of semantic segmentation
(Vitek et al., 2023; Vitek et al., 2020a). For the 2-class segmentation
task we use the IoU of the positive class (i.e. sclera), defined as:
IoU ¼P\Tjj
P[Tjj
;ð7Þ
where Pis the set of the pixels predicted to belong to the sclera by
the model and Tare the actual sclera pixels, whereas for the 4-class
ocular segmentation problem we use mIoU (mean intersection-
over-union), which is defined as:
mIoU ¼1
CX
C
c¼1
IoU
c
;ð8Þ
Fig. 5. Sample images from the four datasets used in the experiments. The top two rows of (a) show different gaze directions and eye colours present in SBVPI, while the
bottom two rows show (left-to-right and top-to-bottom) a sample and the ground truth markups of the: sclera, iris, pupil, periocular region, scleral vessels, medial canthus,
and eyelashes. (b) shows four samples from MOBIUS captured in different capturing conditions. The image in the first row was captured in natural sunlight, the second row in
a well-lit indoor room, the third in an unlit room, while the last row displays an intentionally unusable (‘‘bad”) image intended for quality control. Each row additionally
contains the corresponding ground-truth multi-class markups and the individual masks for the sclera (red), iris (green), and pupil (blue). Finally, (c) and (d) show varied
images from SMD and SLD, respectively, along with their corresponding sclera ground truth markups.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
9
where c21;...;Cfgis the class index, P
c
is the set of pixels recog-
nized by the model as class c;T
c
is the set of pixels actually belong-
ing to class c, and IoU
c
¼
P
c
\T
c
jj
P
c
[T
c
jj
is the class-specific intersection-over-
union.
4.3. Baseline models and training procedure
We use two distinct segmentation models for the experiments
to explore different aspects of the proposed IPAD pruning proce-
dure and ADC criterion, i.e., RITnet (Chaudhary et al., 2019) and
U-Net (Ronneberger et al., 2015):
RITnet: This is a fairly lightweight model and was shown to be
very effective for the task of ocular segmentation in the 2019
OpenEDS challenge (Garbin et al., 2019). Having been designed
specifically for multi-class (sclera, pupil, iris, background) seg-
mentation problems, it is also well suited for our intended pur-
poses. From an architectural point of view, RITnet is a
lightweight (16 GFLOPs, 250000 parameters) convolutional
encoder-decoder (CED) model inspired by the DenseNet
(Huang et al., 2017) architecture. The encoder of the model con-
sists of 5 Downsampling-Blocks, which contain 5 convolutional
layers each, followed by an average pooling layer. The decoder
consists of 4 Upsampling-Blocks that upsample the output of
the encoder back to the original image resolution via the
nearest-neighbor method. The Upsampling-Blocks each contain
4 convolution layers and skip-connections to their respective
Downsampling-Block. The model serves in our experiments to
demonstrate the performance of the proposed pruning proce-
dure on an already compact segmentation model.
U-Net: This model represents a go-to solution for many
image-to-image translation tasks, including semantic image
segmentation (Rot et al., 2018; Vitek et al., 2023). Similarly to
RITnet, U-Net also features an encoder-decoder architecture,
where the encoder begins with a Double-Convolution-Block
(DCB), which consists of two pairs of convolutional layers and
batch normalization with ReLU activation. The encoder then
continues with 4 Downsampling-Blocks, each of which consists
of a max-pooling layer followed by a DCB. The decoder contains
4 Upsampling-Blocks, each of which consists of a bilinear
upsampling layer followed by a DCB, and a final 1 1 convolu-
tion that ensures the number of the model’s output channels
matches the desired number of classes. The Upsampling-
Blocks again contain skip-connections to their respective
Downsampling-Blocks. The model configuration used in our
experiments has a total of 17:3 million parameters with roughly
160 GFLOPs (10as many as RITnet) and is used to demonstrate
the characteristics of IPAD with a heavily parameterized and
computationally more complex model topology.
We train both models using the same training process and loss
to ensure a fair comparison. Specifically, we utilize the learning
objective proposed in Chaudhary et al. (2019), which was designed
specifically with ocular segmentation in mind, and is defined as:
L
R
¼l
CE
k
1
þk
2
l
E
ðÞþk
3
l
GD
þk
4
l
S
;ð9Þ
where l
CE
is the pixel-wise cross-entropy loss, which penalizes
incorrect pixel classifications and is a standard loss in semantic seg-
mentation, but is primarily designed for use with balanced classes;
l
E
is the Canny edge loss, which maximizes the accuracy of the
detection of edges between regions by weighting the pixels by their
distances to the nearest two image segments; l
GD
is the generalized
dice loss, which ensures stable gradients in the case of imbalanced
classes (which are common for ocular segmentation problems) by
weighting the dice score by the squared inverse of the class fre-
quency; l
S
is the surface loss, which is based on the contour dis-
tances and aims to preserve smaller areas ignored by the previous
two losses. For further details about the loss components and the
selection of the kparameters we refer the reader to the original
paper (Chaudhary et al., 2019).
We train the models for 200 epochs, which was determined to
be sufficient for proper convergence of both models. Additionally,
we use a separate set of data samples for validation (distinct from
the training and testing data) to implement early stopping criteria
that help avoid overfitting. Specifically, we consider the model to
have converged and end the training early if the loss on the valida-
tion data does not improve in 10 consecutive epochs. The final
retraining after the pruning procedure is carried out in the same
manner, using the same loss function, on the smaller model with
pruned filters. The brief retraining during pruning also follows
the outlined procedure, but is only executed for 5 epochs with
no stopping criteria. The number of epochs for the retraining dur-
ing the pruning iterations was determined through preliminary
experiments by selecting a trade-off between IPAD runtime com-
plexity and impact on the final accuracy score.
For the learning process, we use the Adam optimizer with a
learning rate of 0:001. For the weight parameters kfrom Eq. (9),
we use the optimal values advocated in the original paper
(Chaudhary et al., 2019)(k
1
¼1;k
2
¼20;k
3
¼1k
4
, and
k
4
¼max 1
epoch
125
;0
), and prune filters away in increments corre-
sponding to
D
F¼1 GFLOP. During pruning, we limit the number of
filters that can be pruned from each layer or each of RITnet’s blocks
to 75%. When we prune away a filter, we also adjust the dimen-
sions of the subsequent filters and batch normalization layers that
depend on it and remove its corresponding bias. We never prune
batch normalization layers directly. For the experiments, all
images are resized to 640 400 pixels. The training and experi-
ments are conducted on various graphic cards, specifically GeForce
TITAN V, GeForce RTX A5000, and several GeForce RTX 3090s and
GeForce RTX 2070s.
4.4. Benchmark methods
We compare IPAD to several pruning methods from the litera-
ture, including the standard L
1
(Li et al., 2017) and L
2
norm-
based pruning methods (He et al., 2018; Chin et al., 2020), and
the global LeGR approach from Chin et al. (2020). Additionally,
we also implement uniform and random pruning procedures for
reference comparisons, neither of which uses any data/model
derived criterion to determine filter importance but instead prunes
filters at random in slightly different ways. Thus, we compare IPAD
to the following pruning methods:
LeGR (Chin et al., 2020) is a state-of-the-art pruning method
from the literature which uses a layer-local pruning criterion
(such as L
1
or L
2
weight norms or our criterion), but additionally
learns scale and shift parameters for each layer to facilitate glo-
bal filter comparisons. In this way the method obtains a global
ranking of the model’s filters and different from most compet-
ing solutions allows for efficient global model pruning.
Weights-based L
1
(Li et al., 2017) and L
2
(He et al., 2018; Chin
et al., 2020) pruning are classical pruning methods from the lit-
erature, which use the L
1
and L
2
norms of the kernel weights as
the criteria for filter importance. We apply these criteria in our
experiments using the same pruning/retraining strategy as
implemented for the proposed IPAD method to ensure a fair
comparison.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
10
Uniform pruning (Liu et al., 2018) iterates over the layers in the
network and prunes a single randomly selected filter from each
layer of the network. After reaching the final layer of the net-
work, the procedure starts over with the first layer and contin-
ues in this manner until the desired amount of FLOPs for the
given iteration (
D
F¼1 GFLOP in our implementation) is
removed.
Random pruning (Li et al., 2022) is the simplest possible prun-
ing procedure that takes a single randomly selected filter from
the entire network and prunes it. It then repeats this process
until the desired amount of FLOPs is removed.
Finally, we also compare the pruned models with the original
unpruned version, i.e., the result of the initial model training
(see Fig. 1). In all graphs and figures, the results for the pruned
models are reported at 50% of the FLOP count of the unpruned
model (i.e. roughly 8 GFLOPs for RITnet and 80 GFLOPs for U-
Net), unless specified otherwise. It is important to note that we
count one multiplication and one addition as a single operation
when reporting the computational complexity of the models, as
modern processor architectures implement such a pair as a single
MAC (multiply-accumulate) instruction (IEEE Standard, 2008).
4.5. Results
In this section, we present the results of our experimental
assessment. We compare the pruning methods on RITnet and
U-Net in 4 different settings:
High-quality sclera segmentation using the laboratory-quality
images from the SBVPI dataset. This setting allows us to test the
models’ performance in settings where we have a high degree of
control over the environmental factors, such lighting. The SBVPI
dataset contains the high-quality images captured in ideal con-
ditions and is large enough to facilitate effective training of our
model. Using this scenario, we therefore explore the impact of
pruning in ideal conditions, i.e., with plentiful high-quality
and well annotated training data, which should help both small
and large models achieve relatively good performance.
Limited-data in-the-wild sclera segmentation using the SMD
dataset, which contains a smaller number of images, all captured
by a mobile camera in real-world conditions. This scenario tests
the performance of the segmentation models in more uncon-
strained, real-world environments, such as the ones encoun-
tered in mobile-phone unlocking tasks. With the smaller SMD
dataset, the segmentation models additionally have a lower
number of training images available, introducing another source
of difficulty. The small amount of training data can cause larger
models to overfit, and it has been shown (Brigato and Iocchi,
2020) that smaller networks can actually outperform larger ones
when lacking training samples, making this setting very inter-
esting for the investigation of the proposed pruning procedure.
Cross-dataset sclera segmentation, where the segmentation
models are trained on images from SMD and evaluated on
SLD – a small dataset of ocular images acquired in real-world
conditions, intended for sclera liveness detection. With this
experiment, we evaluate the ability of the segmentation models
to generalize and adapt to new data samples that are signifi-
cantly different from anything the model saw during training.
Studying the impact of pruning in this scenario is particularly
interesting, as it contains two converse problems: iðÞ low
amount of training data (with which, as described above, smal-
ler models can actually perform better), and iiðÞgeneralization
to distinct unseen data, where it is known (Neyshabur et al.,
2015; Novak et al., 2018) that more complex networks tend to
generalize better.
Four-class ocular segmentation on the images from the
MOBIUS dataset, which were captured using three different
mobile cameras in three different real-world lighting conditions
– outdoor natural light, indoor lighting, poorly lit indoor room.
This scenario avails the segmentation models of a plethora of
training images but again places them into an unconstrained
real-world environment, in the more challenging task of four-
class segmentation. With this experiment, we explore the
adaptability of our pruning criterion and method to different
tasks, while also evaluating the models in a hybrid setting with
i
ðÞ a large number of training examples (similar to the high-
quality setting), but iiðÞworse capturing conditions (similar to
the in-the-wild setting).
We first demonstrate the superior performance of our criterion
relative to the classic weights-based criterion in Sub Section 4.5.1,
where we compare the performances of the two side-by-side. Next,
we study how the pruning affects the model’s performance, as we
prune more and more filters in Sub Section 4.5.2, in which we com-
pare the performance of the models at different FLOP counts using
each of the pruning procedures. In Sub Section 4.5.3 we investigate
the importance of the proper choice of the
a
parameter. The abla-
tion study in Sub Section 4.5.4 explores the removal of the
weights-based and activation-based components, as well as the
a
selection process, and finally also shows: iðÞthe impact of remov-
ing 1 1 pruning from our pruning procedure entirely, and iiðÞthe
impact of not using our criterion on the 1 1 filters.
4.5.1. Comparison with previous work
In the first set of experiments, we look at the performance dif-
ferences that arise from the use of the proposed filter-importance
criterion w
ðÞfrom Eq. (6) that forms the basis for IPAD. We note
that the proposed criterion extends the standard L
1
and L
2
(filter-
weights) norm criteria to also consider activation deviations when
determining filter importance, and, therefore, includes the stan-
dard criteria as a special case when
a
¼1. Furthermore, all existing
baseline techniques can also be implemented using the iterative
procedure, introduced for IPAD. In the experimental evaluations,
we, therefore, compare all considered baseline pruning methods
side-by-side first using:
the optimal balancing weight
a
(denoted ‘‘IPAD (
a
¼opt:)”in
the figures) that resulted in the best performance (i.e., highest
IoU scores) on the validation data, where
a
–1, and
only the filter-weights norm criterion denoted as ‘‘Weights only
(
a
¼1)” in the figures.
In all visualizations, the orange bars correspond to the use of
the proposed (combined) criterion wðÞ with different configura-
tions of the IPAD method (L
1
vs. L
2
vs. L
1
or L
2
LeGR), the pink bars
correspond to the respective implementations with the weight-
based criterion only, and the blue bars correspond to the reference
methods. The results of the pruned models are reported at 50% of
the original unpruned model’s FLOP count. The reader is referred
to the Appendix for the full table of results and a statistical analysis
of the impact of our criterion.
High-quality sclera segmentation: On the high-quality data of
SBVPI both segmentation models achieve high IoU scores, as shown
in Fig. 6. In 7 out of the 8 pairs, the proposed criterion outperforms
the standard weights-based criterion. Also note that the best result
overall for both models (RITnet and U-Net) is achieved using the
proposed combined criterion. Additionally, while the standard cri-
terion is in some cases outperformed even by random and uniform
pruning (in terms of final IoU score), this is never the case for the
newly proposed criterion. Finally, note that both criteria on both
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
11
considered segmentation models consistently outperform the orig-
inal unpruned model, despite having only 50% of its computational
complexity, which is also in line with the fact that RITnet (a light-
weight model designed specifically for ocular segmentation) out-
performs U-Net (a more general larger model), despite having
only 10% of its FLOP count. Overall, we observe that the proposed
criterion leads to both better and more consistent results than
the classic weights-based criterion alone.
Limited-data in-the-wild sclera segmentation: In this more
challenging setting, the models achieve noticeably lower IoU scores
overall, as expected due to the more challenging images and lower
number of training samples available. As can be seen from Fig. 7,
the proposed combined criterion outperforms the standard crite-
rion, with 6 out of the 8 pairs and the best overall result for both
segmentation models again being achieved by the combined crite-
rion w
ðÞ
. Additionally, while the standard criterion is in some cases
outperformed even by random and uniform pruning, this only hap-
pens once with our criterion in the case of L
2
pruning on U-Net,
where both criteria performed poorly (worse than uniform prun-
ing). Finally, note that both criteria with both segmentation models
again consistently lead to smaller models that even outperform the
original unpruned model. This result suggests that both initial
models (RITnet and U-Net) are over-parameterized given the stud-
ied segmentation task and that (after pruning) the retraining
results in more capable segmentation networks, whose complexity
better suits the targeted task.
Cross-dataset sclera segmentation: In the cross-dataset exper-
iments, the performance of both segmentation models is overall
worse (with lower IoU scores) than in the previous experiments,
where training and testing were conducted on (disjoint) images
coming from the same dataset, as shown in Fig. 8. With 5 out of
the 8 pairs of pruning methods, the combined filter-importance
criterion outperforms the standard weights-based criterion and
the best overall result for both segmentation models is again
achieved with the proposed criterion wðÞ. However, we do observe
that the results are less consistent than with the within-dataset
experiments due to the more challenging setting. We observe that
in the cross-dataset setting, the standard criterion is outperformed
by random and uniform pruning in 5 out of 8 cases, while this only
happens in only 2 out of 8 cases with the proposed filter-
importance criterion. Additionally, RITnet in this experiment exhi-
bits far better performance in its unpruned state, only being out-
performed through the use of the proposed criterion in 2 of the 4
cases and never by the standard criterion. Overall, our criterion
maintains its superiority over the classic weights-based criterion
and in general leads to better performing pruned models, but is
still less consistent than in the previous experiments due to the
more challenging task, in which filter pruning has a bigger impact
on performance.
Four-class ocular segmentation: In this multi-class problem,
the segmentation models achieve decently high mIoU scores, as
shown in Fig. 9. With 7 out of the 8 pairs, the proposed criterion
outperforms the standard criterion and once again leads to the best
overall result for both models (RITnet and U-Net) in terms of seg-
mentation performance. Additionally, while the standard criterion
is in some cases outperformed even by random and uniform prun-
ing, this is never the case with our criterion. Finally, note that both
criteria on both segmentation models consistently outperform the
original unpruned model despite the simpler architecture and
reduced FLOP count. Similarly as with the experiments discussed
above, the proposed combined filter-importance criterion once
again performs better and more consistently than the classic
weights-based criterion.
4.5.2. Performance across different complexities
In the previous section, we observed that the pruned models
very often outperform the corresponding original (unpruned) mod-
els despite their much lower computational complexity in terms of
FLOP count. This observation can be attributed to the fact that,
given the task at hand, both considered segmentation models are
over-parameterized. This prompts us to also look at the impact
of different model complexities on the final segmentation perfor-
Fig. 7. Impact of the proposed filter-importance criterion wðÞon SMD. The segmentation performance in the limited-data in-the-wild sclera segmentation setting is reported
for the pruning methods implemented with the proposed criterion vs. the standard weights-based criterion normally used in the literature.
Fig. 6. Impact of the proposed filter-importance criterion wðÞon SBVPI. The segmentation performance in the high-quality sclera segmentation setting is reported for the
pruning methods implemented with the proposed criterion vs. the standard weights-based criterion normally used in the literature.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
12
mance achieved by the pruned RITnet and U-Net models. To study
the behavior of the proposed pruning procedure with different tar-
get FLOP counts and explore the segmentation performance of the
pruned models in this experimental series, we set 3 different tar-
gets at 25%;50%and 75%of the initial FLOP count of the unpruned
RITnet and U-Net models. We report results for four IPAD variants
(L
1
;L
2
;L
1
LeGR and L
2
LeGR based) implemented with the com-
bined filter-importance criterion wðÞat a fixed
a
, i.e.,
a
¼0:5, for
consistency and fair comparisons, as no data-dependent optimiza-
tion on the validation data is involved in this setting.
High-quality sclera segmentation: As shown in Fig. 10(a), in
this simplest setting, all of the IPAD variants regardless of the tar-
geted FLOP count lead to pruned segmentation models that out-
perform the original unpruned model. The exceptions here are
the reference random and uniform pruning techniques. In the
case of RITnet, the pruned models exhibit a particularly evident
upward trend in performance when reducing the FLOP count to
75%, implying that the pruning of irrelevant filters does in fact
initially bring a performance boost. However, after the first bun-
dle of removed filters, the model’s smaller and smaller size can no
longer keep up with the problem’s complexity and the perfor-
mance slowly starts degrading. The trend in U-Net’s case is less
consistent, which also fits the above explanation, as U-Net is a
much larger model. It starts with roughly 10as many FLOPs
as RITnet, and so even at 25%of the initial FLOPs, it is still more
than twice as large as the unpruned RITnet. As such, the model
initially exhibits the same performance boost achieved by the
removal of irrelevant filters, but the decline after that is much
less pronounced.
Limited-data in-the-wild sclera segmentation:InFig. 10(b)
we show the results on the more challenging SMD dataset, where
a smaller number of training images with higher diversity is avail-
able for the experiments. We again observe that the pruned models
quite consistently outperform the original unpruned model with
most IPAD configurations. The upward trend on RITnet is still pre-
sent, particularly at the 75% to 50% FLOP targets, while on U-Net
this trend is less evident.
Cross-dataset sclera segmentation: The results of the cross-
dataset experiments, where the models are trained on SMD and
evaluated on SLD, are shown in Fig. 10(c). Because this cross-
dataset segmentation problem is more challenging, we observe a
different behavior of the pruned models. For most IPAD variants
on the RITnet model, performance starts degrading slightly with
any reduction in the models’ FLOP counts. While some of the
pruned models perform better at specific target FLOP percentages,
the overall trend is towards weaker results. For the more heavily
parameterized U-Net model the opposite can be observed. Here,
the segmentation performance generally increases compared to
the unpruned model for all FLOP targets, since even the smaller
models are still large enough to generalize well to new and unseen
ocular data.
Four-class ocular segmentation: The results of the four-class
segmentation on the MOBIUS data, shown in Fig. 10(d), follow sim-
ilar trends as in the previous experiments. The upward trend in the
case of RITnet, particularly for the 75% to 50% FLOP targets, is
clearly present, and most of the smaller models outperform the
original unpruned version. With U-Net, we see a considerable
boost in performance with the first reduction in complexity (at
the 75% FLOP target) and then remains steady at smaller FLOP
counts as well. Overall, the initial unpruned segmentation model
once again gets significantly outperformed by all the smaller mod-
els, which agrees with our previous analysis.
4.5.3. Pruning criterion weighting
As shown in Sub Section 4.5.1, the evaluated pruning methods
using our criterion relatively consistently beat the literature-
standard weights-based criterion when using the optimal
a
deter-
mined on the validation data. In this section, we now study how
the segmentation performance of the pruned models changes with
different values of
a
used in the proposed combined filter-
Fig. 9. Impact of the proposed filter-importance criterion wðÞon MOBIUS. The segmentation performance in the four-class ocular segmentation setting is reported for the
pruning methods implemented with the proposed criterion vs. the standard weights-based criterion normally used in the literature.
Fig. 8. Impact of the proposed filter-importance criterion wðÞon SLD. The segmentation performance in the cross-dataset sclera segmentation setting is reported for the
pruning methods implemented with the proposed criterion vs. the standard weights-based criterion normally used in the literature.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
13
importance criterion to determine how vital the proper selection of
this balancing parameter is. We present the results for a fixed per-
centage of 50% FLOPs of the initial models for consistency. Addi-
tionally, we also report results for the uniform/random pruning
approaches and the original unpruned model. It is important to
note that these reference approaches do not rely on the value of
a
, and are therefore represented as horizontal lines in the pre-
sented graphs. Here, the left most point (for
a
¼0) corresponds
to using only the ADC criterion, the right most point (for
a
¼1)
to using only the weights-based criterion, whereas all other points
represent possible operating points for the proposed combined
filter-importance criterion. The experiments are again conducted
for four different IPAD versions.
High-quality sclera segmentation: From the results in Fig. 11
(a) we can see that, while most IPAD variants have multiple values
of
a
where they outperform the weights-based criterion (right-
most point in the graphs), there also relatively consistently appear
values of
a
that lead to somewhat worse performance. This implies
that the selection of the correct
a
parameter is crucial for the prun-
ing procedure and consequent segmentation performance. What is
more, while the ranking of different values of
a
is quite similar for
L
1
and L
1
LeGR or for L
2
and L
2
LeGR, this is not the case when com-
paring L
1
-based methods to L
2
-based methods. This observation
suggests that the selection of the correct
a
depends mainly on
the norms used in the calculation of the criterion.
Limited-data in-the-wild sclera segmentation:InFig. 11(b)
we can again see that the weights-based criterion (right-most
value) outperforms some of the poorly chosen values of
a
in all
the methods except in the case of L
1
pruning on RITnet. The L
1
methods and the L
2
ones once again follow similar trends, although
the difference between the L
1
and L
2
methods is less pronounced
than it was in the previous experiment. The selection of the
a
parameter is still observed to be an important factor for the higher
level of performance exhibited by the segmentation models
pruned based on the proposed combined filter-importance
criterion.
Cross-dataset sclera segmentation: In this experiment, which
focuses primarily on the models’ ability to generalize to unseen
data, the results in Fig. 11(c) tell a different story from the previous
two experiments. This time, most of the combined-criterion based
methods outperform their weight-based counterpart (right-most
point) regardless of the choice of
a
. This observation implies that
the proposed activation-based criterion, irrespective of how
strongly it is weighted, critically contributes to the determination
of filter importance and consequently leads to pruned models with
better generalization capabilities. Additionally, note that in RIT-
net’s case, all results corresponding to the weights-based criterion
(the right-most points) are actually below the original unpruned
model, as already discussed in Section 4.5.1. Conversely, several
other variants of the evaluated pruning methods still lead to seg-
mentation models that outperform the original model with the
optimal choice of
a
. Interestingly, the choice of the IPAD variant
seems to be particularly relevant for the selection of the
a
param-
eter in this experiment, especially with the U-Net segmentation
model. As can be seen, L
1
LeGR and L
2
LeGR follow almost the same
trend across different target FLOPs, and so do the L
1
and L
2
IPAD
implementations. Given that the performance of the pruned mod-
els varies with respect to the values of
a
used in the implementa-
tion of the pruning method, a proper choice of the parameter again
appears critical for good performance.
Four-class ocular segmentation: With the four-class results in
Fig. 11(d), we observe more consistent performance across differ-
ent choices of
a
than in the previous two experiments, with only
L
1
IPAD pruning performing somewhat inconsistently. Here, the
results with the best performing
a
are always better than the
results achieved with the weight-based criterion only, whereas
even the worst choice of
a
still leads to better performance than
the purely weights-based criterion in 2 out 8 cases.
4.5.4. Ablation study
In the previous section, we explored the importance of the
proper choice of the
a
parameter. In this ablation study, we study
this specific aspect more explicitly. Specifically, we look at the per-
formance differences if we (i) remove the weights-based criterion
component (i.e.
a
¼0), (ii) remove the activation-based compo-
nent (i.e.
a
¼1), (iii) remove the
a
selection process on validation
data (as described in Section 4.5.1) and and use a predetermined
fixed
a
instead (i.e.
a
¼0:5). All the pruned models’ results are
reported at 50% of the FLOPs of the original models in Fig. 12.
Fig. 10. Performance of the pruned models across different target FLOP counts. The right-most value in each graph is the performance of the unpruned model. The top
row shows the results of RITnet (16 GFLOPs unpruned), while the bottom row contains the results of U-Net (160 GFLOPs unpruned).
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
14
No weights-based criterion: The
a
¼0 results in Fig. 12 show
the effect of turning off the weights-based component of the
proposed filter-importance criterion wðÞ completely. In 25 of
the 32 considered cases, the chosen combination of the two cri-
terion components matches or outperforms the activation-
based component alone. In 19 out of the 32 cases even the fixed
a
¼0:5 combination outperforms the activation-based compo-
nent alone, which is still more than half of all cases, although
by a less significant margin.
No activation-based criterion: The
a
¼1 results of Fig. 12
show the impact of disabling the activation-based component
of our criterion. In 27 of the 32 cases, the best
a
choice matches
or outperforms the weights-based component alone, following
a similar trend as observed in the previous experiment. In 21
of the 32 cases, the weights-based component alone is outper-
formed by the fixed
a
¼0:5 combined criterion, again following
a similar trend as observed in the previous experiment. Note
that the weights-based component alone performed worse by
2 in both of these case counts than the activation-based compo-
nent alone, again pointing to the superiority of our activation-
based criterion.
No
a
selection:At
a
¼0:5 the results of Fig. 12 show how using
a fixed
a
changes the results relative to determining the best
a
value on the validation data. In 27 of the 32 cases the best
a
choice matches or outperforms the fixed-value
a
¼0:5 result,
which is again the vast majority of the cases and points to the
importance of optimizing the
a
parameter on some hold-out
data.
Overall, the presented results suggest that, while our activation-
based criterion component alone seems to perform slightly better
than the standard weights-based criterion alone, the combination
of the two is still the far superior choice. Additionally, the proper
choice of
a
is shown to be crucial for the overall pruning process’s
success. However, even with a fixed-
a
combination of the two cri-
teria, the pruned models in general exhibit a better performance
than models pruned with either criterion alone.
Another important aspect of the pruning procedure is the type
of convolutional layer the procedure is applied to. In general,
11 convolutions are intrinsically different from 3 3 (and other)
convolutions, since they are typically used for channel mixing in
dimensionality reduction and not spatial filtering. To this end, we
next study the impact of: iðÞcompletely removing the pruning of
11 convolutions from our pruning procedure, and iiðÞpruning
11 convolutions but only applying our criterion to the 3 3 con-
volutions, while 1 1 convolutions in this case use the classic
weights-based criterion only. Since U-Net only has 3 3 convolu-
tions, we only report the results for this ablation on RITnet. We
report the results in the bar graphs of Fig. 13 at 50% total FLOPs
and the optimal
a
values selected on the validation data.
No channel pruning: We note from Fig. 13 that removing 1 1
pruning quite significantly decreases the performance in gen-
eral relative to the L
1
and L
2
based IPAD variants (denoted as
Classic) as well as L
1
and L
2
LeGR IPAD methods. The only excep-
tion to this general trend is the cross-dataset experiment, where
excluding the 1 1 convolutions from the pruning process still
leads to the 2nd best result overall. In all other experiments, the
removal of this pruning step is detrimental for the segmenta-
tion performance of the pruned models.
Classic weights-based criterion for channel pruning: The
milder version of still pruning 1 1 convolutions but only rely-
ing on the weights-based criterion for the pruning process
(since our criterion was developed with spatial convolutions
in mind) performs significantly better overall, outperforming
all other methods in the two cases with very limited training
data (SMD and SLD) and outperforms the procedure with
11 pruning fully removed in 6 out of 8 cases. However, with
sufficient training data (SBVPI and MOBIUS), it is still outper-
formed consistently by the IPAD variants based on the classic
methods as well as LeGR.
4.5.5. Qualitative comparison
Finally, we show a few examples of the model predictions
before and after the pruning procedure in Fig. 14. The goal of these
visualizations is to explore the impact of the pruning process on
the behavior of the segmentation models. The results for all pruned
Fig. 11. Performance of the pruned models across different values of the abalancing parameter. The top row shows the results for RITnet, while the bottom row contains
the results for U-Net. Points corresponding to a¼1 correspond to the standard weights-based criterion, points corresponding to a¼0 to the ADC criterion, and all the rest to
different variants of the proposed criterion from Eq. (6).
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
15
Fig. 13. Ablation study results w.r.t. the filter-types pruned. The results show the segmentation performance of pruned RITnet models with different variants of IPAD when
the pruning of 1 1 filters is completely disabled or when the weights-based criterion is used to prune the 1 1 filters.
Fig. 12. Results of the ablation study. The graphs show segmentation performance differences when components of the proposed pruning process are selectively turned off.
The results show the impact of removing the weights-based criterion component (a¼0), removing the activation-based component (a¼1), and removing the aselection
process (a¼0:5). The rows show the results on different datasets, in top-to-bottom order: SBVPI, SMD, SLD, MOBIUS.
M. Vitek, M. Bizjak, P. Peer et al. Journal of King Saud University – Computer and Information Sciences 35 (2023) 101630
16
models are shown at 50% the FLOPs of the initial unpruned model.
Note that the pruned model predictions stay close to the original
unpruned model. Even with poor segmentation results, such as
the ones presented for the SMD and SLD datasets, the IoU between
the pruned models’ predictions and the original unpruned model’s
prediction remains high. This consistency of predictions is pre-
cisely the goal of the pruning procedure. Also note how the false
positive in the SLD block, that appears with both RITnet and U-
Net due to the specular reflection in the original image, is removed
by either pruning procedure for both segmentation models, show-
ing the advantage of simplifying the models through pruning, as
discussed throughout Section 4.5.2.
5. Conclusion
In this paper, we presented a novel criterion for determining fil-
ter importance in convolutional neural networks (CNNs) in the
process of filter pruning and designed an iterative pruning proce-
dure around this novel