Available via license: CC BY-SA 4.0
Content may be subject to copyright.
GHN-Q: Parameter Prediction for Unseen Quantized
Convolutional Architectures via Graph
Hypernetworks
Stone Yun
Vision and Image Processing Group, University of Waterloo
Waterloo Artificial Intelligence Institute
s22yun@uwaterloo.ca
Alexander Wong
Vision and Image Processing Group, University of Waterloo
Waterloo Artificial Intelligence Institute
a28wong@uwaterloo.ca
Abstract
Deep convolutional neural network (CNN) training via iterative optimization has
had incredible success in finding optimal parameters. However, modern CNN
architectures often contain millions of parameters. Thus, any given model for a
single architecture resides in a massive parameter space. Models with similar loss
could have drastically different characteristics such as adversarial robustness, gener-
alizability, and quantization robustness. For deep learning on the edge, quantization
robustness is often crucial. Finding a model that is quantization-robust can some-
times require significant efforts. Recent works using Graph Hypernetworks (GHN)
have shown remarkable performance predicting high-performant parameters of
varying CNN architectures. Inspired by these successes, we wonder if the graph
representations of GHN-2 can be leveraged to predict quantization-robust parame-
ters as well, which we call GHN-Q. We conduct the first-ever study exploring the
use of graph hypernetworks for predicting parameters of unseen quantized CNN
architectures. We focus on a reduced CNN search space and find that GHN-Q can
in fact predict quantization-robust parameters for various 8-bit quantized CNNs.
Decent quantized accuracies are observed even with 4-bit quantization despite
GHN-Q not being trained on it. Quantized finetuning of GHN-Q at lower bitwidths
may bring further improvements and is currently being explored.
1 Introduction
AI-on-the-edge is an exciting area as we move towards an increasingly connected, but mobile world.
However, there are tight constraints on latency, power, and area when enabling deep neural networks
(DNN) for the edge. Consequently, fixed-precision, integer quantization such as in [
1
] has become an
essential tool for fast, efficient CNNs.
Developing efficient models that can be deployed for low-power, quantized inference while retaining
close to original floating point accuracy is a challenging task. In some cases, state-of-the-art
performance can have significant degradation after quantization of weights and activations [
2
].
So how can we find high-performant, quantization-robust parameters for a given CNN?
Recent works by [
3
] and [
4
] have shown remarkable performance using Graph Hypernetworks (GHN)
to predict all trainable parameters of unseen DNNs in a single forward pass. For example, [
3
] report
Preprint. Under review.
arXiv:2208.12489v1 [cs.LG] 26 Aug 2022
Figure 1: Finetuning GHN-Q on ConvNets-250K. We generate a large number of CNN graphs which
are then quantized to target bitwidth for training. Once trained, GHN-Q can predict robust parameters
for unseen CNNs.
that their GHN-2 can predict the parameters of an unseen ResNet-50 to achieve 60% accuracy on
CIFAR-10. Inspired by these successes, we wonder if the graph representational power of GHN-2
can be leveraged to predict quantization-robust parameters for unseen CNN architectures. We present
the first-ever study exploring the use of GHNs to predict the parameters of unseen, quantized CNN
architectures, which we call GHN-Q. By finetuning GHNs on a mobile-friendly CNN architecture
space, we explore the use of adapting GHNs specifically to target efficient, low-power quantized
CNNs. We find that even a floating point-finetuned GHN-Q can indeed predict quantization-robust
parameters for various CNNs.
2 Experiment
Table 1: Testing GHN-Q on unseen quantized networks. CIFAR-10 top-1 test accuracy of quantized
CNNs (Quant8, Quant4) are compared to their full-precision (Float32) accuracy. Presented as
(Mean%±standard error of mean; Max%).
IID OOD
Precision Test Deep Wide BN-Free
Float32 71.1±0.3; 80.2 68.1±0.7; 79.8 69.8±0.5; 79.0 37.8±1.3; 56.1
Quant8 70.9±0.3; 80.1 67.9±0.7; 79.5 69.6±0.5; 78.6 37.5±1.3; 56.4
Quant4 37.2±0.3; 52.6 30.7±0.5; 50.2 34.7±0.5; 50.7 21.5±0.8; 36.7
We would first like to investigate whether full precision floating point training on a target design
space can train GHN-Q to predict high-performant CNN parameters that are robust to 8-bit uniform
quantization. A couple aspects of the method in [
3
] suggest that the parameters predicted by GHN-Q
should be compact and quantization-friendly, namely the channel-wise weight tiling and differentiable
parameter normalization. We finetuned a CIFAR-10, DeepNets-1M pretrained GHN-2 model obtained
from [
5
] on a set of
2.5×105
mobile-friendly architectures that we call ConvNets-250K. Figure 1
shows how we train GHN-Q for predicting quantization-robust CNN parameters.
The ConvNets-250K search space consists solely of convolution
1
(including residual blocks), batch-
normalization (BatchNorm), pooling, and linear layers as these are typically the easiest to accelerate
on edge devices. We also limit the maximum number of parameters in sampled CNNs to
107
. We
could have set a lower constraint but we also wanted to allow for a more diverse search space.
Finetuning on ConvNets-250K was run for 100 epochs using CIFAR-10 [
6
]. Initial learning rate was
1Various convolutions such as depthwise, dilated, regular.
2
0.001
and reduced by a factor of
0.1
at epoch 75. GHN-Q is trained with Adam optimizer using
β1= 0.9
,
β2= 0.999
, weight decay of
10−5
, training batch-size of 32, and meta-batch-size of
4. We continue to use the weight-tiling and parameter normalization described in [
3
] and we use
s(max)= 10 as the maximum shortest path for virtual edges.
While we start with full-precision finetuning of GHN-Q on the target design space, we can easily
quantize the CNNs with arbitrary precision or model other scalar quantization methods. Thus,
GHN-Q becomes a powerful tool for quantization-aware design of efficient CNN architectures.
3 Results
We follow a testing procedure similar to [
3
] and evaluate the trained GHN-Q by comparing the mean
CIFAR-10 test accuracy at precisions of 32-bit floating point (Float32), 8-bit quantization (Quant8),
and 4-bit quantization
2
(Quant4). Table 1 shows the results on our different testing splits. BN-Free
networks have no BatchNorm layers. Wide/Deep indicate much wider/deeper nets than training.
For handling BatchNorm, we use a test-batch-size of 64 to get batch statistics [
7
]. We also have to
recompute BatchNorm-folded (BN-Fold) weights each batch before quantizing the BN-Fold weights
like in [
1
]. See Eq. 1 for the BN-Fold operation where
γ
is a learnable parameter,
σ2
B
is batch variance
(could also be exponential moving average of variance) and
is a small constant. Quantization
encodings use the absolute tensor ranges.
wfold =γw
pσ2
B+(1)
It is worth noting that we only quantize the weights and activations
3
with a
Quantize()
operator
instead of running fully fixed-point inference. Rounding and truncation errors of fully fixed-point
arithmetic will lead to some additional error. However, as most of the quantization noise is due
to weights and activations, the simulated quantization generally correlates well with on-device
accuracy [1, 8, 9].
4 Discussion
The parameters predicted by GHN-Q are surprisingly robust despite not having been trained on any
kind of quantization. It is particularly interesting that even for 4-bit quantization, the average test
accuracy is significantly better than random chance. A likely explanation is that the channel-wise
weight tiling and differentiable parameter normalization lead to layerwise distributions that are
compact and quantization-friendly. In [
2
], the authors find that a mismatch between channelwise
distributions can lead to significant accuracy loss for post-training quantization. The weight tiling
method in [
3
] copies predicted parameters across channels and thus minimizes such distributional
mismatch by-construction. Additionally, parameter normalization could help produce less heavy-
tailed distributions. A detailed analysis of the distributions of GHN-Q predicted parameters would
yield a clearer picture. As depthwise-separable convolution is particularly susceptible to distributional
mismatch, it would be interesting to test the 8-bit quantized performance of GHN-Q on a test set
consisting solely of MobileNet-like CNNs such as those in [10, 11].
Besides analyzing mean quantized accuracy, quantization robustness needs to be quantified on a
per-network basis. An analysis of the mean accuracy change and quantization error (e.g., quantized
mean squared error) of individual networks would provide better insight. It would be interesting to
see how quantization error may change after 8-bit quantized GHN-Q finetuning even if accuracy
remains similar.
We believe these results demonstrate great potential for leveraging the powerful graph representation
of GHNs for edge-AI. Finetuning GHN-Q on lower bitwidths such as 4-bit and 2-bit quantized
networks should further improve quantization robustness of predicted parameters. Furthermore,
GHN-Q could possibly be a useful weight initialization for quantized CNN training. Quantized
models usually require full-precision training to convergence before quantization-aware finetuning.
2We use asymmetric, uniform quantization throughout.
3Note that we did not yet add Quantize() after concatenation.
3
In [
3
] there are questions of how well GHN-2 predicted parameters can be used for finetuning on the
source task. However, if GHN-Q could be adapted such that predicted parameters can directly start
quantization-aware training, there would be significant savings in training CNNs for quantization.
References
[1]
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko,
“Quantization and training of neural networks for efficient integer-arithmetic-only inference,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
June 2018, in Salt Lake City, USA, 18–22 June, 2018.
[2]
S. Yun and A. Wong, “Do all mobilenets quantize poorly? gaining insights into the effect of
quantization on depthwise separable convolutional networks through the eyes of multi-scale
distributional dynamics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, June 2021, pp. 2447–2456.
[3]
B. Knyazev, M. Drozdzal, G. W. Taylor, and A. Romero-Soriano, “Parameter prediction for
unseen deep architectures,” in Advances in Neural Information Processing Systems, 2021.
[4]
C. Zhang, M. Ren, and R. Urtasun, “Graph hypernetworks for neural architecture
search,” in International Conference on Learning Representations, 2019. [Online]. Available:
https://openreview.net/forum?id=rkgW0oA9FX
[5]
B. Knyazev, M. Drozdzal, G. Taylor, and A. Romero-Soriano, “Facebookresearch/ppuda: Code
for parameter prediction for unseen deep architectures (neurips 2021),” Oct 2021. [Online].
Available: https://github.com/facebookresearch/ppuda
[6]
A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Apr 2009. [Online].
Available: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
[7]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing
internal covariate shift,” in Proceedings of the 32nd International Conference on International
Conference on Machine Learning - Volume 37, ser. ICML’15 in Lille, France, 6–11 July, 2015.
JMLR.org, 2015, p. 448–456.
[8]
R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper,” CoRR, vol. abs/1806.08342, 2018. [Online]. Available: http://arxiv.org/abs/1806.
08342
[9]
Qualcomm, “Aimet quantization simulation,” 2020. [Online]. Available: https://quic.github.io/
aimet-pages/releases/latest/user_guide/quantization_sim.html
[10]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and
H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”
arXiv, vol. abs/1704.04861, 2017.
[11]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018, in Salt Lake City, USA, 18–22 June, 2018.
4