Conference PaperPDF Available

Attention-based Multi-Patch Aggregation for Image Aesthetic Assessment

Authors:

Abstract and Figures

Aggregation structures with explicit information, such as image attributes and scene semantics, are effective and popular for intelligent systems for assessing aesthetics of visual data. However, useful information may not be available due to the high cost of manual annotation and expert design. In this paper, we present a novel multi-patch (MP) aggregation method for image aesthetic assessment. Different from state-of-the-art methods, which augment an MP aggregation network with various visual attributes, we train the model in an end-to-end manner with aesthetic labels only (i.e., aesthetically positive or negative). We achieve the goal by resorting to an attention-based mechanism that adaptively adjusts the weight of each patch during the training process to improve learning efficiency. In addition, we propose a set of objectives with three typical attention mechanisms (i.e., average, minimum, and adaptive) and evaluate their effectiveness on the Aesthetic Visual Analysis (AVA) benchmark. Numerical results show that our approach outperforms existing methods by a large margin. We further verify the effectiveness of the proposed attention-based objectives via ablation studies and shed light on the design of aesthetic assessment systems.
Content may be subject to copyright.
Aention-based Multi-Patch Aggregation for
Image Aesthetic Assessment
Kekai Sheng
NLPR, Institute of Automation,
Chinese Academy of Sciences &
University of Chinese Academy of
Sciences
shengkekai2014@ia.ac.cn
Weiming Dong
NLPR, Institute of Automation,
Chinese Academy of Sciences
weiming.dong@ia.ac.cn
Chongyang Ma
Snap Inc.
cma@snap.com
Xing Mei
Snap Inc.
xing.mei@snap.com
Feiyue Huang
Youtu Lab, Tencent
garyhuang@tencent.com
Bao-Gang Hu
NLPR, Institute of Automation,
Chinese Academy of Sciences
hubg@nlpr.ia.ac.cn
ABSTRACT
Aggregation structures with explicit information, such as image
attributes and scene semantics, are eective and popular for in-
telligent systems for assessing aesthetics of visual data. However,
useful information may not be available due to the high cost of
manual annotation and expert design. In this paper, we present a
novel multi-patch (MP) aggregation method for image aesthetic
assessment. Dierent from state-of-the-art methods, which aug-
ment an MP aggregation network with various visual attributes,
we train the model in an end-to-end manner with aesthetic labels
only (i.e., aesthetically positive or negative). We achieve the goal by
resorting to an attention-based mechanism that adaptively adjusts
the weight of each patch during the training process to improve
learning eciency. In addition, we propose a set of objectives
with three typical attention mechanisms (i.e., average, minimum,
and adaptive) and evaluate their eectiveness on the Aesthetic
Visual Analysis (AVA) benchmark. Numerical results show that
our approach outperforms existing methods by a large margin. We
further verify the eectiveness of the proposed attention-based
objectives via ablation studies and shed light on the design of
aesthetic assessment systems.
CCS CONCEPTS
Computing methodologies Computational photography
;
Neural networks;
KEYWORDS
Image aesthetic assessment, attention mechanism, multi-patch ag-
gregation, convolutional neural network
Corresponding author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MM ’18, October 22–26, 2018, Seoul, Republic of Korea
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5665-7/18/10. . . $15.00
https://doi.org/10.1145/3240508.3240554
ACM Reference Format:
Kekai Sheng, Weiming Dong, Chongyang Ma, Xing Mei, Feiyue Huang,
and Bao-Gang Hu. 2018. Attention-based Multi-Patch Aggregation for Image
Aesthetic Assessment. In 2018 ACM Multimedia Conference (MM ’18), October
22–26, 2018, Seoul, Republic of Korea. ACM, New York, NY, USA, 8 pages.
https://doi.org/10.1145/3240508.3240554
1 INTRODUCTION
As the volume of visual data grows exponentially each year, the
capability of assessing image aesthetics becomes crucial for various
applications such as photo enhancement, image stream ranking, and
album thumbnail composition [
1
,
3
,
4
]. The aesthetic assessment
process of the human visual system involves numerous factors such
as lighting, contrast, composition, and texture [
7
,
26
]. Some of these
factors belong to holistic scene information, whereas others are
ne-grained image details. Designing an articial intelligent system
that accommodates all these factors is a challenging task.
Many studies have focused on this problem in the last decade. Al-
though some early methods only consider global factors [
5
,
13
,
22
],
most recent approaches propose combining holistic scene infor-
mation and ne-grained details in a multi-patch (MP) aggregation
network [
6
,
12
,
23
,
34
]. A common practice is to leverage explicit
information, such as image attributes [
34
,
36
], scene semantics [
19
],
and intrinsic components [
6
,
23
] in the network design. Using
explicit information encodes various complementary visual cues
and can signicantly outperform alternative methods that only rely
on aesthetic labels [
22
]. However, explicit information might not
be always available due to the high cost of manual annotation and
the expert knowledge required for feature design. As images with
aesthetic labels become available at a large scale from online pho-
tography communities, we revisit the problem of image aesthetic
assessment and explore how to learn an eective aesthetic-aware
model in an end-to-end manner with aesthetic labels only.
Learning with aesthetic labels only is challenging because the
labels may not provide sucient signals for training and can lead
to poor assessment results. In the absence of explicit information,
deciding which image patches are useful in making the correct pre-
diction is dicult. Therefore, image patches are usually considered
equally important in the training stage and the inference time for
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
879
Input image
Aesthe tically pos itive (1)
or negative (0)
Target / Output
Representation
vectors
Convolutional
neural network
Training Inference
Average of decision
confidence
Attention-based objectiv e
Decision confiden ce
of th e target cl ass
Multi-patch aggregation
Decision confidence [0, 1] Weights on trainin g signa ls [0, 1]
Weig h t s i n SGD
optimization
forward
back-propaga tion
attention
Figure 1: System overview. We use an attention-based ob-
jective to enhance training signals by assigning relatively
larger weights to misclassied image patches.
most previous methods. Recently, Ma et al. [
23
] proposed an eec-
tive patch selection module to select useful patches heuristically
during the training stage and showed that patch selection alone
improved the aesthetic assessment performance by 7%. However,
their heuristic patch selection was indirectly learned from the
training data and might not fully leverage meaningful information
from aesthetic labels. We notice that the scheme of patch selection
shares an idea similar with attention mechanisms [
2
,
10
,
30
], in
which human visual attention is not distributed evenly within
an image. Recent successes in visual analysis [
18
,
33
,
35
] have
demonstrated that a well designed attention-based module can
signicantly improve the performance of a learning system.
Motivated by patch selection and attention mechanism, we pro-
pose a simple yet eective solution for image aesthetic assess-
ment, as shown in Figure 1. The key ingredient is an attention-
based objective that strengthens training signals by assigning large
weights to patches on which the current model has made incorrect
predictions. In this manner, we improve the learning eciency
and eventually achieve better assessment results compared with
existing approaches that consider each patch with equal weight.
For comparison purpose, we present and evaluate three typical
attention mechanisms (i.e., average,minimum, and adaptive). In
comparison with the heuristic patch selection scheme [
23
], our
method simplies the design of the network architecture, and more
importantly, enables an end-to-end way to train an assessment
model with aesthetic labels only. To the best of our knowledge,
attention mechanism has not been explored for image aesthetic
assessment. Our quantitative results demonstrate that our proposed
solution outperforms state-of-the-art methods on the large-scale
Aesthetic Visual Analysis (AVA) benchmark [
25
]. We also conduct
ablation studies and provide additional visualizations to analyze
our learned models.
2 RELATED WORK
The estimation of image styles, aesthetics, and quality has been
actively investigated over the past few decades. Early studies started
from distinguishing snapshots from professional photographs by
CNN
Representation
aggregation
Aggregation
(e.g., multi-task)
CNN CNN
Representation
aggregation
Multi-
patch
CNN
CNN
for explicit
information
input dat a cropped patch Hidden layer output
(a)
(b) (c)
Figure 2: Typical aggregation-based architectures for image
aesthetic assessment: (a) MP aggregation; (b) multi-column
aggregation; (c) aggregation with explicit information.
modeling well-established photographic rules based on low-level
handcrafted features [
5
,
13
,
16
,
32
]. Recently, machine learning
based approaches have been successfully applied to various com-
puter vision tasks [
9
,
18
,
33
]. Deep learning methods, such as deep
convolutional neural network (CNN) and deep belief network, have
been successfully applied to photo aesthetic assessment task with
promising results. In Figure 2, we divide recent deep learning
based aesthetic assessment methods into three categories based on
dierent aggregation structures.
MP aggregation (Figure 2a) concatenates vector representations
extracted from multiple patches of the input image for aesthetic
assessment. Typical examples include deep multi-patch aggregation
network (DMA-Net) [
22
], multi-net adaptive spatial pooling CNN
(MNA-CNN) [
19
], and MP subnet with an eective patch selection
scheme (New-MP-Net) [23].
Multi-column aggregation (Figure 2b) focuses on boosting train-
ing signals of aesthetic modeling with additional task-related ex-
plicit information, such as multi-column for various attribute model-
ing [
15
,
36
], brain-inspired deep networks (BDN) [
34
], two-column
CNN for rating pictorial aesthetics (RAPID) [
21
], aesthetic-attention
net (AA-Net) [
35
], multi-task CNN (MTCNN) with semantic pre-
diction [
11
], aesthetic quality regression with simultaneous image
categorization (A&C CNN) [
12
], and two-column deep aesthetic
net (DAN) with triplet pre-training and category prediction [6].
Representation aggregation with explicit information (Figure 2c) is
built on an MP aggregation module and uses explicit information as
complementary visual cues for good results. Common examples of
explicit information include object instances (e.g., DMA-Net with an
object-oriented model [
22
]), scene semantic (e.g., MNA-CNN with
scene-aware aggregation [
19
]), and expert-designed photographic
attributes (e.g., depth of eld and color harmonization [23]).
Although explicit information provides meaningful cues for
image aesthetic assessment, useful information may not always
be available due to the high cost of manual annotation and expert
knowledge in design. Therefore, compared with methods having
explicit information, training a CNN with aesthetic labels alone is
useful if it achieves similar or even better assessment results.
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
880
3 APPROACH
3.1 Problem Statement
We denote
N
pairs of input image
I
and its corresponding ground-
truth aesthetic label
ˆ
y
as our dataset
{Ii,ˆ
yi}N
i=1
. Here,
ˆ
yi=
1means
that the image
Ii
is aesthetically positive, whereas
ˆ
yi=
0denotes
an aesthetically negative image. Given the dataset, the problem of
learning image aesthetic assessment can be formulated as follows:
arдmax
θ
1
|P | Õ
pP
Pr (˜
y=ˆ
y|p,θ)(1)
where
P
is a set of square patches
{p}
cropped from images in
the dataset,
˜
y
denotes the predicted aesthetic label, and
θ
refers
to all the network parameters that need to be learned for image
aesthetic assessment task. In Equation 1,
Pr (˜
y=ˆ
y|p,θ)
represents
the probability that the patch is correctly predicted as the ground-
truth label and is computed as the output of the last
so f tmax
layer
in our network.
Directly optimizing Equation 1 is computationally expensive
and might lead to unwanted artifacts, such as overtting, especially
when only aesthetic labels are available. This issue can be linked
to the fact that a large batch size usually introduces detrimental
eects (e.g., sharp local minima) during the training process via
mini-batch stochastic gradient descent (SGD) [
14
]. Therefore, it is
desirable to design an objective function for ecient and eective
learning of aesthetic-aware image representations.
3.2 Attention-based Objective Functions
Inspired by patch selection [
23
] and attention mechanisms [
2
,
10
,
30
], we propose assigning dierent weights to dierent image
patches for eective learning of the aesthetic assessment model.
For comparison purpose, we propose three dierent MP weight
assignment schemes, namely, MPa,M Pm in , and M Pa da .
MPa
scheme. Recall the Jensen’s inequality that: given a real-
valued concave function
f
and a set of points
{x}
in a domain
S
,
Jensen’s inequality can be stated as follows:
f1
|S|Õ
xS
x1
|S|Õ
xS
f(x)(2)
where the equality holds if and only if
xi=xj(xiS)
or
f
is
linear. On the basis of Jensen’s inequality,
MPa
can be proposed
as an ecient relaxation of the original objective in Equation 1, as
shown below:
log 1
|P | Õ
pP
Pr (˜
y=ˆ
y|p,θ)1
|P | Õ
pP
log Pr (˜
y=ˆ
y|p,θ)
| {z }
MPav д
(3)
MPa
θ=1
|P | Õ
pP
1
Pr (˜
y=ˆ
y|p,θ)
| {z }
we i дht s
·Pr (˜
y=ˆ
y|p,θ)
θ(4)
If we sample one patch from each image, the
MPa
scheme will
share a training pipeline similar to the training of a common im-
age classication model. Therefore, this scheme can be trained
eciently compared with the existing MP aggregation models.
MPm in
scheme. In many machine learning algorithms, another
typical attention mechanism is to focus on improving results at
data points with moderate condences, such as hinge loss and
hard example mining [
20
,
28
]. Inspired by these prior methods, we
propose the MPm in scheme as another relaxation of Equation 1:
log 1
|P | Õ
pP
Pr (˜
y=ˆ
y|p,θ)min
pP
1
|P | log P r (˜
y=ˆ
y|p,θ)
=1
|P | log P r (˜
y=ˆ
y|pm,θ)
| {z }
MPmi n
(5)
where
pm=arдmin
pP
Pr (˜
y=ˆ
y|p,θ)
MPm in
θ=1
|P |
1
Pr (˜
y=ˆ
y|pm,θ)·Pr (˜
y=ˆ
y|pm,θ)
θ
=1
|P | Õ
pP
I(p=pm)
Pr (˜
y=ˆ
y|p,θ)·Pr (˜
y=ˆ
y|p,θ)
θ
(6)
where
I(·)
equals to 1if
p
is
pm
, and 0otherwise. As shown in
Equation 5, the
MPm in
scheme only considers image patches with
the lowest prediction condence to search for meaningful visual
cues, while ignoring other patches from the same image. In practice,
we implement a softer version of
MPm in
to avoid a potentially
unstable training process. Specically, the possibility that
p
is
selected in the SGD process is proportional to 1Pr (˜
y=ˆ
y|p,θ).
MPad a
scheme. To take advantage of patch selection in the
training stage in an end-to-end manner, we design
MPad a
to assign
adaptively larger weights to meaningful training instances, i.e.,
patches on which the current model predicts incorrect aesthetic
labels, as shown as follows:
MPad a =1
|P | Õ
pP
ωβ·log Pr (˜
y=ˆ
y|p,θ)
ωβ=Pr (˜
y=ˆ
y|p,θ)β1
Pr (˜
y=ˆ
y|p,θ)β=1Pr (˜
y=ˆ
y|p,θ)β
(7)
MPad a
θ=1
|P | Õ
pP
λ·Pr (˜
y=ˆ
y|p,θ)
θ
λ=
1− (1+β·log Pr (˜
y=ˆ
y|p,θ)) · (1ωβ)
Pr (˜
y=ˆ
y|p,θ)
(8)
where
β
is a positive number to control the adaptiveness of weight
assignment. Given that
Pr (˜
y=ˆ
y|p,θ)β
is in the range of
(
1
,+∞)
,
we normalize its value by subtracting 1and dividing the result by
Pr (˜
y=ˆ
y|p,θ)β
. Figure 3 shows how the curves of
ωβ
change as a
function of
Pr (˜
y=ˆ
y|p,θ)
with dierent values of
β
. Intuitively, as
the optimization progresses, the ratio of instances of high decision
condence increases. Consequently, the overall loss decreases, and
meaningful training signals become weaker. To maintain mean-
ingful training signals, we should allocate more computational
resources to data points of relatively lower prediction condence.
Dierent from the hinge loss which ignores data points that have
been classied correctly, the
MPad a
scheme constantly assigns
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
881
Figure 3: Curves of adaptive weights ωβ=
1
Pr βwith
dierent values of the hyperparameter β.
certain positive weights to those patches to help maintain correct
predictions, similar to focal loss [
17
]. Moreover, this scheme pre-
vents the training process from becoming unstable due to potential
noisy data points or outliers.
The task of aesthetic assessment can be considered as a two-class
classication problem, and the classier may suer from accuracy
paradox if the dataset is unbalanced. In the training partition of AVA
benchmark [
25
], more than 75% photos are labeled as aesthetically
positive, whereas the rest are negative ones. To achieve a high
accuracy, an assessment model may tend to predict a photo to be
aesthetically positive. The
MPad a
scheme can resolve this issue
eectively because it assigns large weights
ωβ
to misclassied data
points, which will essentially bias the model to pay attention to the
minority class.
3.3 Network Architecture
In addition to the objective function, the network architecture is
another indispensable part of an eective machine learning system.
Among the popular architectures of CNNs, ResNet [
9
] is a good
choice for our task because of its computational eciency for
training and inference processes.
Table 1 shows the details of our network architecture for all the
experiments. For a fair comparison, we adopt the 18-layer ResNet
architecture (plus the last classication layer), with the same depth
of layers as models based on VGG16 nets [
29
], which are commonly
used in previous methods for aesthetic assessment [19, 22, 23].
4 EXPERIMENTAL RESULTS
In this section, we describe our implementation details and present
the experiment results. We compare our approach with state-of-
the-art methods on the AVA benchmark dataset and validate the
proposed solution via ablation study.
4.1 Training Data
We conduct experiments based on the AVA benchmark [
25
], which
is the largest publicly available dataset for image aesthetic assess-
ment. The AVA benchmark contains about 250
,
000 photos in total.
Each photo has an aesthetic score, which is obtained by averaging
Layers Output names Output shape
conv, 7x7, 64, stride 2 - [112, 112, 64]
max pool, 3x3, stride 2 - [56, 56, 64]
conv,3x3,64
conv,3x3,64x2
0-0-BNReLU1
[56, 56, 64]
0-0-BNReLU2
0-0-ReLU
0-1-BNReLU1
0-1-BNReLU2
0-1-ReLU
conv,3x3,128
conv,3x3,128x2
1-0-BNReLU1
[28, 28, 128]
1-0-BNReLU2
1-0-ReLU
1-1-BNReLU1
1-1-BNReLU2
1-1-ReLU
conv,3x3,256
conv,3x3,256x2
2-0-BNReLU1
[14, 14, 256]
2-0-BNReLU2
2-0-ReLU
2-1-BNReLU1
2-1-BNReLU2
2-1-ReLU
conv,3x3,512
conv,3x3,512x2
3-0-BNReLU1
[7, 7, 512]
3-0-BNReLU2
3-0-ReLU
3-1-BNReLU1
3-1-BNReLU2
3-1-ReLU
global average pooling - [512]
2d fc, softmax - [2]
Table 1: The architecture of deep CNN used in our exper-
iments. We use the
18
-layer ResNet for a fair comparison
with alternative approaches. Each x-x-ReLU is the output of
a residual block.
the ratings from about 200 people. The scores range from 1to 10,
where 10 indicates the highest aesthetic quality.
For a fair comparison, we use the same partition of training
and test data similar to prior methods [
19
,
22
,
23
,
25
] (i.e., 235,599
images for training and the rest for testing). We also follow the
same procedure to assign a binary aesthetic label to each image in
AVA. Specically, images with average ratings less than or equal
to 5are aesthetically negative, whereas others are aesthetically
positive.
4.2 Implementation Details
Given an image of arbitrary resolution in our dataset, we initially
resize its shorter edge to be 256 while keeping its aspect ratio. In
this manner, we can keep the testing data pipeline compatible with
the training data pipeline without changing the aspect ratios of
original images, which we assume is important for photo aesthetic
assessment. Figure 4 shows that the distributions of aesthetic scores
of images of dierent aspect ratios share similar patterns.
We then randomly crop several patches of resolution 224
×
224 from each resized image. Random horizontal ipping (50%
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
Figure 4: The distributions of aesthetic scores of images with
various aspect ratios share similar patterns.
probability to ip) is conducted for data augmentation without
color jittering or multi-scale cropping. Next, the collected patches
are fed into the CNN to learn the feature representation for aesthetic
assessment. An attention-based MP objective function (Section 3) is
used to organize separated patches eectively and ensure that error
back-propagation process eventually leads to desirable convergence
with satisfactory assessment performance.
We build our system based on publicly available implementations
of ResNet [
9
] using Tensorow. Instead of training from scratch, our
network is ne-tuned from a model pretrained on the ImageNet
ILSVRC2012 dataset [
27
]. Each training process contains 32 epochs
and it takes about 10 hours to complete on an NVIDIA TITAN X
graphics card. At test time, it takes less than 0
.
1ms to predict the
aesthetic label of an image.
We apply a ve-fold cross-validation technique on the training
set of AVA to select the model hyperparameters. The hyperparame-
ters include the learning rate (begins with 0
.
001 and is reduced by
half every 20 epochs), xed weight decay with 0
.
0001, and
β=
0
.
5
in
ωβ
. The optimization process is performed using Nesterov SGD
with a mini-batch size of 32.
4.3 Performance Evaluations
We use the total classication accuracy on the canonical AVA testing
partition to validate the eectiveness of our proposed objectives
for aesthetic assessment. In Table 2, we compare our results with
several existing techniques, including handcrafted features [
25
],
three VGG-Net based methods [
19
], a single-patch network [
22
]
based on spatial pyramid pooling (SPP) [
8
], three types of aggre-
gation based approaches as reviewed in Section 2, and a recent
method [
31
] which casts aesthetic assessment as a regression task.
4.3.1 Our methods versus state-of-the-art approaches. Our pro-
posed objective functions based on attention mechanism gener-
ally work better compared with existing techniques for image
aesthetic assessment. The
MPad a
scheme even outperforms the
models with well-designed features that utilize hybrid information
(e.g., [
11
,
19
,
22
,
23
]). In Figure 5, we show four groups of aesthetic
assessment results predicted by our
MPad a
scheme based on the
Method Core Features Results
AVA [25] handcrafted features 68.0
VGG-Scale [19] non-uniform scaling 73.8
VGG-Pad [19] uniform scaling + padding 72.9
SPP [22] spatial pooling 76.0
VGG-Crop [19]
MP aggregation
71.2
DMA-Net [22] 75.41
MNA-CNN [19] 77.1
New-MP-Net [23] 81.7
DCNN [21]
multi-column
aggregation
73.25
RAPID [21] 75.42
A&C CNN [12] 74.51
MTCNN [11] 78.56
MTRLCNN [11] 79.08
BDN [34] 78.08
Two-column DAN [6] 78.72
AA-Net [35] 76.9
DMA-Net-IF [22] representation aggregation
with explicit information
75.4
MNA-CNN-Scene [19] 77.4
A-Lamp [23] 82.5
NIMA [31] distributions of human
opinion scores 81.51
MPaaverage weights 81.76
MPm in minimum select 80.50
MPad a adaptive weights 83.03
Table 2: Comparisons between several state-of-the-art ap-
proaches and our proposed schemes. We list the core fea-
tures of each method and the corresponding total classi-
cation accuracy on the AVA test set.
ResNet architecture, including aesthetically positive and negative
predictions with high and low decision condence.
4.3.2 Comparisons between dierent aention-based schemes.
Among all the three attention-based objectives proposed in Sec-
tion 3, the
MPad a
scheme achieves the highest aesthetic assessment
accuracy. Figure 6 shows that the
MPad a
scheme tends to assign
larger weights to aesthetically negative examples with respect to
positive ones. This adaptive scheme can help resolve the intrinsic
class imbalance problem of our dataset and will lead to better
assessment performance. We have also found that training based on
the
MPad a
scheme converges faster and reaches a lower minimum
compared with the other two schemes.
4.3.3
β
value for adaptive weight assignment. For a better under-
standing of the
MPad a
scheme, we conduct additional experiments
to train the model with dierent values of
β
(i.e., the hyperparameter
for adaptive weight
ωβ
in Eqn 7). Figure 3 shows that when
β
(
0
,
1
)
, patches correspond to smaller probability and thus lower
prediction condence will be assigned a considerably larger weight,
that is, a model with
β∈ (
0
,
1
)
is more adaptive compared with those
with
β∈ (
1
,∞)
. Our experimental results show that a model trained
with
β∈ (
0
,
1
)
generally outperforms the ones with
β∈ (
1
,∞)
by a
margin of 1% in terms of total classication accuracy.
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
883
Decis ion
Decis ion
c
c
onfidence
LowHigh HighLow
Negative predictions Pos itive pr edictions
Figure 5: Our aesthetic assessment results on the AVA test set predicted by the MPad a scheme.
Figure 6: Average weights for patches from aesthetically
positive and negative instances during the training process.
4.4 Further Investigation
In this section, we conduct ablation study to verify the eectiveness
of our proposed method. We also investigate the learned represen-
tations to understand aesthetic-aware models further.
4.4.1 Ablation study: ResNet versus VGG. In addition to using
the 18-layer ResNet [
9
], we also test the aesthetic assessment per-
formance of the
MPad a
scheme based on VGG16 nets, which are
commonly used in previous studies [
19
,
22
,
23
]. In comparison with
the number in Table 2, the accuracy of implementation using VGG16
nets is only approximately 0
.
5% lower. Note that a VGG16 net has
a considerably larger network size (around 500MB) as compared
to ResNet (approximately 40MB) and requires considerably more
computational resources and time budget.
4.4.2 Representation correlation. For further understanding of
the trained aesthetic-aware neural network, we investigate the
representation vectors extracted from dierent layers of the ResNet
architecture. We resort to singular vector canonical correlation
analysis (SVCCA)
1
[
24
], a recently proposed powerful tool to
understand the relationships among representations of various
layers. We use
ResN et I maдe N e t
,
ResN etAV A
and
ResN et Ra nd
to
denote the original model pretrained on ImageNet, the ne-tuned
model from
ResN et I maдe N e t
and the model trained from scratch,
respectively. The SVCCA maps of
ResN etAV A
and
ResN et Ra nd
are
shown in Figures 7 and 8, respectively. The notations of output
nodes in the SVCCA maps are listed in Table 1. Figure 7 shows
a positive correlation among various layers in the trained model,
especially for high-level layers (e.g., the last six layers).
Our experiments show that
ResN et Ra nd
can still achieve good
aesthetic classication results and is inferior to
ResN etAV A
by a
moderate gap (around 1%). In comparison with the SVCCA map
of
ResN etAV A
in Figure 7, the correlations of representations in
ResN et Ra nd
are relatively weaker, as shown in Figure 8. This result
indicates that a weak representation correlation can cause degrada-
tion of aesthetic assessment performance.
4.4.3 Aesthetic-aware model versus object-oriented model. Since
we start from a pre-trained object-oriented model
ResN et I maдe N e t
and then ne-tune it to obtain a model
ResN etAV A
for aesthetic
assessment, it is interesting to see which parts of the network
change the most during the ne-tuning process. Figure 9 shows the
heatmap visualization where the SVCCA map of
ResN etAV A
has
larger components than the SVCCA map of
ResN et I maдe N e t
. This
gure indicates that the high-level layers of a model for aesthetic
assessment present stronger correlation compared with an object-
oriented model.
1The code of SVCCA we use can be found at https://github.com/google/svcca.
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
884
Figure 7: SVCCA map visualization of correlations between
dierent layer representations in ResN etAV A .
Figure 8: SVCCA map visualization of correlations between
dierent layer representations in ResN etR and .
4.4.4 Eect of image resizing methods. In our current system,
we initially resize the shorter edge to be 256 for all the images to
make the training and test data compatible with each other. This
preprocessing operation is valid if aesthetic assessment result only
relies on scale-invariant features and will not change after uniform
resizing. In certain cases, such as the pictures with distinctive local
contrast, resizing the images may change the aesthetic assessment
Figure 9: Heatmap visualization showing where the SVCCA
map of Res N etAV A has larger components than the SVCCA
map of Res N etI maдe N e t .
1 / 0.740 1 / 0.562 1 / 0.670 0 / 0.521 1 / 0.736 1 / 0.606
1 / 0.710 1 / 0.596 0 / 0.582 0 / 0.692 0 / 0.573 0 / 0.651
0 / 0.539 0 / 0.761 0 / 0.638 0 / 0.812 1 / 0.757 1 / 0.572
Figure 10: Our aesthetic assessment results and the cor-
responding prediction condence with two scaling ap-
proaches: resizing shorter edge to be
256
while keeping the
aspect ratio (left) and resizing to a resolution
256
×
256
(right).
results. We designate the study of image aesthetic assessment at
the original resolution as future work.
In Figure 10, we compare our uniform scaling strategy with non-
uniform scaling to a resolution of 256
×
256. For each image, we
show our aesthetic assessment result and the corresponding predic-
tion condence. This gure shows that resizing without keeping
the original aspect ratio will reduce the condence for positive
predictions and will increase the condence for negative ones. This
result is consistent with human visual perception because changing
the aspect ratio of an ordinary image is likely to downgrade the
image aesthetically.
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
885
5 CONCLUSIONS
In this study, we revisit the problem of image aesthetic assessment
and propose a simple yet eective solution inspired by the attention
mechanism. To learn a neural network based model for aesthetic
assessment from training data with aesthetic labels only, we investi-
gate three dierent weight assignment schemes for MP aggregation,
namely,
MPa
,
MPm in
, and
MPad a
. Our experimental results on
the AVA test dataset show that our approach outperforms state-of-
the-art approaches for image aesthetic assessment by a large margin.
Among the three schemes presented, adaptive weight assignment
MPad a
achieves the best aesthetic assessment performance due
to larger weights assigned to meaningful instances during the
optimization process, which help strengthen training signals and
resolve the class imbalance issue of the dataset. We further validate
our design choices via ablation study and evaluate the learned
models by comparing dierent training strategies.
In the future we plan to investigate learning from unlabeled
data to improve assessment performance further. Although our
major goal is to improve the accuracy of image aesthetic assessment,
another possible future avenue is to explore more compact aesthetic
assessment models for various mobile applications, such as image
enhancement and album thumbnail generation. Finally, combining
image aesthetic assessment with other visual analysis tasks within
a unied learning framework is also interesting.
ACKNOWLEDGMENTS
We thank the anonymous reviewers for their valuable comments
and suggestions. This work was supported by National Natural Sci-
ence Foundation of China under nos. 61672520, 61573348, 61702488
and 61720106006, by Beijing Natural Science Foundation under No.
4162056, by the independent research project of National Labora-
tory of Pattern Recognition, and by CASIA-Tencent Youtu joint
research project.
REFERENCES
[1]
Subhabrata Bhattacharya, Rahul Sukthankar, and Mubarak Shah. 2011. A holistic
approach to aesthetic enhancement of photographs. ACM Transactions on
Multimedia Computing, Communications, and Applications 7, 1 (2011), 21.
[2]
Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang,
Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al
.
2015. Look and
think twice: Capturing top-down visual attention with feedback convolutional
neural networks. In Proceedings of the IEEE International Conference on Computer
Vision. 2956–2964.
[3]
Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. 2017. Aesthetic Critiques
Generation for Photos. In IEEE International Conference on Computer Vision. IEEE,
3534–3543.
[4]
Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. 2017.
Learning to Compose with Professional Photographs on the Web. In Proceedings
of ACM on Multimedia Conference. ACM, 37–45.
[5]
Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2006. Studying aesthetics
in photographic images using a computational approach. In European Conference
on Computer Vision. Springer, 288–301.
[6]
Yubin Deng, Chen Change Loy, and Xiaoou Tang. 2017. Image Aesthetic
Assessment: An experimental survey. IEEE Signal Processing Magazine 34, 4
(2017), 80–106.
[7]
Michael Freeman. 2006. The Complete Guide to Light and Lighting in Digital
Photography (A Lark Photography Book). Lark Books.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid
pooling in deep convolutional networks for visual recognition. In European
Conference on Computer Vision. Springer, 346–361.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
Learning for Image Recognition. In IEEE Computer Vision and Pattern Recognition.
IEEE, 770–778.
[10]
Laurent Itti and Christof Koch. 2001. Computational modelling of visual attention.
Nature Reviews Neuroscience 2, 3 (2001), 194.
[11]
Yueying Kao, Ran He, and Kaiqi Huang. 2017. Deep aesthetic quality assessment
with semantic information. IEEE Transactions on Image Processing 26, 3 (2017),
1482–1495.
[12]
Yueying Kao, Kaiqi Huang, and Steve Maybank. 2016. Hierarchical aesthetic
quality assessment using deep convolutional neural networks. Signal Processing:
Image Communication 47, C (2016), 500–510.
[13]
Yan Ke, Xiaoou Tang, and Feng Jing. 2006. The Design of High-Level Features
for Photo Quality Assessment. In IEEE Computer Vision and Pattern Recognition.
IEEE, 419–426.
[14]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,
and Ping Tak Peter Tang. 2016. On large-batch training for deep learning:
Generalization gap and sharp minima. arXiv preprint:1609.04836 (2016).
[15]
Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless C Fowlkes. 2016.
Photo Aesthetics Ranking Network with Attributes and Content Adaptation.
European Conference on Computer Vision (2016), 662–679.
[16]
Yan Kong, Weiming Dong, Xing Mei, Chongyang Ma, Tong-Yee Lee, Siwei Lyu,
Feiyue Huang, and Xiaopeng Zhang. 2016. Measuring and Predicting Visual
Importance of Similar Objects. IEEE Transactions on Visualization and Computer
Graphics 22, 12 (2016), 2564–2578.
[17]
Tsung Yi Lin, Priya Goyal, Ross Girshick, KaimingHe, and Piotr Dollar. 2017. Focal
Loss for Dense Object Detection. In IEEE International Conference on Computer
Vision. IEEE, 2999–3007.
[18]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional
networks for semantic segmentation. In IEEE Computer Vision and Pattern
Recognition. IEEE, 3431–3440.
[19]
Mai Long, Jin Hailin, and Liu Feng. 2016. Composition-preserving deep photo
aesthetics assessment. In IEEE Computer Vision and Pattern Recognition. IEEE,
497–506.
[20]
Ilya Loshchilov and Frank Hutter. 2015. Online Batch Selection for Faster Training
of Neural Networks. Mathematics (2015).
[21]
Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z Wang. 2015. Rating
image aesthetics using deep learning. IEEE Transactions on Multimedia 17, 11
(2015), 2021–2034.
[22]
Xin Lu, Zhe Lin, Xiaohui Shen, Radomir Mech, and James Z Wang. 2015.
Deep multi-patch aggregation network for image style, aesthetics, and quality
estimation. In IEEE International Conference on Computer Vision. IEEE, 990–998.
[23]
Shuang Ma, Jing Liu, and WenChen Chang. 2017. A-Lamp: Adaptive layout-aware
multi-patch deep convolutional neural network for photo aesthetic assessment.
In IEEE Computer Vision and Pattern Recognition. IEEE, 722–731.
[24]
Raghu Maithra, Gilmer Justin, Yosinski Jason, and Jascha Sohl-Dickstein. 2017.
SVCCA: Singular vector canonical correlation analysis for deep learning dynam-
ics and interpretability. In Advances in Neural Information Processing Systems.
6076–6085.
[25]
Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A large-
scale database for aesthetic visual analysis. In IEEE Computer Vision and Pattern
Recognition. IEEE, 2408–2415.
[26]
Clark V. Poling. 1975. Johannes Itten, Design and Form: The Basic Course at the
Bauhaus and Later. Thames and Hudson. 368–370 pages.
[27]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S Bernstein, et al
.
2015. ImageNet Large Scale Visual Recognition Challenge. International Journal
of Computer Vision 115, 3 (2015), 211–252.
[28]
Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training Region-
Based Object Detectors with Online Hard Example Mining. In IEEE Computer
Vision and Pattern Recognition. IEEE, 761–769.
[29]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. arXiv preprint:1409.1556 (2014).
[30]
Marijn F Stollenga, Jonathan Masci, Faustino Gomez, and Jürgen Schmidhuber.
2014. Deep networks with internal selective attention through feedback
connections. In Advances in neural information processing systems. 3545–3553.
[31]
Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment.
IEEE Transactions on Image Processing 27, 8 (2018), 3998–4011.
[32]
Hanghang Tong, Mingjing Li, Hong-Jiang Zhang, Jingrui He, and Changshui
Zhang. 2004. Classication of digital photos taken by photographers or home
users. In Pacic-Rim Conference on Multimedia. Springer, 198–205.
[33]
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang,
Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image
classication. IEEE Computer Vision and Pattern Recognition, 3156–3164.
[34]
Zhangyang Wang, Shiyu Chang, Florin Dolcos, Diane Beck, Ding Liu, and
Thomas S Huang. 2016. Brain-inspired deep networks for image aesthetics
assessment. arXiv preprint:1601.04155 (2016).
[35]
Wang Wenguan and Shen Jianbing. 2017. Deep cropping via attention box
prediction and aesthetic assessment. In IEEE International Conference on Computer
Vision. IEEE, 2186–2194.
[36]
Luming Zhang. 2016. Describing Human Aesthetic Perception by Deeply-learned
Attributes from Flickr. arXiv preprint:1605.07699 (2016).
Session: FF-4
MM’18, October 22-26, 2018, Seoul, Republic of Korea
886
Article
Looking at the world often involves not just seeing things, but feeling things. Modern feedforward machine vision systems that learn to perceive the world in the absence of active physiology, deliberative thought, or any form of feedback that resembles human affective experience offer tools to demystify the relationship between seeing and feeling, and to assess how much of visually evoked affective experiences may be a straightforward function of representation learning over natural image statistics. In this work, we deploy a diverse sample of 180 state-of-the-art deep neural network models trained only on canonical computer vision tasks to predict human ratings of arousal, valence, and beauty for images from multiple categories (objects, faces, landscapes, art) across two datasets. Importantly, we use the features of these models without additional learning, linearly decoding human affective responses from network activity in much the same way neuroscientists decode information from neural recordings. Aggregate analysis across our survey, demonstrates that predictions from purely perceptual models explain a majority of the explainable variance in average ratings of arousal, valence, and beauty alike. Finer-grained analysis within our survey (e.g. comparisons between shallower and deeper layers, or between randomly initialized, category-supervised, and self-supervised models) point to rich, preconceptual abstraction (learned from diversity of visual experience) as a key driver of these predictions. Taken together, these results provide further computational evidence for an information-processing account of visually evoked affect linked directly to efficient representation learning over natural image statistics, and hint at a computational locus of affective and aesthetic valuation immediately proximate to perception.
Conference Paper
Full-text available
We model the photo cropping problem as a cascade of attention box regression and aesthetic quality classification, based on deep learning. A neural network is designed that has two branches for predicting attention bounding box and analyzing aesthetics, respectively. The predicted attention box is treated as an initial crop window where a set of cropping candidates are generated around it, without missing important information. Then, aesthetics assessment is employed to select the final crop as the one with the best aesthetic quality. With our network, cropping candidates share features within full-image convolutional feature maps, thus avoiding repeated feature computation and leading to higher computation efficiency. Via leveraging rich data for attention prediction and aesthetics assessment, the proposed method produces high-quality cropping results, even with the limited availability of training data for photo cropping. The experimental results demonstrate the competitive results and fast processing speed (5 fps with all steps).
Conference Paper
Full-text available
Photo composition is an important factor affecting the aesthetics in photography. However, it is a highly challenging task to model the aesthetic properties of good compositions due to the lack of globally applicable rules to the wide variety of photographic styles. Inspired by the thinking process of photo taking, we formulate the photo composition problem as a view finding process which successively examines pairs of views and determines their aesthetic preferences. We further exploit the rich professional photographs on the web to mine unlimited high-quality ranking samples and demonstrate that an aesthetics-aware deep ranking network can be trained without explicitly modeling any photographic rules. The resulting model is simple and effective in terms of its architectural design and data sampling method. It is also generic since it naturally learns any photographic rules implicitly encoded in professional photographs. The experiments show that the proposed view finding network achieves state-of-the-art performance with sliding window search strategy on two image cropping datasets.
Article
Full-text available
Automatically learned quality assessment for images has recently become a hot topic due to its usefulness in a wide variety of applications such as evaluating image capture pipelines, storage techniques and sharing media. Despite the subjective nature of this problem, most existing methods only predict the mean opinion score provided by datasets such as AVA [1] and TID2013 [2]. Our approach differs from others in that we predict the distribution of human opinion scores using a convolutional neural network. Our architecture also has the advantage of being significantly simpler than other methods with comparable performance. Our proposed approach relies on the success (and retraining) of proven, state-of-the-art deep object recognition networks. Our resulting network can be used to not only score images reliably and with high correlation to human perception, but also to assist with adaptation and optimization of photo editing/enhancement algorithms in a photographic pipeline. All this is done without need of a "golden" reference image, consequently allowing for single-image, semantic- and perceptually-aware, no-reference quality assessment.
Article
Full-text available
In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.
Article
Full-text available
Deep convolutional neural networks (CNN) have recently been shown to generate promising results for aesthetics assessment. However, the performance of these deep CNN methods is often compromised by the constraint that the neural network only takes the fixed-size input. To accommodate this requirement, input images need to be transformed via cropping, warping, or padding, which often alter image composition, reduce image resolution, or cause image distortion. Thus the aesthetics of the original images is impaired because of potential loss of fine grained details and holistic image layout. However, such fine grained details and holistic image layout is critical for evaluating an image's aesthetics. In this paper, we present an Adaptive Layout-Aware Multi-Patch Convolutional Neural Network (A-Lamp CNN) architecture for photo aesthetic assessment. This novel scheme is able to accept arbitrary sized images, and learn from both fined grained details and holistic image layout simultaneously. To enable training on these hybrid inputs, we extend the method by developing a dedicated double-subnet neural network structure, i.e. a Multi-Patch subnet and a Layout-Aware subnet. We further construct an aggregation layer to effectively combine the hybrid features from these two subnets. Extensive experiments on the large-scale aesthetics assessment benchmark (AVA) demonstrate significant performance improvement over the state-of-the-art in photo aesthetic assessment.
Article
Human beings often assess the aesthetic quality of an image coupled with the identification of the image’s semantic content. This paper addresses the correlation issue between automatic aesthetic quality assessment and semantic recognition. We cast the assessment problem as the main task among a multitask deep model, and argue that semantic recognition task offers the key to address this problem. Based on convolutional neural networks, we employ a single and simple multi-task framework to efficiently utilize the supervision of aesthetic and semantic labels. A correlation item between these two tasks is further introduced to the framework by incorporating the inter-task relationship learning. This item not only provides some useful insight about the correlation but also improves assessment accuracy of the aesthetic task. Particularly, an effective strategy is developed to keep a balance between the two tasks, which facilitates to optimize the parameters of the framework. Extensive experiments on the challenging AVA dataset and Photo.net dataset validate the importance of semantic recognition in aesthetic quality assessment, and demonstrate that multi-task deep models can discover an effective aesthetic representation to achieve state-ofthe- art results.