Access to this full-text is provided by Wiley.
Content available from The Journal of Engineering
This content is subject to copyright. Terms and conditions apply.
Received: 4 March 2022 Revised: 11 June 2022 Accepted: 3 August 2022 The Journal of Engineering
DOI: 10.1049/tje2.12185
ORIGINAL RESEARCH
Auto-encode the synthesis pseudo features for generalized
zero-shot learning
Lin Wang1,2Zhenjun Shen1Guoyong Wang3Jianqiang Song1Qingtao Wu1
1School of Information Engineering, Henan
University of Science and Technology, Luoyang,
China
2Henan Qunzhi Information Technology Co., Ltd.,
Luoyang, China
3School of computer and Information Engineering,
Luoyang Institute of Science and Technology,
Luoyang, China
Correspondence
Jianqiang Song, School of Information Engineering,
Henan University of Science and Technology,
Luoyang 471023, China.
Email: 9943517@haust.edu.cn
Funding information
Key Technologies R & D Program of Henan
Province, Grant/Award Numbers: 202102210169,
212102210088; Luoyang Major Scientific and
Technological Innovation Projects, Grant/Award
Number: 2101017A; National Natural Science
Foundation of China (NSFC), Grant/Award
Numbers: 62002102, 62072121, 62176113
Abstract
Zero-shot learning (ZSL) is to identify target categories without labeled data, in which
semantic information is used to transfer knowledge from some seen categories. In the
existing Generalized Zero-Shot Learning (GZSL) methods, domains shift problem always
appeared during generating feature stage. In order to solve this problem, a new method
to Auto-Encode the Synthesis Pseudo Features for the GZSL task (AESPF-GZSL) is
proposed in this manuscript. Specifically, the AESPF-GZSL method trains the generated
features under the semantic auto-encoder framework and exploits attention mechanism
to train the generated features again. Then, the generated features are input to the clas-
sifier. The proposed method is performed on three benchmark data sets referred as to
AWA, CUB and SUN. The experimental results show that the proposed method achieves
the state-of-the art classifier accuracy both in ZSL and GZSL settings. In ZSL setting, the
classification accuracy of our method is superior to the compared algorithms, improved
by 0.40% in AWA and 0.30% in SUN, respectively. And in GZSL setting, the classification
accuracy of the method is superior to the comparison algorithm 0.41% in Harmonic mean
on AWA, and 1.01%, 0.62%, and 1.05% in training data set, testing data set, and harmonic
average on SUN.
1 INTRODUCTION
Zero-shot Learning (ZSL) imitates human’s ability to accurately
identify completely unseen object categories based on seman-
tic description and past experience, and studies the method
of associating visual features with semantic attributes. Even
though ZSL seems to be very difficult to solve as a brand-new
problem in machine learning [1–6], a reasonable and effective
learning theory and approach can be used to train the classi-
fier to correctly classify the unseen object class partly in the
training stage. Researchers use computers to obtain low-level
visual features consistent with image semantics, introduce the
concept of image middle-level representation, such as attributes,
and use it as an intermediate expression between image
feature information and high-level semantics to solve the zero-
shot problem. Knowledge transfer from known categories to
unknown categories can alleviate the ZSL problem effectively.
In ZSL, the models usually use the samples from unseen
classes as the test set to evaluate the performance of classifier.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2022 The Authors. The Journal of Engineering published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
Jin et al. [7] proposed a ZSL method with a center loss,
which can make the instances of the same class more com-
pact by extracting the distinguishing parts. In [8], in order to
solve the challenging ZSL task, SeeNet with the High-order
Attribute Features (SeeNet-HAF) is proposed to solve the chal-
lenging task. Comparing with ZSL, GZSL models test the
samples from seen and unseen classes to evaluate the clas-
sification accuracy, and it also needs to generate features of
unseen classes for classification. The Region Graph Embed-
ding Network (RGEN) is proposed for dealing with ZSL and
GZSL tasks [9], the balance loss is especially valuable for mit-
igating the extreme domain bias in the deep GZSL models,
and provides inherent insights for GZSL. An Attentive Region
Embedding Network (AREN) [10] is also proposed for solv-
ing the challenging ZSL/GZSL task. Li et al. [11] proposed
a method to Learn Discriminative and Meaningful Samples
for generalized zero-shot classification tasks (LDMS), and use
generative adversarial network to regularize class consistency
and semantic consistency. Han et al. [12] proposed to learn
J. Eng . 2022;2022:985–993. wileyonlinelibrary.com/iet-joe 985
986 WAN G ET AL.
the redundancy-free features for GZSL, which projects the
original visual features into a new (redundancy-free) feature
space and then limits the statistical dependence between the
two feature spaces. Wu et al. [13] proposed an end-to-end
Self-supervised Domain-aware Generative Network (SDGN),
which integrates self-supervised learning into unbiased GZSL
feature generation model. In order to avoid unrestrained train-
ing process, the samples generated should be very different
from the real samples. Ma et al. [14] proposed a similarity
preserving loss to standardize the generative network, which
can reduce the distance between synthesized samples and real
samples.
In ZSL, the seen and unseen categories are disjoint for
image classification, thus a large deviation between the map-
ping semantic and the real category semantic is generated when
a model is learned from the training set and applied in the test
samples, which is called domain shift problem. Recently, many
methods have been proposed to solve this problem, such as data
enhancement, self-training and hub correction [15]. Transduc-
tive ZSL was firstly put forward by [16], in which the visual
attribute distribution of unseen data is explored by using the
nearest neighbor method of tag prototype. In [17], the domain
shift problem is solved by spread of super labels of transductive
multi-view, which use manifold information of unseen classes
compensates for sparse semantic vectors with no supervision
information. Rohrbach et al. [18] proposed a transductive strat-
egy by using graph-based label expansion approach to mine
manifold structures of unseen category samples. Different from
the above methods, Xu et al. [19] proposed a data enhancement
method to train a more universal classification model by adding
auxiliary data to seen categories.
In order to generate unseen features and make them sim-
ilar to unseen original features, the synthesis pseudo features
of GZSL [20] are synthesized by features of seen classes and
semantic similarity scores between seen and unseen classes.
Some related works relying on GAN [21]orSAE[22] are pro-
posed to generate unseen features for zero-shot learning. By this
way, these models transform the ZSL problem to the supervised
classification task, and the classifier can be used to enforce on
the real and generated features. However, most of the gener-
ation models neglect the fact that the generated features are
not always equally important for the classifier. To this end,
we exploit the attention mechanism to pay more attention for
generating the effective discriminative features.
In this paper, we propose a novel method for GZSL. The
original visual features are reconstructed by encoding and
decoding the synthesized pseudo features. Then, in order to
optimize generated features, they are introduced into the atten-
tion mechanism to train again, the mechanism encourages the
model to pay more attention to the internal information of
generated features with noise, which can reduce the difference
between the original features and the generated features. Its goal
is to capture the relationships among classes, further improve
the discriminative ability of the model, alleviate the confusion
between similar categories, increase the attention layer propor-
tionally in the generated visual space, and gradually guide the
generation of higher quality visual features.
The main contributions are presented as follows:
∙We proposed a method to optimize generated features in
GZSL. The optimization model consists of auto-encode and
attention mechanism. To some extent, we can obtain a good
classification result.
∙We mainly solve the domains shift problem and consider
the information inside the feature to reduce the differ-
ence between the generated feature and the original feature,
therefore, the robustness of the test set was promoted.
∙Experiments on three benchmark data sets show the supe-
riority of the proposed method. In ZSL, the classification
accuracy of the proposed method is superior to the compared
algorithms, improved by 0.40% in AWA and 0.30% in SUN,
respectively. And in GZSL, the classification accuracy of our
method is superior to the compared algorithm 0.41% in Har-
monic mean of AWA, and 1.01%, 0.62%, and 1.05% in train
data set, test data set, and harmonic average of SUN.
The rest of the paper is organized as follows: Section 2
states relevant work. Section 3describes the construction and
optimization of the proposed method. Section 4presents the
experimental results of the proposed method on three bench-
mark data sets and makes some brief analysis. And Section 5is
the conclusion.
2RELATED WORK
2.1 Zero-shot learning (ZSL)
The image classification task requires all training samples of
the categories to be available. However, it is impractical to
establish a unified image classification system for all object cat-
egories in the world. There is no training sample for the newly
emerging object categories, so it is a very meaningful work to
identify the objects that have never seen the categories before.
Such as Large-scale Object Recognition [27, 28], the number
of categories is very huge, as all categories labeled training
samples are very expensive. Fine-grained classification in [29,
30], the difference between categories is very small, as all cat-
egories are labeled with the same costly new (unseen) object
recognition, no image is labeled for using, therefore, it is very
urgent to study the algorithm of identifying new categories
of samples by labelling samples. In recent years, in order to
identify new sample categories and reduce the cost of manual
labelling, ZSL has received more and more attention. In ZSL,
if we want to recognize the unknown (unseen classes, test cat-
egories) and without the known training samples, can only use
the known (seen classes, training categories) category samples
and attribute knowledge to learn a classifier, and then migrated
to the category of unseen, and classify unseen samples. Early
ZSL study dates back to 2008, Larochelle et al. [31] proposed
ZSL method for character classification problem, and the clas-
sification accuracy rate reached 60%. Akata et al. [32] studied
the problem of ZSL on the CUB 200-2011 bird dataset and
obtained a recognition rate higher than 18%. Lampert et al.
WAN G ET AL.987
[23] proposed Animals with Attributes (AWA) dataset for a ZSL
experiment and achieved a classification accuracy higher than
40%. The development of ZSL is reviewed comprehensively
in [24], including the evolution, key technologies, mainstream
models, current research hotspots and future research direc-
tions. The above research results show that the ZSL problem
is not a very difficult research problem. Although the final
result cannot reach the expected level, as a new research hot
spot, the corresponding research methods and theoretical level
need to be explored, completed and improved constantly. The
main difference between ZSL and GZSL is that the former
only classifies unseen samples, while GZSL recognizes seen and
unseen samples.
2.2 Projection domain shift
The problem of projection domain shift in ZSL is firstly identi-
fied by [17]. In order to overcome this problem, a transductive
multi-view embedding framework is proposed, which combines
requirements on the graph to access all test data simultaneously.
Similar transductive methods are proposed in [18, 28]. This
assumption is usually invalid in the context of ZSL, because new
classes usually appear dynamically and unavailable before model
learning. Instead of assuming the access to all test unseen classes
for transductive learning, our model is based on inductive learn-
ing and it only relies on imposing the reconstruction constraint
on the training data to combat domain shift.
2.3 Semantic auto-encoder (SAE)
Recently, many methods have been proposed to solve the
domains shift problem. The SAE [33] for ZSL takes the
encoder-decoder paradigm. The goal of an encoder is to map a
visual feature vector into the semantic space as in the existing
ZSL models. In addition, in order to reconstruct the origi-
nal visual features, the decoder exploits an extra constraint. It
enables the projection function learned from the seen classes to
be better generalized to the new unseen classes. It assumes that
the encoder and the decoder are linear and symmetric to sim-
plify the SAE model. Recently, Liu et al. [34] put forward based
on the feature extraction of diagram and the automatic encoder
(GAFE) model, the idea of the automatic encoder into ZSL,
however, GAFE ignores the semantic properties of artificial
design to identify different categories. In addition, the difference
between each property may be very large, which makes it diffi-
cult to learn the appropriate classifier so that the map learned
by GAFE cannot conserve potential difference information in
the data. In our works, SAE is used to reconstruct the generated
features by GZSL which can solve the domains shift problems
to some extent.
2.4 Attention mechanism
Although the growth of the methods by generating feature is
rapid, the synthetic pseudo feature of overfitting always exists,
resulting in bad recognizing performance of generated features.
And the attention mechanism [35] can be used to focus on the
intrinsic information of the generated features with noise, which
can moderate the confusion of generated features and improve
the quality and discrimination of the generated features. Bet-
ter features can be learned through attention mechanism, which
helps to refine category features, making similar images have
shorter distant and different images have larger distant. In addi-
tion, the class attention obtains from the previously generated
process acts as the intrinsic information. The result indicates
that the policy has the potential to extend to other tasks. In
this way, the attention layer is treated as an auxiliary generation
module, allowing the network to firstly rely on specific types of
cues, and then gradually learn to assign more weight to the evi-
dence among classes. Attention mechanism also alleviates the
tendency of classifying certain classes. Therefore, the generated
features are higher quality and more discriminative.
3AESPF-GZSL METHOD
Most existing methods ignore domains shift problem during the
generating phase, so it is easy falling to overfit source classes in
ZSL and GZSL tasks. In the proposed AESPF-GZSL method,
the generated features are firstly trained in the SAE model and
then attention mechanism is exploited on them to generate
more excellent features. Secondly, the generated unseen classes
features are introduced in the classifier. Finally, this method has
been proved that the model robustness can be enhanced greatly.
The overall framework of the proposed method is shown in
Figure 1. In this section, some relevant definitions and nota-
tions are introduced firstly and then the algorithm is described
in detail.
3.1 Definitions and notations
In this work, DS={(xi,si,yi)xi∈XS,yi∈YS,si∈ℝ
b}K
i=1
denotes the dataset of seen classes for training, where each xi
is an image feature of XSwhich is extracted by a convolu-
tional neural network (CNN) for seen class, siis a b-dimension
semantic vector for each seen class i,yiis a label of YS
for each seen class i.DU={(xt,st,yt)xt∈XU,yt∈YU,st∈
ℝb}K+L
t=K+1denotes the dataset of unseen classes for testing,
where each xtis an image feature of XUwhich extracted by
CNN for unseen class, stis a semantic vector for each unseen
class t,ytis a label of YUfor each unseen class t. Note that there
is no intersection between the seen class and the unseen class,
that is: YS∩YU=∅. Formally, the goal of ZSL is to learn a
classifier fzsl ∶XS,XU→YU, the goal of GZSL is to learn a
classifier fgzsl ∶XS,XU→YS∪YU.
3.2 The learning algorithm
The cosine distance 𝜇it is chose as the measurement of similari-
ties between different classes in our method, the similarity score
988 WAN G ET AL.
FIGURE 1 Framework of the proposed AESPF-GZSL method
𝜇it is used to sum up the distances of these features of similar
instances, where irepresents a seen class, trepresents an unseen
class, and the synthetic pseudo features of the unseen classes is
obtained as follows:
vt=
N
j=1
𝜇it vj
i,(1)
where Nrepresents the number of similar seen classes, jrep-
resents one of them, and vj
irepresents the feature vector of the
jth similar instance i.
Given an input data matrix M∈ℝ
d×Nwith Nfeature vec-
tors of ddimension. Using a projected matrix W∈ℝ
k×dto
project it to a latent space of kdimension, which can get a
potential representation P∈ℝ
k×N. Then, a projected matrix
W∗∈ℝ
k×dis used to project the latent representation to fea-
ture space
M∈ℝ
d×N, where k<d. The latent representation
reduces the dimensionality of the original data. In order to get
the minimum reconstruction error, that is, Mis forced as similar
as possible to
M. The object function is:
min
W,W∗M−W∗WM 2
F,s.t.WM =P.(2)
In order to simplify the model, let W∗=W𝖳,wehavethe
following formulation:
min
WM−W𝖳WM 2
F,s.t.WM =P(3)
It is very difficult to solve an objective function with a hard
constraint such as WM =P, therefore, we consider to relax the
constraint into a soft constraint and the objective is as follows:
min
WM−W𝖳P2
F+𝜆WM −P2
F(4)
where Wand W𝖳are the decoder and the encoder, respectively,
and 𝜆is a weighting factor that controls the balance of the first
and second terms, which corresponds to the loss of the decoder
and the encoder, respectively.
By using trace properties, Equation (4) can be rewritten as the
following formula:
min
WM𝖳−P𝖳W2
F+𝜆WM −P2
F.(5)
Then, the derivative of Equation (5) is formulated as follows:
−P(M𝖳−P𝖳W)+𝜆(WM −P)M𝖳=0,(6)
PP 𝖳W+𝜆WM M 𝖳=PM 𝖳+𝜆PM 𝖳.(7)
Let:
A=PP 𝖳,B=𝜆MM 𝖳,C=(1 +𝜆)PM 𝖳.(8)
Then we have the following formula:
AW +WB =C.(9)
Using the famous Sylvester [36] equation which can be solved
efficiently by the Bartels-Stewart algorithm.
W=Sylvester (A,B,C).(10)
Here, vtcan be reconstructed as follows:
st=Wvt,vt=W𝖳st,(11)
where stis semantic representation of unseen class t,the
encoder Wprojects feature vectors vtto st,andthevt
is the reconstructed generated features, the decoder W𝖳
projects the potential representation stto vt, this method can
solve the domains shift problems efficiently.
Attention mechanism was introduced into AESPF-GZSL to
improve the quality of the generated features, which will gen-
erate more discriminative features. More discriminative features
will be generate by it. Copying the generated features vtto two
feature spaces firstly: f(vt)andg(vt), h(vt) represents original
features. The equation is as follows:
kit =f(vi
t)g(vt
t)𝖳,Qi,t=exp(kit )
d
i=1exp(kit )
,(12)
WAN G ET AL.989
ALGORITHM 1 AESPF-GZSL Algorithm
Input:DS={(xi,si,yi)xi∈XS,yi∈YS,si∈ℝ
b}K
i=1,
DU={(xt,st,yt)xt∈XU,yt∈YU,st∈ℝ
b}K+L
t=K+1,𝜆, 𝛼.Output:vt,𝜛i,
acctr ,accte,H
1: According to Equation (1), synthetic pseudo features vt←N
j=1𝜇it vj
i.
2: According to Equation (10), the project function W←SAE.
3: According to Equation (11), When we got the value of W,
reconstructing vtis: st←Wvt,vt←W𝖳st.
4: According to Equation (12), kit ←f(vi
t)g(vt
t)𝖳,Qi,t←exp(kit )
d
i=1exp(kit ).
5: According to Equation (13), 𝜛t←𝛼d
i=1Qi,th(vt)+(1 −𝛼)h(vt).
6: Taking a part of the seen class sample and the unseen class sample into
the classification.
7: After this, acctr,accte and Hare obtained.
where the ith row and the tth column of Qreflect the effect of
the tth class on the ith class, that is, the product between the
generated different features is regarded as a category correla-
tion matrix. Therefore, each line of Qafter softmax represents
a mode of attention, the output formula is:
𝜛t=𝛼
d
i=1
Qi,th(vt)+(1 −𝛼)h(vt),(13)
where Qis attention map, 𝜛is the category attention repre-
sentation which multiply the original features by the weight of
attention. And 𝛼controls the effect of attention on the resulting
representation 𝜛.
We can learn better features through attention mechanism,
which helps to refine the category feature, so that similar sam-
ples have smaller distances, and dissimilar samples have larger
distances. Therefore, the generated features by our method have
higher quality and stronger discriminative. The AESPF-GZSL
method is described as Algorithm 1.
4EXPERIMENT
In this section, the experiments are implemented to verify the
effectiveness of the proposed method, which is applied on three
benchmark data sets and compared with the novel models under
the same conditions. All experiments are carried out onPyTorch
[37] in Ubuntu14.04 environment.
4.1 Datasets and implementation details
AWA is a coarse-grained dataset which consists of 30,475
images for 50 animal classes and each class has an 85-
dimensional attribute vector. CUB is a fine-grained dataset
which consists of 11,788 images for 200 bird species and it
provides an instance-level attribute vectors; a 312-dimensional
class-level attribute vector is used in the experiments for sim-
plicity. SUN is also a fine-grained dataset which consists of
14,340 images for 717 different common scenes and each class
TAB LE 1 The statistics of the three benchmark datasets
Dataset Instances Attribute
Seen/unseen
classes
AWA 30,475 85 40/10
CUB 11,788 312 150/50
SUN 14,340 102 645/72
TAB LE 2 ZSL classification accuracies (%) with various approaches
Methods AWA CUB SUN Average
CONSE 63.60 36.70 44.20 48.17
SSE 68.80 43.70 54.50 55.67
LATEM 74.80 49.40 56.90 60.37
ALE 78.60 53.20 59.10 63.63
DEVISE 72.90 53.20 57.50 61.2
SJE 76.70 55.30 57.10 63.03
ESZSL 74.70 55.10 57.30 62.37
SYNC 72.20 54.10 59.10 61.8
GFZSL 80.50 53.00 62.90 65.47
SCoRe - 59.50 - -
CSSD 81.20 52.50 - -
NCE-based MIE 83.10 57.40 64.40 68.3
SAE 80.60 33.40 42.4 52.13
SE-ZSL 83.80 60.30 64.50 69.53
AESPF-GZSL 84.20 59.80 64.80 69.6
0Footnotes: The best results are marked in bold, and the second best ones are underlined.
has a 102-dimensional attribute vector. The datasets segmenta-
tion follows the setting of [38]. Table 1summarizes the detailed
information of the three datasets, including AWA [23], CUB [25]
and SUN [26].
In our experiments, 24,790 images from 40 categories in the
AWA dataset are used as seen classes and the remaining images
are used as unseen classes. 8,821 images from 150 categories
in the CUB dataset and 12,904 images from 645 categories in
SUN dataset are used as seen classes and the remaining images
are used as unseen classes, respectively. And then 20% images
from seen classes are selected randomly as the first test set and
all the images from unseen classes are used as the secondly
test set for AWA, CUB and SUN dataset separately. Our model
mainly follows the setting of [20]. When we get the synthesis
pseudo features, we reconstruct generated features with Wand
W𝖳. Due to the comparison models and our model use the
same datasets, we do not need the other operation for these
datasets. Then, we train the generated features in the attention
mechanism. In this stage, we build a parameter 𝛼to control the
proportion between the original features and category attention
representation. Finally, the generated features are introduced
into the GZSL model to join classification operations instead
of the original generated features.
990 WAN G ET AL.
TAB LE 3 GZSL classification accuracies (%) of different approaches
AWA CUB SUN
Methods ts tr H ts tr H ts tr H
DEM 32.80 84.70 47.30 19.60 54.00 13.60 20.50 34.30 25.60
LESAEM 19.10 70.20 30.00 24.30 53.00 33.30 21.90 34.70 26.90
TVNM 27.00 67.90 38.60 26.50 62.30 37.20 22.20 38.30 28.10
ZSKLM 18.30 79.30 29.80 24.20 63.90 35.10 21.00 31.00 25.10
CSSDM 34.70 87.10 49.60 19.10 62.70 29.30 - - -
BZSLM 19.90 23.90 21.70 18.90 25.10 20.90 17.30 17.60 17.40
UVDSM 15.30 79.50 25.70 23.80 76.50 36.30 - - -
DCNM 25.50 84.20 39.10 28.40 60.70 38.70 25.50 37.00 30.20
NIWT - - - 20.70 41.80 27.70 - - -
RN 31.40 91.30 46.70 38.10 61.10 47.00 ---
SPF-GZSL 48.50 59.80 53.60 30.20 63.40 40.90 32.20 59.00 41.60
AESPF-GZSL 48.12 61.54 54.01 27.20 62.62 37.93 33.21 59.62 42.65
Note:ts =accts,tr =acctr (T1 per-class accuracy on Yuand Ysrespectively), His harmonic mean, the results of NIWT, RN, DEM, and SPF-GZSL come from the published papers of their
own authors. And the best results are marked in bold, and the second best ones are underlined.
4.2 Results on traditional ZSL
In the train phase under the setting of traditional ZSL, we first
train the generalized model in Xs, then synthesize the unseen
classes sample. And in the test phase, the test datasets are all
from unseen classes. Therefore, if we let 𝜂=0, the test data
does not contain seen classes, which differs from GZSL. The
average precision of each class is shown in Table 2. To evaluate
the classification performance of the algorithm, we have listed
14 existing methods for comparison. These methods include
CONSE [39], SSE[40], LATEM[41], ALE [32], DEVISE [42],
SJE [29], ESZSL[43], SYNC[44], GFZSL [45], SCoRe [46],
CSSD [47], NCE-based MIE [48], SAE [33], and SE-ZSL [49].
4.3 Results on generalized ZSL (GZSL)
Under the setting of GZSL, the examples used for evaluation
which come from seen and unseen classes, which differ from
traditional ZSL. Specifically, we took the samples from the seen
class with a probability of 𝜂and mixed it with the samples from
the unseen class. Our goal is to achieve high accuracy in both
classes. Therefore, we chose the Harmonic average as the main
evaluation index, it can be computed by the following function
[38]:
H=2×acctr ×accts
acctr +accts
,(14)
where acctr and accts are average per-class top-1 (T1) accuracies
of the test images from seen and unseen classes, respectively.
11 up-to-date methods have been listed for comparison
to evaluate the classification performance of the proposed
method, whose results are shown in Table 3. These meth-
ods include DEM [50], LESAE [51], TVN [52], ZSKL [53],
CSSD [54], BZSL [55], UVDS [56], DCN [57], NIWT [58],
RN [59], and SPF-GZSL [20]. The results have shown that the
H-mean values of our approach reaches a better value compared
with all other methods in AWA, our method shows a signifi-
cant increase of 0.41%in the H-mean (Harmonic mean), while
ts (the top-1 per-class accuracy of the test data set Yu) value is
second. In the three evaluation indexes of SUN, we achieves the
best results, and the existing in ts,tr (the top-1 per-class accu-
racy of the train data set Ys) and H-mean compares with the
best results on the mean, increased by 1.01%,0.62%and 1.05%,
respectively. These results show that the method proposed in
this paper achieves remarkable results.
4.4 Analysis of experimental results
4.4.1 Classification accuracy
We observe from Table 2: on the AWA data set, AESPF-ZSL
algorithm is better than the existing methods of baseline, and
the classification accuracy of our method is superior to the SE-
ZSL method by 0.4%. On CUB data set, compared with these
10 existing methods, the accuracy of our method is second.
In addition, on SUN data set, AESPF is superior to the exist-
ing baseline method in all tasks, and the classification accuracy
of our method is superior to the SE-ZSL method by 0.3%.
Moreover, our method can achieve the highest value in the
average accuracy.
We observe from Table 3: on the AWA data set, AESPF-
GZSL algorithm is better than the existing methods of baseline
in one third of all the tasks. Comparing with SPF-GZSL
method, the results of the proposed method shows H aver-
age increased by 0.41%and tr average increased significantly by
WAN G ET AL.991
FIGURE 2 ZSL Classification accuracy with 𝛼on three datasets
1.74%. In addition, comparing with the all baseline methods, ts
value is second in our method. AESPF can assign appropriate
weights for attention difference according to different scenarios,
so as to significantly improve the accuracy. On CUB data set,
AESPF-GZSL does not obtain a good accuracy. On SUN data
set, AESPF-GZSL algorithm is superior to the existing baseline
method in all tasks.
4.4.2 Parameter sensitivity
To investigate AESPF-ZSL selection algorithm for parameter
sensitivity of 𝛼, three data sets AWA, CUB and SUN are used to
conduct classification experiment, respectively. The results are
shown in Figure 2, which can get a conclusion: on the AWA,
when 𝛼=0.6, and the optimal classification effect is obtained
in 𝛼∈[0.1,0.9]. Then, on CUB data set, when 𝛼=0.4, the
classification accuracy is optimal in that value. And the clas-
sification accuracy gets the best value in 𝛼=0.7ontheSUN
data set.
To investigate AESPF-GZSL selection algorithm for param-
eter sensitivity of 𝛼, using AWA, CUB and SUN three data sets
to conduct classification experiment, respectively. The results
are shown in Figures 3−5, from which it can be known that:
AESPF algorithm on AWA has poorer robustness for param-
eters, as shown in Figure 3,𝛼has larger influence on the
classification accuracy, and when 𝛼=0.4 the classification
accuracy reaches to be higher. And robustness on CUB and
SUN data sets is strong. The classification accuracy of AESPF
decreases with the increase of 𝛼, and the optimal classification
effect is obtained in 𝛼∈[0.1,0.9] (as shown in Figure 4), when
𝛼=0.3, the H-mean obtained the max value, this represents
the classification accuracy is optimal in that value. As shown in
Figure 5, the classification accuracy trends to be stable with the
𝛼increase within 𝛼∈[0.3,0.8], and gets the best classification
effect in 𝛼=0.4.
FIGURE 3 GZSL Classification accuracies with 𝛼on AWA
FIGURE 4 GZSL Classification accuracies with 𝛼on CUB
FIGURE 5 GZSL Classification accuracies with 𝛼on SUN
992 WAN G ET AL.
5CONCLUSION
In this paper, a novel method was proposed to optimize syn-
thesis pseudo features for GZSL. The generated features were
introduced into the SAE model and attention mechanism,
which can optimize the generated features and solve domains
shift problem. To some extent, a higher classification accuracy
was obtained. We evaluated our method on three benchmark
data sets, that is, AWA, CUB, SUN for ZSL and GZSL. And
the experimental results show the superiority of the proposed
approach in the traditional zero-shot and generalized zero-shot
settings. However, a limitation of the proposed AESPF-GZSL
model is that the running time increases as the number of gen-
erated features is increased. In the future, we will attempt to
design a more reasonable procedure to select generated fea-
tures. Besides, extending the proposed AESPF-GZSL model to
domain adaptation and cross modal image retrieval will be of
considerable interest.
ACKNOWLEDGEMENTS
This work was supported in part by the National Natural Sci-
ence Foundation of China (NSFC) under Grant Nos. 62002102,
62176113 and 62072121, and in part by the Key Technolo-
gies R & D Program of Henan Province under Grant Nos.
202102210169 and 212102210088, and in part by the Luoyang
Major Scientific and Technological Innovation Projects under
Grant No. 2101017A.
CONFLICT OF INTEREST
The authors have declared no conflict of interest.
DATA AVAILABILITY STATEMENT
There is no data available for this manuscript.
ORCID
Jianqiang Song https://orcid.org/0000-0002-9643-799X
REFERENCES
1. Zhang, W., Zhang, Z., Zeadally, S., et al.: MASM: A multiple-algorithm
service model for energy-delay optimization in edge artificial intelligence.
IEEE Trans. Ind. Inf. 15(7), 4216–4224 (2019)
2. Zhu, J., Xie, P., Zhang, M., et al.: Distributed stochastic subgradient pro-
jection algorithms based on weight-balancing over time-varying directed
graphs. Complexity 2019, 8030792 (2019)
3. Zhu, J., Xu, C., Guan, J., et al.: Differentially private distributed online
algorithms over time-varying directed networks. IEEE Trans. Signal Inf.
Process. Networks 4(1), 4–17 (2018)
4. Zhou, Y., Zhang, M., Zhu, J., et al.: A randomized block-coordinate adam
online learning optimization algorithm. Neural Comput. Appl. 32, 12671–
12684 (2020)
5. Zhang, M., Zhou, Y., Quan, W., et al.: Online learning for IoT optimiza-
tion: A Frank-Wolfe Adam-Based Algorithm. IEEE Internet Things J.
7(9), 8228–8237 (2020)
6. Wang, M., Jia, S., Chen, E., et al.: A derived least square fast learning
network model. Applied Intelligence 50(12), 4176–4194 (2020)
7. Jin, X.B., Xie, G.S., Huang, K., et al.: Discriminant zero-shot learning with
center loss. Cognit. Comput. 11(4), 503–512 (2019)
8. Jin, X.B., Xie, G.S., Huang, K., et al.: Beyond attributes: high-order
attribute features for zero-shot learning. In: Proceedings of the IEEE
International Conference on Computer Vision Workshops. IEEE, Piscat-
away (2019)
9. Xie, G.S., Liu, L., Zhu, F., et al.: Region graph embedding network for zero-
shot learning. In: European Conference on Computer Vision, pp. 562–580.
Springer, Berlin (2020)
10. Xie, G.S., Liu, L., Jin, X., et al.: Attentive region embedding network for
zero-shot learning. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 9384–9393. IEEE, Piscataway (2019)
11. Li, X., Fang, M., Li, H., et al.: Learning discriminative and meaningful
samples for generalized zero shot classification. Signal Process. Image
Commun. 87, 115920 (2020)
12. Han, Z., Fu, Z., Yang, J.: Learning the redundancy-free features for general-
ized zero-shot object recognition. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pp. 12865–12874.
IEEE, Piscataway (2020)
13. Wu, J., Zhang, T., Zha, Z.J., et al.: Self-supervised domain-aware gener-
ative network for generalized zero-shot learning. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
12767–12776. IEEE, Piscataway (2020)
14. Ma, Y., Xu, X., Shen, F., et al.: Similarity preserving feature generating
networks for zero-shot learning. Neurocomputing 406, 333–342 (2020)
15. Yu, Y., Ji, Z., Guo, J., et al.: Transductive zero-shot learning with adaptive
structural embedding. IEEE Trans. Neural Networks Learn. Syst. 29(9),
4116–4127 (2017)
16. Fu, Y., Hospedales, T.M., Xiang, T., et al.: Attribute learning for
understanding unstructured social activity. In: European Conference on
Computer Vision, pp. 530–543. Springer, Berlin (2012)
17. Fu, Y., Hospedales, T.M., Xiang, T., et al.: Transductive multi-view zero-
shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37(11), 2332–2345
(2015)
18. Rohrbach, M., Ebert, S., Schiele, B.: Transfer learning in a transductive
setting. Adv. Neural Inf. Process. Syst. 15(4), 229–237 (2013)
19. Xu, X., Hospedales, T., Gong, S.: Semantic embedding space for zero-
shot action recognition. In: IEEE International Conference on Image
Processing, pp. 63–67. IEEE, Piscataway (2015)
20. Li, C., Ye, X., Yang, H., et al.: Generalized zero shot learning via synthesis
pseudo features. IEEE Access 7, 87827–87836 (2019)
21. Xie, G.S., Zhang, Z., Liu, G.S., et al.: Generalized zero-shot learning
with multiple graph adaptive generative networks. IEEE Trans. Neural
Networks Learn. Syst. 33(7), 2903–2915 (2022)
22. Xie, G.S.,Zhang, X.Y., Yao, Y.Z., et al.: Vman: A virtual mainstay alignment
network for transductive zero-shot learning. IEEE Trans. Image Process.
30, 4316–4329 (2021)
23. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification
for zero-shot visualobject categorization. IEEE Trans. Pattern Anal. Mach.
Intell. 36(3), 453–465 (2014)
24. Sun, X., Gu, J., Sun, H.: Research progress of zero-shot learning. Appl.
Intell. 51(2), 1–15 (2020)
25. Wah, C., Branson, S., Perona, P., et al.: Multiclass recognition and part
localization with humans in the loop. In: Proceedings of the International
Conference Computer Vision (ICCV), pp. 2524–2531. IEEE, Piscataway
(2011)
26. Patterson, G., Hays, J.: SUN attribute database: Discovering, annotating,
and recognizing scene attributes. In: Proceedings of IEEE Computer
Vision Pattern Recognition (CVPR), pp. 2751–2758. IEEE, Piscataway
(2012)
27. Deng, J., Ding, N., Jia, Y., et al.: Large-scale object classification using label
relation graphs. In: European Conference on Computer Vision, pp. 48–64.
Springer, Berlin (2014)
28. Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and
zero-shot learning in a large-scale setting. In: Proceedings of IEEE Com-
puter Vision Pattern Recognition, CVPR 2011, pp. 1641–1648. IEEE,
Piscataway (2011)
29. Akata, Z., Reed, S., Walter, D., et al.: Evaluation of output embeddings
for fine-grained image classification. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 2927–2936. IEEE,
Piscataway (2015)
WAN G ET AL.993
30. Duan, K., Parikh, D., Crandall, D., et al.: Discovering localized attributes
for fine-grained recognition. In: 2012 IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3474–3481. IEEE, Piscataway (2012)
31. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen
object classes by between-class attribute transfer. In: 2009 IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 951–958. IEEE,
Piscataway (2009)
32. Akata, Z., Perronnin, F., Harchaoui, Z., et al.: Label-embedding for
attribute-based classification. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 819–826. IEEE, Piscataway
(2013)
33. Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot
learning. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3174–3183. IEEE, Piscataway (2017)
34. Liu, Y., Gao, Q.X., Han, J.G., et al.: Graph and autoencoder based fea-
ture extraction for zero-shot learning. In: International Joint Conference
on Artificial Intelligence, pp. 15–36. Morgan Kaufmann, San Francisco
(2019)
35. Lu, Z., Yu, Y., Lu, Z.M., et al.: Attentive semantic preservation network
for zero-shot learning. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops, pp. 682–683.
IEEE, Piscataway (2020)
36. Bartels, R.H., Stewart, G.W.: Solution of the matrix equation AX+XB=C.
Commun. ACM 15(9), 820–826 (1972)
37. PyTorch. https://github.com/pytorch/pytorch. Accessed 21 June 2012
38. Xian, Y., Lampert, C.H., Schiele, B., et al.: Zero-shot learning–A compre-
hensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern
Anal. Mach. Intell. 41(9), 2251–2265 (2018)
39. Norouzi, M., Mikolov, T., Frome, A., et al.: Zero-shot learning by convex
combination of semantic embeddings. In: Proceedings of the Interna-
tional Conference on Machine Learning (ICML), p. 19. ACM, New York
(2014)
40. Zhang, Z., Saligrama, V.: Learning joint feature adaptation for zero-shot
recognition (2016)
41. Xian, Y., Akata, Z., Sharma, G., et al.: Latent embeddings for zero-shot
classification. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 69–77. IEEE, Piscataway (2016)
42. Frome, A., Corrado, G.S., Shlens, J., et al.: Devise: A deep visual-semantic
embedding model. In: Proceedings of the Conference and Workshop on
Neural Information Processing Systems (NIPS), pp. 2121–2129. NIPS
Foundation, La Jolla (2013)
43. Romera-Paredes, B., Torr, P.H.: An embarrassingly simple approach to
zero-shot learning. In: International Conference on Machine Learning
(ICML), pp. 2152–2161. ACM, New York (2015)
44. Changpinyo, S., Chao, W.L., Gong, B., et al.: Synthesized classifiers for
zero-shot learning. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 5327C–5336. IEEE, Piscataway
(2016)
45. Verma, V.K., Rai, P.: A simple exponential family framework for zero-
shot learning. In: Joint European Conference on Machine Learning and
Knowledge Discovery in Databases. Springer, Berlin (2017)
46. Morgado, P., Vasconcelos, N.: Semantically consistent regularization for
zero-shot recognition. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 6060–6069. IEEE, Piscataway
(2017)
47. Ji, Z., Wang, J., Yu, Y., et al.: Class-specific synthesized dictionary model
for zero-shot learning. Neurocomputing 329, 339–347 (2019)
48. Tang, C., Yang, X., Lv, J., et al.: Zero-shot learning by mutual informa-
tion estimation and maximization. Knowledge-Based Systems 194, 105490
(2020)
49. Verma, V.K., Arora, G., Mishra, A., et al.: Generalized zero-shot learn-
ing via synthesized examples. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4281C–4289. IEEE,
Piscataway (2018)
50. Zhang, L., Xiang, T., Gong, S.: Learning a deep embeddingmodel for
zero-shot learning. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pp. 2021–2030. IEEE,
Piscataway (2017)
51. Liu, Y., Gao, Q., Li, J., et al.: Zero shot learning via lowrank embed-
ded semantic autoencoder. In: Proceedings of the Twenty-Seventh
International Joint Conference on Artificial Intelligence (IJCAI-18), pp.
2490–2496. Morgan Kaufmann, San Francisco (2018)
52. Zhang, H., Long, Y., Guan, Y., et al.: Triple verification network for gen-
eralized zero-shot learning. IEEE Trans. Image Process. 28(1), 506–517
(2019)
53. Zhang, H., Koniusz, P.: Zero-shot kernel learning. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp.
7670–7679. IEEE, Piscataway (2018)
54. Han, J., Pang, Y., Ji, Z., et al.: Class-specific synthesized dictionary model
for zero-shot learning. Neurocomputing 329, 339–347 (2019)
55. Shen, F., Zhou, X., Yu, J., et al.: Scalable zero-shot learning via
binary visual-semantic embeddings. IEEE Trans. Image Process. 28(7),
3662–3674 (2019)
56. Long, Y., Liu, L., Shen, F., et al.: Zero-shot learning using synthesised
unseen visual data with diffusion regularisation. IEEE Trans. Pattern Anal.
Mach. Intell. 40(10), 2498–2512 (2018)
57. Liu, S., Long, M., Wang, J., et al.: Generalized zero-shot learning with deep
calibration network. In: Neural Information Processing Systems (NIPS),
pp. 2009–2019. NIPS Foundation, La Jolla (2018)
58. Selvaraju, R.R., Chattopadhyay, P., Elhoseiny, M., et al.: Choose your neu-
ron: Incorporating domain knowledge through neuron-importance. In:
Proceedings of the European Conference on Computer Vision (ECCV),
pp. 526–541. Springer, Berlin (2018)
59. Sung, F., Yang, Y., Zhang, L., et al.: Learning to compare: Relation network
for few-shot learning. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 1199–1208. IEEE, Piscataway
(2018)
How to cite this article: Wang, L., Shen, Z., Wang, G.,
Song, J., Wu, Q.: Auto-encode the synthesis pseudo
features for generalized zero-shot learning. J. Eng. 2022,
985–993 (2022). https://doi.org/10.1049/tje2.12185