Content uploaded by Sami Sebastian Brandt

Author content

All content in this area was uploaded by Sami Sebastian Brandt on Jan 04, 2022

Content may be subject to copyright.

Tensor-based Subspace Factorization for StyleGAN

Ren´

e Haas, Stella Graßhof and Sami S. Brandt

IT University of Copenhagen, Copenhagen, Denmark

Abstract— In this paper, we propose τGAN a tensor-based

method for modeling the latent space of generative models.

The objective is to identify semantic directions in latent space.

To this end, we propose to ﬁt a multilinear tensor model

on a structured facial expression database, which is initially

embedded into latent space. We validate our approach on

StyleGAN trained on FFHQ using BU-3DFE as a structured

facial expression database. We show how the parameters of the

multilinear tensor model can be approximated by Alternating

Least Squares. Further, we introduce a stacked style-separated

tensor model, deﬁned as an ensemble of style-speciﬁc models

to integrate our approach with the extended latent space of

StyleGAN. We show that taking the individual styles of the

extended latent space into account leads to higher model

ﬂexibility and lower reconstruction error. Finally, we do several

experiments comparing our approach to former work on both

GANs and multilinear models. Concretely, we analyze the

expression subspace and ﬁnd that the expression trajectories

meet at an apathetic face that is consistent with earlier work.

We also show that by changing the pose of a person, the

generated image from our approach is closer to the ground

truth than results from two competing approaches.

I. INTRODUCTION

In this paper, we propose a novel framework for ﬁnding

semantic directions in the latent space of Generative Ad-

versarial Networks (GANs) [10]. GANs have, since their

proposal, emerged as one of the most dominant approaches

for unsupervised representation learning in Computer Vision

and beyond [23].

Architecturally GANs refer to the simultaneous training

of two neural networks: a generator and a discriminator.

The generator produces images by sampling from its latent

space, while the discriminator, a binary classiﬁer, tries to

discriminate the generated images from the training images.

The goal of training is to reach the equilibrium of the min-

max game between the two adversaries, such that neither can

improve by changing the parameter values. At equilibrium,

the discriminator can be discarded, and the generator can

then be used to produce new data by sampling from the latent

distribution. The new data points follow the same statistics as

the training data but are not contained in it. Modern state-

of-the-art GAN variations have borrowed from the Style-

transfer literature [14], [22] to disentangle the latent space

and synthesize high-quality face images. Work by [17], [18],

and most recently [16], showed how to a train state-of-the-art

StyleGAN model, even in cases of limited data.

A recent goal has been to ﬁnd semantically interpretable

directions in GAN latent spaces, and several approaches

for semantic face editing have been proposed. Semantic

face editing refers to the ability to change various semantic

input reconstructed rotated

estimate

parameters

edit

parameters

Fig. 1: Overview of the proposed approach.

attributes, such as identity, expression, and rotation, gender,

of the generated images. Early work used an information

criterion (InfoGAN) [6] to determine semantic directions.

However, as pointed out in [8], there is no guarantee that

the latent codes produced by this method are semantically

meaningful. Additional unsupervised approaches for ﬁnding

semantic directions in StyleGAN include Principal Compo-

nent Analysis (PCA) on sampled latent codes [15] and the

closed-form factorization suggested by [25].

A recent approach for ﬁnding semantic directions in

StyleGAN in a supervised fashion is to train binary linear

classiﬁers (SVMs) to detect single binary semantic attributes

such as smile vs. no smile, male vs. female, glasses vs. no

glasses. For a given semantic attribute, the semantic direction

could then be deﬁned as the normal to the supporting hyper-

planes of the trained SVM [24].

In the literature, a wide collection of multilinear meth-

ods have been proposed to model and analyze faces and

expressions. Early, PCA or dictionary-based 3D Morphable

Models (3DMM) [3], [9] capture the variation in shape and

texture of neutral 3D faces. Recently 3DMMs have also

been used to make semantic edits to images generated by

StyleGAN [27]. More recently, factorization methods, based

on higher-order data representations, were introduced with

the beneﬁt of better disentanglement of dimensions, such

as person-speciﬁc shape and expression, when compared to

matrix methods [28], [30]. These models were built on the

Higher-Order Singular Value Decomposition (HOSVD) to

factorize the data, and have successfully been used to model

faces, their 3D reconstruction, as well as in transferring

expressions [4], [5]. Moreover, in [11], [12] a HOSVD

tensor model was constructed from the Binghamton 3D

facial expression database (BU-3DFE) [33], which revealed

a practically planar expression subspace, in which the six

basic emotions form one-dimensional afﬁne subspaces [11].

These six lines intersect in a common vertex, the origin of

expressions, which surprisingly does not represent the neutral

face, but an extrapolated expression referred to as apathetic.

The main novelty of this work is to use a multilinear

face model to analyze the latent space of GANs. More

speciﬁcally, we propose to use the HOSVD to factorize the978-1-6654-3176-7/21/$31.00 ©2021 IEEE

z∈ Z

FC

FC

FC

. . .

Mapping Network

f:Z → W

w∈ W

conv

conv

conv

conv

Synthesis Network

g:W → X

x∈ X

.

.

.

Latent Space ZAuxiliary

Latent Space WImage Space X

Fig. 2: Architecture of the StyleGAN generator.

latent space into semantically meaningful linear subspaces

that yield a multilinear tensor model. Given an input image,

we estimate the model parameters to approximate the input,

and then change one attribute, such as rotation, as illustrated

in Fig. 1.

The main contributions of this paper are as follows:

•We propose a novel method for semantic face editing

with StyleGAN.

•We propose a method to estimate model parameters

and present reasonable regularization, enabling stable

parameter transfer.

•We show that expression trajectories intersect at a

unique point, corresponding to the origin of expressions,

which differs from the neutral face conﬁrming the

earlier ﬁndings [11], [12] based on BU-3DFE.

•We propose an extended model, based on style separa-

tion, which leads to greater model ﬂexibility and lower

reconstruction error for independent test images.

The paper is organized as follows: In Sec. II we will

review the architecture and outline the process on how to

embed reference images into the latent space of StyleGAN.

In Sec. III we present our Tensor-Based GAN model which

we build ”on top of” the StyleGAN latent space. Here we will

also elaborate on how we can approximate model parameters

for a given latent vector. Experiments and results of our

proposed approach are presented in Sec. IV, followed by

a summary and conclusion in Sec. V.

II. STY LEGAN

In this section, we will review the StyleGAN architecture

and explain how to embed reference images into the latent

space of the pre-trained models released by Nvidia [17], [18].

A. StyleGAN Architecture

The StyleGAN generator G:Z → X , where G=g◦f,

is composed of two networks, the mapping network f:Z →

Wand the synthesis network g:W → X , see Fig. 2. The

mapping network f, maps the latent vector z∈ Z onto the

auxiliary latent space Wto the vector w=f(z)while the

synthesis network g:W → X maps the vector w∈ W

to the ﬁnal output image x∈ X in image space. The full

generator Gthus maps the latent vector zto an image x.

The notation used in this paper is summarized in Tab. I.

ResNet

input: x

b

w0

g(b

w0) = b

x0

VGG16 loss b

wi

g(b

wi) = b

xi

initial: b

x0

x

b

xi

repeat

Fig. 3: Diagram illustrating image embedding into the aux-

iliary latent space W.

B. Generator Inversion

GANs do not include an encoder as part of their archi-

tecture. Therefore, a goal in GAN research has been to

ﬁnd a method for ﬁnding a latent code that produces an

image as close as possible to a given reference image, which

we refer to as embedding an image into the latent space.

The problem can be considered as inverting the synthesis

network g−1:X → W [1], [21] while inverting G, and

thereby embedding into Zspace, has been investigated in

[18]. Contemporary techniques for Wspace embedding, i.e.

ﬁnding g−1, use a VGG network [26]. Our approach for

embedding onto the auxiliary latent space Wis illustrated

in Fig. 3. The inverse generator G−1:X → Z yields the

latent vector z=G−1(x)with G−1=f−1◦g−1for the

input image x.

The initial estimate for the auxiliary latent vector for

a given reference image is computed as follows. We use

the pre-trained weights of StyleGAN [17] and the recently

revised StyleGAN2 [18] architecture. Then, as proposed in

[2], we train a ResNet [13] in a supervised setting using

synthetic StyleGAN data to approximate g−1that yields the

initial estimate ˆ

w0for the latent vector. The reﬁnement for

the auxiliary latent vector is computed by ﬁrst using the

VGG16 network [26], pre-trained on ImageNet database, and

then removing the classiﬁcation layer, hence the truncated

network produces a high dimensional feature vector for a

given input image, as described in [34]. Since the trained

generator is fully differentiable, the loss can be calculated

in VGG space and gradients back-propagated through the

generator, hence we can iteratively update the latent code.

This approach is also used in [21]. We also found that

using the ResNet estimate as initialization for the VGG

optimization process, leads to faster convergence than not

using ResNet initialization.

III. MULTILINEAR MODEL

This section introduces τGAN, our latent space factor-

ization method for GANs that augments the StyleGAN

synthesis network gwith a multilinear tensor model. We

do this by embedding a facial expression database into the

X W Q

x

e

x

b

w

e

w

b

q

e

q

g−1τ−1

τ

g

e

d

i

t

p

a

r

a

m

e

t

e

r

s

Fig. 4: Overview of the different spaces and how the function

relate them, c.f. Tab. I. The blue line indicates a manual

change of one of the parameter vectors for transfer of person,

expression or rotation.

TABLE I: Overview of the notation used in this work.

Symbol Description

XImage Space

ZLatent Space

WAuxiliary Latent Space

QParameter Space

Operator Name

f:Z → W Mapping Network

g:W → X Synthesis Network

g−1:X → W StyleGAN Embedder

τ−1:W → Q Parameter Estimator

τ:Q→W Tensor Model

auxiliary latent space Wof StyleGAN. We then order the

embedded database into a tensor, which we factorize into

semantic subspaces. The resulting parameter space Qwill

thus be the Cartesian product of the semantic subspaces

Q=QP× QE× QR, where QPis the person space, QE

the expression space, and QRis the rotation subspace. An

overview of the different spaces and how the operators relate

them are displayed in Fig. 4and Tab I.

A. Tensor Factorization

The Higher-Order Singular Value Decomposition

(HOSVD) is a generalization of the matrix SVD to

higher-order tensors [7], [32], [11], [29], [28], [19].

The starting point for our analysis is a standardized data

tensor T∈RN×P×E×R, where Nrefers to the number of

elements in the latent vector, Pis the number of people, E

the number of expressions, and Rnumber of viewpoints or

rotations. Using the HOSVD Tcan then be factorized as

T≃C×1U1×2U2×3U3×4U4,(1)

where ×kdenotes the k-way product, C∈R˜

N×˜

P×˜

E×˜

Ris

the core tensor, and U1∈RN×˜

N,U2∈RP×˜

P,U3∈

RE×˜

E,U4∈RR×˜

Rare matrices with orthonormal columns

constructed from the singular vectors of the k-mode matrix

unfoldings of T. In general we have that ˜

N≤N,˜

P≤P,

˜

E≤E, and ˜

R≤R.

B. Multilinear Tensor Model for GANs

The HOSVD (1) factorizes the data tensor into a core

tensor, and a set of factor matrices Ui, one for each subspace.

By selecting appropriate rows from Ui,i= 2,3,4, one

normalized latent vector, i.e. a single mode-1 ﬁber of T,

can be recovered. For example, to recover the latent vector of

person pperforming expression ewith rotation r, the pth row

of U2,eth row of U3, and rth row of U4is selected. This can

be conveniently formulated by a canonical basis, where the

parameter vectors q′

2∈RP,q′

3∈REand q′

4∈RRpick a

weighted linear combination of the rows of the Uimatrices.

Therefore, a given latent code y′can be approximated by

the model prediction b

y′as

b

y′=C×1U1×2q′

2

TU2×3q′

3

TU3×4q′

4

TU4.(2)

This expression can be further simpliﬁed by deﬁning qT

i≡

q′T

iUiand analogously b

y=UT

1b

y′. Now applying ×1UT

1to

both sides of (2) and recalling that the columns the respective

Umatrices are orthonormal we can write a more compact

model representation as

b

y=C×2qT

2×3qT

3×4qT

4,(3)

where the unprimed coordinates refer to the latent code in the

eigenspace spanned by the columns of the Uimatrices. In

this formulation, we have 3 individual parameter vectors and

use repeated n-mode products to relate these to the model

prediction.

We can rewrite (3) in a more general form to illustrate

the mathematical structure of our model. Let us deﬁne the

P×E×R, rank-1 parameter tensor Q=qT

2⊗qT

3⊗qT

4,

where ⊗refers to the tensor product. Then the components

of the rank-1 parameter tensor Q∈RP×E×Ris given by

Qνρλ =q(2)

νq(3)

ρq(4)

λwhere q(k)

νrefers to the νth component

of the subspace vector qk∈ Qkfor k={2,3,4}.

With this deﬁnition, we can write (3) in a more compact

and convenient representation using the Einstein summation

convention

ˆ

Yµ=Cµνρλ Qνρλ.(4)

This lets us write the latent code, in the auxiliary latent

space W, as an application of the multilinear map, deﬁned

by the core tensor C, on the parameter tensor Q.

Our entire tensor model τcan thus be written as the

composite map of the core Cfollowed by the change-of-

basis transformation deﬁned by U1:W → W, and the

inverse standardization operator Ω−1:W → W, where Ω−1

translates and scales a latent vector back to the original scale

of Wspace according to the mean and variance of the BU-

3DFE data.

C. Stacked Style-Separated Model

In addition to the previously presented model, we propose

an alternative approach, where styles are separated instead

of vectorizing the latent code. That is, we interpret the

Sstyles of was separate vectors of dimension L, which

is also indicated in Fig. 2. To separate the Sstyles, we

propose to order the latent codes into the data tensor Tstyle ∈

RS×L×P×E×R.

Then the shape dimension can be addressed separately by

deﬁning the style-speciﬁc tensors

Ts∈RL×P×E×R, s = 1,2,· · · , S. (5)

We factorize each style-speciﬁc tensor Ts, and deﬁne style-

speciﬁc tensor model τs. The ensemble of these models is

referred to as the stacked style-separated model τS, which

has S(P+E+R)parameters. In conclusion, while the prior

vectorized model τ, based on T, has P+E+Rparameters,

this formulation τShas S(P+E+R)parameters since it

models the style separately.

D. Optimization

Our next aim is to estimate the model parameters by

constructing the estimator τ−1:W → Q. The estimator

is deﬁned as the solution to the optimization problem

min

Q||b

y−y||2

2subject to

||qi||2

2≤c2and

||Uiqi||1≤c1for i= 2,3,4.

(6)

The form of (6) is inspired by [11], [12], and enforces

constraints on the model parameters to retrieve a stable

representation of new latent vectors by linear combinations

within the training data. We regularize the model using

Ridge and Lasso regression. Then the Lagrangian for the

constrained problem (6) can be written as

L(Q, λ1, λ2) = ||b

y−y||2

2

+

4

X

k=2

λ2,k ||qk||2

2+λ1,k ||q′

k||1

(7)

where λ1,k, λ2,k ≥0refer to regularization parameters,

i.e. Lasso and Ridge. Note that there is no prime on the

Ridge term since ||q′

i||2

2= (UT

iqi)T(UT

iqi) = ||qi||2

2since

UT

iUi=I. We will now continue to present a strategy

for solving the constrained optimization problem in (6) by

Alternating Least Squares.

As in [11], [12] the minimization can be solved by ﬁrst

rewriting (3) as a matrix-vector multiplication separately for

each of the three model parameter vectors as

ˆ

y=A(k)qk, k = 2,3,4,(8)

where the matrices A(k)are given by

A(2) =C×3qT

3×4qT

4,(9)

A(3) =C×2qT

2×4qT

4,(10)

A(4) =C×2qT

2×3qT

3.(11)

Therefore, an unknown latent vector ycan be estimated

by alternating between the systems (8), while updating the

matrices A(k)in each step.

IV. EXP ERI MEN TS

In the following, we give some additional details for the

BU-3DFE database and continue to report on our experimen-

tal results.

Fig. 5: Validation of decomposition results. Energy of sin-

gular values for each mode of T.

A. Facial Expression Database

As mentioned in the introduction, we use the BU-3DFE

database [33]. The database contains 3D face scans and

images of 100 persons (56 female and 44 male), with varying

ages (18-70 years) and diverse ethnic/racial ancestries. Each

subject was asked to perform the six basic emotions: anger,

disgust, happiness, fear, sadness, and surprise, each with

four levels of intensity. Additionally, for each participant,

the neutral face was recorded. Hence, for each person, there

are a total of 25 facial expressions recorded from two pose

directions, left and right, resulting in 5000 face images.

B. Data Prepossessing

As a pre-processing step, we embedded each face im-

age from the BU-3DFE database, into the latent space of

StyleGAN, as described in Sec. II-B. We then collected the

resulting latent vectors into the 4-way data tensor T0∈

RN×P×E×R. We then calculated the mode-1unfolding

T(1)

0∈RN×P ER of T0containing all the P E R latent

vectors. We then standardized this matrix to zero mean and

unit variance for each latent variable and then ﬁnally folded

this standardized matrix into a N×P×E×Rdimensional

tensor Twhich we used for all subsequent experiments.

C. Subspace Analysis

The standardized tensor Twas factorized by the HOSVD,

as described in (1), yielding the four subspaces spanned by

the columns of Uk,k= 1,...,4. The distribution of the

energy of the subspaces is shown in Fig. 5, which illustrates

the compactness of the subspaces.

In Fig. 6we show a visualization of the expression

subspace. As an initial step, we truncated the expression

subspace from 25 dimensions to 3D. It can be seen that

for each emotion, the variation in expression strength forms

linear trajectories in expression space. These trajectories are

star-shaped and meet at an origin of expression which is

shared by all emotion trajectories. This is neither the neutral

nor the mean face, but the “apathetic” face, found in [11],

[12], see Fig. 7(a)-(c). In this case, the apathetic face in

Fig. 7(c) is closer to the mean face than in [11], [12],

displayed in Fig. 7(f) for comparison.

D. Vectorized vs. Stacked Style-Separated Model

In Sec. III we proposed to build two different versions of

tensor models. (1) The vectorized model ﬂattens each latent

code of one image and then orders them into the tensor

T∈RN×P×E×R, and (2) the stacked style-separated model

Tstyle ∈RS×L×P×E×Rwhich considers the S= 18 styles of

StyleGAN separately. We estimated the parameters for the

Fig. 6: Projection of the expression subspace, deﬁned by U3,

onto 3 dimensions.

(a) Mean (b) Neutral (c) Apathy

(d) Mean (e) Neutral (f) Apathy

Fig. 7: Synthesized faces for (a) the mean face, (b) neutral

face, and (c) apathetic face. Accordingly, (d), (e), (f) show

the 3D faces synthesized by the method in [11].

two models, using the ALS procedure (8). The results are

illustrated in Fig. 8. It can be seen, that the ground truth

(Fig. 8a), is visually closer to the stacked style-separated

model (Fig. 8c) than the vectorized model (Fig. 8b) for

test images from the BU-3DFE data set (top row), as well

as for arbitrary images (2nd and 3rd row). We conclude

that the proposed adaptation by the separate styles improves

performance.

E. Validation of Regularization Parameters

The optimization problem deﬁned in (7) contains six

regularization parameters λ1,k and λ2,k,k= 2,3,4, two for

each of the three parameter vectors, which must be manually

set. In the following experiment we investigated how the

hyperparameters inﬂuenced the quality of the results, and

assume that they are the same for the three parameters, hence

λ1=λ1,k, and λ2=λ2,k . Here we used the vectorized

model on the basis of the standardized latent codes in (3).

Initially, we divided the data into a training, validation, and

test set by a randomized 90–5–5 split over the P= 100

person identities. The validation set thus had a total of

5ER = 250 samples. We estimated the tensor model based

(a) Reference (b) Vectorized (c) Style-Separated

Fig. 8: Reconstructions: (a) Ground truth images, and the

results from either (b) the vectorized model, and (c) the style-

separated model. The top row shows an example from the

BU-3DFE database, while the 2nd and 3rd rows illustrate

reconstruction of novel images which are not part of BU-

3DFE.

on the training set. For each latent vector in the validation

set we then estimated the subspace parameters qiby ALS

using (8).

We evaluated three kinds of errors for the validation set:

the approximation error, and the expression and rotation

transfer errors. The approximation error between the ground

truth yand estimated latent code b

yiis deﬁned as ϵapprox =

||b

y−y||2

2. The transfer errors result from exchanging es-

timated parameters b

qkby known values e

qk. Hence using

e

yexpr ≡τ(b

qperson ⊗e

qexpr ⊗b

qrot)gives rise to an expression

transfer error which we deﬁne as ϵexpr =||e

yexpr −y||2

2.

Analogously, the rotation transfer error is deﬁned as the

error arising from only changing the parameters associated

with the rotation subspace according to ϵrot =||e

yrot −y||2

2.

The three error metrics ϵapprox,ϵexpr , and ϵrot were then

calculated for each sample, with varying hyperparameter

values λ1and λ2. In this experiment, we investigate Lasso

and Ridge regression independently, i.e., we set λ1= 0 while

varying λ2, and vice versa. We restrict ourselves to only

consider cases where the regularization strength is equal for

all subspaces.

The results are illustrated in Fig. 9. In general, it can be

seen that the approximation error is more stable than the

other two errors. Fig. 9a suggests that high values of λ1

should be chosen for rotation transfer, while for expression

transfer λ1≈1seems to be a reasonable choice. Fig. 9b

reveals that for λ2≈1all error metrics are small, and hence

this interval is a good choice.

(a) Lasso (L1 penalty)

(b) Ridge (L2 penalty)

Fig. 9: Inﬂuence of the hyper parameters, λ1and λ2steering

the (a) Lasso and (b) Ridge constraints, on (from top to

bottom row) the approximation error, expression transfer

error, and rotation transfer error.

F. Regularization and Parameter Transfer

We used the regularization parameters above to perform

expression and rotation transfer on samples from the test set.

We then synthesized images from the estimated parameters

by applying the composite transformation b

x=g(τ(b

Q))

to the estimated subspace parameters b

Q. Additionally, we

performed expression and rotation transfer by replacing one

of the three estimated parameter vectors by known values,

as described before. We did this for the regularized model

(λ1>0, λ2>0) and the non-regularized model (λ1=

λ2= 0). Fig. 10 shows how well the ground truth, in W

space, (Fig. 10a) can be approximated by the non-regularized

solution (Fig. 10b) and the regularized solution (Fig. 10c). It

seems that the non-regularized solution matched the ground

truth slightly better with respect approximation expression

transfer. However, for rotation transfer (Fig. 10e) the reg-

ularized solution clearly outperformed the non-regularized

solution. Because in the non-regularized solution the result-

ing image is not recognizable as a face anymore at all, while

the regularized solution is not deformed and the rotation of

the depicted faces conform to ground truth. This experiment

thus showed that adding a small L2 regularization term yields

stable rotation transfer.

(a) Ground Truth (b) Non-regularized (c) Regularized

(d) Expression Transfer (e) Rotation Transfer

Fig. 10: Reconstruction and regularization results. (a) Ground

truth (b) approximation by the non-regularized model, and

(c) the regularized model. (d,e) Results from rotation and

expression transfer containing ground truth (top row), the

non-regularized solutions (middle row), and the regularized

solution (bottom row).

Fig. 11: To ﬁnd the optimal interpolation strength αfor

rotation transfer for InterFaceGAN [24] and GANSpace [15]

we compare the images generated by shifting the latent code

corresponding to an image from the one rotation towards the

other and compare the result with the ground truth.

G. Quantitative Comparison

Finally, we compare τGAN to InterFaceGAN [24] and

GANSpace [15] for the application of semantic face editing

by using rotation transfer as one example.

Since the BU-3DFE database [33], see Sec. IV-A, contains

5000 faces images, 2500 from the left and from the 2500

right; we chose one of the two views as the reference image,

and then used InterFaceGAN, GANSpace and τGAN to

estimate a reconstruction of the image from the comple-

mentary rotation. The resulting image was then compared to

the Ground Truth (GT) by 1) Pearson correlation coefﬁcient

(pcorr), 2) Structural Similarity Index Measure (SSIM) [31],

and 3) Learned Perceptual Image Patch Similarity (LPIPS)

[34]. For the LPIPS measure, we employed two versions:

one based on VGG [26], referred to as lpips-vgg, and the

other, lpips-alex, on AlexNet [20].

In InterFaceGAN [24] the authors ﬁnd semantic directions

of StyleGAN by ﬁtting SVMs to single semantic attributes

using an annotated data set. Using these directions, semantic

editing can be performed by interpolating in the direction

n∈RNdeﬁned by the SVM hyper-plane normal vector for

a given latent code w∈ W, as

wedit =w+αn,(12)

where αis the strength of the shift in semantic direction

associated with n. To perform rotation transfer, we chose the

pose direction for the StyleGAN1 model trained on FFHQ

provided by [24] as n.

GANSpace ﬁnds semantic directions in an unsupervised

fashion using PCA. The semantic meaning of the found

principal components needs to be assigned by a one-time

manual labeling. In the paper the authors report that the

10th principal component applied only to the ﬁrst 7 layers

produces a shift in rotation for the pretrained StyleGAN1

network. Using this deﬁnition, and the rotation direction, we

can perform semantic edits with GANSpace in a similar way

as in eq. 12.

To determine the optimal interpolation strength αfor

both methods, we design an experiment where we perform

rotation transfer with varying values for α. From the latent

code representing an image of one rotation, we edit the latent

code towards the complementary rotation resulting in a latent

vector wedit which is then used to synthesize an edited image.

We then compare the edited image to the ground truth using

the four metrics mentioned above. For each value of αwe

average the metrics and pick the minimum. The results are

presented in Fig. 11, where it can be seen that the best

performance for InterFaceGAN is reached at α= 2.77, and

for GANSpace at α= 1.66, respectively. These values are

used for the quantitative comparison presented in Fig. 13.

To perform rotation transfer with τGAN model, we ﬁrst

estimated the model parameter vectors b

qk,k= 2,3,4for a

given input image as described in Sec. III-D. Then we used

the rotation subspace deﬁned by U4in (1). For τGAN we

take the subspace direction m=u(4)

2−u(4)

1∈ QR, where

u(4)

1,u(4)

2are the ﬁrst and second row of U4, respectively.

The rotation parameter was then changed as

e

q4=b

q4+γm,(13)

which then yields the edited latent code

wτ,edit =τ(b

q2⊗b

q3⊗e

q4).(14)

Fig. 12 shows synthesized images produced by InterFace-

GAN, GANSpace and τGAN, respectively. These are com-

pared against the reconstructions generated by latent codes

interpolated directly in Wspace by w=βwleft + (1 −

β)wright where wleft and wright refer to the left and right

rotation, respectively. The results show that τGAN provides

an alternative way for generating rotation in the StyleGAN

latent space. Compared to InterFaceGAN, our model seems

to create rotations which better preserve features like skin

tone and gaze direction, and compared to GANSpace the

face shape seems better preserved. However, for all methods

(a) Direct Interpolation

(b) InterFaceGAN

(c) GANSpace

(d) τGAN (Ours)

Fig. 12: Comparison of rotation transfer among varying

methods. The ground truth images in pixel space are shown

in the top row in the outermost columns. We use the latent

code corresponding to the left hand rotation (top left) and try

to recover the right hand rotation (top right). The provided

images have been created by: (a) direct interpolation, (b)

InterFaceGAN, (c) GANSpace, and (d) our proposed τGAN.

we note that the identity of the person slightly changes in

this example.

Additionally, we objectively compare the quality of ro-

tation transfer resulting from different methods as follows.

We apply the previously introduced three methods: Inter-

FaceGAN, GANSpace, and our proposed τGAN, to shift the

rotation of the 125 left-oriented images in the validation set

towards the right orientation. We then compare the edited

images to the known ground truth using the same four

metrics introduced at the beginning of this section. The

results in Fig. 13 show that τGAN has the lowest median

value for all metrics when compared with InterFaceGAN and

GANSpace.

V. C ONCLUSIONS

In this work, we proposed τGAN, a tensor-based model

for the auxiliary latent space of the StyleGAN. It is con-

structed by ﬁrst embedding the images of the BU-3DFE

database into the latent space of StyleGAN. The latent codes

were stored into a tensor which is then factorized into

semantically meaningful subspaces by HOSVD. This con-

struction ensured that the semantic directions were directly

interpretable in contrast to unsupervised methods, where this

Fig. 13: Quantitative comparison of rotation transfer per-

formed by varying methods. We start with images from

the left rotation and shift the latent codes towards the right

rotations using τGAN, InterFaceGAN, and GANSpace. The

edited images are then compared to the GT based on the

previously used adapted metrics, redeﬁned to be the lower

the better. We observe that the edited images produced by

τGAN are more similar to the GT across all four metrics.

is not always the case.

We were able to generalize previous results [11] of face

analysis by showing that the expression subspace has the

structure where the expression trajectories meet in a speciﬁc

apathetic expression, which is different from the mean or

neutral face. We evaluated our approach quantitatively and

qualitatively, and compared different versions of the pro-

posed tensor models on the basis of approximation of unseen

samples, and demonstrated the stability in the transfer of

expression and rotation. From the results, we conclude that

the proposed approach is a powerful way for characterizing

and parameterizing the latent space of StyleGAN.

The current setting assumes complete data that contains

measurements of all the people performing the same ex-

pressions from each rotation without any missing data. This

requirement could be relaxed by low-rank completion meth-

ods that is left for future work. To conclude we employed a

model trained on FFHQ, and received promising results on

the BU-3DFE data set.

REFERENCES

[1] R. Abdal, Y. Qin, and P. Wonka. Image2stylegan: How to embed

images into the stylegan latent space? In Proc. ICCV, pages 4431–

4440, 2019.

[2] P. Baylies. Stylegan encoder - converts real images to latent space.

https://github.com/pbaylies/stylegan-encoder/,

2019.

[3] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d

faces. In Proc. SIGGRAPH, pages 187–194, 1999.

[4] A. Brunton, T. Bolkart, and S. Wuhrer. Multilinear Wavelets: A

Statistical Shape Space for Human Faces. In Proc. ECCV, pages 297–

312, Jan. 2014.

[5] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. FaceWarehouse:

A 3D Facial Expression Database for Visual Computing. IEEE

Transactions on Visualization and Computer Graphics, 20(3):413–425,

Mar. 2014.

[6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and

P. Abbeel. Infogan: Interpretable representation learning by in-

formation maximizing generative adversarial nets. In D. D. Lee,

M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors,

NIPS, pages 2172–2180, 2016.

[7] L. De Lathauwer and B. De Moor. A multi-linear singular value

decomposition. Society for Industrial and Applied Mathematics,

21:1253–1278, 03 2000.

[8] Y. Deng, J. Yang, D. Chen, F. Wen, and X. Tong. Disentangled

and controllable face image generation via 3d imitative-contrastive

learning. In IEEE Computer Vision and Pattern Recognition, 2020.

[9] C. Ferrari, G. Lisanti, S. Berretti, and A. D. Bimbo. A dictionary

learning-based 3d morphable shape model. IEEE Transactions on

Multimedia, 19(12):2666–2679, 2017.

[10] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In

Proceedings of the 27th International Conference on Neural Infor-

mation Processing Systems - Volume 2, NIPS’14, page 2672–2680,

Cambridge, MA, USA, 2014. MIT Press.

[11] S. Graßhof, H. Ackermann, S. Brandt, and J. Ostermann. Apathy is

the root of all expressions. 12th IEEE Conference on Automatic Face

and Gesture Recognition (FG2017), 2017.

[12] S. Graßhof, H. Ackermann, S. S. Brandt, and J. Ostermann. Mul-

tilinear Modelling of Faces and Expressions. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 43(10):3540–3554, Oct.

2021. Conference Name: IEEE Transactions on Pattern Analysis and

Machine Intelligence.

[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image

recognition. Proc. CVPR, pages 770–778, 2016.

[14] X. Huang and S. Belongie. Arbitrary style transfer in real-time with

adaptive instance normalization. In Proc. ICCV, 2017.

[15] E. H¨

ark¨

onen, A. Hertzmann, J. Lehtinen, and S. Paris. Ganspace:

Discovering interpretable gan controls. In Proc. NeurIPS, 2020.

[16] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila.

Training generative adversarial networks with limited data. In Proc.

NeurIPS, 2020.

[17] T. Karras, S. Laine, and T. Aila. A style-based generator architecture

for generative adversarial networks. In Proc. CVPR, pages 4396–4405,

2019.

[18] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila.

Analyzing and improving the image quality of StyleGAN. In Proc.

CVPR, 2020.

[19] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.

SIAM REVIEW, 51(3):455–500, 2009.

[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation

with deep convolutional neural networks. In Proceedings of the 25th

International Conference on Neural Information Processing Systems -

Volume 1, NIPS’12, page 1097–1105. Curran Associates Inc., 2012.

[21] D. Nikitko. Stylegan – encoder for ofﬁcial tensorﬂow implementation.

https://github.com/puzer/stylegan-encoder/, 2019.

[22] D. Y. Park and K. H. Lee. Arbitrary style transfer with style-attentional

networks. In Proc. IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, 2019-June:5873–5881, Dec 2018.

[23] S. Pidhorskyi, D. A. Adjeroh, and G. Doretto. Adversarial latent

autoencoders. In Proc. CVPR, June 2020.

[24] Y. Shen, J. Gu, X. Tang, and B. Zhou. Interpreting the latent space

of gans for semantic face editing. In CVPR, 2020.

[25] Y. Shen and B. Zhou. Closed-form factorization of latent semantics

in gans. In CVPR, 2021.

[26] K. Simonyan and A. Zisserman. Very deep convolutional networks

for large-scale image recognition, 2014.

[27] A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P. Seidel, P. P´

erez,

M. Z¨

ollhofer, and C. Theobalt. Stylerig: Rigging stylegan for 3d

control over portrait images, cvpr 2020. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR). IEEE, june 2020.

[28] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of

image ensembles: Tensorfaces. In Proceedings of the 7th European

Conference on Computer Vision-Part I, ECCV ’02, page 447–460,

Berlin, Heidelberg, 2002. Springer-Verlag.

[29] M. A. O. Vasilescu and D. Terzopoulos. Multilinear subspace analysis

of image ensembles. In Proc. of the IEEE Conf. on Computer Vision

and Pattern Recognition, volume 2, pages II– 93, 07 2003.

[30] D. Vlasic, M. Brand, H. Pﬁster, and J. Popovi´

c. Face Transfer with

Multilinear Models. In ACM SIGGRAPH, pages 426–433, 2005.

[31] Z. Wang and A. Bovik. A universal image quality index. Signal

Processing Letters, IEEE, 9:81 – 84, 04 2002.

[32] K. Yano and K. Harada. Multilinear face model. In Visualization,

Imaging, and Image Processing (VIIP 2008), 2008.

[33] L. Yin, X. Wei, Y. Sun, J. Wang, and M. Rosato. A 3d facial expression

database for facial behavior research. In 7th Intern. Conf. on Automatic

Face and Gesture Recognition (FGR06), pages 211–216, 2006.

[34] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The

unreasonable effectiveness of deep features as a perceptual metric. In

CVPR, 2018.