Content uploaded by Susana Ruano
Author content
All content in this area was uploaded by Susana Ruano on Aug 19, 2022
Content may be subject to copyright.
Texture improvement for human shape estimation from a single image
Jorge González Escribano, Susana Ruano, Archana Swaminathan, David Smyth, and Aljosa Smolic
V-SENSE, Trinity College Dublin, Dublin, Ireland
Abstract
Current human digitization techniques from a single image are showing promising results when it comes
to the quality of the estimated geometry, but they often fall short when it comes to the texture of the generated
3D model, especially on the occluded side of the person, while some others do not even output a texture for
the model. Our goal in this paper is to improve the predicted texture of these models without requiring any
other additional input more than the original image used to generate the 3D model in the first place. For
that, we propose a novel way to predict the back view of the person by including semantic and positional
information that outperforms the state-of-the-art techniques. Our method is based on a general purpose
image-to-image translation algorithm with conditional adversarial networks adapted to predict the back
view of a human. Furthermore, we use the predicted image to improve the texture of the 3D estimated
model and we provide a 3D dataset, V-Human, to train our method and also any 3D human shape estimation
algorithms which use meshes such as PIFu.
Keywords: human shape estimation, neural networks, dataset
1 Introduction
Figure 1: Examples from V-Human dataset
Human shape estimation has been traditionally tackled with clas-
sic computer vision techniques which require multiple view-
points as input. Methods to reduce the number of different view-
points needed have been explored and nowadays, machine learn-
ing techniques can use a single viewpoint to predict the shape
and color of a person. One of the most successful approaches
makes use of implicit functions to represent a surface, which is
much more efficient in terms of memory needed to store the 3D
asset than others such as voxel-based ones. One of the most rele-
vant technique which has been considered the baseline for many
others is PIFu [Saito et al., 2019]. Nevertheless, although this
method has been improved in many works [Saito et al., 2020,
Huang et al., 2020, He et al., 2020, Hong et al., 2021], the ma-
jority of them focus on improving only the shape whereas the
color and appearance of the reconstructed model is not taken
into account. Deep learning techniques for tasks involving 3D data require copious amounts of- 3D data
for the training of these learning-based approaches, especially if they are supervised- [Zolanvari et al., 2019,
Ruano and Smolic, 2021]. The most popular datasets typically used to train learning-based 3D human recon-
struction methods requiring images as input are RenderPeople and Twindom. These datasets contain 3D scans
of people from different ethnicities, wearing different fashion styles with a substantial level of detail. How-
ever, these are expensive commercial datasets and therefore their accessibility is limited to companies that
have the financial resources to purchase them [Zhang et al., 2021]. Consequently, researchers have created
other databases of 3D human models to train deep learning methods [Zheng et al., 2019, Zhang et al., 2017,
Yu et al., 2021, Pumarola et al., 2019, Gabeur et al., 2019, Caliskan et al., 2020]. But these datasets are limited
in the quality of the models and the number poses because of the effort and equipment needed for the capture
and preparation of the data.
Our contribution in this paper consists of a method that improves the color prediction of 3D reconstructions
methods which only needs a single image as input and does not rely on parametric models. The novelty
is the use of the semantic information and UV positional to predict the back view of the person. We show
how it outperforms state-of-the-art methods and we also show how it also improves commercial solutions.
Furthermore, we contribute with the creation and release of a freely available synthetically generated dataset of
3D human models (samples shown in Fig. 1) called V-Human, which is used to train our method but it can also
be used for training other deep learning reconstruction methods that use meshes for training.
2 Related work
Human shape and color estimation. Human shape estimation has been widely studied in the literature and
classic techniques to solve 3D reconstruction problems require a huge number of viewpoints but nowadays,
there are methods that use a single image as input. Initial approaches for estimating the 3D shape of a human
from a single image used a parametric model of the body [Loper et al., 2015]. The main drawback of these
techniques is that the model represents a naked human. Although these parametric approaches have been proven
to be effective strategies for capturing motion with accurate body proportions, they cannot handle clothes and
props. Other methods such as BodyNet [Varol et al., 2018] use voxel-based approaches but a well-known
disadvantage of this data format is the high storage requirement necessary to capture fine details. In contrast
to volumetric approaches, implicit functions are a memory efficient way to represent a surface since there is no
need to store the space in which the surface is enclosed. PIFu [Saito et al., 2019] is one of the first methods
that successfully reconstructed humans from a single image with a pixel-aligned implicit function strategy.
They not only provide a solution for the shape but also a method to predict the colors on the surface of the
geometry. However, the color estimation comes with a price not only the surface is considered but also a
relatively small 3D space around it. Consequently, it has advantages for estimating color in the occluded parts
but it does not allow for having sharp definition. PIFuHD [Saito et al., 2020] and ARCH [Huang et al., 2020]
are extensions of this work. The former provides a more detailed reconstruction due to a multi-level architecture
but no color estimation is provided. The latter produces animatable reconstructions by incorporating body
semantic knowledge and takes into account the color estimation but the approach is similar to PIFu. Many other
techniques build upon the PIFu baseline such as Geo-PIFu [He et al., 2020] and StereoPIFu [Hong et al., 2021]
but they do not consider the color estimation. DIMNet [Zhang et al., 2021] improves PIFu’s sampling strategy
but it uses several views to perform feature fusion and improve the color estimation.
Training datasets. Learning-based techniques are heavily dependent on the data used for the training,
and estimating 3D human shapes is especially data demanding. On the one hand, there are several commer-
cial datasets that are used in human shape estimation papers: RenderPeople, Twindom and AXYZ. Those
datasets contain a great variety of human scans with people from different ethnicities wearing a broad vari-
ety of clothes and hair styles, and acting with natural poses. Around 500 models from RenderPeople were
used in PIFu [Saito et al., 2019] and PIFuHD [Saito et al., 2020]. More than double the number of scans
(in particular, 1016) are used in [Zins et al., 2021]. Scans from Twindom are also used to train many algo-
rithms [Chibane et al., 2020, Zheng et al., 2021] (1600 and 1700 models, respectively).
A notable drawback of these datasets is that they are not available for many researchers because they are
commercial and financially expensive, as noted in [Zhang et al., 2021]. Consequently, the creation of freely
available datasets has been of interest to the wider research community. THuman [Zheng et al., 2019] has
been used in many published works [Zhang et al., 2021, He et al., 2020]. It has approximately 7000 3D scans
of people with 100 different subjects. The meshes lack detail, so despite the large number of models, the
accuracy of the reconstruction will be limited if this dataset is used for training. A new version, the THuman2.0
Dataset [Yu et al., 2021] was recently released and contains 500 high-resolution scans of people. In comparison
with the first version, the quality is improved but the variety of the poses is drastically reduced.
Figure 2: Pipeline overview
The MGN dataset [Bhatnagar et al., 2019] contains 96 models with segmented clothes and is
also used in [Zhang et al., 2021], but they complement the training set with other models. 3D-HUMANS
[Gabeur et al., 2019] also have scans of people performing different activities but it is limited to 21 subjects.
3Dpeople [Pumarola et al., 2019] has 80 people performing 70 actions but they do not share the 3D meshes due
to copyright reasons. Finally, 3DVH [Caliskan et al., 2020] is a synthetic dataset but it is focused on providing
a set of realistic renderings of the models with different backgrounds. The number of 3D models is not specified
and is still to be released.
3 Proposed method
Our goal in this paper is to improve the color estimation of PIFu without requiring any additional information
and we show an overview of our system in Figure 2. In particular, we develop a method which uses the same
input as in PIFu to predict the 2D back-view of the person (in blue), and then we use this prediction along with
the 3D model generated with PIFu (in yellow) to improve its texture (in green). We first describe in Section 3.1
our novel strategy to predict the 2D back-view; then, in Section 3.2 we present how to improve the 3D model
texture with the predicted occluded side and finally, in Section 3.3 we describe the dataset used for training.
3.1 2D back-view prediction
Our idea is inspired by the work in [Natsume et al., 2019], where the back view of the person is predicted in the
2D space allowing for a more detail information about the clothed human. We observed that the silhouette of
the person is the same in the front view and the back view, so we can apply the same idea of using an image-to-
image translation method. As a difference, we increment the information used for the training. We examined
the results from PIFu and observed that the parts corresponding to the back of the person were the more difficult
to predict and it seemed to be related to where the extremities were located or the type of clothes they were
wearing. Consequently, we thought that having some type of semantic information during the training could
help to better predict the back side of the person. Also, to enforce a stronger specialization of a generic image-
to-image translation algorithm into predicting the non-view side of a human, we added as input some positional
information. Therefore, we can train the algorithm with the mapping between the pixels of the RGB images
and the 3D surface of the body.
As the base neural model for our experiments, we chose pix2pix [Isola et al., 2017], an image-to-image
translation cGAN that has been proven to perform well in a wide variety of image translation problems. The
reason behind this choice of architecture is that generating the back side of a 3D model from its front side image
is best framed as a conditional GAN problem, as the predicted texture is directly conditioned by the texture of
the model on the opposite side, and the versatility of this network made it the best candidate. As semantic
information, we use clothing segmentation data inferred by the neural network presented in [Li et al., 2020]
encoded as an RGB image. More specifically, an implementation of this model trained with the ATR dataset,
which includes 18 labels of different garments that the person in the input image may be wearing. To the output
of the segmentation neural network we have added a mask of the silhouette of the person in the picture as the
background, so in the case that the segmentation process is not able to recognise a portion of the input image it
is not shown as empty. An example of the semantic input can be seen in the blue box in Figure 2 (middle). As
positional information, we use the output of the DensePose neural network [Güler et al., 2018], which consists
of the estimated UV coordinates (2D mapped texture coordinates) of a person shown in an image, encoded
using the default ’rainbow’ encoding, which makes use of the full RGB spectrum to represent this coordinates
as a color image. An example of positional information can be seen in Figure 2 (right part inside the blue box).
3.2 3D texture improvement
We are able to achieve a higher quality texture on 3D models by combining the original input view, our gener-
ated back-side view and the 3D textured model output from a neural model such as PIFu. First, we perform the
orthogonal projection of the input image onto the 3D model from the front, aligning the 3D model exactly with
the texture. Then, we perform this same step using our generated back view texture from the exact opposite
angle. After performing this step, we end up with a high resolution texture on the parts of the 3D model that
are on the line of sight of the projection, but with the occluded parts showing either no texture or the texture
belonging to the occluding geometry. To solve this issue, we perform occlusion detection to find those vertices
which are not in the line of sight in either the front or the back view. We locate these vertices on the UV
projection and find the triangles that they form on it, in order to create an occlusion mask. We use the occlusion
mask to show the color-per-vertex from PIFu on the masked pixels, which correspond to those occluded by the
model itself, while showing either the front or the generated back texture on the unmasked pixels. This way we
can show the higher quality texture on those pixels for which it is available, while using the lower quality but
available color-per-vertex on those occluded from the camera.
3.3 V-Human dataset
In order to alleviate the cost and quality issues with existing commercial and freely available datasets respec-
tively, we have created a dataset from synthetic models suitable for training deep learning algorithms for 3D
human reconstruction from images. To prepare the dataset we used the fully rigged avatars from the Microsoft
Rocketbox library [Gonzalez-Franco et al., 2020], re-targeted Mixamo animations to them and refined them to
make them suitable to be used as direct input for learning techniques with implicit functions. Following the
aforementioned procedure we created our dataset, V-Human, which consists of a collection of 1620 models.
These models were created with the 90 Microsoft Rocketbox avatars and to increase the variety of poses in-
cluded in the dataset we do not always use all the frames of a single animation. Instead, we select a varying
number, which is adapted depending on the pace of the action represented in the animation. Thus, we avoid
having very similar poses if a particular action is almost static. Each avatar adopts 18 different poses, which
makes a total of 1620 unique poses in the dataset.
4 Experiments
We design two different kind of experiments to test the performance of our texture method. The former has
ground-truth associated data and the latter shows the performance in a realistic environment.
V-Human RenderPeople Volograms
mean median mean median mean median
LSGAN 76.1 82.3 65.0 69.5 73.7 80.8
WGAN-GP 71.8 80.6 62.0 69.1 72.8 79.4
128 filters 75.3 81.5 60.2 68.0 70.4 79.1
PatchGAN 9 76.2 81.5 67.4 74.4 76.8 81.6
PIFu 52.8 56.0 54.1 57.2 58.2 63.1
PIFu retrained 50.5 51.7 42.0 39.3 48.4 53.2
Table 1: Results of segmentation experiment in % of correctly classified pixels (4000 epochs)
Experiments with ground-truth. First, we evaluate the quality of the estimated back-side as an image
and we compare ourselves with PIFu and PIFu trained with V-Human (18 epochs the shape training and 6
epochs color training). Furthermore, we perform an ablation study to fine-tune our pix2pix model by using four
different variants: the first one uses the default parameters in pix2pix (LSGAN), the second one makes use of
the WGAN-GP [Gulrajani et al., 2017] loss instead of LSGAN (WGAN-GP), the third one has 128 filters in
the last layer of both the generative and the discriminator models instead of 64 (128 filters), and the fourth one
changes the number of layers in the PatchGAN discriminator from 3 to 9 (PatchGAN). We train the models for
4000 epochs.
Evaluation metric. As quality metric, inspired by [Isola et al., 2017], we use a clothing segmentation model
to classify each pixel of the ground-truth and the predicted image. Then, we calculate the percentage of pixels
correctly classified, which are the ones that belong to the same feature type (eg., shirt, trousers, hat, skin - no
clothing...) in the predicted and ground-truth images. We discard the background pixels.
Training and test sets. On the one hand, as training set we use 1458 models from V-Human (90%) and
we leave a 10% aside for testing purposes as it is done in [Saito et al., 2019, Saito et al., 2020]. Those ones
correspond to 81 different identities with 18 different poses. On the other hand, for testing we use data from
three different sources: our V-Human dataset, RenderPeople and Volograms. The nine subjects left from the
complete V-Human dataset were used to create the test set, which has 162 models with 18 different poses per
subject. We used nine rigged RenderPeople avatars and created a dataset of 162 posed models following a
similar pipeline with Mixamo animations. Finally, we used 162 models from nine different volumetric video
sequences captured by Volograms using their studio technology [Pagés et al., 2021].
Experiments in the wild. We also explore how our method performs in a realistic environment. For that,
we apply our 2D back-view prediction to images captured with a smartphone. This situation is different from
the previous experiments because those ones are weak-perspective renderings of 3D models and not images
from a standard camera. Furthermore, we qualitatively compare the result with the back view of the 3D texture
model created with Volograms’ mobile technology1.
5 Results
The results from the image segmentation experiments are shown in Table 1. We can see that all of our models
used in the ablation study outperform PIFu consistently, with the best one achieving 24% better mean and 26%
median accuracy than PIFu. From our four models tested, we can see that the one using 9 layers on a PatchGAN
discriminator (PatchGAN 9) has the best results in all of the datasets.
In the left part of Figure 3 we show a closeup of a texture ground truth and its corresponding predictions
made by PIFu and our method. In it we can clearly see how PIFu produces no wrinkles while our method
does, increasing the perceived quality, although they do not match with those from the ground truth. Clothing
wrinkles can be seen as a type of "noise" because it is very unlikely that estimated and ground truth wrinkles
match, therefore PSNR and SSIM are not adequate as evaluation metrics. Nevertheless, the wrinkles on the
1www.volograms.com
Figure 3: Texture results comparison. On the left, wrinkle texture closeup: groundtruth, PIFu, ours. On the
right, example results comparison: PIFu (top row), ours (bottom row)
Figure 4: 3D Result Example. In order from left to right: our texture, PIFu texture, combined texture, PIFu
back view, ours back view
clothes give it a more textile and realistic appearance, meanwhile the texture of the clothes generated by PIFu
looks a little more like modeling clay, as it does not preserve high frequency details. On the right part of this
image, some more examples of comparison between PIFu and our method can be seen.
Regarding the results of applying our full pipeline to improve 3D model textures, Figure 4 shows a 3D
model who is holding their hand in front of the chest, occluding some of the clothing, from a perspective that
shows this occluded part. The first image on the left shows the high resolution texture generated using our
approach, but with the hand texture filtering onto the occluded part of the clothing. In the next image, the 3D
model is textured using the output from PIFu, which has some noticeable artifacts, and right next to it is a
comparison with our method without them.
Results in the wild. We show in Figure 5 an example of the experiments done with images captured
with a smartphone. In particular, Figure 5a shows the input of our system which is a picture taken with a
smartphone with the background removed and Figure 5b the output, our predicted back view. We do not have
the information of how the person looks like from the back but, as we can observe, the predicted back view is
quite plausible. In particular, we can see that the prediction is very good for the jeans, because it creates some
wrinkles which are pretty credible. Furthermore, we can compare the prediction with the results obtained with
volograms’ mobile technology. Although their 3D predicted model is very sharp in the front-view, the back
part, shown in Figure 5c, is not. We can see that our 2D prediction is much more detailed, so potentially, it can
be use to improve the view.
6 Conclusion
(a) (b) (c)
Figure 5: Qualitative results in the wild.(a) input image, (b) 2D
prediction with our method, (c) back of the 3D model with volo-
gram’s mobile technology
In this paper we have presented a new
method for improving the texture of 3D
models predicted from a single image of a
person. We presented an strategy to pre-
dict the 2D back-view of a person which
includes semantic information about the
person and its clothes along with posi-
tional information. Moreover, we show
how this result can be incorporated in the
3D model output of available systems,
demonstrating how we can improve the
state-of-the-art solutions by generating a
sharper prediction. Furthermore, we pro-
vide an open-source dataset to train the
model which is also suitable for training
deep learning algorithms that uses 3D meshes as input. Finally, we show how our method helps to improve
existing commercial solutions in natural environments.
Acknowledgments
This publication has emanated from research conducted with the financial support of Science Foundation Ire-
land (SFI) under the Grant Number 15/RP/2776. We thank Volograms for providing their data.
References
[Bhatnagar et al., 2019] Bhatnagar, B. L., Tiwari, G., Theobalt, C., and Pons-Moll, G. (2019). Multi-garment
net: Learning to dress 3d people from images. In ICCV, pages 5420–5430.
[Caliskan et al., 2020] Caliskan, A., Mustafa, A., Imre, E., and Hilton, A. (2020). Multi-view consistency loss
for improved single-image 3d reconstruction of clothed people. In ACCV.
[Chibane et al., 2020] Chibane, J., Alldieck, T., and Pons-Moll, G. (2020). Implicit functions in feature space
for 3d shape reconstruction and completion. In CVPR, pages 6970–6981.
[Gabeur et al., 2019] Gabeur, V., Franco, J.-S., Martin, X., Schmid, C., and Rogez, G. (2019). Moulding
humans: Non-parametric 3d human shape estimation from single images. In ICCV, pages 2232–2241.
[Gonzalez-Franco et al., 2020] Gonzalez-Franco, M., Ofek, E., Pan, Y., Antley, A., Steed, A., Spanlang, B.,
Maselli, A., Banakou, D., Pelechano Gómez, N., Orts-Escolano, S., et al. (2020). The rocketbox library and
the utility of freely available rigged avatars. Frontiers in virtual reality, 1(article 561558):1–23.
[Güler et al., 2018] Güler, R. A., Neverova, N., and Kokkinos, I. (2018). Densepose: Dense human pose
estimation in the wild. In CVPR, pages 7297–7306.
[Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017).
Improved training of wasserstein gans. CoRR, abs/1704.00028.
[He et al., 2020] He, T., Collomosse, J., Jin, H., and Soatto, S. (2020). Geo-pifu: Geometry and pixel aligned
implicit functions for single-view human reconstruction. arXiv preprint arXiv:2006.08072.
[Hong et al., 2021] Hong, Y., Zhang, J., Jiang, B., Guo, Y., Liu, L., and Bao, H. (2021). Stereopifu: Depth
aware clothed human digitization via stereo vision. In CVPR, pages 535–545.
[Huang et al., 2020] Huang, Z., Xu, Y., Lassner, C., Li, H., and Tung, T. (2020). Arch: Animatable reconstruc-
tion of clothed humans. In CVPR, pages 3093–3102.
[Isola et al., 2017] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017). Image-to-image translation with
conditional adversarial networks. In CVPR, pages 1125–1134.
[Li et al., 2020] Li, P., Xu, Y., Wei, Y., and Yang, Y. (2020). Self-correction for human parsing. TPAMI.
[Loper et al., 2015] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and Black, M. J. (2015). Smpl: A
skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16.
[Natsume et al., 2019] Natsume, R., Saito, S., Huang, Z., Chen, W., Ma, C., Li, H., and Morishima, S. (2019).
Siclope: Silhouette-based clothed people. In CVPR, pages 4480–4490.
[Pagés et al., 2021] Pagés, R., Zerman, E., Amplianitis, K., Ondˇ
rej, J., and Smolic, A. (2021). Volograms &
V-SENSE Volumetric Video Dataset. ISO/IEC JTC1/SC29/WG07 MPEG2021/m56767.
[Pumarola et al., 2019] Pumarola, A., Sanchez-Riera, J., Choi, G., Sanfeliu, A., and Moreno-Noguer, F.
(2019). 3dpeople: Modeling the geometry of dressed humans. In ICCV, pages 2242–2251.
[Ruano and Smolic, 2021] Ruano, S. and Smolic, A. (2021). A benchmark for 3D reconstruction from aerial
imagery in an urban environment. In VISIGRAPP (5: VISAPP), pages 732–741.
[Saito et al., 2019] Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., and Li, H. (2019). Pifu:
Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, pages 2304–2314.
[Saito et al., 2020] Saito, S., Simon, T., Saragih, J., and Joo, H. (2020). Pifuhd: Multi-level pixel-aligned
implicit function for high-resolution 3d human digitization. In CVPR, pages 84–93.
[Varol et al., 2018] Varol, G., Ceylan, D., Russell, B., Yang, J., Yumer, E., Laptev, I., and Schmid, C. (2018).
Bodynet: Volumetric inference of 3d human body shapes. In ECCV, pages 20–36.
[Yu et al., 2021] Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., and Liu, Y. (2021). Function4d: Real-time human
volumetric capture from very sparse consumer rgbd sensors. In ICCV.
[Zhang et al., 2017] Zhang, C., Pujades, S., Black, M. J., and Pons-Moll, G. (2017). Detailed, accurate, human
shape estimation from clothed 3d scan sequences. In CVPR.
[Zhang et al., 2021] Zhang, S., Liu, J., Liu, Y., and Ling, N. (2021). Dimnet: Dense implicit function network
for 3D human body reconstruction. Computers & Graphics.
[Zheng et al., 2021] Zheng, Y., Shao, R., Zhang, Y., Yu, T., Zheng, Z., Dai, Q., and Liu, Y. (2021). Deep-
multicap: Performance capture of multiple characters using sparse multiview cameras. arXiv preprint
arXiv:2105.00261.
[Zheng et al., 2019] Zheng, Z., Yu, T., Wei, Y., Dai, Q., and Liu, Y. (2019). Deephuman: 3d human recon-
struction from a single image. In ICCV, pages 7739–7749.
[Zins et al., 2021] Zins, P., Xu, Y., Boyer, E., Wuhrer, S., and Tung, T. (2021). Learning implicit 3d represen-
tations of dressed humans from sparse views. arXiv preprint arXiv:2104.08013.
[Zolanvari et al., 2019] Zolanvari, S. I., Ruano, S., Rana, A., Cummins, A., da Silva, R. E., Rahbar, M., and
Smolic, A. (2019). Dublincity: Annotated lidar point cloud and its applications. In BMVC.