Strike a Pose: Image-Based Pose Synthesis.
ABSTRACT In this paper, our objective is to facilitate the creation of novel human poses by synthesizing images. Existing approaches commonly deform one single
image, which often results in a distorted image due to texture and illumination artefacts. We present a novel image-based pose synthesis technique that accurately reconstructs texture details by combining information from multiple photographs. Given
a user-speciﬁed 2D target pose, our solution merges different parts of the input photographs in order to conform to the desired pose, solely using 2D operations. We illustrate how novel poses can be generated from only a few example images, requiring
little user intervention.
- SourceAvailable from: kky.zcu.cz[Show abstract] [Hide abstract]
ABSTRACT: The aim of this paper is to help the communication of two people, one hearing impaired and one visually impaired by converting speech to fingerspelling and fingerspelling to speech. Fingerspelling is a subset of sign language, and uses finger signs to spell letters of the spoken or written language. We aim to convert finger spelled words to speech and vice versa. Different spoken languages and sign languages such as English, Russian, Turkish and Czech are considered. KeywordsFingerspelling recognition–Speech recognition–Fingerspelling synthesis–Speech synthesisJournal on Multimodal User Interfaces 01/2011; 4(2):61-79. · 0.83 Impact Factor
Strike a Pose : Image-Based Pose Synthesis
Cedric Vanaken, Chris Hermans, Tom Mertens, Fabian Di Fiore, Philippe Bekaert, Frank Van Reeth
Hasselt University - tUL - IBBT
Expertise Centre for Digital Media
In this paper, our objective is to facilitate the cre-
ation of novel human poses by synthesizing images.
Existing approaches commonly deform one single
image, which often results in a distorted image due
to texture and illumination artefacts. We present
a novel image-based pose synthesis technique that
accurately reconstructs texture details by combin-
ing information from multiple photographs. Given
a user-specified 2D target pose, our solution merges
different parts of the input photographs in order to
conform to the desired pose, solely using 2D oper-
ations. We illustrate how novel poses can be gen-
erated from only a few example images, requiring
little user intervention.
CR Descriptors: Image-based, pose, 2D, synthesis
becoming increasingly important, judging by the
many powerful tools that are available today [1,
2, 3, 4]. Image-based pose synthesis, which al-
lows for the creation of new human poses by syn-
thesizing new images, can be an essential com-
ponent for some of these tools.
existing approaches create novel poses from sin-
gle photographs by deforming a character using
meshes [5, 6].
While image-based deformation suffices for sim-
ple (rigid) objects and cartoon-like subjects , it
often results in distorted images when dealing with
photorealistic images and human subjects, as it can-
not reproduce changes in texture and illumination.
Using multiple input photographs provides higher
Editing and creating photographs is
realism for texture changes in local regions, like
creases in fabrics at bent limbs. Standard defor-
mation techniques do not account for these details.
Distortions can also occur with these techniques
when large deformations are required. Using mul-
tiple input photographs and poses can eliminate the
need to perform large and possibly distorting defor-
exists in relaxing the restriction of using only a sin-
gle photograph for synthesizing novel poses in im-
ages, while balancing user-convenience and image
quality. Synthesized images featuring characters in
a user-specified pose are created from a small set of
images (typically 2 to 4). Different parts from the
sample poses are merged into a new whole, such
that the desired user-specified pose is obtained.
In order to reach these goals, we provide an intu-
itive solution where the user annotates the input im-
ages with a simple approximate skeleton, which can
be obtained through only a few mouse clicks. The
VMV 2008O. Deussen, D. Keim, D. Saupe (Editors)
Figure 1: Schematic overview of our algorithm.
Starting from a desired target skeleton and set of
input images tagged with associated skeletons, we
take the parts of the input poses that best match
with the target pose. The segmented input sprites
are subdivided according to these selected skeleton
parts, while the best solution for the remaining body
parts is infered using our fusing algorithm. The
color code in the skeleton is used for visualization
user can also draw a similar skeleton which serves
as the target pose.
In addition, we also allow for mesh deformation
in order to offer a larger degree of freedom concern-
ing synthesizing images for a desired target pose.
lenges for our approach. First, we have to find cor-
respondences between the character in the images
and the bones of the user-specified skeleton. After-
wards, the character has to be subdivided into dif-
ferent body parts, which can be merged into a com-
Secondly, once we have a set of seperated body-
parts, they need to be fused seamlessly into a whole,
while preserving texture details. When necessary,
the synthesized image can finally be deformed to
better match the user-specified target pose.
Technically, there are two main chal-
Recently, researchers have proposed many ad-
vanced photo editing tools [1, 2, 3, 4] based on
computational techniques, ranging from seamless
cloning  to automatic image-based rendering .
Techniques that allow for manipulating objects
and characters in pictures are most relevant to our
approach. Barrett et al.  provide object-level
user control in photographs through segmentation
and deformation. Igarashi et al.  propose as-
rigid-as-possible shape manipulation, which em-
ploys a mesh-based representation to deform ob-
jects in a photograph. Wang et al.  propose a
matching method for 2D shape deformation. Their
method places a control mesh over the subject and
uses an iterative solver which utilizes rigid transfor-
mations. Hornung et al.  propose a more elab-
orate method for deformation-based pose editing,
and even allow for character animation using 3D
motion data, demanding a high degree of user in-
teraction. In general, deformation-based techniques
provide the user with flexible control over the shape
of an object or character. Unfortunately, deforma-
tion by itself is unable to model changes in textures
Kavan et al.  proposed a system of 2D polyg-
onal impostors called Polypostors. Characters are
manually decomposed into bodyparts, which are
overlayed with polygons. These polygons are used
for character deformation and animation. One of
the limitations of the Polypostors is that deforma-
tions can not deviate far from the inital key-frame,
restricting the application to simple walk cycles.
Vanaken et al.  present an extension to video
sprites  for articulated characters. They pro-
pose a distance measure aimed at working on high-
level 2D skeletal representations of the characters,
instead of their visual appearance. Through a com-
bination of different heuristics, they are able to an-
imate a character according to a new sequence of
target skeletons. Unfortunately tens or hundreds of
input images are needed to find a plausible match,
whereas our solution requires significantly less in-
put images. Plugging our algorithm into the work
of Vanaken et al. would definitely provide a larger
freedom of possible target animations and would re-
duce the number of required input poses drastically.
When considering multi-camera based pose
synthesis and animation techniques like the work of
Starck et al. [15, 16], it is obvious that even though
multi-camera setups allow for a higher sense of
realism in reconstruction and animation, their cost
and complexity currently makes them inaccessible
for standard pc users.
In this work, we take images from more than one
pose into account, which allows for synthesizing
more realistic results. At first sight our method may
seem to require a significant amount of user interac-
tion, because more than a single input photograph
is used. However, specifying a very simple skele-
ton already leads to good results. The entire pro-
cess typically requires less than a minute of work
per input image, as one skeleton consists of only 22
An overview of our approach is shown in Figure 1.
First, the user overlays 2D skeletons on each in-
put image, and specifies the target skeleton. This
is done by simply marking the joints of the body
in the input images using mouse input. In addition,
if background images are available, we automati-
cally extract the foreground sprite (i.e., the char-
acter) from each input image using a background
subtraction technique . If not, the sprite can be
manually segmented or a more sophisticated fore-
ground extraction technique such as GrabCut 
can be used.
For each body part in the target skeleton, we
search for its best match amongst the input skele-
tons using a suitable skeleton-based distance mea-
sure. Afterwards the optimal parts are transfered to
the resulting image.
For each input image, a mesh is overlayed on
the character in an edge-aware fashion, using con-
strained Delaunay triangulation . We do this
in order to associate image regions with each skele-
ton bone, and in turn also with each body part. If
all possible combinations of the input poses deviate
too far from the desired target pose, this mesh can
eventually be deformed to better match the target
Once all input image regions have been selected,
we overlay them to form the final image. Some
parts of these regions might overlap with each other
in image space. Simple averaging could be used
to arrive at the final image, yet this might intro-
overlaps while respecting the continuity of the im-
4 Matching Body Parts
parts are matched and extracted. At a later stage,
these extracted body parts will be used to form the
final pose, as described in Section 5.
As a first step, we split the skeletons into predefined
body parts: legs, arms, head and torso. Match-
ing is performed on each separate body part. For
the matching itself, we compare the 2D angles of
consecutive skeleton joints , a matching cost
which is robust against foreshortening of the skele-
ton bones. For each target body part, we keep the
best match, and discard the others.
It may occur that a target body part matches well
with several input skeletons, which often happens
for the torso. This is detected by computing the dif-
ference between the matching costs. If these cost
differences are below a given threshold, we can
conclude there is no unique winner, and retain all
possible candidates. Later on, these (overlapping)
parts will be combined into a single image (see Sec-
Figure 2: The mesh creation process. Using the silhouette of the sprite combined with the skeleton bones
(a) and an edge image of the sprite (b), we create a mesh (c) that fits the input sprite (d). The colored circles
in (d) represent the user-indicated skeleton joints.
In the second step, the input skeletons need to be as-
sociated with the image data itself in order to trans-
fer the whole body parts to the final image. To this
end, we perform background subtraction in order
to separate the subject from the background (from
here on we refer to the resulting image region as
a “sprite”), after which we overlay a mesh on the
sprites in order to divide the body into different re-
gions covering different body parts.
This mesh is constructed in an edge-aware fash-
ion. Initially, the outer vertices are placed on the
silhouette, while the inner vertices are constrained
to the bones of the input skeleton as well as the
edges obtained from an edge detector, for which
we employed directional Sobel filters . Edge-
awareness ensures that cuts occur where smooth
transitions are needed, avoiding seams in the final
composite image. See Figure 2 for an illustration of
the mesh creation process.
To allow for a large variety of attainable target
poses, the mesh can be deformed using the as-
rigid-as-possible shape manipulation algorithm of
Igarashi et al. , where the skeleton joints act as
control points for the mesh. As we have motivated
in Sections 1 and 2, using this deformation algo-
rithm ona single imagecreates too largedistortions.
However, in our case it is merely used for small de-
formations in the final result.
To find image regions that cover a given body part,
we assign each triangle of the mesh to its nearest
skeleton bones as follows. If one of a triangle’s ver-
tices is located on a skeleton bone, we assign it as
being part of that bone’s body part and mark its sta-
tus as “confident”. For the remaining triangles, we
look for the two closest skeleton bones by compar-
ing the L2 distance between the triangles’ centroid
and the skeleton joints. If a triangle’s two closest
bones belong to the same body part then its status is
marked as “confident”, otherwise it is marked “un-
From the set of confident triangles, we keep those
which belong to the required body part. The uncer-
tain triangles which belong to a required body part
will be fused in the next step of the algorithm.
5Fusing Body Parts
At this point we have collected a set of “confident”
and “uncertain” image regions, which conform to
the desired pose, from the different input images.
The final step is to fuse these regions into a consis-
In order to avoid mismatches, we first translate
the position of the individual input body parts us-
ing the target skeleton as a reference. The final
sprite could now be obtained by simply averaging
overlapping regions. Unfortunately, this may lead
to ghosting artefacts, as the input images may con-
tain different texture details (e.g. due to wrinkles in
clothing, see Figure 5). These texture details are
clearly important to preserve a sense of realism in
the synthesized images.
We therefore take a more elaborate approach.
The final image is subdivided into a lattice of square
patches. For each patch, we want to pick corre-
patible with the “confident” images regions (which
can be seen as fixed as there is no choice to be made
for those parts). Selecting these patches can be seen
as a labeling problem: for each patch in the final im-
age, we need to select a patch from n patches in the
input images, where n is the number of overlapping
“uncertain” image regions.
The optimal labeling will be computed based on
a cost function, consisting of a data term and a
smoothness term. The data term expresses that we
want patches to respect the non-overlapping body
parts. We define this term as the sum of squared dif-
ferences (SSD) of the region pixels where the patch
intersects with the fixed part. Second, we want con-
tinuity from one patch to another, which will be ex-
pressed in the smoothness term. This term is de-
fined as the SSD of the overlap between adjacent
patches (we therefore let patches overlap by a few
pixels). The optimal labeling is the one for which
the cost function is minimal. Optimization prob-
lems such as the one described here can be solved
using standard inference methods like belief propa-
In this section we discuss results obtained from our
technique. The inset comprises of three input pho-
tos, takenin frontof a greenscreen. Chroma-keying
is used to extract the sprites. The target pose con-
sists of legs and arms that are spread open. Our
algorithm has automatically chosen the lower body
from input image 3, while the arms in the result
were taken from input image 1 and 2. The torso and
the head were taken from all, and combined using
our fusion method.
We used background subtraction  for sprite
segmentation in the remainder of examples, unless
The second example (Figure 3) uses two input
Figure 3: Synthesized image on the right hand side
composed from two input images shown on the left.
Notice that the input images are not taken from the
same camera position.
Figure 4: Synthesized image on the right hand side
composed from two input images shown on the
left. Note that the desired target pose can not be
reached solely using the input poses. Mesh defor-
mation is used to obtain the final result. Holes in the
background were in-painted manually. Input im-
ages originate from the Starpulse Supermodels im-
age gallery , shown at highest available image
images that were taken from a different camera po-
sition. This shows that our technique does not nec-
essarily restrict the input images to originate from a
Figure 4 shows two input images of supermodel
Jordan, found online in the Starpulse Supermod-
els image gallery . The user-specified target
pose cannot be composed out of the input poses,
therefore the best available matches are automati-
cally chosen and then deformed into the target pose.
Since no background image was available for this
scene, the input sprites were manually segmented,
the background in the resulting image has been
The example in Figure 5 consists of two input
images. The first image shows a person sitting on
a table, where the second image features this per-
son standing up straight with arms spread. The re-
sult shows this person sitting on the table with his
arms and legs open. To arrive at the desired tar-
get pose, skeleton and mesh deformation were per-
formed. Notice the indicated areas, where a close-
up of a fused overlap region is shown in the lower-
right corner and compared with a close-up of the
same region in the top-right corner by using aver-
aging instead of the patch-based fusion approach.
The ghosting artefacts in the averaged version are
clearly not present in the fused version.
The last example uses a target skeleton in which
both feet of the subject are lifted off the ground.
Four input images are used, with one input image
per leg, one for the head and one for the arms (Fig-
ure 6). The final result is obtained by using mesh
deformation to better fit the legs and arms to the
tion, the amount of time spent by the user was lim-
ited to indicating the skeleton joints on the input im-
ages and specifying a target skeleton. Marking the
joints in the skeleton can be done with only a few
mouse clicks, whereas a target skeleton is created
starting from an input skeleton and then drag-and-
The precision of the positions of these joints is
not highly important, as long as the used joint se-
mantics are consistent throughout all input and tar-
get poses. When the joint positions are not placed
consistently in all poses, for example if the hand is
indicated once near the wrist and once at the end of
the fingers, the mesh deformation algorithm might
For the results discussed in this sec-
stretch the associated limb in an unnatural way.
Depending on the size of the overlap-areas in the
fusing part of our solution, our unoptimised min-
imisation algorithm required two to ten minutes of
In this paper we presented a novel technique for
pose synthesis from a set of photographs, based on
selecting and merging different body parts into a
desired pose. Only little user input is required to
specify the poses (2D skeletons) of the input images
and the target pose. For each body part in the tar-
get skeleton, best matches are computed in the input
poses, and the associated image parts are transfered
to the final image. A triangle mesh is used to iden-
tify which pixels belong to which body part. Over-
lapping regions in the resulting image are merged
while respecting the continuity of the image.
Even though our method allows for generating a
wide variety of poses from only small set of pho-
tographs, a target pose can only be met approxi-
mately. More variety is obtained by incorporating
work can be found for different parts of the algo-
rithm. Automatic skeleton extraction could reduce
the required user interaction even more.
When combined with animation and retargeting
algorithms, our method could allow for creating a
wide variety of animations of the subject, or for re-
targeting motion from video sequences without the
need for 3D models.
Our algorithm is currently unable to cope with
situations where body parts occlude other one an-
other or where the subject is captured sideways, as
well as with images where the subject is shot under
large perspective differences. The availability of 3D
skeletons and/or multi-camera information would
be of great value when dealing with these problems.
If this information is available, our technique would
be highly suitable for use in 3D character animation
applications [15, 16].
Incorporating color correction would finally al-
low for the combination of photographs that either
feature the subject standing on different positions
under a direct light source, or that are captured in
different lighting environments, e.g. combining in-
door and outdoor images.
Possible improvements on this
Figure 5: Pose synthesis example with input images on the left and result on the right. The inset red
rectangle in the bottom right shows a close-up of our patch-based fusion approach, with simple averaging
of the overlap input regions in the top right red rectangle. Close-ups of the same region in the input images
are shown in the green and yellow rectangles. Notice how the averaged version exhibits ghosting artefacts
on the shirt near the seam of the sweater. The final result was obtained using mesh deformation.
Figure 6: Synthesized image on the right hand side composed from four different input images shown on
the left. The resulting image was deformed to match the target skeleton.
We gratefully express our gratitude to the European
Regional Development Fund (ERDF), the Flemish
Government and the Flemish Interdisciplinary insti-
tute for Broadband Technology (IBBT), which are
kindly funding part of the research at the Exper-
tise Centre for Digital Media. Part of the work is
also funded by the European research project IST-
2-511316-IP : IP-RACINE (Integrated Project Re-
search Area CINE). We also like to thank our col-
leagues, especially Mark Gerrits, Yannick Francken
and Tom Cuypers, for their time and effort in this
work, as well as the friends that helped us by pos-
ing for the example images.
 B. M. Oh, M. Chen, J. Dorsey and F. Durand.
Image-based modeling and photo editing. In
SIGGRAPH ’01: ACM SIGGRAPH 2001 pa-
pers, 433-442, 2001.
 Y. Chuang, D. B. Goldman, K. C. Zheng,
B. Curless, D. H. Salesin and R. Szeliski. Ani-
pers, 853-860, 2005.
 J-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother,
J. Winn and A. Criminisi. Photo clip art. In
SIGGRAPH ’07: ACM SIGGRAPH 2007 pa-
pers, 26(3):577-584, 2007.
 S. Avidan and A. Shamir. Seam carving for
content-aware image resizing. SIGGRAPH ’07:
ACM SIGGRAPH 2007 papers,26(3):10,2007.
 T. Igarashi, T. Moscovich and J. F. Hughes. As-
rigid-as-possible shape manipulation. In SIG-
GRAPH ’05: ACM SIGGRAPH 2005 Papers,
 A. Hornung, E. Dekkers and L. Kobbelt. Char-
acter animation from 2D pictures and 3D mo-
tion data.ACM Transactions on Graphics,
 W. Van Haevre, F. Di Fiore and F. Van Reeth.
Uniting Cartoon Textures with Computer As-
sisted Animation. Proceedings of the 3rd In-
ternational Conference on Computer Graphics
and Interactive Techniques in Autralasia and
Southeast Asia, 245-253, 2005.
 P. P´ erez, M. Gangnet and A. Blake. Poisson
image editing. In SIGGRAPH ’03: ACM SIG-
GRAPH 2003 Papers, 313-318, 2003.
 D. Hoiem, A. A. Efros and M. Hebert. Auto-
matic photo pop-up. In SIGGRAPH ’05: ACM
SIGGRAPH 2005 Papers, 577-584, 2005.
 W. A. Barrett and A. S. Cheney. Object-based
image editing. In SIGGRAPH ’02: ACM Sig-
graph 2002 Papers, 777-784, 2002.
 Y. Wang, K. Xu, Y. Xiong and Z. Cheng. 2D
ing. In Computer Animation and Social Agents
(CASA2008), Journal of Computer Animation
and Virtual Worlds, 2008.
 L. Kavan, S. Dobbyn, S. Collins, J. ˇZ´ ara and
C. O’Sullivan. Polypostors: 2D polygonal im-
postors for 3D crowds. In SI3D ’08: Proceed-
ings of the 2008 symposium on Interactive 3D
graphics and games, 149-155, 2008.
 C. Vanaken, M. Gerrits and P. Bekaert. Artic-
ulated video sprites. In Proceedings of Euro-
graphics Short Papers, 2006.
 A. Sch¨ odl and I. A. Essa. Controlled anima-
tion of video sprites. In SCA ’02: Proceedings
of the 2002 ACM SIGGRAPH/Eurographics
symposium on Computer animation, 121-127,
 J. Starck,G.Miller
Video-based character animation.
’05:Proceedings of the 2005 ACM SIG-
GRAPH/Eurographics symposium on Com-
puter animation, 49-58, 2005.
 J. Starck and A. Hilton. Surface capture for
performance based animation. IEEE Computer
Graphics and Applications, 27(3):21-31, 2007.
 A. M. McIvor. Background subtraction tech-
niques. In Proceedings of Image and Vision
Computing, Auckland, New Zealand, 2000.
 C. Rother, V. Kolmogorov and A. Blake.
”GrabCut”: Interactive foreground extraction
using iterated graph cuts. In SIGGRAPH ’04:
ACM SIGGRAPH 2004 Papers,309-314, 2004.
 L. P. Chew. Constrained Delaunay triangula-
tions. In SCG ’87: Proceedings of the third an-
nual symposium on Computational geometry,
 R. C. Gonzalez and R. E. Woods. Digital Im-
age Processing. Addison-Wesley, 2001.
 J.S.Yedidia, W.T.FreemanandY.Weiss. Un-
derstanding belief propagation and its general-
izations. In International Joint Conference on
Artificial Intelligence 2001 Distinguished Lec-
ture Track, 2001.