CG2Real: Improving the Realism of Computer Generated Images Using a Large Collection of Photographs
ABSTRACT Computer-generated (CG) images have achieved high levels of realism. This realism, however, comes at the cost of long and expensive manual modeling, and often humans can still distinguish between CG and real images. We introduce a new data-driven approach for rendering realistic imagery that uses a large collection of photographs gathered from online repositories. Given a CG image, we retrieve a small number of real images with similar global structure. We identify corresponding regions between the CG and real images using a mean-shift cosegmentation algorithm. The user can then automatically transfer color, tone, and texture from matching regions to the CG image. Our system only uses image processing operations and does not require a 3D model of the scene, making it fast and easy to integrate into digital content creation workflows. Results of a user study show that our hybrid images appear more realistic than the originals.
-
Citations (0)
-
Cited In (0)
Page 1
Computer Science and ArtificialIntelligence Laboratory
Technical Report
massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu
MIT-CSAIL-TR-2009-034July 15, 2009
CG2Real: Improving the Realism of
Computer Generated Images using a
Large Collection of Photographs
Micah K. Johnson, Kevin Dale, Shai Avidan,
Hanspeter Pfister, William T. Freeman, and
Wojciech Matusik
Page 2
CG2Real: Improving the Realism of Computer Generated Images using a Large
Collection of Photographs
Micah K. Johnson1,2
Kevin Dale3
Shai Avidan2
Hanspeter Pfister3
William T. Freeman1
Wojciech Matusik2
1MIT
{kimo, billf}@mit.edu
2Adobe Systems, Inc.
{avidan, wmatusik}@adobe.com
3Harvard University
{kdale,pfister}@seas.harvard.edu
Figure 1: Given an input CG image (left), our system finds the most similar photographs to the input image (not shown). Next, it identifies similar regions
between the CG image and photographs, transfers these regions into the CG image (center), and uses seamless compositing to blend the regions. Finally, it
transfers local color and gradient statistics from the photographs to the input image to create a color and tone adjusted image (right).
Abstract
Computer Graphics (CG) has achieved a high level of realism, pro-
ducing strikingly vivid images. This realism, however, comes at the
cost of long and often expensive manual modeling, and most often
humans can still distinguish between CG images and real images.
We present a novel method to make CG images look more realis-
tic that is simple and accessible to novice users. Our system uses
a large collection of photographs gathered from online reposito-
ries. Given a CG image, we retrieve a small number of real images
withsimilarglobalstructure. Weidentifycorrespondingregionsbe-
tween the CG and real images using a novel mean-shift cosegmen-
tation algorithm. The user can then automatically transfer color,
tone, and texture from matching regions to the CG image. Our sys-
tem only uses image processing operations and does not require a
3D model of the scene, making it fast and easy to integrate into dig-
ital content creation workflows. Results of a user study show that
our improved CG images appear more realistic than the originals.
1Introduction
The field of image synthesis has matured to the point where photo-
realistic Computer Graphics (CG) images can be produced with
commercially available software packages (e.g., Renderman and
POVRay). However, reproducing the details and quality of a nat-
ural image often requires a considerable investment of time by a
highly skilled artist. Even with large budgets and many man-hours
of work, it is still often surprisingly easy to distinguish CG images
from photographs.
CG images differ from real photographs in three major ways.
First, the color distribution of CG images is often overly saturated
and exaggerated. Second, multi-scale image statistics, such as the
histogramoffilteroutputsatdifferentscales, rarelymatchthestatis-
tics of natural images. Finally, CG images often lack details (i.e.,
high frequencies, texture, and noise) that make them look too pris-
tine.
Recently, the proliferation of images available online through
photo-sharing sites such as Flickr has allowed researchers to col-
lect large databases of natural images and to develop data-driven
methods for improving photographs. This work leverages a large
collection of images to improve the realism of computer generated
images in a data-driven manner with minimal user input.
CG2RealtakesaCGimagetobeimproved, retrievesandalignsa
small number of similar natural images from a database, and trans-
fers the color, tone, and texture from the natural images to the CG
image. A key ingredient in the system is a novel mean-shift coseg-
mentation algorithm that matches regions in the CG image with re-
gions in the real images. If the structure of a real image does not fit
that of the CG image (e.g., because of differences in perspective),
we provide a user interface to correct basic perspective differences
interactively. After cosegmentation, we use local style transfer be-
tween image regions, which greatly improves the quality of these
transfers compared to global transfers based on histogram match-
ing. The user has full control over which regions and which styles
are being transferred. Color and tone transfers are completely au-
tomatic, and texture transfer can be controlled by adjusting a few
parameters. In addition, all operations are reasonably fast: an av-
erage computer can run the cosegmentation and all three transfer
operations in less than a minute for 600×400 pixel image.
The primary contribution of this paper is a novel data-driven
approach for improving the look of CG images using real pho-
tographs. Within this system, several novel individual operations
also further the state of the art, including (1) an improved image
search tuned for matching global image structure between CG and
real images; (2) an image cosegmentation algorithm that is both fast
and sufficiently accurate for color and tone transfer; and (3) meth-
ods for local transfer of color, tone, and texture that take advantage
of region correspondences. As a final contribution, we describe
several user studies that demonstrate that our improved CG images
appear more realistic than the originals.
Page 3
Input
Image DatabaseSimilar imagesCosegmentation
Color
Tone
Texture
Local style transfer
Output
A
B
A
B
A
B
A
B
Figure 2: An overview of our system. We start by querying a large collection of photographs to retrieve the most similar images. The user selects the k closest
matches and the images, both real and CG, are cosegmented to identify similar regions. Finally, the real images are used by the local style transfer algorithms
to upgrade the color, tone, and/or texture of the CG image.
2Previous Work
Adding realistic texture to an image is an effective tool to improve
the photo-realistic look and feel of CG images. In their seminal
work, Heeger and Bergen [1995] proposed a novel texture synthe-
sis approach. Their method starts with a random noise image and
iteratively adjust its statistics at different scales to match those of
the target texture, leading to new instances of the target texture.
This approach was later extended by De Bonet [1997] to use joint
multi-scale statistics. Alternatively, one can take an exemplar based
approach to texture synthesis. This idea was first illustrated in the
work of Efros and Leung [1999] and was later extended to work on
patches instead of pixels [Efros and Freeman 2001; Kwatra et al.
2003]. The image analogies framework [Hertzmann et al. 2001]
extends non-parametric texture synthesis by learning a mapping be-
tween a given exemplar pair of images and applying the mapping
to novel images. Freeman et al. [2002] proposed a learning-based
approach to solve a range of low-level image processing problems
(e.g., image super-resolution) that relies on having a dictionary of
corresponding patches that is used to process a given image.
Unfortunately, these approaches require correspondence be-
tween the source and target images (or patches), a fairly strong as-
sumption that cannot always be satisfied. Rosales et al. [2003] later
relaxed this assumption by framing the problem as a large inference
problem, where both the position and appearance of the patches are
inferred from a pair of images without correspondence. While the
results look convincing for a variety of applications, the specific
problem of improving realism in CG images was not addressed.
Instead of requiring corresponding images (or patches) in order
to learn a mapping, one can take a global approach that attempts
to transfer color or style between images. Reinhard et al. [2001]
modified the color distribution of an image to give it the look and
feel of another image. They showed results on both photographic
and synthetic images. Alternatively, Pitié et al. [2005] consider this
problem as estimating a continuous N-dimensional transfer func-
tion between two probability distribution functions and present an
iterative non-linear algorithm. Bae et al. [2006] take yet another
approach, using a two-scale nonlinear decomposition of an image
to transfer style between images. In their approach, histograms of
each layer are modified independently and then recombined to ob-
tain the final output image. Finally, Wen et al. [2008] provide a
stroke-based interface for performing local color transfer between
images. In this system, the user provides a target image and input
in the form of stroke pairs.
We build on and extend this line of work with several important
distinctions. First, the work discussed so far does not consider the
question of how the model images are chosen, and instead it is as-
sumed that the user provides them. However, we believe a system
capable of handling a variety of types of input images should be
able to obtain model images with a minimum of user assistance.
Given a large collection of photographs we assume that we can find
images with similar global structure (e.g., trees next to mountains
below a blue sky) and transfer their look and feel to the CG image.
Moreover, because the photographs are semantically and contextu-
ally similar to the CG image, we can find corresponding regions
using cosegmentation, and thus can more easily apply local style
transfer methods to improve realism.
Recently, several authors have demonstrated the use of large col-
lections of images for image editing operations. In one instance,
Hays and Efros. [2007] use a large collection of images to com-
plete missing information in a target image. The system works by
retrieving a number of images that are similar to the query image
and using their data to complete a user-specified region. We take a
different approach by automatically identifying matching regions
and by stitching together regions from multiple images. Liu et
al. [2008] perform example-based image colorization using images
from the web that is robust to illumination differences. However
their method involves image registration between search results and
input and requires exact scene matches. Our approach instead uses
a visual search based on image data, and our transfers only assume
similar content between CG input and real search results. Finally,
Sivic et al. [2008] show a novel use of large image collections by
retrieving and stitching together images that match a transformed
version of the query real image. While not related to image editing,
this work provides a unique image-browsing experience.
In work based on annotated image datasets, Lalonde and
Efros [2007] use image regions drawn from the LabelMe
database [Russell et al. 2008] to populate the query image with new
objects. Johnson et al. [2006] allow the user to create novel com-
posite images by typing in a few nouns at different image locations.
Here the user input is used to retrieve and composite relevant parts
of images from a large annotated image database. In both cases,
the system relies on image annotations to identify image regions
and region correspondences. In contrast, our approach uses an au-
tomatic cosegmentation algorithm for identifying local regions and
inter-region correspondences.
Researchers have also studied the characteristics of natural ver-
sus synthetic images. For example, in digital forensics, Lyu and
Page 4
Farid [2005] examine high order image statistics to distinguish be-
tween synthetic and natural images. Lalonde and Efros [2007] use
color information to predict if a composite image will look natu-
ral or not. Others have focused solely on learning a model for the
statistics of natural images [Weiss and Freeman 2007; Roth and
Black 2005]. These works suggest that natural images have rela-
tively consistent statistical properties and that these properties can
be used to distinguish between synthetic and natural images. Based
on this observation, our color and tone transfer algorithms work
statistically, adjusting color and gradient distributions to match cor-
responding distributions from real images.
3Image and Region Matching
Figure 2 shows an overview of our system. First, we retrieve the
N closest real images to the query CG image. The N images are
shown to the user, who selects the k most relevant images; typically,
N = 30 and k = 5. Next, we perform a cosegmentation of the k real
images with the CG image to identify similar image regions. Once
the images are segmented, the user chooses among three different
types of transfer from the real images to the CG image: texture,
color and tone. We find that all three types of style transfer can
improve the realism of low-quality CG images. Since high-quality
CG images often have realistic textures, we typically transfer only
color and tone for these inputs.
3.1
Our system leverages a database of 4.5 million natural images
crawled from the photo-sharing site Flickr using keywords related
to outdoor scenes, such as ‘beach’, ‘forest’, ‘city’, etc. Each im-
age, originally of Flickr’s large size with a maximum dimension
of 1024 pixels, was downsampled to approximately 75% its orig-
inal size and stored in PNG format (24-bit color) to minimize the
impact of JPEG compression artifacts on our algorithms.
Image Database
3.2
The goal of the visual search is to retrieve semantically similar im-
ages for a CG input. For example, for a CG image depicting a park
with a tree line on the horizon, the results of the query should de-
pict similar scenes at approximately the same scale, with similar
lighting, viewpoint, and spatial layout of objects within the image
(e.g., a park with trees and a skyline). Using a very large database
of real photographs greatly enhances the chances of finding good
matches. Searching a large database, however, requires an efficient,
yet descriptive, image representation.
The gist scene descriptor [Oliva and Torralba 2001] is one choice
of representation that has been used successfully for image match-
ing tasks [Hays and Efros 2007]. The gist descriptor uses his-
tograms of Gabor filter responses at a single level. We used gist
in an early implementation of our system and were not fully sat-
isfied with the results. In a recent study, Gabor-based descriptors,
such as gist, were out-performed by SIFT-based descriptors for tex-
ture classification [Zhang et al. 2007], justifying our decision to use
a more detailed image representation.
Our representation is based on visual words, or quantized SIFT
features [Lowe 1999], and the spatial pyramid matching scheme of
Lazebnik et al. [2006]. This approach has been shown to perform
well for semantic scene classification. Although our transfer op-
erations are local, the system benefits from global structural align-
ment between the CG input and real matches, justifying a descrip-
tor with significant spatial resolution. Additionally, since textures
in the original CG image often only faintly resemble the real-world
appearance of objects they represent, we use smaller visual word
vocabularies than is typical to more coarsely quantize appearance.
Specifically, we use two vocabularies of 10 and 50 words and
grid resolutions of 1×1, for the 10-word vocabulary, and 1×1,
Visual Search
AAA
BBB
AAA
BBB
AAA
B
B
B
C
C
C
Figure 3: Results from our cosegmentation algorithm. In each row, the CG
image is shown on the left and two real image matches, on the right. Note
that in all cases, segment correspondences are correct, and the images are
not over-segmented.
2×2, 4×4, and 8×8, for the 50-word vocabulary, for a final pyra-
mid descriptor with 4346 elements. This representation has some
redundancy, since a visual word will occur multiple time across
pyramid levels. The weighting scheme specified by the pyramid
match kernel [Grauman and Darrell 2005] accounts for this; it also
effectively provides term-frequency (tf) weighting. We also apply
inverse document frequency (idf) weighting to the pyramid descrip-
tor.
In addition, we represent a rough spatial layout of color with
an 8×8 downsampled version of the image in CIE L*a*b* space
(192 elements). Since the search is part of an interactive system,
we use principal component analysis (PCA) to reduce the descrip-
tor dimensionality to allow for an efficient in-core search. We keep
700 elements for the pyramid term and 48 for the color term and
L2-normalize each. The final descriptor is the concatenation of the
spatialpyramidandcolorterms, weightedbyα and(1−α), respec-
tively, for α ∈[0,1]. Similarity between two images is measured by
Euclidean distance between their descriptors.
ForlowqualityCGimages, textureisonlyaweakcue, sosmaller
α values achieve a better balance of color versus texture cues.
We found that presenting the user with 15 results obtained with
α = 0.25 and 15 with α = 0.75 yielded a good balance between
the quality of matches, robustness to differences in fidelity of CG
inputs, and time spent by the user during selection. We use a kd
tree-based exact nearest-neighbor search, which requires about 2
seconds per query on a 3 GHz dual-core machine.
3.3 Cosegmentation
Global transfer operations between two images, such as color and
tone transfer, work best when the images have similarly-sized re-
gions, e.g., when there are similar amounts of sky, ground, or build-
ings. If the images have different regions, or if one image contains
a large region that is not in the other image, global transfers can
fail. Similar to Tai et al. [2006], we find that segmenting the im-
ages and identifying regional correspondences before color transfer
greatly improves the quality and robustness of the results. But in
contrast to their work, we use cosegmentation [Rother et al. 2006]
to segment and match regions in a single step. This approach is
better than segmenting each image independently and matching re-
gions after the fact because the content of all images is taken into
account during the cosegmentation process and matching regions
are automatically produced as a byproduct.
Page 5
CG input Color modelGlobal histogram matchingLocal color transfer
Figure 4: Transferring image color using cosegmentation. On the left are CG images and real images that serve as color models; white lines are superimposed
on the images to denote the cosegmentation boundaries. On the right are results from two color transfer algorithms: a global algorithm based on N-dimensional
histogram matching, and our local color transfer algorithm. In the top example, the global result has a bluish color cast. In the bottom exmaple, the global
result swaps the colors of the building and the sky. Local color transfer yields better results in both examples.
ThecosegmentationapproachofRotheretal.[2006]usesanNP-
hard energy function with terms to encode both spatial coherency
and appearance histograms. To optimize it, they present a novel
schemethattheycalltrust-regiongraphcuts. Itusesanapproximate
minimization technique to obtain an initial estimate and then refines
the estimate in the dual space to the original problem.
While our goal is similar to Rother et al., we take a simpler ap-
proach. Building upon the mean-shift framework [Fukunaga and
Hostetler 1975], we define a new feature vector with color, spatial,
and image-index terms. We can compute reasonable cosegmenta-
tions in seconds using a standard mean-shift implementation.
Our feature vector at every pixel p is the concatenation of the
pixel color in L*a*b* space, the normalized x and y coordinates at
p, and a binary indicator vector (i0,...,ik) such that ijis 1 when
pixel p is in the jthimage and 0 otherwise. Note that the problem
of segmenting a set of related images is different from the problem
of segmenting video—there is no notion of distance across the im-
age index dimension as there is in a video stream (i.e., there is no
time dimension). Thus, the final components of the feature vector
only differentiate between pixels that come from the same image
versus those that come from different images and do not introduce
an artificial distance along this dimension. In addition, the compo-
nents of the feature vector are weighted by three weights to balance
the color, spatial, and index components. We find that the weights
and the mean-shift bandwidth parameter do not need to be adjusted
for individual image sets to achieve the types of segmentations that
are useful to our color and tone transfer algorithms.
A disadvantage of mean-shift is that it can be costly to compute
at every pixel of an image without using specific assumptions about
feature vectors or kernel [Paris and Durand 2007]. Since we are af-
ter coarse regional correspondences, we reduce the size of the im-
age by a factor of 8 along each dimension and use a standard mean-
shift algorithm with the feature vectors described above. We then
upsample the cosegmentation maps to full resolution using joint bi-
lateral upsampling [Kopf et al. 2007].
In Fig. 3, we show three cosegmentation results, each with three
images (one CG, two real). In the first two cases, the algorithm
segments the images into sky and non-sky. In the last case, the
images are segmented into three regions: ground, sky, and water.
In all cases, the segment correspondences are correct, and although
our color and tone transfer algorithms are robust to it, the images
have not been over-segmented.
4 Local Style Transfer Operators
After cosegmentation, we apply local style transfer operations for
color, tone and texture.
The simplest style transfer is color transfer, where colors of
the real images are transferred to the CG image by transferring
the statistics of a multi-dimensional histogram. This method was
shown to work quite well for color transfer between real images,
but it often fails when applied to CG images. The main difficulty
is that the color histogram of CG images is typically different from
the histogram of real images—it is much more sparse (fewer colors
are used). The sparsity and simplicity of the color distributions can
lead to instability during global transfer where colors are mapped
arbitrarily, as shown in the bottom row of Fig. 4.
We mitigate these problems by a combination of joint bilateral
upsampling and local color transfer. We downsample the images,
compute the color transfer offsets per region from the lower resolu-
tion images, and then smooth and upsample the offsets using joint
bilateral upsampling. Working on regions addresses the problem of
images that contain different proportions of colors and joint bilat-
eral upsampling smooths color transfer in the spatial domain.
Within each sub-sampled region, our color transfer algorithm
uses 2D histogram matching on the a* and b* channels, and 1D his-
togram matching on the L* channel. The advantage of histogram
matching methods is that they do not require per pixel correspon-
dences, which we do not have. Unfortunately, unlike 1D histogram
matching, there is no closed form solution for 2D histogram trans-
fer. We use an iterative algorithm by Pitié et al. [2005] that projects
the 2D histogram onto random 1D axes, performs standard 1D his-
togram matching, and reprojects the data back to 2D. The algo-
rithm typically converges in fewer than 10 iterations. We found
that marginalizing the distributions and performing the remapping
independently for the a* and b* channels produces inferior results.
In Fig. 4, we show two examples of color transfer. From left
to right, we show the original CG images, the real images used
as color models, the results of global N-dimensional color transfer
(in L*a*b* space), and results of our region-based transfer. In the