PreprintPDF Available

SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.
SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation
Ellie Arar1Yarden Frenkel1Daniel Cohen-Or1Ariel Shamir2Yael Vinker1,3
1Tel Aviv University
{elliearar, Yf2, dcor}@mail.tau.ac.il
2Reichman University
arik@runi.ac.il
3MIT
yaelvink@mit.edu
https://swiftsketch.github.io/
Sketch Denoising Process
Figure 1. SwiftSketch is a diffusion model that generates vector sketches by denoising a Gaussian in stroke coordinate space (top). It
generalizes effectively across diverse classes and takes under a second to produce a single high-quality sketch (bottom).
Abstract
Recent advancements in large vision-language models
have enabled highly expressive and diverse vector sketch
generation. However, state-of-the-art methods rely on a
time-consuming optimization process involving repeated
feedback from a pretrained model to determine stroke
placement. Consequently, despite producing impressive
sketches, these methods are limited in practical applica-
tions. In this work, we introduce SwiftSketch, a diffu-
sion model for image-conditioned vector sketch genera-
tion that can produce high-quality sketches in less than a
second. SwiftSketch operates by progressively denoising
stroke control points sampled from a Gaussian distribu-
tion. Its transformer-decoder architecture is designed to
effectively handle the discrete nature of vector represen-
tation and capture the inherent global dependencies be-
tween strokes. To train SwiftSketch, we construct a syn-
thetic dataset of image-sketch pairs, addressing the limi-
tations of existing sketch datasets, which are often created
by non-artists and lack professional quality. For generat-
ing these synthetic sketches, we introduce ControlSketch, a
method that enhances SDS-based techniques by incorporat-
ing precise spatial control through a depth-aware Control-
Net. We demonstrate that SwiftSketch generalizes across di-
verse concepts, efficiently producing sketches that combine
high fidelity with a natural and visually appealing style.
1. Introduction
In recent years, several works have explored the task of gen-
erating sketches from images, tackling both scene-level and
object-level sketching [6,46,47,55]. This task involves
transforming an input image into a line drawing that cap-
tures its key features, such as structure, contours, and over-
all visual essence. Sketches can be represented as pixels
or vector graphics, with the latter often preferred for their
resolution independence, enhanced editability, and ability
to capture sketches’ sequential and abstract nature. Exist-
ing vector sketch generation methods often involve train-
ing a network to learn the distribution of human-drawn
sketches [56]. However, collecting human-drawn sketch
datasets is labor-intensive, and crowd-sourced contributors
often lack artistic expertise, resulting in datasets that pri-
marily feature amateur-style sketches (Fig. 2, left). On the
other hand, sketch datasets created by professional design-
ers or artists are typically limited in scale, comprising only
a few hundred samples, and are often restricted to specific
domains, such as portraits or product design (Fig. 2, right).
Therefore, existing data-driven sketch generation methods
are often restricted to specific domains or reflect a non-
professional style present in the training data.
With recent advancements in Vision-Language Models
(VLMs) [57], new approaches have emerged in the sketch
domain, shifting sketch generation from reliance on human-
1
arXiv:2502.08642v1 [cs.CV] 12 Feb 2025
drawn datasets to leveraging the priors of pretrained mod-
els [11,46,47,54]. These methods generate professional-
looking sketches by optimizing parametric curves to repre-
sent an input concept, guided by the pretrained VLM. How-
ever, they have a significant drawback: The generation pro-
cess depends on repeated feedback (backpropagation) from
the pretrained model, which is inherently time-consuming
often requiring from several minutes to over an hour to
produce a single sketch. This makes these approaches im-
practical for interactive applications or for tasks that require
large-scale sketch data generation.
In this work, we introduce SwiftSketch, a diffusion-based
object sketching method capable of generating high-quality
vector sketches in under a second per sketch. SwiftSketch
can generalize across a wide range of concepts and produce
sketches with high fidelity to the input image (see Figure 1).
Inspired by recent advancements in diffusion models for
non-pixel data [29,43,44], we train a diffusion model that
learns to map a Gaussian distribution in the space of stroke
coordinates to the data distribution (see Figure 1, top). To
address the discrete nature of vector graphics and the com-
plex global topological relationships between shapes, we
employ a transformer-decoder architecture with self- and
cross-attention layers, trained to reconstruct ground truth
sketches in both vector and pixel spaces. The image con-
dition is integrated into the generation process through the
cross-attention mechanism, where meaningful features are
first extracted from the input image using a pretrained CLIP
image encoder [36].
With the lack of available professional-quality paired
vector sketch datasets, we construct a synthetic dataset to
train our network. The input images are generated with
SDXL [33], and their corresponding vector sketches are
produced with a novel optimization-based technique we in-
troduce called ControlSketch. ControlSketch enhances the
SDS loss [34], commonly used for text-conditioned gen-
eration, by integrating a depth ControlNet [58] into the
loss, enabling object sketch generation with spatial con-
trol. Our dataset comprises over 35,000 high-quality vec-
tor sketches across 100 classes and is designed for scala-
bility. We demonstrate SwiftSketch’s capability to generate
high-quality vector sketches of diverse concepts, balancing
fidelity to input images and the abstract appearance of nat-
ural sketches.
2. Related Work
Sketch Datasets Existing sketch datasets are primarily
composed of human-drawn sketches, and are designed to
accomplish different sketching tasks. Class-conditioned
datasets [10,16] are particularly common, with the largest
being the QuickDraw dataset [16], containing 50 million
sketches spanning 345 categories. Datasets of image-
referenced sketches cover a spectrum of styles, including
(a) QuickDraw (b) Sketchy (c) OpenSketch (d) Berger et al.
Figure 2. Amateur vs. Professional Sketches. (a) QuickDraw [16]
and (b) Sketchy [40] are large-scale datasets, with Sketchy offering
more fine-grained sketches, though both exhibit non-professional
style. (c) OpenSketch [15] and (d) Berger et al. [3] contain pro-
fessional sketches but are limited in scale and focus on specific
domains.
image trace and contours [1,10,25,50], or more abstract
but still fine-grained depictions [14,40], and very abstract
sketches [31]. These large-scale datasets are often created
by non-artists. Efforts have been made to collect sketches
from professionals [3,15,17,53], but these datasets are
often smaller in scale, and are limited to specific domains
like portraits [3] or household items [15]. These constraints
make them unsuitable for training generative models that
can generalize broadly to diverse concepts.
Data-Driven Sketch Generation These datasets have fa-
cilitated data-driven approaches for various sketch-related
tasks [56]. Multiple generative frameworks and archi-
tectures have been explored for vector sketch generation,
including RNNs [16], BERT [27], Transformers [4,37],
CNNs [8,23,42], LSTMs [35,42], GANs [45], reinforce-
ment learning [30,60], and diffusion models [49]. How-
ever, these methods are fundamentally designed to oper-
ate in a class-conditioned manner, restricting their ability to
generate sketches to only the classes included in the train-
ing data. Additionally, they rely on crowd-sourced datasets
which contain non-professional sketches, restricting their
ability to handle more complex or artistic styles. On the
other hand, existing works for generating more profession-
ally looking sketches are either restricted to specific do-
mains [28] or can only generate sketches in pixel space
[6,25]. Note that image-to-sketch generating can be formu-
lated as a style transfer task, with recent works that employ
the text-to-image diffusion priors achieving highly artistic
results with high fidelity [12,18,48], however, all of these
works also operate only in pixel space. In contrast, we fo-
cus on vector sketches due to their resolution independence,
smooth and clean appearance, control over abstraction, and
editable nature.
VLMs for Vector Sketches To reduce reliance on exist-
ing vector datasets, recent research leverages the rich priors
of large pre-trained vision-language models (VLMs) in a
zero-shot manner. Early methods [11,46,47] utilize CLIP
[36] as the backbone for image- and text-conditioned gen-
2
eration. These approaches iteratively optimize a randomly
initialized set of strokes using a differentiable renderer [26]
to bridge the gap between vector and pixel representations.
More recently, text-to-image diffusion models [38] have
been employed as backbones, with the SDS loss [34] used
to guide the optimization process, achieving superior results
[22,54,55]. However, the use of the SDS loss has so far
been limited to text-conditioned generation. While these
approaches yield highly artistic results across diverse con-
cepts, they are computationally expensive, relying on itera-
tive backpropagation.
Diffusion Models for Non-Pixel Data Diffusion models
have emerged as a powerful generative framework, extend-
ing their impact beyond traditional pixel-based data. Re-
cent research demonstrates their versatility across diverse
domains, including tasks such as human motion synthesis
[43], 3D point cloud generation [21,29], and object detec-
tion reframed as a generative process [7]. Some prior works
have explored diffusion models for vector graphics synthe-
sis. VecFusion [44] uses a two-stage diffusion process for
vector font generation but its architecture and vector rep-
resentation are highly complex and specialized for fonts,
limiting adaptability to other vector tasks. SketchKnitter
[49] and Ashcroft et al. [2] generate vector sketches us-
ing a diffusion-based model trained on the QuickDraw and
Anime-Vec10k dataset, but without conditioning on images
or text inputs.
3. Preliminaries
Diffusion Models Diffusion models [20,41] are a class of
generative models that learn a distribution by gradually de-
noising a Gaussian. Diffusion models consist of a forward
process q(xt|xt1)that progressively noises data samples
x0pdata at different timesteps t[1, T ], and a backward
or reverse process p(xt1|xt)that progressively cleans the
noised signal. The reverse process is the generative process
and is approximate with a neural network ϵθ(xt, t). Dur-
ing training, a noised signal at differnet timesteps is derived
from a sample x0as follows:
xt=¯αtx0+1¯αtϵ, (1)
where ϵ N(0,I), and ¯αt=Qt
s=1 αsis called the noise
scheduler. The common approach for training the model is
with the following simplified objective:
Lsimple =Ex0q(x0)∼N (0,I),t∼U(1,T )
ϵϵθ(xt, t)
2.(2)
At inference, to generate a new sample, the process starts
with a Gaussian noise xT N(0,I)and the denoising net-
work is applied iteratively for Tsteps, yielding a final sam-
ple x0.
SDS Loss The Score Distillation Sampling (SDS) loss
[34] is used to extract signals from a pretrained text-to-
image diffusion model to optimize a parametric represen-
tation. For vector graphics, the parameters ϕdefining an
SVG can be optimized using the SDS loss to represent a de-
sired textual concept. A differentiable rasterizer [26] raster-
ize ϕinto a pixel image x, which is then noised to produce
xtat a sampled timestep t. This noised image, conditioned
on a text prompt c, is passed through the pretrained diffu-
sion model, ϵθ(xt, t, c). The deviation of the diffusion loss
in Eq. (2) is used to approximate the gradients of the ini-
tial image synthesis model’s parameters, ϕ, to better align
its outputs with the conditioning prompt. Specifically, the
gradient of the SDS loss is defined as:
ϕLSDS =w(t)(ϵθ(xt, t, y )ϵ)∂x
∂ϕ ,(3)
where w(t)is a constant that depends on αt. This optimiza-
tion process iteratively adjusts the parametric model.
4. Method
Our method consists of three key components: (1) Con-
trolSketch, an optimization-based technique for generating
high-quality vector sketches of input objects; (2) a syn-
thetic paired image-sketch dataset, created using ControlS-
ketch; and (3) SwiftSketch, a diffusion model trained on our
dataset for efficient sketch generation.
4.1. ControlSketch
Given an input image Idepicting an object, our goal is to
generate a corresponding sketch Sthat maintains high fi-
delity to the input while preserving a natural sketch-like
appearance. Following common practice in the field, we
define Sas a set of nstrokes {si}n
i=1, where each stroke
is a two-dimensional cubic B´
ezier curve: si={pi
j}4
j=1 =
{(xj, yj)i}4
j=1. We optimize the set of strokes using the
standard SDS-based optimization pipeline, as described in
Section 3, with two key enhancements: an improved stroke
initialization process and the introduction of spatial control.
Our process rely on the image’s attention map Iattn, depth
map Idepth, and caption y, extracted using DDIM inversion
[41], MiDaS [5], and BLIP2 [24] respectively. While pre-
vious approaches [47,54] sample initial stroke locations
based on the image’s attention map, we observe that this
method often results in missing areas in the output sketch,
especially when spatial control is applied. To address this,
we propose an enhanced initialization method (see Fig. 3,
left) that ensures better coverage. We divide the object area
into k= 6 equal-area regions (Fig. 3c), using a weighted K-
Means method that accounts for both attention weights and
pixel locations. We distribute n
2points equally across the
regions, while the remaining n
2points are allocated propor-
tionally to the average attention value in each region. This
3
𝑠
!
𝑠
"
𝑠
#
.
.
𝑠
$
“A lion”
!"#" = %𝑤 𝑡 𝜖) 𝜖 𝜕𝑥
𝜕𝜙
𝑡~𝒰(1, 𝑇)
𝜖~𝒩(0, 𝐼)
𝜖%.
𝜖!
CtrlNet
!
Strokes initialization SDS with Spatial Control
(a) (b)
(c) (d)
𝑦 =
Figure 3. ControlSketch Pipeline. Left: The object area is divided
into kregions (c), with npoints distributed based on attention val-
ues from (b) while ensuring a minimum allocation per region. (d)
The initial strokes are derived from these points. Right: The ini-
tial strokes are iteratively optimized to form the sketch. At each
iteration, the rasterized sketch is noised based on tand ϵand fed
into a diffusion model with a depth ControlNet conditioned on the
image’s depth and caption y. The predicted noise ˆϵis used for the
SDS loss.
(a) (b)
(c) (d)
(e) Ordered strokes
132
(f) Half of strokes
Figure 4. (a) Input image. (b) Object mask. (c) The object’s con-
tour is extracted from the mask using morphological operations,
and sketch pixels that intersect with the contour are given higher
weight. (d) Attention map. (e) We sort the strokes based on a
combination of contour intersection count and attention score. (f)
A visualization of the first 16 strokes in the ordered sketch, demon-
strating the effectiveness of our sorting scheme.
means that more points are assigned to regions with higher
attention. Within each region, the points are evenly spaced
to further ensure good coverage. This process determines
the location of the initial set of strokes’ control points to be
optimized, as demonstrated in Figure 3d.
The stroke optimization process is depicted in Figure 3,
right. At each optimization step, the rasterized sketch
R({si}n
i=1)is noised based on tand ϵ, then fed into a depth
ControlNet text-to-image diffusion model [58]. The model
predicts the noise ˆϵconditioned on the caption yand the
depth map Idepth. We balance the weighting between the
spatial and textual conditions to achieve an optimal trade-
off between “semantic” fidelity, derived from y(ensuring
the sketch is recognizable), and “geometric” fidelity, de-
rived from Idepth, which governs the accuracy of the spatial
structure.
4.2. The ControlSketch Dataset
We utilize ControlSketch to generate a paired image-vector
sketch dataset. Each data sample comprises the set {I,
Iattn,Idepth ,Imask,S,c,y}, which includes, respectively,
the image, its attention map, depth map, and object mask,
along with the corresponding vector sketch of the object,
class label, and caption. To generate the images, we uti-
lize SDXL [33], along with a prompt template designed to
produce images for each desired class c(an example of a
generated image for the class “lion” is shown in Figure 3a).
We then apply ControlSketch on the masked images to gen-
erate the corresponding vector sketches. Additional de-
tails are provided in the supplementary. Optimization-based
methods, such as ControlSketch, do not impose an inherent
stroke ordering. Learning an internal stroke order enables
the generation of sketches with varying levels of abstrac-
tion by controlling the number of strokes generated. Thus,
we propose a heuristic stroke-sorting scheme that prioritizes
contour strokes and those depicting salient regions (illus-
trated in Figure 4). Consequently, each vector sketch Sis
represented as an ordered sequence of strokes (s1, . . . , sn).
4.3. SwiftSketch
We utilize the ControlSketch dataset to train a generative
model Mθthat learns to efficiently produce a vector sketch
from an input image I. We define Mθas a transformer de-
coder to account for the discrete and long-range dependen-
cies inherent in vector sketches. The training of Mθfollows
the standard conditional diffusion framework, as outlined
in Section 3, with task-specific modifications to address the
characteristics of vector data and the image-to-sketch task.
In our case, the model learns to denoise the set of (x, y)
coordinates that define the strokes in the sketch.
The training process is depicted in Figure 5. At each it-
eration, a pair (I , S0)is sampled from the dataset, where
S0R2×4×nis the clean sketch in vector representation,
and R(S0)denotes the corresponding rasterized sketch in
pixel space, with Rbeing a differentiable rasterizer [26].
The image Iis processed using a pretrained CLIP ResNet
model [36], where features are extracted from its fourth
layer, recognized for effectively capturing both geometric
and semantic information [47]. These features are then re-
fined through a lightweight CNN to enhance learning and
align dimensions for compatibility with Mθ. This process
yields the image embedding Ie. At each iteration, we sam-
ple a timestep t U(1, T )and noise ϵ N(0,I)to define
St:
St=¯αtS0+1¯αtϵ, (4)
where ¯αtis the noise scheduler as a function of t. As
illustrated in Figure 5,Strepresents a noised version of
S0in vector space, with the level of noise determined by
the timestep t. The control points {st
1,...st
n}are fed into
4
!!
"
!#
"
!$
"
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
"#
$#%&'( )*
ℛ(𝑆!)(𝑆!)
%
!!
%
!#
%
!$
%
+
Linear
Linear
Transformer
Decoder
++
+
CLIP
raster
parametric
MLP
CNN
+ℛ(𝑆")
!!
"
,
!#
"
,
!$
"
,
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
𝑦𝑥
-&'($%) . /-*+)%,*
𝑀!
𝑆!
+,
𝑆"𝑆!
%
Figure 5. SwiftSketch Training Pipeline. At each training iteration, an image Iis passed through a frozen CLIP image encoder, followed by
a lightweight CNN, to produce the image embedding Ie. The corresponding vector sketch S0is noised based on the sampled timestep tand
noise ϵ, forming St(with R(St)illustrating the rasterized noised sketch, which is not used in training). The network Mθ, a transformer
decoder, receives the noised signal Stand is tasked with predicting the clean signal ˆ
S0, conditioned on the image embedding Ieand the
timestep t(fed through the cross-attention mechanism). The network is trained with two loss functions: one based on the distance between
the control points and the other on the similarity of the rasterized sketches.
𝑀!
!!"#$%& '(
!"
)
!!#$
𝑀!𝑀!
. . . .
!$
!"
)
𝑀!
Figure 6. Inference Process. Starting with randomly sampled
Gaussian noise ST N (0,I), the model Mθpredicts the clean
sketch ˆ
S0=Mθ(St, t, Ie)at each step t, which is then re-noised
to St1. This iterative process is repeated for Tsteps and is fol-
lowed by a final feed-forward pass through a refinement network,
Mθ, which is a trainable copy of Mθ, specifically trained to cor-
rect very small residual noise.
the network Mθ, where they are first encoded via a lin-
ear layer (depicted in green), and combined with a stan-
dard positional embedding before being passed through the
transformer decoder (in pink), which consists of 8 layers of
cross-attention and self-attention. The encoded timestep t
and image features Ieare fed into the transformer through
the cross-attention mechanism. The decoder output is pro-
jected back to the original points dimension through a linear
layer, yielding the prediction Mθ(St, t, Ie) = ˆ
S0.
We train Mθwith two training objectives Lpoints and
Lraster, applied on both the vector and raster representation
of the sketch:
Lpoints =S0ˆ
S01=
n
X
i=1 s0
iˆ
s0
i1,
Lraster =LP I P S R(S0),R(ˆ
S0),
(5)
where Lpoints is defined by the L1distance between the
sorted control points of the ground truth sketch S0and
the predicted sketch ˆ
S0, and Lraster is the LPIPS distance
[59] between the rasterized sketches. Lpoints encourages
per-stroke precision, while Lraster encourages the gener-
ated sketch to align well with the overall structure of the
ground truth sketch. Together, our training loss is: L=
Lpoints +λLraster, with λ= 0.2.
As is often common, to apply classifier-free guidance
[19] at inference, we train Mθto learn both the condi-
tioned and the unconditioned distributions by randomly set-
ting I=for 10% of the training steps.
The inference process is illustrated in Figure 6. The
model, Mθ, generates a new sketch by progressively de-
noising randomly sampled Gaussian noise, ST N(0,I).
At each step t,Mθpredicts the clean sketch ˆ
S0=
Mθ(St, t, Ie), conditioned on the image embedding Ieand
time step t. The next intermediate sketch, St1, is derived
from ˆ
S0using Equation (4). This process is repeated for
Tsteps. We observe that the final output sketches from the
denoising process may retain slight noise. This is likely
because the network prioritizes learning to clean heavily
noised signals during training, while small inaccuracies in
control point locations have a smaller impact on the loss
function, leading to reduced precision at finer timesteps.
To address this, we introduce a refinement stage, where
5
Figure 7. Examples of the denoising process. From left to right: strokes’ control points are sampled from a Gaussian distribution, and our
network progressively refines the signal to generate a sketch.
Input 12s 17s 22s 27s 32s
Figure 8. Stroke Order Visualization. Generated sketches are vi-
sualized progressively, with the stroke count shown on top. Early
strokes capture the object’s contour and key features, while later
strokes add finer details.
a learned copy of our network, Mθ, is fine-tuned to per-
form an additional cleaning step. This refinement network
is trained in a manner similar to the original model, with
the objective of denoising a slightly noised sketch, condi-
tioned on the same input image features, while the timestep
condition is fixed at 0. More details are provided in the
supplementary. This refinement stage is inspired by simi-
lar strategies employed in the pixel domain [33,39], where
additional processing steps are used to improve the quality
and resolution of generated images. As illustrated in Fig-
ure 6, after the final denoising step of Mθis applied, ˆ
S0is
passed through Mθto perform the additional refinement.
4.4. Implementation Details
ControlSketch requires approximately 2,000 steps to con-
verge, taking around 10 minutes on a standard RTX3090
GPU. SwiftSketch is trained with T= 50 noising steps, to
support fast generation. To encourage the model to focus
on fine details, we adjust the noise scheduler to perturb the
signal more subtly for small timesteps compared to the co-
sine noise schedule proposed in [32]. The model is trained
on images from 15 classes, with 1,000 samples per class.
The training process spans 400K steps, requiring approxi-
mately six days on a single A100 GPU. At inference, we
use a guidance scale of 2.5. Our synthetic dataset includes
an additional 200 test samples for the 15 training classes, as
well as 85 additional object categories, each with 200 sam-
ples. Additional implementation details, as well as detailed
class labels and dataset visualizations are provided in the
supplementary material.
5. Results
We begin by showcasing SwiftSketch’s ability to generate
high-quality vector sketches for a diverse set of input im-
ages. SwiftSketch successfully generalizes to unseen im-
ages within the training categories (Figure 11), creating
sketches that depict the input images well while demon-
strating a plausible and detailed appearance. On images
of unseen categories that pose greater challenges, SwiftS-
ketch effectively captures the essential features of the in-
put images, producing abstract yet faithful representations
(Figure 12). Notably, all sketches are provided in vector
format, and are generated in just 50 diffusion steps, fol-
lowed by a single refinement step, with the entire process
taking less than one second. In Figure 7, we illustrate the
denoising steps of the generation process, starting from a
Gaussian distribution and progressively refining towards the
data distribution. In Figure 8, we demonstrate the ability of
our method to create level-of-abstraction using our ordered
stroke technique. We visualize the progressive addition of
strokes in the sequence they appear in the output SVG file.
Note how the first strokes already convey the intended con-
cept effectively. Additional results of both SwiftSketch and
ControlSketch are available in the supplementary.
5.1. Comparisons
We evaluate the performance of SwiftSketch and ControlS-
ketch with respect to state-of-the-art methods for image-to-
sketch generation, including Photo-Sketching [25], Chan et
al. [6], InstantStyle [48], and CLIPasso [47]. InstantStyle
6
XDoG Chan et al. Instant-Style Photo-Sketching CLIPasso ControlSketch SwiftSketch
Time |P/V0.1 sec. |P0.04 sec. |P1 min.|P0.6sec.|P5min. |V10 min. |V0.5sec.|V
Figure 9. Qualitative Comparison. Input images are shown on the left, with the time required to produce a single sketch and whether the
sketches are in Pixel or Vector format indicated at the top. From left to right, the sketches are generated using XDoG [52], PhotoSketching
[25], Chan et al. [6] (in anime style), InstantStyle [48], and CLIPasso [47]. On the right are the resulting sketches from our proposed
methods, ControlSketch and SwiftSketch.
is applied with a sketch image as the style reference. Fig-
ure 9shows representative results from each method, with
XDoG [52], a classic edge detection technique, shown on
the left as a baseline. The sketches of Chan et al. and In-
stantStyle are detailed and align well with the overall struc-
ture of the input images. However note that they closely
follow the edge maps shown on the left. The sketches of
Photo-Sketching (fifth column) are more abstract, but can
fail to effectively capture the images’ content in a natu-
ral way. While these approaches are efficient, producing
sketches in less than a minute, they focus on generating
raster sketches. In contrast, our method produces vector
sketches, which are resolution-independent, easily editable,
and exhibit a smooth, clean style. CLIPasso (sixth column)
generates vector sketches that achieve a good balance be-
tween fidelity and semantics. However, it is significantly
slower, requiring 5 minutes to produce a single sketch, and
it may introduce artifacts, such as the noisy overlapping
strokes observed in the robot example. ControlSketch (sev-
enth column) produces high-fidelity sketches that remain
abstract, smooth, and natural, effectively depicting the in-
put images while avoiding artifacts. However, it is even
slower than CLIPasso, as SDS-based methods generally re-
quire more time to converge, making it impractical for in-
teractive applications. SwiftSketch, shown in the rightmost
column, successfully learns the data distribution from Con-
trolSketch samples, enabling it to produce sketches that ap-
proach the quality of optimization-based techniques but in
real time. Additional results are available in the supplamen-
tary material.
Quantitative Evaluation We sample 4,000 images from
our dataset (2,000 from our test set of categories seen dur-
ing training and 2,000 from unseen categories) and addi-
tional 2,000 images from the SketchyCOCO [14] dataset to
assess generalization on external data. Each set consists of
10 randomly selected categories with 200 images per cate-
gory. Following common practice in the field, we use the
CLIP zero-shot classifier [36] to assess class-level recogni-
tion, MS-SSIM [51] for image-sketch fidelity following the
settings proposed in CLIPascene [46], and DreamSim [13].
The results are presented in Table 1, where scores for each
data type are reported separately, with human sketches from
the SketchyCOCO dataset included as a baseline. Chan et
al. and InstantStyle achieve the highest scores across most
metrics due to their highly detailed sketches, which closely
resemble the image’s edge map. This level of detail ensures
that their sketches are both easily recognizable as depict-
ing the correct class (as indicated by the CLIP score) and
exhibit high fidelity (as reflected in other measurements).
The results show that SwiftSketch generalizes well to test
set images from seen categories, as evidenced by its similar
scores to ControlSketch (which serves as the ground truth in
our case). However, its performances decrease for unseen
categories, particularly in class-based recognition. This is
especially apparent on the SketchyCOCO dataset, which is
highly challenging due to its low-resolution images and dif-
ficult lighting conditions. It is important to note that SwiftS-
ketch is trained on only 15 image categories due to limited
resources, suggesting that more extensive training could im-
prove its generalization capabilities.
The results demonstrate that ControlSketch produces
7
Table 1. Quantitative Comparison of Sketch Generation Methods. The scores for Top-1 and Top-3 CLIP recognition accuracy, MS-SSIM,
and DreamSim are presented. Results are based on 6,000 random samples: 2,000 from our dataset’s test set, 2,000 from unseen categories
in our dataset, and 2,000 from the external SketchyCOCO dataset [14]. The best scores in each column are marked in bold.
CLIP Top-1 CLIP Top-3 MS-SSIM DreamSim
Time Seen Unseen Exter. Seen Unseen Exter. Seen Unseen Exter. Seen Unseen Exter.
Human [14] 0.85 0.93 0.16 0.66
Chan et al. (Anime) [6]0.04 sec. 0.94 0.99 0.87 0.99 0.99 0.96 0.79 0.77 0.45 0.32 0.43 0.45
Chan et al. (Contour) [6]0.04 sec. 0.96 0.92 0.80 0.98 0.95 0.91 0.77 0.73 0.49 0.51 0.59 0.54
InstantStyle [48]1 min. 0.97 0.99 0.99 0.99 0.89 0.90 0.36 0.44
Photo-Sketching [25]0.6 sec. 0.90 0.81 0.65 0.95 0.87 0.78 0.61 0.55 0.40 0.58 0.65 0.63
CLIPasso [47]5 min. 0.98 0.97 0.88 0.99 0.99 0.95 0.65 0.60 0.52 0.48 0.57 0.57
ControlSketch 10 min. 0.97 0.97 0.91 0.99 0.99 0.97 0.68 0.63 0.53 0.52 0.59 0.60
SwiftSketch 0.5 sec. 0.95 0.70 0.56 0.98 0.82 0.70 0.62 0.56 0.47 0.53 0.66 0.64
sketches that are both highly recognizable and of high fi-
delity, outperforming alternative methods, particularly on
the SketchyCOCO dataset. To further highlight the advan-
tages of ControlSketch over CLIPasso, we conduct a two-
alternative forced-choice (2AFC) perceptual study with 40
participants. Each participant was shown pairs of sketches
generated by the two methods (presented in random order)
alongside the input image and asked to choose the sketch
they perceived to be of higher quality. The study included
24 randomly selected sketches from both our dataset and
SketchyCOCO, spanning 24 object classes. Participants
rated sketches generated by ControlSketch as higher quality
in 89% of cases. Examples of sketches presented in the user
study are shown in Figure 13.
6. Ablation
We evaluate the contribution of SwiftSketch’s main compo-
nents by systematically removing each one and retraining
the network. Specifically, we examine the impact of exclud-
ing the LPIPS loss, the L1 loss, and the sorting technique,
as well as the effect of incorporating the refinement net-
work. The results are summarized in Table 2, where “Full”
represents our complete diffusion pipeline prior to refine-
ment, and “+Refine” denotes the inclusion of the refinement
stage. Notably, removing the L1 loss results in a signifi-
cant drop in performance, highlighting its essential role in
the training process. Excluding the LPIPS loss negatively
impacts performance, particularly in unseen classes. The
metrics indicate comparable performance in the absence of
the sorting stage. While the resulting sketches may appear
visually similar, the sorting stage is crucial for supporting
varying levels of abstraction. Although the network can be
trained without this stage and still achieve reasonable re-
sults, learning an internal stroke order provides a foundation
for training across abstraction levels, where sketches im-
plicitly encode the importance of strokes. The refinement
stage enhances recognizability, especially in unseen cate-
Table 2. Ablation Study. We systematically remove each compo-
nent in our pipeline and retrain our network. Scores are computed
on 4000 sketches in total: 2000 from seen categories and 2000
from unseen categories.
CLIP Top-1 CLIP Top-3 MS-SSIM DreamSim
Seen Unseen Seen Unseen Seen Unseen Seen Unseen
w/o LPIPS 0.88 0.36 0.94 0.49 0.58 0.52 0.58 0.72
w/o L1 0.03 0.01 0.06 0.03 0.24 0.22 0.80 0.84
w/o Sort 0.93 0.63 0.97 0.76 0.62 0.57 0.55 0.68
Full 0.94 0.61 0.97 0.73 0.62 0.57 0.54 0.68
+Refine 0.95 0.70 0.98 0.82 0.62 0.56 0.53 0.66
gories where the output sketches from the diffusion process
are noisier. We further illustrate the impact of the refine-
ment network in Figure 14, with additional results provided
in the supplementary.
7. Limitations and Future Work
While SwiftSketch can generate vector sketches from im-
ages efficiently, it comes with limitations. First, although
SwiftSketch performs well on seen categories, as evidenced
by our evaluation, its performance decreases for unseen cat-
egories. This is particularly apparent in categories that dif-
fer significantly from those seen during training (e.g., non-
human or non-animal objects). Failure cases often exhibit
a noisy appearance or are entirely unrecognizable, such as
the carrot in Figure 10. Expanding the number of train-
ing categories in future work could enhance the model’s
generalization. Second, our refinement stage, which is
meant to fix the noisy appearance, might over-simplify the
sketches, resulting in lost details such as the nose and eyes
of the cow in Fig. 10. Lastly, in the scope of this paper,
we trained SwiftSketch on sketches with a fixed number
of strokes (32). Extending the training to sketches with
varying numbers of strokes, spanning multiple levels of ab-
8
straction, presents an exciting direction for future research.
Our transformer-decoder architecture is inherently suited
for such an extension, and our results show that the network
can capture essential features from sorted sketches, high-
lighting its potential to effectively handle more challenging
levels of abstraction.
(a) Noisy, unrecognized sketches(b) Refinement stage omits fine details
Figure 10. Limitations. (a) Sketches may appear unrecognizable
(e.g., carrot) or noisy (e.g., Eiffel Tower). (b) The refinement stage
can lead to the loss of fine details, such as the cow’s nose and eye.
8. Conclusions
We introduced SwiftSketch, a method for object sketch-
ing capable of generating plausible vector sketches in un-
der a second. SwiftSketch employs a diffusion model with
a transformer-decoder architecture, generating sketches
by progressively denoising a Gaussian distribution in the
space of stroke control points. To address the scarcity
of professional-quality paired vector sketch datasets, we
constructed a synthetic dataset spanning 100 classes and
over 35,000 sketches. This dataset was generated us-
ing ControlSketch, an improved SDS-based sketch gener-
ation method enhanced with a depth ControlNet for bet-
ter spatial control. We demonstrated both visually and nu-
merically that ControlSketch produces high-quality, high-
fidelity sketches and that SwiftSketch effectively learns the
data distribution of ControlSketch, achieving high-quality
sketch generation while reducing generation time from ap-
proximately 10 minutes to 0.5 seconds. We believe this
work represents a meaningful step toward real-time, high-
quality vector sketch generation with the potential to en-
able more interactive processes. Additionally, our extensi-
ble dataset construction process will be made publicly avail-
able to support future research in this field.
9. Acknowledgements
We thank Guy Tevet and Oren Katzir for their valuable
insights and engaging discussions. We also thank Yuval
Alaluf, Elad Richardson, and Sagi Polaczek for providing
feedback on early versions of our manuscript. This work
was partially supported by Joint NSFC-ISF Research Grant
no. 3077/23 and Isf 3441/21.
References
[1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Ji-
tendra Malik. Contour detection and hierarchical image
segmentation. IEEE Trans. Pattern Anal. Mach. Intell.,
33(5):898–916, May 2011. 2
[2] Alexander Ashcroft, Ayan Das, Yulia Gryaditskaya, Zhiyu
Qu, and Yi-Zhe Song. Modelling complex vector drawings
with stroke-clouds. In The Twelfth International Conference
on Learning Representations, 2024. 3
[3] Itamar Berger, Ariel Shamir, Moshe Mahler, Eliza-
beth Jeanne Carter, and Jessica K. Hodgins. Style and ab-
straction in portrait sketching. ACM Transactions on Graph-
ics (TOG), 32:1 12, 2013. 2
[4] Kumar Bhunia, Umar Ayan Das, Riaz Muhammad, Yongxin
Yang, Timothy M. Hospedales, Tao Xiang, Yulia Gryadit-
skaya, and Yi-Zhe Song. Edinburgh research explorer pix-
elor: A competitive sketching ai agent. so you think you can
sketch? 2020. 2
[5] Reiner Birkl, Diana Wofk, and Matthias M ¨
uller. Midas v3.1
a model zoo for robust monocular relative depth estimation,
2023. 3
[6] Caroline Chan, Fr´
edo Durand, and Phillip Isola. Learning
to generate line drawings that convey geometry and seman-
tics. 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 7905–7915, 2022. 1,2,
6,7,8,4
[7] Shoufa Chen, Pei Sun, Yibing Song, and Ping Luo. Dif-
fusiondet: Diffusion model for object detection. 2023
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 19773–19786, 2022. 3
[8] Yajing Chen, Shikui Tu, Yuqi Yi, and Lei Xu. Sketch-
pix2seq: a model to generate sketches of multiple categories.
ArXiv, abs/1709.04121, 2017. 2
[9] Chenxwh. BRIA Background Removal v1.4 Model, 2025. 1
[10] Mathias Eitz, James Hays, and Marc Alexa. How do humans
sketch objects? ACM Transactions on Graphics (TOG), 31:1
10, 2012. 2,1
[11] Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw:
Exploring text-to-drawing synthesis through language-image
encoders. ArXiv, abs/2106.14843, 2021. 2
[12] Yarden Frenkel, Yael Vinker, Ariel Shamir, and Daniel
Cohen-Or. Implicit style-content separation using b-lora,
2024. 2
[13] Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy
Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream-
sim: Learning new dimensions of human visual similarity
using synthetic data. In Advances in Neural Information Pro-
cessing Systems, volume 36, pages 50742–50768, 2023. 7
[14] Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang
Liu, and Changqing Zou. Sketchycoco: Image generation
from freehand scene sketches. In 2020 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 5173–5182, 2020. 2,7,8,1
[15] Yulia Gryaditskaya, Mark Sypesteyn, Jan Willem Hoftijzer,
Sylvia C. Pont, Fr´
edo Durand, and Adrien Bousseau. Opens-
ketch. ACM Transactions on Graphics (TOG), 38:1 16,
2019. 2
[16] David Ha and Douglas Eck. A neural representation of
sketch drawings. CoRR, abs/1704.03477, 2017. 2,1
[17] Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yan-
hao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu,
and Ying Tai. A generalist facex via learning unified facial
9
Figure 11. Sketches generated by SwiftSketch for seen categories, using input images not included in the training data.
Figure 12. Sketches generated by SwiftSketch for unseen categories.
10
Input ControlSk. CLIPasso Input ControlSk. CLIPasso Input ControlSk. CLIPasso
Input ControlSk. CLIPasso Input ControlSk. CLIPasso Input ControlSk. CLIPasso
Input ControlSk. CLIPasso Input ControlSk. CLIPasso Input ControlSk. CLIPasso
Input ControlSk. CLIPasso Input ControlSk. CLIPasso Input ControlSk. CLIPasso
Figure 13. Comparison of ControlSketch with CLIPasso [47]. ControlSketch captures fine details (e.g., camel and bear), avoids artifacts
in small objects (e.g., dog and duck), and handles challenging inputs effectively (last row).
Figure 14. Effect of the refinement network. The output sketches from the diffusion model may contain slight noise, which the refinement
network addresses by performing an additional cleaning step. However, this process can sometimes reduce details in the sketch (see
Limitations).
representation. ArXiv, abs/2401.00551, 2023. 2
[18] Amir Hertz, Andrey Voynov, Shlomi Fruchter, and Daniel
Cohen-Or. Style aligned image generation via shared atten-
tion. In 2024 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 4775–4785, 2024. 2
[19] Jonathan Ho. Classifier-free diffusion guidance. ArXiv,
abs/2207.12598, 2022. 5
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. In Proceedings of the 34th Inter-
national Conference on Neural Information Processing Sys-
tems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Asso-
ciates Inc. 3
[21] Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M.
Rehg, and Varun Jampani. Spar3d: Stable point-aware re-
construction of 3d objects from single images. 2025. 3
[22] Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion:
Text-to-svg by abstracting pixel-based diffusion models.
arXiv, 2022. 3
[23] Moritz Kampelm¨
uhler and Axel Pinz. Synthesizing human-
like sketches from natural images using a conditional convo-
lutional decoder. 2020 IEEE Winter Conference on Applica-
tions of Computer Vision (WACV), pages 3192–3200, 2020.
2
[24] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Blip-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models, 2023. 3,
1
[25] Mengtian Li, Zhe L. Lin, Radom´
ır Mˇ
ech, Ersin Yumer, and
Deva Ramanan. Photo-sketching: Inferring contour draw-
ings from images. 2019 IEEE Winter Conference on Ap-
plications of Computer Vision (WACV), pages 1403–1412,
2019. 2,6,7,8,4
11
[26] Tzu-Mao Li, Michal Luk´
ac, Micha¨
el Gharbi, and Jonathan
Ragan-Kelley. Differentiable vector graphics rasterization
for editing and learning. ACM Transactions on Graphics
(TOG), 39:1 15, 2020. 3,4
[27] Hangyu Lin, Yanwei Fu, Yu-Gang Jiang, and X. Xue.
Sketch-bert: Learning sketch bidirectional encoder repre-
sentation from transformers by self-supervised learning of
sketch gestalt. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 6757–6766,
2020. 2
[28] Difan Liu, Matthew Fisher, Aaron Hertzmann, and Evange-
los Kalogerakis. Neural strokes: Stylized line drawing of
3d shapes. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021. 2
[29] Shitong Luo and Wei Hu. Diffusion probabilistic models
for 3d point cloud generation. 2021 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
2836–2844, 2021. 2,3
[30] Umar Riaz Muhammad, Yongxin Yang, Yi-Zhe Song, Tao
Xiang, and Timothy M. Hospedales. Learning deep sketch
abstraction. 2018 IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 8014–8023, 2018. 2
[31] Kushin Mukherjee, Holly Huey, Xuanchen Lu, Yael Vinker,
Rio Aguina-Kang, Ariel Shamir, and Judith E. Fan. Seva:
Leveraging sketches to evaluate alignment between human
and machine visual abstraction. ArXiv, abs/2312.03035,
2023. 2,1
[32] Alex Nichol and Prafulla Dhariwal. Improved denoising dif-
fusion probabilistic models. ArXiv, abs/2102.09672, 2021.
6,4
[33] Dustin Podell, Zion English, Kyle Lacey, Andreas
Blattmann, Tim Dockhorn, Jonas M ¨
uller, Joe Penna, and
Robin Rombach. Sdxl: Improving latent diffusion models
for high-resolution image synthesis, 2023. 2,4,6,1
[34] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden-
hall. Dreamfusion: Text-to-3d using 2d diffusion. ArXiv,
abs/2209.14988, 2022. 2,3
[35] Yonggang Qi, Guoyao Su, Pinaki Nath Chowdhury,
Mingkang Li, and Yi-Zhe Song. Sketchlattice: Latticed rep-
resentation for sketch manipulation. 2021 IEEE/CVF In-
ternational Conference on Computer Vision (ICCV), pages
933–941, 2021. 2
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, and Ilya Sutskever. Learning transferable vi-
sual models from natural language supervision. CoRR,
abs/2103.00020, 2021. 2,4,7,3
[37] Leo Sampaio Ferraz Ribeiro, Tu Bui, John P. Collomosse,
and Moacir Antonelli Ponti. Sketchformer: Transformer-
based representation for sketched structure. 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 14141–14150, 2020. 2
[38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj¨
orn Ommer. High-resolution image syn-
thesis with latent diffusion models, 2022. 3
[39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep
language understanding. Advances in Neural Information
Processing Systems, 35:36479–36494, 2022. 6
[40] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James
Hays. The sketchy database: learning to retrieve badly drawn
bunnies. ACM Trans. Graph., 35(4), July 2016. 2
[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In International Conference
on Learning Representations, 2021. 3
[42] Jifei Song, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, and Tim-
othy M. Hospedales. Learning to sketch with shortcut cycle
consistency. 2018 IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 801–810, 2018. 2
[43] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel
Cohen-or, and Amit Haim Bermano. Human motion diffu-
sion model. In The Eleventh International Conference on
Learning Representations, 2023. 2,3
[44] Vikas Thamizharasan, Difan Liu, Shantanu Agarwal,
Matthew Fisher, Michael Gharbi, Oliver Wang, Alec Jacob-
son, and Evangelos Kalogerakis. Vecfusion: Vector font
generation with diffusion. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 7943–7952, June 2024. 2,3
[45] Varshaneya V, Balasubramanian S, and Vineeth N. Balasub-
ramanian. Teaching gans to sketch in vector format. Pro-
ceedings of the Twelfth Indian Conference on Computer Vi-
sion, Graphics and Image Processing, 2019. 2
[46] Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, and Ariel
Shamir. Clipascene: Scene sketching with different types
and levels of abstraction. 2023 IEEE/CVF International
Conference on Computer Vision (ICCV), pages 4123–4133,
2022. 1,2,7
[47] Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Ro-
man Christian Bachmann, Amit Haim Bermano, Daniel
Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso:
Semantically-aware object sketching. ACM Trans. Graph.,
41(4), jul 2022. 1,2,3,4,6,7,8,11
[48] Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui
Qin, and Anthony Chen. Instantstyle: Free lunch to-
wards style-preserving in text-to-image generation. ArXiv,
abs/2404.02733, 2024. 2,6,7,8,4
[49] Qiang Wang, Haoge Deng, Yonggang Qi, Da Li, and Yi-Zhe
Song. Sketchknitter: Vectorized sketch generation with dif-
fusion models. In The Eleventh International Conference on
Learning Representations, 2023. 2,3
[50] Zeyu Wang, Sherry Qiu, Nicole Feng, Holly Rushmeier,
Leonard McMillan, and Julie Dorsey. Tracing versus free-
hand for evaluating computer-generated drawings. ACM
Trans. Graph., 40(4), Aug. 2021. 2
[51] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi-
scale structural similarity for image quality assessment. In
The Thrity-Seventh Asilomar Conference on Signals, Sys-
tems & Computers, 2003, volume 2, pages 1398–1402. Ieee,
2003. 7
[52] Holger Winnem¨
oller. Xdog: advanced image stylization with
extended difference-of-gaussians. In International Sympo-
sium on Non-Photorealistic Animation and Rendering, 2011.
7
[53] Chufeng Xiao, Wanchao Su, Jing Liao, Zhouhui Lian, Yi-
12
Zhe Song, and Hongbo Fu. Differsketching: How differently
do people sketch 3d objects? ACM Transactions on Graphics
(Proceedings of ACM SIGGRAPH Asia 2022), 41(4):1–16,
2022. 2
[54] Ximing Xing, Chuan Wang, Haitao Zhou, Jing Zhang,
Qian Yu, and Dong Xu. Diffsketcher: Text guided vec-
tor sketch synthesis through latent diffusion models. ArXiv,
abs/2306.14685, 2023. 2,3
[55] Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang,
Dong Xu, and Qian Yu. Svgdreamer: Text guided svg gener-
ation with diffusion model. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 4546–4555, June 2024. 1,3
[56] Peng Xu, Timothy M. Hospedales, Qiyue Yin, Yi-Zhe Song,
Tao Xiang, and Liang Wang. Deep learning for free-hand
sketch: A survey and a toolbox, 2020. 1,2
[57] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu.
Vision-language models for vision tasks: A survey, 2024.
1
[58] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. 2023
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 3813–3824, 2023. 2,4,1
[59] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 5
[60] Tao Zhou, Chen Fang, Zhaowen Wang, Jimei Yang, Byung-
moon Kim, Zhili Chen, Jonathan Brandt, and Demetri Ter-
zopoulos. Learning to doodle with stroke demonstrations
and deep q-networks. In British Machine Vision Conference,
2018. 2
13
SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation
Supplementary Material
Image Mask Attention Sketch
Figure 15. An example from the ControlSketch dataset, which
includes the input image, object mask, attention map, and the cor-
responding sketch generated using ControlSketch.
Table of Contents
A. ControlSketch Dataset 1
B. ControlSketch Method 1
C. SwiftSketch 3
D. Qualitative Comparison 4
E. Quantitative Evaluation 4
F. Ablation 5
G. Limitations 5
A. ControlSketch Dataset
The data generation process begins with generating images,
followed by creating corresponding sketches using the Con-
trolSketch framework, as described in section 4.2 in the
main paper. We generate images using the SDXL model
[33] with the following prompt: “A highly detailed wide-
shot image of one <c>, set against a plain mesmerizing
background. Center., where cis the class label. Addition-
ally, a negative prompt, “close up, few, multiple, is applied
to ensure images depict a single object in a clear and high-
quality pose. The generated images are of size 1024×1024.
An example output image for the class “cat” is shown in
Figure 15.
During the image generation process, we retain cross at-
tention maps of the class label token extracted from internal
layers of the model for future use. To isolate the object,
we employ the BRIA Background Removal v1.4 Model [9]
to extract an object mask. After generating the image, we
use BLIP2 [24] to extract the image caption that provides
context beyond the object’s class. For example, for the im-
age in Fig. 15, the caption describe the cat as sitting, of-
fering richer semantic information. The sketches are gener-
ated using the ControlSketch method with 32 strokes. These
strokes are subsequently arranged according to our stroke-
sorting schema. The final SVG files contains the sorted
strokes. We use the Hugging Face implementation of SDXL
version 1.0 [33] with its default parameters. Generating a
single image with SDXL takes approximately 10 seconds,
while sketch generation using the ControlSketch method on
an NVIDIA RTX3090 GPU requires about 10 minutes.
Our dataset comprises 35,000 pairs of images and their
corresponding sketches in SVG format, spanning 100 object
categories. These categories are derived by combining com-
mon ones from existing sketch datasets [10,14,16,31] with
additional, unique categories such as astronaut, robot and
sculpture. These unique categories are not present in prior
datasets, highlighting the advantages of a synthetic data ap-
proach. The full list of categories is available in Table 3. All
the sketches in our data were manually verified, we filtered
very few generated images with artifacts that caused arti-
facts in the generated sketches. The 15 categories used in
training are: angel, bear, car, chair, crab, fish, rabbit, sculp-
ture, astronaut, bicycle, cat, dog, horse, robot, woman. For
each of these categories we generated 1200 image-sketch
pairs, where 1000 samples are used for training and the rest
for testing. For the rest of 85 categories we created 200
samples per class. We show 78 random samples from each
class of the training data in Figures 28 to 42, and 100 ran-
dom samples from the entire dataset (one of each class) in
Figure 16. Since the entire data creation pipeline is fully
automated, we continuously extend the dataset and plan to
release the code to enable future work in this area.
B. ControlSketch Method
Technical details In the ControlSketch optimization, we
leverage the pretrained depth ControlNet model [58] to
compute the SDS loss. The Adam optimizer is employed
with a learning rate of 0.8. The optimization process runs
for 2000 iterations, taking approximately 10 minutes to gen-
erate a single sketch on an RTX 3090 GPU. However, after
700 iterations most images already yield a clearly identifi-
able sketch.
Strokes initialization The number of areas, k, is defined
as the rounded square root of the total number of strokes n
(for our default number of strokes, 32, kis set to 6). Our
initialization technique combines between saliency and full
coverage of the sketch, which we find to be important when
the SDS loss is applied with our spatial control. In Fig-
ure 18 we demonstrate how the final sketches will look like
1
Figure 16. 100 random samples of sketches generated with ControlSketch.
when applied with and without our enhances initialization,
where the default case is defined based on the attention map
as was proposed in CLIPasso [47]. As seen, our approach
ensures comprehensive object coverage while emphasizing
critical areas, resulting in visually effective and recogniz-
able sketches without omitting essential elements. For ex-
ample, in the lion image, initializing strokes based solely on
saliency results in almost all strokes focusing on the lion’s
head. Consequently, the final sketch omits significant por-
tions of the lion’s body.
2
airplane alarm clock angel astronaut backpack
bear bed bee beer bicycle
boat broccoli burger bus butterfly
cabin cake camel camera candle
car carrot castle cat cell phone
chair chicken child cow crab
cup deer doctor dog dolphin
dragon drill duck elephant fish
flamingo floor lamp flower fork giraffe
goat hammer hat helicopter horse
house ice cream jacket kangaroo kimono
laptop lion lobster margarita mermaid
motorcycle mountain octopus parrot pen
pickup
truck pig purse quiche rabbit
robot sandwich scissors sculpture shark
sheep spider squirrel strawberry sword
t-shirt table teapot television tiger
tomato train tree truck umbrella
vase waffle watch whale wine bottle
woman yoga zebra The Eiffel
Tower book
Table 3. The 100 categories of the ControlSketch dataset.
Spatial control The ControlNet model receives two in-
puts as conditions: the text prompt and the depth condi-
tion. The balance between these conditions which is de-
termined by the conditioning scale parameter influences the
final sketch attributes. We found that a conditioning scale
of 1.5 provides the best results, effectively maintaining both
semantic and geometric attributes of the subject.
The depth ControlNet model used in the control SDS
loss can be replaced with any other ControlNet model,
along with the extraction of the appropriate condition from
the input image. Different ControlNet models influence the
style and attributes of the final sketch. Examples of dif-
ferent sketches generated with different ControlNet models
and conditions are shown in Figure 17.
C. SwiftSketch
Our implementation is built on the MDM codebase [43].
Our model consists of 8 self- and cross-attention layers.
Input Depth Scribble Segmentation
Input Depth Scribble Segmentation
Figure 17. Examples of sketches generated by ControlSketch us-
ing different ControlNet models, alongside their respective condi-
tions which influence the style and attributes of the final sketches.
It was trained with a batch size of 32, a learning rate of
5×105, for 400,000 steps. The refinement network shares
the same architecture as our diffusion model and is initial-
ized with its final weights. The timestep condition is fixed
at 0. We train the refinement network on the diffusion out-
put sketches from the training dataset, using only the LPIPS
loss between the network’s rendered output sketch and the
target rendered sketch, as we found it resulting in more pol-
ished and visually improved final sketches. The refinement
network was trained for 30,000 steps with a learning rate of
5×106.
For training, We scaled the ground truth (x, y)coordi-
nates to the range [-2, 2]. Our experiments revealed that a
scaling factor of 2 outperformed the standard value of 1.0
which is used in image generation tasks. To extract in-
put image features for our model, the image is processed
using a pretrained CLIP ResNet model [36], with features
extracted from its fourth layer. These features are subse-
quently refined through three convolutional layers to cap-
ture additional spatial details. Each patch embedding is
further refined using three linear layers, enhancing feature
learning and aligning dimensions for compatibility with the
model. The resulting feature representation is seamlessly
3
Input Saliency Saliency
+coverage
Figure 18. Strokes initialization in the ControlSketch method.
The ”Saliency” column demonstrates the result when strokes are
initialized based solely on the attention map (following common
practive [47]), often leading to an overemphasis on critical regions,
such as the lion’s head, at the expense of other important parts like
the body. The ”Saliency + coverage” column showcases our en-
hanced initialization method, which combines saliency with full
object coverage, ensuring both essential details and global object
representation are maintained, resulting in complete and recogniz-
able sketches.
integrated into the generation process via a cross-attention
mechanism.
To encourage the diffusion model to focus on fine details,
we adjust the noise scheduler to perturb the signal more sub-
tly for small timesteps, by reducing the exponent in the stan-
dard cosine noise schedule proposed in [32] from 2 to 0.4.
Our model Mθwas trained using classifier-free guidance so
during inference, we enhance fidelity to the input image by
extrapolating the following variants using s= 2.5:
Mθs(st, t, I) = Mθ(st, t, )+s·Mθ(st, t, I)Mθ(st, t, )
(6)
.
Figure 22 showcases 100 random SwiftSketch samples
across all categories in the ControlSketch dataset. The last
three rows correspond to classes our model was trained on,
while the remaining rows are unseen classes. Each sketch
is generated in under a second. The results demonstrate that
our model generalizes well to unseen categories, producing
sketches with high fidelity to the input images. However, in
Figure 19. Stroke Order Visualization. SwiftSketch generated
sketches are visualized progressively, with the stroke count shown
on top. The first row for each example is with our sorting tech-
nique (w), while the second row omits it (w/o)
some cases, high-level details are absent, and the sketch’s
category label can be difficult to identify. More examples
for unseen classes are shown in Figure 23, Figure 24 and
Figure 25
D. Qualitative Comparison
Figure 26 and Figure 27 show more examples of qualitative
comparison of seen and unseen categories. Input images
are shown on the left. From left to right, the sketches are
generated using PhotoSketching [25], Chan et al. [6] (in
anime style), InstantStyle [48], and CLIPasso [47]. On the
right are the resulting sketches from our proposed methods,
ControlSketch and SwiftSketch.
E. Quantitative Evaluation
In this section, we present the details of the user study con-
ducted to compare our new optimization method, ControlS-
ketch, with the state-of-the-art optimization method for ob-
ject sketching, CLIPasso. We selected 24 distinct categories
for the user study: 16 categories from our ControlSketch
dataset, and 8 categories from the SketchyCOCO dataset.
For each category, we randomly sampled one image. Par-
ticipants were presented with the input image alongside
two sketches—one generated by CLIPasso and the other by
ControlSketch—displayed in random order. We asked par-
ticipants two questions for each pair of sketches: 1. Which
sketch more effectively depicts the input image? 2. Which
sketch is of higher quality? Participants were required to
choose one sketch for each question. A total of 40 individu-
als participated in the survey. The results are as follows: For
the ControlSketch dataset, 87% of participants chose Con-
trolSketch for the first question, and 88% for the second
question. For the SketchyCOCO dataset—which is more
challenging due to its low-resolution images and difficult
lighting conditions—90% chose ControlSketch for the first
4
Figure 20. Limitations of SwiftSketch. (a) When trained solely
on masked object images, SwiftSketch struggles to generate accu-
rate sketches for complex scenes. As shown, it incorrectly assigns
strokes to the image frame instead of capturing the scene’s key el-
ements. (b) During the refinement stage, fine details particularly
facial features are often lost, resulting in oversimplified represen-
tations. (c) Sketches may appear unrecognizable.
question, and 93% for the second question. These results
highlight the significant advantages of ControlSketch over
CLIPasso across diverse categories and datasets.
F. Ablation
Figure 21 presents a comparison of results with and with-
out the refinement step in the SwiftSketch pipeline. As can
be seen, the final output sketches generated by the denois-
ing process of our diffusion model may still retain slight
noise. Incorporating the refinement stage significantly en-
hances the quality and cleanliness of the sketches Figure 19
illustrates the impact of the stroke sorting technique used for
training. Early strokes effectively capture the object’s con-
tour and key features, while later strokes refine the details.
With sorting, the object is significantly more recognizable
with fewer strokes compared to the case without sorting.
G. Limitations
SwiftSketch, which was trained only on masked object im-
ages, faces challenges in handling complex scenes. When
provided with a scene image, as illustrated in Figure 20(a),
SwiftSketch struggles to generate accurate sketches, often
misplacing strokes onto the image frame instead of captur-
ing key elements of the scene. Another significant limi-
tation is its tendency to omit fine details, particularly fa-
cial features, leading to oversimplified representations, as
shown in Figure 20(b). In some cases, sketches may appear
unrecognizable, as shown in Figure 20(c).
5
Seen Classes
Input W/O Refine W Refine Input W/O Refine W Refine
Unseen Classes
Input W/O Refine W Refine Input W/O Refine W Refine
Figure 21. Comparison of SwiftSketch sketches with (right) and without (left) the refinement step. This highlights the critical role of the
refinement network in significantly improving the quality of the generated sketches and reducing noise
6
Figure 22. 100 random sapmels of SwiftSketch sketches. The last three rows are seen classes, while the remaining rows are unseen classes
7
Figure 23. Sketches generated by SwiftSketch for unseen categories.
8
Figure 24. Sketches generated by SwiftSketch for unseen categories.
9
Figure 25. Sketches generated by SwiftSketch for unseen categories.
10
Input Chan et al. InstantStyle Photo-Sketching CLIPasso ControlSketch SwiftSketch
Input Chan et al. Instant-Style Photo-Sketching CLIPasso ControlSketch SwiftSketch
Figure 26. Qualitative comparison, seen categories
11
Input Chan et al. InstantStyle Photo-Sketching CLIPasso ControlSketch SwiftSketch
Input Chan et al. Instant-Style Photo-Sketching CLIPasso ControlSketch SwiftSketch
Figure 27. Qualitative comparison, unseen categories
12
Figure 28. Dog - SwiftSketch training data examples
13
Figure 29. Horse - SwiftSketch training data examples
14
Figure 30. Cat - SwiftSketch training data examples
15
Figure 31. Angel - SwiftSketch training data examples
16
Figure 32. Astronaut - SwiftSketch training data examples
17
Figure 33. Bear - SwiftSketch training data examples
18
Figure 34. Bicycle - SwiftSketch training data examples
19
Figure 35. Car - SwiftSketch training data examples
20
Figure 36. Chair - SwiftSketch training data examples
21
Figure 37. Crab - SwiftSketch training data examples
22
Figure 38. Fish - SwiftSketch training data examples
23
Figure 39. Rabbit - SwiftSketch training data examples
24
Figure 40. :Sculpture - SwiftSketch training data examples
25
Figure 41. Robot - SwiftSketch training data examples
26
Figure 42. Woman - SwiftSketch training data examples
27
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey .
Conference Paper
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained textto-image diffusion models. ControlNet locks the productionready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with “zero convolutions” (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.
Article
Multiple sketch datasets have been proposed to understand how people draw 3D objects. However, such datasets are often of small scale and cover a small set of objects or categories. In addition, these datasets contain freehand sketches mostly from expert users, making it difficult to compare the drawings by expert and novice users, while such comparisons are critical in informing more effective sketch-based interfaces for either user groups. These observations motivate us to analyze how differently people with and without adequate drawing skills sketch 3D objects. We invited 70 novice users and 38 expert users to sketch 136 3D objects, which were presented as 362 images rendered from multiple views. This leads to a new dataset of 3,620 freehand multi-view sketches, which are registered with their corresponding 3D objects under certain views. Our dataset is an order of magnitude larger than the existing datasets. We analyze the collected data at three levels, i.e., sketch-level, stroke-level, and pixel-level, under both spatial and temporal characteristics, and within and across groups of creators. We found that the drawings by professionals and novices show significant differences at stroke-level, both intrinsically and extrinsically. We demonstrate the usefulness of our dataset in two applications: (i) freehand-style sketch synthesis, and (ii) posing it as a potential benchmark for sketch-based 3D reconstruction. Our dataset and code are available at https://chufengxiao.github.io/DifferSketching/.
Article
ion is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present CLIPasso, an object sketching method that can achieve different levels of abstraction, guided by geometric and semantic simplifications. While sketch generation methods often rely on explicit sketch datasets for training, we utilize the remarkable ability of CLIP (Contrastive-Language-Image-Pretraining) to distill semantic concepts from sketches and images alike. We define a sketch as a set of Bézier curves and use a differentiable rasterizer to optimize the parameters of the curves directly with respect to a CLIP-based perceptual loss. The abstraction degree is controlled by varying the number of strokes. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual components of the subject drawn.