Available via license: CC0
Content may be subject to copyright.
Segment Any Anomaly without Training via
Hybrid Prompt Regularization
Yunkang Cao1∗Xiaohao Xu1∗Chen Sun1Yuqi Cheng1
Zongwei Du1Liang Gao1Weiming Shen1§
1State Key Laboratory of Digital Manufacturing Equipment and Technology,
Huazhong University of Science and Technology, China
{cyk_hust, sun_chen, chengyuqi, duzongwei, gaoliang}@hust.edu.cn
xxh11102019@outlook.com,chengyuqi.c@qq.com,wshen@ieee.org
Abstract
We present a novel framework, i.e., Segment Any Anomaly + (SAA
+
), for zero-
shot anomaly segmentation with hybrid prompt regularization to improve the
adaptability of modern foundation models. Existing anomaly segmentation models
typically rely on domain-specific fine-tuning, limiting their generalization across
countless anomaly patterns. In this work, inspired by the great zero-shot general-
ization ability of foundation models like Segment Anything, we first explore their
assembly to leverage diverse multi-modal prior knowledge for anomaly localiza-
tion. For non-parameter foundation model adaptation to anomaly segmentation,
we further introduce hybrid prompts derived from domain expert knowledge and
target image context as regularization. Our proposed SAA
+
model achieves state-
of-the-art performance on several anomaly segmentation benchmarks, including
VisA, MVTec-AD, MTD, and KSDD2, in the zero-shot setting. We will release
the code at https://github.com/caoyunkang/Segment-Any-Anomaly.
1 Introduction
Anomaly segmentation models [
1
,
2
,
3
] have attracted great interest in various domains, e.g,, industrial
quality control [
4
,
5
] and medical diagnoses [
6
]. The key to reliable anomaly segmentation is to
discriminate the distribution of anomaly data from normal data. Specifically, this paper considers
zero-shot anomaly segmentation (ZSAS) on images, which is a promising yet unexplored setting
where neither normal nor abnormal image is provided for the target category during training.
Due to the scarcity of abnormal samples for training, many works are working towards unsupervised
or self-supervised anomaly segmentation, which targets learning a representation of the normal
samples during training. Then, the anomalies can be segmented by calculating the discrepancy
between the test sample and the learned normal distribution. In specific, these models, including
auto-encoder-based reconstruction [
7
,
8
,
9
,
10
,
11
,
12
], one-class classification [
13
,
14
,
15
], and
memory-based normal distribution [
3
,
2
,
16
,
17
,
18
] methods, typically require training separate
models for certain limited categories. However, in real-world scenarios, there are millions of industrial
products, and it is not cost-effective to collect a large training set for individual objects, which hinders
their deployment in cases when efficient deployments are required, e.g., the initial stage of production.
Recently, foundation models, e.g., SAM [
19
] and CLIP [
20
], exhibit great zero-shot visual perception
abilities by retrieving prior knowledge stored in these models via prompting [
21
,
22
]. In this work,
we would like to explore how to adapt foundation models to realize anomaly segmentation under the
*Equal Contribution.
§Corresponding Author.
Preprint. Under review.
arXiv:2305.10724v1 [cs.CV] 18 May 2023
Figure 1: Towards segmenting any anomaly without training, we first construct a vanilla baseline
(SAA) by prompting into a cascade of anomaly region generator (e.g., a prompt-guided object detec-
tion foundation model [
23
]) and anomaly region refiner (e.g., a segmentation foundation model [
19
])
modules via a naive class-agnostic language prompt (e.g., “Anomaly”). However, SAA shows the
severe false-alarm problem, which falsely detects all the “
wick
” rather than the ground-truth anomaly
region (the “
overlong wick
”). Thus, we further strengthen the regularization with hybrid prompts
in the revamped model (SAA+), which successfully helps identify the anomaly region.
zero-shot setting. To this end, as is shown in Fig. 1, we first construct a vanilla baseline, i.e., Segment
Any Anomaly (SAA), by cascading prompt-guided object detection [
23
] and segmentation foundation
models [
19
], which serve as Anomaly Region Generator and Anomaly Region Refiner, respectively.
Following the practice to unlock foundation model knowledge [
24
,
25
], naive language prompts, e.g.,
“
defect
” or “
anomaly
”, are utilized to segment desired anomalies for a target image. In specific, the
language prompt is used to prompt the Anomaly Region Generator to generate prompt-conditioned
box-level regions for desired anomaly regions. Then these regions are refined in the Anomaly Region
Refiner to produce final predictions, i.e., masks, for anomaly segmentation.
However, as is shown in Figure 1, vanilla foundation model assembly (SAA) tends to cause significant
false alarms, e.g., SAA wrongly refers to all wicks as anomalies whereas only the overlong wick
is a real anomaly, which we attribute to the ambiguity brought by naive language prompts. Firstly,
conventional language prompts may become ineffective when facing the domain shift between the pre-
training data distribution of foundation models and downstream datasets for anomaly segmentation.
Secondly, the degree of “
anomaly
” for a target depends on the object context, which is hard for naive
coarse-grained language prompts, e.g., “an anomaly region”, to express exactly.
Thus, going beyond naive language prompts, we incorporate domain expert knowledge and target
image context in our revamped framework, i.e., Segment Any Anomaly + (SAA
+
), respectively.
On the one hand, expert knowledge provides detailed descriptions of anomalies that are relevant
to the target in open-world scenarios. We utilize more specific descriptions as in-context prompts,
effectively aligning the image content in both pre-trained and target datasets. On the other hand, we
utilize the target image context to reliably identify and adaptively calibrate anomaly segmentation
predictions [
26
,
27
]. By leveraging the rich contextual information present in the target image, we
can accurately associate the object context with the final anomaly predictions.
Technically, apart from naive class-agnostic prompts, we leverage domain expert knowledge to
construct target-oriented anomaly language prompts, i.e., class-specific language expressions. Besides,
as language can not accurately retrieve regions with certain object characteristics, such as number,
size, and location, precisely [
28
,
29
], we introduce object property prompts in the form of thresholding
filters. These prompts assist in identifying and removing region candidates that do not satisfy desired
properties. Furthermore, to fully exploit the target image context, we suggest utilizing image saliency
and region confidence ranking as prompts, which model the anomaly degree of a region by considering
the similarities, e.g., euclidean distance, between it and other regions within the image. Finally, we
conduct thorough experiments to confirm the efficacy of our hybrid prompts in adapting foundation
models to zero-shot anomaly segmentation. Specifically, our final model (SAA
+
) attains new state-
of-the-art performance on various anomaly segmentation datasets under the zero-shot setting. To
summarize, our main contributions are:
•
We propose the SAA framework for anomaly segmentation, allowing the collaborative
assembly of diverse foundation models without the need for training.
2
•
We introduce hybrid prompts as a regularization technique, leveraging domain expert
knowledge and target image context to adapt foundation models for anomaly segmentation.
This leads to the development of SAA+, an enhanced version of our framework.
•
Our method achieves state-of-the-art performance in zero-shot anomaly segmentation on
several benchmark datasets, including VisA, MVTec-AD, KSDD2, and MTD. Notably,
SAA/SAA
+
demonstrates remarkable capability in detecting texture-related anomalies
without requiring any annotation.
2 Related work
Anomaly Segmentation.
Due to the limited availability and high cost of abnormal images in
industrial settings, much of the current research on anomaly segmentation focuses on unsupervised
methods that rely solely on normal images. Reconstruction-based approaches, such as those proposed
in [
7
,
8
,
9
,
10
,
11
,
12
], score anomalies with train an encoder-decoder model to reconstruct images for
segmentation purposes. By comparing the input image with the reconstructed version, these methods
can predict the location of anomalies. Feature embedding-based methods, on the other hand, typically
employ teacher-student architecture [
30
,
31
,
32
,
33
,
34
,
35
,
36
,
1
], one-class classification technology
[
13
,
14
,
15
], or memory-based normal distribution [
3
,
2
,
16
] to segment anomalies by identifying
differences in feature distribution between normal and abnormal images.
Recently, researchers have begun to explore the potential of ZSAS [
37
,
38
,
39
,
40
], which eliminates
the need for either normal or abnormal images during the training process. Among them, WinClip [
25
]
pioneers the potential of foundation models, e.g., visual-language models, for the ZSAL task. Unlike
WinClip [
25
] that segments anomalies through text-visual similarity, we propose to generate proposals
and score their anomaly degree, achieving much better segmentation performance.
Foundation Model.
Foundation models show an impressive ability to solve diverse vision tasks in a
zero-shot manner. Specifically, these models can learn a strong representation by training on large-
scale datasets [
41
]. While early work [
20
,
42
] focus on developing robust image-wise recognition
capacity, recent work [
43
,
44
,
45
,
46
,
47
,
23
] introduce foundation models or their applications
for dense visual tasks. For instance, Grounding DINO [
23
] achieves encouraging open-set object
detection ability using arbitrary texts as queries. Recently, SAM [
19
] demonstrates a powerful ability
to extract high-quality object segmentation masks in the open world. Impressed by the success of
these foundation models, we would like to explore how to adapt these off-the-shelf models to detect
anomalies without any training on the downstream datasets for anomaly segmentation.
Prompt Engineering.
Prompt engineering is a widely employed technique that involves adapting
foundation models for downstream tasks. Generally, this approach involves appending a set of
learnable tokens to the input. Prior studies have investigated prompting with text inputs [
48
], vision
inputs [
49
,
50
,
51
], and both text and visual inputs [
52
,
53
,
54
]. Despite their effectiveness in adapting
foundation models to various downstream tasks, prompting methods cannot be employed in ZSAS
because they require training data, which is not available in ZSAS. In contrast, some methods employ
heuristic prompts [
55
] that do not require any training, making them more feasible for tasks without
any data. In this paper, we propose using hybrid prompts derived from domain expert knowledge and
target image context for ZSAS.
3 SAA: Vanilla Foundation Model Assembly for ZSAS
3.1 Problem Definition: Zero-shot Anomaly Segmentation (ZSAS)
The goal of ZSAS is to perform anomaly segmentation on new objects without requiring any
corresponding object training data. ZSAS seeks to create an anomaly map
A∈[0,1]h×w×1
based
on an empty training set ∅, in order to identify the anomaly degree for individual pixels in an image
I∈Rh×w×3
that includes novel objects. The ZSAS task has the potential to significantly reduce the
need for training data and lower the costs associated with real-world inspection deployments.
3
Figure 2:
Overview of the proposed Segment Any Anomaly + (SAA+) framework.
We adapt
foundation models to zero-shot anomaly segmentation via hybrid prompt regularization. In specific,
apart from naive class-agnostic language prompts, the regularization comes from both domain expert
knowledge, including more detailed class-specific language and object property prompts, and target
image context, including visual saliency and confidence ranking-related prompts.
3.2 Baseline Model Assembly: Segment Any Anomaly (SAA)
For ZSAS, we start by constructing a vanilla foundation model assembly, i.e., Segment Any Anomaly
(SAA), as shown in Fig. 1. In specific, given a certain query image for anomaly segmentation, we
first use languages as the initial prompt to roughly retrieve coarse anomaly region proposals via an
Anomaly Region Generator implemented with a language-driven visual grounding foundation model,
i.e., GroundingDINO [
23
]. Afterward, anomaly region proposals are refined into pixel-wise high-
quality segmentation masks with the Anomaly Region Refiner in which a prompt-driven segmentation
foundation model, i.e., SAM [19], is used.
3.2.1 Anomaly Region Generator
With recent booming development on language-vision models, some foundation models [
24
,
23
,
46
] gradually acquire the ability to retrieve objects in images through language prompts. Given
language prompts
T
that describe desired regions to be detected, e.g., “
anomaly
”, foundation models
can generate desired regions for a query image
I
. There we base the architecture of the region
detector on a text-guided open-set object detection architecture for visual grounding. Specifically, we
take a GroundingDINO [
23
] architecture that has been pre-trained on large-scale language-vision
datasets [
41
]. Such a network first extracts the features of the language prompt and the query image
via text encoder and visual encoder, respectively. Then the rough object regions are generated in the
form of bounding boxes with a cross-modality decoder. Given the bounding-box-level region set
RB
,
and their corresponding confidence score set
S
, the module of anomaly region generator (
Generator
)
can be formulated as,
RB,S:= Generator(I,T)(1)
3.2.2 Anomaly Region Refiner
To generate pixel-wise anomaly segmentation results, we propose Anomaly Region Refiner to refine
the bounding-box-level anomaly region candidates into an anomaly segmentation mask set. To this
end, we use a sophisticated foundation model for open-world visual segmentation, i.e., SAM [
19
].
This model mainly includes a ViT-based [
56
] backbone and a prompt-conditioned mask decoder. In
specific, the model is trained on a large-scale image segmentation dataset [
19
] with one billion fine-
grained masks, which enables high-quality mask generation abilities under an open-set segmentation
4
setting. The prompt-conditioned mask decoder accepts various types of prompts as input. We regard
the bounding box candidates
RB
as prompts and obtain pixel-wise segmentation masks
R
. The
module of the Anomaly Region Refiner (Refiner) can be formulated as follows,
R:= Refiner(I,RB)(2)
Till then, we obtain the set of regions in the form of high-quality segmentation masks
R
with
corresponding confidence scores S. To sum up we summarize framework (SAA) as follows,
R,S:= SAA(I,Tn)(3)
where Tnis a naive class-agnostic language prompt, e.g., “anomaly”, utilized in SAA.
3.3 Analysis on the ZSAS Performance of Vanilla Foundation Model Assembly
We present some preliminary experiments to evaluate the efficacy of vanilla foundation model
assembly for ZSAS. Despite the simplicity and intuitiveness of the solution, we observe a language
ambiguity issue. Specifically, certain language prompts, such as “anomaly”, may fail to detect the
desired anomaly regions. For instance, as depicted in Fig. 1, all “
wick
” is erroneously identified as
an anomaly by the SAA with the “anomaly” prompt.
We attribute this language ambiguity to the domain gap between the pretraining language-vision
datasets and the targeted ZSAS datasets, which means that some language prompts may have different
meanings and be associated with different image contents in distinct datasets. In addition, there is
hardly any adjective expression like “
anomaly
” in those large-scale datasets, thus making this kind of
prompt design poor at understanding what is an anomaly region. Additionally, the exact “anomaly”
is object-specific and would vary across objects. For example, it denotes the scratches on leather or
the crack on hazelnut. The language ambiguity issue leads to severe false alarms in ZSAS datasets.
We propose introducing hybrid prompts generated by domain expert knowledge and the target image
context to reduce language ambiguity, thereby achieving better ZSAS performance.
4 SAA+: Foundation Model Adaption via Hybrid Prompt Regularization
To address language ambiguity in SAA and improve its ability on ZSAS, we propose an upgraded
version called SAA
+
that incorporates hybrid prompts, as Fig. 2. In addition to leveraging the
knowledge gained from pre-trained foundation models, SAA
+
utilizes both domain expert knowledge
and target image context to generate more accurate anomaly region masks. We provide further details
on these hybrid prompts below.
4.1 Prompt Generated from Domain Expert Knowledge
Following the trend of prompt learning [
48
,
54
], we initialize the prompt, which unlocks the knowl-
edge of foundation models, in the form of language. However, the language ambiguity issue caused
by the domain gap is particularly severe when using only the naive language prompt “
anomaly
”. To
address this problem, we leverage domain expert knowledge that contains useful prior information
about the target anomaly regions. Specifically, although experts may not provide a comprehensive
list of potential open-world anomalies for a new product, they can identify some candidates based on
their past experiences with similar products. Domain expert knowledge enables us to refine the naive
“
anomaly
” prompt into more specific prompts that describe the anomaly state in greater detail. In
addition to language prompts, we introduce property prompts to complement the lack of awareness
on specific properties like “count” and “area” [28] in existing foundation models [28].
4.1.1 Anomaly Language Expression as Prompt
To describe potential open-world anomalies, we propose designing more precise language prompts.
These prompts are categorized into two types: class-agnostic and class-specific prompts.
Class-agnostic prompts (Ta
) are general prompts that describe anomalies that are not specific to any
particular category, e.g., “
anomaly
” and “
defect
”. Despite the domain gap between the pre-trained
datasets and the targeted ZSAS datasets, our empirical analysis (5.3) shows that these generic prompts
provide encouraging initial performance.
5
Class-specific prompts (Ts
) are designed based on expert knowledge of abnormal patterns with
similar products to supplement more specific anomaly details. We use prompts already employed
in the pre-trained visual-linguistic dataset, e.g., “
black hole
” and “
white bubble
”, to query the
desired regions. This approach reformulates the task of finding an anomaly region into locating
objects with a specific anomaly state expression, which is more straightforward to utilize foundation
models than identifying “anomaly” within an object context.
By prompting SAA with anomaly language prompts
PL={Ta,Ts}
derived from domain expert
knowledge, we generate finer anomaly region candidates
R
and corresponding confidence scores
S
.
4.1.2 Anomaly Object Property as Prompt
Current foundation models [
23
,
57
] have limitations when it comes to querying objects with specific
property descriptions, such as size or location, which are important for describing anomalies, such
as “
The small black hole on the left of the cable.
” To incorporate this critical expert
knowledge, we propose using anomaly property prompts formulated as rules rather than language.
Specifically, we consider the location and area of anomalies.
Anomaly Location.
Accurate localization of anomalies plays a critical role in distinguishing true
anomalies from false positives. Typically, anomalies are expected to be located within the objects
of interest during inference. However, due to the influence of background context, anomalies may
occasionally appear outside the inspected objects. To tackle this challenge, we leverage the open-
world detection capability of foundation models to determine the location of the inspected object.
Subsequently, we calculate the intersection over union (IoU) between the potential anomaly regions
and the inspected object. By applying an expert-derived IoU threshold, denoted as
θIoU
, we filter
out anomaly candidates with IoU values below this threshold. This process ensures that the retained
anomaly candidates are more likely to represent true anomalies located within the inspected object.
Anomaly Area.
The size of an anomaly, as reflected by its area, is also a property that can provide
useful information. In general, anomalies should be smaller than the size of the inspected object.
Experts can provide a suitable threshold value
θarea
for the specific type of anomaly being considered.
Candidates with areas unmatched with θarea ·ObjectArea can then be filtered out.
By combining the two property prompts
PP={θarea, θI oU }
, we can filter the set of candidate
regions
R
to obtain a subset of selected candidates
RP
with corresponding confidence scores
SP
using the filter function (Filter),
RP,SP:= Filter(R,PP)(4)
4.2 Prompts Derived from Target Image Context
Besides incorporating domain expert knowledge, we can leverage the information provided by the
input image itself to improve the accuracy of anomaly region detection. In this regard, we propose
two prompts induced by the image context.
4.2.1 Anomaly Saliency as Prompt
Predictions generated by foundation models like [
23
] using the prompt “
defect
” can be unreliable
due to the domain gap between pre-trained language-vision datasets [
41
] and targeted anomaly
segmentation datasets [
4
,
58
]. To calibrate the confidence scores of individual predictions, we
propose Anomaly Saliency Prompt mimicking human intuition. In specific, humans can recognize
anomaly regions by their discrepancy with their surrounding regions [
40
], i.e., visual saliency contains
valuable information indicating the anomaly degree. Hence, we calculate a saliency map (
s
) for the
input image by computing the average distances between the corresponding pixel feature (
f
) and its
Nnearest neighbors,
sij := 1
NX
f∈Np(fij )
(1 − hfij ,fi)(5)
where
(i, j)
denotes to the pixel location,
Np(fij )
denotes to the
N
nearest neighbors of the corre-
sponding pixel, and
h·,·i
refers to the cosine similarity. We use pre-trained CNNs from large-scale
image datasets [
59
] to extract image features, ensuring the descriptiveness of features. The saliency
6
map indicates how different a region is from other regions. The saliency prompts
PS
are defined as
the exponential average saliency value within the corresponding region masks,
PS:= (exp(Pij rij sij
Pij rij
)|r∈ RP)(6)
The saliency prompts provide reliable indications of the confidence of anomaly regions. These
prompts are employed to recalibrate the confidence scores generated by the foundation models,
yielding new rescaled scores SSbased on the anomaly saliency prompts PS. These rescaled scores
provide a combined measure that takes into account both the confidence derived from the foundation
models and the saliency of the region candidate. The process is formulated as follows,
SS:= p·s|p∈ PS, s ∈ SP(7)
4.2.2 Anomaly Confidence as Prompt
Typically, the number of anomaly regions in an inspected object is limited. Therefore, we propose
anomaly confidence prompts
PC
to identify the
K
candidates with the highest confidence scores
based on the image content and use their average values for final anomaly region detection. This is
achieved by selecting the top
K
candidate regions based on their corresponding confidence scores, as
shown in the following,
RC,SC:= TopK(RP,SS)(8)
Denote a single region and its corresponding score as
rC
and
sC
, we then use these
K
candidate
regions to estimate the final anomaly map,
Aij := PrC∈RCrC
ij ·sC
PrC∈RCrC
ij
(9)
With the proposed hybrid prompts (
PL,PP,PS
, and
PC
), SAA is regularized in our final framework,
i.e., Segment Any Anomaly +(SAA+), which makes more reliable anomaly predictions.
5 Experiments
In this section, we first assess the performance of SAA/SAA
+
on several anomaly segmentation
benchmarks. Then, we extensively study the effectiveness of individual hybrid prompts.
5.1 Experimental Setup
Datasets.
We leverage four datasets with pixel-level annotations.: VisA [
58
], MVTec-AD [
4
],
KSDD2 [
60
], and MTD [
61
]. VisA and MVTec-AD comprise a variety of object subsets, e.g., circuit
boards, while KSDD2 and MTD are comprised of texture anomalies. In summary, we categorize the
subsets of all of these datasets into texture which typically exhibit similar patterns within a single
image (e.g., carpets), and object which includes more diverse distribution (e.g., candles).
Evaluation Metrics.
ZSAS performance is evaluated based on two metrics: (I)
max-F1-pixel
(
Fp
)
[
25
], which measures the F1-score for pixel-wise segmentation at the optimal threshold; (II)
max-F1-
region
(
Fr
), which is proposed in this paper to mitigate the bias towards large defects observed with
max-F1-pixel [
4
]. Specifically, we compute the F1-score for region-wise segmentation at the optimal
threshold, considering a prediction positive if the overlapping value exceeds 0.6.
Implementation Details.
We adopt the official implementations of GroundingDINO
1
and Segment
Anything Model
2
to construct the vanilla baseline (SAA). Details about the prompts derived from
domain expert knowledge are explained in the supplementary material. For the saliency prompts in-
duced from image content, we utilize the WideResNet50 [
62
] network, pre-trained on ImageNet [
59
],
and set
N= 400
in line with prior studies [
40
]. For anomaly confidence prompts, we set the
hyperparameter
K
as
5
by default. Input images are fixed at a resolution of
400 ×400
for evaluation.
1https://github.com/IDEA-Research/GroundingDINO
2https://github.com/facebookresearch/segment-anything
7
Table 1: Qualitative comparisons between SAA
+
and other concurrent methods on zero-shot anomaly
segmentation. Best scores are highlighted in bold. The second best scores are also underlined.
Metric Method Per Dataset Per Defect Type Total
VisA MVTec-AD KSDD2 MTD Texture Object
Fp
WinClip [25] 14.82 31.65 - - - 20.93 -
ClipSeg [24] 14.32 25.42 34.27 9.39 27.75 18.30 20.58
UTAD [40] 6.95 23.48 22.53 11.37 29.13 12.07 16.19
SAA 12.76 23.44 8.79 14.78 20.94 17.35 18.22
SAA+27.07 39.40 59.19 35.40 53.79 28.82 34.85
Fr
ClipSeg [24] 5.65 19.68 9.05 6.55 21.37 10.41 13.06
UTAD [40] 5.32 17.53 3.56 2.95 16.38 9.94 11.49
SAA 4.83 32.49 16.40 10.63 40.31 13.19 19.74
SAA+14.46 49.67 39.34 30.27 60.40 25.70 34.07
Figure 3: Qualitative comparisons on zero-shot anomaly segmentation for ClipSeg [
24
], UTAD [
40
],
SAA, and SAA+on four datasets, i.e., VisA [58], MVTec-AD [4], KSDD2 [60], and MTD [61]
5.2 Main Results
Methods for Comparison.
We compare our final model, i.e., Segment Any Anomaly + (SAA
+
)
with several concurrent state-of-the-art methods, including WinClip [
25
], UTAD [
40
], ClipSeg [
24
],
and our vanilla baseline (SAA). For WinClip, we report its official results on VisA and MVTec-AD.
For the other three methods, we use official implementations and adapt them to the ZSAS task.
Notably, as all methods require no training process, their performance is stable with a variance of
±0.00.
Quantitative Results
: As is shown in Table 1, SAA
+
method outperforms other methods in both
Fp
and
Fr
by a significant margin. Although WinClip [
25
], ClipSeg [
24
], and SAA also use foundation
models, SAA
+
better unleash the capacity of foundation models and adapts them to tackle ZSAS. The
remarkable performance of SAA+meets the expectation to segment any anomaly without training.
Qualitative Results
: Fig. 3 presents qualitative comparisons between SAA
+
and previous competi-
tive methods, where SAA
+
achieves better performance. Moreover, the visualization shows SAA
+
is capable of detecting texture anomalies, e.g. small scratches on the leather.
8
Table 2: Ablation study on the proposed hy-
brid prompts, including language prompt (
PL
),
object property prompt (
PP
), saliency prompt
(
PS
), and confidence prompt (
PC
). The best
scores are highlighted in bold.
Metric Model Variants Texture Object Total
Fp
w/o PL
w/o Ta&Ts50.30 24.79 30.95
w/o Ta51.15 25.88 31.80
w/o Ts53.51 26.55 33.06
w/o PP21.83 21.40 21.50
w/o PS50.58 24.72 30.96
w/o PC50.41 27.99 34.13
full model (SAA+)53.79 28.82 34.85
Fr
w/o PL
w/o Ta&Ts50.58 22.36 29.17
w/o Ta55.26 20.28 28.72
w/o Ts54.21 23.13 30.64
w/o PP33.94 20.99 24.11
w/o PS57.66 24.36 32.39
w/o PC53.65 25.18 32.05
full model (SAA+)60.40 25.70 34.07
Figure 4: Effects of disabling (
w/o
) and abling
(
w/
) prompts (
PS
) of saliency maps (
s
) on the
final anomaly segmentation.
Figure 5: Sensitivity analysis of hyperparameter
Kof confidence prompts (PC).
5.3 Ablation study
In Table 2, we perform component-wise analysis to ablate specific prompt designs in our framework.
Language prompt (PL).
Table 2 verifies the effectiveness of language prompts derived from
domain expert knowledge (+3.90% in
Fp
and +4.90% in
Fr
). Then, we dig into the efficacy of
Ta
and
Ts
, which clearly indicate that both the general description and the specifically designed
description for anomalies can achieve reasonable performance. Moreover, their combination can
make a synergy, enhancing anomaly segmentation performance. The improvement of
PL
helps
unlock language-driven region detection capacity of current foundation models [23, 19].
Property prompt (PP).
Apart from the improvement in the overall performance, property prompts
bring dramatic improvements (from
21.83%
to
53.79%
in
Fp
) on texture categories, thanks to
the filtering mechanism which filters out a significant number of falsely detected anomaly region
candidates via high-level characteristics, e.g., location and area of the target image.
Saliency prompt (PS).
Table 2 provides clear evidence of the efficacy of
PS
on anomaly segmenta-
tion. This is because region saliencies can accurately describe the degree of deviation of a region
from its surroundings. In Fig. 4, we showcase the qualitative impact of
PS
on anomaly segmentation,
which illustrates visual saliency maps can help highlight abnormal regions, i.e., which shows higher
saliency values compared to other regions. By incorporating
PS
to calibrate the confidence scores,
more precise segmentation results can be achieved. For example, the use of
PS
enables the effective
localization of the cracked region of hazelnut and the overlong wick on candles.
Confidence prompt (PC).
With the incorporation of anomaly confidence prompts, we limit the
number of anomaly regions, which effectively reduces false positives, leading to
0.72% Fp
average
improvements across all categories, as shown in Table 2. The influence of the hyperparameter
K
in
PC
is illustrated in Fig. 5. The figure shows that performance initially increases as
K
improves,
as more anomaly regions are accurately detected. However, when
K
exceeds a certain threshold
(around
K= 5
), the performance drops slightly as more regions are wrongly identified as abnormal.
The best results are obtained at around K= 5, with an average Fpof 34.85% across all categories.
9
6 Conclusion
In this work, we explore how to segment any anomaly without any further training by unleashing the
full power of modern foundation models. We owe the struggle of adapting foundation model assembly
to anomaly segmentation to the prompt design, which is the key to controlling the function of off-the-
shelf foundation models. Thus, we propose a novel framework, i.e., Segment Any Anomaly
+
, to
leverage hybrid prompts derived from both expert knowledge and target image context to regularize
foundation models free of training. Finally, we successfully adapt multiple foundation models to
tackle zero-shot anomaly segmentation, achieving new SoTA results on several benchmarks. We
hope our work can shed light on the design of label-free model adaptation for anomaly segmentation.
Limitations.
Due to the computation restriction, we currently do not test our method on more large-
scale foundation models. We have finished the exploration of our methodology with representative
foundation models, and we will explore the scaling effect of the models in the future.
References
[1]
Yunkang Cao, Xiaohao Xu, Zhaoge Liu, and Weiming Shen. Collaborative discrepancy optimization for
reliable image anomaly localization. IEEE Transactions on Industrial Informatics, pages 1–10, 2023.
[2]
Qian Wan, Liang Gao, Xinyu Li, and Long Wen. Industrial image anomaly localization based on gaussian
clustering of pretrained feature. IEEE Transactions on Industrial Electronics, 69(6):6182–6192, 2021.
[3]
Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler.
Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 14318–14328, 2022.
[4]
Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD – A comprehensive
real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on
Computer Vision and Pattern Recognition, pages 9592–9600, 2019.
[5]
Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-
teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 4183–4192, 2020.
[6]
Christoph Baur, Stefan Denner, Benedikt Wiestler, Nassir Navab, and Shadi Albarqouni. Autoencoders for
unsupervised anomaly segmentation in brain mr images: a comparative study. Medical Image Analysis,
69:101952, 2021.
[7]
Kang Zhou, Yuting Xiao, Jianlong Yang, Jun Cheng, Wen Liu, Weixin Luo, Zaiwang Gu, Jiang Liu,
and Shenghua Gao. Encoding structure-texture relation with p-net for anomaly detection in retinal
images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XX 16, pages 360–377. Springer, 2020.
[8]
Jinlei Hou, Yingying Zhang, Qiaoyong Zhong, Di Xie, Shiliang Pu, and Hong Zhou. Divide-and-assemble:
Learning block-wise memory for unsupervised anomaly detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 8791–8800, 2021.
[9]
Vitjan Zavrtanik, Matej Kristan, and Danijel Skoˇ
caj. DRAEM – A discriminatively trained reconstruction
embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 8330–8339, 2021.
[10]
Takashi Matsubara, Kazuki Sato, Kenta Hama, Ryosuke Tachibana, and Kuniaki Uehara. Deep generative
model using unregularized score for anomaly detection with heterogeneous complexity. IEEE Transactions
on Cybernetics, 52(6):5161–5173, 2020.
[11]
Xudong Yan, Huaidong Zhang, Xuemiao Xu, Xiaowei Hu, and Pheng-Ann Heng. Learning semantic
context from normal samples for unsupervised anomaly detection. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 35, pages 3110–3118, 2021.
[12]
Jielin Jiang, Jiale Zhu, Muhammad Bilal, Yan Cui, Neeraj Kumar, Ruihan Dou, Feng Su, and Xiaolong
Xu. Masked swin transformer unet for industrial anomaly detection. IEEE Transactions on Industrial
Informatics, 19(2):2200–2209, 2022.
[13]
Jihun Yi and Sungroh Yoon. Patch SVDD: Patch-level SVDD for anomaly detection and segmentation. In
Proceedings of the Asian Conference on Computer Vision, 2020.
[14]
Fabio Valerio Massoli, Fabrizio Falchi, Alperen Kantarci, ¸Seymanur Akti, Hazim Kemal Ekenel, and
Giuseppe Amato. Mocca: Multilayer one-class classification for anomaly detection. IEEE Transactions on
Neural Networks and Learning Systems, 33(6):2313–2323, 2021.
10
[15]
Kihyuk Sohn, Chun-Liang Li, Jinsung Yoon, Minho Jin, and Tomas Pfister. Learning and evaluating
representations for deep one-class classification. In International Conference on Learning Representations,
2020.
[16]
Yunkang Cao, Xiaohao Xu, and Weiming Shen. Complementary pseudo multimodal feature for point
cloud anomaly detection. arXiv preprint arXiv:2303.13194, 2023.
[17]
Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Yabiao Wang, and Chengjie Wang. Multimodal
industrial anomaly detection via hybrid fusion. In 2023 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2023.
[18]
Xi Jiang, Jianlin Liu, Jinbao Wang, Qiang Nie, Kai Wu, Yong Liu, Chengjie Wang, and Feng Zheng.
SoftPatch: Unsupervised anomaly detection with noisy data. In Advances in neural information processing
systems, 2022.
[19]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete
Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint
arXiv:2304.02643, 2023.
[20]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from
natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR,
2021.
[21]
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. Align and prompt:
Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 4953–4963, 2022.
[22]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of
foundation models. arXiv preprint arXiv:2108.07258, 2021.
[23]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang,
Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object
detection. arXiv preprint arXiv:2303.05499, 2023.
[24]
Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
[25]
Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer.
Winclip: Zero-/few-shot anomaly classification and segmentation. arXiv preprint arXiv:2303.14814, 2023.
[26]
Xiaohao Xu, Jinglu Wang, Xiang Ming, and Yan Lu. Towards robust video object segmentation with
adaptive object calibration. In Proceedings of the 30th ACM International Conference on Multimedia,
pages 1–10, 2022.
[27]
Xiaohao Xu, Jinglu Wang, Xiao Li, and Yan Lu. Reliable propagation-correction modulation for video
object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2946–2954,
2022.
[28]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching
clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
[29]
Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Yan Lu, and Bhiksha Raj. Rˆ 2vos: Robust referring video
object segmentation via relational multimodal cycle consistency. arXiv preprint arXiv:2207.01203, 2022.
[30]
Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R
Rabiee. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 14902–14912, 2021.
[31]
Guodong Wang, Shumin Han, Errui Ding, and Di Huang. Student-teacher feature pyramid matching for
anomaly detection. arXiv preprint arXiv:2103.04257, 2021.
[32]
Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9737–9746,
2022.
[33]
Yunkang Cao, Qian Wan, Weiming Shen, and Liang Gao. Informative knowledge distillation for image
anomaly segmentation. Knowledge-Based Systems, 248:108846, 2022.
[34]
Yunkang Cao, Yanan Song, Xiaohao Xu, Shuya Li, Yuhao Yu, Yiheng Zhang, and Weiming Shen. Semi-
supervised knowledge distillation for tiny defect detection. In 2022 IEEE 25th International Conference
on Computer Supported Cooperative Work in Design (CSCWD), pages 1010–1015, 2022.
[35]
Qian Wan, Liang Gao, Xinyu Li, and Long Wen. Unsupervised image anomaly detection and segmentation
based on pre-trained feature mapping. IEEE Transactions on Industrial Informatics, 2022.
11
[36]
Qian Wan, Yunkang Cao, Liang Gao, Weiming Shen, and Xinyu Li. Position encoding enhanced feature
mapping for image anomaly detection. In 2022 IEEE 18th International Conference on Automation Science
and Engineering (CASE), pages 876–881. IEEE, 2022.
[37]
Amr M Nagy and László Czúni. Zero-shot learning and classification of steel surface defects. In Fourteenth
International Conference on Machine Vision (ICMV 2021), volume 12084, pages 386–394. SPIE, 2022.
[38]
Jiahui Liu, Xiaojuan Qi, Songzhi Su, Tony Prescott, and Li Sun. Zero-shot anomalous object detection
using unsupervised metric learning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2021) Proceedings. Sheffield, 2021.
[39]
Adín Ramírez Rivera, Adil Khan, Imad Eddine Ibrahim Bekkouch, and Taimoor Shakeel Sheikh. Anomaly
detection based on zero-shot outlier synthesis and hierarchical feature distillation. IEEE Transactions on
Neural Networks and Learning Systems, 33(1):281–291, 2020.
[40]
Toshimichi Aota, Lloyd Teh Tzer Tong, and Takayuki Okatani. Zero-shot versus many-shot: Unsupervised
texture anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 5564–5572, 2023.
[41]
Christoph Schuhmann, Robert Kaczmarczyk, Aran Komatsuzaki, Aarush Katta, Richard Vencu, Romain
Beaumont, Jenia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered
400 million image-text pairs. In NeurIPS Workshop Datacentric AI. Jülich Supercomputing Center, 2021.
[42]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong
Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In
Advances in neural information processing systems, volume 34, pages 9694–9705, 2021.
[43]
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-IO: A
unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
[44]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren
Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-
sequence learning framework. arXiv preprint arXiv:2202.03052, 2022.
[45]
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and
Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[46]
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei
Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–
16803, 2022.
[47]
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII,
pages 696–712. Springer, 2022.
[48]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-
language models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2022.
[49]
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for
efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv,
Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 105–124. Springer, 2022.
[50]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and
Ser-Nam Lim. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
[51]
Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for
adapting large-scale models. arXiv preprint arXiv:2203.17274, 1(3):4, 2022.
[52]
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Unified vision and language
prompt learning. arXiv preprint arXiv:2210.07225, 2022.
[53]
Sheng Shen, Shijia Yang, Tianjun Zhang, Bohan Zhai, Joseph E Gonzalez, Kurt Keutzer, and Trevor
Darrell. Multitask vision-language prompt tuning. arXiv preprint arXiv:2211.11720, 2022.
[54]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language
models. Int J Comput Vis, 130(9):2337–2348, 2022.
[55]
Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle?
visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
[56]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In
International Conference on Learning Representations, 2021.
12
[57]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan
Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–
10975, 2022.
[58]
Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. SPot-the-Difference
self-supervised pre-training for anomaly detection and segmentation. In Proceedings of the European
Conference on Computer Vision, 2022.
[59]
Geoffrey E Hinton, Alex Krizhevsky, and Ilya Sutskever. ImageNet classification with deep convolutional
neural networks. Advances in Neural Information Processing Systems, 25(1106-1114):1, 2012.
[60]
Jakob Božiˇ
c, Domen Tabernik, and Danijel Skoˇ
caj. Mixed supervision for surface-defect detection: From
weakly to fully supervised learning. Computers in Industry, 129:103459, 2021.
[61]
Yibin Huang, Congying Qiu, Yue Guo, Xiaonan Wang, and Kui Yuan. Surface defect saliency of magnetic
tile. In 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE), pages
612–617, 2018.
[62]
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson
and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages
87.1–87.12. BMVA Press, September 2016.
13