Figure - available from: International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
Examples for the two F-OOD detection benchmarks: COVID and OBJECTS. Each benchmark consists of a training ID dataset, two covariate-shifted ID datasets, two near-OOD datasets, and four far-OOD datasets
Source publication
Existing out-of-distribution (OOD) detection literature clearly defines semantic shift as a sign of OOD but does not have a consensus over covariate shift. Samples experiencing covariate shift but not semantic shift from the in-distribution (ID) are either excluded from the test set or treated as OOD, which contradicts the primary goal in machine l...
Citations
... For challenging and practical evaluations, three critical perspectives are crucial: (1) the semantic similarity between ID and OOD data, (2) two distribution shifts -semantic shift (class shift) and covariate shift (feature distribution shift), and (3) the dataset's alignment with real-world scenarios. Here, we define label shifts as 'semantic shifts,' and changes in image style or quality as 'covariate shifts' following [16]. Research addressing both as mentioned in (2) is termed fullspectrum OOD detection [16]. ...
... Here, we define label shifts as 'semantic shifts,' and changes in image style or quality as 'covariate shifts' following [16]. Research addressing both as mentioned in (2) is termed fullspectrum OOD detection [16]. These perspectives reflect realworld scenarios and enable us to conduct more robust evaluations across different OOD detection methods. ...
... Background. An existing problem setting for OOD detection with covariate shifts is full-spectrum OOD (FS-OOD) [16,22]. In FS-OOD, the ID data is constructed by adding data with the same semantics but different covariate distributions (covariate-shifted ID) to the training data (training ID). ...
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Banchmarks.
... Notably, recent studies [16] have pointed out that simply rejecting OOD samples may be insufficient for ensuring model robustness; addressing errors within in-distribution data is equally important, a challenge often addressed within the Selective Prediction framework [17]. Additionally, researchers have emphasized the necessity of incorporating covariate shift benchmarks to achieve robust models [18]. These varied perspectives illustrate the complexity of evaluating model robustness, underscoring the need for an evaluation framework that not only focuses on OOD performance but also manages overall prediction risk, ensuring that both ID and OOD risks are adequately addressed. ...
This paper challenges the conventional approach treating of out-of-distribution (OOD) risk as uniform and aimed at reducing OOD risk on average. We argue that managing OOD risk on average fails to account for the potential impact of rare, high-consequence events, which can undermine trust in a model even with just a single OOD incident. First, we show that OOD performance depends on both the rate of outliers and the number of samples processed by a machine learning (ML) model. Second, we introduce a novel perspective that assesses OOD risk by considering the expected maximum risk within a limited sample size. Our theoretical findings clearly distinguish when OOD detection is essential and when it becomes redundant, allowing efforts to be directed towards improving ID performance once adequate OODrobustness is achieved. Finally, an analysis of popular computer vision benchmarks reveals that ID errors often dominate overall risk, highlighting the importance of strong ID performance as a foundation for effective OOD detection. Our framework offers both theoretical insights and practical guidelines for deploying ML models in high-stakes applications, where trust and reliability are paramount.
... Averly et al. presented a unifying framework to detect OOD scenarios caused by both semantic and covariate shifts in uncontrolled environments for a variety of models [17]. Yang et al. presented a full-spectrum OOD detection model that uses a simple feature-based semantics score function to account for semantic shifts and become tolerant to covariate shifts in image data [18]. However, none of these works focus on OOD detection scenarios for RL agent based autonomous systems. ...
Autonomous agents for cyber applications take advantage of modern defense techniques by adopting intelligent agents with conventional and learning-enabled components. These intelligent agents are trained via reinforcement learning (RL) algorithms, and can learn, adapt to, reason about and deploy security rules to defend networked computer systems while maintaining critical operational workflows. However, the knowledge available during training about the state of the operational network and its environment may be limited. The agents should be trustworthy so that they can reliably detect situations they cannot handle, and hand them over to cyber experts. In this work, we develop an out-of-distribution (OOD) Monitoring algorithm that uses a Probabilistic Neural Network (PNN) to detect anomalous or OOD situations of RL-based agents with discrete states and discrete actions. To demonstrate the effectiveness of the proposed approach, we integrate the OOD monitoring algorithm with a neurosymbolic autonomous cyber agent that uses behavior trees with learning-enabled components. We evaluate the proposed approach in a simulated cyber environment under different adversarial strategies. Experimental results over a large number of episodes illustrate the overall efficiency of our proposed approach.
... We use AUROC as our metric, which is commonly employed in existing OOD detection studies Yang et al. (2021;2023;, to evaluate the performance of OOD detection methods in our experiments. For specific training hyperparameter settings across all experiments, please refer to Appendix B. ...
The primary goal of out-of-distribution (OOD) detection tasks is to identify inputs with semantic shifts, i.e., if samples from novel classes are absent in the in-distribution (ID) dataset used for training, we should reject these OOD samples rather than misclassifying them into existing ID classes. However, we find the current definition of "semantic shift" is ambiguous, which renders certain OOD testing protocols intractable for the post-hoc OOD detection methods based on a classifier trained on the ID dataset. In this paper, we offer a more precise definition of the Semantic Space and the Covariate Space for the ID distribution, allowing us to theoretically analyze which types of OOD distributions make the detection task intractable. To avoid the flaw in the existing OOD settings, we further define the "Tractable OOD" setting which ensures the distinguishability of OOD and ID distributions for the post-hoc OOD detection methods. Finally, we conduct several experiments to demonstrate the necessity of our definitions and validate the correctness of our theorems.
... Or the data could simply be corrupted (without any content deviation) which complicates the model's robustness, as the ML models are vulnerable to covariate shifts 20 . In such cases, the decision to detect them as OOD or not largely hinges on the model's ability to generalize across existing covariate shifts 21 . ...
... Fig. 3 provides example thumbnails to illustrate our criteria for classifying images as ID or OOD. Our approach is closely related to Semantic AD, where research aims to define OOD scores that capture the semantic essence of images (e.g., 21,[44][45][46]. In this context, FBP reconstructions from sparse measurements of ID samples can be interpreted as covariate shifts. ...
Recent works demonstrate the effectiveness of diffusion models as unsupervised solvers for inverse imaging problems. Sparse-view computed tomography (CT) has greatly benefited from these advancements, achieving improved generalization without reliance on measurement parameters. However, this comes at the cost of potential hallucinations, especially when handling out-of-distribution (OOD) data. To ensure reliability, it is essential to study OOD detection for CT reconstruction across both clinical and industrial applications. This need further extends to enabling the OOD detector to function effectively as an anomaly inspection tool. In this paper, we explore the use of a diffusion model, trained to capture the target distribution for CT reconstruction, as an in-distribution prior. Building on recent research, we employ the model to reconstruct partially diffused input images and assess OOD-ness through multiple reconstruction errors. Adapting this approach for sparse-view CT requires redefining the notions of "input" and "reconstruction error". Here, we use filtered backprojection (FBP) reconstructions as input and investigate various definitions of reconstruction error. Our proof-of-concept experiments on the MNIST dataset highlight both successes and failures, demonstrating the potential and limitations of integrating such an OOD detector into a CT reconstruction system. Our findings suggest that effective OOD detection can be achieved by comparing measurements with forward-projected reconstructions, provided that reconstructions from noisy FBP inputs are conditioned on the measurements. However, conditioning can sometimes lead the OOD detector to inadvertently reconstruct OOD images well. To counter this, we introduce a weighting approach that improves robustness against highly informative OOD measurements, albeit with a trade-off in performance in certain cases.
... However, they struggle with recognizing inliers and outliers effectively. In contrast, outlier exposure adds a regularization term to the training objective to help recognize OOD samples [49][50][51]. Yet, this could significantly harm the training objectives [17] and impose unnecessary constraints on the model for recognizing specific types of OOD data. ...
... As stated in Section 2, conventional outlier exposure introduce a new training objective, which is producing anomalous results for anomalous samples [49][50][51]. As such, the objective function of conventional outlier exposure can be formulated as: ...
In real-world scenarios, deep learning models often face challenges from both imbalanced (long-tailed) and out-of-distribution (OOD) data. However, existing joint methods rely on real OOD data, which leads to unnecessary trade-offs. In contrast, our research shows that data mixing, a potent augmentation technique for long-tailed recognition, can generate pseudo-OOD data that exhibit the features of both in-distribution (ID) data and OOD data. Therefore, by using mixed data instead of real OOD data, we can address long-tailed recognition and OOD detection holistically. We propose a unified framework called Reinforced Imbalance Learning with Class-Aware Self-Supervised Outliers Exposure (RICASSO), where "self-supervised" denotes that we only use ID data for outlier exposure. RICASSO includes three main strategies: Norm-Odd-Duality-Based Outlier Exposure: Uses mixed data as pseudo-OOD data, enabling simultaneous ID data rebalancing and outlier exposure through a single loss function. Ambiguity-Aware Logits Adjustment: Utilizes the ambiguity of ID data to adaptively recalibrate logits. Contrastive Boundary-Center Learning: Combines Virtual Boundary Learning and Dual-Entropy Center Learning to use mixed data for better feature separation and clustering, with Representation Consistency Learning for robustness. Extensive experiments demonstrate that RICASSO achieves state-of-the-art performance in long-tailed recognition and significantly improves OOD detection compared to our baseline (27% improvement in AUROC and 61% reduction in FPR on the iNaturalist2018 dataset). On iNaturalist2018, we even outperforms methods using real OOD data. The code will be made public soon.
... For scoring rules, we compare the maximum softmax probability (MSP) (Hendrycks & Gimpel, 2017), the Maximum Logit Score (MLS) (Vaze et al., 2022), ODIN (Liang et al., 2018), GODIN (Hsu et al., 2020), Energy scoring (Liu et al., 2020), GradNorm and SEM (Yang et al., 2023). We further experiment with ReAct , an activation pruning technique which can be employed in conjunction with any scoring rule. ...
Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR
... The work of Hendrycks and Gimpel (2017) first introduced the baseline (i.e., MSP) as the scoring function to detect OOD inputs. Later, various methods focused on designing scoring functions are proposed, like ODIN (Liang et al., 2018), Mahalanobis distance (Lee et al., 2018), energy score (Liu et al., 2020), ViM , KNN , and SEM (Yang et al., 2023). The methods mentioned above are post-hoc techniques, and there are also many training time methods like LogitNorm (Wei et al., 2022). ...
In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.
... Recently, Yang et al. [33], Bai et al. [166] propose that we should consider cases where covariate shift occurs in ID data, which is not taken into account previously. This is crucial to prevent the loss of model generalization. ...
... Various OOD datasets are constructed accordingly for evaluating different methods [175], with the most frequently employed datasets being iNaturalist [176], SUN [177], Places [178] and Textures [179]. Additionally, full-spectrum OOD is typically evaluated on three benchmarks: DIGITS, OBJECTS, and COVID, as proposed by Yang et al. [33]. • Semantic Segmentation. ...
Out-of-distribution (OOD) detection aims to detect test samples outside the training category space, which is an essential component in building reliable machine learning systems. Existing reviews on OOD detection primarily focus on method taxonomy, surveying the field by categorizing various approaches. However, many recent works concentrate on non-traditional OOD detection scenarios, such as test-time adaptation, multi-modal data sources and other novel contexts. In this survey, we uniquely review recent advances in OOD detection from the problem scenario perspective for the first time. According to whether the training process is completely controlled, we divide OOD detection methods into training-driven and training-agnostic. Besides, considering the rapid development of pre-trained models, large pre-trained model-based OOD detection is also regarded as an important category and discussed separately. Furthermore, we provide a discussion of the evaluation scenarios, a variety of applications, and several future research directions. We believe this survey with new taxonomy will benefit the proposal of new methods and the expansion of more practical scenarios. A curated list of related papers is provided in the Github repository: \url{https://github.com/shuolucs/Awesome-Out-Of-Distribution-Detection}
... In the context of graphs, OOD data encompasses graphs with structural properties, node features, or edge distributions that are not wellrepresented in the training set [30]. While it is simple to state that data is ID or OOD, this demarcation oversimplifies that ID data and OOD data exist along a spectrum [31,32]. ...
This thesis presents a new metric named Graph Distributional Analytics (GDA).
This approach uses Weisfeiler-Leman kernels, cosine similarity, and traditional statistical metrics to better characterize graph-structured data. It focuses on enhancing the analysis of graph-structured data and enhancing the explainability and power of Graph Neural Networks (GNNs) without introducing a new model architecture. Within existing GNN research, strong claims of out-of-distribution (OOD) generalizability are frequently made, but these claims fail when exposed to real-world data. We propose existing standards of identifying OOD data are insufficient, and a metric is needed that accurately and efficiently identifies data that is actually different from the training data. Our metric accurately identifies OOD data which allows researchers to make realistic claims about model generalizability. Extensive experiments confirm the effectiveness of this metric through comparative analysis against traditional methods. Our study shows that GDA outperforms existing metrics in detecting OOD instances. This is needed for applications where the generalizability of GNNs is necessary, such as in drug effectiveness studies, protein interaction classification, and complex network systems in telecommunications and social media analysis. The thesis explores how this metric affects the explainability of GNNs, and it reveals the behavior and decision-making processes of these models. This application of GDA in curriculum transfer learning optimizes data usage and computational efficiency. By strategically introducing training data, the models progressively adapt. This improves accuracy and generalization capabilities across various graph-based tasks.
This work does not propose a new GNN architecture. Instead, it offers a methodology for better understanding and analyzing the data these models process. The
contributions of this thesis extend beyond academic theory to practical applications, where improving the accuracy and efficiency of GNNs can lead to significant advancements in bioinformatics, chemistry, code analysis, and network security.