Fig 2 - available from: International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
Source publication
As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize...
Contexts in source publication
Context 1
... overview of our approach is shown in Fig. 2. A randomly initialized input is presented to the optical flow and the appearance pathways of our model. We compute the feature maps up to a particular layer that we would like to visualize. A single target feature channel, c, is selected and activation maximization is performed to generate the preferred input in two steps. First, the ...
Context 2
... to generate the preferred input in two steps. First, the derivatives on the input that affect c are calculated by backpropagating the target loss, summed over all locations, to the input layer. Second, the propagated gradient is scaled by the learning rate and added to the current input. These operations are illustrated by the dotted red lines in Fig. 2. Gradient-based optimization performs these steps iteratively with an adaptively decreasing learning rate until the input converges. Importantly, during this optimization process the network weights are not altered, only the input receives changes. The detailed procedure is outlined in the remainder of this ...
Context 3
... our detailed discussion in Sects. 4.1, 4.2 and 4.3, we focus our experimental studies on a VGG-16 two-stream fusion model (Feichtenhofer et al. 2016b) that is illustrated in Fig. 2 and trained on UCF-101. Our visualization technique, however, is generally applicable to any spatiotemporal architecture. In Sect. 4.4, we visualize various other architectures trained on multiple ...
Context 4
... first study the conv5_fusion layer (i.e. the last local layer; see Fig. 2 for the overall architecture and Table 1 for the filter specification of the layers), which takes in features from the appearance and motion streams and learns a local fusion representation for subsequent fully-connected layers with global receptive fields. Therefore, this layer is of particular interest as it is the first point in the ...
Context 5
... γ = 10 flow γ = 5 flow γ = 1 flow γ = 0 test set activity Finally, we visualize the ultimate class prediction layers of the architecture, where the unit outputs correspond to different classes; thus, we know to what they should be matched. In Fig. 12, we show the fast motion activation of the classes Archery, BabyCrawling, PlayingFlute and CleanAndJerk and BenchPress. The learned features for archery (e.g., the elongated bow shape and positioning of the bow as well as the shooting motion of the arrow) are markedly Further, Clean and Jerk actions (where a barbell weight is pushed ...
Context 6
... we show class prediction units of the Inception-v3 architecture trained on Kinetics, for both the appearance and motion streams of a two-stream ConvNet. In Fig. 20 we show the class prediction units of these two streams for 20 sample classes. Notably, Kinetics includes many actions that are hard to predict just from optical flow information. 2 The first row shows classes that are easily classified by the appearance stream with recognition accuracies above 90% and the last row shows classes that ...
Similar publications
One of the most prominent problems in machine learning in the age of deep learning is the availability of sufficiently large annotated datasets. For specific domains, e.g. animal species, a long-tail distribution means that some classes are observed and annotated insufficiently. Additional labels can be prohibitively expensive, e.g. because domain...
Damage estimation is part of daily operation of power utilities, often requiring a manual process of crew deployment and damage report to quantify and locate damages. Advancement in unmanned aerial vehicles (UAVs) as well as real-time communication and learning technologies could be harnessed towards efficient and accurate automation of this proces...
Citations
... Simultaneously, the examination of scene bias places focus on activities, thereby enhancing nuanced behavior differentiation. These collective insights significantly contribute to advancements in AI-based vision systems, improving safety, and driving performance within autonomous vehicles [34,129,130]. ...
The flourishing realm of advanced driver-assistance systems (ADAS) as well as autonomous vehicles (AVs) presents exceptional opportunities to enhance safe driving. An essential aspect of this transformation involves monitoring driver behavior through observable physiological indicators, including the driver’s facial expressions, hand placement on the wheels, and the driver’s body postures. An artificial intelligence (AI) system under consideration alerts drivers about potentially unsafe behaviors using real-time voice notifications. This paper offers an all-embracing survey of neural network-based methodologies for studying these driver bio-metrics, presenting an exhaustive examination of their advantages and drawbacks. The evaluation includes two relevant datasets, separately categorizing ten different in-cabinet behaviors, providing a systematic classification for driver behaviors detection. The ultimate aim is to inform the development of driver behavior monitoring systems. This survey is a valuable guide for those dedicated to enhancing vehicle safety and preventing accidents caused by careless driving. The paper’s structure encompasses sections on autonomous vehicles, neural networks, driver behavior analysis methods, dataset utilization, and final findings and future suggestions, ensuring accessibility for audiences with diverse levels of understanding regarding the subject matter.
... Unlike fully connected networks, the architecture of CNNs is compatible with 2D structured inputs (such as images or any other 2D signals), which helps effectively preserve the spatial structure of inputs. Feichtenhofer et al. [37], present a deep insight into convolutional neural networks, for video recognition tasks. Convolutional layers are composed of multiple kernels, which are convolved with the input image or mid-layer activation maps to produce next-level activations. ...
Video anomaly detection (VAD) is currently a trending research area within computer vision, given that anomalies form a key detection objective in surveillance systems, often requiring immediate responses. The primary challenges associated with video anomaly detection tasks stem from the scarcity of anomaly samples and the context-dependent nature of anomaly definitions. In light of the limited availability of labeled data for training (specifically, a shortage of labeled data for abnormalities), there has been a growing interest in semi-supervised anomaly detection methods. These techniques work by identifying anomalies through the detection of deviations from normal patterns. This paper provides a new perspective to researchers in the field, by categorizing semi-supervised VAD methods according to the proxy task type they employ to model normal data and consequently to detect anomalies. It also reviews recent deep learning based semi-supervised VAD methods, emphasizing their common tactic of slightly overfitting their models on normal data using a proxy task to detect anomalies. Our goal is to help researchers develop more effective video anomaly detection methods. As the selection of a right Deep Neural Network (DNN) plays an important role in several parts of this task, a quick comparative review on DNNs is also included. Unlike previous surveys, DNNs are reviewed from a spatiotemporal feature extraction viewpoint, customized for video anomaly detection. This part of the review can help researchers select suitable networks for different parts of their methods. The review provides a novel and deep look at existing methods and results in stating the shortcomings of these approaches, which can be a hint for future works.
... In addition to AM, model inversion [9,19,30,65,67] has introduced the task of maximizing the classification score of the synthesized image instead of maximizing a class activation. The only extension of feature visualization approaches to videos has been introduced by Feichtenhofer et al. [13] in which they use AM to create visual representations of class features from two-steam models [14,63], trained on individual RGB frames (spatial stream) and optical flow (temporal stream). In this paper, we instead propose a model inversion method for inverting video architectures that concurrently model space and time modalities. ...
... Activation Maximization (AM) [31] optimizes a random noise image by gradient ascent to maximize the activation of a specific class. We specifically adapt [13] for visualizing concurrent spatiotemporal representations, as it is the only prior method for visualizing features over space and time. ...
The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished.
... While a few video interpretation methods exist, they have various limitations, e.g. being primarily qualitative [10], using a certain dataset that prevents evaluating the effect of the training dataset [11] or using classification accuracy as a metric without quantifying a model's internal representations [9], [11], [12]. In response, we present a quantitative paradigm for evaluating the extent that spatiotemporal models are biased toward static or dynamic information in their internal representations. ...
... dataset to completely decouple static and dynamic information, but used it only to examine overall architecture performance on action recognition and did not examine intermediate representations [9]. Other work focused on understanding latent representations in spatiotemporal models either mostly concerned qualitative visualization [10] or a specific architecture type [20]. A related task is understanding the scene representation bias of action recognition datasets [16], [21]. ...
... Our proposed interpretability technique is the first to quantify static and dynamic biases on intermediate representations learned in off-the-shelf models for multiple video-based tasks. Most prior efforts focused on a single task, and studied either datasets [16] or architectures [10], [22]. In contrast, our unified study covers seven datasets and dozens of architectures on three different tasks, i.e. action recognition, AVOS and VIS. ...
There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or a combination of the two. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.
... Deep Action Recognition. Two-stream CNNs (Simonyan & Zisserman, 2014;Feichtenhofer et al., 2016Feichtenhofer et al., , 2020 are the earliest works on deep action recognition. Later, many methods Zhou et al., 2018;Lin et al., 2019;Liu et al., 2020b, a;Luo et al., 2019;Wu et al., 2021;Khowaja & Lee, 2020;Tian et al., 2021) enhance the 2D CNNs with various temporal modules and achieve promising results. ...
Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial–temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.
... colour, spatial texture, contours, shape) drives performance, rather than dynamic information (e.g., motion, temporal texture). While some methodologies have been aimed at understanding the representations learned by these architectures [10,11,12], none provide an approach to completely disentangle static vs. dynamic information. In response, we have developed an Appearance Free Dataset (AFD) for evaluating action recognition architectures when presented with purely dynamic information; see Fig. 1. ...
... Interpretability Various efforts have addressed the representational abilities of video understanding architectures ranging from dynamic texture recognition [54], future frame selection [55] and comparing 3D convolutional vs. LSTM architectures [56]. Other work centering on action recognition focused on visualization of learned filters in convolutional architectures [11,12], or trying to remove scene biases for action recognition [10,18]. Evidence also suggests that optical flow is useful exactly because it contains invariances related to single frame appearance information [57]. ...
Intuition might suggest that motion and dynamic information are key to video-based action recognition. In contrast, there is evidence that state-of-the-art deep-learning video understanding architectures are biased toward static information available in single frames. Presently, a methodology and corresponding dataset to isolate the effects of dynamic information in video are missing. Their absence makes it difficult to understand how well contemporary architectures capitalize on dynamic vs. static information. We respond with a novel Appearance Free Dataset (AFD) for action recognition. AFD is devoid of static information relevant to action recognition in a single frame. Modeling of the dynamics is necessary for solving the task, as the action is only apparent through consideration of the temporal dimension. We evaluated 11 contemporary action recognition architectures on AFD as well as its related RGB video. Our results show a notable decrease in performance for all architectures on AFD compared to RGB. We also conducted a complimentary study with humans that shows their recognition accuracy on AFD and RGB is very similar and much better than the evaluated architectures on AFD. Our results motivate a novel architecture that revives explicit recovery of optical flow, within a contemporary design for best performance on AFD and RGB.
... While a few video interpretation methods exist, they have various limitations, e.g. being primarily qualitative [16], using a certain dataset that prevents evaluating the effect of the training dataset [20] or using classification accuracy as a metric without quantifying a model's internal representations [20,39]. ...
... These approaches do not interpret the learned representations in the intermediate layers and in some cases require training to be performed on specific datasets [20]. Other work focused on understanding latent representations in spatiotemporal models either mostly concerned qualitative visualization [16] or a specific architecture type [54]. A related task is understanding the scene representation bias of action recognition datasets [33,34]. ...
... Our proposed interpretability technique is the first to quantify static and dynamic biases on intermediate representations learned in off-the-shelf models for multiple video-based tasks. Most prior efforts focused on a single task, and studied either datasets [33] or architectures [16,35]. In contrast, our unified study covers six datasets and dozens of architectures on two different tasks, i.e. action recognition and video object segmentation. ...
Deep spatiotemporal models are used in a variety of computer vision tasks, such as action recognition and video object segmentation. Currently, there is a limited understanding of what information is captured by these models in their intermediate representations. For example, while it has been observed that action recognition algorithms are heavily influenced by visual appearance in single static frames, there is no quantitative methodology for evaluating such static bias in the latent representation compared to bias toward dynamic information (e.g. motion). We tackle this challenge by proposing a novel approach for quantifying the static and dynamic biases of any spatiotemporal model. To show the efficacy of our approach, we analyse two widely studied tasks, action recognition and video object segmentation. Our key findings are threefold: (i) Most examined spatiotemporal models are biased toward static information; although, certain two-stream architectures with cross-connections show a better balance between the static and dynamic information captured. (ii) Some datasets that are commonly assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual units (channels) in an architecture can be biased toward static, dynamic or a combination of the two.
... The proposed method performs a 3D full convolutional network to encode video frames, extracts the action proposal segments, and finally classifies and refines the results in the action classification subnet. Feichtenhofer et al. (2020) study some ways of fusing CNNs spatially and temporally, and propose a new architecture for spatiotemporal fusion of video snippets. Chen et al. (2020) introduce graph convolutional networks for skeleton-based action recognition. ...
In recent years, the trend of exercise has risen rapidly. Weight training for sculpting body shapes is viewed as a particular trend, but incorrect weight training postures or forms can not only nullify the benefits of exercise, but also cause permanent damage to the bodies. Therefore, weight trainees usually hire a coach or an athletic trainer for guidance. However, the cost of hiring a trainer is high and may be prohibitive in the long term. In this study, the OpenPose system and inexpensive webcams are used to develop the WTPose algorithm that can determine whether a weight trainee's posture is correct in real time. When there is deviation in the weight trainee's posture, the algorithm will immediately display the correct posture, thereby helping the weight trainee to correct her/his weight training posture by merely spending a small fee. As proven through experiments, regardless of the user's body shape and gender, the WTPose algorithm can accurately determine whether her/his weight training posture is correct.
... In this paper, we provide a deep analysis of temporal modeling for action recognition. Previous works focus on performance benchmark [10,54], spatiotemporal feature visualization [18,40] or salieny analysis [5,22,39,48] to gain better understanding of action models. For example, comprehensive studies of CNNbased models have been conducted recently in [10,54] to compare performance of different action models. ...
... There are a few works that have assessed the temporal importance in a video, e.g., Huang et al. [23] proposed the method approaches to identify the crucial motion information in a video based on the C3D model and then used it to reduce sparse frames of the video without too much motion information; on the other hand, Sigurdsson et al. [42] analyzed the action category by measuring the complexity at different levels, such as verb, object and motion complexity, and then composed those attributes to form the action class. Feichtenhofer et al. [18] visualized the features learned from various models trained by optical flow to explain why the network fails in certain cases. On the other hand, the receptive field is typically used to determine the range of a network can theoretically see in both spatial and temporal dimensions. ...
In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected by various factors such as dataset, network architecture, and input frames. With this, we further study some important questions for action recognition that lead to interesting findings. Our analysis shows that there is no strong correlation between temporal relevance and model performance; and action models tend to capture local temporal information, but less long-range dependencies. Our codes and models will be publicly available.
... For example dumbbell images in ImageNet often contain arms [37], leading to pictures of arms being classified as dumbbell. Likewise, specialized saliency techniques developed for video recognition models can show biases in video datasets [17]. Shared Interest [6] and Activation Atlas [7] enable systematic analysis of these issues on the dataset level. ...
We survey a number of data visualization techniques for analyzing Computer Vision (CV) datasets. These techniques help us understand properties and latent patterns in such data, by applying dataset-level analysis. We present various examples of how such analysis helps predict the potential impact of the dataset properties on CV models and informs appropriate mitigation of their shortcomings. Finally, we explore avenues for further visualization techniques of different modalities of CV datasets as well as ones that are tailored to support specific CV tasks and analysis needs.