ArticlePublisher preview available

Proposal-Based Graph Attention Networks for Workflow Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In the process of “Industry 4.0”, video analysis plays a vital role in a variety of industrial applications. Video-based action detection has obtained promising performance in computer vision community. However, in complex factory environment, how to detect workflow of both machines and workers in production process is not well resolved. To solve this issue, we propose a generic proposal based Graph Attention Networks for workflow detection. Specifically, an efficient and effective action proposal method is firstly employed to generate workflow proposals. Then, these proposals and their relations are exploited for proposal graph construction. Here, two types of relationships are considered for identifying the workflow phases, which are contextual and surrounding relations to capture context information and characterize the correlations between different workflow instances. To improve the recognition accuracy, within-category and between-category attention are incorporated to learn long-range and dynamic dependencies respectively. Thus, the capability of feature representation for workflow detection can be greatly enhanced. Experimental results verify that the proposed approach is considerably improved upon the state-of-the-arts on THUMOS’14 and a practical workflow dataset, achieving 6.7% and 3.9% absolute improvement compared to the advanced GTAN detector at tIoU threshold 0.4, respectively. Moreover, augmentation experiments are carried out on ActivityNet1.3 to prove the effectiveness of performance improvement by modeling workflow proposal relationships.
This content is subject to copyright. Terms and conditions apply.
Neural Processing Letters (2022) 54:101–123
https://doi.org/10.1007/s11063-021-10622-7
Proposal-Based Graph Attention Networks for Workflow
Detection
Min Zhang1,2 ·Haiyang Hu1·Zhongjin Li1·Jie Chen1
Accepted: 5 August 2021 / Published online: 13 August 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021
Abstract
In the process of “Industry 4.0”, video analysis plays a vital role in a variety of industrial
applications. Video-based action detection has obtained promising performance in computer
vision community. However, in complex factory environment, how to detect workflow of
both machines and workers in production process is not well resolved. To solve this issue,
we propose a generic proposal based Graph Attention Networks for workflow detection.
Specifically, an efficient and effective action proposal method is firstly employed to generate
workflow proposals. Then, these proposals and their relations are exploited for proposal graph
construction. Here, two types of relationships are considered for identifying the workflow
phases, which are contextual and surrounding relations to capture context information and
characterize the correlations between different workflow instances. To improve the recog-
nition accuracy, within-category and between-category attention are incorporated to learn
long-range and dynamic dependencies respectively. Thus, the capability of feature represen-
tation for workflow detection can be greatly enhanced. Experimental results verify that the
proposed approach is considerably improved upon the state-of-the-arts on THUMOS’14 and
a practical workflow dataset, achieving 6.7% and 3.9% absolute improvement compared to
the advanced GTAN detector at tIoU threshold 0.4, respectively. Moreover, augmentation
experiments are carried out on ActivityNet1.3 to prove the effectiveness of performance
improvement by modeling workflow proposal relationships.
Keywords Workflow detection ·Graph attention networks ·Temporal action localization
1 Introduction
With the prominent achievement of deep learning, a growing number of cameras are installed
for intelligent monitoring. Many practical applications require cameras to record scene videos
for activity detection in real-time, such as smart surveillance [1], autonomous driving [2]and
human behavior analysis [3]. In large scale factory, the essential ingredients of production
BHaiyang Hu
huhaiyang@hdu.edu.cn
1School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
2Department of Design and Art, Zhejiang Industry Polytechnic College, Shaoxing, China
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Recently, the graph convolutional network (GCN) [50] is proposed to perform convolution operations on non-grid structures. Due to its effectiveness in relation modeling, GCN has been introduced to video understanding tasks [51,52]. Zeng et al. [2] model the proposal-proposal relations by graph convolution networks to learn powerful representations for proposal classification and refinement. ...
... Bai et al. [55] propose to model the relations between boundary and action content of temporal proposals through graph neural network. Zhang et al. [52] utilize the graph attention networks to model proposal relations, which incorporates both within-category and betweencategory attentions. Rashid et al. [11] leverage the similarity and dissimilarity between video moments to construct graph, aiming at informing the detection and classification under weak supervision. ...
Article
Full-text available
Weakly supervised temporal action detection uses the extracted appearance and motion features to localize the action segments in untrimmed videos with only action category labels. Most previous methods detect action segments based on temporally local features, and employ the early fusion or the late fusion machine to combine the knowledge of two feature modalities. However, the temporally local features generally lead to incomplete detection results, and the above-mentioned fusion machines cannot fully use the complementary information between different modalities. In this paper, we propose the separately guided context-aware network to exploit the global contexts and sufficiently leverage different modality information to detect action segments. Specifically, we propose to construct graphs by modeling the co-occurrence relations between frames to gather the global contexts. To fully combine the complementary information of two modalities, the separately guided scheme is proposed, which utilizes two graphs for each feature modality to integrate the contexts revealed by the intra-modality and the cross-modality information respectively. This scheme sufficiently enhances frame representations based on two modalities and facilitates the detection of action frames. And we also present the co-occurrence relation learning strategy under weak supervision to better guide graphs in gathering contexts. Extensive experiments on the THUMOS14 dataset and the ActivityNet dataset demonstrate the superior performance of the proposed method. Particularly, the proposed method achieves a mean average precision of 39.1% and 42.0% on the THUMOS14 and the ActivityNet dataset respectively under the IoU threshold 0.5.
... Specialized models have demonstrated the applicability of attention-based graph techniques to support workflow generation. Zhang et al. [19] developed a GAT-based model for workflow detection in complex industrial environments. Similarly, Graph Recurrent Attention Networks (GRANs) [20] have shown promise in generating large graphs efficiently by iteratively constructing nodes and edges with attention-driven conditioning. ...
Preprint
Full-text available
This paper introduces Opus, a novel framework for generating and optimizing Workflows tailored to complex Business Process Outsourcing (BPO) use cases, focusing on cost reduction and quality enhancement while adhering to established industry processes and operational constraints. Our approach generates executable Workflows from Intention, defined as the alignment of Client Input, Client Output, and Process Context. These Workflows are represented as Directed Acyclic Graphs (DAGs), with nodes as Tasks consisting of sequences of executable Instructions, including tools and human expert reviews. We adopt a two-phase methodology: Workflow Generation and Workflow Optimization. In the Generation phase, Workflows are generated using a Large Work Model (LWM) informed by a Work Knowledge Graph (WKG) that encodes domain-specific procedural and operational knowledge. In the Optimization phase, Workflows are transformed into Workflow Graphs (WFGs), where optimal Workflows are determined through path optimization. Our experiments demonstrate that state-of-the-art Large Language Models (LLMs) face challenges in reliably retrieving detailed process data as well as generating industry-compliant workflows. The key contributions of this paper include: - The integration of a Work Knowledge Graph (WKG) into a Large Work Model (LWM), enabling the generation of context-aware, semantically aligned, structured and auditable Workflows. - A two-phase approach that combines Workflow Generation from Intention with graph-based Workflow Optimization. - Opus Alpha 1 Large and Opus Alpha 1 Small, models that outperform state-of-the-art LLMs by 38\% and 29\% respectively in Workflow Generation for a Medical Coding use case.
... However, these methods require complex manual labeling or auxiliary equipment installation to collect production behavior information, which result in additional labor costs during the production process. Therefore, recent research has focused on workflow detection based solely on surveillance videos [7][8]. Thus, we can avoid the additional workload caused by auxiliary device installation. ...
... Action recognition for trimmed videos has made remarkable achievements in the past few years [19], aiming to recognize the target action from the trimmed video containing only a single action. Thanks to the development of deep learning in the image domain, a large number of excellent networks have been proposed [20][21][22][23][24], so that RGB features and optical flow features based on image level can be modeled to solve this task. To further improve the performance of the recognition model and break the limitation that 2D convolution can only handle a single frame of data, 3D convolution is applied to cope with the additional temporal dimension [21]. ...
Article
Full-text available
Weakly-supervised temporal action localization aims to detect the temporal boundaries of action instances in untrimmed videos only by relying on video-level action labels. The main challenge of the research is to accurately segment the action from the background in the absence of frame-level labels. Previous methods consider the action-related context in the background as the main factor restricting the segmentation performance. Most of them take action labels as pseudo-labels for context and suppress context frames in class activation sequences using the attention mechanism. However, this only applies to fixed shots or videos with a single theme. For videos with frequent scene switching and complicated themes, such as casual shots of unexpected events and secret shots, the strong randomness and weak continuity of the action cause the assumption not to be valid. In addition, the wrong pseudo-labels will enhance the weight of context frames, which will affect the segmentation performance. To address above issues, in this paper, we define a new video frame division standard (action instance, action-related context, no-action background), propose an Action-aware Network with Upper and Lower loss AUL-Net, which limits the activation of context to a reasonable range through a two-branch weight-sharing framework with a three-branch attention mechanism, so that the model has wider applicability while accurately suppressing context and background. We conducted extensive experiments on the self-built food safety video dataset FS-VA, and the results show that our method outperforms the state-of-the-art model.
Article
Multivariate time series (MTS) forecasting plays an important role in industrial process monitoring, control and optimizations. Usually, hierarchical interactive behaviors among industrial MTS have formed complex nonlinear causal characteristics, which greatly hinders the applications of existing predictive models. It is found that graph attention networks (GAT) provide technical ideas to meet this challenge. However, the unknown directed graph and linear conversions of node information make conventional GAT less popular for the industrial fields. In this paper, we propose a novel prediction model termed as temporal causal graph attention networks with nonlinear paradigms (TC-GATN) to adequately capture inherent dependencies on industrial MTS. Specifically, the graph learning algorithm concerning the granger causality (GC) is exploited to extract potential relationships among multiple variables for guiding directional edge connections of the hierarchy. Then, parallel GRU encoders located in the graph neighborhood space are introduced to perform the nonlinear interaction of node features, which accomplishes the adaptive transformation and transmission. The self-attention mechanism is further employed to aggregate encoder hidden states across all stages. Finally, a temporal module is supplemented to process information from the graph layer, achieving satisfactory predictions. The feasibility and effectiveness of the TC-GATN are validated by two actual datasets from the methanol production and the chlorosilane distillation.
Conference Paper
Full-text available
Training the classical-vanilla deep neural networks (DNNs) with several layers is problematic due to optimization problems. Interestingly, skip connections of various forms (e.g. that perform the summation or concatenation of hidden representations or layer outputs) have been shown to allow the successful training of very DNNs. Although there are ongoing theoretical works to understand very DNNs that employ the summation of the outputs of different layers (e.g. as in the residual network), there is none to the best of our knowledge that has studied why DNNs that concatenate of the outputs of different layers (e.g. as seen in Inception, FractalNet and DenseNet) works. As such, we present in this paper, the first theoretical analysis of very DNNs with concatenated hidden representations based on a general framework that can be extended to specific cases. Our results reveal that DNNs with concatenated hidden representations circumnavigate the singularity of hidden representation, which is catastrophic for optimization. For substantiating the theoretical results, extensive experiments are reported on standard datasets such as the MNIST and CIFAR-10.
Article
Full-text available
The recent advances in 3D Convolutional Neural Networks (3D CNNs) have shown promising performance for untrimmed video action detection, employing the popular detection framework that heavily relies on the temporal action proposal generations as the input of the action detector and localization regressor. In practice the proposals usually contain strong intra and inter relations among them, mainly stemming from the temporal and spatial variations in the video actions. However, most of existing 3D CNNs ignore the relations and thus suffer from the redundant proposals degenerating the detection performance and efficiency. To address this problem, we propose graph attention based proposal 3D ConvNets (AGCN-P-3DCNNs) for video action detection. Specifically, our proposed graph attention is composed of intra attention based GCN and inter attention based GCN. We use intra attention to learn the intra long-range dependencies inside each action proposal and update node matrix of Intra Attention based GCN, and use inter attention to learn the inter dependencies between different action proposals as adjacency matrix of Inter Attention based GCN. Afterwards, we fuse intra and inter attention to model intra long-range dependencies and inter dependencies simultaneously. Another contribution is that we propose a simple and effective framewise classifier, which enhances the feature presentation capabilities of backbone model. Experiments on two proposal 3D ConvNets based models (P-C3D and P-ResNet) and two popular action detection benchmarks (THUMOS 2014, ActivityNet v1.3) demonstrate the state-of-the-art performance achieved by our method. Particularly, P-C3D embedded with our module achieves average mAP 3.7% improvement on THUMOS 2014 dataset compared to original model.
Article
Full-text available
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of actions. SSN models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and precise localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end manner. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping is devised to generate high quality action proposals. We further study the importance of the decomposed discriminative model and discover a way to achieve similar accuracy using a single classifier, which is also complementary with the original SSN design. On two challenging benchmarks, THUMOS’14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
Conference Paper
Full-text available
In this paper, we study the problem of learning Graph Convolutional Networks (GCNs) for regression. Current ar-chitectures of GCNs are limited to the small receptive field of convolution filters and shared transformation matrix for each node. To address these limitations, we propose Semantic Graph Convolutional Networks (SemGCN), a novel neu-ral network architecture that operates on regression tasks with graph-structured data. SemGCN learns to capture semantic information such as local and global node relationships , which is not explicitly represented in the graph. These semantic relationships can be learned through end-to-end training from the ground truth without additional supervision or hand-crafted rules. We further investigate applying SemGCN to 3D human pose regression. Our formulation is intuitive and sufficient since both 2D and 3D human poses can be represented as a structured graph encoding the relationships between joints in the skeleton of a human body. We carry out comprehensive studies to validate our method. The results prove that SemGCN outperforms state of the art while using 90% fewer parameters.
Chapter
Training the classical-vanilla deep neural networks (DNNs) with several layers is problematic due to optimization problems. Interestingly, skip connections of various forms (e.g. that perform the summation or concatenation of hidden representations or layer outputs) have been shown to allow the successful training of very DNNs. Although there are ongoing theoretical works to understand very DNNs that employ the summation of the outputs of different layers (e.g. as in the residual network), there is none to the best of our knowledge that has studied why DNNs that concatenate of the outputs of different layers (e.g. as seen in Inception, FractalNet and DenseNet) works. As such, we present in this paper, the first theoretical analysis of very DNNs with concatenated hidden representations based on a general framework that can be extended to specific cases. Our results reveal that DNNs with concatenated hidden representations circumnavigate the singularity of hidden representation, which is catastrophic for optimization. For substantiating the theoretical results, extensive experiments are reported on standard datasets such as the MNIST and CIFAR-10.
Article
The click feature of an image, defined as the user-click-frequency vector of the image on a pre-defined word vocabulary, is known to effectively reduce the semantic gap for fine-grained image recognition. Unfortunately, user-click-frequency data are usually absent in practice. It remains challenging to predict the click feature from the visual feature, because the user-click-frequency vector of an image is always noisy and sparse. In this paper, we devise a Hierarchical Deep Word Embedding (HDWE) model by integrating sparse constraints and an improved RELU operator to address click feature prediction from visual features. HDWE is a coarse-to-fine click feature predictor that is learned with the help of an auxiliary image dataset containing click information. It can therefore discover the hierarchy of word semantics. We evaluate HDWE on three dog and one bird image datasets, in which Clickture-Dog and Clickture-Bird are respectively utilized as auxiliary datasets to provide click data. Our empirical studies show that HDWE has 1) higher recognition accuracy, 2) a larger compression ratio, and 3) good one-shot learning ability and scalability to unseen categories.