Figure 1 - uploaded by Xiaoou Tang
Content may be subject to copyright.
(A) The New York Grand Central station. Two semantic regions learned by our algorithm are plotted on the background image. They correspond to paths of pedestrians. Colors indicate different moving directions of pedestrians. Activities observed on the same semantic region have similar semantic interpretation such as " pedestrians enter the hall from entrance a and leave from exit b " .(B) Examples of tracklets collected in the scene. The goal of this work is to learn semantic regions from tracklets.  

(A) The New York Grand Central station. Two semantic regions learned by our algorithm are plotted on the background image. They correspond to paths of pedestrians. Colors indicate different moving directions of pedestrians. Activities observed on the same semantic region have similar semantic interpretation such as " pedestrians enter the hall from entrance a and leave from exit b " .(B) Examples of tracklets collected in the scene. The goal of this work is to learn semantic regions from tracklets.  

Source publication
Conference Paper
Full-text available
In this paper, a Random Field Topic (RFT) model is proposed for semantic region analysis from motions of objects in crowded scenes. Different from existing approaches of learning semantic regions either from optical flows or from complete trajectories, our model assumes that fragments of trajectories (called tracklets) are observed in crowded scene...

Contexts in source publication

Context 1
... semantic regions correspond to different paths commonly taken by objects, and activities observed in the same semantic region have similar semantic interpretation. Some examples are shown in Figure 1 (A). Semantic regions can be used for activity analysis in a single camera view [21,9,10,24,20] or in multiple camera views [13,11,23] at later stages. ...
Context 2
... worked well for traffic scenes where at different time different subsets of activities were observed. However, our experiments show that it fails in a scene like Figure 1 (A), where all types of activities happen together most of the time with significant temporal over- laps. In this type of scenes, the temporal co-occurrence information is not discriminative enough. ...

Citations

... For the trajectory prediction, there are datasets that only contains pedestrians, such as the Subway Station dataset [129] and the CUHK Crowd Dataset [130] used by Xu et al. [30]; and the Town Center Dataset [131] used by Xue et al. [49]. Besides, there are several datasets that contain urban traffic, such as ApolloScape [132] as used by Ma et al. [59], Interaction Dataset [133] as used by Li et al. [47], and nuScenes [134] as used by Yao et al. [25]. ...
Article
Full-text available
The prediction of pedestrian behavior is essential for automated driving in urban traffic and has attracted increasing attention in the vehicle industry. This task is challenging because pedestrian behavior is driven by various factors, including their individual properties, the interactions with other road users, and the interactions with the environment. Deep learning approaches have become increasingly popular because of their superior performance in complex scenarios compared to traditional approaches such as the social force or constant velocity models. In this paper, we provide a comprehensive review of deep learning-based approaches for pedestrian behavior prediction. We review and categorize a large selection of scientific contributions covering both trajectory and intention prediction from the last five years. We categorize existing works by prediction tasks, input data, model features, and network structures. Besides, we provide an overview of existing datasets and the evaluation metrics. We analyze, compare, and discuss the performance of existing work. Finally, we point out the research gaps and outline possible directions for future research.
... The research directions related to changes and dynamics have received and are still receiving a substantial amount of attention within the computer vision (CV) community. The research topics include, but are not limited to surveillance and anomaly detection (Owens and Hunter 2000;Weiming Hu et al. 2004;Piciarelli et al. 2005;Zhouyu Fu et al. 2005;Weiming Hu et al. 2006;Piciarelli and Foresti 2006;Anjum and Cavallaro 2008;Piotto et al. 2009 Khan et al. 2016;Atev et al. 2006;Nawaz et al. 2014), crowd analysis (Cheriyadat and Radke 2008;Khan et al. 2016;Zhou et al. 2011;Sharma and Guha 2016) and appearance change (Lowry et al. 2016) Not all these research directions play an equally important roles in the context of MoDs, thus in the following, we will focus on the three fields presenting the highest intersection with MoDs, namely (i) anomaly detection, (ii) crowd monitoring, and (iii) trajectory clustering. It is important to emphasize that the cut between fields is very often arbitrary and contributions exist at the intersection of fields. ...
Article
Full-text available
Robotic mapping provides spatial information for autonomous agents. Depending on the tasks they seek to enable, the maps created range from simple 2D representations of the environment geometry to complex, multilayered semantic maps. This survey article is about maps of dynamics (MoDs), which store semantic information about typical motion patterns in a given environment. Some MoDs use trajectories as input, and some can be built from short, disconnected observations of motion. Robots can use MoDs, for example, for global motion planning, improved localization, or human motion prediction. Accounting for the increasing importance of maps of dynamics, we present a comprehensive survey that organizes the knowledge accumulated in the field and identifies promising directions for future work. Specifically, we introduce field-specific vocabulary, summarize existing work according to a novel taxonomy, and describe possible applications and open research problems. We conclude that the field is mature enough, and we expect that maps of dynamics will be increasingly used to improve robot performance in real-world use cases. At the same time, the field is still in a phase of rapid development where novel contributions could significantly impact this research area.
... (3) Leveraging the generated spatiotemporal crowd data, we develop realtime congestion alerts and future-time prediction visualizations to assist with manual crowd congestion monitoring. Furthermore, to quantify the benefits of our proposed framework, we validate our implementation with a fully annotated publicly available dataset from New York's Grand Central Station, a busy public urban location [9]. In order to demonstrate the generalizability and capacity in providing qualitative analysis, we also collect an unannotated video from a stadium during a crowded football game. ...
... To quantify the performance of the modular framework, the three components, namely the trajectory generation, congestion prediction, and congestion visualization were experimented on with the New York Grand Central Station (GCS) dataset, collected by Zhou et al. [9]. Point-wise individual trajectories were manually annotated by Yi et al. [43]. ...
... We also design a GCN-GRU model that demonstrates how CMGraphs may be utilized for spatiotemporal forecasting. To evaluate the framework, quantitative experiments are conducted on an annotated public dataset at the Grand Central Station, which has been widely used by researchers studying crowd scenes [9,43]. To further illustrate the practical application of the framework, qualitative congestion analysis is conducted on supplementary video captured at Stanford Stadium. ...
Article
Full-text available
Crowd congestion is one of the main causes of modern public safety issues such as stampedes. Conventional crowd congestion monitoring using closed-circuit television (CCTV) video surveillance relies on manual observation, which is tedious and often error-prone in public urban spaces where crowds are dense, and occlusions are prominent. With the aim of managing crowded spaces safely, this study proposes a framework that combines spatial and temporal information to automatically map the trajectories of individual occupants, as well as to assist in real-time congestion monitoring and prediction. Through exploiting both features from CCTV footage and spatial information of the public space, the framework fuses raw CCTV video and floor plan information to create visual aids for crowd monitoring, as well as a sequence of crowd mobility graphs (CMGraphs) to store spatiotemporal features. This framework uses deep learning-based computer vision models, geometric transformations, and Kalman filter-based tracking algorithms to automate the retrieval of crowd congestion data, specifically the spatiotemporal distribution of individuals and the overall crowd flow. The resulting collective crowd movement data is then stored in the CMGraphs, which are designed to facilitate congestion forecasting at key exit/entry regions. We demonstrate our framework on two video data, one public from a train station dataset and the other recorded at a stadium following a crowded football game. Using both qualitative and quantitative insights from the experiments, we demonstrate that the suggested framework can be useful to help assist urban planners and infrastructure operators with the management of congestion hazards.
... Based on the principle of crowd motion detection [4,29], the existing methods of crowd motion segmentation and clustering can be classified into three main categories, including the flow field model-based [30][31][32], probability model-based [33][34][35], and similarity-based methods [36][37][38][39][40]. The first category uses flow field models to simulate image spatial segmentation and produce spatially continuous segments consequently. ...
Article
Full-text available
Coherent motions depict the individuals’ collective movements in widely existing moving crowds in physical, biological, and other systems. In recent years, similarity-based clustering algorithms, particularly the Coherent Filtering (CF) clustering approach, have accomplished wide-scale popularity and acceptance in the field of coherent motion detection. In this work, a tracklet-before-clustering initialization strategy is introduced to enhance coherent motion detection. Moreover, a Hierarchical Tracklet Association (HTA) algorithm is proposed to address the disconnected KLT tracklets problem of the input motion feature, thereby making proper trajectories repair to optimize the CF performance of the moving crowd clustering. The experimental results showed that the proposed method is effective and capable of extracting significant motion patterns taken from crowd scenes. Quantitative evaluation methods, such as Purity, Normalized Mutual Information Index (NMI), Rand Index (RI), and F-measure (Fm), were conducted on real-world data using a huge number of video clips. This work has established a key, initial step toward achieving rich pattern recognition.
... In this paper, we propose a general yet elegantly simple graph-based algorithm that can achieve high accuracy in an efficient and effective manner. An overview of our framework is illustrated in Fig. 2. At first, we extract movement features as tracklets by applying a KLT feature point tracker [23], where the tracklet is a fragment of a trajectory, which is obtained over a short period [24]. These tracklets are then used to build dynamic graph sequences based on spatial and temporal features in order to model crowd behavior. ...
... The motion features are then extracted and represented as tracklets, where a tracklet is a fragment of the trajectory obtained over a short period [24], using a KLT tracker [25] over a small window of consecutive frames. The feature detectors we have considered include Scale Invariant Features Transformation (SIFT) and Speeded Up Robust Features (SURF). ...
Article
Full-text available
Detecting anomalous crowd behavioral patterns from videos is an important task in video surveillance and maintaining public safety. In this work, we propose a novel architecture to detect anomalous patterns of crowd movements via graph networks. We represent individuals as nodes and individual movements with respect to other people as the node-edge relationship of an evolving graph network. We then extract the motion information of individuals using optical flow between video frames and represent their motion patterns using graph edge weights. In particular, we detect the anomalies in crowded videos by modeling pedestrian movements as graphs and then by identifying the network bottlenecks through a max-flow/min-cut pedestrian flow optimization scheme (MFMCPOS). The experiment demonstrates that the proposed framework achieves superior detection performance compared to other recently published state-of-the-art methods. Considering that our proposed approach has relatively low computational complexity and can be used in real-time environments, which is crucial for present day video analytics for automated surveillance.
... In the following, we first introduce the two datasets, pre-processing [34] is a dataset collected from Grand Central Station in New York. The videos were originally collected by [53], and later the trajectories dataset was created by [34]. The dataset is much larger to meet the requirements of deep learning, it contains both video frames and density heatmaps that can be converted from trajectory data. ...
Preprint
Full-text available
Monitoring and predicting crowd movements in indoor environments are of great importance in crowd management to prevent crushing and trampling. Existing works mostly focused on individual trajectory forecasting in a less crowded scene, or crowd counting and density estimation. Only a very few works predict the crowd density distribution. However, this study is failing to realize multi-step prediction or exploiting only density heatmaps modality ignores the information complementation with corresponding video frames. Therefore, we are motivated to predict crowd density distribution in multiple time steps to facilitate long-term prediction. In this paper, a Multi-Step Crowd Density Predictor (MSCDP) to fuse video frame sequences and corresponding density heatmaps, is proposed to accurately forecast the future crowd density heatmaps. To capture long-term periodic movement features, the long-term optical flow context memory (LOFCM) module is designed to store learnable patterns. We conducted extensive experiments on two real-world datasets. Evaluation results show that our MSCDP outperforms the state-of-the-art baseline techniques and MSCDP variants in terms of various prediction errors, demonstrating the effectiveness of MSCDP and each of its key components in multi-step crowd density prediction.
... Statistical machine learning has been used for trajectory analysis in computer vision [42,15,67,26,65,11]. They aim to learn individual motion dynamics [76], structured latent patterns in data [65,64], anomalies [12,11], etc. These methods provide a certain level of explainability, but are limited in model capacity for learning from large amounts of data. ...
Preprint
Full-text available
Trajectory prediction has been widely pursued in many fields, and many model-based and model-free methods have been explored. The former include rule-based, geometric or optimization-based models, and the latter are mainly comprised of deep learning approaches. In this paper, we propose a new method combining both methodologies based on a new Neural Differential Equation model. Our new model (Neural Social Physics or NSP) is a deep neural network within which we use an explicit physics model with learnable parameters. The explicit physics model serves as a strong inductive bias in modeling pedestrian behaviors, while the rest of the network provides a strong data-fitting capability in terms of system parameter estimation and dynamics stochasticity modeling. We compare NSP with 15 recent deep learning methods on 6 datasets and improve the state-of-the-art performance by 5.56%-70%. Besides, we show that NSP has better generalizability in predicting plausible trajectories in drastically different scenarios where the density is 2-5 times as high as the testing data. Finally, we show that the physics model in NSP can provide plausible explanations for pedestrian behaviors, as opposed to black-box deep learning. Code is available: https://github.com/realcrane/Human-Trajectory-Prediction-via-Neural-Social-Physics.
... The object was for behaviors in each segmented semantic region to have similar characteristics and to be represented by some kind of atomic video events. Zhou et al. [35] proposed a random field topic (RFT) model including sources and sinks as high-level semantic priors to learn semantic regions from tracklets (fragments of trajectories). Similar to Wang et al. [31], these semantic regions corresponded to common paths taken by objects, whose motions in the same semantic region have similar semantic interpretations. ...
Article
Full-text available
Group behavior pattern mining in traffic scenarios is a challenging problem due to group variability and behavioral regionality. Most methods are either based on trajectory data stored in static databases regardless of the variability of group members or do not consider the influence of scene structures on behaviors. However, in traffic scenarios, information about group members may change over time, and objects' motions show regional characteristics owing to scene structures. To address these issues, we present a general framework of a moving cluster with scene constraints (MCSC) discovery consisting of semantic region segmentation, mapping, and an MCSC decision. In the first phase, a hidden Markov chain is adopted to model the evolution of behaviors along a video clip sequence, and a Markov topic model is proposed for semantic region analysis. During the mapping procedure, to generate snapshot clusters, moving objects are mapped into the corresponding sets of moving objects according to the semantic regions where they are located at each timestamp. In the MCSC decision phase, a candidate MCSC recognition algorithm and screening algorithm are designed to incrementally identify and output MCSCs. The effectiveness of the proposed approach is verified by experiments carried out using public road traffic data.
... Vision-based motion trajectory prediction is essential for practical applications such as visual surveillance and selfdriving cars (see Fig. 21), in which reasons about the future motion patterns of a pedestrian is critical. A large body of work learns motion patterns by clustering trajectories (Zhou et al., 2011;Morrisand & Trivedi, 2011;Kim et al., 2011;Hu et al., 2007). However, forecasting future motion trajectory of a person is really challenging as the prediction cannot be predicted in isolation. ...
Article
Full-text available
Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions.
... Interframe-motion-based methods [9-11] rely on using motion information of targets between two consecutive frames to extract motion patterns. These methods generally perform more robustly in crowded scenarios but are often considered not suitable for extracting long-range motion patterns [12]. Multiframe-motion-based methods [1][2][3]13] instead use motion information across multiple frames, e.g. ...
... Interframe-motion-based methods [9][10][11] rely on using motion information of targets between two consecutive frames to extract motion patterns. These methods generally perform more robustly in crowded scenarios but are often considered not suitable for extracting long-range motion patterns [12]. Multiframe-motion-based methods [1][2][3]13] instead use motion information across multiple frames, e.g. ...
... short-duration tracks or complete trajectories, as estimated with video tracking and are regarded to be more helpful in extracting long-range patterns. These methods generally involve first estimating tracked trajectories of targets [1][2][3]12], followed by encoding trajectory information into feature space(s) [2,[13][14][15], and performing trajectory clustering [1,2,4,13,14] or classification [3,15] to determine dominant patterns. Some of these approaches [2,4,13,14] work in an offline manner as they assume availability of all estimated trajectories a priori with an inability to incorporate new trajectory(ies) at clustering or classification stage incrementally. ...
Article
Full-text available
This paper investigates the use of Siamese networks for trajectory similarity analysis in surveillance tasks. Specifically, the proposed approach uses an auto‐encoder as a part of training a discriminative twin (Siamese) network to perform trajectory similarity analysis, thus presenting an end‐to‐end framework to perform an online motion pattern extraction in the scene with an ability to incorporate new incoming trajectory(ies) incrementally. The effectiveness of the proposed method is evaluated on four challenging public real‐world datasets containing both vehicle and person targets, and compared with five existing methods. The proposed method consistently shows better or comparable performance than the existing methods on all datasets.