Conference Paper

Collecting and Annotating Human Activities in Web Videos

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recent efforts in computer vision tackle the problem of human activity understanding in video sequences. Traditionally, these algorithms require annotated video data to learn models. In this paper, we introduce a novel data collection framework, to take advantage of the large amount of video data available on the web. We use this new framework to retrieve videos of human activities in order to build datasets for training and evaluating computer vision algorithms. We rely on Amazon Mechanical Turk workers to obtain high accuracy annotations. An agglomerative clustering technique brings the possibility to achieve reliable and consistent annotations for temporal localization of human activities in videos. Using two different datasets, Olympics Sports and our novel Daily Human Activities dataset, we show that our collection/annotation framework achieves robust annotations for human activities in large amount of video data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We found enough gap in the works of literature despite the emerging of newer datasets in the last 5 years (Carreira et al. 2018(Carreira et al. , 2019Damen et al. 2018;Goyal et al. 2017;Heilbron and Niebles 2014;Kay et al. 2017;Sigurdsson et al. 2016). Researchers have chosen UCF101 (Soomro, Roshan Zamir, and Shah 2012) and HMDB51 (H Kuehne et al. 2011) datasets very frequently. ...
... Three steps of ActivityNet human activity collection and annotation process (Source:(Heilbron and Carlos Niebles 2014)). ...
Article
Full-text available
Different types of research have been done on video data using Artificial Intelligence (AI) deep learning techniques. Most of them are behavior analysis, scene understanding, scene labeling, human activity recognition (HAR), object localization, and event recognition. Among all these, HAR is one of the challenging tasks and thrust areas of video data processing research. HAR is applicable in different areas, such as video surveillance systems, human-computer interaction, human behavior characterization, and robotics. This paper aims to present a comparative review of vision-based human activity recognition with the main focus on deep learning techniques on various benchmark video datasets comprehensively. We propose a new taxonomy for categorizing the literature as CNN and RNN-based approaches. We further divide these approaches into four sub-categories and present various methodologies with their experimental datasets and efficiency. A short comparison is also made with the handcrafted feature-based approach and its fusion with deep learning to show the evolution of HAR methods. Finally, we discuss future research directions and some open challenges on human activity recognition. The objective of this survey is to give the current progress of vision-based deep learning HAR methods with the up-to-date study of literature.
... Mechanical Turk, which requires simple training for annotators, can be widely used to efficiently collect large amounts of data, especially for visual tasks with high data redundancy and diverse variations between images [5][6][7][8][9]. The process of securing the annotation database in the visual recognition problem is essential for defining and training algorithms but is considered separately from the model training process. ...
... They are generally annotated at a level that allows training by the public with a mild iterative training process. For this reason, crowdsourcing platforms [5][6][7][8][9] that use web-based annotation tools are widely used in visual recognition problems. As the performance of visual recognition technology improves, various annotation tools for training are being introduced in the field of medical imaging [10,11]. ...
Preprint
Full-text available
As the area of application of deep neural networks expands to areas requiring expertise, e.g., in medicine and law, more exquisite annotation processes for expert knowledge training are required. In particular, it is difficult to guarantee generalization performance in the clinical field in the case of expert knowledge training where opinions may differ even among experts on annotations. To raise the issue of the annotation generation process for expertise training of CNNs, we verified the annotations for surgical phase recognition of laparoscopic cholecystectomy and subtotal gastrectomy for gastric cancer. We produce calibrated annotations for the seven phases of cholecystectomy by analyzing the discrepancies of previously annotated labels and by discussing the criteria of surgical phases. For gastrectomy for gastric cancer has more complex twenty-one surgical phases, we generate consensus annotation by the revision process with five specialists. By training the CNN-based surgical phase recognition networks with revised annotations, we achieved improved generalization performance over models trained with original annotation under the same cross-validation settings. We showed that the expertise data annotation pipeline for deep neural networks should be more rigorous based on the type of problem to apply clinical field.
... Recent advances in 3D convolutional neural networks (CNNs) [1] provide a method to yield semantic representation of each short video segment which also embeds temporal dynamics. However, for longer and complicated video sequences (such as 120 s long in the ActivityNet dataset [2]), the context understanding of a video becomes more important for generating natural descriptions. Since long videos involve multiple events ranging across multiple time scales, capturing diverse temporal context between events is the key for natural video description with context understanding. ...
... [1] , [2] , and [3] mean three read modes: backward, content-based, and forward, and read mode is assigned to each read head. Read weighting , is defined as a weighted sum of each read mode vector, backward weighting , content-based weighting , , and forward weighting . ...
Article
Full-text available
Recent video captioning models aim at describing all events in a long video. However, their event descriptions do not fully exploit the contextual information included in a video because they lack the ability to remember information changes over time. To address this problem, we propose a novel context-aware video captioning model that generates natural language descriptions based on the improved video context understanding. We introduce an external memory, differential neural computer (DNC), to improve video context understanding. DNC naturally learns to use its internal memory for context understanding and also provides contents of its memory as an output for additional connection. By sequentially connecting DNC-based caption models (DNC augmented LSTM) through this memory information, our consecutively connected DNC architecture can understand the context in a video without explicitly searching for event-wise correlation. Our consecutive DNC is sequentially trained with its language model (LSTM) for each video clip to generate context-aware captions with superior quality. In experiments, we demonstrate that our model provides more natural and coherent captions which reflect previous contextual information. Our model also shows superior quantitative performance on video captioning in terms of BLEU (BLEU@4 4.37), METEOR (9.57), and CIDEr-D (28.08).
... MSR-VTT currently has the most extensive vocabulary compared to other datasets. ActivityNet 200 (Release 1.3) contains 10,024 training, 4926 validation, and 5044 testing videos, totaling around 20,000 videos from 200 activity classes like eating and drinking, recreation, and household activities [49]. Table 1 summarizes the statistics for these datasets. ...
... The different sizes of words in the figure signify how frequently certain words are used in the annotations provided in the dataset; the larger the size of a word, the more frequent is its use. [49]. Table 1 summarizes the statistics for these datasets. ...
Article
Full-text available
Traditionally, searching for videos on popular streaming sites like YouTube is performed by taking the keywords, titles, and descriptions that are already tagged along with the video into consideration. However, the video content is not utilized for searching of the user's query because of the difficulty in encoding the events in a video and comparing them to the search query. One solution to tackle this problem is to encode the events in a video and then compare them to the query in the same space. A method of encoding meaning to a video could be video captioning. The captioned events in the video can be compared to the query of the user, and we can get the optimal search space for the videos. There have been many developments over the course of the past few years in modeling video-caption generators and sentence embedding. In this paper, we exploit an end-to-end video captioning model and various sentence embedding techniques that collectively help in building the proposed video-searching method. The YouCook2 dataset was used for the experimentation. Seven sentence embedding techniques were used, out of which the Universal Sentence Encoder outperformed over all the other six, with a median percentile score of 99.51. Thus, this method of searching, when integrated with traditional methods, can help improve the quality of search results.
... Recently, there is a rising number of works studying the temporal annotation of actions, activities, and relations [6,7,9,20]. Intuitively, scaling up the temporal annotation is much easier than the spatiotemporal annotation in video object annotation. ...
... Current strategies in temporal annotation can be categorized into two types. One category is direct temporal localization by labeling the starting and ending frames of targets in a video [7,9]. In order to find the boundary frame accurately, it normally requires the annotator to spend much time browsing the video. ...
Conference Paper
Understanding the objects and relations between them is indispensable to fine-grained video content analysis, which is widely studied in recent research works in multimedia and computer vision. However, existing works are limited to evaluating with either small datasets or indirect metrics, such as the performance over images. The underlying reason is that the construction of a large-scale video dataset with dense annotation is tricky and costly. In this paper, we address several main issues in annotating objects and relations in user-generated videos, and propose an annotation pipeline that can be executed at a modest cost. As a result, we present a new dataset, named VidOR, consisting of 10k videos (84 hours) together with dense annotations that localize 80 categories of objects and 50 categories of predicates in each video. We have made the training and validation set public and extendable for more tasks to facilitate future research on video object and relation recognition.
... For evaluation, we use 3 open-vocabulary test datasets: TAO (Tracking Any Object) [5], ActivityNet [12] and RareAct [34]. TAO and ActivityNet are exclusively object and action focused datasets respectively. ...
Preprint
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
... Data curation is an expensive process that typically involves extensive manual annotation. In some cases, datasets have employed a semi-automatic crowdsourcing approach for collection and annotation [12,18,40,13]. For tasks that require greater reliability, Figure 2: The average, maximum, and minimum number of action segments for each procedure. ...
Conference Paper
Full-text available
The application of deep learning to nursing procedure activity understanding has the potential to greatly enhance the quality and safety of nurse-patient interactions. By utilizing the technique, we can facilitate training and education, improve quality control, and enable operational compliance monitoring. However, the development of automatic recognition systems in this field is currently hindered by the scarcity of appropriately labeled datasets. The existing video datasets pose several limitations: 1) these datasets are small-scale in size to support comprehensive investigations of nursing activity; 2) they primarily focus on single procedures, lacking expert-level annotations for various nursing procedures and action steps; and 3) they lack temporally localized annotations, which prevents the effective localization of targeted actions within longer video sequences. To mitigate these limitations, we propose NurViD, a large video dataset with expert-level annotation for nursing procedure activity understanding. NurViD consists of over 1.5k videos totaling 144 hours, making it approximately four times longer than the existing largest nursing activity datasets. Notably, it encompasses 51 distinct nursing procedures and 177 action steps, providing a much more comprehensive coverage compared to existing datasets that primarily focus on limited procedures. To evaluate the efficacy of current deep learning methods on nursing activity understanding, we establish three benchmarks on NurViD: procedure recognition on untrimmed videos, procedure and action recognition on trimmed videos, and action detection. Our benchmark and code will be available at https://github.com/minghu0830/NurViD-benchmark.
... Pedestrian action segmentation from video stimuli Given our interest in studying affective responses of participants toward pedestrian crossing actions, videos from the JAAD dataset had to be segmented to the relevant aspect of the video, i.e., the part where the pedestrian action occurred. The duration of the pedestrian actions in the videos were identified by adapting the temporal localization method which is used in activity annotation (Heilbron & Niebles, 2014). Annotators (N ¼ 7) from our institute were asked to mark the beginning start set and end set of a pedestrian action in the 10 JAAD videos. ...
... Some datasets are annotated by domain experts to encourage more reliable labeling, particularly for challenging tasks, but this expertise comes at a much higher cost and is thus hard to scale. We adapted a semi-automatic crowdsourcing approach to collect and annotate our datasets, inspired by many previous works [9,15,23,49]. ...
Preprint
Full-text available
Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognition systems is currently hindered by a lack of appropriately labeled datasets. Existing video datasets 1) do not classify animals according to established biological taxonomies; 2) are too small to facilitate large-scale behavioral studies and are often limited to a single species; and 3) do not feature temporally localized annotations and therefore do not facilitate localization of targeted behaviors within longer video sequences. Thus, we propose MammalNet, a new large-scale animal behavior dataset with taxonomy-guided annotations of mammals and their common behaviors. MammalNet contains over 18K videos totaling 539 hours, which is ~10 times larger than the largest existing animal behavior dataset. It covers 17 orders, 69 families, and 173 mammal categories for animal categorization and captures 12 high-level animal behaviors that received focus in previous animal behavior studies. We establish three benchmarks on MammalNet: standard animal and behavior recognition, compositional low-shot animal and behavior recognition, and behavior detection. Our dataset and code have been made available at: https://mammal-net.github.io.
... The 1 000 videos in the test set are from movies independent from the training and validation splits. (d) Ac-tivityNet [12,17] consists of 20 000 YouTube videos, and some of them are minutes long. We follow [13,67] descriptions of a video to form a paragraph and evaluate the model with video-paragraph retrieval on the val1 split. ...
Preprint
Full-text available
Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35\% and accelerating the inference speed by 14\% at the best case. The code is available at \href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.
... Existing work, including research and applications, mostly expose categories and options to users directly when the number of options is within an acceptable range (D. et al., 1979;Miller, 1956). When there are many options to choose from, a common best practice is to put those decision criteria into separate filters (see Figure 4.5) which are used as a facilitating mechanism to give users iteratively searching experience and immediate feedback on the narrowed down subsets of interest (Shneiderman, 1994;Heilbron and Niebles, 2014). Having seen different classification workflows and how they might impact the classification result, it is interesting to investigate this as a whole in a case where the same objects are being classified with different multiple-step task designs. ...
Thesis
Microtask crowdsourcing has been applied in many fields in the past decades, but there are still important challenges not fully addressed, especially in task/workflow design and aggregation methods to help produce a correct result or assess the quality of the result. This research took a deeper look at crowdsourcing classification tasks and explored how task and workflow design can impact the quality of the classification result. This research used a large online knowledge base and three citizen science projects as examples to investigate workflow design variations and their impacts on the quality of the classification result based on statistical, probabilistic, or machine learning models for true label inference, such that design principles can be recommended and applied in other citizen science projects or other human-computer hybrid systems to improve overall quality. It is noticeable that most of the existing research on aggregation methods to infer true labels focus on simple single-step classification though a large portion of classification tasks are not simple single-step classification. There is only limited research looking into such multiple-step classification tasks in recent years and each has a domain-specific or problem-specific focus making it difficult to be applied to other multiple-steps classifications cases. This research focused on multiple-step classification, modeling the classification task as a path searching problem in a graph, and explored alternative aggregation strategies to infer correct label paths by leveraging established individual algorithms from simple majority voting to more sophisticated algorithms like message passing, and expectation-maximisation. This research also looked at alternative workflow design to classify objects using the DBpedia entity classification as a case study and demonstrated the pros and cons of automatic, hybrid, and completely humanbased workflows. As a result, it is able to provide suggestions to the task requesters for crowdsourcing classification task design and help them choose the aggregation method that will achieve a good quality result.
... The ActivityNet dataset [73], [94] family was produced for both action recognition and detection. Example human action classes include "Drinking coffee," "Getting a tattoo," and "Ironing clothes." ...
Article
Full-text available
Many believe that the successes of deep learning on image understanding problems can be replicated in the realm of video understanding. However, due to the scale and temporal nature of video, the span of video understanding problems and the set of proposed deep learning solutions is arguably wider and more diverse than those of their 2D image siblings. Finding, identifying, and predicting actions are a few of the most salient tasks in this emerging and rapidly evolving field. With a pedagogical emphasis, this tutorial introduces and systematizes fundamental topics, basic concepts, and notable examples in supervised video action understanding. Specifically, we clarify a taxonomy of action problems, catalog and highlight video datasets, describe common video data preparation methods, present the building blocks of state-of-the-art deep learning model architectures, and formalize domain-specific metrics to baseline proposed solutions. This tutorial is intended to be accessible to a general computer science audience and assumes a conceptual understanding of supervised learning.
... An effective deep neural networks model requires a large amount of accurately labeled samples for training, because millions or even billions of parameters need to be learned from the labeled samples. However, reliable labels heavily rely on extensive human labor work [18,19]. These time-consuming and labor-intensive labeling work hindered the rapid and widespread applications of these technologies in practical applications. ...
Article
Full-text available
To reduce the discrepancy between the source and target domains, a new multi-label adaptation network (ML-ANet) based on multiple kernel variants with maximum mean discrepancies is proposed in this paper. The hidden representations of the task-specific layers in ML-ANet are embedded in the reproducing kernel Hilbert space (RKHS) so that the mean-embeddings of specific features in different domains could be precisely matched. Multiple kernel functions are used to improve feature distribution efficiency for explicit mean embedding matching, which can further reduce domain discrepancy. Adverse weather and cross-camera adaptation examinations are conducted to verify the effectiveness of our proposed ML-ANet. The results show that our proposed ML-ANet achieves higher accuracies than the compared state-of-the-art methods for multi-label image classification in both the adverse weather adaptation and cross-camera adaptation experiments. These results indicate that ML-ANet can alleviate the reliance on fully labeled training data and improve the accuracy of multi-label image classification in various domain shift scenarios.
... The annotation process is the same method than in ImageNet. It then spread for image and video annotations (Heilbron and Niebles, 2014;Vondrick et al., 2010). The strategy is in two folds: ...
Thesis
Full-text available
Action recognition in videos is one of the key problems in visual data interpretation. Despite intensive research, differencing and recognizing similar actions remains a challenge. This thesis deals with fine-grained classification of sport gestures from videos, with an application to table tennis.In this manuscript, we propose a method based on deep learning for automatically segmenting and classifying table tennis strokes in videos. Our aim is to design a smart system for students and teachers for analyzing their performances. By profiling the players, a teacher can therefore tailor the training sessions more efficiently in order to improve their skills. Players can also have an instant feedback on their performances.For developing such a system with fine-grained classification, a very specific dataset is needed to supervise the learning process. To that aim, we built the “TTStroke-21” dataset, which is composed of 20 stroke classes plus a rejection class. The TTStroke-21 dataset comprises video clips of recorded table tennis exercises performed by students at the sport faculty of the University of Bordeaux - STAPS. These recorded sessions were annotated by professional players or teachers using a crowdsourced annotation platform. The annotations consist in a description of the handedness of the player and information for each stroke performed (starting and ending frames, class of the stroke). Fine-grained action recognition has some notable differences with coarse-grained action recognition. In general, datasets used for coarse-grained action recognition, the background context often provides discriminative information that methods can use to classify the action, rather than focusing on the action itself. In fine-grained classification, where the inter-class similarity is high, discriminative visual features are harder to extract and the motion plays a key role for characterizing an action.In this thesis, we introduce a Twin Spatio-Temporal Convolutional Neural Network. This deep learning network takes as inputs an RGB image sequence and its computed Optical Flow. The RGB image sequence allows our model to capture appearance features while the optical flow captures motion features. Those two streams are processed in parallel using 3D convolutions, and fused at the last stage of the network. Spatio-temporal features extracted in the network allow efficient classification of video clips from TTStroke-21. Our method gets an average classification performance of 87.3% with a best run of 93.2% accuracy on the test set. When applied on joint detection and classification task, the proposed method reaches an accuracy of 82.6%.A systematic study of the influence of each stream and fusion types on classification accuracy has been performed, giving clues on how to obtain the best performances. A comparison of different optical flow methods and the role of their normalization on the classification score is also done. The extracted features are also analyzed by back-tracing strong features from the last convolutional layer to understand the decision path of the trained model. Finally, we introduce an attention mechanism to help the model focusing on particular characteristic features and also to speed up the training process. For comparison purposes, we provide performances of other methods on TTStroke-21 and test our model on other datasets. We notice that models performing well on coarse-grained action datasets do not always perform well on our fine-grained action dataset.The research presented in this manuscript was validated with publications in one international journal, five international conference papers, two international workshop papers and a reconductible task in MediaEval workshop in which participants can apply their action recognition methods to TTStroke-21. Two additional international workshop papers are in process along with one book chapter.
... In addition to pervasive sensing, a large body of work in HCI has investigated video analysis interfaces to support researchers in analyzing longitudinal datasets efficiently. Inspired by Vcode [44] which identified the efficiency of synchronized video playback and a timeline-based annotation interface, similar techniques were applied and tested in various research fields including behavioural studies [20,34], multimedia analysis [49,64,65] or learning contexts [23,26]. However, this approach is still insufficient for analyzing the behaviour of specific objects or people in the scene (e.g., trajectories, territories, movements, etc.). ...
Article
Full-text available
A well-designed workplace has a direct and significant impact on our work experiences and productivity. In this paper, we investigate how office interior layouts influence the way we socially experience office buildings. We extend the previous work that examined static social formations of office workers by looking at their dynamic movements during informal desk visiting interactions. With a month of video data collected in the office, we implemented a vision-based analysis system that enables us to examine how people occupy space in social contexts in relation to desk configurations. The results showed that both social territoriality and approach path highlight social comfort in human-building interactions, which are different from efficiency or path optimization. From these findings, we propose the concepts of socio-spatial comfort: social buffers, privacy buffers, and varying proxemics to inform a user-centered way of designing human building interactions and architecture.
... Here we describe some of the largest and highest quality among them. The ActivityNet dataset [21,88] family was produced "to compare algorithms for human activity understanding: global video classification, trimmed activity recognition and activity detection." Example human action classes include "Drinking coffee", "Getting a tattoo", and "Ironing clothes". ...
Preprint
Full-text available
Many believe that the successes of deep learning on image understanding problems can be replicated in the realm of video understanding. However, the span of video action problems and the set of proposed deep learning solutions is arguably wider and more diverse than those of their 2D image siblings. Finding, identifying, and predicting actions are a few of the most salient tasks in video action understanding. This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures.
... Problems that must be addressed include low image or video quality, as well as duplication of data due to repeated data entry by crowds. Normalization techniques and enhancing image / video will be applied, as well as near-duplication examination techniques as in [18]. While the completeness of the metadata contents of objects can be handled by an semi-automatic annotating process using tag recommending system [20]; georeferencing process that adds the metadata content of geographic locations; and the contextualization process is to enrich the description of objects and relate them to other object data, this utilizes the knowledge possessed by experts. ...
... There exist different tools for machine analysis of face videos, gathering information on head position, movement, and estimation of affective states [24,25]. In addition, multimedia annotation tools assist labeling by human analysts; these softwares are generally designed for user-friendly crowdsource use [26,27]; they are very helpful in annotating large-scale picture and video databases. Crowdsourced annotations usually serve to pre-select subsets of a single database for expert annotators; they are also used as teaching databases for machine learning in computer vision applications [28,25,29]. ...
... Particularly, crowd-sourced video annotation requires cost-aware and efficient methods instead of frame-by-frame labelling. The large number of frames in a video requires a more intelligent mechanism for propagating annotations from a subset of keyframes [23,24]; otherwise video crowdsourcing methods will not be scalable. ...
... Some other works have used a Bluetooth headset combined with speech recognition software to perform the annotations [33] whereas others [24] take annotations manually. There are also some works that have used a crowdsourcing approach to label activities from video [29,42]. Furthermore, once the data is labeled, the boundaries between labels may be shifted so further pre-processing to correct boundaries may be needed [35]. ...
Article
The technology trend of context-aware computer systems carries the promise of more flexible automated systems, with a high degree of adaptation to the user’s situation, but it implies as a precondition that the context information (such as the place, time, activity, preferences, etc.) is indeed available. One very important aspect of the user context is the activity in which the human is currently involved. Human Activity Recognition (HAR) has become a trending topic in the last years because of its potential applications in pervasive health care, assisted living, exercise monitoring, etc. Most of the works on HAR either require from the user to label the activities as they are performed so the system can learn them, or rely on a trained device that expects a “typical” ideal user. The first approach is impractical, as the training process easily becomes time consuming, expensive, etc., while the second one drops the HAR precision for many non-typical users. In this work we propose a “crowdsourcing” method for building personalized models for HAR by combining the advantages of both user-dependent and general models, finding class similarities between the target user and the community users. We evaluated our approach on 4 different public datasets and showed that the personalized models outperformed the user-dependent and user-independent models when labeled data is scarce.
... There are two recent large-scale video annotation efforts that successfully utilize crowdsourcing. The first effort is ActivityNet (Heilbron and Niebles 2014) which uses a proposal/verification framework similar to that of ImageNet (Deng et al. 2009). They define a target set of actions, query video search engines for proposal videos of those actions and then ask crowd workers to clean up the results. ...
Article
Full-text available
Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Watching even a short 30-second video clip requires a significant time investment from a crowd worker; thus, requesting multiple annotations following a single viewing is an important cost-saving strategy. But how many questions should we ask per video? We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments). We demonstrate that while workers may not correctly answer all questions, the cost-benefit analysis nevertheless favors consensus from multiple such cheap-yet-imperfect iterations over more complex alternatives. When compared with a one-question-per-video baseline, our method is able to achieve a 10% improvement in recall (76.7% ours versus 66.7% baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline). We demonstrate the effectiveness of our method by collecting multi-label annotations of 157 human activities on 1,815 videos.
... For example, in the work of Kirkham et al. [39], they leveraged the error-prone task of defining activity annotation boundaries to a set of annotators to reduce "label-jittering" (activity start and end times do not align perfectly with the annotation). In the work of Heilbron and Niebles [40], the Amazon Mechanical Turk was used to recruit workers to annotate activities from video, and they achieved high quality annotations when combining the work of six annotators. Lasecki et al. [41] also used Mechanical Turk workers to annotate dependencies between actions to identify high-level home activities. ...
Article
Full-text available
Human Activity Recognition (HAR) is an important part of ambient intelligence systems since it can provide user-context information, thus allowing a greater personalization of services. One of the problems with HAR systems is that the labeling process for the training data is costly, which has hindered its practical application. A common approach is to train a general model with the aggregated data from all users. The problem is that for a new target user, this model can perform poorly because it is biased towards the majority type of users and does not take into account the particular characteristics of the target user. To overcome this limitation, a user-dependent model can be trained with data only from the target user that will be optimal for this particular user; however, this requires a considerable amount of labeled data, which is cumbersome to obtain. In this work, we propose a method to build a personalized model for a given target user that does not require large amounts of labeled data. Our method uses data already labeled by a community of users to complement the scarce labeled data of the target user. Our results showed that the personalized model outperformed the general and the user-dependent models when labeled data is scarce.
... We now describe the collection and annotation process for obtaining ActivityNet. Inspired by [5,11,38], we follow a semi-automatic crowdsourcing strategy to collect and annotate videos (Figure 2). We first search the web for potential videos depicting a particular human activity. ...
... There exist different tools for machine analysis of face videos, gathering information on head position, movement, and estimation of affective states [24,25]. In addition, multimedia annotation tools assist labeling by human analysts; these softwares are generally designed for user-friendly crowdsource use [26,27]; they are very helpful in annotating large-scale picture and video databases. Crowdsourced annotations usually serve to pre-select subsets of a single database for expert annotators; they are also used as teaching databases for machine learning in computer vision applications [28,25,29]. ...
Article
We studied an artificial intelligent assisted interaction between a computer and a human with severe speech and physical impairments (SSPI). In order to speed up AAC, we extended a former study of typing performance optimization using a framework that included head movement controlled assistive technology and an onscreen writing device. Quantitative and qualitative data were collected and analysed with mathematical methods, manual interpretation and semi-supervised machine video annotation. As the result of our research, in contrast to the former experiment's conclusions, we found that our participant had at least two different typing strategies. To maximize his communication efficiency, a more complex assistive tool is suggested, which takes the different methods into consideration.
... To overcome this, for future work we will use a crowdsourcing approach to complete missing information. The idea of using crowdsourcing for activity recognition from video data is already being explored [58,59]. However, for accelerometer data it presents several challenges because it is hard to classify an activity based on visual inspection. ...
Article
Full-text available
With the development of wearable devices that have several embedded sensors, it is possible to collect data that can be analyzed in order to understand the user's needs and provide personalized services. Examples of these types of devices are smartphones, fitness-bracelets, smartwatches, just to mention a few. In the last years, several works have used these devices to recognize simple activities like running, walking, sleeping, and other physical activities. There has also been research on recognizing complex activities like cooking, sporting, and taking medication, but these generally require the installation of external sensors that may become obtrusive to the user. In this work we used acceleration data from a wristwatch in order to identify long-term activities. We compare the use of Hidden Markov Models and Conditional Random Fields for the segmentation task. We also added prior knowledge into the models regarding the duration of the activities by coding them as constraints and sequence patterns were added in the form of feature functions. We also performed subclassing in order to deal with the problem of intra-class fragmentation, which arises when the same label is applied to activities that are conceptually the same but very different from the acceleration point of view.
Preprint
Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Watching even a short 30-second video clip requires a significant time investment from a crowd worker; thus, requesting multiple annotations following a single viewing is an important cost-saving strategy. But how many questions should we ask per video? We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments). We demonstrate that while workers may not correctly answer all questions, the cost-benefit analysis nevertheless favors consensus from multiple such cheap-yet-imperfect iterations over more complex alternatives. When compared with a one-question-per-video baseline, our method is able to achieve a 10% improvement in recall 76.7% ours versus 66.7% baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline). We demonstrate the effectiveness of our method by collecting multi-label annotations of 157 human activities on 1,815 videos.
Article
Localizing video moments based on the movement patterns of objects is an important task in video analytics. Existing video analytics systems offer two types of querying interfaces based on natural language and SQL, respectively. However, both types of interfaces have major limitations. SQL-based systems require high query specification time, whereas natural language-based systems require large training datasets to achieve satisfactory retrieval accuracy. To address these limitations, we present SketchQL, a video database management system (VDBMS) for offline, exploratory video moment retrieval that is both easy to use and generalizes well across multiple video moment datasets. To improve ease-of-use, SketchQL features a visual query interface that enables users to sketch complex visual queries through intuitive drag-and-drop actions. To improve generalizability, SketchQL operates on object-tracking primitives that are reliably extracted across various datasets using pre-trained models. We present a learned similarity search algorithm for retrieving video moments closely matching the user's visual query based on object trajectories. SketchQL trains the model on a diverse dataset generated with a novel simulator, that enhances its accuracy across a wide array of datasets and queries. We evaluate SketchQL on four real-world datasets with nine queries, demonstrating its superior usability and retrieval accuracy over state-of-the-art VDBMSs.
Chapter
With the increasing amount of videos available on the internet and in security settings, human activity recognition has become a highly important topic in machine learning. In this chapter, we cover the topic of deep learning for human activity recognition and the tasks of trimmed action recognition, temporal action localization, and spatiotemporal action localization. Throughout our treatise, we discuss different architectural building blocks including 2D Convolutional Neural Networks (CNNs), 3D-CNNs, Recursive Neural Networks (RNNs), and Spatial-Temporal Graph Convolution Networks (ST-GCNs), and how they were used in state-if-the-art models for human activity recognition. Moreover, we discuss how multiple modalities and data resolutions can be fused via a multistream network topology, and how video-classification models can be extended for usage in (spatio)temporal action localization. Finally, we provide a curated list of data sets for human activity recognition tasks.
Chapter
Despite tremendous progress achieved in temporal action localization, state-of-the-art methods still struggle to train accurate models when annotated data is scarce. In this paper, we introduce a novel active learning framework for temporal localization that aims to mitigate this data dependency issue. We equip our framework with active selection functions that can reuse knowledge from previously annotated datasets. We study the performance of two state-of-the-art active selection functions as well as two widely used active learning baselines. To validate the effectiveness of each one of these selection functions, we conduct simulated experiments on ActivityNet. We find that using previously acquired knowledge as a bootstrapping source is crucial for active learners aiming to localize actions. When equipped with the right selection function, our proposed framework exhibits significantly better performance than standard active learning strategies, such as uncertainty sampling. Finally, we employ our framework to augment the newly compiled Kinetics action dataset with ground-truth temporal annotations. As a result, we collect Kinetics-Localization, a novel large-scale dataset for temporal action localization, which contains more than 15K YouTube videos.
Article
Querying uncertain data has become a prominent application due to the proliferation of user-generated content from social media and of data streams from sensors. When data ambiguity cannot be reduced algorithmically, crowdsourcing proves a viable approach, which consists of posting tasks to humans and harnessing their judgment for improving the confidence about data values or relationships. This paper tackles the problem of processing top-K queries over uncertain data with the help of crowdsourcing for quickly converging to the realordering of relevant results. Several offline and online approaches for addressing questions to a crowd are defined and contrasted on both synthetic and real data sets, with the aim of minimizing the crowd interactions necessary to find the realordering of the result set.
Article
The advent of affordable jobsite cameras is reshaping the way on-site construction activities are monitored. To facilitate the analysis of large collections of videos, research has focused on addressing the problem of manual workface assessment by recognizing worker and equipment activities using computer-vision algorithms. Despite the explosion of these methods, the ability to automatically recognize and understand worker and equipment activities from videos is still rather limited. The current algorithms require large-scale annotated workface assessment video data to learn models that can deal with the high degree of intraclass variability among activity categories. To address current limitations, this study proposes crowdsourcing the task of workface assessment from jobsite video streams. By introducing an intuitive web-based platform for massive marketplaces such as Amazon Mechanical Turk (AMT) and several automated methods, the intelligence of the crowd is engaged for interpreting jobsite videos. The goal is to overcome the limitations of the current practices of workface assessment and also provide significantly large empirical data sets together with their ground truth that can serve as the basis for developing video-based activity recognition methods. Six extensive experiments have shown that engaging nonexperts on AMT to annotate construction activities in jobsite videos can provide complete and detailed workface assessment results with 85% accuracy. It has been demonstrated that crowdsourcing has the potential to minimize time needed for workface assessment, provides ground truth for algorithmic developments, and most importantly allows on-site professionals to focus their time on the more important task of root-cause analysis and performance improvements.
Article
Full-text available
Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have differ-ent clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations.
Conference Paper
Full-text available
Activity annotation in videos is necessary to create a training dataset for most of activity recognition systems. This is a very time consuming and repetitive task. Crowdsourcing gains popularity to distribute annotation tasks to a large pool of taggers. We present for the first time an approach to achieve good quality for activity annotation in videos through crowdsourcing on the AmazonMechanical Turk platform (AMT). Taggers must annotate the start, end boundaries and the label of all occurrences of activities in videos. Two strategies to detect non-serious taggers according to temporal annotated results are presented. Individual filtering checks the consistence in the answers of each tagger with the characteristic of dataset to identify and remove nonserious taggers. Collaborative filtering checks the agreement in annotations among taggers. The filtering techniques detect and remove non-serious taggers and finally, the majority voting applied to AMT temporal tags to generate one final AMT activity annotation set. We conduct the experiments to get activity annotation from AMT on a subset of two rich datasets frequently used in activity recognition. The results show that our proposed filtering strategies can increase the accuracy by up to 40%. The final annotation set is of comparable quality of the annotation of experts with high accuracy (76% to 92%).
Article
Full-text available
We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
Conference Paper
Full-text available
Activity recognition in video is dominated by low- and mid-level features, and while demonstrably capable, by nature, these features carry little semantic meaning. Inspired by the recent object bank approach to image representation, we present Action Bank, a new high-level representation of video. Action bank is comprised of many individual action detectors sampled broadly in semantic space as well as viewpoint space. Our representation is constructed to be semantically rich and even when paired with simple linear SVM classifiers is capable of highly discriminative performance. We have tested action bank on four major activity recognition benchmarks. In all cases, our performance is better than the state of the art, namely 98.2% on KTH (better by 3.3%), 95.0% on UCF Sports (better by 3.7%), 57.9% on UCF50 (baseline is 47.9%), and 26.9% on HMDB51 (baseline is 23.2%). Furthermore, when we analyze the classifiers, we find strong transfer of semantics from the constituent action detectors to the bank classifier.
Article
Full-text available
This paper summarizes the 28 video sequences available for result comparison in the PETS04 workshop. The se-quences are from about 500 to 1400 frames in length, for a total of about 26500 frames. The sequences are anno-tated with both target position and activities by the CAVIAR research team members.
Article
Full-text available
A video sequence may contain any number of persons, objects and activities. ViPER-GT is a tool for annotating a video with detailed spatial and temporal information about its contents. This paper is for people who wish to extend it for their own needs, or curious users who wish to understand some of its design choices. It presents information about ViPER-GT's predecessors (and antecedents), information about its design and implementation and some use cases.
Conference Paper
Full-text available
In this paper, we present a systematic framework for recognizing realistic actions from videos ldquoin the wildrdquo. Such unconstrained videos are abundant in personal collections as well as on the Web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.
Conference Paper
Full-text available
This paper exploits the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common outdoors. The contribution of this paper is three-fold: (a) we automatically discover relevant scene classes and their correlation with human actions, (b) we show how to learn selected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We use movie scripts as a means of automatic supervision for training. For selected action classes we identify correlated scene classes in text and then retrieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features framework and are combined in a joint scene-action SVM-based classifier. We report experimental results and validate the method on a new large dataset with twelve action classes and ten scene classes acquired from 69 movies.
Conference Paper
Full-text available
We propose a new learning method which exploits temporal consistency to successfully learn a complex appearance model from a sparsely labeled training video. Our approach consists in iteratively improving an appearance-based model built with a Boosting procedure, and the reconstruction of trajectories corresponding to the motion of multiple targets. We demonstrate the efficiency of our procedure on pedestrian detection in videos and cell detection in microscopy image sequences. In both cases, our method is demonstrated to reduce the labeling requirement by one to two orders of magnitude. We show that in some instances, our method trained with sparse labels on a video sequence is able to outperform a standard learning procedure trained with the fully labeled sequence.
Conference Paper
Full-text available
User studies are important for many aspects of the design process and involve techniques ranging from informal surveys to rigorous laboratory studies. However, the costs involved in engaging users often requires practitioners to trade off between sample size, time requirements, and monetary costs. Micro-task markets, such as Amazon's Mechanical Turk, offer a potential paradigm for engaging a large number of users for low time and monetary costs. Here we investigate the utility of a micro-task market for collecting user measurements, and discuss design considerations for developing remote micro user evaluation tasks. Although micro-task markets have great potential for rapidly collecting user measurements at low costs, we found that special care is needed in formulating tasks in order to harness the capabilities of the approach. Author Keywords Remote user study, Mechanical Turk, micro task, Wikipedia.
Conference Paper
Full-text available
In this paper we introduce a template-based method for recognizing human actions called Action MACH. Our ap- proach is based on a Maximum Average Correlation Height (MACH) filter. A common limitation of template-based methods is their inability to generate a single template us- ing a collection of examples. MACH is capable of captur- ing intra-class variability by synthesizing a single Action MACH filter for a given action class. We generalize the traditional MACH filter to video (3D spatiotemporal vol- ume), and vector valued data. By analyzing the response of the filter in the frequency domain, we avoid the high com- putational cost commonly incurred in template-based ap- proaches. Vector valued data is analyzed using the Clifford Fourier transform, a generalization of the Fourier trans- form intended for both scalar and vector-valued data. Fi- nally, we perform an extensive set of experiments and com- pare our method with some of the most recent approaches in the field by using publicly available datasets, and two newannotatedhumanaction datasetswhichinclude actions performed in classic featurefilms and sports broadcast tele- vision.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
Much recent research in human activity recognition has focused on the problem of recognizing simple repetitive (walking, running, waving) and punctual actions (sitting up, opening a door, hugging). However, many interesting human activities are characterized by a complex temporal composition of simple actions. Automatic recognition of such complex actions can benefit from a good understanding of the temporal structures. We present in this paper a framework for modeling motion by exploiting the temporal structure of the human activities. In our framework, we represent activities as temporal compositions of motion segments. We train a discriminative model that encodes a temporal decomposition of video sequences, and appearance models for each motion segment. In recognition, a query video is matched to the model according to the learned appearances and motion segment decomposition. Classification is made based on the quality of matching between the motion segment classifiers and the temporal segments in the query sequence. To validate our approach, we introduce a new dataset of complex Olympic Sports activities. We show that our algorithm performs better than other state of the art methods.
Conference Paper
Full-text available
In this paper, we make three main contributions in the area of action recognition: (i) We introduce the concept of Joint Self-Similarity Volume (Joint SSV) for modeling dynamical systems, and show that by using a new optimized rank-1 tensor approximation of Joint SSV one can obtain compact low-dimensional descriptors that very accurately preserve the dynamics of the original system, e.g. an action video sequence; (ii) The descriptor vectors derived from the optimized rank-1 approximation make it possible to recognize actions without explicitly aligning the action sequences of varying speed of execution or different frame rates; (iii) The method is generic and can be applied using different low-level features such as silhouettes, histogram of oriented gradients, etc. Hence, it does not necessarily require explicit tracking of features in the space-time volume. Our experimental results on three public datasets demonstrate that our method produces remarkably good results and outperforms all baseline methods.
Conference Paper
Full-text available
With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lag far behind. Current action recognition databases contain on the order of ten different action categories collected under fairly controlled conditions. State-of-the-art performance on these datasets is now near ceiling and thus there is a need for the design and creation of new benchmarks. To address this issue we collected the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube. We use this database to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions such as camera motion, viewpoint, video quality and occlusion.
Article
Full-text available
Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of "crowdsourcing" prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively "learn" a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set.
Article
Full-text available
People exert large amounts of problem-solving effort playing computer games. Simple image- and text-recognition tasks have been successfully 'crowd-sourced' through games, but it is not clear if more complex scientific problems can be solved with human-directed computing. Protein structure prediction is one such problem: locating the biologically relevant native conformation of a protein is a formidable computational challenge given the very large size of the search space. Here we describe Foldit, a multiplayer online game that engages non-scientists in solving hard prediction problems. Foldit players interact with protein structures using direct manipulation tools and user-friendly versions of algorithms from the Rosetta structure prediction methodology, while they compete and collaborate to optimize the computed energy. We show that top-ranked Foldit players excel at solving challenging structure refinement problems in which substantial backbone rearrangements are necessary to achieve the burial of hydrophobic residues. Players working collaboratively develop a rich assortment of new strategies and algorithms; unlike computational approaches, they explore not only the conformational space but also the space of possible search strategies. The integration of human visual problem-solving and strategy development capabilities with traditional computational algorithms through interactive multiplayer games is a powerful new approach to solving computationally-limited scientific problems.
Article
Full-text available
Human action in video sequences can be seen as silhouettes of a moving torso and protruding limbs undergoing articulated motion. We regard human actions as three-dimensional shapes induced by the silhouettes in the space-time volume. We adopt a recent approach for analyzing 2D shapes and generalize it to deal with volumetric space-time action shapes. Our method utilizes properties of the solution to the Poisson equation to extract space-time features such as local space-time saliency, action dynamics, shape structure and orientation. We show that these features are useful for action recognition, detection and clustering. The method is fast, does not require video alignment and is applicable in (but not limited to) many scenarios where the background is known. Moreover, we demonstrate the robustness of our method to partial occlusions, non-rigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action, and low quality video.
Conference Paper
Full-text available
Human action in video sequences can be seen as silhouettes of a moving torso and protruding limbs undergoing articulated motion. We regard human actions as three-dimensional shapes induced by the silhouettes in the space-time volume. We adopt a recent approach by Gorelick et al. (2004) for analyzing 2D shapes and generalize it to deal with volumetric space-time action shapes. Our method utilizes properties of the solution to the Poisson equation to extract space-time features such as local space-time saliency, action dynamics, shape structure and orientation. We show that these features are useful for action recognition, detection and clustering. The method is fast, does not require video alignment and is applicable in (but not limited to) many scenarios where the background is known. Moreover, we demonstrate the robustness of our method to partial occlusions, non-rigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action and low quality video.
Conference Paper
Full-text available
Local space-time features capture local events in video and can be adapted to the size, the frequency and the velocity of moving patterns. In this paper, we demonstrate how such features can be used for recognizing complex motion patterns. We construct video representations in terms of local space-time features and integrate such representations with SVM classification schemes for recognition. For the purpose of evaluation we introduce a new video database containing 2391 sequences of six human actions performed by 25 people in four different scenarios. The presented results of action recognition justify the proposed method and demonstrate its advantage compared to other relative approaches for action recognition.
Article
Full-text available
We present an improved method for clustering in the presence of very limited supervisory information, given as pairwise instance constraints. By allowing instance-level constraints to have spacelevel inductive implications, we are able to successfully incorporate constraints for a wide range of data set types. Our method greatly improves on the previously studied constrained k-means algorithm, generally requiring less than half as many constraints to achieve a given accuracy on a range of real-world data, while also being more robust when over-constrained. We additionally discuss an active learning algorithm which increases the value of constraints even further.
Article
Crowdsourcing systems, in which tasks are electronically distributed to numerous "information piece-workers", have emerged as an effective paradigm for human-powered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Be-cause these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as ma-jority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task as-signments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers' answers. We show that our algorithm signifi-cantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker.
Article
With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, defined as annotators who assign labels randomly without actually looking at the instance. Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the final consensus labels. In this paper we propose an empirical Bayesian algorithm called SpEMthat iteratively eliminates the spammers and estimates the consensus labels based only on the good annotators. The algorithm is motivated by defining a spammer score that can be used to rank the annotators. Experiments on simulated and real data show that the proposed approach is better than (or as good as) the earlier approaches in terms of the accuracy and uses a significantly smaller number of annotators.
Conference Paper
We study quality control mechanisms for a crowdsourcing system where workers perform object comparison tasks. We study error masking techniques (e.g., voting) and detection of bad workers. For the latter, we consider using gold-standard questions, as well as disagreement with the plurality answer. We perform experiments on Mechanical Turk that yield insights as to the role of task difficulty in quality control, and the effectiveness of the schemes.
Article
We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.
Article
Pedestrian detection is a key problem in computer vision, with several applications including robotics, surveillance and automotive safety. Much of the progress of the past few years has been driven by the availability of challenging public datasets. To continue the rapid rate of innovation, we introduce the Caltech Pedestrian Dataset, which is two orders of magnitude larger than existing datasets. The dataset contains richly annotated video, recorded from a moving vehicle, with challenging images of low resolution and frequently occluded people. We propose improved evaluation metrics, demonstrating that commonly used per-window measures are flawed and can fail to predict performance on full images. We also benchmark several promising detection systems, providing an overview of state-of-the-art performance and a direct, unbiased comparison of existing methods. Finally, by analyzing common failure cases, we help identify future research directions for the field.
Article
Crowdsourcing services, such as Amazon Mechanical Turk, allow for easy distribution of small tasks to a large number of workers. Unfortunately, since manually verifying the quality of the submitted results is hard, malicious workers often take advantage of the verification difficulty and submit answers of low quality. Currently, most requesters rely on redundancy to identify the correct answers. However, redundancy is not a panacea. Massive redundancy is expensive, increasing sig-nificantly the cost of crowdsourced solutions. Therefore, we need techniques that will accurately estimate the quality of the workers, allowing for the rejection and blocking of the low-performing workers and spammers. However, existing techniques cannot separate the true (un-recoverable) error rate from the (recoverable) biases that some workers exhibit. This lack of separation leads to incorrect assessments of a worker's quality. We present algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error. Our algorithm generates a scalar score representing the inherent quality of each worker. We illustrate how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsu-pervised and supervised techniques for inferring the quality of the workers. We present experimental results demonstrating the performance of the proposed algorithm under a variety of settings.
Article
Crowdsourcing has emerged in the past decade as a pop-ular model for online distributed problem solving and production. The creation of Amazon Mechanical Turk (MTurk) as a "micro-task" marketplace has facilitated this growth by connecting willing workers with available tasks. Within computer vision, MTurk has proven espe-cially useful in large-scale image label collection, as many computer vision algorithms require substantial amounts of training data. In this survey, we discuss different types of worker incentives, various considerations for MTurk task design, methods for annotation quality analysis, and cost-effective ways of obtaining labels in a selective man-ner. We present several examples of how MTurk is being utilized in the computer vision community. Finally, we discuss the implications that MTurk usage will have on future computer vision research.
Article
This paper addresses the repeated acquisition of labels for data itemswhen the labeling is imperfect. We examine the improvement (or lackthereof) in data quality via repeated labeling, and focus especially onthe improvement of training labels for supervised induction. With theoutsourcing of small tasks becoming easier, for example via Amazon'sMechanical Turk, it often is possible to obtain less-than-expertlabeling at low cost. With low-cost labeling, preparing the unlabeledpart of the data can become considerably more expensive than labeling.We present repeated-labeling strategies of increasing complexity, andshow several main results. (i) Repeated-labeling can improve labelquality and model quality, but not always. (ii) When labels are noisy,repeated labeling can be preferable to single labeling even in thetraditional setting where labels are not particularly cheap. (iii) Assoon as the cost of processing the unlabeled data is not free, even thesimple strategy of labeling everything multiple times can giveconsiderable advantage. (iv) Repeatedly labeling a carefully chosen setof points is generally preferable, and we present a set of robusttechniques that combine different notions of uncertainty to select datapoints for which quality should be improved. The bottom line: theresults show clearly that when labeling is not perfect, selectiveacquisition of multiple labels is a strategy that data miners shouldhave in their repertoire. For certain label-quality/cost regimes, thebenefit is substantial.
Chapter
Recognizing human activities has become an important topic in the past few years. A variety of techniques for representing and modeling different human activities have been proposed, achieving reasonable performances in many scenarios. On the other hand, different benchmarks have also been collected and published. Different from other chapters focusing on the algorithmic aspects, this chapter gives an overview of different benchmarking datasets, summarizes the performances of the-state-of-the-art algorithms, and analyzes these datasets.
Conference Paper
Active learning methods aim to select the most informative unlabeled instances to label first, and can help to focus image or video annotations on the examples that will most improve a recognition system. However, most existing methods only make myopic queries for a single label at a time, retraining at each iteration. We consider the problem where at each iteration the active learner must select a set of examples meeting a given budget of supervision, where the budget is determined by the funds (or time) available to spend on annotation. We formulate the budgeted selection task as a continuous optimization problem where we determine which subset of possible queries should maximize the improvement to the classifier's objective, without overspending the budget. To ensure far-sighted batch requests, we show how to incorporate the predicted change in the model that the candidate examples will induce. We demonstrate the proposed algorithm on three datasets for object recognition, activity recognition, and content-based retrieval, and we show its clear practical advantages over random, myopic, and batch selection baselines.
Conference Paper
The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contri- bution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human ac- tions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based clas- sifier. Using the retrieved action samples for visual learn- ing, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multi- channel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly in- vestigate and show high tolerance of our method to annota- tion errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results.
Conference Paper
Currently, video analysis algorithms suffer from lack of information regarding the objects present, their interactions, as well as from missing comprehensive annotated video databases for benchmarking. We designed an online and openly accessible video annotation system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos. The annotations are also complemented with knowledge from static image databases to infer occlusion and depth information. Using this system, we have built a scalable video database composed of diverse video samples and paired with human-guided annotations. We complement this paper demonstrating potential uses of this database by studying motion statistics as well as cause-effect motion relationships between objects.
Article
Remember outsourcing? Sending jobs to India and China is so 2003. The new pool of cheap labor: everyday people using their spare cycles to create content, solve problems, even do corporate R
Article
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Article
CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are widespread security measures on the World Wide Web that prevent automated programs from abusing online services. They do so by asking humans to perform a task that computers cannot yet perform, such as deciphering distorted characters. Our research explored whether such human effort can be channeled into a useful purpose: helping to digitize old printed material by asking users to decipher scanned words from books that computerized optical character recognition failed to recognize. We showed that this method can transcribe text with a word accuracy exceeding 99%, matching the guarantee of professional human transcribers. Our apparatus is deployed in more than 40,000 Web sites and has transcribed over 440 million words.
The rise of crowdsourcing. Wired magazine
  • J Howe
Mihalcik and D. Doermann. The design and implementation of viper
  • D Mihalcik
  • D Doermann
  • Mihalcik D.