Article

Analyzing sedentary behavior in life-logging images

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We describe a study that aims to understand physical activity and sedentary behavior in free-living settings. We employed a wearable camera to record 3 to 5 days of imaging data with 40 participants, resulting in over 360,000 images. These images were then fully annotated by experienced staff with a rigorous coding protocol. We designed a deep learning based classifier in which we adapted a model that was originally trained for ImageNet [1]. We then added a spatio-temporal pyramid to our deep learning based classifier. Our results show our proposed method performs better than the state-of-the-art visual classification methods on our dataset. For most of the labels our system achieves more than 90% average accuracy across different individuals for frequent labels and more than 80% average accuracy for rare labels.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Multimedia Appendix 3 [14][15][16][17]25,[27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46] presents the main results of primary and secondary data collection. Figure 2). ...
... Studies undertaking primary or secondary data analysis (n=25) were predominantly feasibility or pilot studies (n=13, 52%) [15,18,25,[27][28][29][30][31][32][34][35][36]43,46], followed by methodological studies (n=6, 24%) [37,[39][40][41]44] and validation studies (n=4, 16%) [14,33,42,46]. There was 1 randomized controlled trial [38] conducted with acquired brain injury patients where camera images formed part of a health intervention. ...
... There was also 1 descriptive study, which described the context of sedentary time in older adults [17]. The majority of studies were conducted in the United States (n=8, 32%) [27,32,[35][36][37][42][43][44], the United Kingdom (n=6, 24%) [15][16][17]28,31,45], and Ireland (n=4, 16%) [30,33,39,41]. A total of 3 studies (12%) [29,34,46] were international multicenter studies, while the remaining studies were from New Zealand (n=3, 12%) [14,25,40] and Spain (n=1, 4%) [38]. ...
Preprint
Full-text available
BACKGROUND Self-management is a critical component of chronic disease management and can include a host of activities, such as adhering to prescribed medications, undertaking daily care activities, managing dietary intake and body weight, and proactively contacting medical practitioners. The rise of technologies (mobile phones, wearable cameras) for health care use offers potential support for people to better manage their disease in collaboration with their treating health professionals. Wearable cameras can be used to provide rich contextual data and insight into everyday activities and aid in recall. This information can then be used to prompt memory recall or guide the development of interventions to support self-management. Application of wearable cameras to better understand and augment self-management by people with chronic disease has yet to be investigated. OBJECTIVE The objective of our review was to ascertain the scope of the literature on the use of wearable cameras for self-management by people with chronic disease and to determine the potential of wearable cameras to assist people to better manage their disease. METHODS We conducted a scoping review, which involved a comprehensive electronic literature search of 9 databases in July 2017. The search strategy focused on studies that used wearable cameras to capture one or more modifiable lifestyle risk factors associated with chronic disease or to capture typical self-management behaviors, or studies that involved a chronic disease population. We then categorized and described included studies according to their characteristics (eg, behaviors measured, study design or type, characteristics of the sample). RESULTS We identified 31 studies: 25 studies involved primary or secondary data analysis, and 6 were review, discussion, or descriptive articles. Wearable cameras were predominantly used to capture dietary intake, physical activity, activities of daily living, and sedentary behavior. Populations studied were predominantly healthy volunteers, school students, and sports people, with only 1 study examining an intervention using wearable cameras for people with an acquired brain injury. Most studies highlighted technical or ethical issues associated with using wearable cameras, many of which were overcome. CONCLUSIONS This scoping review highlighted the potential of wearable cameras to capture health-related behaviors and risk factors of chronic disease, such as diet, exercise, and sedentary behaviors. Data collected from wearable cameras can be used as an adjunct to traditional data collection methods such as self-reported diaries in addition to providing valuable contextual information. While most studies to date have focused on healthy populations, wearable cameras offer promise to better understand self-management of chronic disease and its context.
... Studies undertaking primary or secondary data analysis (n = 25) were predominantly feasibility or pilot studies (n = 13, 52%) [15,18,25,[27][28][29][30][31][32][34][35][36]43,46], followed by methodological studies (n = 6, 24%) [37,[39][40][41]44] and validation studies (n = 4, 16%) [14,33,42,46] . There was one randomised controlled trial [38] conducted with acquired brain injury patients where camera images formed part of a health intervention. ...
... The majority of studies were conducted in the United States of America (n = 8, 32%) [27,32,[35][36][37][42][43][44], the United Kingdom (n = 6, 24%) [15][16][17]28,31,45], and Ireland (n = 4, 16%) [30,33,39,41]. Three studies (12%) [29,34,46] were international multicentre studies, while the remaining studies were from New Zealand (12%) [14,25,40] and Spain (4%) [38]. ...
... Studies classified as validation [14,33,42,46] compared the validity of wearable cameras either against, or in combination with other measurement approaches, including self-reported diaries, questionnaires, as well as objective measurement techniques such as accelerometers and doubly labelled water. For the methodological papers [35,37,[39][40][41]44], outcomes focused on testing software or analytic approaches to classify and analyse generated images. ...
Article
Full-text available
Background: Self-management is a critical component of chronic disease management and can include a host of activities, such as adhering to prescribed medications, undertaking daily care activities, managing dietary intake and body weight, and proactively contacting medical practitioners. The rise of technologies (mobile phones, wearable cameras) for health care use offers potential support for people to better manage their disease in collaboration with their treating health professionals. Wearable cameras can be used to provide rich contextual data and insight into everyday activities and aid in recall. This information can then be used to prompt memory recall or guide the development of interventions to support self-management. Application of wearable cameras to better understand and augment self-management by people with chronic disease has yet to be investigated. Objective: The objective of our review was to ascertain the scope of the literature on the use of wearable cameras for self-management by people with chronic disease and to determine the potential of wearable cameras to assist people to better manage their disease. Methods: We conducted a scoping review, which involved a comprehensive electronic literature search of 9 databases in July 2017. The search strategy focused on studies that used wearable cameras to capture one or more modifiable lifestyle risk factors associated with chronic disease or to capture typical self-management behaviors, or studies that involved a chronic disease population. We then categorized and described included studies according to their characteristics (eg, behaviors measured, study design or type, characteristics of the sample). Results: We identified 31 studies: 25 studies involved primary or secondary data analysis, and 6 were review, discussion, or descriptive articles. Wearable cameras were predominantly used to capture dietary intake, physical activity, activities of daily living, and sedentary behavior. Populations studied were predominantly healthy volunteers, school students, and sports people, with only 1 study examining an intervention using wearable cameras for people with an acquired brain injury. Most studies highlighted technical or ethical issues associated with using wearable cameras, many of which were overcome. Conclusions: This scoping review highlighted the potential of wearable cameras to capture health-related behaviors and risk factors of chronic disease, such as diet, exercise, and sedentary behaviors. Data collected from wearable cameras can be used as an adjunct to traditional data collection methods such as self-reported diaries in addition to providing valuable contextual information. While most studies to date have focused on healthy populations, wearable cameras offer promise to better understand self-management of chronic disease and its context.
... 6 516 Table 4 indicates that we analyzed how often each type of data has been captured. 517 For example, 56 out of 86 publications captured at least one kind of individual data 518 (65%). Also, Table 4 shows absolute and relative frequency of inferences for each 519 construct domain in relation to the overall number of publications (e.g., 15%, 13 out 520 of 86 publications, made an inference relevant to the task context). ...
Book
In today’s society, stress is one of the most prevalent phenomena affecting people both in private life and working environments. Despite the fact that academic research has revealed significant insights into the nature of stress in organizations, as well as its antecedents, consequences, and moderating factors, stress researchers still face a number of challenges today. One major challenge is related to construct measurement, particularly if stress is to be understood from a longitudinal perspective, which implies longitudinal measurement of stress and related phenomena. A novel research opportunity has emerged in the form of “lifelogging” in recent years. This concept is based on the idea that unobtrusive computer technology can be used to continuously collect data on an individual’s current state (psychological, physiological, or behavioral) and context (ranging from temperature to social interaction information). Based on a review of the lifelogging literature (N = 155 articles), this article discusses the potential of lifelogging for construct measurement in organizational stress research. The primary contribution of this article is to showcase how modern computer technology can be used to study the temporal nature of stress and related phenomena (e.g., coping with stress) in organizations.
... 6 516 Table 4 indicates that we analyzed how often each type of data has been captured. 517 For example, 56 out of 86 publications captured at least one kind of individual data 518 (65%). Also, Table 4 shows absolute and relative frequency of inferences for each 519 construct domain in relation to the overall number of publications (e.g., 15%, 13 out 520 of 86 publications, made an inference relevant to the task context). ...
Chapter
Appendix to the book "Lifelogging for Organizational Stress Measurement: Theory and Applications" including a full list of the reviewed articles.
... These new devices come in various types and styles, from the GoPro, which is marketed for recording high-quality video of sports and other adventures, to Google Glass, which is a heads-up display interface for smartphones but includes a camera, to Narrative Clip and Autographer, which capture "lifelogs" by automatically taking photos throughout one's day (e.g., every 30 seconds). These devices, and others like them, are being used for a variety of applications, from documenting police officers' interactions with the public [39], to studying people's activities at a fine grain resolution for psychological studies [12,26], to keeping visual diaries of people's lives for promoting health [40] or just for personal use [11,21]. No matter the purpose, however, all of these devices can record huge amounts of imagery, which makes it difficult for users to organize and browse their image data. ...
Article
Full-text available
Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy. We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections.
Article
Full-text available
Persuasive technology (PT) is increasingly being used in the health and wellness domain to motivate and assist users with different lifestyles and behavioral health issues to change their attitudes and/or behaviors. There is growing evidence that PT can be effective at promoting behaviors in many health and wellness domains, including promoting physical activity (PA), healthy eating, and reducing sedentary behavior (SB). SB has been shown to pose a risk to overall health. Thus, reducing SB and increasing PA have been the focus of much PT work. This paper aims to provide a systematic review of PTs for promoting PA and reducing SB. Specifically, we answer some fundamental questions regarding its design and effectiveness based on an empirical review of the literature on PTs for promoting PA and discouraging SB, from 2003 to 2019 (170 papers). There are three main objectives: (1) to evaluate the effectiveness of PT in promoting PA and reducing SB; (2) to summarize and highlight trends in the outcomes such as system design, research methods, persuasive strategies employed and their implementaions, behavioral theories, and employed technological platforms; (3) to reveal the pitfalls and gaps in the present literature that can be leveraged and used to inform future research on designing PT for PA and SB.
Article
Automatic image captioning has been studied extensively over the last few years, driven by breakthroughs in deep learning-based image-to-text translation models. However, most of this work has considered captioning web images from standard data sets like MS-COCO, and has considered single images in isolation. To what extent can automatic captioning models learn finer-grained contextual information specific to a given person's day-to-day visual experiences? In this paper, we consider captioning image sequences collected from wearable, life-logging cameras. Automatically-generated captions could help people find and recall photos among their large-scale life-logging photo collections, or even to produce textual “diaries” that summarize their day. But unlike web images, photos from wearable cameras are often blurry and poorly composed, without an obvious single subject. Their content also tends to be highly dependent on the context and characteristics of the particular camera wearer. To address these challenges, we introduce a technique to jointly caption sequences of photos, which allows captions to take advantage of temporal constraints and evidence across time, and we introduce a technique to increase the diversity of generated captions, so that they can describe a photo from multiple perspectives (e.g., first-person versus third-person). To test these techniques, we collect a dataset of about 8000 realistic lifelogging images, a subset of which are annotated with nearly 5000 human-generated reference sentences. We evaluate the quality of image captions both quantitatively and qualitatively using Amazon Mechanical Turk, finding that while these algorithms are not perfect, they could be an important step towards helping to organize and summarize lifelogging photos.
Conference Paper
Full-text available
MyLifeBits is a project to fulfill the Memex vision first posited by Vannevar Bush in 1945. It is a system for storing all of one's digital media, including documents, images, sounds, and videos. It is built on four principles: (1) collections and search must replace hierarchy for organization (2) many visualizations should be supported (3) annotations are critical to non-text media and must be made easy, and (4) authoring should be via transclusion.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
The Fisher kernel (FK) is a generic framework which com- bines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained us- ing only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant re- sources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Article
Full-text available
Lifelogging is the process of automatically recording aspects of one’s life in digital form. This includes visual lifelogging using wearable cameras such as the SenseCam and in recent years many interesting applications for this have emerged and are being actively researched. One of the most interesting of these, and possibly the most far-reaching, is using visual lifelogs as a memory prosthesis but there are also applications in job-specific activity recording, general lifestyle analysis and market analysis.In this work we describe a technique which allowed us to develop automatic classifiers for visual lifelogs to infer different lifestyle traits or characteristics. Their accuracy was validated on a set of 95k manually annotated images and through one-on-one interviews with those who gathered the images. These automatic classifiers were then applied to a collection of over 3million lifelog images collected by 33 individuals sporadically over a period of 3.5years. From this collection we present a number of anecdotal observations to demonstrate the future potential of lifelogging to capture human behaviour. These anecdotes include: the eating habits of office workers; to the amount of time researchers spend outdoors through the year; to the observation that retired people in our study appear to spend quite a bit of time indoors eating with friends. We believe this work demonstrates the potential of lifelogging techniques to assist behavioural scientists in future.
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Conference Paper
Full-text available
We address image classification on a large-scale, i.e. when a large number of images and classes are involved. First, we study classification accuracy as a function of the image signature dimensionality and the training set size. We show experimentally that the larger the training set, the higher the impact of the dimensionality on the accuracy. In other words, high-dimensional signatures are important to obtain state-of-the-art results on large datasets. Second, we tackle the problem of data compression on very large signatures (on the order of 10 5 dimensions) using two lossy compression strategies: a dimensionality reduction technique known as the hash kernel and an encoding technique based on product quantizers. We explain how the gain in storage can be traded against a loss in accuracy and/or an increase in CPU cost. We report results on two large databases ImageNet and a dataset of lM Flickr images showing that we can reduce the storage of our signatures by a factor 64 to 128 with little loss in accuracy. Integrating the decompression in the classifier learning yields an efficient and scalable training algorithm. On ILSVRC2010 we report a 74.3% accuracy at top-5, which corresponds to a 2.5% absolute improvement with respect to the state-of-the-art. On a subset of 10K classes of ImageNet we report a top-1 accuracy of 16.7%, a relative improvement of 160% with respect to the state-of-the-art.
Article
Full-text available
SenseCam is a wearable digital camera that captures an electronic record of the wearer's day. It does this by automatically recording a series of still images through its wide-angle lens, and simultaneously capturing a log of data from a number of built-in electronic sensors. Subsequently reviewing a sequence of images appears to provide a powerful autobiographical memory cue. A preliminary evaluation of SenseCam with a patient diagnosed with severe memory impairment was extremely positive; periodic review of images of events recorded by SenseCam resulted in significant recall of those events. Following this, a great deal of work has been undertaken to explore this phenomenon and there are early indications that SenseCam technology may be beneficial to a variety of patients with physical and mental health problems, and is valuable as a tool for investigating normal memory through behavioural and neuroimaging means. Elsewhere, it is becoming clear that judicious use of SenseCam could significantly impact the study of human behaviour. Meanwhile, research and development of the technology itself continues with the aim of providing robust hardware and software tools to meet the needs of clinicians, patients, carers, and researchers. In this paper we describe the history of SenseCam, and the design and operation of the SenseCam device and the associated viewing software, and we discuss some of the ongoing research questions being addressed with the help of SenseCam.
Article
Full-text available
OWEN, N., G.N. HEALY, C.E. MATTHEWS, and D.W. DUNSTAN. Too much sitting: the population health science sedentary behavior. Exerc. Sport Sci. Rev., Vol. 38, No. 3, pp. 105-113, 2010. Even when adults meet physical activity guidelines, sitting for prolonged periods can compromise metabolic health. Television (TV) time and objective measurement studies show associations, and breaking up sedentary time is beneficial. Sitting time, TV time, and time sitting in automobiles increase premature mortality risk. Further evidence from prospective studies, intervention trials, and population-based behavioral studies is required.
Article
Full-text available
In this paper, we describe an approach designed to exploit context information in order to aid the detection of landmark images from a large collection of photographs. The photographs were generated using Microsoft’s SenseCam, a device designed to passively record a visual diary and cover a typical day of the user wearing the camera. The proliferation of digital photos along with the associated problems of managing and organising these collections provide the background motivation for this work. We believe more ubiquitious cameras, such as SenseCam, will become the norm in the future and the management of the volume of data generated by such devices is a key issue. The goal of the work reported here is to use context information to assist in the detection of landmark images or sequences of images from the thousands of photos taken daily by SenseCam. We will achieve this by analysing the images using low-level MPEG-7 features along with metadata provided by SenseCam, followed by simple clustering to identify the landmark images.
Article
Full-text available
The Microsoft SenseCam is a small multi-sensor camera worn around the user's neck. It was designed primarily for lifelog recording. At present, the SenseCam passively records up to 3,000 images per day as well as logging data from several on-board sensors. The sheer volume of image and sensor data captured by the SenseCam creates a number of challenges in the areas of segmenting whole day recordings into events, and searching for events. In this paper, we use content and contextual information to help aid in automatic event segmentation of a user's SenseCam images. We also propose and evaluate a number of novel techniques using Bluetooth and GPS context data to accurately locate and retrieve similar events within a user's lifelog photoset.
Article
Full-text available
MyLifeBits is a project to fulfill the Memex vision first posited by Vannevar Bush in 1945. It is a system for storing all of one's digital media, including documents, images, sounds, and videos.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Conference Paper
We present a novel dataset and novel algorithms for the problem of detecting activities of daily living (ADL) in firstperson camera views. We have collected a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object tracks, hand positions, and interaction events. ADLs differ from typical actions in that they can involve long-scale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including (1) temporal pyramids, which generalize the well-known spatial pyramid to approximate temporal correspondence when scoring a model and (2) composite object models that exploit the fact that objects look different when being interacted with. We perform an extensive empirical evaluation and demonstrate that our novel representations produce a two-fold improvement over traditional approaches. Our analysis suggests that real-world ADL recognition is “all about the objects,” and in particular, “all about the objects being interacted with.”
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Conference Paper
In this work, algorithms are developed and evaluated to de- tect physical activities from data acquired using five small biaxial ac- celerometers worn simultaneously on different parts of the body. Ac- celeration data was collected from 20 subjects without researcher su- pervision or observation. Subjects were asked to perform a sequence of everyday tasks but not told specifically where or how to do them. Mean, energy, frequency-domain entropy, and correlation of acceleration data was calculated and several classifiers using these features were tested. De- cision tree classifiers showed the best performance recognizing everyday activities with an overall accuracy rate of 84%. The results show that although some activities are recognized well with subject-independent training data, others appear to require subject-specific training data. The results suggest that multiple accelerometers aid in recognition because conjunctions in acceleration feature values can effectively discriminate many activities. With just two biaxial accelerometers - thigh and wrist - the recognition performance dropped only slightly. This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves.
Article
In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Conference Paper
An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds
  • D Matthew
  • Rob Zeiler
  • Fergus
Matthew D Zeiler and Rob Fergus, "Visualizing and understanding convolutional neural networks," arXiv preprint arXiv:1311.2901, 2013.
Object recognition from local scaleinvariant features," in Computer vision, 1999. The proceedings of the seventh IEEE international conference on
  • G David
  • Lowe
David G Lowe, "Object recognition from local scaleinvariant features," in Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Ieee, 1999, vol. 2, pp. 1150-1157.
Liblinear: A library for large linear classification
  • Kai-Wei Rong-En Fan
  • Cho-Jui Chang
  • Xiang-Rui Hsieh
  • Chih-Jen Wang
  • Lin