Conference PaperPDF Available

"My Day in Review": Visually Summarising Noisy Lifelog Data


Abstract and Figures

Lifelogging devices, which seamlessly gather various data about a user as they go about their daily life, have resulted in users amassing large collections of noisy photographs (e.g. visual duplicates, image blur), which are difficult to navigate, especially if they want to review their day in photographs. Social media websites, such as Facebook, have faced a similar information overload problem for which a number of summarization methods have been proposed (e.g. news story clustering, comment ranking etc.). In particular, Facebook's Year in Review received much user interest where the objective for the model was to identify key moments in a user's year, offering an automatic visual summary based on their uploaded content. In this paper, we follow this notion by automatically creating a review of a user's day using lifelogging images. Specifically, we address the quality issues faced by the photographs taken on lifelogging devices and attempt to create visual summaries by promoting visual and temporal-spatial diversity in the top ranks. Conducting two crowdsourced evaluations based on 9k images, we show the merits of combining time, location and visual appearance for summarization purposes.
Content may be subject to copyright.
My Day in Review
Visually Summarising Noisy Lifelog Data
Soumyadeb Chowdhury1, Philip J. McParlane2, Md. Sadek Ferdous3, Joemon Jose4
School of Computing Science, University of Glasgow
{Soumyadeb.Chowdhury1, Sadek. Ferdous3, Joemon.Jose4},
Lifelogging devices, which seamlessly gather various data about a
user as they go about their daily life, have resulted in users
amassing large collections of noisy photographs (e.g. visual
duplicates, image blur), which are difficult to navigate, especially
if they want to review their day in photographs. Social media
websites, such as Facebook, have faced a similar information
overload problem for which a number of summarization methods
have been proposed (e.g. news story clustering, comment ranking
etc.). In particular, Facebook’s Year in Review received much
user interest where the objective for the model was to identify key
moments in a user’s year, offering an automatic visual summary
based on their uploaded content. In this paper, we follow this
notion by automatically creating a review of a user’s day using
lifelogging images. Specifically, we address the quality issues
faced by the photographs taken on lifelogging devices and attempt
to create visual summaries by promoting visual and temporal-
spatial diversity in the top ranks. Conducting two crowdsourced
evaluations based on 9k images, we show the merits of combining
time, location and visual appearance for summarization purposes.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis and
Wearable camera; Autographer; lifelog images; GPS; temporal;
spatial; clustering; GIST features; key moments; crowdsourcing.
Lifelogging represents a way of digitally recording data (referred
to as lifelogs) which capture a lifelogger’s experiences, in varying
amount of detail, for a variety of purposes, using a lifelogging
device. These devices capture experiences in our daily routine
without the need of explicit interaction due to their hands-free
nature. Most prior works [for example, 1 and 2] have segmented
unprocessed lifelog data into meaningful units called events,
which are a collection of temporally related sequence of lifelog
data, over a period of time, with a defined beginning and end. To
our knowledge, no previous study has extended the notion of
temporal-spatial clustering to consider the visual aspects for
lifelogging data abstraction. In this paper, we combine visual
scene evidences with the time and location information in order to
identify the most representative lifelog images as key moments in
order to summarise a user’s day. This feature is similar to
Facebook’s Year in Review’
, which automatically compiles
some of most liked images in the user’s feed and presents them in
to a neat timeline. The feature deployed by Facebook relied upon
the number of likes (a user input) and did not consider if the
images, either belong to the user or they were publicly available
content over the web (e.g. quotes, cartoons etc.). In the context of
the relevant use-cases, the automatic generation of key moments
would be useful to an array of actors, but not limited to:
lifeloggers, researchers interested in lifelogger’s daily life
experience, community councils interested in community
biographies of a sample population. The research presented in this
paper attempts to address the following research questions:
RQ1: How can we effectively structure lifelog images to generate
key moments, i.e. a review of a lifelogger’s day?
The above research question is further partitioned into:
RQ1.1: How do we reduce the amount of noise in lifelogging
collection (i.e. low quality, blurry and repetitive images)?
RQ1.2: Can we effectively exploit temporal-spatial information
obtained from the lifelogs to generate a summary of the daily
moments of one’s life?
RQ1.3: Can we combine visual scene features in addition to the
temporal-spatial information to improve the summarisation of
daily moments?
In most of the existing research with lifelogging devices, the
lifelogs are structured into activities or events [1, 2]. These events
merge various sources of sensed data together into a meaningful
and logical unit. Anguera et al. [3] and many other lifelogging
researchers have shown the potential of using meta information
obtained from the lifelogs to automatically generate annotations.
Prior work reported by Wang and Smeaton [4] categorised daily
activities, which were used to define events. Lazer et al. [5] have
used GPS sensors in cell phones to annotate the lifelogs using
their respective location. Gurrin et al. [6] have used WiFi sensors
to identify fine-grained locations of events. A recent study
reported in Kikhia et al. [7] has presented an approach to present
lifelogs based on places and activities obtained from the GPS
data. However, this work does not report eliminating noisy and
duplicate images, before clustering the lifelogs using GPS
locations. One of the earliest works presented by Doherty and
Smeaton [8] used lifelog images with geographic data, to examine
how the visual and location information might be useful to recall
events from the past. It was concluded that lifelog images helped
in recalling past activities, whereas location data supports
inferential processes.
3. Proposed Approach
In this section we will give a brief overview of the devices used to
capture the lifelogs, followed by the techniques used to eliminate
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from
ICMR '15, June 23 - 26, 2015, Shanghai, China
Copyright 2015 ACM 978-1-4503-3274-3/15/06…$15.00
noisy (blurred, duplicate) lifelog images, and finally the clustering
techniques employed to generate key moments.
3.1 Lifelog Capturing Phase
For the research reported in this paper, we have used two devices
to collect the lifelogs: (i) Autographer a wearable camera, which
can record more than 100 images per hour [9]; (ii) GPS device
recording location logs every 5 seconds. In the current scenario,
location sensors in the Autographer takes almost 5 to 6 minutes to
capture a GPS fix, as opposed to 1 to 2 minutes claimed by the
manufacturers. The lifelogs were collected by one of the authors.
3.2 Image Selection
One of the main problems in the automatic organization of the
lifelogs obtained from the wearable camera is managing noisy
photographs, most predominantly: (i) blurry images and (ii) visual
duplicates. In the following subsections, we discuss the
techniques employed to address RQ1.1.
3.2.1 Blurry Images
As the images are shot hands free and often whilst on the move,
the image quality, in particular image sharpness, is often very low.
In order to automatically identify blurry images, we adopt the
technique proposed by Tong et al. [10], which employs edge type
and sharpness analysis using a Haar wavelet transform. This
technique computes a blur score B (where 0 B ≤ 1) for a given
image. Figure 1, highlights the blur distribution for all the images
in our collection, where a lower score (Blur %) implies a sharper
image. In this work, we have selected only those images, where B
< 0.38. This parameter is selected empirically, in order to remove
the long tail of the distribution, which contains images which are
blurrier than 80% of the overall population. We believe that
blurred images will not be useful in the sense of not providing
enough information. However, such images (low quality) can be
used for tasks like, elicitation based-surveys, where a social
scientist may require all the images, irrespective of their quality.
Figure 1. Blur score distribution of the lifelog collection
3.2.2 Duplicate Images
As the lifeloggers may be stationary for longer periods of time
(i.e. sitting at the office desk, travelling in a car/bus etc.) and have
limited control over as to when an image is captured, multiple
image duplicates tend to exist within the lifelogs. These visual
duplicates must be identified in order to avoid selecting duplicate
images as key moments in a user’s day, which would further
alleviate the information overload problem for our scenario
In this work we have detected the duplicate and near-duplicate
images using a popular hashing function technique presented in
Tang et al. [11]. Hashing functions are used to generate fixed-
length output strings, which act as a shortened reference to the
initial data. In this work, we have used a hashing method, called
the Perceptual Hash (pHash), which has been shown to give high
detection performance for resized, cropped and exposure
compensated images [11]. We hypothesise that detecting images
which have these alterations will capture many of the duplicates
taken by lifelogging devices. We chose a hashing function to
detect duplicate images in the lifelog collection due to its high
performance, while maintaining low computational expense in
extraction and matching between the images. By adopting the
aforementioned method, we also ensure its scalability in large
lifelogging collections. To detect the visual duplicates in our
lifelog collection, we adopt the following procedure:
Step 1: We first compute the pHash string (using the tools
available at for each image.
Step 2: Single pass clustering is then employed on all the images
in our collection, using the hamming distance as a measure to
compare between two pHash strings for a given pair of images.
Step 3: Specifically, an image is added to an existing cluster if its
hamming distance is small enough (T<8 as suggested in the
existing work [12]), otherwise the image is added to a new cluster.
For clusters containing multiple images (i.e. duplicates), we select
only the sharpest image (i.e. lowest B).
3.3 Image Ranking
Once the blurred and duplicate images are removed, a second
stage of ranking is carried out, in order to generate the key
moments. The goal of image ranking in web search is to maximize
the relevance of images in the top ranks with respect to a textual
query. This differs from ranking lifelog images to generate
automatic key moments, due to the absence of any query or similar
user involvement, as well as textual annotations, which makes the
ranking a non-trivial and ambiguous problem. We achieve the
automatic generation of key moments by clustering images based
on a number of visual and geospatial features, as follows.
3.3.1 Visual Clustering
Our first approach attempts to cluster images based on their visual
scene, i.e. group images which have a similar “visual landscape”
(e.g. images taken in a single room or place). In order to achieve
our goal, we extract the GIST visual features [13]. This feature
has been adopted by many works in the past to achieve state-of-
the-art scene classification accuracy, and is most suitable for our
purpose. For each day’s lifelog images obtained from the
lifelogger, we execute the following approach:
Step 1: We consider the images which have passed our selection
process, i.e. all blurred and duplicate images are removed.
Step 2: The normalised 512-D GIST features for each of the
images obtained in step 1 are extracted,
Step 3: Clustering is employed using the Expectation-
maximization (EM) algorithm. We employ this method over other
popular clustering techniques (such as K-means), as the EM
approach does not require an initial number of clusters to be set
(i.e. K). This is important as we do not know the prior number of
clusters or ‘moments’ for any given day.
3.3.2 Temporal-spatial Clustering
Time and location are crucial evidences for the purposes of
segmenting images into various clusters or moments. Therefore,
we consider these features in our work. Firstly, we model an
image’s time as the minute of the day in which it was captured
normalised by the number of minutes in a day (e.g. 2:40pm is
modeled as 880/1440 = 0.58), referred to as t (where 0<=t<=1).
Specifically, we model an image’s location using its GPS co-
ordinates. We normalise the longitude (l1) and latitude (l2) values
for each image based on the maximum value recorded for each
day. As the GPS equipment does not function well indoors, there
are many images which do not contain GPS co-ordinates. In this
case, we set the longitude and latitude as the mean value for a
given day. This approach is employed, opposed to setting these
values to 0, in order to avoid skewing the dataset when an image
lacks GPS information. We therefore model an image as a 3-D
vector (i.e.[t, l1, l2]) of its time and location. This relates to RQ
1.2, which was presented in Section 1.
3.3.3 Combining Visual and Temporal-spatial Clustering
Finally, we combine the visual and temporal-spatial aspects by
concatenating the two feature vectors, resulting in a 515-D vector,
before clustering using the EM approach. We hypothesise that
visual appearance, time and location are all essential in the
clustering process and will complement each other to summarise
large collections of lifelog images, by automatically generating
key moments within a user’s day. Such a combined clustering
approach has not yet been studied in previous works and is
beneficial in scenarios where location information is unavailable,
either due to device constraints or other factors. This relates to
RQ1.3, which was presented in Section 1.
4. Experiments
In this section, we discuss a number of statistics and limitations of
the lifelogs collected, followed by the various systems used for the
crowdsourced evaluation and the experimental procedure.
4.1 Data Collection
Table 1 shows the statistics related to the lifelogs captured by one
of the authors over a period of 13 days. The lifelogging devices
(Autographer and GPS) were used from 8:30am to 6pm. However,
the image capturing was stopped on a number of occasions, as
required by the lifelogger, which is not discussed further in this
paper. According to the statistics reported in Table 1, the number
of images containing locations is only 14% because the GPS
device did not log the locations while the lifelogger was indoors,
i.e. inside buildings. The lifelogs collected for the research
presented in this paper is ethically approved through the
lifelogger’s informed consent.
Table 1. Lifelog Image Collection Statistics
Total Images
# of days
Average # Images captured per day
% of blurred images
% of images containing location information
4.2 Experimental Systems
Random (Srandom): Due to the lack of benchmarks, especially in
the context of generating the key moments of a user’s day, we
firstly propose ranking images randomly for each day, as a weak
baseline. In this system, 5 random images are selected.
Removing blurred images and duplicates (Sselect): The noisy
images, i.e. blurred and visual duplicates are removed using the
selection approaches presented in Sections 3.2.1 and 3.2.2
respectively, before 5 images are selected randomly. This system
will be considered a stronger baseline compared to Srandom.
Visual Clustering (Svisual): This approach has been presented in
Section 3.3.1 to visually cluster all images gathered after the
selection process. In this approach, we select the sharpest image
(i.e. lowest B) from the 5 largest clusters. In the case where less
than 5 clusters exist, we select the 2nd sharpest image from each
cluster and so on.
Temporal-spatial clustering (Stemp-spatial): This approach has been
detailed in Section 3.3.2, to cluster images obtained after the
selection process, based upon their spatial and temporal
information. We use the same image selection process as in Svisual
by selecting the sharpest images from the largest clusters.
Combined clustering (Scombined): Our final approach combines
Svisual and Stemp-spatial to rank the images obtained after the selection
process. We select the sharpest images from the largest clusters.
4.3 Crowdsourcing Evaluation
As we are not evaluating events, activities or similar segmentation
units, which can be only identified by the lifelogger, we instead
employ a taskforce of crowdsourced evaluators in order to judge
the quality and diversity of the top tanked images for each
experimental system, thus increasing the speed and broadening
the opinion of our evaluation. Moreover a crowdsourced
evaluation will benefit use cases like generating community
biographies for city councils and policy makers, where the lifelog
images are not necessarily used by the lifeloggers, but sourced to
external sources for a specific purpose, where quality and
diversity are essential dimensions. Two separate crowdsourced
evaluations were taken out on a popular crowdsourcing platform
CrowdFlower (CF) [14].
4.3.1 Evaluating Image Quality
Our first evaluation attempts to judge the quality of five top
ranked images for each system. The evaluators are asked the
following questions with regards to a presented image: (1) How
clear is the photograph? Each evaluator is asked to rate on a
Likert scale ranging from 0 (very blurry) to 5 (very sharp). (2)
How “interesting” is the photograph? Consider the scene and the
objects, for example an image of a door or wall would be
considered ‘uninteresting’. An image of a street or depicting an
activity (using a computer, eating etc.), would be considered
‘interesting’. Each evaluator is asked to rate on a Likert scale
from 0 (very boring) to 5 (very interesting). The first question
aims to validate the blur detection part of our image selection
process with the second attempting to gauge the noise reduction
from an image interestingness perspective. Although image
interestingness is ill-defined and “in the eye of the beholder”, we
believe that by measuring perceived interest from a wide spectrum
of crowdsourced evaluators, we will gain some insight into the
user’s engagement.
Each evaluator was presented with 5 images from those selected
by various systems detailed in Section 4.2. Each image was
evaluated by 3 separate evaluators with the survey scores
averaged. Each evaluator was paid $0.03 on completion of this
task. For our two questions (both with 6 different options),
evaluator agreement of 66% and 56% was achieved. Evaluator
agreement, computed by CrowdFlower, describes the level of
agreement between multiple contributors (weighted by the
contributors’ trust scores). The fairly low agreement level
achieved is mainly due to the high number of options available for
each question (i.e. 6) and the difficultly in gauging how relevant
an image is for summarization purposes in the context of
lifelogging. Hence, we take the average scores for 3 different
users for each question to reduce the diversity in opinion.
4.3.2 Evaluating Image Diversity
In order to evaluate the visual diversity of the top ranked images,
a second crowdsourced evaluation is conducted. The evaluators
are asked to judge the “visual similarity” of the image pairs. Each
evaluator is asked to rate the visual similarity on a Likert scale
ranging from 0 (completely different) to 5 (Identical). For our
question judging the visual similarity (having 6 different options),
the users achieved a 70.4% agreement level.
5. Results
Based on the judgements made in the two evaluations, we are able
to quantify the sharpness, interestingness and visual similarity in
the top 5 ranked images for each system. Table 2 shows the
average statistics obtained from both the evaluations, and clearly
demonstrates the effectiveness of Scombined system. Statistical
significance results against our Srandom baseline are denoted as *
being p < 0.05, ** being p < 0.01 and *** being p<0.001.
Table 2. Results from Evaluation (‘v’ denotes very).
(0=v blurry,
5 = v sharp)
(0= v boring,
5= v interesting)
Visual similarity
(0= v different,
5= v similar)
In relation to RQ1.1, considering the effectiveness of the selection
methods, i.e. achieved by Sselect, we observe that the two major
problems faced by lifelogging devices (i.e. blurred and duplicate
images) can be automatically alleviated resulting in a 7% increase
to image sharpness and 31% decrease to visual duplicates in the
top 5 ranks, compared to our Srandom baseline.
In relation to RQ1.2 and RQ1.3, i.e. effectiveness of the
ranking/clustering methods, the evaluation demonstrates that the
images which are clustered based upon the combination of all
three aspects proposed within this paper (i.e. Scombined) received the
highest average score (Table 2, last row) for each of the three
aforementioned features. The Scombined also achieved a 21%
increase (on average) over the Srandom baseline for all the three
features. The reduction of noise, i.e. eliminating blurry images
and visual duplicates significantly improves image interestingness
implying a positive correlation.
By improving the quality (i.e. sharpness and interestingness) and
visual diversity of the images taken by lifeloggers, we would
expect to improve the user’s search and retrieval experience when
reviewing their images for a day. Firstly, we would expect our
filtering approach to significantly reduce browsing time as almost
60% of images in our collection were either too blurry or a visual
duplicate. Secondly, by promoting visual diversity using our
proposed clustering techniques, we would expect a user to more
effectively review their lifelogging visual data by presenting them
with 5 visually diverse images (out of thousands of images).
6. Conclusion
This paper extends the notion of structuring lifelog data by
evaluating a number of techniques to automatically generate good
quality and visually diverse images in order to form key moments,
summarising a lifelogger’s daily life. Two crowdsourced
evaluations demonstrated that the most effective technique to
generate such key moments would rely upon eliminating blurred
and visually duplicate images, followed by temporal, spatial and
visual scene (using GIST) clustering. The techniques reported in
this paper also contribute to decreasing the information overload
problem posed by lifelogging devices. The techniques presented
in this paper will be applied to a project, where we are piloting
with 100 lifeloggers for ~12 months to capture user experiences in
a modern city on a daily basis. The objective is to archive data to
facilitate researchers from various disciplines to conduct human
computational tasks. We firmly believe that it is also crucial to
consider how to present the lifelog data, which conforms to the
needs of all the possible actors using it, for example, urban
science researchers could explore traits in transportation usage
behavior, policy makers and city councils can explore traits using
community biographies to better understand needs of a
community and improve their lifestyle.
The authors acknowledge support from Integrated Multimedia
City Data (iMCD), a project within the ESRC-funded Urban Big
Data Centre (ES/L011921/1).
[1] Doherty, A., Kelly, P., & Foster, C. "Wearable Cameras: Identifying
Healthy Transportation Choices." IEEE Pervasive Computing 12.1
(2013): 44-47.
[2] Doherty, Aiden R., et al. "Experiences of aiding autobiographical
memory using the SenseCam." HumanComputer Interaction 27.
1-2 (2012): 151-174.
[3] Anguera, X., Xu, J., & Oliver, N. "Multimodal photo annotation and
retrieval on a mobile phone.” ACM MIR, 2008.
[4] Wang, P., & Smeaton, A. F. "Semantics-based selection of everyday
concepts in visual lifelogging." International Journal of Multimedia
Information Retrieval 1.2 (2012): 87-101
[5] Lazer, D, et al. "Life in the network: the coming age of
computational social science." Science (NY) 323.5915 (2009): 721.
[6] Gurrin, C, et al. "The smartphone as a platform for wearable cameras
in health research." American journal of preventive medicine 44.3
(2013): 308-313.
[7] Kikhia, B, et al. "Structuring and Presenting Lifelogs Based on
Location Data." Pervasive Computing Paradigms for Mental
Health. Springer International Publishing, 2014. 133-144.
[8] Doherty, A. R., & Smeaton, A. F."Automatically augmenting lifelog
events using pervasively generated content from millions of people."
Sensors 10.3 (2010): 1423-1446.
[9], last accessed: 1 April, 2015.
[10] Tong, H, et al. "Blur detection for digital images using wavelet
transform." Multimedia and Expo, 2004. ICME'04.
[11] Tang, Z., Dai, Y., & Zhang, X. "Perceptual hashing for color images
using invariant moments." Appl. Math 6.2S (2012): 643S-650S.
[12] Chum, O., Philbin, J., & Zisserman, A. "Near Duplicate Image
Detection: min-Hash and tf-idf Weighting." BMVC. Vol. 810. 2008.
[13] Oliva, A., & Torralba, A.. "Building the gist of a scene: The role of
global image features in recognition." Progress in brain research
155 (2006): 23-36.
[14], last accessed: 1 April, 2015.
... Since photos are VOLUME 4, 2016 taken whilst on the move, there is a problem of blurriness. In fact, the blurred images will not provide enough information; yet reduce the efficiency of search performance due to the wasted computation time [59], [60]. The third issue is about images redundancy. ...
... The third issue is about images redundancy. Since the lifelogger may be in stationary situations during the day, duplicates photos tend to exist within the lifelogs [59]. The retrieval of such images is time consuming without any benefit. ...
... Soumyadeb et al. estimated Blur score using a Haar wavelet transform. Images below the threshold are pruned [59]. The UPB team [60] deals with the uninformativeness and blurriness. ...
Full-text available
Lifelogging is the process of digital tracking of person’s daily experiences in varying amounts of details, for a variety of purposes. In recent years, lifelogging has become an increasingly popular area of research due to the growing demands from many applications, such as video surveillance, entertainment, healthcare systems, and intelligent environments. Furthermore, the advancements in devices technology offer the promise to record and store large volumes of personal data in a very cheap manner, using an inexpensive tool. However, the rapid access to this huge deluge of unlabeled and unstructured data and automatically processing it to recognize everyday experiences, still present major challenges. A large number of research have been conducted in recent years to cover different lifelogging aspects, but there is still a lack of studies that provide a comprehensive survey of the available literature, and most of the existing lifelogging surveys generally focus on only one aspect. This review highlights the advances of state-of-the-art in lifelogging from different angles, including its research history, current applications, activity recognition techniques, moment retrieval, storytelling, privacy and security issues, as well as challenges and future research trends.
... • Summarising daily activities captured by an egocentric or lifelogging camera [5,9,36], including identifying frames which look like intentionally taken photos [59]. ...
... Among the 'available' classes, find the class with the highest proportion of misclassified instances, say class j. 9 Replace temporarily the current instance x j ∈ S with each of the remaining instances from class j, one at a time. Identify the instance x j * which gives the minimum resubstitution error E. 10 Mark class j as 'not-available' for another t iterations. ...
Full-text available
A keyframe summary of a video must be concise, comprehensive and diverse. Current video summarisation methods may not be able to enforce diversity of the summary if the events have highly similar visual content, as is the case of egocentric videos. We cast the problem of selecting a keyframe summary as a problem of prototype (instance) selection for the nearest neighbour classifier (1-nn). Assuming that the video is already segmented into events of interest (classes), and represented as a dataset in some feature space, we propose a Greedy Tabu Selector algorithm (GTS) which picks one frame to represent each class. An experiment with the UT (Egocentric) video database and seven feature representations illustrates the proposed keyframe summarisation method. GTS leads to improved match to the user ground truth compared to the closest-to-centroid baseline summarisation method. Best results were obtained with feature spaces obtained from a convolutional neural network (CNN).
... Which are the costs in terms of physical computation and user's acceptability of such ideas? Such problems are similar to those that are currently emerging in lifelogging research (Chowdhury et al. 2015). Federica Cena has been an assistant professor at the Computer Science Department at University of Torino since 2011. ...
Full-text available
Over the last few years, user modeling scenery is changing. With the recent advancements in ubiquitous and wearables technologies, the amount and type of data that can be gathered about users and used to build user models is expanding. User Model can now be enriched with data regarding different aspects of people’s everyday lives. All these changes bring forth new research questions about the kinds of services which could be provided, the ways for effectively conveying new forms of personalisation and recommendation, and how traditional user modeling should change to exploit ubiquitous and wearable technology to provide these services. In this paper we follow the evolution of user modeling process, starting from the traditional User Model and progressing to RWUM - Real World User Model, which contains data from a person’s everyday life. We tried to answer the above questions and to present a conceptual framework that represents the RWUM process, which might be used as a reference model for designing RWUM-based systems. Finally, we propose some inspiring usage scenarios and design directions that can guide researchers in designing novel, robust and versatile services based on RWUM.
... Prior work reported by Chowdhury et al.[3] gave the hints of ...
Full-text available
In this paper, we investigate the effectiveness of two distinct techniques (Special Moment Approach & Spatial Frequency Approach) for reviewing the lifelogs, which were collected by lifeloggers who were willing to use a wearable camera and a bracelet simultaneously for two days. Generally, Special moment approach is a technique for extracting episodic events and Spatial frequency approach is a technique for associating visual with temporal and location information, especially heat map is applied as the spatial data for expressing frequency awareness. Based on that, the participants were asked to fill in two post-study questionnaires for evaluating the effectiveness of those two techniques and their combination. The preliminary result showed the positive potential of exploring individual lifelogs using our approaches.
... For example, the GUI can display the lifelogs based on spatio-temporal attributes or advanced visualisation mechanisms (e.g. clustering based on visual or event similarities)[12,13]. @BULLET Publication. ...
The visual lifelogging activity enables a user, the lifelogger, to passively capture images from a first-person perspective and ultimately create a visual diary encoding every possible aspect of her life with unprecedented details. In recent years, it has gained popularities among different groups of users. However, the possibility of ubiquitous presence of lifelogging devices specifically in private spheres has raised serious concerns with respect to personal privacy. In this article, we have presented a thorough discussion of privacy with respect to visual lifelogging. We have readjusted the existing definition of lifelogging to reflect different aspects of privacy and introduced a first-ever privacy threat model identifying several threats with respect to visual lifelogging. We have also shown how the existing privacy guidelines and approaches are inadequate to mitigate the identified threats. Finally, we have outlined a set of requirements and guidelines that can be used to mitigate the identified threats while designing and developing a privacy-preserving framework for visual lifelogging.
In this paper, we investigate the effectiveness of two distinct techniques (Special Moment Approach and Spatial Frequency Approach) for reviewing the lifelogs, which were collected using a wearable camera and a bracelet, simultaneously for two days. Special moment approach is a technique for extracting episodic events. Spatial frequency approach is a technique for associating visual with temporal and location information. Heat map is applied as the spatial data for expressing frequency awareness. Based on this, the participants were asked to fill in two post-study questionnaires for evaluating the effectiveness of those two techniques and their combination. The preliminary result showed the positive potential of exploring individual lifelogs using our approaches.
In today’s society, stress is one of the most prevalent phenomena affecting people both in private life and working environments. Despite the fact that academic research has revealed significant insights into the nature of stress in organizations, as well as its antecedents, consequences, and moderating factors, stress researchers still face a number of challenges today. One major challenge is related to construct measurement, particularly if stress is to be understood from a longitudinal perspective, which implies longitudinal measurement of stress and related phenomena. A novel research opportunity has emerged in the form of “lifelogging” in recent years. This concept is based on the idea that unobtrusive computer technology can be used to continuously collect data on an individual’s current state (psychological, physiological, or behavioral) and context (ranging from temperature to social interaction information). Based on a review of the lifelogging literature (N = 155 articles), this article discusses the potential of lifelogging for construct measurement in organizational stress research. The primary contribution of this article is to showcase how modern computer technology can be used to study the temporal nature of stress and related phenomena (e.g., coping with stress) in organizations.
Appendix to the book "Lifelogging for Organizational Stress Measurement: Theory and Applications" including a full list of the reviewed articles.
Conference Paper
Wearable cameras allow us to capture large amount of video or still images in an automatic and implicit manner. However, the only necessary images should be filtered out from the captured data that contains meaningless and/or redundant information. In this paper, we propose a method to identify a set of still images by audio and video data, which is intended to let users feel pleasurable when they watch the images later.
Full-text available
Human memory is a dynamic system that makes accessible certain memories of events based on a hierarchy of information, arguably driven by personal significance. Not all events are remembered, but those that are tend to be more psychologically relevant. In contrast, lifelogging is the process of automatically recording aspects of one's life in digital form without loss of information. In this article we share our experiences in designing computer-based solutions to assist people review their visual lifelogs and address this contrast. The technical basis for our work is automatically segmenting visual lifelogs into events, allowing event similarity and event importance to be computed, ideas that are motivated by cognitive science considerations of how human memory works and can be assisted. Our work has been based on visual lifelogs gathered by dozens of people, some of them with collections spanning multiple years. In this review article we summarize a series of studies that have led to the development of a browser that is based on human memory systems and discuss the inherent tension in storing large amounts of data but making the most relevant material the most accessible.
Conference Paper
Full-text available
Mobile phones are becoming multimedia devices. It is com- mon to observe users capturing photos and videos on their mobile phones on a regular basis. As the amount of digital multimedia content expands, it becomes increasingly di-- cult to flnd speciflc images in the device. In this paper, we present a multimodal and mobile image retrieval proto- type named MAMI (Multimodal Automatic Mobile Index- ing). It allows users to annotate, index and search for digital photos on their phones via speech or image input. Speech annotations can be added at the time of capturing photos or at a later time. Additional metadata such as location, user identiflcation, date and time of capture is stored in the phone automatically. A key advantage of MAMI is that it is implemented as a stand-alone application which runs in real-time on the phone. Therefore, users can search for photos in their personal archives without the need of con- nectivity to a server. In this paper, we compare multimodal and monomodal approaches for image retrieval and we pro- pose a novel algorithm named the Multimodal Redundancy Reduction (MR2) Algorithm. In addition to describing in detail the proposed approaches, we present our experimen- tal results and compare the retrieval accuracy of monomodal versus multimodal algorithms.
Full-text available
In sensor research we take advantage of additional contextual sensor information to disambiguate potentially erroneous sensor readings or to make better informed decisions on a single sensor’s output. This use of additional information reinforces, validates, semantically enriches, and augments sensed data. Lifelog data is challenging to augment, as it tracks one’s life with many images including the places they go, making it non-trivial to find associated sources of information. We investigate realising the goal of pervasive user-generated content based on sensors, by augmenting passive visual lifelogs with “Web 2.0” content collected by millions of other individuals.
Conference Paper
Lifelogging techniques help individuals to log their life and retrieve important events, memories and experiences. Structuring lifelogs is a major challenge in lifelogging systems since the system should present the logs in a concise and meaningful way to the user. In this paper the authors present an approach for structuring lifelogs as places and activities based on location data. The structured lifelogs are achieved using a combination of density-based clustering algorithms and convex hull construction to identify the places of interest. The periods of time where the user lingers at the same place are then identified as possible activities. In addition to structuring lifelogs the authors present an application in which images are associated to the structuring results and presented to the user for reviewing. The system is evaluated through a user study consisting of 12 users, who used the system for 1 day and then answered a survey. The proposed approach in this paper allows automatic inference of information about significant places and activities, which generates structured image-annotated logs of everyday life. © Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2014.
Image hashing is a new technology in multimedia security. It maps visually identical images to the same or similar short strings called image hashes, and finds applications in image retrieval, image authentication, digital watermarking, image indexing, and image copy detection. This paper presents a perceptual hashing for color images. The input image in RGB color space is firstly converted into a normalized image by interpolation and filtering. Color space conversions from RGB to YCbCr and HSI are then performed. Next, invariant moments of each component of the above two color spaces are calculated. The image hash is finally obtained by concatenating the invariant moments of these components. Similarity between image hashes is evaluated by L2 norm. Experiments show that the proposed hashing is robust against normal digital processing, such as JPEG compression, watermark embedding, gamma correction, Gaussian low-pass filtering, adjustments of brightness and contrast, image scaling, and image rotation. Receiver operating characteristics (ROC) comparisons between the proposed hashing and singular value decompositions (SVD) based hashing, also called SVD-SVD hashing, presented by Kozat et al. at the 11th International Conference on Image Processing (ICIP'04) are conducted, and the results indicate that the proposed hashing shows better performances in robustness and discriminative capability than the SVD-SVD hashing.
Traditionally, health researchers have used large-scale travel surveys to measure existing travel behavior and identify the determinants driving it. However, such surveys rely on self-reporting, which can be unreliable. Here, the authors discuss using wearable cameras that capture first-person point-of-view images to help objectively identify the duration, frequency, and mode of journeys and reveal potential errors inherent in self-reporting. Their approach could ultimately lead to a better understanding of the environments offering individuals opportunities to engage in more active forms of transportation. This column is part of a special issue on transit and transport.
Concept-based indexing, based on identifying various semantic concepts appearing in multimedia, is an attractive option for multimedia retrieval and much research tries to bridge the semantic gap between the media’s low-level features and high-level semantics. Research into concept-based multimedia retrieval has generally focussed on detecting concepts from high-quality media such as broadcast TV or movies, but it is not well addressed in other domains like lifelogging where the original data is captured with poorer quality. We argue that in noisy domains such as lifelogging, the management of data needs to include semantic reasoning in order to deduce a set of concepts to represent lifelog content for applications like searching, browsing or summarization. Using semantic concepts to manage lifelog data relies on the fusion of automatically detected concepts to provide a better understanding of the lifelog data. In this paper, we investigate the selection of semantic concepts for lifelogging which includes reasoning on semantic networks using a density-based approach. In a series of experiments we compare different semantic reasoning approaches and the experimental evaluations we report on lifelog data show the efficacy of our approach.
The Microsoft SenseCam, a small camera that is worn on the chest via a lanyard, increasingly is being deployed in health research. However, the SenseCam and other wearable cameras are not yet in widespread use because of a variety of factors. It is proposed that the ubiquitous smartphones can provide a more accessible alternative to SenseCam and similar devices. To perform an initial evaluation of the potential of smartphones to become an alternative to a wearable camera such as the SenseCam. In 2012, adults were supplied with a smartphone, which they wore on a lanyard, that ran life-logging software. Participants wore the smartphone for up to 1 day and the resulting life-log data were both manually annotated and automatically analyzed for the presence of visual concepts. The results were compared to prior work using the SenseCam. In total, 166,000 smartphone photos were gathered from 47 individuals, along with associated sensor readings. The average time spent wearing the device across all users was 5 hours 39 minutes (SD=4 hours 11 minutes). A subset of 36,698 photos was selected for manual annotation by five researchers. Software analysis of these photos supports the automatic identification of activities to a similar level of accuracy as for SenseCam images in a previous study. Many aspects of the functionality of a SenseCam largely can be replicated, and in some cases enhanced, by the ubiquitous smartphone platform. This makes smartphones good candidates for a new generation of wearable sensing devices in health research, because of their widespread use across many populations. It is envisioned that smartphones will provide a compelling alternative to the dedicated SenseCam hardware for a number of users and application areas. This will be achieved by integrating new types of sensor data, leveraging the smartphone's real-time connectivity and rich user interface, and providing support for a range of relatively sophisticated applications.
Conference Paper
This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing. The similarity measures are applied and eval- uated in the context of near duplicate image detection. The proposed method uses a visual vocabulary of vector quantized local feature descriptors (SIFT) and for retrieval exploits enhanced min-Hash techniques. Standard min-Hash uses an approximate set intersection between document descriptors was used as a similarity measure. We propose an efficient way of exploiting more so- phisticated similarity measures that have proven to be essential in image / particular object retrieval. The proposed similarity measures do not require extra computational effort compared to the original measure. Wefocusprimarilyonscalabilitytoverylargeimageandvideodatabases, where fast query processing is necessary. The method requires only a small amount of data need be stored for each image. We demonstrate our method on the TrecVid 2006 data set which contains approximately 146K key frames, and also on challenging the University of Kentucky image retrieval database.