Conference PaperPDF Available

Enhancing Digital Literacy by Multi-modal Data Mining of the Digital Lifespan

Authors:

Abstract and Figures

Social media pervades modern digital society, yet despite its regular use the general level of digital literacy and awareness of online representations of self remain low. We report progress on the RCUK DE funded 'Charting the Digital Lifespan' project, which promotes digital literacy of social media through design of novel technological interventions that raise awareness of the digital personhood. We describe progress towards automatic social media data-mining capable of categorising photographs and associated comments into high-level semantic concept groups elicited from digital anthropological study. We describe how this pilot system has been incorporated into live trials of technologies that visualize and encourage reflection upon digital personhood.
Content may be subject to copyright.
Enhancing Digital Literacy by
Multi-modal Data Mining of the Digital Lifespan
John Collomosse
Stuart James
CVSSP
University of Surrey
Guildford, UK
j.collomosse@surrey.ac.uk
Abigail Durrant
Diego Trujillo-Pisanty
Culture Lab
Newcastle University
Newcastle upon Tyne, UK
abigail.durrant@ncl.ac.uk
Wendy Moncur
Kathryn M. Orzech
University of Dundee
Dundee, UK
wmoncur@dundee.ac.uk
Sarah Martindale
Horizon
University of Nottingham
Nottingham, UK
Sarah.Martindale@nott.ac.uk
Mike Chantler
Heriot-Watt University
Edinburgh, UK
m.j.chantler@hw.ac.uk
ABSTRACT
Social media pervades modern digital society, yet despite its
regular use the general level of digital literacy and aware-
ness of online representations of self remain low. We re-
port progress on the RCUK DE funded ‘Charting the Digital
Lifespan’ project, which promotes digital literacy of social
media through design of novel technological interventions
that raise awareness of the digital personhood. We describe
progress towards automatic social media data-mining capa-
ble of categorising photographs and associated comments
into high-level semantic concept groups elicited from digital
anthropological study. We describe how this pilot system
has been incorporated into live trials of technologies that
visualize and encourage reflection upon digital personhood.
Categories and Subject Descriptors
D.4 [Human-centered computing]: Social Media
1. INTRODUCTION
Over two billion individuals world-wide, and 57% of the
UK population, maintain digital reflections of their ana-
logue lives through the lens of social media [1]. The paths
of these digital and physical lives run in parallel, converg-
ing and diverging as they mediate personhood. The RCUK
Digital Economy Theme funded Charting the Digital Lifes-
pan (CDL) project is embarking upon the second half of a
two year interdisciplinary study exploring this space. A key
goal for CDL is to enhance the digital literacy of individu-
als by facilitating reflection upon their digital personhood.
Frequently individuals are unaware of the permanence of
their digital footprints online, which are often contributed
in an ad-hoc incremental manner without regard to the holis-
tic representations being constructed of their digital selves.
Repercussions can occur when these digital representations
are revealed to, or interact with, others in unexpected ways.
1.1 Methodology and Context
CDL studies drivers and use patterns of social media at
three life stages: emerging adults; new parents; and re-
tirees. An experience-centred design (ECD) [13] methodol-
ogy is adopted, in which technology probes [8] — hardware
or software devices — are introduced into individuals’ lives
enabling novel forms of reflection upon their digital person-
hood. The probes provoke individual reflection by present-
ing a visualisation of the digital personhood, instantiated
through automatic analysis (‘data mining’) of online social
media presence. The design of the technology probe itself is
informed by both semi-structured interviews conducted with
individuals also at that life-stage (digital anthropology [7]),
and exploratory workshops centred around design fictions
that encourage critical speculation about future possibilities
involving digital personhood [2]. The needs of the technol-
ogy probe drive technical innovation in social media data
mining, which forms the focus of this paper.
One concrete example of the CDL methodology was the ‘Ad-
mixed Portrait’ [12]; a visualisation aimed at promoting re-
flection on online identity in the new parent user-group. For
this technology probe, a visual amalgam was generated by
data mining face portrait images recently posted by a par-
ticipant to their Facebook timeline (Fig. 1). New parents
were able to reflect upon how they are portrayed online,
through the nature and content of their posts (e.g. a trend
towards posts of their new child) and posts made by others.
This intervention was designed around new parent use cases
identified through anthropological study.
In our current instantiation with the emerging adult group,
we are encouraging reflection on the kinds of material pre-
viously shared on their Facebook newsfeed. An exploratory
user study was used to determine a set of nine representative
‘concept groups’, e. g. sport, friends and family (Sec. 3.1).
Computer Vision (CV) and Machine Learning (ML) algo-
rithms have been developed to automatically classify partic-
ipants’ social media posts into one or more of these groups.
Due to the challenges automatic classification, only posts of
photographs (visuals) with associated free text comments
were considered. Visualizations indicating the frequency
with which posts were made into each group were then shown
to participants to encourage reflection on the kinds of ac-
tivity that they were depicting through their digital per-
sonhood. We elaborate upon the CV/ML algorithms em-
ployed in performing the post classification, and the results
obtained, in the remainder of this paper.
2. RELATED WORK
The identification of semantic concepts within images (im-
age classification) is a fundamental CV/ML challenge. Mod-
ern approaches tackle the problem using a three-stage pipeline.
First, a set of features are extracted from the image en-
coding texture and visual structure in a high dimensional
space. Commonly features are gradient domain (e. g. SIFT
[9], HOG [5]). Second, the feature space is simplified whilst
maintaining its ability to discriminate content. This is typi-
cally performed by clustering using k-means [11] or Gaussian
mixtures [10] to form a visual dictionary, and expressing all
features with respect to that dictionary. Third, the image
descriptors for each concept are used to train a classifier us-
ing a set of training examples. More recently the pipeline
has been adapted to learn the classifier and appropriate first-
stage features simultaneously; so called ‘deep learning’ [4].
These frameworks have been applied classically to lab-scale
datasets; a few tens of videos, or thousands of images with
success rates approaching 70% for object recognition e. g.
PASCAL VOC [10]. More recently large datasets such as
ImageNet [6], containing many hundreds of categories each
with tens of exemplars have been explored. In our setup
we face the complementary problem of only a few categories
but many thousands of diverse exemplars (high intra-class
variance) and sparse text data that previous, domain con-
strained, social classification work does not accommodate.
3. SOCIAL MEDIA CLASSIFICATION
We adopt a standard supervised classification approach in
which a set of social media posts (SMPs) are marked up
manually by tagging any of the nine concepts present. A
fraction of this data is used to train our proposed classi-
fier, and the remainder used to test the system to evaluate
accuracy. The train-test split is randomised and repeated
several times in a cross-validation framework to report an
mean averaged precision (MAP) value for accuracy.
3.1 Datasets and Concept Groups
Two datasets were harvested from Facebook using a bespoke
web crawler. The first indexed private Facebook profiles of
20 participants in the emerging adult group. The sec-
ond indexed publicly available profiles linked from emerg-
ing adults. Approximately 47k posts were harvested, and
6k usable records containing both photos and English text
comments were retained for the combined dataset. Manual
annotation was performed to establish ground-truth. For
public data, the CDL team annotated the data. For pri-
vate data, annotation was crowd-sourced from participants
contributing their data.
The nine concept groups for annotation and subsequent clas-
sification were scoped initially by CDL staff, and subse-
quently refined during 15 semi-structured interviews with
emerging adults conducted by the project anthropologist.
Participants volunteered category names, and these were
grouped after all interviews with emerging adults were com-
plete, aligning these with the preliminary concept groups
when appropriate. The resulting concept groups were: Art;
Attitude & Beliefs; Family & Pets; Food; Friends; Travel;
Celebrations; Personal style and self-imagery (e. g. selfie);
Sports. To collect private Facebook data and have it marked
up, the project anthropologist enlisted first-year art students
to submit their Facebook image data and classify images do-
nated by their peers to explore images categories. This ac-
tivity took place in the context of a design project aimed at
creating a novel way to share digital photos. The students
had the option to opt-out of the data collection and classifi-
cation, but most (22) participated, allowing a Facebook app
designed within the project to access their photos and asso-
ciated comments. A web interface presented random photos
to participants, who chose appropriate category labels.
3.2 Text mining
For each SMP a set of nouns, verbs and adjectives were
extracted from text comments provided by both the au-
thor and social media contributors. Topic discovery was
performed using LDA based on keyword co-occurrence [3],
using the set of keywords present for each annotated concept
group. The result was a set of 10 representative keywords for
each concept group, determined by SMP content. A binary
feature space was defined over the resulting 9 ×10 dimen-
sional set of keywords, with each dimension indicating the
presence of a word in SMP comments. Non-linear support
vector machines were trained in a one-vs-all pattern using
SMPs from the training class. An mean average precision
(MAP) of 32.0% was obtained (Fig. 2, green).
3.3 Visual mining
Dense color-SIFT features were extracted from each pho-
tograph in the SMP training set, and 10% of these were
clustered using kmeans to quantize the feature space in to
k= 2000 codewords. Histograms of codeword occurrence
were built for each training SMP yielding a Bag of Visual
Words [11] representation for the media, which were used
to train a non-linear support vector machine in a one-vs-all
pattern. A MAP of 25.3% was obtained (Fig. 2, blue).
3.4 Multimodal fusion
An iterative fusion method adapted from Zhang et al.’s Co-
Trade [14] was used to integrate both the trained Text and
Visual SVM classifiers to produce a combined classifier in-
formed by both modalities. Training SMPs were partitioned
further, into a Training and Validation set on a 50:50 basis.
Validation set SMPs were labelled using the resulting classi-
fiers. A hyper-graph was built for each modality connecting
each training and validation SMP (node) with its Kclosest
neighbours (in our experiments K= 5), as measured by Eu-
clidean distance in the respective modality. Each node was
evaluated for label coherency with its immediate neighbours,
and ranked. An initially empty set of ‘coherent’ training
data was incrementally built by steadily adding the most
coherent nodes from the unified training and validation set,
until classification accuracy over validation data reached a
maximum. This resulted in a single, overall classifier with
performance comparable (a fraction of a percentage point
greater) than the better of the two uni-modal classifiers.
The reflected the complementarity of performance seen by
those independent classifiers (Fig. 2).
4. DISCUSSION AND CONCLUSIONS
Social media is highly diverse, complicating efforts to au-
tomatically categorise it. A system has been outlined to
attempt this through multi-modal fusion of visual and text
classifiers, yielding performance of 30% MAP on unstruc-
tured raw social media in the wild. Whilst encouraging, the
system does not yet generalise to accommodate unseen data
well – e. g. new keywords (‘Whiskey’ would not be regarded
as similar to the word ‘drink’, in the celebrations concept).
We are augmenting our system to incorporate WordNet as
a semantic distance measure in our text classifier to enable
such inferences. Visual classification is also a challenge, due
to noise in the training set and difficulty of obtaining a suf-
ficiently large quantity of training mark-up to counter the
high visual diversity. We are exploring domain adaptation
from web-based sources, and user-specific annotation (e. g.
in the form of user supplied ‘corrections’ to categorisations)
to improve accuracy. The application of our classification
is visualizations to facilitate reflection on the digital lifes-
pan. As such, ‘perfect’ classification is not needed – rather,
a ‘mostly correct’ result is acceptable to visualise general
trends in SMPs which tend to be the most effective form of
data presentation within the technology probes we continue
to explore within CDL.
Figure 1: The Ad-mixed Portrait, used to promote
reflection amongst new parents on their digital rep-
resentations of self on social media (Facebook) [12].
Figure 2: Result of independent modalities used to
classify SMPs in the wild: Text (green) and Visual
(blue) yielded independent accuracies of 32.0% and
25.3% MAP respectively, and exhibited complemen-
tary performance across the categories. This indi-
cates value in pursuing a multi-modal SMP classifi-
cation strategy.
Acknowledgements
The Charting the Digital Lifespan (CDL) project is funded
by RCUK (EP/L00383X/1). The third author is addition-
ally supported by The Leverhulme Trust (ECF-2012-642).
5. REFERENCES
[1] Internet access - households and individuals 2012.
Technical report, Office of National Statistics, London,
UK, February 2013.
[2] J. Bleecker. Design fiction: A short essay on design,
science, fact and fiction.
http://nearfuturelaboratory.com/2009/03/17/
design-fiction-a-short-essay-on-design-science-fact-and-fiction.
Accessed: 2014-09-12.
[3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet
allocation. Journal of Machine Learning Research, 3,
2003.
[4] K. Chatfield, K. Simonyan, A. Vedaldi, and
A. Zisserman. Return of the devil in the details:
Delving deep into convolutional nets. In Proceedings of
the British Machine Vision Conference (BMVC),
2014.
[5] N. Dalal and B. Triggs. Histograms of oriented
gradients for human detection. In Proceedings Intl.
Conf. on Computer Vision and Pattern Recognition
(CVPR), pages 886–893, 2005.
[6] T. Deselaers and V. Ferrari. Visual and semantic
similarity in imagenet. In Proceedings of the 2011
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1777–1784, 2011.
[7] H. A. Horst and D. Miller. Digital Anthropology. Berg,
London, 2012.
[8] H. Hutchinson, W. Mackay, B. Westerlund, A. Druin,
C. Plaisant, M. Beaudouin-Lafon, S. Conversy,
H. Evans, H. Hansen, N. Roussel, and B. Eiderback.
Technology probes: inspiring design for and with
families. In Proceedings of ACM CHI, pages 17–24,
2003.
[9] A. Lowe. Object recognition from local scale-invariant
features. In Proceedings Intl. Conf. on Computer
Vision (ICCV), pages 1150–1157, 1999.
[10] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek.
Image classification with the fisher vector: Theory and
practice. Technical Report 8209, INRIA, Project
LEAR, Grenoble, France, May 2013.
[11] J. Sivic and A. Zisserman. Video Google: Efficient
visual search of videos. In J. Ponce, M. Hebert,
C. Schmid, and A. Zisserman, editors, Toward
Category-Level Object Recognition, volume 4170 of
LNCS, pages 127–144. Springer, 2006.
[12] D. Trujillo-Pisanty, A. Durrant, S. Martindale,
S. James, and J. Collomosse. Admixed portrait:
Reflections on being online as a new parent. In
Proceedings of ACM Designing Interactive Systems
(DIS), pages 503–512. ACM Press, 2014.
[13] P. Wright and J. McCarthy. Experience-Centred
Design: Designers, Users and Communities in
Dialogue. Morgan & Claypool, 2010.
[14] M. Zhang and Z. Zhou. Cotrade: Confident
co-training with data editing. IEEE Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics,
41(6), December 2011.
Article
Full-text available
Over the last two decades, digital photography has been adopted by young and old. Many young adults easily take photos, share them across multiple social networks using smartphones, and create digital identities for themselves consciously and unconsciously. Is the same true for older adults? As part of a larger mixed-methods study of online life in the UK, we considered digital photographic practices at two life transitions: leaving secondary school and retiring from work. In this paper, we report on a complex picture of different kinds of interactions with visual media online, and variation across age groups in the construction of digital identities. In doing so, we argue for a blurring of the distinctions between Chalfen’s ‘Kodak Culture’ and Miller and Edwards’ ‘Snaprs’. The camera lens often faces inwards for young adults: tagged ‘Selfies’ and images co-constructed with social network members commonly contribute to their digital identities. In contrast, retirees turn the camera’s lens outwards towards the world, not inwards to themselves. In concluding, we pay special attention to the digital social norms of co-creation of self and balancing convenience and privacy for people of varying ages, and what our findings mean for the future of photo-sharing as a form of self-expression, as today’s young adults grow old and retire.
Article
Full-text available
A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.
Article
Full-text available
Co-training is one of the major semi-supervised learning paradigms that iteratively trains two classifiers on two different views, and uses the predictions of either classifier on the unlabeled examples to augment the training set of the other. During the co-training process, especially in initial rounds when the classifiers have only mediocre accuracy, it is quite possible that one classifier will receive labels on unlabeled examples erroneously predicted by the other classifier. Therefore, the performance of co-training style algorithms is usually unstable. In this paper, the problem of how to reliably communicate labeling information between different views is addressed by a novel co-training algorithm named COTRADE. In each labeling round, COTRADE carries out the label communication process in two steps. First, confidence of either classifier's predictions on unlabeled examples is explicitly estimated based on specific data editing techniques. Secondly, a number of predicted labels with higher confidence of either classifier are passed to the other one, where certain constraints are imposed to avoid introducing undesirable classification noise. Experiments on several real-world datasets across three domains show that COTRADE can effectively exploit unlabeled data to achieve better generalization performance.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
We describe a new method for use in the process of co-designing technologies with users called technology probes. Technology probes are simple, flexible, adaptable technologies with three interdisciplinary goals: the social science goal of understanding the needs and desires of users in a real-world setting, the engineering goal of field-testing the technology, and the design goal of inspiring users and researchers to think about new technologies. We present the results of designing and deploying two technology probes, the messageProbe and the videoProbe, with diverse families in France, Sweden, and the U.S. We conclude with our plans for creating new technologies for and with families based on our experiences.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
This Pictorial documents the process of designing a device as an intervention within a field study of new parents. The device was deployed in participating parents' homes to invite reflection on their everyday experiences of portraying self and others through social media in their transition to parenthood. The design creates a dynamic representation of each participant's Facebook photo collection, extracting and amalgamating "faces" from it to create an alternative portrait of an online self. We document the rationale behind our design, explaining how its features were inspired and developed, and how they function to address research questions about human experience.
Article
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. A particularly significant one is data augmentation, which achieves a boost in performance in shallow methods analogous to that observed with CNN-based methods. Finally, we are planning to provide the configurations and code that achieve the state-of-the-art performance on the PASCAL VOC Classification challenge, along with alternative configurations trading-off performance, computation speed and compactness.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.