Conference PaperPDF Available

Exploring the urban environment with a camera phone: lessons from a user study



We present a study investigating two novel mobile s ervices supporting querying for information in the urban en vironment using camera equipped smart phones as well as two different ways to visualize results - icon-based visualization and text-based visualization. Both applications enable the user to access information about an object by snapping a photo of it. We investigate how users would use a photo-based touri st guide in a free exploration setting in general as well as the acceptance/preference of two different ways to visu alize results.
Exploring the urban environment with a camera phone:
Lessons from a user study
Norman Höller, Arjan Geven,
Manfred Tscheligi
CURE-Center for Usability Research
and Engineering
Hauffgasse 3-5, A-1110 Vienna
{hoeller, geven,
Lucas Paletta, Katrin
Amlacher, Patrik Luley
Joanneum Research, Steyrergasse
17-19/ 8010 Graz
{lucas.paletta, katrin.amlacher,
Dusan Omercevic
University of Ljubljana
Kongresni trg 12,
1000 Ljubljana
We present a study investigating two novel mobile services
supporting querying for information in the urban environment
using camera equipped smart phones as well as two different ways
to visualize results – icon-based visualization and text-based
visualization. Both applications enable the user to access
information about an object by snapping a photo of it. We
investigate how users would use a photo-based tourist guide in a
free exploration setting in general as well as the
acceptance/preference of two different ways to visualize results.
Categories and Subject Descriptors
H.5.2 [User Interfaces]: GUI, Interaction Styles
General Terms
Performance, Design, Experimentation, Human Factors.
mobile devices, computer vision, augmented reality
When we travel to unfamiliar cities and places, we use a tourist
guide or the Internet to get information about buildings, streets,
restaurants, and places to shop. This work presents two systems
that provide support in ubiquitous interaction with the real world,
with immediate access to virtual information spaces representing a
reading-glass for stories behind the urban environment and run on
a lightweight camera phone.
The two applications presented enable users to get information
about things they see (e.g., buildings, neighborhoods) by simply
taking a photograph of it. They return information about the
photographed object and its surroundings in distinct ways. The
two systems are (a) “Object Recognition” (OR) – built around
geo-indexed object recognition and (b) “Hyperlinking Reality”
(HR) – built around purely image-based recognition.
Mobile image-based recognition and localisation have recently
been proposed in terms of mobile vision services for the support
of urban nomadic users. HPAT (hyper-polyhedron with adaptive
threshold) indexing provided one of the first innovative attempts
on building identification proposing local affine features for
object matching [5]. An image retrieval methodology for the
indexing of visually relevant information from the web for mobile
location recognition was introduced in [6]. Exploiting knowledge
about a current geo-context in a probabilistic framework using
attentive, geo-indexed object recognition was done by [1].
Powerful and computationally demanding computer vision
techniques based on local invariant features are described by [3].
The approach to image matching was pioneered by [4] and [3].
Similar work to this focused on learning of user behavior in
regard of embodied interaction with a mobile photo-annotation
system [2], evaluating it with a guided tour on campus.
We evaluated two systems, Object Recognition (OR) and
Hyperlinking Reality (HR).
3.1 Object Recognition (OR)
When the user takes a picture with the OR application, s/he
receives a picture of that very object annotated with text and
further links (Figure 1a
The server’s matching algorithm cuts
down the visual search space into a subset of relevant object
hypothesis based on contextual processing of multi-sensor
3.2 Hyperlinking Reality (HR)
A picture taken with the Hyperlinking Reality (HR) application is
returned with icons placed on the objects that are annotated.
Selecting those objects reveals information about them (Figure
1b). Differences despite the common purpose in the two
applications are summarized in Table 1.
We addressed the following research questions:
What buildings do participants choose to photograph and
how do they take the photographs of these?
Copyright is held by the author/owner(s).
MobileHCI’09, September 15 - 18, 2009, Bonn, Germany.
ACM 978
What is the preferred visualization of results?
How much feedback do users need to receive when the
system is processing the information? How important is
response time for users of these systems?
In which real-life applications can users imagine to use this
technology? What are areas for improvement and further
To answer research questions, we created a setting in which 16
participants (9 female, 7 male; aged 22 - 30) could experiment
with the technology in an urban environment measuring
approximately 400 by 100 meters.
Thinking out loud (transmitted over bluetooth) and shadowing
was used to understand what each user was doing and thinking
during the free exploration.
A short on-site interview with each user to asses the general first
impressions of the systems was conducted after exploration.
Additionally two focus groups were held.
Participants took an average of 7.1 photos with the OR
application and 4.6 photos with the HR application. 13 out of the
16 users reported interest in installing the application on their own
mobile phone if it would be available.
The amount of time that the system needed to generate results was
considered almost acceptable for OR but for some users too long
for HR. Particularly the lack of feedback during the server-side
processing made it difficult for the user to know what was going
on and caused irritation as well.
The icon-based visualization of the results with the HR
application (Figure 1b) was preferred by users over the text-based
visualization from the OR application (Figure 1a). Different
possible scenarios for use were mentioned during the focus
groups, mainly in the context of shopping, concert tickets and
Users reacted positively on the applications and were highly
motivated to take advantage of the intuitive interface, with some
important remarks regarding technical features (response time,
reliability), information visualization and future applications of
the technology.
This work is part of the project MOBVIS and has been partly
funded by the European Commission under contract IST-FP6-
511051. We would like to thank all partners of the MOBVIS
consortium and the European Commission for support.
Figure 1: Annotated photos from the OR (a)
and HR applications (b)
[1] Amlacher, K. and Paletta, L., (2008). An Attentive Machine
Interface Using Geo-Contextual Awareness for Mobile
Vision Tasks, Proc. European Conference on Artificial
Intelligence, ECAI 2008, pp. 601-605.
[2] Cuellar G., Eckles D. and Spasojevic M. (2008). Photos for
Information: A Field Study of Cameraphone Computer
Vision Interactions in Tourism. Proc. CHI 2008.
[3] Lowe, D.G. (2004). Distinctive image features from scale-
invariant keypoints. IJCV 60(2), pp. 91-110.
[4] Schmid, C. and Mohr, R. (1997). Local gray-value invariants
for image retrieval. IEEE PAMI 19(5), pp. 530-535.
[5] Shao, H., Svoboda, T., and van Gool, L. (2003). HPAT
indexing for fast object/scene recognition based on local
appearance. Proc. International Conference on Image and
Video Retrieval, CIVR 2003, pp. 71–80.
[6] Yeh, T., Tollmar, K., and Darrell, T. (2004). Searching the
web with mobile images for location recognition. Proc. IEEE
Computer Vision and Pattern Recognition, CVPR 2004, pp.
Table 1: Comparison of usability aspects between the
investigated system functionalities
Annotation of
urban objects Single objects (e.g.,
facades) selected by
the user
Multiple objects in
urban environment
of annotation List of information
including URLs in
response message
Icon-based annotation
with URLs, directly
on query image
response time ~15 sec. (1 MP
image, 1 CPU) ~50 sec. (3 MP
image, 8-core CPU)
Use of geo-
information Geo-indexed object
recognition Purely image based
annotation GPS based position
without annotation. Position and
orientation of user
based on single image
Full-text available
Augmented reality (AR) applications typically overlay ng the camera view as a backdrop for information presentation, however, AR applications could also benefit from using the camera as a sensor to a greater extent. Beyond using visual data for markerless tracking, AR applications could recognize objects and provide users with information based on these objects. We present two applications that use the camera as a sensor: Pic-­in and SubwayArt. The first allows users to check-­in on location-­sharing service foursquare by taking a picture of the venue they are at. The second provides users with information about artworks in the Stockholm subway system by combining localization and computer vision techniques. Keywords Mobile augmented reality, computer vision, camera as a sensor
Full-text available
A novel user interface concept for camera phones, called “Hyperlinking Reality via Camera Phones”, that we present in this article, provides a solution to one of the main challenges facing mobile user interfaces, that is, the problem of selection and visualization of actions that are relevant to the user in her current context. Instead of typing keywords on a small and inconvenient keypad of a mobile device, a user of our system just snaps a photo of her surroundings and objects in the image become hyperlinks to information. Our method commences by matching a query image to reference panoramas depicting the same scene that were collected and annotated with information beforehand. Once the query image is related to the reference panoramas, we transfer the relevant information from the reference panoramas to the query image. By visualizing the information on the query image and displaying it on the camera phone’s (multi-)touch screen, the query image augmented with hyperlinks allows the user intuitive access to information. KeywordsImage matching using local invariant features–Wide baseline stereo matching–Augmented reality–Image-based localization
Conference Paper
Full-text available
The presented work settles attention in the architecture of ambient intelligence, in particular, for the application of mobile vision tasks in multimodal interfaces. A major issue for the performance of these services is uncertainty in the visual information which roots in the requirement to index into a huge amount of reference images. We propose a system implementation - the Attentive Machine Interface (AMI) - that enables contextual processing of multi-sensor informa- tion in a probabilistic framework, for example to exploit contextual information from geo-services with the purpose to cut down the vi- sual search space into a subset of relevant object hypotheses. We present a proof-of-concept with results from bottom-up infor- mation processing from experimental tracks and image capture in an urban scenario, extracting object hypotheses in the local context from both (i) mobile image based appearance and (ii) GPS based position- ing, and verify performance in recognition accuracy (> 10%) us- ing Bayesian decision fusion. Finally, we demonstrate that top-down information processing - geo-information priming the recognition method in feature space - can yield even better results (> 13%) and more economic computing, verifying the advantage of multi-sensor attentive processing in multimodal interfaces.
Conference Paper
Advances in mobile computing and computer vision can support camera-based interactions with mobile devices, including systems that use image-matching to support getting information about objects identified by the camera. These interfaces, sometimes considered mobile augmented reality, can be applied in many domains. This paper reports on a field study of these interfaces in a tourism application, which begins to address questions about embodied interaction, existing photo-taking practices, and alternative interfaces.
Conference Paper
The paper describes a fast system for appearance based image recognition. It uses local invariant descriptors and efficient nearest neighbor search. First, local affine invariant regions are found nested at multiscale intensity extremas. These regions are characterized by nine generalized color moment invariants. An efficient novel method called HPAT (hyper-polyhedron with adaptive threshold) is introduced for efficient localization of the nearest neighbor in feature space. The invariants make the method robust against changing illumination and viewpoint. The locality helps to resolve occlusions. The proposed indexing method overcomes the drawbacks of most binary tree-like indexing techniques, namely the high complexity in high dimensional data sets and the boundary problem. The database representation is very compact and the retrieval close to realtime on a standard PC. The performance of the proposed method is demonstrated on a public database containing 1005 images of urban scenes. Experiments with an image database containing objects are also presented.
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
We describe an approach to recognizing location from mobile devices using image-based Web search. We demonstrate the usefulness of common image search metrics applied on images captured with a camera-equipped mobile device to find matching images on the World Wide Web or other general-purpose databases. Searching the entire Web can be computationally overwhelming, so we devise a hybrid image-and-keyword searching technique. First, image-search is performed over images and links to their source Web pages in a database that indexes only a small fraction of the Web. Then, relevant keywords on these Web pages are automatically identified and submitted to an existing text-based search engine (e.g. Google) that indexes a much larger portion of the Web. Finally, the resulting image set is filtered to retain images close to the original query. It is thus possible to efficiently search hundreds of millions of images that are not only textually related but also visually relevant. We demonstrate our approach on an application allowing users to browse Web pages matching the image of a nearby location.
This paper addresses the problem of retrieving images from large image databases. The method is based on local grayvalue invariants which are computed at automatically detected interest points. A voting algorithm and semilocal constraints make retrieval possible. Indexing allows for efficient retrieval from a database of more than 1,000 images. Experimental results show correct retrieval in the case of partial visibility, similarity transformations, extraneous features, and small perspective deformations