Conference PaperPDF Available

Mobile augmented reality for books on a shelf

Authors:

Abstract and Figures

Retrieving information about books on a bookshelf by snapping a photo of book spines with a mobile device is very useful for bookstores, libraries, offices, and homes. In this paper, we develop a new mobile augmented reality system for book spine recognition. Our system achieves very low recognition delays, around 1 second, to support real-time augmentation on a mobile device’s viewfinder. We infer user interest by analyzing the motion of objects seen in the viewfinder. Our system initiates a query during each low-motion interval. This selection mechanism eliminates the need to press a button and avoids using degraded motion-blurred query frames during high-motion intervals. The viewfinder is augmented with a book’s identity, prices from different vendors, average user rating, location within the enclosing bookshelf, and a digital compass marker. We present a new tiled search strategy for finding the location in the bookshelf with improved accuracy in half the time as in a previous state-of-the-art system. Our AR system has been implemented on an Android smartphone.
Content may be subject to copyright.
MOBILE AUGMENTED REALITY FOR BOOKS ON A SHELF
David Chen1, Sam Tsai1, Cheng-Hsin Hsu2, Jatinder Pal Singh2, Bernd Girod1
1Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
Email: {dmchen, sstsai, bgirod}@stanford.edu
2Deutsche Telekom R&D Laboratories USA, Los Altos, CA 94022, USA
Email: {cheng-hsin.hsu, j.singh}@telekom.com
ABSTRACT
Retrieving information about books on a bookshelf by snapping a
photo of book spines with a mobile device is very useful for book-
stores, libraries, offices, and homes. In this paper, we develop a new
mobile augmented reality system for book spine recognition. Our
system achieves very low recognition delays, around 1second, to
support real-time augmentation on a mobile device’s viewfinder. We
infer user interest by analyzing the motion of objects seen in the
viewfinder. Our system initiates a query during each low-motion in-
terval. This selection mechanism eliminates the need to press a but-
ton and avoids using degraded motion-blurred query frames during
high-motion intervals. The viewfinder is augmented with a book’s
identity, prices from different vendors, average user rating, location
within the enclosing bookshelf, and a digital compass marker. We
present a new tiled search strategy for finding the location in the
bookshelf with improved accuracy in half the time as in a previous
state-of-the-art system. Our AR system has been implemented on an
Android smartphone.
Index TermsMobile Visual Search, Mobile Augmented Re-
ality, Book Spine Recognition
1. INTRODUCTION
Many visual search applications [1, 2, 3, 4] now enable mobile de-
vices to retrieve information about products simply by snapping a
photo. Typically, robust image-based features like SIFT [5], SURF
[6], or CHoG [7] are extracted from the photo and matched against
an online database, yielding accurate retrieval results even in the
presence of photometric and geometric distortions.
There is also growing interest in applications that continuously
augment the mobile device’s video viewfinder with relevant informa-
tion about the objects currently visible. Existing augmented reality
(AR) applications use a smartphone’s camera, digital compass, and
Global Positioning System (GPS) sensors to create virtual layers on
top of building facades and product packages seen in the phone’s
viewfinder [8, 2, 9, 10]. Low-latency, robust augmentation is often
achieved using a combination of server-side visual search and client-
side visual tracking.
In this paper, we develop a mobile AR system for a class of
objects not considered in previous AR systems: book spines. Au-
tomatic book spine recognition is useful for generating an inventory
of books and for retrieving information about a book without tak-
ing it off the bookshelf. However, identifying individual book spines
in a bookshelf rack photo is challenging because each spine has a
small area relative to the whole image and other spines act as clutter.
Lee et al. [11] quantize each spine’s colors and matches spines by
Fig. 1: Our new mobile augmented reality system for book spine
recognition. (Top View) The user points the magnifying glass at
a particular book spine. (Bottom View) About 1 second later,
the viewfinder is augmented with the book’s title, prices from
competing vendors, an average user rating in stars, and a yellow
box highlighting the location in the larger bookshelf. The dig-
ital compass arrow in the lower right-hand corner continuously
shows the direction in which the phone is pointing. Demo video:
http://www.youtube.com/watch?v=fWOw2K1TzFk
color indices. Quoc and Choi [12] segment spine regions and ex-
tract titles by optical character recognition (OCR). Crasto et al. [13]
deploy a calibrated projector-camera system to track the books that
are removed from a shelf. In our previous work [14, 15], we used
a combination of line-based spine segmentation and feature-based
image retrieval to recognize book spines which are photographed in
arbitrary orientations and under various lighting conditions. These
prior systems all have recognition latencies of at least several sec-
onds, making them less suitable for real-time mobile AR.
As depicted in Fig. 1, our new mobile AR system enables a user
to point the camera at a book spine and see the book’s title, prices
from competing vendors, and an average user rating augmented in
the video viewfinder after about 1second. Optionally, images of the
book’s front and back covers can also be shown in the viewfinder to
provide more information about the book. To show the location of
the books currently visible in the viewfinder, we provide two visual
aides: (1) a thumbnail of the surrounding bookshelf is displayed on
the left side of the viewfinder and a yellow box highlights where the
books are placed in the bookshelf, and (2) a digital compass arrow
is drawn in the lower right-hand corner indicating the direction in
which the phone is pointing. As another possible augmentation, our
system plays an audio review of the book using the phone’s text-to-
speech function.
On the mobile device, the motion of objects seen in the
viewfinder is analyzed to detect periods of low motion, when the
user is likely interested in the contents of the viewfinder. At the start
of each low-motion interval, a new query is triggered. Since user
interest is automatically inferred, there is no need to press a button
to initiate a query. This selection mechanism also has the positive
effect of avoiding query frames severely degraded by motion blur,
which occur when the user rapidly moves the phone. On the server,
the spines are segmented from the query frame, and each spine is ef-
ficiently matched against a database of spine images by vocabulary
tree scoring [16] and RANSAC-based geometric verification [17] on
a shortlist of database candidates. To determine the precise loca-
tion within the surrounding bookshelf, the query frame which shows
spines on a single rack is matched against an image which shows the
whole bookshelf.
The remainder of the paper is organized as follows. Sec. 2
gives background on line-based spine segmentation and feature-
based spine recognition algorithms. Then, Sec. 3 presents our new
mobile AR system, introducing our intuitive user interface, explain-
ing how a user’s intent to focus on a new book spine is inferred from
the motion of objects seen in the viewfinder, and describing how
fast rack-to-shelf matching is performed through a tile-based search
scheme. Experimental results in Sec. 4 show the performance and
advantages of our methods. Finally, Sec. 5 concludes the paper.
2. MOBILE BOOK SPINE RECOGNITION
In the book spine recognition system of [15], the user has to press a
button on the mobile device to initiate a new query. A photo is taken
by the onboard camera and transmitted over a wireless network (e.g.,
WLAN, 3G, 4G) to a server which contains a large database of la-
beled book spines. Matching the query photo directly against the
database of spines yields poor retrieval results, because the spines
in the query photo act as clutter toward one another. Thus, the
spines in the query photo are first segmented by detecting edges
and finding long, straight edges of similar orientation, correspond-
ing to the boundaries between book spines. Then, robust image-
based features are extracted from the individually segmented spines
and matched against the database of spines using vocabulary tree
scoring [16] and RANSAC-based geometric verification [17] on a
shortlist of database candidates. The recognized spines’ identities
and boundaries are sent back to the mobile device and displayed on
the viewfinder.
Compared to the system reported in [15], our new AR system
has several new features and important advantages:
There is no need to press a button, as user interest is automat-
ically inferred by analyzing the motion of objects shown in
the viewfinder.
Recognition latency is reduced from about 3seconds in the
previous system to about 1second in the new system, by
quickly selecting a query frame from viewfinder frames at
the start of a low-motion interval.
The location of the current books is highlighted in a thumb-
nail of the bookshelf in the viewfinder, whereas the previous
system just stored this location on the server. Finding the
location is also made faster through a new tile-based search
scheme.
With these improved features, the new AR system supports substan-
tially greater interactivity and faster response.
3. MOBILE AUGMENTED REALITY SYSTEM FOR BOOK
SPINE RECOGNITION
Fig. 2: Block diagram of our mobile augmented reality system.
A block diagram of our mobile AR system is drawn in Fig. 2.
On the mobile device, motion analysis is performed on viewfinder
frames, and a query frame is captured during each low-motion inter-
val and transmitted to a server. On the server, to identify the book
spines shown in the query frame, the spines are segmented and rec-
ognized using the methods of [15] as summarized in Sec. 2. The
titles, authors, prices, and ratings of recognized spines are retrieved
from a database and sent back to the mobile device. Meanwhile,
feature-based image matching between the query frame and a photo
of the whole shelf previously taken enables us to precisely determine
the location of the book spines in the surrounding shelf. Coordinates
representing the location of the books are also sent back to the mo-
bile device.
3.1. Motion Analysis for Initiating Queries
As the user rapidly moves the smartphone, the user is most likely
not interested in the viewfinder’s contents during this high-motion
period. Conversely, during a low-motion period, the user is likely
interested in the viewfinder’s contents. Our system initiates a new
query at the beginning of each low-motion period by uploading a
viewfinder frame of 640 ×480 pixels to the server. Among all the
recognized book spines, the center-most spine has its information
augmented in the viewfinder.
The speed at which the smartphone is moving can be reliably
estimated by the motion of objects seen in the viewfinder. This mo-
tion is computed by extracting and tracking Rotation Invariant Fast
Features (RIFF) [10] from viewfinder frames captured at 15 Hz. We
demonstrate our motion analysis technique on two test viewfinder
10 20 30 40 50
0
10
20
30
40
Time (seconds)
No. Tracked RIFF Features
Raw
Median−Filtered
(a1)
5 10 15 20 25 30 35 40
0
10
20
30
40
Time (seconds)
No. Tracked RIFF Features
High Threshold
Low Threshold
(a2)
10 20 30 40 50
Low
High
Time (seconds)
Motion
(b1)
5 10 15 20 25 30 35 40
Low
High
Time (seconds)
Motion
(b2)
10 20 30 40 50
0
200
400
600
800
Time (seconds)
No. SURF Features
(c1)
5 10 15 20 25 30 35 40
0
200
400
600
800
Time (seconds)
No. SURF Features
(c2)
Fig. 3: Statistics for two different viewfinder sequences. (a1, a2) Number of tracked RIFF features between viewfinder frames. (b1, b2)
Classification of motion into low and high states. (c1, c2) Number of SURF features for viewfinder frames.
sequences1,2captured with a Motorola Droid smartphone. The first
sequence contains 9different low-motion intervals, separated by 9
different high-motion intervals. Within each low-motion interval,
there is a fair amount of hand jitter. The second sequence contains
17 different low-motion intervals, and most of them are shorter in
duration than those in the first sequence.
Fig. 4: Finite state machine for determining how to transition be-
tween low-motion and high-motion states on the mobile device.
Fig. 3(a1,a2) show traces of the number of tracked RIFF fea-
tures for both sequences. Since the raw trace is very noisy, a
median filter with a window of 7samples is applied for more
stable motion estimation. If 𝑅[𝑘]denotes samples in the raw
trace, samples in the median-filtered trace are given by 𝑀[𝑘]=
median ({𝑅[𝑘+𝛿]}3
𝛿=3).Since𝑀[𝑘]depends on future samples
{𝑅[𝑘+𝛿]}3
𝛿=1 and the samples are collected at 15 Hz, a small
delay of 200 milliseconds is incurred compared to directly using
𝑅[𝑘]. RIFF uses FAST corner keypoints [18] whose repeatability
1http://www.youtube.com/watch?v=9Py1Q0jz6DQ
2http://www.youtube.com/watch?v=RpGtpLOikdk
(a) (b)
Fig. 5: Viewfinder frames selected from (a) low-motion and (b) high-
motion intervals.
decreases sharply when there is motion blur, so a low (high) number
of tracked features indicates a period of high (low) motion. A low
(high) threshold is determined so that the number of tracked features
during high-motion (low-motion) intervals lie below (above) the low
(high) threshold. Subsequently, we use the finite state machine
(FSM) in Fig. 4 to switch between low-motion and high-motion
states. Having two thresholds instead of one is important to pre-
vent rapid switching between states in a short duration due to noise,
and the distance between the low and high thresholds is scaled in
relation to the standard deviation of the noise in the median-filtered
trace. The motion classifications given by the FSM are plotted in
Fig. 3(b1,b2).
Fig. 5 shows two frames, one seleced from a low-motion interval
and the other from a high-motion interval. As can be observed, the
low-motion frame has more clearly defined details, while the high-
motion frame suffers from motion blur which can severely degrade
the line-based spine segmentation and feature-based spine recogni-
tion methods. Fig. 3(c1,c2) show traces of the number of SURF [6]
features in both test sequences. During each high-motion period,
there is a significant drop in the number of SURF features due to
motion blur. A frame with few SURF features is likely to yield an
inaccurate image retrieval result. Thus, our choice to initiate a query
during a low-motion interval not only corresponds to a period of
very probable user interest, but also avoids selecting useless blurry
frames.
3.2. Fast Tiled Search for Rack-to-Shelf Matching
As a user identifies books with our AR application, an inventory pro-
gram on the server records all the books queried by the user. The in-
ventory information currently includes (1) location-agnostic details
such as the book titles, authors, prices, user ratings, and reviews, and
(2) location-aware details such as the direction that a person should
be facing in a room to see the books and the specific position of a
set of books within the surrounding bookshelf. When a query is ini-
tiated, we compute the phone’s direction from the onboard magnetic
field sensors. The estimated direction is shown as a digital compass
arrow in the lower right-hand corner of the viewfinder (see Fig. 1).
In this section, we focus on the more challenging problem of pre-
cisely locating books within the surrounding bookshelf. Note that
the methods discussed in this section are not used to recognize the in-
dividual book spines in a query viewfinder frame; spine recognition
is performed using the vocabulary tree scoring and RANSAC-based
geometric verification methods described in Sec. 2.
To localize the books currently visible in the viewfinder within
the surrounding bookshelf, two types of approaches are possible:
(1) location estimation based on a recent trace of the accelerometer
readings and knowledge of an anchor point, and (2) location estima-
tion based on matching the viewfinder frame against an image of the
whole bookshelf. Both approaches have been previously evaluated
[15], and the image-based approach has been found to give notice-
ably higher localization accuracy. In this section, we describe a new
image-based localization strategy that is faster and more accurate
than the method in [15].
Before querying individual books, the user takes a 960 ×1280
photo that shows the entire bookshelf (e.g., Fig. 6(a)). This book-
shelf photo 𝐼shelf can be repeatedly reused for localization purposes,
even if a small number of books are subsequently removed from or
misplaced in the bookshelf. 𝐼shelf is only retaken when we focus on
a new shelf or when the contents of the current shelf change signif-
icantly. Each 640 ×480 query frame 𝐼query (e.g., Fig. 5(a)) shows a
particular rack in the shelf. Feature-based image matching between
𝐼query and 𝐼shelf allows us to precisely localize where the spines in
𝐼query reside within the whole bookshelf shown in 𝐼shelf. Note that
𝐼shelf is used for localization only and is not used to recognize indi-
vidual spines in 𝐼query.
The system in [15] used all the local feature descriptors in 𝐼shelf
to build a k-d tree. For each descriptor in 𝐼query , the first and second
nearest descriptors in 𝐼shelf are found by searching the k-d tree, and a
tentative match is formed with the first nearest descriptor in 𝐼shelf if a
distance ratio test is passed [5]. Tentative matches are then verified
using RANSAC with an affine model. We refer to this scheme as
Full Search.
Although 𝐼query covers only a portion of 𝐼shelf, Full Search po-
tentially compares every descriptor in 𝐼query to every descriptor in
𝐼shelf. Thus, many descriptors in 𝐼shelf act as outliers, making the
matching process less accurate and slower. We address this problem
with a new Tiled Search strategy, depicted in Fig. 6(b). First, long
nearly horizontal edges are detected in 𝐼shelf to find the boundaries
between racks in the bookshelf. Second, each rack is split into 𝐶rack
(a) (b)
Fig. 6: (a) Image of the whole bookshelf with feature keypoints over-
laid. (b) Same image split into 𝐶rack =2tiles per rack.
nonoverlapping tiles of equal width, where 𝐶rack is an adjustable sys-
tem parameter. Fig. 6(b) illustrates a sample 3-rack bookshelf with
𝐶rack =2tiles per rack. For each tile, all the descriptors falling
within that tile are used to build a k-d tree specific to that tile. Next,
we exploit the fact that consecutive query frames tend to cover dif-
ferent portions of the same rack. If the previous query frame was
matched to a tile in the 𝑖th rack, for the current query frame, we first
search the tiles in the 𝑖th rack and terminate the search if the number
of post-RANSAC inliers exceeds a threshold 𝑇RANSAC ; no false pos-
itive image matches are ever observed for a sufficiently high value
of 𝑇RANSAC. Only if fewer than 𝑇RANSAC inliers are found does the
search continue into tiles in the other racks. As we will show in
Sec. 4.3, Tiled Search significantly reduces the rack-to-shelf match-
ing latency while actually giving a slight boost in matching accuracy
compared to Full Search. We will also show empirically the tradeoff
between search latency and number of feature matches as the param-
eter 𝐶rack is varied.
4. EXPERIMENTAL RESULTS
4.1. Recognition Latency
In this section, we report the performance of our new AR system and
show it has much lower recognition latency than the system reported
in [15]. Both systems use a Motorola Droid smartphone running An-
droid 2.1on a 550 MHz processor. The recognition server has a 3.2
GHz processor. This server performs line-based spine segmentation,
extraction of upright SURF features [6], vocabulary tree scoring [16]
with a set of 1million visual words, soft binned quantization [19],
and RANSAC-based geometric verification [17] on a shortlist of the
top 50 candidates out of a database of 2148 labeled book spines.
Query viewfinder frames are uploaded over a WiFi network with 1
Mbps transfer rate; our system would very likely be deployed in a
library, bookstore, office, or home with a WiFi network.
Fig. 7 compares the latencies for different operations in the pre-
vious system [15] and our new AR system. Both systems are tested
on a set of 40 rack images (all 640×480 resolution) which are avail-
able online3. This collection also includes a 960×1280 image show-
ing the entire surrounding bookshelf, where the shelf contains all the
books shown in the 40 rack images. Book spines are photographed
in different orientations and under different lighting conditions.
3http://tinyurl.com/3k9skw2
First, for image capture, the previous system initiates a photo
capture operation after the user presses a button, a process that takes
2seconds on average. When the camera shutter closes during photo
capture, the viewfinder screen also turns black momentarily, which is
an undesirable effect for continuous AR. In contrast, our AR system
captures a viewfinder frame at the beginning of a low-motion period,
taking 200 milliseconds to collect enough samples for the median-
filtered trace and 100 milliseconds to copy a query frame into an
upload buffer, with no interruption of the viewfinder stream. Sec-
ond, the latencies for image upload, line-based spine segmentation,
and feature-based spine recognition are similar in the two systems.
Then, the rack-to-shelf matching method is faster in our new system
because we use a more efficient Tiled Search compared to the Full
Search used in the prior system. In total, recognition latency is re-
duced from about 3seconds in the previous system to about 1second
in the new system. The low latency of the new system is very im-
portant for supporting real-time AR. Interestingly, since queries are
triggered automatically, rather than by a conscious user input, the re-
mainining 1 second latency is hardly noticeable. Recognition results
“magically” appear, as soon as the user hovers over the book spine
of interest.
Previous Proposed
0
500
1000
1500
2000
Latency (milliseconds)
Image Capture
Previous Proposed
0
100
200
300
400
500
Latency (milliseconds)
Image Upload
Previous Proposed
0
20
40
60
80
Latency (milliseconds)
Spine Segmentation
Previous Proposed
0
100
200
300
Latency (milliseconds)
Spine Recognition
Previous Proposed
0
50
100
Latency (milliseconds)
Rack-to-Shelf Matching
Previous Proposed
0
1000
2000
3000
Latency (milliseconds)
Entire System
Fig. 7: Comparison of latencies for different operations between the
previous system [15] and our newly proposed system. The error bars
indicate standard deviations.
4.2. Recognition Accuracy
For book spine recognition, we use the retrieval system described in
[15]. Each query spine in the aforementioned 40 test rack images is
matched against the database of 2148 labeled spines, by vocabulary
tree scoring and RANSAC-based geometric verification on a short-
list of 50 database spine candidates. If at least 𝑇RANSAC =25feature
matches are found between the query spine and the best database
candidate, a good match is deemed to be found and the information
for that matching database spine is retrieved to be displayed on the
phone’s screen. With these settings, we achieve 80 percent recall
and 95 percent precision in identifying all the spines shown in the
40 test rack images. To avoid returning false positives to users, it
is important to attain high precision at the expense of slightly lower
recall.
4.3. Rack-to-Shelf Matching
0 50 100 150 200
0
0.2
0.4
0.6
0.8
1
Number of Feature Matches
CDF
Proposed
Previous
(a)
0 50 100 150
0
0.2
0.4
0.6
0.8
1
Latency (milliseconds)
CDF
(b)
30 40 50 60
60
80
100
120
Latency (milliseconds)
Number of Feature Matches
Crack = 2
Crack = 3
Crack = 4
(c)
Fig. 8: Statistics for localizing query book spines within an image
of the whole shelf. (a) Cumulative distribution function (CDF) for
number of feature matches. (b) CDF for search latency. (c) Number
of feature matches versus search latency, as the number of tiles 𝐶rack
per rack is varied.
In Sec. 3.2, we described the Full Search and Tiled Search meth-
ods for matching a query image to an image of the entire bookshelf.
For the same 40 test images, the distribution of the number of fea-
ture matches between a rack image and the larger bookshelf images
with 𝐶rack =2is plotted in Fig. 8(a), where it can be seen that
Tiled Search and Full Search perform comparably. Tiled Search
obtains 115 features matches on average, slightly higher than the
111 matches obtained on average by Full Search, due to the avoid-
ance of outliers in bookshelf regions distant from the current rack.
In our design, Tiled Search will terminate whenever any particular
tile in the bookshelf image matches the rack image with more than
𝑇RANSAC =50post-RANSAC inliers. Due to this early termination
option, Tiled Search significantly reduces the latency compared to
Full Search, as shown in Fig. 8(b). On average, Tiled Search takes
54 milliseconds per query image compared to 107 milliseconds for
Full Search.
The parameter 𝐶rack can be adjusted to reduce search latency or
increase the number of feature matches. Fig. 8(c) shows this tradeoff
for 𝐶rack =2,3,4. Having fewer tiles per rack causes each tile to be-
come wider, which increases the number of feature matches between
a whole bookshelf image and a query frame, but also increases the
image matching latency. We observe that using 𝐶rack =4tiles still
yields a decent number of feature matches while cutting the latency
by 35 percent compared to 𝐶rack =2tiles.
5. CONCLUSIONS
We have developed a new mobile augmented reality system for rec-
ognizing book spines. Our system achieves a very low recognition
latency around 1second, which is crucial for near instantaneous
augmentation on the mobile device’s viewfinder. There is no need
to press a button to initiate a query, because user interest is auto-
matically inferred from the motion of objects seen in the phone’s
viewfinder. In addition to augmenting the viewfinder with a rec-
ognized book spine’s identity, we also highlight the location of the
books in the surrounding bookshelf. Our book spine recognition
system provides a fast way of retrieving information with a mobile
device about books in a library, bookstore, office, or home, without
ever taking a book off the bookshelf. Book spine recognition can be
easily combined with book cover recognition to create a joint sys-
tem that can recognize any facade of a book. Other potential appli-
cations of our mobile AR system include helping librarians reshelve
misplaced books; aiding bookstore clerks in organizing books on
shelves according to the books’ subjects; and guiding an individual
toward a particular book of interest in a library or bookstore.
6. ACKNOWLEDGMENTS
We thank Gabriel Takacs for sharing his RIFF code and helping us
port it to the Android platform. We also thank the reviewers for
their insightful comments, which were very helpful in improving this
paper.
7. REFERENCES
[1] Google, “Google Goggles: use pictures to search the web,”
http://www.google.com/mobile/goggles.
[2] Kooaba, “Kooaba Visual Search: get instant product informa-
tion,” http://www.kooaba.com.
[3] Nokia, “Nokia Point and Find: tag places and ob-
jects,” http://europe.nokia.com/services-and-apps/nokia-point-
and-find.
[4] Amazon, “Amazon Remembers: create a visual list of prod-
ucts,” http://www.amazon.com/gp/remembers.
[5] D. G. Lowe, “Distinctive image features from scale-invariant
keypoints,International Journal of Computer Vision, vol. 60,
no. 2, pp. 91–110, November 2004.
[6] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up
robust features (SURF), Computer Vision and Image Under-
standing, vol. 110, pp. 346–359, June 2008.
[7] V. Chandrasekhar, Y. Reznik, G. Takacs, D. Chen, S. Tsai,
R. Grzeszczuk, and B. Girod, “Quantization schemes for low
bitrate compressed histogram of gradients descriptors,” in
IEEE Computer Vision and Pattern Recognition Workshops
(CVPRW), San Francisco, CA, USA, June 2010, pp. 33–40.
[8] Layar, “Layar Reality Browser: digital information on top of
the real world,” http://site.layar.com/download/layar.
[9] D. Chen, S. Tsai, R. Vedantham, R. Grzeszczuk, and B. Girod,
“Streaming mobile augmented reality on mobile phones,” in
International Symposium on Mixed and Augmented Reality
(ISMAR), Orlando, FL, USA, October 2009, pp. 181–182.
[10] G. Takacs, V. Chandrasekhar, S. Tsai, D. Chen, R. Grzeszczuk,
and B. Girod, “Unified real-time tracking and recognition with
rotation-invariant fast features, in IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), San Francisco,
CA, USA, June 2010, pp. 934 –941.
[11] D. Lee, Y. Chang, J. Archibald, and C. Pitzak, “Matching
book-spine images for library shelf-reading process automa-
tion,” in IEEE International Conference on Automation Sci-
ence and Engineering (CASE), Arlington, VA, USA, Septem-
ber 2008, pp. 738–743.
[12] N. Quoc and W. Choi, “A framework for recognition books on
bookshelves, in Proc. International Conference on Intelligent
Computing (ICIC), Ulsan, Korea, September 2009, pp. 386–
395.
[13] D. Crasto, A. Kale, and C. Jaynes, “The smart bookshelf: A
study of camera projector scene augmentation of an everyday
environment, in Proc. IEEE Workshop on Applications of
Computer Vision (WACV), Breckenridge, CO, USA, January
2005, pp. 218–225.
[14] D. Chen, S. Tsai, C.-H. Hsu, K.-H. Kim, J. P. Singh, and
B. Girod, “Building book inventories using smartphones, in
ACM International Conference on Multimedia (MM), Firenze,
Italy, October 2010, pp. 651–654, ACM.
[15] D. Chen, S. Tsai, K.-H. Kim, C.-H. Hsu, J. P. Singh, and
B. Girod, “Low-cost asset tracking using location-aware
camera phones,” in Applications of Digital Image Process-
ing (ADIP) XXXIII, San Diego, CA, USA, August 2010, p.
77980R.
[16] D. Nister and H. Stewenius, “Scalable recognition with a vo-
cabulary tree, in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), New York, NY, USA, June 2006,
pp. 2161–2168.
[17] M. Fischler and R. Bolles, “Random sample consensus: a
paradigm for model fitting with applications to image analy-
sis and automated cartography, Communications of the ACM,
vol. 24, no. 6, pp. 381–395, 1981.
[18] E. Rosten and T. Drummond, “Machine learning for high-
speed corner detection,” in European Conference on Computer
Vision (ECCV), Graz, Austria, May 2006, vol. 1, pp. 430–443.
[19] J. Philbin, M. Isard, J. Sivic, and A. Zisserman, “Lost in quan-
tization: Improving particular object retrieval in large scale im-
age databases,” in Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), Anchorage, AL, USA, June
2008, pp. 1–8.
... A suitable example is a bookshelf, where books are distributed according to user preferences and the identification of individual book spines is limited due to the small area of the spines. Consequently, several studies have been proposed using overlays displayed on handheld [10,45] and headworn [44] AR devices. ...
... Figure 2 depicts a block diagram of our mobile augmented reality system. Motion analysis is done on viewfinder frame on the mobile device, and a query frames is sent to a server every time there is a period of low motion [12]. Applications for augmented reality are being created, for instance, for primary, intermediate, and postsecondary education. ...
Article
Full-text available
Virtually enhancing the real environment with augmented reality (AR) has lots of potential but is still in the early stages of research. The definition of appropriate user interfaces (UIs) is complicated by the absence of standards and the growing complexity of interaction opportunities. Several educators have discussed the advantages of XR for students as well as the use of AR and VR in the laboratory. Utilizing AR and VR to create immersive learning experiences is challenging since it takes time and effort to construct instructional AR and VR tools, apps, or educational settings. Because of this, even though these new technologies are said to help today’s students, their implementation in education may be postponed or stopped. In this research, the usage of XR technologies in education has been investigated through the examination of websites, technical papers, reports, and mobile app stores. This research study proposes a collision detection algorithm (CDA) utilizing machine learning. In order to aid in the identification of the meeting of two objects in the virtual environment, the collision detection method is employed in applications that support augmented reality and simulated reality technologies. In this study, mean, standard deviation and error parameters were utilized to analyze competitions that were related to augmented reality and virtual reality.
... White et al. [18] examined visual cues of gestural interaction in tangible AR where seven types of visual cues were identified: text, diagram, ghost, animation, ghost+animation, ghost+text, and ghost+text+animation. For the book searching context, Chen et al. [3] developed a mobile AR system for book spine recognition on a bookshelf, which achieved very low recognition time. However, combining HMD AR with visual cues for book searching is nearly blank. ...
Conference Paper
Full-text available
Figure 1: Case study: book searching task with our proposed HMD AR solution.): Our AR app gives the book title stimulus for searching.): The visual cue here is a light-green blob pointing out the correct bookshelf floor where the target book is located.) Users successfully found the book with/without the visual cues. ABSTRACT Augmented reality (AR) is today becoming more widely utilized as it allows for interacting with the virtual objects. In this study, we propose an head-mounted display (HMD) AR system supporting book searching with visual cues. The visual cue is represented as a light-green blob hinting users for task completion, which significantly strengthens the overall performance. The system is implemented by using Microsoft HoloLens 2. The proof-of-concept version of the proposed solution is demonstrated in a pilot user study (n=8) comprising an experimental group (with visual cues, n=4) and a control group (without visual cues, n=4), followed by quantitative analysis of task completion time (TCT) and NASA task load index (TLX). The results show that our proposed HMD AR solution improved the task performance and had cognitive benefits for book searching tasks.
... AR refers to a live and real-world image that has been enhanced or diminished by virtual content through a camera interface. AR technology aims to simplify everyday tasks by complementing the user's perception of and interaction with the real world [4]. AR is the ability to superimpose digital media on the real world through the screen of a device such as a personal computer or a smart phone, to create and show users a world full of information which has not been possible to conceptualize until now [17]. ...
Preprint
Full-text available
In this paper, CAMA, a context-aware mobile application for a university campus is presented. CAMA app provides two main services to the user equipped with a smartphone the context-aware service, and the navigation service. The context-aware service provides illustrative information to users. Students need to know their classrooms, supervisors and public locations, while the university staff may need to know additional information. By connecting our system with the user profile, this service can provides personal information. It can notify both students and teachers of their next classroom, and allow them to download the course material on fly. Moreover, the mobile application includes additional features, such as providing the shortest path via the navigation service, detecting more than one object simultaneously, location sharing, personal context-aware, recommendation, and voice-commanded search. The combination of features that are included in CAMA app, such as providing personalized context-aware services and recommendation, distinguishes it from other applications. All the features of CAMA app are possible with the availability of state of the art AR technologies. CAMA app was tested in a University Campus as one of the technologies to be used later in an intelligent Campus environment.
... Some of those advantages are using book covers for classification tasks, and focusing on understanding the book cover and predicting the class to which the book might belong [Iwana and Uchida 2016]. Another advantage is gathering information in real time and layering additional or other information directly over the subject image [Chen et al. 2011]. Our work has been inspired by a common habit most people share: after spotting a book of interest in a bookstore, readers pull out their cell phones and try to obtain more information by searching online. ...
Article
Full-text available
Nowadays deep neural networks play a significant part in various fields of human activity. Especially they benefit spheres dealing with large amounts of data and lengthy operations on obtaining and processing information from the visual environment. This article deals with the development of a convolutional neural network based on the YOLO architecture, intended for real-time book recognition. The creation of an original data set and the training of the deep neural network are described. The structure of the neural network obtained is presented and the most frequently used metrics for estimating the quality of the network performance are considered. A brief review of the existing types of neural network architectures is also made. YOLO architecture possesses a number of advantages that allow it to successfully compete with other models and make it the most suitable variant for creating an object detection network since it enables some of the common disadvantages of such networks to be significantly mitigated (such as recognition of similarly looking, same-color book coves or slanted books). The results obtained in the course of training the deep neural network allow us to use it as a basis for the development of the software for book spine recognition.
Conference Paper
Full-text available
Recent research in projector-camera systems has overcome many of the obstacles to deploying and using intelligent dis- plays for a wide range of applications. In parallel with these developments, projector costs continue to decline with cor- responding increase in resolution, brightness and contrast ratio. In light of this trend, we are exploring the unique capabilities that camera-projector systems can offer to in- telligent environments and ubiqutous computing. Our initial step towards environments that are intelli- gently augmented by projector-camera devices, is a smart bookshelf application. The system utilizes a camera pair and a projector to monitor the state of a real world library shelf. As books are added to the shelf a foreground detec- tion algorithm which takes into account the projected infor- mation yields new pixels in each view that are then verfied using a planar parallax constraint across both cameras to yield the book spine. Using a simple calibration scheme, the homography induced by the world plane in which book spines approximately lie is estimated. Users are then able to query for the presence of a book through a user interface and book spines are highlighted by transforming image pix- els to their corresponding points in the projector’s frame via the known homography. The system also can display the state of the bookshelf at any time in the past. Projected in- formation can also be used to enhance the image-processing tasks at hand and we briefly explore this in this work.
Article
Full-text available
Maintaining an accurate and up-to-date inventory of one's assets is a labor-intensive, tedious, and costly oper-ation. To ease this difficult but important task, we design and implement a mobile asset tracking system for automatically generating an inventory by snapping photos of the assets with a smartphone. Since smartphones are becoming ubiquitous, construction and deployment of our inventory management solution is simple and cost-effective. Automatic asset recognition is achieved by first segmenting individual assets out of the query photo and then performing bag-of-visual-features (BoVF) image matching on the segmented regions. The smartphone's sensor readings, such as digital compass and accelerometer measurements, can be used to determine the location of each asset, and this location information is stored in the inventory for each recognized asset. As a special case study, we demonstrate a mobile book tracking system, where users snap photos of books stacked on bookshelves to generate a location-aware book inventory. It is shown that segmenting the book spines is very important for accurate feature-based image matching into a database of book spines. Segmentation also provides the exact orientation of each book spine, so more discriminative upright local features can be employed for improved recognition. This system's mobile client has been implemented for smartphones running the Symbian or Android operating systems. The client enables a user to snap a picture of a bookshelf and to subsequently view the recognized spines in the smartphone's viewfinder. Two different pose estimates, one from BoVF geometric matching and the other from segmentation boundaries, are both utilized to accurately draw the boundary of each spine in the viewfinder for easy visualization. The BoVF representation also allows matching each photo of a bookshelf rack against a photo of the entire bookshelf, and the resulting feature matches are used in conjunction with the smartphone's orientation sensors to determine the exact location of each book.
Conference Paper
Full-text available
Manual generation of a book inventory is time-consuming and tedious, while deployment of barcode and radio-frequency identification (RFID) management systems is costly and affordable only to large institutions. In this paper, we design and implement a mobile book recognition system for conveniently generating an inventory of books by snapping photos of a bookshelf with a smartphone. Since smartphones are becoming ubiquitous and affordable, our inventory management solution is cost-effective and very easy to deploy. Automatic and robust book recognition is achieved in our system using a combination of spine segmentation and bag-of-features image matching. At the same time, the location of each book is inferred from the smartphone's sensor readings, including accelerometer traces, digital compass measurements, and WiFi signatures. This location information is combined with the image recognition results to construct a location-aware book inventory. We demonstrate the effectiveness of our book spine recognition and location estimation techniques in recognition experiments and in an actual mobile book recognition system.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
The state of the art in visual object retrieval from large databases is achieved by systems that are inspired by text retrieval. A key component of these approaches is that local regions of images are characterized using high-dimensional descriptors which are then mapped to ldquovisual wordsrdquo selected from a discrete vocabulary.This paper explores techniques to map each visual region to a weighted set of words, allowing the inclusion of features which were lost in the quantization stage of previous systems. The set of visual words is obtained by selecting words based on proximity in descriptor space. We describe how this representation may be incorporated into a standard tf-idf architecture, and how spatial verification is modified in the case of this soft-assignment. We evaluate our method on the standard Oxford Buildings dataset, and introduce a new dataset for evaluation. Our results exceed the current state of the art retrieval performance on these datasets, particularly on queries with poor initial recall where techniques like query expansion suffer. Overall we show that soft-assignment is always beneficial for retrieval with large vocabularies, at a cost of increased storage requirements for the index.
Conference Paper
We study different quantization schemes for the Compressed Histogram of Gradients (CHoG) image feature descriptor. We propose a scheme for compressing distributions called Type Coding, which offers lower complexity and higher compression efficiency compared to tree-based quantization schemes proposed in prior work. We construct optimal Entropy Constrained Vector Quantization (ECVQ) code-books and show that Type Coding comes close to achieving optimal performance. The proposed descriptors are 16× smaller than SIFT and perform on par. We implement the descriptor in a mobile image retrieval system and for a database of 1 million CD, DVD and book covers, we achieve 96% retrieval accuracy using only 4 kilobytes of data per query image.