[Show abstract][Hide abstract] ABSTRACT: We present a machine learning technique for estimating absolute, per-pixel depth using any conventional monocular 2D camera, with minor hardware modifications. Our approach targets close-range human capture and interaction where dense 3D estimation of hands and faces is desired. We use hybrid classification-regression forests to learn how to map from near infrared intensity images to absolute, metric depth in real-time. We demonstrate a variety of human-computer interaction and capture scenarios. Experiments show an accuracy that outperforms a conventional light fall-off baseline, and is comparable to high-quality consumer depth cameras, but with a dramatically reduced cost, power consumption, and form-factor.
[Show abstract][Hide abstract] ABSTRACT: We propose 'filter forests' (FF), an efficient new discriminative approach for predicting continuous variables given a signal and its context. FF can be used for general signal restoration tasks that can be tackled via convolutional filter-ing, where it attempts to learn the optimal filtering kernels to be applied to each data point. The model can learn both the size of the kernel and its values, conditioned on the ob-servation and its spatial or temporal context. We show that FF compares favorably to both Markov random field based and recently proposed regression forest based approaches for labeling problems in terms of efficiency and accuracy. In particular, we demonstrate how FF can be used to learn optimal denoising filters for natural images as well as for other tasks such as depth image refinement, and 1D signal magnitude estimation. Numerous experiments and quanti-tative comparisons show that FFs achieve accuracy at par or superior to recent state of the art techniques, while being several orders of magnitude faster.
[Show abstract][Hide abstract] ABSTRACT: Conversations abound with uncetainties of various kinds. Treating
conversation as inference and decision making under uncertainty, we propose a
task independent, multimodal architecture for supporting robust continuous
spoken dialog called Quartet. We introduce four interdependent levels of
analysis, and describe representations, inference procedures, and decision
strategies for managing uncertainties within and between the levels. We
highlight the approach by reviewing interactions between a user and two spoken
dialog systems developed using the Quartet architecture: Prsenter, a prototype
system for navigating Microsoft PowerPoint presentations, and the Bayesian
Receptionist, a prototype system for dealing with tasks typically handled by
front desk receptionists at the Microsoft corporate campus.
[Show abstract][Hide abstract] ABSTRACT: Dictation using speech recognition could potentially serve as an efficient input method for touchscreen devices. However, dictation systems today follow a mentally disruptive speech interaction model: users must first formulate utterances and then produce them, as they would with a voice recorder. Because utterances do not get transcribed until users have finished speaking, the entire output appears and users must break their train of thought to verify and correct it. In this paper, we introduce Voice Typing, a new speech interaction model where users' utterances are transcribed as they produce them to enable real-time error identification. For fast correction, users leverage a marking menu using touch gestures. Voice Typing aspires to create an experience akin to having a secretary type for you, while you monitor and correct the text. In a user study where participants composed emails using both Voice Typing and traditional dictation, they not only reported lower cognitive demand for Voice Typing but also exhibited 29% relative reduction of user corrections. Overall, they also preferred Voice Typing.
[Show abstract][Hide abstract] ABSTRACT: Prior research has shown that when drivers look away from the road to view a personal navigation device (PND), driving performance is affected. To keep visual attention on the road, an augmented reality (AR) PND using a heads-up display could overlay a navigation route. In this paper, we compare the AR PND, a technology that does not currently exist but can be simulated, with two PND technologies that are popular today: an egocentric street view PND and the standard map-based PND. Using a high-fidelity driving simulator, we examine the effect of all three PNDs on driving performance in a city traffic environment where constant, alert attention is required. Based on both objective and subjective measures, experimental results show that the AR PND exhibits the least negative impact on driving. We discuss the implications of these findings on PND design as well as methods for potential improvement.
[Show abstract][Hide abstract] ABSTRACT: Information Technology (IT) has had significant impact on the society and has touched all aspects of our lives. Up and until now computers and expensive devices have fueled this growth. It has resulted in several benefits to the society. The challenge now is to take this success of IT to its next level where IT services can be accessed by the users in developing regions. The focus of the workshop in 2011 is to identify the alternative sources of intelligence and use them to ease the interaction process with information technology. We would like to explore the different modalities, their usage by the community, the intelligence that can be derived by the usage, and finally the design implications on the user interface. We would also like to explore ways in which people in developing regions would react to collaborative technologies and/or use collaborative interfaces that require community support to build knowledge bases (example Wikipedia) or to enable effective navigation of content and access to services.
[Show abstract][Hide abstract] ABSTRACT: Text entry experiments evaluating the effectiveness of various input techniques often employ a procedure whereby users are prompted with natural language phrases which they are instructed to enter as stimuli. For experimental validity, it is desirable to control the stimuli and present text that is representative of a target task, domain or language. MacKenzie and Soukoreff (2001) manually selected a set of 500 phrases for text entry experiments. To demonstrate representativeness, they correlated the distribution of single letters in their phrase set to a relatively small (by current standards) corpus of English prior to 1966, which may not reflect the style of text input today. In this paper, we ground the notion of representativeness in terms of information theory and propose a procedure for sampling representative phrases from any large corpus so that researchers can curate their own stimuli. We then describe the characteristics of phrase sets we generated using the procedure for email and social media (Facebook and Twitter). The phrase sets and code for the procedure are publicly available for download.
[Show abstract][Hide abstract] ABSTRACT: Mobile devices often utilize touchscreen keyboards for text input. However, due to the lack of tactile feedback and generally
small key sizes, users often produce typing errors. Key-target resizing, which dynamically adjusts the underlying target areas
of the keys based on their probabilities, can significantly reduce errors, but requires training data in the form of touch
points for intended keys. In this paper, we introduce Text Text Revolution (TTR), a game that helps users improve their typing
experience on mobile touchscreen keyboards in three ways: first, by providing targeting practice, second, by highlighting
areas for improvement, and third, by generating ideal training data for key-target resizing as a side effect of playing the
game. In a user study, participants who played 20 rounds of TTR not only improved in accuracy over time, but also generated
useful data for key-target resizing. To demonstrate usefulness, we trained key-target resizing on touch points collected from
the first 10 rounds, and simulated how participants would have performed had personalized key-target resizing been used in
the second 10 rounds. Key-target resizing reduced errors by 21.4%.
[Show abstract][Hide abstract] ABSTRACT: With the proliferation of pervasive devices and the increase in their processing capabilities, client-side speech processing has been emerging as a viable alternative. The SiMPE workshop series started in 2006  with the goal of enabling speech processing on mobile and embedded devices to meet the challenges of pervasive environments (such as noise) and leveraging the context they offer (such as location). SiMPE 2010, the 5th in the series, will continue to explore issues, possibilities, and approaches for enabling speech processing as well as convenient and effective speech and multimodal user interfaces. Over the years, SiMPE has been evolving too, and since last year, one of our major goals has been to increase the participation of speech/multimodal HCI designers, and increase their interactions with speech processing experts. Multimodality got more attention in SiMPE 2008 than it has received in the previous years. In SiMPE 2007 , the focus was on developing regions. Given the importance of speech in developing regions, SiMPE 2008 had "SiMPE for developing regions" as a topic of interest. Speech User interaction in cars was a focus area in 2009 . Given the multi-disciplinary nature of our goal, we hope that SiMPE will become the prime meeting ground for experts in these varied fields to bring to fruition, novel, useful and usable mobile speech applications.
[Show abstract][Hide abstract] ABSTRACT: Mobile devices with touch capabilities often utilize touchscreen keyboards. However, due to the lack of tactile feedback, users often have to switch their focus of attention between the keyboard area, where they must locate and click the correct keys, and the text area, where they must verify the typed output. This can impair user experience and performance. In this paper, we examine multimodal feedback and guidance signals that keep users' focus of attention in the keyboard area but also provide the kind of information users would normally get in the text area. We first conducted a usability study to assess and refine the user experience of these signals and their combinations. Then we evaluated whether those signals which users preferred could also improve typing performance in a controlled experiment. One combination of multimodal signals significantly improved typing speed by 11%, reduced keystrokes-per-character by 8%, and reduced backspaces by 28%. We discuss design implications.
[Show abstract][Hide abstract] ABSTRACT: Short Message Service (SMS) messaging is very popular, especially among teens. Because research has shown that SMS messaging while driving results in 35% slower reaction time than being intoxicated , campaigns have been launched by states, governments and even cell phone carriers to discourage and ban SMS users from messaging while driving . At the same time, automobile automotive infotainment systems such as the Ford Sync  now provide drivers the ability to hear incoming messages using text-to-speech (TTS). But how should users respond to these messages while driving in a safe manner? Automatic speech recognition (ASR) affords automobile drivers a hands-free, eyes free method of replying to SMS messages. In , we examined three approaches to leveraging ASR for SMS replies: dictation using a language model trained on SMS responses, canned responses using a probabilistic context-free grammar (PCFG), and a "voice search" approach based on template matching. Voice search proceeds in two steps : an utterance is first converted into text, which is then used as a search query to match the most similar items of an index. For SMS replies, we created an index of SMS response templates, with slots for concepts such as time and place, from a large SMS data collection. After convolving recorded SMS replies so that the audio would exhibit the acoustic characteristics of in-car recognition, they compared how the three approaches handled the convolved audio with respect to the top n-best reply candidates. The voice search approach consistently outperformed dictation and canned responses, achieving as high as 89.7% task completion with respect to the top 5 reply candidates. Even if the voice search approach may be more robust to in-car noise, this does not guarantee that it will be more usable. Indeed, users may have difficulties verifying whether SMS response templates match their intended meaning, especially while driving. Using a high-fidelity driving simulator, in  we compared the voice search approach to the dictation approach in terms of both driving performance and task performance measures. Although the two approaches did not differ in terms of driving performance, users made five times more errors on average using dictation than voice search. Hence, verifying whether SMS response templates matched the meaning of an intended reply is much less prone to error than deciphering the sometimes nonsensical misrecognitions of dictation. And as prior research  has shown, because ASR errors with in-car speech interfaces negatively impacts driving performance, the safest way to respond to SMS messages in automobiles may just be the voice search approach. For MIAA, we will demonstrate a multimodal interface for SMS replies based on voice search, as shown in Figure 1.
[Show abstract][Hide abstract] ABSTRACT: Speech recognition affords automobile drivers a hands-free, eyes-free method of replying to Short Message Service (SMS) text messages. Although a voice search approach based on template matching has been shown to be more robust to the challenging acoustic environment of automobiles than using dictation, users may have difficulties verifying whether SMS response templates match their intended meaning, especially while driving. Using a high-fidelity driving simulator, we compared dictation for SMS replies versus voice search in increasingly difficult driving conditions. Although the two approaches did not differ in terms of driving performance measures, users made about six times more errors on average using dictation than voice search.
[Show abstract][Hide abstract] ABSTRACT: Soft keyboards offer touch-capable mobile and tabletop devices many advantages such as multiple language support and room for larger displays. On the other hand, because soft keyboards lack haptic feedback, users often produce more typing errors. In order to make soft keyboards more robust to noisy input, researchers have developed key-target resizing algorithms, where underlying target areas for keys are dynamically resized based on their probabilities. In this paper, we describe how overly aggressive key-target resizing can sometimes prevent users from typing their desired text, violating basic user expectations about keyboard functionality. We propose an anchored key-target method which incorporates usability principles so that soft keyboards can remain robust to errors while respecting usability principles. In an empirical evaluation, we found that using anchored dynamic key-targets significantly reduce keystroke errors as compared to the state-of-the-art.
[Show abstract][Hide abstract] ABSTRACT: As users of social networking websites expand their network of friends, they are often flooded with newsfeed posts and status updates, most of which they consider to be "unimportant" and not newsworthy. In order to better understand how people judge the importance of their newsfeed, we conducted a study in which Facebook users were asked to rate the importance of their newsfeed posts as well as their friends. We learned classifiers of newsfeed and friend importance to identify predictive sets of features related to social media properties, the message text, and shared background information. For classifying friend importance, the best performing model achieved 85% accuracy and 25% error reduction. By leveraging this model for classifying newsfeed posts, the best newsfeed classifier achieved 64% accuracy and 27% error reduction.
[Show abstract][Hide abstract] ABSTRACT: With the proliferation of pervasive devices and the increase in their processing capabilities, client-side speech processing has been emerging as a viable alternative. SiMPE 2009, the fourth in the series, will continue to ex- plore issues, possibilities, and approaches for enabling speech processing as well as convenient and eectiv e speech and multimodal user interfaces. One of our major goals for SiMPE 2009 is to increase the participation of speech/multimodal HCI designers, and increase their interactions with speech processing experts. Multimodality got more attention in SiMPE 2008 than it has received in the previous years. In SiMPE 2007 (3), the focus was on developing regions. Given the importance of speech in developing regions, SiMPE 2008 had \SiMPE for developing regions" as a topic of interest. We think of this as a key emerging area for mobile speech applications, and will continue this in 2009 as well.
[Show abstract][Hide abstract] ABSTRACT: Nowadays, personal navigation devices (PNDs) that provide GPS-based directions are common in vehicles. These devices typically display the real-time location of the vehicle on a map and play spoken prompts when drivers need to turn. While such devices appear to be less distracting than paper directions, in-car displays may distract drivers from their primary task of driving. In experiments conducted with a high fidelity driving simulator, we found that using paper directions degrades driving performance and visual attention significantly more than using a navigation device that provides either a map with spoken prompts or has spoken prompts only. This was expected. However, we also found that having just spoken prompts affected visual attention the least. We discuss the implications of these findings on PND design for vehicles.
[Show abstract][Hide abstract] ABSTRACT: As users enter web queries, real-time query expansion (RTQE) interfaces offer suggestions based on an index garnered from query logs. In selecting a suggestion, users can potentially reduce keystrokes, which can be very beneficial on mobile devices with deficient input means. Unfortunately, RTQE interfaces typically provide little assistance when only parts of an intended query appear among the suggestion choices. In this paper, we introduce Phrase Builder, an RTQE interface that reduces keystrokes by facilitating the selection of individual query words and by leveraging back-off query techniques to offer completions for out- of-index queries. We describe how we implemented a small memory footprint index and retrieval algorithm, and discuss lessons learned from three versions of the user interface, which was iteratively designed through user studies. Compared to standard auto-completion and typing, the last version of Phrase Builder reduced more keystrokes-per-character, was perceived to be faster, and was overall preferred by users.
[Show abstract][Hide abstract] ABSTRACT: Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robust voice search approach to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches are accurately retrieved using a vector space model. In evaluating SMS replies within the acoustically challenging environment of automobiles, the voice search approach consistently outperformed using just the recognition results of a statistical language model or a probabilistic context-free grammar. For SMS replies covered by our templates, the approach achieved as high as 89.7% task completion when evaluating the top five reply candidates. Index Terms: SMS, information retrieval, voice UI, voice search
[Show abstract][Hide abstract] ABSTRACT: Nowadays, personal navigation devices (PNDs) that provide GPS- based directions are widespread in vehicles. These devices typically display the real-time location of the vehicle on a map and play spoken prompts when drivers need to turn. While such devices are less distracting than paper directions, their graphical display may distract users from their primary task of driving. In experiments conducted with a high fidelity driving simulator, we found that drivers using a navigation system with a graphical display indeed spent less time looking at the road compared to those using a navigation system with spoken directions only. Furthermore, glancing at the display was correlated with higher variance in driving performance measures. We discuss the implications of these findings on PND design for vehicles.