Article

Nouse ‘use your nose as a mouse’ perceptual vision technology for hands-free games and interfaces

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Due to recent increase of computer power and decrease of camera cost, it became very common to see a camera on top of a computer monitor. This paper presents the vision-based technology which allows one in such a setup to significantly enhance the perceptual power of the computer. The described techniques for tracking a face using a convex-shape nose feature as well as for face-tracking with two off-the-shelf cameras allow one to track faces robustly and precisely in both 2D and 3D with low resolution cameras. Supplemented by the mechanism for detecting multiple eye blinks, this technology provides a complete solution for building intelligent hands-free input devices. The theory behind the technology is presented. The results from running several perceptual user interfaces built with this technology are shown.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Kocejko et al. [8] and Lupu et al. [9] have controlled the mouse cursor by tracking the eye gaze movements of the user. Betke et al. [10], Epstein et al. [11], Nabati et al. [12], Chareonsuk et al. [13], Varona et al. [14], Bian et al. [15], Gorodnichy et al. [16], Gyawal et al. [17] and Morris et al. [18] have avoided the overhead of using and head-mounted devices and high-cost hardware system by capturing the user's head motions through web camera to control the mouse pointer. Fathi et al. [19] achieved this by tracking the eye movement whereas Sugano et al. [20], Sambrekar et al. [21] and M. Nasor et al. [22] have attempted tracking the eye gazes. ...
... Fathi et al. [19] achieved this by tracking the eye movement whereas Sugano et al. [20], Sambrekar et al. [21] and M. Nasor et al. [22] have attempted tracking the eye gazes. Betke et al. [10], Nabati et al. [12], Chareonsuk et al. [13], Varona et al. [14], Bian et al. [15], Gorodnichy et al. [16], Fathi et al. [19], Hegde et al. [23] and Arai et al. [24] have developed the camera-based mouse replacement solutions for implementing mouse click events such as single and doubleclicking and dragging. Tracking and converting accurately the facial expression of the user to the mouse operation is still acknowledged as a research challenge and opportunity. ...
... The total error for the neural network is the sum of these errors: (15) Applying (14) in (15), (16) During backpropagation, each weight in the network is updated such that the predicted value comes closer to the target value, thereby minimizing the error. ...
Article
Full-text available
Many solutions were proposed in the past decades to assist the people with disability in the movement to interact with personal computers. Various results were proposed to simulate the mouse cursor movement and click operations through facial expressions captured by the camera. Tracking and converting accurately the facial expression of the user to the mouse operation is still acknowledged as a research challenge and opportunity. The proposed system introduces a prediction of items to be selected by the user in the GUI based system applying the backpropagation neural network techniques to improve the performance of the overall selection process.
... The user can move freely in front of the display while a camera captures the images of the user and the positions of the user's eyes are tracked. Gorodnichy and Roth [2] described techniques for tracking a face using a convex-shape nose feature as well as for face-tracking with two off-the-shelf cameras allowed one to track faces robustly and precisely in both 2D and 3D with low resolution cameras on top of a computer display. ...
... Notice that if the real display uses the right hand coordinate system, the virtual display will use left hand system. The real rotation matrix R r and translation vector T r can be computed by the following equations ( 2 ) , ...
... The position and orientation of the camera is chosen randomly and used as the ground truth in the error analysis. If the rotation matrix is denoted as a vector R = [r 1 , r 2 , r 3 ], and the translation vector is denoted as T = [t 1 , t 2 ...
Conference Paper
Our work focuses on the extrinsic calibration of a display-camera system where the camera has no direct view of the display. An annular planar mirror is used to reflect the display so that the camera can capture the reflection images. The position of the mirror can be obtained after outer circle is detected. The position parameters are determined uniquely by putting two orthogonal lines in the background and optimized by re-projecting the inner circle to the image plane. The display pixels are encoded using gray code and its imaged position in the mirror can be obtained by solving the PnP problem. Finally, the real extrinsic parameters between the camera and the display are obtained according to the mirror imaging principle. Our approach is simple and fully automatic without any manual intervention. Both simulation and real experiments validate our approach.
... The user can move freely in front of the display while a camera captures the images of the user and the positions of the user's eyes are tracked. Gorodnichy and Roth [2] described techniques for tracking a face using a convex-shape nose feature as well as for face-tracking with two off-the-shelf cameras allowed one to track faces robustly and precisely in both 2D and 3D with low resolution cameras on top of a computer display. ...
... Notice that if the real display uses the right hand coordinate system, the virtual display will use left hand system. The real rotation matrix R r and translation vector T r can be computed by the following equations ( 2 ) , ...
... The position and orientation of the camera is chosen randomly and used as the ground truth in the error analysis. If the rotation matrix is denoted as a vector R = [r 1 , r 2 , r 3 ], and the translation vector is denoted as T = [t 1 , t 2 ...
Article
A new extrinsic calibration method of a display-camera system that the display is not in the direct view of the camera is proposed. An annular mirror is used so that the camera can capture the virtual image of the display. The position of the mirror can be obtained after outer circle is detected. The position parameters are determined uniquely by putting two orthogonal lines in the background and optimized by re-projecting the inner circle to the image plane. The display pixels are encoded using gray code and its imaging position in the mirror can be obtained by solving the PnP problem. Finally, the real extrinsic parameters between the camera and the display are obtained according to the mirror imaging principle. Compared with other existing methods, our approach is simple and fully automatic without any manual intervention. The mirror only needs to be fixed at one position and the degenerate case using a common planar mirror is avoided. Both simulation and real experiments validate our approach.
... Another example of head location estimation by means of facial feature tracking can be found in Gorodnichy et al.'s work [38]. In this case, the goal is to track the nose, whose main characteristic is that, it extends in front of the face and ends with a somewhat universal rounded shape. ...
... The speed of the cursor is determined by the offset's amount. Several systems allow switching between both modes [38,81]. ...
... When we first started with this project, few hands-free systems with no infrared lightning based in computer vision existed. As mentioned before, not all the systems were born to offer accessibility to disabled users [13,106,38]. The ideal is that when we are designing a system for disabled users is to take them into account during the whole development process. ...
... mouse click). [4] The earlier application Nouse (http://nouse.ca) of Gorodnichy et al. tracked the nose and eyes as well and used multiple blinks as control commands [5]. They also proposed and implemented a visual feedback mechanism whereby a small image follows the cursor and shows how the system is interpreting the current state of the user's head pose [6]. ...
... Other applications and implementations were considered but they lacked the availability, functionality, or performance required for our user interface. These include, among others, Nouse [5], Camera Mouse [9], faceShift (faceshift.com), SINA [4], SHORE (Fraunhofer IIS) scale invariant feature transform (SIFT) [13], local binary pattern, iterative closest point [14], and various template matching algorithms. ...
... The velocity pointer may be particularly helpful where the control space has significantly lower resolution than the target screen space. [4,5] ...
Conference Paper
Full-text available
Some individuals have difficulty using standard hand-manipulated computer input devices such as a mouse and a keyboard effectively. However, if these users have sufficient control over their face and head movement, a robust face tracking user interface can bring significant usability benefits. Using consumer-grade computer vision devices and signal processing techniques, a robust user interface can be made readily available at low cost, and can provide a number of benefits, including non-intrusive usage. Designing and implementing this type of user interface presents many challenges particularly with regards to accuracy and usability. Continuing previously published research, we now present results based on an analysis and comparison of different options for face tracking user interfaces. Five different options are evaluated each with different architectural stages of a face tracking user interface -- namely user input, capture technology, feature retrieval, feature processing, and pointer behavior. Usability factors were also included in the evaluation. A prototype system, configured to use different options, was created and compared with existing similar solutions. Tests were designed that ran on an Internet browser and a quantitative evaluation was done. The results show which of the evaluated options performed better than the others and how the best performing prototype compares to currently available solutions. These findings can serve as a precursor to a full-scale usability study, various improvements, and future deployment for public use.
... [23] T he earlier application Nouse of Gorodnichy et al. tracked the nose and eyes as well and used multiple blinks as control commands. [9] T hey also proposed and implemented a visual feedback mechanism where a small image follows the cursor and shows how the system is interpreting the current state of the user's face pose. [7] Morris et al. used skin color to detect the face and the nostrils, producing a system that they admit is vulnerable to noise from similarly colored regions. ...
... Nouse implements a fixed interaction zone within the field of view that can be re-initialized (i.e. the process is still not fully automatic). [9] Camera Mouse requires the user or a helper to specify the reference point and sensitivity (interaction zone size) by a mouse click. [2] We believe that a fully automatic definition of the interaction zone will improve usability significantly. ...
... It would also be helpful to program a gesture to turn the cursor or selection control on and off similar to the Snap Clutch of Istance et al. [12] and other face tracking user interfaces. [9,23] ...
Conference Paper
Full-text available
Using face and head movements to control a computer can be especially helpful for users who, for various reasons, cannot effectively use common input devices with their hands. Using vision-based consumer devices makes such a user interface readily available and allows its use to be non-intrusive. However, a characteristic problem with this system is accurate control. Consumer devices capture already small face movements at a resolution that is usually lower than the screen resolution. Computer vision algorithms and technologies that enable such also introduce noise, adversely affecting usability. This paper describes how different components of this perceptual user interface contribute to the problem of accuracy and presents potential solutions. This interface was implemented with different configurations and was statistically evaluated to support the analysis. The different configurations include, among other things, the use of 2D and depth images from consumer devices, different input styles, and the use of the Kalman filter.
... However, a potential drawback of optical mouse sensors is the requirement of an additional light source to work properly, as demonstrated by [6,91]. With regard to the methods used in the aforementioned works, virtual interface-based methods for making computers accessible to the physically disabled are popular in the literature [28,39,94]. Although common, in practice dwell-time-based click actuation suffers from unwanted actuation of mouse clicks due to eye gaze fixation, generally known as the "Midas Touch" problem [37]. ...
... A critical requirement for vision-based AMCs to work properly is to ensure proper lighting condition for calibration and accurate detection of facial [28,39,41,94] or eye gaze [6,91,106] features. For eye gaze-based AMCs, gaze tracking Manuscript submitted to ACM may be challenging due to image resolution, different lighting conditions, user's dependency on eyeglasses due to poor eyesight, and even user's skin complexion. ...
Preprint
Full-text available
Upper limb disability may be caused either due to accidents, neurological disorders, or even birth defects, imposing limitations and restrictions on the interaction with a computer for the concerned individuals using a generic optical mouse. Our work proposes the design and development of a working prototype of a sensor-based wireless head-mounted Assistive Mouse Controller (AMC), Auxilio, facilitating interaction with a computer for people with upper limb disability. Combining commercially available, low-cost motion and infrared sensors, Auxilio solely utilizes head and cheek movements for mouse control. Its performance has been juxtaposed with that of a generic optical mouse in different pointing tasks as well as in typing tasks, using a virtual keyboard. Furthermore, our work also analyzes the usability of Auxilio, featuring the System Usability Scale. The results of different experiments reveal the practicality and effectiveness of Auxilio as a head-mounted AMC for empowering the upper limb disabled community.
... Head movement is a common social communication method [16,24] that has also been used to interact with interactive systems. For example, to move a cursor on a desktop computer [11], as command gestures while using head-mounted displays [41], or to interact with mobile devices [8]. However, formerly additional hardware external to the primary device was needed for tracking. ...
... Still, head movement can also be a useful additional input method for able-bodied users [10,23]. It has been used as continuous input to move the cursor on desktop computers [11,29], or to change the viewport in a 3D-application [14]. Discrete operations are also promising application areas for head gestures. ...
... In Gorodnichy and Roth (2004), the authors used a convex-shaped nose feature to track the nose position in a frame. Nose features are also used in (Varona et al., 2008), where the eyes are tracked using the eyes histogram and a meanshift algorithm. ...
... In (Gorodnichy and Roth, 2004) - (Pallejà et al., 2011) the fine control of the head is mandatory, prohibiting its usage by persons that suffer from involuntary movements or an impairment that interfere with the head movement.Our work presents a novel approach where a markerless head tracking system is implemented and is suitable for recognizing the functional head movement of the user. Our initial intention is to give an alternative to impaired persons to use the computer, even if they do not have a fine control of the head movement. ...
Conference Paper
Full-text available
The use of computers as a communication tools is trivial nowadays, but the use of the PC by a impaired person is often a challenge. Augmentative and alternative communication (AAC) devices can empower these subjects by the use of their remaining functional movements, including head movements. Currently computer vision AAC solutions present limited performance in the presence of involuntary body movement or spasticity (stiff or rigid muscles). Our work proposes a novel human computer interface (HCI) based on the functional head movements of each user. After calibration, a Hidden Markov Model (HMM) classifier represents the desired functional movement based on the velocities components of the estimated head position. New segmented movements are then classified in valid or invalid based on the HMM. Valid segments can generate mouse “click” events that can be used with scanning virtual keyboards, enabling text editing, and within scanning based software that can control mouse functions.
... Over the years, a number of different techniques have been used for head tracking. These techniques could be classified as template-based approaches [3]; 3D systems using stereo cameras [4]; vision-based systems [5]; and using wearable markers [6]. In Kim et al [3], a dual template based tracking system was proposed. ...
... The templates are used for tracking the nose and the tip of the nose. Gorodnichy & Roth [4] presented a system for tracking the nose both in 2D and 3D with low resolution images. The nose was selected as the facial feature to track because it is the most prominent feature of the face; it is always visible even with different head orientations. ...
Conference Paper
Full-text available
This paper presents the architecture for a novel RGB-D based assistive device that incorporates depth as well as RGB data to enhance head tracking and facial gesture based control for severely disabled users. Using depth information it is possible to remove background clutter and therefore achieve a more accurate and robust performance. The system is compared with the CameraMouse, SmartNav and our previous 2D head tracking system. For the RGB-D system, the effective throughput of dwell clicking increased by a third (from 0.21 to 0.30 bits per second) and that of blink clicking doubled (from 0.15 to 0.28 bits per second) compared to the 2D system.
... Rurainsky and Eisert (2003) used deformable templates to detect pupils, mouth corners and inner mouth lip lines from face images. Gorodnichy and Roth (2004) tracked nose with a method that used 3D-templates. The 3D-templates were created from two images captured by two web cameras positioned next to each other. ...
... The more complex examples include controlling a game with face or head movements. Gorodnichy and Roth (2004) presented two such games: BubbleFrenzy and NousePong. The head movements were registered by tracking the player's nose. ...
... Gaze movement is also used as an input method [2,3,24,25,35] because gaze can reflect the user's intention. There is also a method that uses movement of the head, for purposes such as turning pages when browsing [45] and operating the cursor (e.g., desktop devices [46], mobile devices [47]). There are methods that use a combination of head movement and gaze [48] and a combination of brain and gaze [49]. ...
Article
Full-text available
With the spread of eyewear devices, people are increasingly using information devices in various everyday situations. In these situations, it is important for eyewear devices to have eye-based interaction functions for simple hands-free input at a low cost. This paper proposes a gaze movement recognition method for simple hands-free interaction that uses eyewear equipped with an infrared distance sensor. The proposed method measures eyelid skin movement using an infrared distance sensor inside the eyewear and applies machine learning to the time-series sensor data to recognize gaze movements (e.g., up, down, left, and right). We implemented a prototype system and conducted evaluations with gaze movements including factors such as movement directions at 45-degree intervals and the movement distance difference in the same direction. The results showed the feasibility of the proposed method. The proposed method recognized 5 to 20 types of gaze movements with an F-value of 0.96 to 1.0. In addition, the proposed method was available with a limited number of sensors, such as two or three, and robust against disturbance in some usage conditions (e.g., body vibration, facial expression change). This paper provides helpful findings for the design of gaze movement recognition methods for simple hands-free interaction using eyewear devices at a low cost.
... Since gaze can also reflect the user's intention [29], gaze can be used for a hands-free gesture input method [9,30]. There are also methods that use the movement of the head, such as a method that can adjust the input value by rotating or tilting the user's head (HeadTurn [31]), a method to turn pages when browsing (HeadPager [32]), a method to operate the cursor (e.g., desktop devise [33,34], mobile device [35]), and a method to select a target [36]. There are methods for gesture input that combine multiple factors, such as head movement and gaze [37,38]. ...
Article
Full-text available
Simple hands-free input methods using ear accessories have been proposed to broaden the range of scenarios in which information devices can be operated without hands. Although many previous studies use canal-type earphones, few studies focused on the following two points: (1) A method applicable to ear accessories other than canal-type earphones. (2) A method enabling various ear accessories with different styles to have the same hands-free input function. To realize these two points, this study proposes a method to recognize the user’s facial gesture using an infrared distance sensor attached to the ear accessory. The proposed method detects skin movement around the ear and face, which differs for each facial expression gesture. We created a prototype system for three ear accessories for the root of the ear, earlobe, and tragus. The evaluation results for nine gestures and 10 subjects showed that the F-value of each device was 0.95 or more, and the F-value of the pattern combining multiple devices was 0.99 or more, which showed the feasibility of the proposed method. Although many ear accessories could not interact with information devices, our findings enable various ear accessories with different styles to have eye-free and hands-free input ability based on facial gestures.
... Studies have shown that sensing head orientation and position can help the calibration of gaze interaction and promote the accuracy of gaze-based selection tasks [51,52]. Head movements were also leveraged to control desktop cursors [22,58] and mobile devices [15], by mapping the position of the head to the cursor. Head gestures were also proposed for performing discrete operations on the desktop [40,45,54] and HMD glasses [19,64]. ...
Article
We propose HeadGesture, a hands-free input approach to interact with Head Mounted Display (HMD) devices. Using HeadGesture, users do not need to raise their arms to perform gestures or operate remote controllers in the air. Instead, they perform simple gestures with head movement to interact with the devices. In this way, users' hands are free to perform other tasks, e.g., taking notes or manipulating tools. This approach also reduces the hand occlusion of the field of view [11] and alleviates arm fatigue [7]. However, one main challenge for HeadGesture is to distinguish the defined gestures from unintentional movements. To generate intuitive gestures and address the issue of gesture recognition, we proceed through a process of Exploration - Design - Implementation - Evaluation. We first design the gesture set through experiments on gesture space exploration and gesture elicitation with users. Then, we implement algorithms to recognize the gestures, including gesture segmentation, data reformation and unification, feature extraction, and machine learning based classification. Finally, we evaluate user performance of HeadGesture in the target selection experiment and application tests. The results demonstrate that the performance of HeadGesture is comparable to mid-air hand gestures, measured by completion time. Additionally, users feel significantly less fatigue than when using hand gestures and can learn and remember the gestures easily. Based on these findings, we expect HeadGesture to be an efficient supplementary input approach for HMD devices.
... Compared to eye and mouth there are few methods that address nose detection problem, even though it is not less important than eye or other facial landmarks. Since, it does not affect so much by facial expressions and in several cases is the only facial feature which is clearly visible during the head motion [16,72,73]. Generally, nose detection methods are mainly based on characteristic points such as the nostrils and the tip of the nose [74]. ...
Chapter
Detection of facial landmarks and their feature points plays an important role in many facial image-related applications such as face recognition/verification, facial expression analysis, pose normalization, and 3D face reconstruction. Generally, detection of facial features is easy for persons; however, for machines it is not an easy task at all. The difficulty comes from high inter-personal variation (e.g., gender, race), intra-personal changes (e.g., pose, expression), and from acquisition conditions (e.g., lighting, image resolution). This chapter discusses basic concepts related to the problem of facial landmarks detection and overviews the successes and failures of exiting solutions. Also, it explores the difficulties that hinders the path of progress in the topic and the challenges involved in the adaptation of existing approaches to build successful systems that can be utilized in real-world facial images-related applications. Additionally, it discusses the performance evaluation metrics and the available benchmarking datasets. Finally, it suggests some possible future directions for research in the topic.
... The Nouse TM uses a standard webcam and advanced video recognition algorithms to instantaneously map the movement of the user's nose and head to the movement of a computer mouse device, thereby allowing a user to operate a computer handsfree (Gorodnichy & Roth, 2004; http://www.nouse.ca/en/). The Nouse TM costs CDN$150, is easy to install and requires only the accompanying software and any commercially available webcam. ...
Article
Assistive technology devices for computer access can facilitate social reintegration and promote independence for people who have had a stroke. This work describes the exploration of the usefulness and acceptability of a new computer access device called Nouse™ (Nose-as-mouse). The device uses standard webcam and video recognition algorithms to map the movement of the user’s nose to a computer cursor, thereby allowing hands-free computer operation. Ten participants receiving in- or outpatient stroke rehabilitation completed a series of standardized and everyday computer tasks using Nouse™ and then completed a device usability questionnaire. Task completion rates were high (90%) for computer activities only in the absence of time constraints. Most of the participants were satisfied with ease of use (70%) and liked using Nouse™ (60%), indicating they could resume most of their usual computer activities apart from word-processing using the device. The findings suggest that hands-free computer access devices like Nouse™ may be an option for people who experience upper motor impairment caused by stroke and are highly motivated to resume personal computing. More research is necessary to further evaluate the effectiveness of this technology, especially in relation to other computer access assistive technology devices.
... Geralmente usuários com dificuldades motoras apresentam problemas de coordenação, redução da força muscular e restrição de movimento, incluindo impossibilidade de movimentar o mouse e manusear o teclado. Para essas pessoas existem diversos dispositivos e aplicações que os permitem navegar na Web e usar o computador como um todo, tais como dispositivos bucais para controlar os movimentos do cursor do mouse na tela, câmeras que capturam os movimentos dos olhos ou da cabeça para guiar o cursor [Dmitry and Roth, 2004], reconhecimento da fala para substituir a digitação e executar tarefas simples, entre outros. ...
... The feedback from 2500 users showed that head tracking improved the game's immersion and realism. Furthermore, Gorodnichy and Roth's study [14] showed that test participants rated playing the game ''Aim-n-shoot BubbleFrenzy'' with the hands-free 'nose as mouse' technique as more fun and less tiring than playing the game with a mouse. ...
Article
This study aimed to develop and test a hands-free video game that utilizes information on the player’s real-time face position and facial expressions as intrinsic elements of a gameplay. Special focus was given to investigating the user’s subjective experiences in utilizing computer vision input in the game interaction. The player’s goal was to steer a drunken character home as quickly as possible by moving their head. Additionally, the player could influence the behavior of game characters by using the facial expressions of frowning and smiling. The participants played the game with computer vision and a conventional joystick and rated the functionality of the control methods and their emotional and game experiences. The results showed that although the functionality of the joystick steering was rated higher than that of the computer vision method, the use of head movements and facial expressions enhanced the experiences of game playing in many ways. The participants rated playing with the computer vision technique as more entertaining, interesting, challenging, immersive, and arousing than doing so with a joystick. The results suggested that a high level of experienced arousal in the case of computer vision-based interaction may be a key factor for better experiences of game playing.
... A majority of the proposed vision-based interfaces provide point-only functionality by tracking face/head or facial features [14,7,12] and using the location of the tracked object as a camera mouse. Betke et al. [1] tested normalized correlation template feature tracking in a typing board application. ...
Article
Full-text available
We present a novel vision-based perceptual user interface for hands-free text entry that utilizes face detection and visual gesture detection to manipulate a scrollable virtual keyboard. A thorough experimentation was undertaken to quantitatively define a performance of the interface in hands-free pointing, selection and scrolling tasks. The experiments were conducted with nine participants in laboratory conditions. Several face and head gestures were examined for detection robustness and user convenience. The system gave a reasonable performance in terms of high gesture detection rate and small false alarm rate. The participants reported that a new interface was easy to understand and operate. Encouraged by these results, we discuss advantages and constraints of the interface and suggest possibilities for design improvements.
... Recently, head/face or facial feature tracking has been applied for pointing at the objects in a graphical user interface (aka camera mouse) [1,4,13,14]. Such systems utilize off-the-shelve hardware components and, therefore, do not require external equipment other than a PC and a webcam. ...
Article
Full-text available
Video-based human-computer interaction has received increasing interest over the years. However, earlier research has been mainly focusing on technical characteristics of different methods rather than on user performance and experiences in using computer vision technology. This study aims to investigate performance characteristics of novice users and their subjective experiences in typing text with several video-based pointing and selection techniques. In Experiment 1, eye tracking and head tracking were applied for the task of pointing at the keys of a virtual keyboard. The results showed that gaze pointing was significantly faster but also more erroneous technique as compared with head pointing. Self-reported subjective ratings revealed that it was generally better, faster, more pleasant and efficient to type using gaze pointing than head pointing. In Experiment 2, mouth open and brows up facial gestures were utilized for confirming the selection of a given character. The results showed that text entry speed was approximately the same for both selection techniques, while mouth interaction caused significantly fewer errors than brow interaction. Subjective ratings did not reveal any significant differences between the techniques. Possibilities for design improvements are discussed.
Book
Full-text available
El presente libro es el resultado de la docencia e investigación en las áreas de Realidad Virtual, Realidad Aumentada e Interfaces basadas en visión, llevadas a cabo en tres instituciones universitarias, dos de ellas de Argentina –Universidad Nacional de La Plata (UNLP) y Universidad Nacional del Centro de la Provincia de Buenos Aires (UNICEN)– y una española –Universidad de las Islas Baleares (UIB)–. El texto se estructura en dos partes principales: la primera parte relacionada con Realidad Virtual (RV) y Realidad Aumentada (RA) y la segunda parte relacionada con las denominadas Interfaces avanzadas o Basadas en Visión (VBI). La primera parte consta de tres capítulos. El capítulo 1 presenta una introducción a conceptos y tecnología compartidos por las aplicaciones de realidad virtual y realidad aumentada. El capítulo 2 presenta los desafíos actuales para el desarrollo de simuladores de entrenamiento que utilizan realidad virtual, y describe los simuladores desarrollados por el Instituto Pladema de la UNICEN. El capítulo 3 presenta el tema Realidad Aumentada, sus fundamentos, algoritmos de tracking y librerías utilizadas para el desarrollo de aplicaciones. Lo incluido en este capítulo es utilizado como material de docencia en un curso del Doctorado de Ciencias Informáticas de la UNLP, dictado en la actualidad por una docente de dicha institución e investigadora del III-LIDI. La segunda parte, Interfaces Avanzadas, consta de dos capítulos. El material incluido es resultado de la docencia e investigación de dos investigadores de la Unidad de Gráficos y Visión por Ordenador e Inteligencia Artificial de la UIB. El capítulo 4 realiza una introducción a las interfaces basadas en visión, así como explica el proyecto SINA desarrollado en la UIB. El capítulo 5 presenta los sistemas de interacción multitáctil, y además explica un caso de estudio del diseño de una mesa multitáctil.
Article
We propose HeadCross, a head-based interaction method to select targets on VR and AR head-mounted displays (HMD). Using HeadCross, users control the pointer with head movements and to select a target, users move the pointer into the target and then back across the target boundary. In this way, users can select targets without using their hands, which is helpful when users' hands are occupied by other tasks, e.g., while holding the handrails. However, a major challenge for head-based methods is the false positive problems: unintentional head movements may be incorrectly recognized as HeadCross gestures and trigger the selections. To address this issue, we first conduct a user study (Study 1) to observe user behavior while performing HeadCross and identify the behavior differences between HeadCross and other types of head movements. Based on the results, we discuss design implications, extract useful features, and develop the recognition algorithm for HeadCross. To evaluate HeadCross, we conduct two user studies. In Study 2, we compared HeadCross to the dwell-based selection method, button-press method, and mid-air gesture-based method. Two typical target selection tasks (text entry and menu selection) are tested on both VR and AR interfaces. Results showed that compared to the dwell-based method, HeadCross improved the sense of control; and compared to two hand-based methods, HeadCross improved the interaction efficiency and reduced fatigue. In Study 3, we compared HeadCross to three alternative designs of head-only selection methods. Results show that HeadCross was perceived to be significantly faster than the alternatives. We conclude with the discussion on the interaction potential and limitations of HeadCross.
Article
Full-text available
Background Gesture is a basic interaction channel that is frequently used by humans to communicate in daily life. In this paper, we explore to use gesture-based approaches for target acquisition in virtual and augmented reality. A typical process of gesture-based target acquisition is: when a user intends to acquire a target, she performs a gesture with her hands, head or other parts of the body, the computer senses and recognizes the gesture and infers the most possible target. Methods We build mental model and behavior model of the user to study two key parts of the interaction process. Mental model describes how user thinks up a gesture for acquiring a target, and can be the intuitive mapping between gestures and targets. Behavior model describes how user moves the body parts to perform the gestures, and the relationship between the gesture that user intends to perform and signals that computer senses. Results In this paper, we present and discuss three pieces of research that focus on the mental model and behavior model of gesture-based target acquisition in VR and AR. Conclusions We show that leveraging these two models, interaction experience and performance can be improved in VR and AR environments. Keywords: Gesture-based interaction, Mental model, Behavior model, Virtual reality, Augmented reality
Chapter
This chapter talks about related work on head gesture recognition and usage of head tracking for several applications including games and virtual reality. An experiment which systematically explores the effects of head tracking, in complex gaming environments typically found in commercial video games is presented. This experiment seeks to find if there are any performance benefits of head tracking in games and how it affects the user experience. We present the results of this experiment along with some guidelines for the game designers who wish to use head tracking for games.
Article
A non-invasive eye control approach based on Kinect 2.0 was proposed to improve life quality for people with upper-limb disabilities. First, Kinect 2.0 was used to track the critical points of the canthus and eyelid. Second, the greyscale differential algorithm was used to position iris. Finally, a novel eye control model based on the canthus coordinates, central coordinates of pupils and eyelid distance was established for cursor control. To reduce the calibration difficulty and the interference from eye jitter on the control, incremental indirect control was performed to ensure smooth and gradual cursor motion to the fixation point. To solve the Midas touch problem, an eye switch was proposed to reduce excessive unconscious cursor movement. Benefitting from the Kinect 2.0 advantages of high resolution, high efficiency and good stability, the control speed reached 19.9 fps in Matlab, which meets the requirement for real-time control. The method also offers good adaptability to various noise sources, such as rotation, occlusion and scaling. The experimental results suggest the availability and robustness of the new approach.
Conference Paper
We introduce techniques enabling interactive guidance for better self-portrait photos ("selfies") using a smartphone cam- era. Aesthetic quality is estimated using empirical models for three parameterized composition principles: face size, face position, and lighting direction. The models are built using 2,700 crowdworker assessments of highly-controlled synthetic selfies. These are generated by manipulating a virtual camera and lighting when rendering a realistic 3D model of a human to methodically explore the parameter space. A camera application uses the models to estimate the aesthetic quality of a live selfie preview based on parameters measured by computer vision. The photographer is guided towards a better selfie by directional hints overlaid on the live preview. A study shows the technique provides a 26% increase in aesthetic quality compared to a standard camera application.
Chapter
Body parts have an important role in the life of a human being. In today’s era, smart phones have also become a part of these body parts virtually. Smart phones and touch devices are of different interaction medium and it has essentially become a part and parcel of our daily life; equipped with touch based screens and thus making the user to interact with the smart phone in a better way. The objective of our project is to use smart phones without using touch feature and use facial features to select the applications, thus making life more comfortable to the end users especially the people with amputated or no fingers. The invention of these smart phones played a vital role in improving the lives of the people with such disabilities. These smart phones will be equipped with both the types of features like touch screen with and without using hands.
Article
In this paper, an improved Mean-shift algorithm was integrated with standard tracking–learning–detection (TLD) model tracker for improving the tracking effects of standard TLD model and enhancing the anti-occlusion capability and the recognition capability of similar objectives. The target region obtained by the improved Mean-shift algorithm and the target region obtained by the TLD model tracker are integrated to achieve favorable tracking effects. Then the optimized TLD tracking system was applied to human eye tracking. In the tests, the model can be self-adopted to partial occlusion, such as eye-glasses, closed eyes and hand occlusion. And the roll angle can approach 90(Formula presented.), raw angle can approach 45(Formula presented.) and pitch angle can approach 60(Formula presented.). In addition, the model never mistakenly transfers the tracking region to another eye (similar target on the same face) in longtime tracking. Experimental results indicate that: (1) the optimized TLD model shows sound tracking stability even when targets are partially occluded or rotated; (2) tracking speed and accuracy are superior to those of the standard TLD and some mainstream tracking methods. In summary, the optimized TLD model show higher robustness, stability and better responding to complex eye tracking requirement.
Article
To improve single-handed operation of mobile devices, rear touch panel operation to control commands and facial feature detection to control cursor position are proposed. Operational control is achieved through finger chord gestures on a rear touch panel, and nose movement is used to control cursor movement. Zooming is achieved by detecting the apparent distance between the left and right eyes in conjunction with a finger chord gesture. We have evaluated movement time, error rates, and the throughputs of these techniques in comparison with the conventional single-handed front touch panel thumb operations using Fitts's law. Experiments have been conducted to evaluate two operation modes, selection and zooming, in the form of reciprocal 1-D pointing tasks with 12 participants. For the target selection task, the proposed technique achieved 12% (0.26 s) shorter movement time and 4.7% smaller error rate than the conventional method on average. Especially, for long distance targets, the performance of the conventional method became remarkably inferior due to the limit of reach of the thumb, whereas the proposed technique achieved much less deterioration and obtained expected performance because the cursor could reach anywhere on the display. For the target size adjustment task, the proposed technique achieved 9% (0.22 s) shorter movement time than the conventional method and obtained a comparable error rate of less than 4%. Consequently, we could demonstrate the techniques that make single-handed select and zoom operations available anywhere on a large-sized tablet device with no blockage of the display.
Article
Because human face tracking is prone to be affected under complex postures (largely tilt, rotate, occlusion and color interference), this paper proposes a new face tracking algorithm based on online modifying, which includes two modules: detection module and tracking module. Tracking module is effective when detection is failed, and in the process of face tracking, in order to reduce the continuous accumulation of errors, we use real-time face detection to fix the parameters (including tracking the location and scale) of the tracking module, which made use of face detection and tracking with their respective advantages. Experiments show that the algorithm is able to accurately track a face under largely tilt and rotate and is also robust to color interfere and occlusion. In addition, the algorithm is used to control commercial games, which provides a new interactive way for HCI.
Conference Paper
Nose localization is important for face recognition, face pose recognition, 3D face reconstruction and so on. In this paper, a novel method for nose localization is proposed. Our method includes two Subclass Discriminant Analysis (SDA) based steps. The first step locates nose from the whole face image and some randomly selected image patches are used as negative samples for the training of SDA classifier. The second step refines nose position by using some nose context patches as negative samples. The proposed method detects nose from the whole face image and no prior knowledge about the layout of face components on a face is employed. Experimental results on AR images show that the proposed method can accurately locate nose from face images, and is robust to lighting and facial expression changes.
Conference Paper
We propose a selection method, "target reverse crossing," for use with camera-based mouse-replacement for people with motion impairments. We assessed the method by comparing it to the selection mechanism "dwell-time clicking," which is widely used by camera-based mouse-replacement systems. Our results show that target reverse crossing is more efficient than dwell-time clicking, while its one-time success accuracy is lower. We found that target directions have effects on the accuracy of reverse crossing. We also show that increasing the target size improves the performance of reverse crossing significantly, which provides future interface design implications for this selection method.
Article
Training surgeons in minimally invasive surgery (MIS) requires surgical residents to operate under the direction of a consultant. The inability of the instructing surgeon to point at the laparoscopic monitor without releasing the instruments remains a barrier to effective instruction. The wireless hands-free surgical pointer (WHaSP) has been developed to aid instruction during MIS. The objective of this study was to evaluate the effectiveness and likeability of the WHaSP as an instructional tool compared with the conventional methods. Data were successfully collected during 103 laparoscopic cholecystectomy procedures, which had been randomized to use or not use the WHaSP as a teaching tool. Audio and video from the surgeries were recorded and analyzed. Instructing surgeons, operating surgeons, and camera assistants provided feedback through a post-operative questionnaire that used a five-level Likert scale. The questionnaire results were analyzed using a Mann-Whitney U test. There were no negative effects on surgery completion time or instruction practice due to the use of the WHaSP. The number of times an instructor surgeon pointed to the laparoscopic screen with their hand was significantly reduced when the WHaSP was utilized (p < 0.001). The questionnaires showed that WHaSP users found it to be comfortable, easy to use, and easy to control. Compared to when the WHaSP was not used, users found that communication was more effective (p = 0.002), locations were easier to communicate (p < 0.001), and instructions were easier to follow (p = 0.005). The WHaSP system was successfully used in surgery. It integrated seamlessly into existing equipment within the operating room and did not affect flow. The positive outcomes of utilizing the WHaSP were improved communication in the OR, improved efficiency and safety of the surgery, easy to use, and comfortable to wear. The surgeons showed a preference for utilizing the WHaSP if given a choice.
Article
A robust and precise scheme for detecting faces and locating the facial components in images at the presence of varying facial contexts as well as complex backgrounds is presented. The system is based on the estimation of the pixel-wised colour distribution of facial components and geometrical information of faces. Probability maps for facial elements are constructed using Gaussian mixture model (GMM) based on the chroma and luma character of facial components. Face candidates are generated based on AdaBoost detection algorithm, and the local skin patch is extracted to generate skin probability map based on GMM. A series of fusion strategy on probability maps is then designed to construct eye, mouth and skin binary maps for verifying each face candidate and locating its facial components, taking facial geometry into consideration. Morphological operators are used for post-processing. Experiments show that more accurate detection results can be obtained as compared to other state-of-the-art methods.
Conference Paper
This paper reports the design, implementation, and results of a carefully designed experiment that examined the performance of a camera-based mouse-replacement interface that was supported with visual feedback. Four different visual feedback modes were tested during the pointing-task experiment. Quantitative results, based on three metrics, do not show statistically significant difference between these modes. Qualitative feedback from the participants of the experiments, however, shows that user experience is improved by static and animated visual feedback during the pointing task.
Article
Nose is one of the salient features in a human face, and its localization is important for face recognition, face pose recognition, 3D face reconstruction, etc. In this paper, a novel nose tip localization method is proposed, which is based on two-stage subclass discriminant analysis (SDA). At the first stage, some randomly selected image patches are used as negative samples for the training of SDA classifier, and nose is detected from the whole face image. The second stage refines nose tip position by using some nose context patches as negative samples for the training of SDA classifier. The proposed method detects nose from the whole face image and no a priori knowledge about the layout of face components is used. Experimental results on AR images show that the proposed method can achieve high nose tip localization rates, and is robust to changes of illumination and facial expression.
Conference Paper
This paper describes a 3D multimodal user interface integrating 2D mouse input and 3D head tracking for 3D desktop virtual environments (DVE). A novel test-bed is proposed to evaluate the functionality and usability of this 3D interface by combining traditional direct manipulation techniques performed by the rotation of head to interact with the objects in 3D DVE that is composed of a relatively up-to-date desktop computer, a webcam and a mouse. We presume the results will indicate how suitable head tracking is as a 3D interface integrated with 2D interaction device to be used in a 3D DVE.
Conference Paper
This paper considers the problems of minimizing the completion time and reducing decoding complexity for relay-aided wireless broadcast. Both network coding and scheduling problems are considered. A deterministic network coding algorithm is designed to select innovative encoding vectors, which is applicable to both base station and relay. Compared with random linear network coding, the proposed algorithm can reduce decoding complexity significantly by selecting sparse encoding vectors. Integrating with the proposed network coding algorithm, a scheduling scheme based on dynamic programming is proposed, which is proved to be optimal in terms of minimizing expected completion time. Simulation shows that the proposed network coding algorithm and scheduling scheme work very well on both reducing completion time and decoding complexity.
Article
Full-text available
This work focuses on camera-based systems that are designed for mouse replacement. Usually, these interfaces are based on computer vision techniques that capture the user’s face or head movements and are specifically designed for users with disabilities. The work identifies and reviews the key factors of these interfaces based on the lessons learnt by the authors’ experience and by a comprehensive analysis of the literature to describe the specific points to consider in their design. These factors are as follows: user features to track, initial user detection (calibration), position mapping, feedback, error recovery, event execution, profiles and ergonomics of the system. The work compiles the solutions offered by different systems to help new designers avoid problems already discussed by the others.
Article
To address the challenges of surgical instruction during minimally invasive surgery (MIS), a wireless hands-free pointer system has been developed. The Wireless Hands-free Surgical Pointer system incorporates infrared and inertial tracking technologies to address the need for hands-free pointing during MIS. The combination of these technologies allows for optimal movement of the pointer and excellent accuracy while the user is located at a realistic distance from the surgical monitor. Several experimental evaluations were performed to optimize the settings of the sensors, and to validate the system when compared to a commercially available hands-free pointing system. The results show improved performance with the proposed technology as measured by the total trajectory travelled by the pointer and the smoothness of the curve. The technology presented has the potential to significantly improve surgical instruction and guidance during MIS.
Article
This paper presents an evaluation of a mouse interface device using tooth-touch sound and expiration signals, which we developed as a pointing device for disabled persons. Our device enabled disabled persons to operate a personal computer easily using a mouse driven by their tooth-touch and expiration. It also had superior features, being easy to handle, light weight, user-friendly and inexpensive to make. The performance of our device was evaluated using Fitts' law, which estimated the comparative usability of the pointing device against that of a conventional ball-type mouse. Finally, we designed a rounding-type menu to improve the input efficiency of the device and apply it in a TV controller. We then compared the input velocity of our device against that of a conventional mouse.
Article
This paper introduces a drowsiness scale which illustrates instantaneous overall predictions about observed anomalous driver behavior. Driver can be informed about her/his own driving conditions by the camera mounted inside of the vehicle. Data obtained from driver behavior by observation is not sufficient to make a correct decision about overall vehicle and driver state unless road and vehicle conditions are also considered. Various driver related observations are involved in the design of an observatory system in collaboration with external road sensory inputs. In our system, we propose a Bayesian learning method about driver awareness state in learning phase. An auto-regressive moving average (ARMA) model is devised to be the driver drowsiness predictor. A mean-square tracking error is measured in different head positions to determine the predictor's reliability and robustness under different illumination and conditions. An empirical set of plots is derived for the head positions corresponding to normal and drowsy driving conditions.
Conference Paper
Full-text available
A system able to detect the existence of the tongue and locate its relative position within the surface of the mouth by using video information obtained from a web camera is proposed in this paper. The system consists of an offline phase, prior to the the operation by the final user, in which a 3-layer cascade of SVM learning classifiers are trained using a database of 'tongue vs. not-tongue' images, that correspond to segmented images containing our region of interest, the mouth with the tongue in three possible positions: center, left or right. The first training stage discerns whether the tongue is present or not, giving the output data to the next stage, in which the presence of the tongue in the center of the mouth is evaluated; finally, in the last stage, a left vs. right position detection is assessed. Due to the novelty of the proposed system, a database needed to be created by using information gathered from different people of distinct ethnic backgrounds. While the system has yet to be tested in an online stage, results obtained from the offline phase show that it is feasible to achieve a real-time performance in the near future. Finally, diverse applications to this prototype system are introduced, demonstrating that the tongue can be effectively used as an alternative input device by a broad range of users, including people with some physical disability condition.
Conference Paper
Full-text available
This paper describes a work in progress relating to the development of a virtual monitoring environment for space telemanipulation systems. The focus is on the improvement of performance and safety of current control systems by using the potential offered by virtualized reality, as well as good human factors practices. Cet article décrit un travail en cours concernant le développement d'un environnement de surveillance virtuel pour les systèmes de télémanipulation spatiale. Le but visé consiste à améliorer le rendement et la sécurité des systèmes de contrôle actuels, ce grâce au potentiel de la réalité virtuelle ainsi qu'à des pratiques modèles en ce qui concerne les facteurs humains.
Article
Full-text available
Classical least squares regression consists of minimizing the sum of the squared residuals. Many authors have produced more robust versions of this estimator by replacing the square by something else, such as the absolute value. In this article a different approach is introduced in which the sum is replaced by the median of the squared residuals. The resulting estimator can resist the effect of nearly 50% of contamination in the data. In the special case of simple regression, it corresponds to finding the narrowest strip covering half of the observations. Generalizations are possible to multivariate location, orthogonal regression, and hypothesis testing in linear models.
Article
Full-text available
Traditionally, image intensities have been processed to segment an image into regions or to find edge-fragments. Image intensities carry a great deal more information about three-dimensional shape, however. To exploit this information, it is necessary to understand how images are formed and what determines the observed intensity in the image. The gradient space, popularized by Huffman and Mackworth in a slightly different context, is a helpful tool in the development of new methods.
Conference Paper
Full-text available
There is a physiological reason, backed up by the theory of visual attention in living organisms, why animals look into each others’ eyes. This is to illustrate the main two properties in which recognizing of faces in video differs from its static counterpart – recognizing of faces in images. First, the lack of resolution in video is abundantly compensated by the information coming from the time dimension. Video data is inherently of a dynamic nature. Second, video processing is a phenomena occurring all the time around us – in biological systems, and many results unraveling the intricacies of biological vision already obtained. At the same time, as we examine the way the video-based face recognition is approached by computer scientists, we notice that up till now video information is often used partially and therefore not very efficiently. This work aims at bridging this gap. We develop a multi-channel framework for video-based face processing, which incorporates the dynamic component of video. The utility of the framework is shown on the example of detecting and recognizing faces from blinking. While doing that we derive a canonical representation of a face best suited for the task.
Conference Paper
Full-text available
We present a neural network-based upright frontal face detection system. A retinally connected neural network examines small windows of an image and decides whether each window contains a face. The system arbitrates between multiple networks to improve performance over a single network. We present a straightforward procedure for aligning positive face examples for training. To collect negative examples, we use a bootstrap algorithm, which adds false detections into the training set as training progresses. This eliminates the difficult task of manually selecting nonface training examples, which must be chosen to span the entire space of nonface images. Simple heuristics, such as using the fact that faces rarely overlap in images, can further improve the accuracy. Comparisons with several other state-of-the-art face detection systems are presented, showing that our system has comparable performance in terms of detection and false-positive rates.
Conference Paper
Full-text available
We develop an automatic system to analyze subtle changes in upper face expressions based on both permanent facial features (brows, eyes, mouth) and transient facial features (deepening of facial furrows) in a nearly frontal image sequence. Our system recognizes fine-grained changes in facial expression based on Facial Action Coding System (FACS) action units (AUs). Multi-state facial component models are proposed for tracting and modeling different facial features, including eyes, brews, cheeks, and furrows. Then we convert the results of tracking to detailed parametric descriptions of the facial features. These feature parameters are fed to a neural network which recognizes 7 upper face action units. A recognition rate of 95% is obtained for the test data that include both single action units and AU combinations
Conference Paper
Full-text available
This paper presents an analysis of the performance of two different skin chrominance models and of nine different chrominance spaces for the color segmentation and subsequent detection of human faces in two-dimensional static images. For each space, we use the single Gaussian model based on the Mahalanobis metric and a Gaussian mixture density model to segment faces from scene backgrounds. In the case of the mixture density model, the skin chrominance distribution is estimated by use of the expectation-maximisation (EM) algorithm. Feature extraction is performed on the segmented images by use of invariant Fourier-Mellin moments. A multilayer perceptron neural network (NN), with the invariant moments as the input vector, is then applied to distinguish faces from distractors. With the single Gaussian model, normalized color spaces are shown to produce the best segmentation results, and subsequently the highest rate of face detection. The results are comparable to those obtained with the more sophisticated mixture density model. However, the mixture density model improves the segmentation and face detection results significantly for most of the un-normalized color spaces. Ultimately, we show that, for each chrominance space, the detection efficiency depends on the capacity of each model to estimate the skin chrominance distribution and, most importantly, on the discriminability between skin and “non-skin” distributions
Article
Full-text available
We show how partial reduction of self-connections of the network designed with the pseudo-inverse learning rule increases the direct attraction radius of the network. Theoretical formula is obtained. Data obtained by simulation are presented.
Article
Full-text available
The concept of the second order change detection, which allows one to discriminate the local (most recent) change in image, such as blink of the eyes, from the global (long lasting) change, such as the motion of head, is introduced. This concept sets the base for designing complete face-operated control systems, in which, using the analogy with mouse, "pointing" is performed by nose and "clicking" is performed by double-blinking of the eyes. The implementation of such systems is described. Le concept de la détection de variation du second ordre en vidéo, qui permet de distinguer la variation locale (plus récente) dans l'image, comme un clignement des yeux, de la variation globale (longue durée), comme un mouvement de tête, est introduit. Ce concept établit les bases pour concevoir des systèmes entièrement commandés par le mouvement facial, dans lesquels, en comparaison avec la souris, « pointer » équivaudrait à un mouvement du nez et « cliquer », à un double clignements des yeux. Le mise en oeuvre de tels systèmes est décrite.
Article
Full-text available
For humans, to view a scene with two eyes is clearly more advantageous than to do that with one eye. In computer vision however, most of high-level vision tasks, an example of which is face tracking, are still done with one camera only. This is due to the fact that, unlike in human brains, the relationship between the images observed by two arbitrary video cameras, in many cases, is not known. Recent advances in projective vision theory however have produced the methodology which allows one to compute this relationship. This relationship is naturally obtained while observing the same scene with both cameras and knowing this relationship not only makes it possible to track features in 3D, but also makes tracking much more robust and precise. In this paper, we establish a framework based on projective vision for tracking faces in 3D using two arbitrary cameras, and describe a stereo tracking system, which uses the proposed framework to track faces in 3D with the aid of two USB cameras. While being very affordable, our stereotracker exhibits pixel size precision and is robust to head's rotation in all three axis of rotation. Pour l'homme, il est manifestement plus avantageux de pouvoir visualiser une scène avec deux yeux plutôt qu'avec un seul. Cependant, en vision artificielle, la plupart des tâches de vision de haut niveau, par exemple le suivi du visage, sont effectuées avec une seule caméra. Cela est dû au fait que, contrairement à ce qui se passe dans le cerveau humain, il arrive souvent qu'on ne connaisse pas la relation qui existe entre les images observées par deux caméras vidéo quelconques. Toutefois, les avancées récentes réalisées dans la théorie de la vision projective ont donné lieu à une méthode permettant de calculer ce type de relation. Naturellement, cette dernière est établie en observant la même scène avec deux caméras différentes; la connaissance de cette relation, en plus de rendre possible le suivi des traits en 3D, rend ce suivi beaucoup plus robuste et précis. Dans cet article, nous établissons un cadre, basé sur la vision projective, pour le suivi du visage en 3D, au moyen de deux caméras quelconques et décrivons un système de suivi stéréoscopique, qui fait appel au cadre proposé pour assurer le suivi des visages en 3D au moyen de deux caméras USB. Tout en étant très abordable, notre système de suivi stéréoscopique présente une précision au niveau du pixel et est suffisamment robuste pour assurer le suivi de la tête dans les trois axes de rotation.
Conference Paper
Full-text available
Human nose, while being in many cases the only facial feature clearly visible during the head motion, seems to be very undervalued in face tracking technology. This paper shows theoretically and by experiments conducted with ordinary USB cameras that, by properly defining nose as an extremum of the 3D curvature of the nose surface, nose becomes the most robust feature which can be seen for almost any position of the head and which can be tracked very precisely, even with low resolution cameras
Article
Full-text available
Human face detection plays an important role in applications such as video surveillance, human computer interface, face recognition, and face image database management. We propose a face detection algorithm for color images in the presence of varying lighting conditions as well as complex backgrounds. Based on a novel lighting compensation technique and a nonlinear color transformation, our method detects skin regions over the entire image and then generates face candidates based on the spatial arrangement of these skin patches. The algorithm constructs eye, mouth, and boundary maps for verifying each face candidate. Experimental results demonstrate successful face detection over a wide range of facial variations in color, position, scale, orientation, 3D pose, and expression in images from several photo collections (both indoors and outdoors)
Article
Full-text available
Images containing faces are essential to intelligent vision-based human-computer interaction, and research efforts in face processing include face recognition, face tracking, pose estimation and expression recognition. However, many reported methods assume that the faces in an image or an image sequence have been identified and localized. To build fully automated systems that analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the goal of face detection is to identify all image regions which contain a face, regardless of its 3D position, orientation and lighting conditions. Such a problem is challenging because faces are non-rigid and have a high degree of variability in size, shape, color and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this paper is to categorize and evaluate these algorithms. We also discuss relevant issues such as data collection, evaluation metrics and benchmarking. After analyzing these algorithms and identifying their limitations, we conclude with several promising directions for future research
Article
Full-text available
Most automatic expression analysis systems attempt to recognize a small set of prototypic expressions, such as happiness, anger, surprise, and fear. Such prototypic expressions, however, occur rather infrequently. Human emotions and intentions are more often communicated by changes in one or a few discrete facial features. In this paper, we develop an automatic face analysis (AFA) system to analyze facial expressions based on both permanent facial features (brows, eyes, mouth) and transient facial features (deepening of facial furrows) in a nearly frontal-view face image sequence. The AFA system recognizes fine-grained changes in facial expression into action units (AU) of the Facial Action Coding System (FACS), instead of a few prototypic expressions. Multistate face and facial component models are proposed for tracking and modeling the various facial features, including lips, eyes, brows, cheeks, and furrows. During tracking, detailed parametric descriptions of the facial features are extracted. With these parameters as the inputs, a group of action units (neutral expression, six upper face AU and 10 lower face AU) are recognized whether they occur alone or in combinations. The system has achieved average recognition rates of 96.4 percent (95.4 percent if neutral expressions are excluded) for upper face AU and 96.7 percent (95.6 percent with neutral expressions excluded) for lower face AU. The generalizability of the system has been tested by using independent image databases collected and FACS-coded for ground-truth by different research teams
Article
Full-text available
Detecting faces in images with complex backgrounds is a difficult task. Our approach, which obtains state of the art results, is based on a neural network model: the constrained generative model (CGM). Generative, since the goal of the learning process is to evaluate the probability that the model has generated the input data, and constrained since some counter-examples are used to increase the quality of the estimation performed by the model. To detect side view faces and to decrease the number of false alarms, a conditional mixture of networks is used. To decrease the computational time cost, a fast search algorithm is proposed. The level of performance reached, in terms of detection accuracy and processing time, allows us to apply this detector to a real world application: the indexing of images and videos
Article
Full-text available
Humans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that performs these operations accurately and in real time would form a big step in achieving a human-like interaction between man and machine. The paper surveys the past work in solving these problems. The capability of the human visual system with respect to these problems is discussed, too. It is meant to serve as an ultimate goal and a guide for determining recommendations for development of an automatic facial expression analyzer.
Article
Full-text available
Over the last 20 years, several different techniques have been proposed for computer recognition of human faces. The purpose of this paper is to compare two simple but general strategies on a common database (frontal images of faces of 47 people: 26 males and 21 females, four images per person). We have developed and implemented two new algorithms; the first one is based on the computation of a set of geometrical features, such as nose width and length, mouth position, and chin shape, and the second one is based on almost-grey-level template matching. The results obtained on the testing sets (about 90% correct recognition using geometrical features and perfect recognition using template matching) favor our implementation of the template-matching approach.
Article
Full-text available
The paradigm of projective vision has recently become popular. In this paper we describe a system for computing camerapositions from an image sequence using projective methods. Projective methods are normally used to deal with uncalibrated images. However, we claim that even when calibration information is available it is often better to use the projective approach. By computing the trilinear tensor it is possible to produceareliable and accurate set of correspondences. When calibration information is available these correspondences can be sent directly to a photogrammetric program to produce a set of camera positions. We show one way of dealing with the problem of cumulative error in the tensor computation and demonstrate that projective methods can handle surprisingly large baselines, in certain cases one third of the image size. In practiceprojective methods, along with random sampling algorithms, solve the correspondenceproblem for many image sequences. To aid in the understanding of this relatively new paradigm we make our binaries available for others on the web. Our software is structured in a way that makes experimentation easy and includes a viewer for displaying the #nal results.
Article
As a first step towards a perceptual user interface, a computer vision color tracking algorithm is developed and applied towards tracking human faces. Computer vision algorithms that are intended to form part of a perceptual user interface must be fast and efficient. They must be able to track in real time yet not absorb a major share of computational resources: other tasks must be able to run while the visual interface is being used. The new algorithm developed here is based on a robust...
Article
The advances in computer vision, which is a branch of artificial intelligence that focuses on providing computers with the functions typical of human vision is discussed. Computer vision has produced important application in fields such as robotics, biomedicine, industrial automation and satellite observation of Earth. The basic idea behind the use of computer vision in HCIs is that it can be instructed more naturally by human gestures than by using keyboard or mouse. The potentiality of computer vision in improving plant and public safety is attracting increasing attention in security concious community.
Article
Human face tracking (HFT) is one of several technologies useful in vision-based interaction (VBI), which is one of several technologies useful in the broader area of perceptual user interfaces (PUI). In this paper we motivate our interests in PUI and VBI, and describe our recent efforts in various aspects of face tracking in the Interaction Lab at UCSB. The HFT methods (GWN, EHT, and CFD), in the context of VBI and PUI, are part of an overall “TLA approach ” to face tracking. TLA /T-L-A / n. [Three-Letter Acronym] 1. Selfdescribing abbreviation for a species with which computing terminology is infested. 2. Any confusing acronym…. (From the Jargon File v. 4.3.1)
Conference Paper
This paper presents progress toward an integrated, robust, real-time face detection and demographic analysis system. Faces are detected and extracted using the fast algorithm proposed by P. Viola and M.J. Jones (2001). Detected faces are passed to a demographic (gender and ethnicity) classifier which uses the same architecture as the face detector. This demographic classifier is extremely fast, and delivers error rates slightly better than the best-known classifiers. To counter the unconstrained and noisy sensing environment, demographic information is integrated across time for each individual. Therefore, the final demographic classification combines estimates from many facial detections in order to reduce the error rate. The entire system processes 10 frames per second on an 800-MHz Intel Pentium III
Article
This paper describes a new approach to low level image processing; in particular, edge and corner detection and structure preserving noise reduction.Non-linear filtering is used to define which parts of the image are closely related to each individual pixel; each pixel has associated with it a local image region which is of similar brightness to that pixel. The new feature detectors are based on the minimization of this local image region, and the noise reduction method uses this region as the smoothing neighbourhood. The resulting methods are accurate, noise resistant and fast.Details of the new feature detectors and of the new noise reduction method are described, along with test results.
Article
An adaptive logic network (ALN) is a multilayer perceptron that accepts vectors of real (or floating point) values as inputs and produces a logic 0 or 1 as output. The ALN has a number of linear threshold units (perceptrons) acting on the network inputs, and their (Boolean) outputs feed into a tree of logic gates of types AND and OR. An ALN represents a real-valued function of real variables by giving a logic 1 response to points on and under the graph of the function, and a logic 0 otherwise. It cannot compute a real-valued function directly, but it can provide information about how to perform that computation in a separate decision-tree-based program. If a function is invertible, then the same ALN can be used to derive a second decision tree to compute an inverse. Another way to look at function synthesis is that linear functions are combined by a tree expression of MAXIMUM and MINIMUM operations. In this way, ALNs can approximate any continuous function defined on a compact set to any degree of precision. The logic tree structure can control qualitative properties of learned functions, for example convexity. Constraints can be imposed on monotonicities and partial derivatives. ALNs can be used for prediction, data analysis, pattern recognition and control applications. They may be particularly useful for extremely large systems, where lazy evaluation allows large parts of a computation to be omitted. A second, earlier type of ALN is also discussed where the inputs are fixed thresholds on variables and the nodes adapt by changing their logical functions.
Conference Paper
As a first step towards a perceptual user interface, a computer vision color tracking algorithm is developed and applied towards tracking human faces. Computer vision algorithms that are intended to form part of a perceptual user interface must be fast and efficient. They must be able to track in real time yet not absorb a major share of computational resources: other tasks must be able to run while the visual interface is being used. The new algorithm developed here is based on a robust non- parametric technique for climbing density gradients to find the mode (peak) of probability distributions called the mean shift algorithm. In our case, we want to find the mode of a color distribution within a video scene. Therefore, the mean shift algorithm is modified to deal with dynamically changing color probability distributions derived from video frame sequences. The modified algorithm is called the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm. CAMSHIFT's tracking accuracy is compared against a Polhemus tracker. Tolerance to noise, distractors and performance is studied. CAMSHIFT is then used as a computer interface for controlling commercial computer games and for exploring immersive 3D graphic worlds.
Article
Extracting geometric primitives is an important task in model-based computer vision. The Hough transform is the most common method of extracting geometric primitives. Recently, methods derived from the field of robust statistics have been used for this purpose. We show that extracting a single geometric primitive is equivalent to finding the optimum value of a cost function which has potentially many local minima. Besides providing a unifying way of understanding different primitive extraction algorithms, this model also shows that for efficient extraction the true global minimum must be found with as few evaluations of the cost function as possible. In order to extract a single geometric primitive we choose a number of minimal subsets randomly from the geometric data. The cost function is evaluated for each of these, and the primitive defined by the subset with the best value of the cost function is extracted from the geometric data. To extract multiple primitives, this process is repeated on the geometric data that do not belong to the primitive. The resulting extraction algorithm can be used with a wide variety of geometric primitives and geometric data. It is easily parallelized, and we describe some possible implementations on a variety of parallel architectures. We make a detailed comparison with the Hough transform and show that it has a number of advantages over this classic technique.
Article
In this paper we present a comprehensive and critical survey of face detection algorithms. Face detection is a necessary first-step in face recognition systems, with the purpose of localizing and extracting the face region from the background. It also has several applications in areas such as content-based image retrieval, video coding, video conferencing, crowd surveillance, and intelligent human–computer interfaces. However, it was not until recently that the face detection problem received considerable attention among researchers. The human face is a dynamic object and has a high degree of variability in its apperance, which makes face detection a difficult problem in computer vision. A wide variety of techniques have been proposed, ranging from simple edge-based algorithms to composite high-level approaches utilizing advanced pattern recognition methods. The algorithms presented in this paper are classified as either feature-based or image-based and are discussed in terms of their technical approach and performance. Due to the lack of standardized tests, we do not provide a comprehensive comparative evaluation, but in cases where results are reported on common datasets, comparisons are presented. We also give a presentation of some proposed applications and possible application areas.
Article
At the heart of every model-based visual tracker lies a pose estimation routine. Recent work has emphasized the use of least-squares techniques which employ all the available data to estimate the pose. Such techniques are, however, susceptible to the sort of spurious measurements produced by visual feature detectors, often resulting in an unrecoverable tracking failure. This paper investigates an alternative approach, where a minimal subset of the data provides the pose estimate, and a robust regression scheme selects the best subset. Bayesian inference in the regression stage combines measurements taken in one frame with predictions from previous frames, eliminating the need to further filter the pose estimates. The resulting tracker performs very well on the difficult task of tracking a human face, even when the face is partially occluded. Since the tracker is tolerant of noisy, computationally cheap feature detectors, frame-rate operation is comfortably achieved on standard hardware.
Article
This paper proposes a robust approach to image matching by exploiting the only available geometric constraint, namely, the epipolar constraint. The images are uncalibrated, namely the motion between them and the camera parameters are not known. Thus, the images can be taken by different cameras or a single camera at different time instants. If we make an exhaustive search for the epipolar geometry, the complexity is prohibitively high. The idea underlying our approach is to use classical techniques (correlation and relaxation methods in our particular implementation) to find an initial set of matches, and then use a robust technique—the Least Median of Squares (LMedS)—to discard false matches in this set. The epipolar geometry can then be accurately estimated using a meaningful image criterion. More matches are eventually found, as in stereo matching, by using the recovered epipolar geometry. A large number of experiments have been carried out, and very good results have been obtained.Regarding the relaxation technique, we define a new measure of matching support, which allows a higher tolerance to deformation with respect to rigid transformations in the image plane and a smaller contribution for distant matches than for nearby ones. A new strategy for updating matches is developed, which only selects those matches having both high matching support and low matching ambiguity. The update strategy is different from the classical “winner-take-all”, which is easily stuck at a local minimum, and also from “loser-take-nothing”, which is usually very slow. The proposed algorithm has been widely tested and works remarkably well in a scene with many repetitive patterns.
Conference Paper
We have developed an artificial neural network based gaze tracking system which can be customized to individual users. A three layer feed forward network, trained with standard error back propagation, is used to determine the position of a user's gaze from the appearance of the user's eye. Unlike other gaze trackers, which normally require the user to wear cumbersome headgear, or to use a chin rest to ensure head immobility, our system is entirely non-intrusive. Currently, the best intrusive gaze tracking sys- tems are accurate to approximately 0.75 degrees. In our experiments, we have been able to achieve an accuracy of 1.5 degrees, while allowing head mobility. In its current implementation, our system works at 15 hz. In this paper we present an empirical analysis of the performance of a large number of artificial neural network architectures for this task. Suggestions for further explorations for neurally based gaze trackers are presented, and are related to other similar artificial neural network applications such as autonomous road following.
Conference Paper
This paper provides an introduction to the field of reasoning with uncertainty in Artificial Intelligence (AI), with an emphasis on reasoning with numeric uncertainty. The considered formalisms are Probability Theory and some of its generalizations, the Certainty Factor Model, Dempster-Shafer Theory, and Probabilistic Networks.
Conference Paper
To build smart human interfaces, it is necessary for a system to know a user's intention and point of attention. Since the motion of a person's head pose and gaze direction are deeply related with his/her intention and attention, detection of such information can be utilized to build natural and intuitive interfaces. We describe our real-time stereo face tracking and gaze detection system to measure head pose and gaze direction simultaneously. The key aspect of our system is the use of real-time stereo vision together with a simple algorithm which is suitable for real-time processing. Since the 3D coordinates of the features on a face can be directly measured in our system, we can significantly simplify the algorithm for 3D model fitting to obtain the full 3D pose of the head compared with conventional systems that use monocular camera. Consequently we achieved a non-contact, passive, real-time, robust, accurate and compact measurement system for head pose and gaze direction
Conference Paper
Computer systems which analyse human face/head motion have attracted significant attention recently as there are a number of interesting and useful applications. Not least among these is the goal of tracking the head in real time. A useful extension of this problem is to estimate the subject's gaze point in addition to his/her head pose. This paper describes a real-time stereo vision system which determines the head pose and gaze direction of a human subject. Its accuracy makes it useful for a number of applications including human/computer interaction, consumer research and ergonomic assessment
Article
This paper describes a new approach to low level image processing; in particular, edge and corner detection and structure preserving noise reduction. Non-linear filtering is used to define which parts of the image are closely related to each individual pixel; each pixel has associated with it a local image region which is of similar brightness to that pixel. The new feature detectors are based on the minimization of this local image region, and the noise reduction method uses this region as the smoothing neighbourhood. The resulting methods are accurate, noise resistant and fast. Details of the new feature detectors and of the new noise reduction method are described, along with test results.
Article
Five important trends have emerged from recent work on computational models of focal visual attention that emphasize the bottom-up, image-based control of attentional deployment. First, the perceptual saliency of stimuli critically depends on the surrounding context. Second, a unique 'saliency map' that topographically encodes for stimulus conspicuity over the visual scene has proved to be an efficient and plausible bottom-up control strategy. Third, inhibition of return, the process by which the currently attended location is prevented from being attended again, is a crucial element of attentional deployment. Fourth, attention and eye movements tightly interplay, posing computational challenges with respect to the coordinate system used to control attention. And last, scene understanding and object recognition strongly constrain the selection of attended locations. Insights from these five key areas provide a framework for a computational and neurobiological understanding of visual attention.
Conference Paper
We describe a real-time system for face and facial feature detection and tracking in continuous video. The core of this system consists of a set of novel facial feature detectors based on our previously proposed information-based maximum discrimination learning technique. These classifiers are very fast and allow us to implement a fully automatic, real-time system for detection and tracking multiple faces. In addition to locking onto up to four target faces, this system locates and tracks nine facial features as they move under facial expression changes
Conference Paper
An approach to detect human head pose by reconstructing 3D positions of facial points from stereo images is proposed for the implementation of an active face recognition system where fast, correct and automatic head pose detection is of critical importance. Four facial points (pupils and mouth corners) are extracted using a simple but efficient method and their three-dimensional coordinates are reconstructed from stereo images. The orientation of the face relative to the camera plane can be computed from the triangular points and thus eliminating the need to know the image vs. model correspondence or the head physical parameters. Errors of pose detection are analyzed and experimental results are shown. Using the head pose detection system, facial images with suitable pose can be selected automatically from the image sequence as input to the face recognition system
Conference Paper
We demonstrate real-time face tracking and pose estimation in an unconstrained office environment with an active foveated camera. Using vision routines previously implemented for an interactive environment, we determine the spatial location of a user's head and guide an active camera to obtain foveated images of the face. Faces are analyzed using a set of eigenspaces indexed over both pose and world location. Closed loop feedback from the estimated facial location is used to guide the camera when a face is present in the foveated view. Our system can detect the head pose of an unconstrained user in real-time as he or she moves about an open room
Conference Paper
This paper compares two artificial neural network (ANN) techniques for facial feature location with an algorithmic method, namely template matching. All three techniques work on windowed data from 128×128 pixel facial images. The ANN techniques used are the multilayer perceptron, and a method using a Kohonen self-organising feature map to classify input patterns and a multilayer perceptron to interpret the output of the Kohonen network. The data used to train the ANNs is described along with the learning parameters and conditions used. The effect of normalising the input data is described. The results show that ANN techniques can equal and in some cases better the performance of template matching for facial feature location
Article
The probability density function of the range estimation error and the expected value of the range error magnitude are derived in terms of the various design parameters of a stereo imaging system. In addition, the relative range error is proposed as a better way of quantifying the range resolution of a stereo imaging system than the percent range error when the depths in the scene lie within a narrow range
Article
Feature measures were highly consistent between methods, Pearson's r # 0.96 or higher, p # 0.001 for each of the action tasks. The mean differences between the methods were small; the mean error between methods was comparable to the error within the manual method (less than 1 pixel). The AFA demonstrated strong concurrent validity with the MSRA for pixel-wise displacement. Tracking was fully automated and provided motion vectors, which may be useful in guiding surgical and rehabilitative approaches to restoring facial function in patients with facial neuromuscular disorders. (Plast. Reconstr. Surg. 107: 1124, 2001.) Facial neuromuscular dysfunction severely impacts adaptive and expressive behavior and emotional well-being. The patient with a facial neuromuscular disorder often has difficulty performing basic daily functions such as eating, drinking, and swallowing, and communicating his or her feelings and intentions to other persons. The risk for moderate to serious levels of depres
Article
A human face provides a variety of different communicative functions. In this paper, we present approaches for real-time face/facial feature tracking and their applications. First, we present techniques of tracking human faces. It is revealed that human skincolor can be used as a major feature for tracking human faces. An adaptive stochastic model has been developed to characterize the skin-color distributions. Based on the maximum likelihood method, the model parameters can be adapted for different people and different lighting conditions. The feasibility of the model has been demonstrated by the development of a realtime face tracker. We then present a top-down approach for tracking facial features such as eyes, nostrils, and lip corners. These real-time tracking techniques have been successfully applied to many applications such as eye-gaze monitoring, head pose tracking, and lip-reading. 1. Introduction Many applications in human computer interaction require tracking a human face...