Article

Investigating cognitive workload in concurrent speech-based information communication

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Users are capable of noticing, listening, and comprehending concurrent information simultaneously, but in conventional speech-based interaction methods, systems communicate information sequentially to the users. This mismatch implies that the sequential approach may be under-utilising human perception capabilities and restricting users to seek information to sub-optimal levels. This paper reports on an experiment that investigates the cognitive workload experienced by the users when listening to a variety of combinations of information types in concurrent formats. Fifteen different combinations of concurrent information streams were investigated, and the subjective listening workload for each of the combination was measured using NASA-TLX. The results showed that the perceived workload index score varies in all concurrent combinations. The workload index score depends on the types and the amount of information presented to users. The perceived workload index score in concurrent listening remained the highest in Monolog with Interview (three concurrent talkers) combination, medium in Monolog with News Headlines (two talkers where one is intermittent) combination, and the lowest in Monolog with Music (one talker and a concurrent music stream) combination. Users descriptive feedback remained aligned with the NASA-TLX-based results. It is expected that the results of this experiment will contribute to helping digital content creators and interaction designers to communicate information more efficiently to users.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The knowledge aspect or cognitive aspect in Islamic education is an integral part that requires a separate assessment technique, in order to measure the level of understanding or knowledge of students on the material that has been taught. This cognitive relates to mastery or understanding of the content of the lesson material (Fazal et al., 2022;Sönmez, 2017;Soucy et al., 2016). Cognitive is a domain that includes mental training or anything related to brain activity (Fadilah & Efendi, 2020;Kuboja & Ngussa, 2015). ...
Article
Full-text available
A B S T R A K Guru belum memahami tentang pelaksanaan penilaian yang valid secara baik dan benar, sehingga siswa kurang memahami tentang pelaksanaan penilaian yang valid. Tujuan dalam penelitian ini ialah untuk menganalisis bagaimana teknik penilain autentik pada aspek kognitif yang digunakan guru pendidikan agama Islam di sekolah dasar. Jenis penelitian yang digunakan ialah penelitian kualitatif dengan pendekatan deskriptif. Dalam penelitian ini menggunakan desain studi kasus dalam artian peneliti akan mengungkapkan secara mendalam ataupun menganalisis tentang masalah penelitian untuk mendapkan hasil yang lebih spesifik. Subjek dalam penelitian ialah satu guru pendidikan agama Islam dan dua peserta didik. Pengumpulan data yang digunakan ialah sumber primer dan sekunder. Sumber primer yang digunakan adalah observasi, wawancara dan dokumentasi, sedangkan sumber sekunder ialah buku dan jurnal bereputasi yang sesuai dengan tema yang dibahas. Analisis data menggunakan deksriptif-analitik yang terbagi atas tiga tahap yaitu tahap analisis kritis, interprestasi kritik, dan penarikan kesimpulan. Hasil dari penelitian ini terdapat beberapa teknik penilaian autentik pada aspek kognitif yang digunakan guru pendidikan agama Islam antara lain tes tulis, tes lisan dan penugasan. Teknik penilaian ini digunakan untuk mendapatkan gambaran secara utuh tentang ketercapaian kompetensi peserta didik pada ranah pengetahuan sekaligus dan dapat dijadikan sebagai alat ukur tingkat keberhasilan pembelajaran, sehingga penilaian memiliki peran penting dalam pendidikan. A B S T R A C T The teacher does not correctly and adequately understand the implementation of a valid assessment, so students do not understand the implementation of a valid assessment. This study aims to analyze how authentic assessment techniques on cognitive aspects are used by Islamic religious education teachers in elementary schools. The type of research used is qualitative research with a descriptive approach. This research uses a case study design because the researcher will reveal in-depth or analyze the research problem to obtain more specific results. The subjects in the study were one teacher of Islamic religious education and two students. The data collection used is primary and secondary sources. The primary sources used were observation, interviews, and documentation, while the secondary sources were reputable books and journals that matched the themes discussed. The data analysis using descriptive-analytic is divided into three stages, namely the critical analysis stage, critical interpretation, and concluding. The results of this study are several authentic assessment techniques on cognitive aspects used by Islamic religious education teachers, including written tests, oral tests, and assignments. This assessment technique is used to get a complete picture of student competencies in the realm of knowledge and can be used as a measuring tool for learning success so that assessment has an essential role in education.
Article
The teacher does not correctly and adequately understand the implementation of a valid assessment, so students do not understand the implementation of a valid assessment. This study aims to analyze how authentic assessment techniques on cognitive aspects are used by Islamic religious education teachers in elementary schools. The type of research used is qualitative research with a descriptive approach. This research uses a case study design because the researcher will reveal in-depth or analyze the research problem to obtain more specific results. The subjects in the study were one teacher of Islamic religious education and two students. The data collection used is primary and secondary sources. The primary sources used were observation, interviews, and documentation, while the secondary sources were reputable books and journals that matched the themes discussed. The data analysis using descriptive-analytic is divided into three stages, namely the critical analysis stage, critical interpretation, and concluding. The results of this study are several authentic assessment techniques on cognitive aspects used by Islamic religious education teachers, including written tests, oral tests, and assignments. This assessment technique is used to get a complete picture of student competencies in the realm of knowledge and can be used as a measuring tool for learning success so that assessment has an essential role in education.
Article
Many situations require focusing attention on one speaker, while monitoring the environment for potentially important information. Some have proposed that dividing attention among 2 speakers involves behavioral trade-offs, due to limited cognitive resources. However the severity of these trade-offs, particularly under ecologically-valid circumstances, is not well understood. We investigated the capacity to process simultaneous speech using a dual-task paradigm simulating task-demands and stimuli encountered in real-life. Participants listened to conversational narratives (Narrative Stream) and monitored a stream of announcements (Barista Stream), to detect when their order was called. We measured participants’ performance, neural activity, and skin conductance as they engaged in this dual-task. Participants achieved extremely high dual-task accuracy, with no apparent behavioral trade-offs. Moreover, robust neural and physiological responses were observed for target-stimuli in the Barista Stream, alongside significant neural speech-tracking of the Narrative Stream. These results suggest that humans have substantial capacity to process simultaneous speech and do not suffer from insufficient processing resources, at least for this highly ecological task-combination and level of perceptual load. Results also confirmed the ecological validity of the advantage for detecting ones’ own name at the behavioral, neural, and physiological level, highlighting the contribution of personal relevance when processing simultaneous speech.
Article
Several intelligent speech interaction (ISI) systems have emerged over the past four decades that have served the human community. The research papers show that these systems are very well connected to the cognitive hexagon and the six hybrid approaches. Where this hexagon reveals six distinct cognitive areas, one of the six hybrid perspectives gives rise to the dimensions of speech quality. This survey has been undertaken to reveal the dimensions of speech quality and to discuss the role of cognitive hexagonal regions on these dimensions with hybrid approaches. Here, ISI systems support this discussion and follow them as cognitive machines. An overview of the state-of-the-art related to ISI systems is also described here. Techniques such as processing (natural language), speech synthesis (speech-to-text or text-to-speech), computing (voice/mobile), and audio mining are presented in this overview. These are contributing well with technologies like Internet of things, voice over Internet protocol, and cloud-based systems. In addition, stochastic components such as reliability, availability, and failure rate were discussed to analyze whether the quality of service of these ISI systems is described. Additionally, after the discussion, some aspects of the applications are also discussed along with the essential advantages and significant drawbacks.
Article
Full-text available
This research aims to assist users to seek information efficiently while interacting with speech-based information, particularly in multimedia delivery, and reports on an experiment that tested two speech-based designs for communicating multiple speech-based information streams efficiently. In this experiment, a high-rate playback design and a concurrent playback design are investigated. In the high-rate playback design, two speech-based information streams were communicated by doubling the normal playback-rate, and in the concurrent playback design, two speech-based information streams were played concurrently. Comprehension of content in both the designs was also compared with the benchmark set from regular baseline condition. The results showed that the users’ comprehension regarding the main information dropped significantly in the high-rate playback and the concurrent playback designs compared to the baseline condition. However, in answering the questions set from the detailed information, the comprehension was not significantly different in all three designs. It is expected that such equeryfficient communication methods may increase productivity by providing information efficiently while interacting with an interactive multimedia system.
Conference Paper
Full-text available
Walking is an everyday act that we, as humans, often take for granted. To walk requires the synergy of somatosensory, neurological and physiological processes for us to move at a regular pace by lifting and setting down each foot in turn. It can be argued that walking is also a source of creativity and exploration when conducted as an intentional act of somatic or self-awareness. This design case study aims to explore the kinds of somatic awareness and aesthetic engagement of walking apparent through the introduction of a pressure mediated sound generating surface with a group of Feldenkrais movement practitioners. These explorations reveal that there is an awareness of tempo and rhythm during the step cycle. This awareness takes on an internal focus as shifts in attention and bodily organization. Another key finding is that exploration and play are enabled due to the rich timbral qualities of the pressure mediated auditory feedback. The significance and contribution in this work is in the implications it has for the design of technologies that support kinaesthetic awareness through aesthetic and exploratory strategies.
Article
Full-text available
The widespread availability of digital media has changed the way people consume information and has impacted the consumption of auditory information. Despite this recent popularity among sighted people, the use of auditory feedback to access digital information is not new for visually impaired users. However, its sequential nature undermines both blind and sighted people’s ability to efficiently find relevant information in the midst of several potentially useful items. We propose taking advantage of the Cocktail Party Effect, which states that people are able to focus on a single speech source among several conversations, but still identify relevant content in the background. Therefore, oppositely to one sequential speech channel, we hypothesize that people can leverage concurrent speech channels to quickly get the gist of digital information. In this paper, we present an experiment with 46 (23 blind, 23 sighted) participants, which aims to understand people’s ability to search for relevant content listening to two, three or four concurrent speech channels. Our results suggest that both blind and sighted people are able to process concurrent speech in scanning scenarios. In particular, the use of two concurrent sources may be used both to identify and understand the content of the relevant sentence. Moreover, three sources may be used for most people depending on the task intelligibility demands and user characteristics. Contrasting with related work, the use of different voices did not affect the perception of concurrent speech but was highly preferred by participants. To complement the analysis, we propose a set of scenarios that may benefit from the use of concurrent speech sources, for both blind and sighted people, towards a Design for All paradigm.
Article
Full-text available
Auditory interfaces offer a solution to the problem of effective eyes-free mobile interactions. In this article, we investigate the use of multilevel auditory displays to enable eyes-free mobile interaction with indoor location-based information in non-guided audio-augmented environments. A top-level exocentric sonification layer advertises information in a gallery-like space. A secondary interactive layer is used to evaluate three different conditions that varied in the presentation (sequential versus simultaneous) and spatialisation (non-spatialised versus egocentric/exocentric spatialisation) of multiple auditory sources. Our findings show that (1) participants spent significantly more time interacting with spatialised displays; (2) using the same design for primary and interactive secondary display (simultaneous exocentric) showed a negative impact on the user experience, an increase in workload and substantially increased participant movement; and (3) the other spatial interactive secondary display designs (simultaneous egocentric, sequential egocentric, and sequential exocentric) showed an increase in time spent stationary but no negative impact on the user experience, suggesting a more exploratory experience. A follow-up qualitative and quantitative analysis of user behaviour support these conclusions. These results provide practical guidelines for designing effective eyes-free interactions for far richer auditory soundscapes.
Article
Full-text available
This study investigated whether unattended speech is processed at a semantic level in dichotic listening using a semantic priming paradigm. A lexical decision task was administered in which target words were presented in the attended auditory channel, preceded by two prime words presented simultaneously in the attended and unattended channels, respectively. Both attended and unattended primes were either semantically related or unrelated to the attended targets. Attended prime-target pairs were presented in isolation, whereas unattended primes were presented in the context of a series of rapidly presented words. The fundamental frequency of the attended stimuli was increased by 40 Hz relative to the unattended stimuli, and the unattended stimuli were attenuated by 12 dB [+12 dB signal-to-noise ratio (SNR)] or presented at the same intensity level as the attended stimuli (0 dB SNR). The results revealed robust semantic priming of attended targets by attended primes at both the +12 and 0 dB SNRs. However, semantic priming by unattended primes emerged only at the 0 dB SNR. These findings suggest that the semantic processing of unattended speech in dichotic listening depends critically on the relative intensities of the attended and competing signals.
Conference Paper
Full-text available
This paper explores the use of multilevel auditory displays to enable eyes-free mobile interaction with location-based information in a conceptual art exhibition space. Multilevel auditory displays enable user interaction with concentrated areas of information. However, it is necessary to consider how to present the auditory streams without overloading the user. We present an initial study in which a top-level exocentric soni�cation layer was used to advertise information present in a gallery-like space. Then, in a secondary interactive layer, three di�erent conditions were evaluated that varied in the presentation (sequential versus simultaneous) and spatialisation (non-spatialised versus egocentric spatialisation) of multiple auditory sources. Results show that 1) participants spent signi�cantly more time interacting with spatialised displays, 2) there was no evidence that a switch from an exocentric to an egocentric display increased workload or lowered satisfaction, and 3) there was no evidence that simultaneous presentation of spatialised Earcons in the secondary display increased workload.
Article
Full-text available
Speech reception thresholds were measured for a voice against two different maskers: Either two concurrent voices with the same fundamental frequency (F0) or a harmonic complex with the same long-term excitation pattern and broadband temporal envelope as the masking sentences (speech-modulated buzz). All sources had steady F0s. A difference in F0 of 2 or 8 semitones provided a 5-dB benefit for buzz maskers, whereas it provided a 3- and 8-dB benefit, respectively, for masking sentences. Whether intelligibility of a voice increases abruptly with small ΔF0s or gradually toward larger ΔF0s seems to depend on the nature of the masker.
Conference Paper
Full-text available
Sighted users are able to sift through a website quickly to find their information of interest. In contrast, screen readers present the information sequentially to blind users, which contrast with the visual presentation on screen that portrays more information at once. We believe that blind users will benefit from multiple simultaneous sound sources while scanning websites with several information items, in order to find their information of interest faster.
Article
Full-text available
This study examined the effects of competing speech on auditory semantic comprehension using a dichotic sentence-word priming paradigm. Lexical decision performance for target words presented in spoken sentences was compared in strongly and weakly biasing semantic contexts. Targets were either congruent or incongruent with the sentential bias. Sentences were presented to one auditory channel (right or left), either in isolation or with competing speech produced by a single talker of the same gender presented simultaneously. The competing speech signal was either presented in the same auditory channel as the sentence context, or in a different auditory channel, and was either meaningful (played forward) or unintelligible (time-reversed). Biasing contexts presented in isolation facilitated responses to congruent targets and inhibited responses to incongruent targets, relative to a neutral baseline. Facilitation priming was reduced or eliminated by competing speech presented in the same auditory channel, supporting previous findings that semantic activation is highly sensitive to the intelligibility of the context signal. Competing speech presented in a different auditory channel affected facilitation priming differentially depending upon ear of presentation, suggesting hemispheric differences in the processing of the attended and competing signals. Results were consistent with previous claims of a right ear advantage for meaningful speech, as well as with visual word recognition findings implicating the left hemisphere in the generation of semantic predictions and the right hemisphere in the integration of newly encountered words into the sentence-level meaning. Unlike facilitation priming, inhibition was relatively robust to the energetic and informational masking effects of competing speech, and was not influenced by the strength of the contextual bias or the meaningfulness of the competing signal, supporting a two-process model of sentence priming in which inhibition reflects later-stage, expectancy-driven strategic processes that may benefit from perceptual reanalysis after initial semantic activation.
Conference Paper
Full-text available
Auditory interfaces offer a solution to the problem of effective eyes-free mobile interactions. However, a problem with audio, as opposed to visual displays, is dealing with multiple simultaneous outputs. Any audio interface needs to consider: 1) simultaneous versus sequential presentation of multiple audio streams, 2) 3D audio techniques to place sounds in different spatial locations versus a single point of presentation, 3) dynamic movement versus fixed locations of audio sources. We present an experiment using a divided-attention task where a continuous podcast and an audio menu compete for attention. A sequential presentation baseline assessed the impact of cognitive load, and as expected, dividing attention had a significant effect on overall performance. However, spatial audio still increased the users' ability to attend to two streams, while dynamic movement of streams led to higher perceived workload. These results will provide guidelines for designers when building eyes-free auditory interfaces for mobile applications.
Article
Full-text available
Two investigations into the identification of concurrently presented, structured sounds, called earcons were carried out. One of the experiments investigated how varying the number of concurrently presented earcons affected their identification. It was found that varying the number had a significant effect on the proportion of earcons identified. Reducing the number of concurrently presented earcons lead to a general increase in the proportion of presented earcons successfully identified. The second experiment investigated how modifying the earcons and their presentation, using techniques influenced by auditory scene analysis, affected earcon identification. It was found that both modifying the earcons such that each was presented with a unique timbre, and altering their presentation such that there was a 300 ms onset-to-onset time delay between each earcon were found to significantly increase identification. Guidelines were drawn from this work to assist future interface designers when incorporating concurrently presented earcons.
Article
Full-text available
Animals often use acoustic signals to communicate in groups or social aggregations in which multiple individuals signal within a receiver's hearing range. Consequently, receivers face challenges related to acoustic interference and auditory masking that are not unlike the human cocktail party problem, which refers to the problem of perceiving speech in noisy social settings. Understanding the sensory solutions to the cocktail party problem has been a goal of research on human hearing and speech communication for several decades. Despite a general interest in acoustic signaling in groups, animal behaviorists have devoted comparatively less attention toward understanding how animals solve problems equivalent to the human cocktail party problem. After illustrating how humans and nonhuman animals experience and overcome similar perceptual challenges in cocktail-party-like social environments, this article reviews previous psychophysical and physiological studies of humans and nonhuman animals to describe how the cocktail party problem can be solved. This review also outlines several basic and applied benefits that could result from studies of the cocktail party problem in the context of animal acoustic communication.
Article
Full-text available
The objective is to lay out the rationale for multiple resource theory and the particular 4-D multiple resource model, as well as to show how the model is useful both as a design tool and as a means of predicting multitask workload overload. I describe the discoveries and developments regarding multiple resource theory that have emerged over the past 50 years that contribute to performance and workload prediction. The article presents a history of the multiple resource concept, a computational version of the multiple resource model applied to multitask driving simulation data, and the relation of multiple resources to workload. Research revealed the importance of the four dimensions in accounting for task interference and the association of resources with brain structure. Multiple resource models yielded high correlations between model predictions and data. Lower correlations also identified the existence of additional resources. The model was shown to be partially relevant to the concept of mental workload, with greatest relevance to performance breakdowns related to dual-task overload. Future challenges are identified. The most important application of the multiple resource model is to recommend design changes when conditions of multitask resource overload exist.
Article
Full-text available
The authors found splenial lesions to be associated with left ear suppression in dichotic listening of consonant-vowel syllables. This was found in both a rapid presentation dichotic monitoring task and a standard dichotic listening task, ruling out attentional limitations in the processing of high stimulus loads as a confounding factor. Moreover, directed attention to the left ear did not improve left ear target detection in the patients, independent of callosal lesion location. The authors' data may indicate that auditory callosal fibers pass through the splenium more posterior than previously thought. However, further studies should investigate whether callosal fibers between primary and secondary auditory cortices, or between higher level multimodal cortices, are vital for the detection of left ear targets in dichotic listening.
Article
In human-computer interaction, particularly in multimedia delivery, information is communicated to users sequentially, whereas users are capable of receiving information from multiple sources concurrently. This mismatch indicates that a sequential mode of communication does not utilise human perception capabilities as efficiently as possible. This article reports an experiment that investigated various speech-based (audio) concurrent designs and evaluated the comprehension depth of information by comparing comprehension performance across several different formats of questions (main/detailed, implied/stated). The results showed that users, besides answering the main questions, were also successful in answering the implied questions, as well as the questions that required detailed information, and that the pattern of comprehension depth remained similar to that seen to a baseline condition, where only one speech source was presented. However, the participants answered more questions correctly that were drawn from the main information, and performance remained low where the questions were drawn from detailed information. The results are encouraging to explore the concurrent methods further for communicating multiple information streams efficiently in human-computer interaction, including multimedia.
Conference Paper
In this paper, we discuss investigations conducted with 10 visually challenged users (VCUs) and 8 sighted users (SUs) that aimed to determine user's experience, interest and expectations from concurrent information communication systems. In the first study, we concurrently played two voice-based streams in continuous form in both the ears, and in the second study, we concurrently communicated one stream continuously in one ear and three news headlines as interval-based short interruptions in another ear. We first reported the participants' experience qualitatively and then based on the feedback received from the users, we proposed a framework that may help in developing systems to communicate multiple voice-based information to the users. It is expected that the application of this new framework to information systems that provide multiple concurrent communication will provide a better user experience for users subject to their contextual and perceptual needs and limitations.
Conference Paper
Speech-based information is usually communicated to users in a sequential manner, but users are capable of obtaining information from multiple voices concurrently. This fact implies that the sequential approach is possibly under-utilizing human perception capabilities to some extent and restricting users to perform optimally in an immersive environment. This paper reports on an experiment that aimed to test different speech-based designs for concurrent information communication. Two audio streams from two types of content were played concurrently to 34 users, in both a continuous or intermittent form, with the manipulation of a variety of spatial configurations (i.e. Diotic, Diotic-Monotic, and Dichotic). In total, 12 concurrent speech-based design configurations were tested with each user. The results showed that the concurrent speech-based information designs involving intermittent form and the spatial difference in information streams produce comprehensibility equal to the level achieved in sequential information communication.
Article
Sound synthesis is the process of generating artificial sounds through some form of simulation or modelling. This article aims to identify which sound synthesis methods achieve the goal of producing a believable audio sample that may replace a recorded sound sample. A perceptual evaluation experiment of five different sound synthesis techniques was undertaken. Additive synthesis, statistical modelling synthesis with two different feature sets, physically inspired synthesis, concatenative synthesis, and sinusoidal modelling synthesis were all compared. Evaluation using eight different sound class stimuli and 66 different samples was undertaken. The additive synthesizer is the only synthesis method not considered significantly different from the reference sample across all sounds classes. The results demonstrate that sound synthesis can be considered as realistic as a recorded sample and makes recommendations for use of synthesis methods, given different sound class contexts.
Conference Paper
Traditional interfaces are continuously being replaced by mobile, wearable, or pervasive interfaces. Yet when it comes to the input and output modalities enabling our interactions, we have yet to fully embrace some of the most natural forms of communication and information processing that humans possess: speech, language, gestures, thoughts. Very little HCI attention has been dedicated to designing and developing spoken language, acoustic-based, or multimodal interaction techniques, especially for mobile and wearable devices. In addition to the enormous, recent engineering progress in processing such modalities, there is now sufficient evidence that many real-life applications do not require 100% accuracy of processing multimodal input to be useful, particularly if such modalities complement each other. This multidisciplinary, one-day workshop will bring together interaction designers, usability researchers, and general HCI practitioners to analyze the opportunities and directions to take in designing more natural interactions especially with mobile and wearable devices, and to look at how we can leverage recent advances in speech, acoustic, and multimodal processing.
Article
An observational workflow time study was conducted involving doctors in the emergency department (ED) of a large Australian hospital. During 121.7 h across 58 sessions, we observed interruptive events, conceptualised as prompts, and doctors' strategies to handle those prompts (task-switching, multitasking, acknowledgement, deferral and deflection) to assess the role of multiple work system factors influencing doctors' work in the ED. Prompt rates varied vastly between work scenarios, being highest during non-verbal solo tasks. The propensity to use certain strategies also differed with task type, prompt type and location within the department, although task-switching was by far the most frequent. Communicative prompts were important in patient treatment and workload management. Clinicians appear to adjust their communication strategies in response to contextual factors in order to deliver patient care. Risk due to the interruptive nature of ED communication is potentially outweighed by the positive effects of timely information transfer and advice provision.
Chapter
Ubiquitous Computing has enabled users to perform their computer activities anytime, anyplace, anywhere while performing other routine activities. Voice-based interaction often plays a significant role to make this possible. Presently, in voice-based interaction system communicates information to the user sequentially whereas users are capable of noticing, listening and comprehending multiple voices simultaneously. Therefore, providing information sequentially to the users may not be an ideal approach. There is a need to develop a design strategy in which information could be communicated to the users through multiple channels. In this paper, a design possibility has been investigated that how information could be communicated simultaneously in voice-based interaction so that users could fulfil their growing information needs and ultimately complete multiple tasks at hand efficiently.
Conference Paper
Traditional interfaces are continuously being replaced by mobile, wearable, or pervasive interfaces. Yet when it comes to the input and output modalities enabling our interactions, we have yet to fully embrace some of the most natural forms of communication and information processing that humans possess: speech, language, gestures, thoughts. Very little HCI attention has been dedicated to designing and developing spoken language and multimodal interaction techniques, especially for mobile and wearable devices. In addition to the enormous, recent engineering progress in processing such modalities, there is now sufficient evidence that many real-life applications do not require 100% accuracy of processing multimodal input to be useful, particularly if such modalities complement each other. This multidisciplinary, two-day workshop will bring together interaction designers, usability researchers, and general HCI practitioners to analyze the opportunities and directions to take in designing more natural interactions with mobile and wearable devices, and to look at how we can leverage recent advances in speech and multimodal processing.
Article
Providing better information access to blind users is an important goal in the context of accessible interface design. Similarly, designers of user interfaces benefit from alternative interface techniques for usage scenarios in which visual (graphical) interfaces are either not possible or suboptimal. In our study we compared a traditional serial aural presentation of menu items to a new simultaneous aural presentation of up to seven menu items. These continuously present VoiceScapes allow the user to actively scan the auditory display to find the most appropriate command. While VoiceScapes are more difficult and attentionally more demanding than other formats of presentation, extended use might allow experienced users to more efficiently navigate complex menu hierarchies. A first pilot experiment with 13 sighted participants presented here tested the basic viability of this approach.
Article
In this article I review the early auditory laterality and dichotic listening research from the perspective of the legacy of Phil Bryden's pioneering contributions to not only empirical work, but also on theory, critical interpretations of results, and statistical issues, with a focus on the role of attention. In doing so, I am describing how my own research was shaped and influenced by Phil Bryden and his work on auditory laterality and dichotic listening. In addition to personal recollections of my meetings and discussions with Phil that had a profound impact on my later career, I have focused the overview on Phil's early dichotic listening papers from the 1960s, to be followed by a detailed review and discussion of the seminal [Bryden, M. P., Munhall, K., & Allard, F. (1983). Attentional biases and the right-ear effect in dichotic listening. Brain and Language, 18, 236-248] paper on attentional effects on the ear advantage in dichotic listening. Finally, I review the little known fact that Phil was also a contributor to the very first functional neuroimaging study that used dichotic stimuli.
Article
This study investigated whether spatial separation between talkers helps reduce cognitive processing load, and how hearing impairment interacts with the cognitive load of individuals listening in multi-talker environments. A dual-task paradigm was used in which performance on a secondary task (visual tracking) served as a measure of the cognitive load imposed by a speech recognition task. Visual tracking performance was measured under four conditions in which the target and the interferers were distinguished by (1) gender and spatial location, (2) gender only, (3) spatial location only, and (4) neither gender nor spatial location. Results showed that when gender cues were available, a 15° spatial separation between talkers reduced the cognitive load of listening even though it did not provide further improvement in speech recognition (Experiment I). Compared to normal-hearing listeners, large individual variability in spatial release of cognitive load was observed among hearing-impaired listeners. Cognitive load was lower when talkers were spatially separated by 60° than when talkers were of different genders, even though speech recognition was comparable in these two conditions (Experiment II). These results suggest that a measure of cognitive load might provide valuable insight into the benefit of spatial cues in multi-talker environments.
Article
Recent human performance research at the Naval Surface Warfare Center, Dahlgren Division (NSWCDD) has shown that increasing the number of concurrent voice communications tasks individual Navy watchstanders must handle is an uncompromising empirical barrier to streamlining crew sizes in future shipboard combat information centers. Subsequent work on this problem at the Naval Research Laboratory (NRL) has resulted in a serialized communications monitoring prototype (U.S. Patent Application Pub. No. US. 2007/0299657) that uses a patented NRL technology known as “pitch synchronous segmentation” (U.S. Patent 5,933,808) to accelerate buffered human speech up to 100% faster than its normal rate without a meaningful decline in intelligibility. In conjunction with this research effort, a series of ongoing human subjects studies at NRL has shown that rate-accelerated, serialized communications monitoring overwhelmingly improves performance measures of attention, comprehension, and effort in comparison to concurrent listening in the same span of time. This paper provides an overview of NRL's concurrent communications monitoring solution and summarizes the empirical performance questions addressed by, and the outcomes of, the Lab's associated program of listening studies.
Article
Continuous samples of light fiction were read to the listener and whose task was to reproduce the speech. The speech signal was periodically interrupted with each half cycle presented alternately to each of the two ears. Sharp deterioration in performance was obtained for switching rates in the region of 3-5 interruptions per second. This leads to a calculated "dead-time" for switching attention of about 1/6 sec. The effect of switching upon recognition markedly decreases as the rate of speaking of the signal decreases. The interaural displacement threshold for continuous speech was found to be about 13 m sec. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Conference Paper
In this paper, we describe a novel voice menu presentation method, the vCocktail, designed for efficient human-computer interaction in wearable computing. This method is devised to reduce the length of the serial presentation of voice menus by introducing spatiotemporal multiplexed voices with enhanced separation cues. Perception error in judging voice direction was measured first to determine appropriate directions in which and interval angles at which menu items were placed allowing the user a clear distinction among the multiple items. Then voice menu items were presented under spatiotemporally multiplexed conditions with several different settings of spatial localization, number of words, and onset interval. The results of the experiments showed that the subjects could hear items very accurately with localization cues and appropriate onset intervals. In addition, the proposed attenuating menu voice and crosstype spatial sequence of presentation increased the correct answer ratio effectively improving distinction between menu items. A correct answer ratio of 99.7 % was achieved in the case of four-item multiplexing when an attenuating voice and a 0.2 sec onset interval were used with the cross-type spatial sequence.
Conference Paper
While the usability of voice-based Web navigation has been steadily improving, it is still not as easy for users with visual impairments as it is for sighted users. One reason is that sequential voice representation can only convey a limited amount of information at a time. Another challenge comes from the fact that current voice browsers omit various visual cues such as text styles and page structures, and lack meaningful feedback about the current focus. To address these issues, we created Sasayaki, an intelligent voice-based user agent that augments the primary voice output of a voice browser with a secondary voice that whispers contextually relevant information as appropriate or in response to user requests. A prototype has been implemented as a plug-in for a voice browser. The results from a pilot study show that our Sasayaki agent is able to improve users' information search task time and their overall confidence level. We believe that our intelligent voice-based agent has great potential to enrich the Web browsing experiences of users with visual impairments.
Article
Although there is substantial evidence that performance in multitalker listening tasks can be improved by spatially separating the apparent locations of the competing talkers, very little effort has been made to determine the best locations and presentation levels for the talkers in a multichannel speech display. In this experiment, a call sign based color and number identification task was used to evaluate the effectiveness of three different spatial configurations and two different level normalization schemes in a seven-channel binaural speech display. When only two spatially adjacent channels of the seven-channel system were active, overall performance was substantially better with a geometrically spaced spatial configuration (with far-field talkers at −90°, −30°, −10°, 0°, +10°, +30°, and +90° azimuth) or a hybrid near-far configuration (with far-field talkers at −90°, −30°, 0°, +30°, and +90° azimuth and near-field talkers at ±90°) than with a more conventional linearly spaced configuration (with far-field talkers at −90°, −60°, −30°, 0°, +30°, +60°, and +90° azimuth). When all seven channels were active, performance was generally better with a “better-ear” normalization scheme that equalized the levels of the talkers in the more intense ear than with a default normalization scheme that equalized the levels of the talkers at the center of the head. The best overall performance in the seven-talker task occurred when the hybrid near-far spatial configuration was combined with the better-ear normalization scheme. This combination resulted in a 20% increase in the number of correct identifications relative to the baseline condition with linearly spaced talker locations and no level normalization. Although this is a relatively modest improvement, it should be noted that it could be achieved at little or no cost simply by reconfiguring the HRTFs used in a multitalker speech display.
Chapter
The results of a multi-year research program to identify the factors associated with variations in subjective workload within and between different types of tasks are reviewed. Subjective evaluations of 10 workload-related factors were obtained from 16 different experiments. The experimental tasks included simple cognitive and manual control tasks, complex laboratory and supervisory control tasks, and aircraft simulation. Task-, behavior-, and subject-related correlates of subjective workload experiences varied as a function of difficulty manipulations within experiments, different sources of workload between experiments, and individual differences in workload definition. A multi-dimensional rating scale is proposed in which information about the magnitude and sources of six workload-related factors are combined to derive a sensitive and reliable estimate of workload.
Article
A vision of the future of intraoperative monitoring for anesthesia is presented-a multimodal world based on advanced sensing capabilities. I explore progress towards this vision, outlining the general nature of the anesthetist's monitoring task and the dangers of attentional capture. Research in attention indicates different kinds of attentional control, such as endogenous and exogenous orienting, which are critical to how awareness of patient state is maintained, but which may work differently across different modalities. Four kinds of medical monitoring displays are surveyed: (1) integrated visual displays, (2) head-mounted displays, (3) advanced auditory displays and (4) auditory alarms. Achievements and challenges in each area are outlined. In future research, we should focus more clearly on identifying anesthetists' information needs and we should develop models of attention in different modalities and across different modalities that are more capable of guiding design.
Article
The dichotic listening paradigm using verbal stimulus material typically yields a right ear advantage (REA) which indicates the left-hemisphere dominance for speech processing. Although this interpretation is widely accepted, the cerebral hemispheres also interact through the corpus callosum. Moreover, the two most influential theoretical models of dichotic listening, the structural and the attentional model, both refer to the functional integrity of the corpus callosum, when explaining the REA. However, the current review of the available data reveals several aspects that can not be explained by the dichotic listening models. For example, an individual's ability to direct attention to either ear is mediated by callosal fibers. Consequently, the corpus callosum not only has to be considered as a channel for the automatic exchange of information between the cerebral hemispheres, it rather allows for a dynamic and flexible interaction in supporting both bottom-up and top-down stimulus processing. The review has also revealed how inter-individual variability in callosal fiber structure affects both bottom-up and top-down performance on the dichotic listening task.
Article
This paper describes the Audio Hallway, a virtual acoustic environment for browsing collections of related audio files. The user travels up and down the Hallway by head motion, passing "rooms" alternately on the left and right sides. Emanating from each room is an auditory collage of "braided audio" which acoustically indicates the contents of the room. Each room represents a broadcast radio news story, and the contents are a collection of individual "sound bites" or actualities related to that story. Upon entering a room, the individual sounds comprising that story are arrayed spatially in front of the listener, with auditory focus controlled by head rotation. The main design challenge for the Audio Hallway is adequately controlling the auditory interface to position sounds so that spatial memory can facilitate navigation and recall in the absence of visual cues. Keywords: digitized speech, virtual environments, spatial audio, auditory user interface. INTRODUCTION The Audio Hallway...
Concurrent voice-based multiple information communication: a study report of profile-based users’ interaction
  • Fazal
Evaluating listeners’ attention to and comprehension of spatialized concurrent and serial talkers at normal and a synthetically faster rate of speech
  • Brock
Multiple comparisons of simple effects in the two-way analysis of variance with fixed effects
  • Copenhaver
Exploring auditory gist: comprehension of two dichotic, simultaneously presented stories
  • Iyer
Designing interfaces for multiple-goal environments
  • Truschin