Content uploaded by Enda Bates

Author content

All content in this area was uploaded by Enda Bates on Jan 07, 2015

Content may be subject to copyright.

Spoken Mathematics using Prosody, Earcons and

Spearcons

Enda Bates1 and Donal Fitzpatrick1,

1 Dublin City University, Dublin 9, Ireland

{Enda Bates, Donal Fitzpatrick} dfitzpat@computing.dcu.ie

Abstract. Printed notation provides a highly succinct and unambiguous

description of the structure of mathematical formulae in a manner which is

difficult to replicate for the visually impaired. A number of different approaches

to the verbal presentation of mathematical material have been explored,

however, the fundamental differences between the two modalities of vision and

audition are often ignored. This use of additional lexical cues, spatial audio or

complex hierarchies of non-speech sounds to represent the structure and scope

of equations may be cognitively demanding to process, and this can detract

from the perception of the mathematical content. In this paper, a new

methodology is proposed which uses the prosodic component found in spoken

language, in conjunction with a limited set of spatialized earcons and spearcons,

to disambiguate the structure of mathematical formulae. This system can

potentially represent this information in an intuitive and unambiguous manner

which takes advantage of the specific strengths and capabilities of audition.

Keywords: Math, auditory interfaces, visual impairment, earcons, spearcons.

1 Introduction

Printed mathematical equations provide a significant amount of information in a

highly succinct manner. The visual representation immediately and unambiguously

indicates structural information such as the presence and scope of a fraction or the

terms contained within a square root operator. Developing a similarly efficient and

unambiguous representation for the visually impaired is a significant challenge.

Tactile representations such as Braille represent one entirely valid approach to this

issue, however, this paper will concentrate on the primary alternative to Braille,

speech synthesis.

2 Visual and Audible Representations of Mathematical Material

Printed notation provides a persistent visual cue which is effectively non-

temporal, however, audible representations are necessarily fleeting due to the

inherently temporal nature of sound. Sighted users can therefore utilize printed

material as a form of external memory and do not need to memorize the structure and

layout of an equation. This conclusion is supported by the results of a series of

cognitive experiments examining equation reading in sighted users, conducted by

Gillan et al [1], which found that sighted subjects process operators and numbers

more intensively than parentheses. This is perhaps unsurprising as the spatial structure

of the equation (which is implied through the use of parenthesis and other graphical

symbols and delimiters) is unambiguous and persistent when presented visually. This

suggests that working with mathematical material in a non-visual medium will result

in an inevitable increase in cognitive load as this structural information must now be

held in memory. This strongly implies that any method of presenting the spatial

structure of an equation via audition must be as easy as possible to cognitively

process.

2.1 Active Browsing of Mathematical Material

Printed mathematics provides a strong spatial definition of the mathematical

content and the user provides the temporal dimension by actively directing their

attention back and forth through the material [1]. A simple audio representation works

in reverse, as the serial stream of audio is highly temporal and the user must now infer

and remember the spatial structure from the presented material. In effect, a printed

symbol exists outside of time, and a sighted user can concentrate on this symbol

whenever they wish. Audition on the other hand is highly temporal and omni-

directional, and cannot be consciously directed in the same way as vision. Clearly this

functionality is critical if a visually impaired user is to have effective control over the

temporal representation of the mathematical material, and indeed the majority of the

available software for the presentation of spoken mathematics incorporate some form

of active browsing which allows visually impaired users to actively browse and

segment mathematical equations [1,2,3,4].

2.2 Spatial Representations

Various projects have attempted to directly map the visible spatial structure of

printed notation to an equivalent audible structure. It has been suggested that this will

reduce the mental effort required by the user to process and solve the equation [5],

however, this may not in fact be the case. The results of a number of experiments

have found that the minimum detectable changes in position for visual and auditory

stimuli are approximately 1’ for visual targets and 2o-4o for auditory targets [6]. The

spatial resolution of vision is clearly therefore much more accurate than audition and

this suggests that accurately replicating the spatial layout of a printed equation with

spatial audio will be difficult to achieve.

While dynamic sonic trajectories can be used to create sonic shapes, this appears to

be a cognitive process, rather than a perceptual one [7]. In effect, the shape is

cognitively inferred from our perception of the moving sound and is not perceived

directly. Therefore it cannot be assumed that mirroring the spatial layout of a written

equation with spatial audio will reduce the cognitive load on the listener. In fact, it has

been found that the additional mental processing required to determine the spatial

trajectory detracts from the processing of the mathematical content [8]. The vertical

layout of mathematical equations would appear to be well matched to a spatial audio

presentation of an equation, however, auditory localization is particularly inaccurate

in the vertical dimension and extremely difficult to synthesize, particularly with

binaural spatialization techniques based on the head-related-transfer function, or

HRTF [9, 10]. Poor results have also been found when HRTF techniques are replaced

with real loudspeakers positioned vertically in front of the listener, such [10].

These issues suggest that a linear spatial mapping from vision to audition is not

particularly useful due to the fundamental differences between the two senses,

however, spatial audio may be beneficial in other ways. Numerous studies have

shown that sounds produced from different spatial locations are easier to distinguish,

which suggests that if additional sounds are added to the main speech signal, these

should be produced from different spatial locations as in the system developed by

Goose et al [8]. The externalization effect which generally occurs when binaural

material is presented via headphones has also been found to be much less fatiguing

than standard stereo, and this may also be of some benefit [9].

3 Existing Approaches to Spoken Mathematics

3.1 Lexical Cues

The use of additional spoken cues to indicate the structure of equations and

formulas is reminiscent of the linear representation of mathematical material in

computer programming languages (which also lack the two dimensional spatial

component of traditional printed mathematics). Nemeth Mathspeak is one system

which adopts this approach and concentrates on removing the ambiguities which arise

in spoken mathematics [11]. Mathspeak uses additional lexical cues such as “begin

fraction” and “end-fraction” (abbreviated to “B-frac” and “E-frac”) to clearly mark

the beginning and end of the fraction. Although this solution entirely removes the

ambiguity from spoken mathematics, it becomes confusing as the equations become

more complex. For example, the formula for quadratic equations would be described

as, “B-frac minus B plus or minus B-rad B sup 2 base minus 4 AC E-rad over 2 A E-

frac”. In addition, there is evidence to suggest that if the end of an utterance is marked

with a verbal message, than this message may detract from the listener’s memory of

the utterance, a process which is referred to as the suffix effect [12]. Although

Nemeth’s approach largely eliminates structural ambiguity in the presentation, the

cognitive effort required to remember and parse the equation is quite significant for

all but the simplest material.

3.2 Earcons, Auditory Icons and Spearcons

In contrast to a purely lexical approach, a number of systems have been developed

which use additional, non-speech sounds to indicate various aspects of the

mathematical material, such as the structure of an equation. Non-speech sounds have

long been used to improve access to graphical user interfaces and are often classified

as earcons, auditory icons or spearcons. Earcons are generally constructed from

abstract sounds such as the error beep in the Windows OS and may consist of melodic

or rhythmic musical material. Although the exact meaning of earcons must be learned

by the user, the abstract nature of the sound means that earcons are relatively easy to

synthesize and modify. In contrast, auditory icons use sampled sounds which are

intended to resemble or refer to the object they represent. This means that auditory

icons are more difficult to manipulate than earcons but their meaning may be easier

for the listener to recognize. Spearcons are a more recent development and are created

by time compressing a spoken phrase so that the resulting short sound is not entirely

comprehensible as speech [13].

Using auditory icons to represent structural mathematical constructs like

parenthesis or fractions is inherently difficult as these is no obvious relationship

between this abstract mathematical syntax and real-world sounds. As it is relatively

straightforward to create a hierarchical range of earcons, these may be more

appropriate to represent nested parenthesis or other structural notation. This approach

was adopted by the MathTalk system which uses musical earcons to indicate

structural delimiters (such as a pattern of 3 short ascending and descending notes to

indicate the beginning and end of an expression respectively) and also provide an

abstract overview of the entire equation [14]. The significant drawback to this

approach is that a complex and highly specific musical grammar must be learned prior

to using the system. In addition, remembering and decoding musical patterns may be

quite difficult for non-musicians and again, the additional cognitive effort required to

decode each pattern could detract from the processing of the mathematical content.

Spearcons lie somewhere between recognizable speech and abstract sounds such as

earcons. A number of studies have found that spearcons were easier to learn and

resulted in an increase in performance in interactive interface tasks such as menu

browsing [13]. The major advantage of spearcons is that they can potentially function

as either a descriptive lexical phrase or an abstract sound depending on its familiarity

to the listener or its function as a structural delimiter.

2.4 Prosody

Prosody, i.e. the patterns of stress and intonation in a language, can also be used to

indicate the structure of an equation through the manipulation of parameters such as

pause duration, speaking rate and pitch. Raman’s AsTeR system used prosody in

conjunction with additional lexical cues such as “fraction” and “end-fraction”, and

“quantity” and “end-quantity” as parenthesis to indicate sub-expressions [2]. Different

levels of exponents and nested expressions were indicated using persistent speech

cues, such as a speaking voice with raised pitch to indicate an exponent. Other

systems have attempted to use prosodic schemes more reminiscent of normal speech.

Stevens analyzed recordings of spoken algebraic equations and used the pitch, timing

and amplitude characteristics of these recordings as the basis for an audio interface

designed specifically for the presentation of mathematics [4]. Fitzpatrick argues that

the prosodic model used by Stevens conflicts with current research in prosodic and

intonational phonology, and instead proposed an alternative model which relates the

structure found in mathematical expressions with the inherent composition and

structure of the English language [15]. In this model, the nesting of clauses within

sentence structures is equated to the nesting of sub-expressions within a larger

equation whose structure is similarly indicated through the insertion of pauses and

alterations to the speaking rate. Additional lexical cues are added to indicate certain

specific terms such as fractions and superscripts, in a similar fashion to Nemeth

MathSpeak [11].

Experiments with these systems have found that prosody can be used to determine

the structure of an equation and, significantly, required less effort from the user than

lexical cues [4]. Natural prosody would therefore appear to be highly suitable for the

presentation of structural information as the subtle prosodic inflections of natural

speech are intuitively understood and relatively undemanding in terms of cognitive

effort. The significant drawback to this approach is that the structure of complex,

nested equations is difficult to determine based on prosody alone, suggesting that

some form of additional delimiter is required [4, 15].

3 Prosody and Audio Punctuation, Earcons and Spearcons

The prosodic model developed by Fitzpatrick [15] based on the natural structures

found in spoken English appears to be the most cognitively efficient means of

conveying the structure of mathematical content via audition. However, in this new

model, the prosodic grouping is reinforced with a limited number of additional sounds

which function as a form of audio punctuation. These include simple, beep sounds

which are used as audio parenthesis to delimit sub-expressions and reinforce the

segmentation implied by the prosodic model. The additional lexical cues used by

Fitzpatrick to indicate fractions and exponents are also replaced with equivalent

spearcons whose shorter duration should reduce the negative influence of the suffix

effect. It is beyond the scope of this paper to describe how the prosodic model caters

for each individual construct found in mathematical material, however, the following

sections will describe how some of the most common terms and expressions will be

represented.

The prosody model developed by Fitzpatrick assumes that a mathematical

expression is a structure containing an arbitrary number of nested sub-expressions,

which can be resolved into either a sentence structure containing various clauses, or a

paragraph of sentences [15]. The degree of nesting in the equation is therefore highly

important, and once this factor has been determined, the appropriate pauses and

changes in speech rate can be determined and added to the basic verbalization of the

mathematical content, as illustrated by the following example.

1

1

1

1

ni

i

i

ai

−

=

+

+

−

∑ (1)

In this equation, the summation which applies to both of the remaining terms

represents the first level of nesting. Two terms are contained within the summation,

however, the second of these terms itself contains two complex terms in the

numerator and denominator of the fraction, and this will require another level of

nesting. The representation of this equation must therefore include a lengthy pause to

indicate that the remaining terms are all contained within the summation, followed by

additional pauses to indicate the scope of the superscript and fraction. The

approximate spoken rendering of this equation would therefore be, “the sum from i

equals 1 to n minus 1 of…. a to the i.. plus begin fraction, i plus 1 over i minus 1, end

fraction”. Adjustments to the speaking rate are also applied at the various clause

boundaries to further reinforce the segmentation of the material. In the above

example, the complex terms contained within the fraction would be indicated by an

increase in the speaking rate relative the rest of the equation. However, these

adjustments must be carefully controlled to ensure that sub-expressions at deeper

nesting levels are not uttered too quickly. With this in mind, a 6% increase in

speaking rate was found to produce acceptable results for a number of nesting levels

[15].

The prosodic model described in the above example indicates the structure of the

equation in a highly intuitive manner, however clearly this approach will never be

entirely sufficient, as indicated by the presence of an additional lexical cue to indicate

the fraction. In this new model, lexical terms such as “begin fraction” are replaced

with spearcons, and these are used to support the prosodic segmentation of all

superscripts, subscripts and fractions. Both these spearcons and the simple, beep-like

earcons which function as audio parenthesis are applied to the prosodically enhanced

speech using a similar strategy which reinforces the segmentation implied with

prosodic parameters such as pauses and adjustments to the speaking rate. These short

beeps function as audio parenthesis in much the same way as brackets are used to

delimit equations in many programming languages. An opening audio parenthesis is

represented by a short beep, specifically a 20msecs sine wave pulse, which is

positioned to the left of the main voice signal using binaural processing. A closing

audio parenthesis is represented in the same way except the sine wave pulse is now

positioned to the right. This scheme significantly enhances the structural segmentation

implied by the speech prosody, and in addition, the extremely short duration and the

externalizing effect of the binaural processing minimizes the detrimental effect of

these additional sounds on the primary speech content. The use of spatial position to

indicate opening or closing parenthesis is also very cognitively efficient as it relies on

a relatively course and intuitive sense of spatial hearing. These two earcons were

designed to support this left-right distribution as the frequency of each sine tone

ramps up or down depending on whether it is an opening or closing parenthesis

respectively. This frequency ramp, or glissando, is not highly apparent due to the

quite short duration of the pulse, however, it does help to further distinguish the two

earcons and is conceptually suggestive of the implied opening and closing delimiters.

Multiple nested levels of audio parenthesis are created by adjusting both the spatial

position and the frequency range of the sine wave pulse.

Parenthesis can be used to remove many of the structural ambiguities which may

arise in spoken mathematics, however, the vertical layout of printed notation, with

fractions, subscripts and superscripts for example, must also be indicated using some

form of additional cue. In this case, the use of non-speech sounds will require the user

to learn the meaning of different abstract sounds, however, the use of lexical cues is

also problematic as the additional speech content may detract from the main spoken

content. For this reason, spearcons based on the phrases “frac”, “sup” and “sub” are

used to indicate fractions, superscripts and subscripts respectively. In each case, the

spearcon is presented using the same left-right, binaural processing to indicate the

beginning or ending of the fraction, superscript or subscript. This approach is

advantageous as the original phrase provides a lexical indicator of the type of

structure while the significantly reduced duration (time compressed to approx.

200msec) and simplified phrasing should also lessen the impact of the suffix effect. In

addition the use of spatial position to indicate the beginning or end of a sub-

expression further reduces the duration and detrimental effect of required lexical cue

4 Conclusion

Working with mathematical material via audition rather than vision places an

increased cognitive load on the listener. This is largely unavoidable due to critical

importance of spatial structures in mathematical notation and the significant

difference between audition and vision in terms of spatial resolution. The design of a

system for visually impaired users must therefore concentrate on resolving structural

ambiguities in a cognitively efficient manner. In addition, the system must provide the

user with temporal control over the material in the form of browsing and overview

capabilities, so the user can direct their own path through the material at their own

pace.

In his PhD thesis on the AsTeR system, T.V. Ramen stressed the importance of

identifying dimensions in the audio space to parallel the functionality of the

dimensions in the visual setting [2]. This does not imply that these dimensions should

be simply mapped from one modality to another, instead it is the functionality of these

dimensions that should be replicated between these two modalities. The spatial layout

of printed mathematical material should not therefore be simply replicated with

sound. Instead the functionality of this layout must be provided via audition using a

method that is appropriate for this modality. The human auditory system is

particularly attuned to speech and this suggests that the most cognitively efficient

means of presenting mathematical structural information via audition is through the

replication of natural speech patterns, such as prosody. However, prosody alone

cannot entirely resolve structural ambiguities and so additional cues are required.

While non-speech sounds such as earcons can be constructed to represent a

hierarchical structure (such as nested parenthesis or menu items) this will require

additional cognitive processing on the part of the listener which may only distract the

processing of the mathematical material, and this is particularly true of complex

schemes based upon musical structures. The audio punctuation strategy proposed in

this paper overcomes this issue, as in this instance it is not the actual audio content of

the earcon that is important, but rather its coarse spatial position (simply either to the

left or to the right) relative to the main speech signal. In this way, the earcons function

as audio parenthesis which punctuate and augments the speech content in an intuitive

and non-distracting manner.

Spearcons are an interesting new development as they lie somewhere between a

clearly comprehensible spoken utterance and an abstract non-speech sound.

Spearcons are therefore an excellent way to indicate structural elements such as

fractions, superscripts and subscripts as they are less distracting than tradition lexical

cues but still provide a description of the particular structural element involved.

References

1. Gillan, D., Barraz, P., et al: Cognitive Analysis of Eqauation readings: Application to the

development of the MathGenie. ICCHP 2004, LNCS, vol. 3118. pp. 630-637, Springer,

Heidelberg (2004).

2. Raman, T. V: Audio Systems for Technical Reading. PhD Thesis, Department of Computer

Science, Cornell University, Ny, USA, May 1994.

3. Gaura, P.: REMathEx – Reader and Editor of the Mathematical Expressions for Blind

Students. ICCHP 2002, LNCS, vol. 2398. pp. 486-493, Springer, Heidelberg (2002).

4. Stevens, R. D..: Principles for the Design of Auditory Interfaces to Present Complex

Information to Blind Computer Users. PhD Thesus, University of York, UK (1996).

5. Harling, P. A., Stevens, R. and Edwards, A.: Mathgrasp: The design of an algebra

manipulation tool for visually disabled mathematicians using spatial-sound and manual

gestures. HCI Group, University of York, UK (1995).

6. Grantham, D. W.: Detection and discrimination of simulated motion of auditory targets in

the horizontal plane. J. Acoust. Soc. Am. vol. 79, pp. 1939–1949. (1986)

7. Hollander, A. J. and Furness, T. A.: Perception of Virtual Auditory Shapes. Proceedings of

the International Conference on Auditory Displays. November, 1994.

8. Goose, S. and Möller, C.: A 3D Audio Only Interactive Web Browser: Using Spatialization

to Convey Hypermedia Document Structure. Proceedings of the ACM International

Conference on Multimedia, pp. 363–371, Orlando, USA (1999).

9 Begault, D. R. & Erbe, T. R.: Multi-channel spatial auditory display for speech

communications. Audio Engineering Society 95th Convention, Preprint No. 3707 (1993).

10 Crispien, K. and Petrie, H.: Providing Access to Graphical-Based User Interfaces for Blind

People: Using Multimedia System Based on Spatial Audio Representation. 95th AES

Convention, J. Audio Eng. Soc, (Abstracts), vol. 41, pp. 1060 (1993).

11 Nemeth, A.: Abraham Nemeth’s Anthology on the Unified Braille Code. Available at

http://www.math.virginia.edu/ arm4r/nemeth

12 Baddeley, A.: Human Memory: Theory and Practice. London: Lawrence Erlbaum

Associates (1990).

13 Walker, B., Nance, A. and Lindsay, J.: Spearcons: Speech-based Earcons Improve

Navigation Performance in Auditory Menus. Proceedings of the 12th International

Conference on Auditory Display, London, UK, June 20-23 (2006).

14 Edwards, A. D. N.: Using sounds to convey complex information. in A. Schick and M.

Klatte (ed.), Contributions to Psychological Acoustics: Results of the Seventh Oldenburg

Symposium on Psychological Acoustics, Oldenburg, pp. 341-358 (1997).

15 Fitzpatrick, D.: Mathematics: how and what to speak. ICCHP, Springer Verlag, pp. 1199-

1206 (2006).