ArticlePDF Available

Cortical processing of phonetic and emotional information in speech: A cross-modal priming study

Cortical processing of phonetic and emotional information in speech: A cross-modal priming study
Erin Diamonda, Yang Zhanga,b,c,d*
a Department of Speech-Language-Hearing Sciences, University of Minnesota, Minneapolis, MN
55455, USA
b Center for Neurobehavioral Development, University of Minnesota, Minneapolis, MN 55455, USA
c Center for Applied and Translational Sensory Science, University of Minnesota, Minneapolis, MN
55455, USA
d School of Foreign Languages, Shanghai Jiao Tong University, Shanghai, 200240, China
* Corresponding author at: Department of Speech-Language-Hearing Sciences, University of
Minnesota, Minneapolis, MN 55455, USA
E-mail address: (Y. Zhang)
Telephone: +1 612 624 7818
Fax: + 1 612 624 7586
The current study utilized behavioral and electrophysiological measures to investigate the timing,
localization, and cortical oscillation characteristics of cortical activities associated with phonetic
and emotional information processing of speech. The experimental design employed a cross-
modal priming paradigm in which the normal adult participants were presented a visual prime
followed by an auditory target. Primes were facial expressions that systematically varied in
emotional content (happy or angry) and mouth shape (corresponding to /a/ or /i/ vowels).
Targets were spoken words that varied by emotional prosody (happy or angry) and vowel (/a/ or
/i/). In both the phonetic and prosodic conditions, participants were asked to judge congruency
status of the visual prime and the auditory target. Behavioral results showed a congruency effect
for both percent correct and reaction time. Two ERP responses, the N400 and late positive
response (LPR), were identified in both conditions. Source localization and inter-trial phase
coherence of the N400 and LPR components further revealed different cortical contributions and
neural oscillation patterns for selective processing of phonetic and emotional information in
speech. The results provide corroborating evidence for the necessity of differentiating brain
mechanisms underlying the representation and processing of co-existing linguistic and
paralinguistic information in spoken language, which has important implications for theoretical
models of speech recognition as well as clinical studies on the neural bases of language and
social communication deficits.
Keywords: EEG; phonetic perception; emotional prosody; N400; LPR; cortical oscillation
Speech carries both linguistic (e.g. phonetic) and paralinguistic (e.g. emotional prosodic)
information. Emotional prosody involves the manipulation of acoustic cues such as fundamental
frequency, loudness, and voice quality that allows the speaker to communicate emotion through
prosody (Kotz & Paulmann, 2011; Patel et al., 2011). In the current experiment, we were
particularly interested in cortical mechanisms underlying the processing of the emotional aspect
of speech prosody as opposed to the recognition of phonetic identity in the same spoken words.
Understanding the brain mechanisms that govern the proper use of the affective cues along with
the expression of linguistic content is important for theories on the neural representations of
language as well as practical applications such as intervention for individuals with
communication difficulties in terms of affective speech comprehension/production (Izdebski,
A number of important neuroanatomical models for speech and voice processing based on
functional imaging studies have been proposed to explain how the adult human brain represents
sublexical phonological and lexical semantic information (Hickok & Poeppel, 2007) as well as
the paralinguistic vocal variations that convey the speaker’s emotion and identity (Belin,
Fecteau, & Bédard, 2004; Klasen, Chen, & Mathiak, 2012; Schirmer & Kotz, 2006). In the
Hickok & Poeppel (2007) model for speech perception, the cortical processing system for spoken
language features bilateral ventral streams projecting from auditory cortex to middle and inferior
posterior temporal regions for mapping sound onto semantic representations and a left-dominant
dorsal stream projecting from auditory cortex to parietal-temporal boundary area and frontal
regions for mapping sound onto articulatory representations. One limitation of this model is that
it does not consider the social-indexical aspects of speech such as emotional prosody and speaker
identity. The other models endeavor to overcome this limitation under the general theme of voice
perception by emphasizing functional specialization and heterogeneity of the temporal and
frontal cortices, the cortico-subcortical pathways involving regions such as amygdala for
emotion processing, and the differential contributions of the right and left hemispheres in vocal
emotion processing. Take the Schirmer & Kotz (2006) model for example. Vocal emotion
processing undergoes three stages. The first consists of initial bilateral acoustic analyses in the
auditory cortex. The second involves projections to specialized voice-selective areas in the
superior temporal sulci and gyri for more complex analysis and synthesis of the emotionally
salient information. The third stage continues on with projections to frontal areas for higher-
order evaluation and cognition. While this adult model provides both temporal and spatial
specifications towards understanding the neural networks for processing emotional prosody,
more research is needed to test its predictions, delineate the temporal windows and functional
properties (including hemispheric lateralization) of the implicated cortical regions, and
understand how the neural networks function and change in relation to emotional valence,
sensory modality, age, gender, linguistic experience, and pathological conditions (See Blasi et
al., 2011; Izdebski, 2008; Witfffoth et al., 2010 for discussion).
Priming paradigms have been employed in behavioral and event-related potential studies to
evaluate the intersection between pairs of stimuli both within and across modalities. In these
paradigms, primes and targets are combined to investigate the effects of their congruency or
incongruency in a specified aspect of interest. Greene, Eaton, and LaShell (2001) explored the
effect of within-modal and cross-modal priming on spatio-temporal event processing. They
observed an effect for visual priming of auditory targets, but not the reverse. The authors
therefore proposed that a visual event provided specific information that could facilitate
processing of an auditory target, whereas the auditory primes produced weaker priming effects
with the potential to correspond to a wide range of visuals.
An adjective evaluation task created by Fazio, Sanbonmatsu, Powell, and Kardes (1986)
was an early example of the affective priming. In that experiment, affectively related primes
facilitated an evaluative decision of adjective targets, as demonstrated by shorter latencies
preceding the adjective evaluation (“good” or “bad”). This has been described as “automatic
attitude activation” (for a review, see Fazio, 2001). Affective priming paradigms since that time
have extended beyond the traditional visual word prime-target pairs to explore interactions
between stimulus domains and modalities. Picture primes and written word targets have been
used to investigate the neural mechanisms at play during cross-domain visual affective priming
paradigms (Zhang, Lawson, Guo & Jiang, 2006; Zhang, Li, Gold & Jiang, 2010). Other prime-
target pairs have spanned two modalities, combining stimuli such as affective sentence prosody
and written words (Schirmer, Kotz & Friederici, 2002) or musical stimuli and written words
(Goerlich, Witteman, Schiller, Van Heuven, Aleman & Martens, 2012). Facial expressions have
been used as targets in cross-modal priming paradigms with sentence primes (Czerwon,
Hohlfeld, Wiese & Werheid, 2013) and musical primes (Lense, Gordon, Key & Dykens, 2014).
Other cross-modal affective priming paradigms have used facial expressions as primes for
emotional words (Schirmer & Kotz, 2006) and musical stimuli (Kamiyama, Abla, Iwanaga &
Okanoya, 2013).
In order to determine the facilitative effect of primes on targets, a variety of behavioral and
brain measures have been utilized. As mentioned previously, differences in reaction time in
response to congruent vs. incongruent pairs are thought to reflect increased or decreased
facilitation of target processing. In conjunction with this measure, event-related potentials
(ERPs) are especially valuable for investigating the timing of neural components underlying both
phonetic and prosodic processing. Of particular interest to the current study are the N400 and
the late positive response (LPR, also referred to as the late positive component [LPC] or late
positive potential [LPP]).
Event-related potential (ERP) research is a noninvasive method of measuring time-locked
neural responses to specific events. It provides high temporal resolution suitable for
investigating brain responses to acoustic and linguistic processing at the millisecond level. The
N400 component is a negativity occurring approximately 400 ms after the onset of target words
involving a violation of meaning (Kutas & Hillyard, 1980). Traditionally, this component has
been studied in the context of semantic expectancy violations in sentences (for a review, see
Kutas & Federmeier, 2011), but it has also been observed in response to incongruency in
affective priming paradigms (e.g., Kamiyama, Abla, Iwanaga & Okanoya, 2013; Paulmann &
Pell, 2010). A late positive response (LPC) following the N400 has also been observed in
affective priming experiments and is generally seen to reflect increased attention to unexpected
targets (Werheid, Alpay, Jentzsch & Sommer, 2005; Zhang, Li, Gold & Jiang, 2010). This
positive-going deflection is often discussed as a possible variant of the P300 component,
reflecting updating working memory (Hajcak, Dunning & Foti, 2009; Zhang et al., 2010; also see
Donchin & Coles, 1988). The P300 is known to be a neurocognitive index of novelty detection
and attentional capture, and its amplitude is strongly dependent on the stimulus context and task
demands (e.g., Nie, Zhang, & Nelson, 2014). Other studies have identified a late positive
response in experiments of emotion or prosody. The late positivity was identified in various
tasks during which the participant was consciously attending to some characteristic of the
stimuli, such as congruency (Aguado et al. 2013; Chen, Zhao, Jiang & Yang, 2011; Kamiyama et
al., 2013), sound intensity deviation (Chen et al., 2011), or level of arousal (Paulmann, Bleichner
& Kotz, 2013).
Localization of these brain processes is another area of investigation. It is generally
accepted that language is primarily localized in the left hemisphere whereas emotion processing
has greater right hemisphere involvement. However, Schirmer & Kotz (2006) challenge the idea
that vocal emotion is uniquely a right hemisphere process. Rather, they propose that prosodic
processing is a multi-step process with differential involvement of both hemispheres. Recent
investigations into the lateralization of prosody support this model (Iredale, Rushby, McDonald,
Dimoska-Di Marco & Swift, 2013; Paulmann, Bleichner & Kotz, 2013; Witteman et al., 2014).
A relatively new area of analysis in ERP research is the application of time-frequency
analysis to examine degree of trial-by-trial coherence in cortical rhythms that may give rise to
the salient components in the averaged ERP waveforms (Koerner & Zhang, 2015; Luck, 2014).
There is a growing body of literature on the cortical rhythms that mediate phonetic and prosodic
processing in an audiovisual priming paradigm. The different cortical rhythms are considered to
reflect resonant neural networks that code and transfer information across brain regions to
support various sensory, motor and cognitive processing. Researchers have investigated how
frequency bands are modulated in response to different auditory and visual cues (for a review of
EEG coherence, see Weiss & Mueller, 2003). Of particular interest in the current investigation
are the following oscillations: delta (14 Hz), theta (4–8 Hz), beta (12–30 Hz) and gamma (>30
Hz). Senkowski, Schneider, Foxe, and Engel (2008) reviewed the implications of these cortical
oscillations on cross-modal sensory integration. Increased gamma oscillations have been
observed in response to incongruency between visual primes and auditory targets (Schneider,
Debener, Oostenveld & Engel, 2008). Theta band power has been proposed to increase in
response to visual presentation of emotional faces (Balconi & Pozzoli, 2009; Knyazev,
Slobodskoj-Plusnin & Bocharov, 2009). It may also increase in response to semantic violations
(Hald, Bastiaansen & Hagoort, 2006). Delta modulations have been shown to increase in
response to linguistic processing (Giraud & Poeppel, 2012; Scheeringa, Petersson, Oostenveld,
Norris, Hagoort & Bastiaansen, 2009). Similarly, delta band power may be more synchronized
in response to emotional faces compared to neutral ones (Knyazev et al., 2009). In addition, it is
generally observed that beta oscillations occur in response specifically to the visual information
of emotional facial stimuli (Balconi & Pozzoli, 2009).
In this event-related potential study, we explored phonetic and emotional prosodic
(henceforth referred to as prosodic) processing using two visual priming conditions to examine
behavioral and neural responses associated with the identification of phonetic mismatch and
prosodic mismatch. In line with the priming literature, we hypothesized that participants would
take more time to react to incongruent audiovisual stimuli than to congruent audiovisual stimuli,
regardless of condition (prosodic vs. phonetic). We anticipated that behavioral accuracy for
detecting incongruent audiovisual pairs would be dependent on dimensional information
(prosodic vs. phonetic) with potential dimensional interaction effects. We further predicted that
participants would exhibit both an N400 response and a late positive response to incongruent
audiovisual stimuli in both conditions. In addition to the conventional ERP waveform analysis,
we applied source localization method to test whether different cortical regions were involved in
generating the N400 and LPR responses for the two conditions (prosodic vs. phonetic). We were
also interested in determining whether different cortical rhythms mediated the generation of the
N400 and LPR responses in the two conditions. The results would collectively provide a better
understanding of the brain mechanisms underlying the processing of prosodic and phonetic
information in spoken language.
Twelve right-handed adults (6 males, 6 females) between the ages of 18 and 24 (mean =
20.6) participated in the experiment. All participants were native English speakers with no
history of speech, language, or hearing impairment. All had normal hearing in audiometric
assessment and normal or corrected-to-normal vision.
The stimuli included both visual primes and auditory targets. The visual primes were
four photographs of a male face showing a happy or an angry expression with a mouth shape that
was representative of either an /ɑ/ or an /i/ vowel. The same male speaker produced the four
auditory targets. These were consonant-vowel-consonant (CVC) words, /bɑb/ (“bob”) and /bib/
(“beeb”), produced with happy or angry prosody. The naturally recorded spoken words were
then normalized in Praat ( to have the same duration and average RMS
(root mean square) intensity.
During the EEG recording session, participants were seated in a comfortable chair in a
soundproof booth (ETS-Lindgren Acoustic Systems). Participants were fitted with a stretchable
64-channel Waveguard cap, and continuous EEG data were recorded using the Advanced Neuro
Technology system. The Ag/AgCl electrodes were arranged to match the standard International
10-20 Montage System and intermediate locations, with the ground electrode located at the AFz
electrode. Eye blinks (VEOG) and horizontal eye movements (HEOG) were monitored with
four bipolar facial electrodes, positioned on the outer canthi of each eye and in the inferior and
superior areas of each orbit respectively. The bandpass filter for EEG recording was for the
0.016-200 Hz range, and the sampling rate was 512 Hz. Impedances for the individual
electrodes were kept at or below 5 kΩ.
Visuals were presented in the center of the screen against a green background. Each
visual prime was presented for 400 ms before the onset of the target auditory stimulus whose
duration was edited to be 295 ms (Figure 1). The pitch-synchronous-overlap-add method in
Praat (Boersma & Weenink, 2015) was used to normalize word duration. Auditory stimuli were
presented at 60 dB sensation level (Rao, Zhang, & Miller, 2010). There were 160 trials in each
block (phonetic and prosodic). Two blocks were presented for each condition (phonetic and
prosodic). Within each block, subjects were given a 10s break after every forty trials. No two
consecutive blocks used the same condition. The inter-block interval was 30s. The presentation
order for the phonetic and prosodic blocks was counterbalanced across participants. The total
duration of the experiment was approximately 60 minutes.
In the prosodic condition, participants were instructed to evaluate a match or mismatch
between the emotion of the face and the emotion of the voice. In the phonetic condition,
participants were instructed to evaluate a match or mismatch between the articulation and the
auditory word target. They indicated their responses (match vs. mismatch) by pressing the left or
right arrow key on a keyboard according to the given instruction in the experimental session. In
the phonetic block, participants were instructed to evaluate a match or mismatch between the
vowel of the word and the mouth shape. Again, they indicated a match or a mismatch by
pressing the left or right arrow key.
Data Analysis
Behavioral data analysis
Behavioral responses were analyzed for percent correct accuracy and mean reaction time.
Analysis of percent accuracy accounted for all possible response categories (hits, correct
rejections, misses, and false alarms). The pairings of a face and a voice were classified as either
congruent or incongruent. A “correct” response was agreement (“yes”) with a congruent pairing
or disagreement (“no”) with an incongruent pairing.
Mean reaction times were calculated for each subject for each of the four conditions
(phonetic congruent, phonetic incongruent, prosodic congruent, prosodic incongruent). Mixed
repeated-measure ANOVA tests evaluated two main factors and their interaction: congruency
(congruent vs. incongruent) and condition (phonetic vs. prosodic). Post-hoc tests were also
conducted to further investigate interaction effects.
ERP waveform analysis
ERP averaging was performed offline in BESA (Version 6.0, MEGIS Software, GmbH,
Germany). Artifact correction parameters were set at 100.0μV for HEOG and 150.0μV for the
VEOG to minimize the effects of eye drift and blink, respectively. After the artifact correction
was applied, the raw EEG data were bandpassed at 0.5–40 Hz. The ERP epoch length was 1500
milliseconds, including a pre-stimulus baseline of 100 milliseconds. The automatic artifact
scanning tool in BESA was applied to detect noisy signals. The automatic artifact rejection
criterion was set at plus or minus 50μV. Additionally, trials where the difference between two
adjacent sample points exceeded 75 μV were excluded from analysis. To improve the signal-to-
noise ratio of the data, nine electrode regions were defined for analysis, which were organized
from anterior to posterior and left to right (Figure 3). Similar channel groupings were used in
previous studies (Chen et al., 2011; Schneider et al., 2008; Stevens & Zhang, 2014; Zhang et al.,
Based on previous literature and visual inspection of the grand mean ERP waveform data,
two time windows were selected for analysis: an early time window from 250 – 450 ms (N400
component, Aguado et al., 2013; Kamiyama et al., 2013; Kotz & Paulmann, 2011) and a late
time window from 700 – 1000 ms (late positive response, Chen et al., 2011; Paulmann,
Bleichner & Kotz, 2013; Kotz & Paulmann, 2011). These ERP component latencies were
assessed relative to the onset of the auditory stimulus, which occurred at 400 ms after the onset
of visual prime. The 100 ms baseline for the entire epoch (including ERP responses to both the
visual prime and the auditory target) was taken relative to the onset of the visual prime.
Repeated measures ANOVA tests were performed for these two peaks of interest. Within-
subject factors were condition (phonetic and prosodic), laterality (left, middle, and right), and
site (anterior, central, and posterior).
Source localization analysis
Source localization analysis was performed using the minimum norm estimation (MNE)
in BESA software (Zhang et al., 2011). MNE analysis approximated the current source space
with minimal a priori assumptions about the active sources, using the smallest norm to explain
the measured ERP signals. MNE was implemented in the process outlined below:
1. The electrode montage was calculated by using the standard positions for the WaveGuard
EEG cap relative to the standard head model (Boundary Element Model) in BESA.
2. Depth weighting and spatio-temporal weighting were adopted to avoid bias towards
superficial sources and improve the focality and reliability of the source activities.
3. The total activity at each source location (750 dipole locations in the left hemisphere and 750
in the right hemisphere) was calculated as the root mean square of the dipole source
activities. These solutions were then projected to the standard realistic brain model in BESA.
The current source data for the prefixed locations at all latencies were further analyzed for
temporal and spatial interpretations.
4. The total MNE activities in each hemisphere were added at each time point. A two-tailed z-
test relative to baseline mean and variance was applied to the MNE differences between the
two stimuli at each sample point.
5. Regional contributions to the total MNE activities were examined using standard anatomical
boundaries in the Talairach space for each region of interest (ROI) in the brain space.
Time frequency analysis
Time frequency analysis was also performed for the nine regions of interest (ROIs): left
anterior (LA), middle anterior (MA), right anterior (RA), left central (LC), middle central (MC),
right central (RC), left posterior (LP), middle posterior (MP), and right posterior RP). Inter-trial
coherence in terms of phase locking values in delta (14 Hz), theta (4–8 Hz), beta (13–30 Hz)
and gamma (>30 Hz) frequency bands was computed for each subject in each of the two
conditions (phonetic vs. prosodic) with the open source EEGLAB package (Delorme & Makeig,
2004). The inter-trial coherence measure is an estimate of mean normalized phase across trials,
which can range from 0 (indicating random phase coherence or complete lack of
synchronization) to 1 (indicating perfect phase synchrony across trials). The inter-trial coherence
data (also referred to as phase locking values) were averaged across the frequencies within the
range of each frequency (Koerner & Zhang, 2015). The peak phase locking values corresponding
to the N400 and late positive response components in their respective windows were identified
for each frequency band for each listening condition on an individual basis. For each
experimental condition (prosodic vs. phonetic), a direct comparison of the time-frequency
analysis data with a false discovery rate (FDR) (Benjamini & Hochberg, 1995) corrected p-value
threshold of 0.01 was conducted to determine what frequency bands mediated the N400 and late
positive responses.
Statistical analysis
All statistical analyses were completed in Systat 10. A repeated measures analysis of
variance (ANOVA), with α = 0.05, was conducted to examine the statistical significance of
listening condition (phonetic vs. prosodic) on N400 and late positive response latencies and
amplitudes recorded at the selected regions of interest. The repeated-measures ANOVA was
also applied in evaluating behavioral and neural responses in the phonetic and prosodic
conditions. Post-hoc paired Student’s t-tests (two-tailed) were also conducted to further
determine how the different factors contributed to the significant interaction effects in the
ANOVA tests.
Behavioral results
The percent correct data showed a significant main effect of congruency (F(1,11) = 5.79,
p < 0.05), and there was also a significant interaction between congruency and condition (F(1,
11) = 8.16, p < 0.05) (Table 1; Figure 2). Post-hoc t-tests revealed that the prosodic condition
(but not the phonetic condition) showed a significant accuracy difference between the congruent
and incongruent trials (p < 0.05). In addition, the reaction times showed a congruency effect
(F(1,11) = 28.67, p < 0.01) (Table 1). Participants took more time to respond to incongruent
stimuli than congruent stimuli, regardless of condition. Post-hoc t-tests revealed that both the
prosodic condition and the phonetic condition showed a congruency effect in reaction time (p <
ERP Results
The subtracted waveforms show clear N400 peaks (in the early window of 250 – 450 ms
after onset of the auditory target) cross all regions of interest (Figures 3-6) for both phonetic and
prosodic conditions. These peaks are followed by a late positivity. A repeated measure
ANOVA for the N400 response revealed a significant main effect for listening condition
(F(1,11) = 7.95, p < 0.05) with the phonetic condition showing stronger N400 activity than the
prosodic condition. There was also a significant main effect for laterality (left, middle, right)
(F(1,11) = 7.17, p < 0.01). Furthermore, there was a condition × laterality interaction (F(1,11) =
4.33, p < 0.05), indicating that the stronger N400 activity for incongruent stimuli in the phonetic
condition was dependent on lateral location (left, middle, right). There was also a condition ×
site × laterality effect (F(1,11) = 4.04, p < 0.01) for the N400 component. Post-hoc t-tests
revealed that the N400 in the phonetic condition was left dominant in the central and posterior
sites (p < 0.05). No significant laterality effect was found for the N400 in the prosodic condition.
The topographic distribution (Figure 6B) for the N400 component confirms left hemisphere
dominance for the phonetic condition, whereas the distribution appears to be more bilateral for
the prosodic condition.
A repeated measure ANOVA for the LPR response in the late time window (700 – 1000
ms after onset of the auditory target) revealed significant effects for electrode site (anterior,
central, posterior) (F(1,11) = 7.49, p < 0.01) and laterality (left, middle, right) (F(1,11) = 19.03,
p < 0.01). In contrast to the N400 component results, there was a significant interaction effect
for condition × site (F(1,11) = 3.89, p < 0.05). Finally, there was a condition × site × laterality
effect (F(1,11) = 3.10, p < 0.05). Post-hoc t-tests showed anterior vs. posterior significant
differences in both hemispheres for both the prosodic and phonetic conditions (p < 0.05). But
there was no significant laterality effect in any of anterior, central, or posterior sites in either
condition. The topographic distribution (Figure 6) for the late positive response appears
relatively similar between the two conditions, with the exception of a greater frontal negativity in
the phonetic condition. However, there is not a clear hemispheric pattern like that seen for the
N400 component.
Localization Results
Source localization analysis provided a rough estimation of cortical activation patterns
for the N400 and late positive responses in the current experiment (Figure 7). Total activity
waveforms for the phonetic and prosodic conditions are not highly revealing because of the
differential patterns of cortical activation. Regions where the minimum norm estimation analysis
yielded a z score of 4 or greater (p< 0.001) are described below for each condition.
Source localization patterns for the N400 response show strong left hemisphere
lateralization in the phonetic condition, with contributions from the superior temporal and
inferior parietal regions as well as the primary motor cortex. In the prosodic condition, the N400
response shows a pattern of right hemisphere dominance with superior temporal and inferior
parietal region activations.
The LPR response appears to have more distributed regions of activation for the phonetic
condition. These regions generally include the parietal lobe in addition to the primary motor
cortex, with possible contributions from the occipital region. Left hemisphere activations in the
prosodic condition include parietal and occipital regions, while right hemisphere activity
includes occipital, temporal, and inferior parietal regions.
Cortical Rhythm Results
Time-frequency analysis evaluated the contributions of delta (1–4 Hz), theta (4–8 Hz),
beta (12–30 Hz) and gamma (>30 Hz) oscillations to ERP responses (Figure 8). In the phonetic
condition, the lower frequency bands (delta, theta and beta rhythms) contributed to the N400
response and theta rhythm contributed to late positive response (p < 0.01). In the prosodic
condition, the primary contributors to both N400 and LPR responses were beta and gamma
rhythms. Theta band oscillations showed a significant difference between the congruent and
incongruent trials (p < 0.01) at the late positivity window in the prosodic condition, too.
Furthermore, there was significant early gamma activity before the onset of the auditory target
for the prosodic condition (p < 0.01).
Congruency effect in behavioral data
Overall, participants were more accurate at identifying congruent face and voice pairs
than incongruent combinations (Table 1). Furthermore, percent correct accuracy was dependent
upon dimensional information. Participants were less accurate at identifying prosodic
incongruency than prosodic congruency while percent correct accuracy for the phonetic
condition was nearly identical for the incongruent and congruent conditions (Figure 2). In a
recent experiment by Chen et al. (2011), participants were more accurate at identifying a
prosodic match compared to a prosodic mismatch when attending to sound intensity deviation in
a sentence. However, the same authors found no effect for accuracy in a congruency detection
task using the same stimuli. Kamiyama et al. (2013) observed high accuracy (above 90%) for
both congruent and incongruent face/music pairs. As different studies tested different
informational dimensions, it is difficult to reach a simple consensus across the studies. Our data
suggest that the phonetic judgment accuracy was not much affected whether the visual prime
matched the auditory target or not, whereas prosodic judgment accuracy was significantly
Results for reaction time demonstrated that participants took longer to respond to
incongruent stimuli than congruent stimuli, irrespective of whether the condition was phonetic or
prosodic. A similar effect has been observed in previous studies (Kamiyama et al., 2013;
Stevens & Zhang, 2014). In the experiment by Kamiyama et al., participants with and without
musical experience judged congruent face-music pairs more quickly than incongruent pairs. In a
cross-language comparison, Stevens & Zhang identified an audiovisual congruency effect
regardless of language background or inclusion of an independent gesture variable.
N400 audiovisual congruency effect
Consistent with our hypotheses, we observed both an N400 response and a late positive
response to incongruent audiovisual stimuli. Our results were in line with the results of previous
studies evaluating responses to various congruent and incongruent stimuli (Kamiyama et al.,
2013; Schirmer & Kotz, 2013; Stevens & Zhang, 2014). In the current experiment, the phonetic
condition elicited a larger N400 component than the prosodic condition. Furthermore, the N400
for the linguistic processing condition showed left hemisphere dominance whereas the N400 for
the prosodic processing did not (see further discussion in the localization results section).
Existing literature evaluating emotional processing spans a wide range of experimental
approaches and identifies multiple ERP components. Among the studies that identified an N400
response, there was a mixture of priming and reverse priming effects (Aguado et al., 2013;
Kamiyama et al., 2013; Paulmann & Pell, 2010). In a facial affective decision task presented in
a priming paradigm, Paulmann and Pell observed both of these effects. They observed a normal
effect for the medium-length prosodic prime of 400 ms (the length of the visual prime in the
current experiment), whereas there was a greater response for congruent compared to
incongruent stimuli when participants were presented with a short prosodic prime (200 ms). As a
possible explanation for the reverse priming effect they observed in their experiment, Aguado et
al. (2013) cites their use of stimuli with complex affective valence. The design differed from
that of the current experiment in that the facial presentation was followed by the visual
presentation of a word with emotional content rather than an auditory presentation of a non-word
(absent of emotional content) spoken with different emotional prosody.
Given that McGurk-type fusion might take place in individual subjects with varying
temporal asynchrony of up to 400 ms or more between visual and auditory presentation
(Wassenhove et al., 2007), could the incongruency data (especially in the phonetic condition)
reflect a similar fusion effect? Our answer is negative here based on evidence from our previous
audiovisual integration study (2013) and the literature. For instance, Green and Gerdman (1995)
investigated the effect of vowel discrepancy on the McGurk effect in consonant-vowel (CV)
segments. Their findings suggest that vowel discrepancy significantly reduces the magnitude of
the fusion effect. In our previous experiment for the vowels, /i/ and /a/, following the McGurk
design using synchronized videos (Zhang et al., 2013), the incongruent audio-visual pairing did
not lead to perceived fusion of an altered vowel identity in our adult listeners. Since our present
study used a static face visual exhibiting the mouth shape corresponding to a vowel rather than a
consonant with a 400 ms time lag for the following audio, it is unlikely that there is a fusion-type
effect in the responses. Further studies by manipulating temporal asynchrony between the video
and audio and the vowel sounds (for instance, both are front or back vowels) can be designed to
explore under what conditions the illusionary fusion can be elicited for vowels and how the brain
responses differ between the conditions.
Late positive response congruency effect across conditions
Throughout the literature, researchers have posited a variety of explanations for the late
positivity response. While there is not a clear consensus regarding its functional significance, the
late positive response is generally characterized as reflecting increased attention to unexpected
targets. This response has been identified in response to incongruency of affective face or word
stimuli (Werheid, Alpay, Jentzsch & Sommer, 2005; Zhang et al., 2010). Our findings in the
current experiment support this explanation; we observed a late positive response in response to
incongruent stimuli, regardless of condition. In contrast to these results, the late positive
potential (LPP) identified by Aguado et al. (2013) was modulated by affective valence (positive
vs. negative) of the target word, but not by congruency of the visual stimuli. The authors
propose that the LPP reflected an evaluative priming effect.
Because the late positive response was observed following the N400 component, it is
appropriate to entertain the possibility that the late response is in fact a variant of the P300
component, reflecting working memory updating (see Hajcack, Dunning & Foti, 2009; Zhang et
al., 2010). Conscious processing of visual stimuli might also contribute to the late response, as
suggested by Zhang et al. (2010) to account for the lack of a late positivity in response to other
priming paradigms (word-word: Zhang et al., 2006; subliminal affective priming: Li, Zinbarg,
Boehm & Paller, 2008). The late positive response has been observed in experiments where the
participant consciously attended to congruency of emotional or prosodic stimuli (Aguado et al.
2013; Chen et al., 2011; Kamiyama et al., 2013).
Witteman et al. (2014) observed an effect of the late positive potential (LPP) in an
emotional task, but not the linguistic one. This effect was larger at posterior sites in the left
hemisphere, but proximal sites in the right hemisphere. Our results show that the late positivity
did not vary by condition in isolation, but interaction effects show an effect of condition
dependent on site as well as an effect dependent on site and laterality. Effects for site and
laterality were also observed independently of each other, indicating clear localization of the late
positive response.
Cortical regions for the N400 and LPR responses
In the current experiment, participants were asked to attend to either the phonetic or the
prosodic dimension of the same stimuli. Overall, our results indicate that different brain regions
are recruited for each of these conditions for both the N400 and LPR. This may reflect the
underlying mechanisms of how the individuals selectively tune to one dimension of information
(Rao, Zhang & Miller, 2010). For the N400 component, we observed left hemisphere
lateralization in the phonetic condition but right hemisphere lateralization for the prosodic
condition. In the same conditions, cortical activations for the late positive response were more
broadly distributed. It appears that there was some primary motor cortex involvement for both
the N400 and late positive response. However, this pattern of activation was only seen in the
phonetic condition. We speculate that the motor cortex involvement might partly be due to the
participants indicating their response by pressing a key on a computer keyboard, which was more
consistent in timing across trials for the phonetic condition than for the prosodic condition.
The cortical localization and laterality patterns in the phonetic condition are consistent
with the dual-stream speech perception model (Hickok & Poeppel, 2007). Similarly, the
localization and laterality results in the prosodic condition are also in line with the three-stage
model (Schirmer & Kotz, 2006) with the N400 and LPC responses reflecting higher-order
evaluative processes for emotional prosody. It is important to note that source localization
analysis represents a weak area in event-related potential research (Luck, 2014). Despite its
imprecision, source localization analysis can help to determine which neural networks are
implicated in a given task, particularly when compared to the oscillation rhythm patterns across
the cortical network activations (see discussion in the following section).
Cortical rhythms mediating audiovisual congruency effects
The MNE analysis and time-frequency analysis results revealed cortical network
activation patterns that varied across conditions. Delta, theta, and beta oscillations contributed to
the N400 response in the phonetic condition. The late positive response in the same condition
was modulated only by theta rhythms. In contrast, beta and gamma activity contributed to both
the N400 and LPR in the prosodic condition. There was also a significant difference in theta
activity between incongruent and congruent conditions for the late positive response.
Theta oscillations have been observed in response to the visual presentation of emotional
faces (Balconi & Pozzoli, 2009; Knyazev et al., 2009). Although both conditions involved the
visual presentation of happy or angry faces, theta activity was only observed for both ERP
responses in the phonetic condition and not for the prosodic condition. There was also increased
theta activity for the difference between congruent and incongruent face/voice pairs in the late
positive response.
Increased gamma activity is often seen in response to incongruent audiovisual
information (for a review, see Senkowski, Schneider, Foxe & Engle, 2008). While cortical
rhythms in the gamma frequency range were observed for incongruency in the prosodic
condition, no such activity was present for incongruency in the phonetic condition in our study.
A similar rest-state gamma rhythm, which was argued to reflect predictive coding of the
following auditory target based on the visual information preceding the sound, was previously
reported in an experiment investigating the McGurk effect (Keil et al., 2012). Güntekin and
Basar (2007) identified an increase in beta activity in response to angry compared to happy facial
stimuli, but no such effect between emotions was observed in the current investigation.
These distinct patterns support our interpretation that different cortical regions are
involved in mediating selective attention to phonetic and prosodic processes. How exactly these
processes are related, however, remains unclear. Specific cortical sites may contribute to each
ERP response, with different cortical rhythm patterns arising from these neural networks.
Conversely, distinct cortical rhythms may be at play, leading to the recruitment of localized
neural networks.
In addition to these cortical rhythm patterns for the N400 and LPR, time frequency
analysis yielded significant gamma activity preceding the onset of the auditory target in the
prosodic condition. A similar predictive coding response has been observed during the resting
state in experiments investigating the McGurk effect (Keil et al., 2012). This finding highlights
the utility of time-frequency analysis in ERP research. In this case, time-frequency analysis
revealed an apparent difference that was not visible in the waveform.
Limitations and future directions
One concern with our experimental design is the amount of lag time between the visual
and auditory presentations. When Paulmann and Pell (2010) presented visual primes of different
lengths, the resulting ERP patterns differed. While our results were consistent with the results of
Paulmann and Pell’s medium-length prime, we did not investigate the possible differences
resulting from visual primes of different lengths.
Another factor to consider is the relationship between the selected vowel sounds and
emotions. The /i/ vowel is produced with spread lips and lends itself more easily to a happy
facial expression, while the /a/ vowel is taller and corresponds more closely to a large, angry
facial expression. To combine the vowel sound and emotions that are inherently in conflict with
each other (i.e. angry + /i/ and happy + /a/) required that the expression of happiness or anger be
clearly displayed without sacrificing the appropriate mouth shape for the vowel. The male
speaker in the photographs compensated for this limitation by manipulating his brow to express
happiness or anger.
As discussed earlier, many studies on cross-modal recognition of emotional prosody or
facial expression combined voices with static pictures of faces. The use of static facial visual
and the priming task clearly does not simulate the real life audiovisual experience, which
exhibits temporal synchrony of dynamic facial motion and voice in a natural environment. Given
that natural facial motion enhances cortical selectivity and responses for face processing and
cross-modal speech processing (Klasetn et al., 2012; Pitcher et al, 2011; Riedel et al., 2015;
Schelinski et al., 2014; von Kreigsten et al., 2008), it remains to be tested whether the use of
dynamic facial stimuli would have differential enhancement effects in the phonetic vs. prosodic
Our experiment was designed in a way that required behavioral responses from the
participants. It is worth considering whether a format that does not rely on an overt response
might be effective and more suitable for testing individuals whose behavioral responses are
limited because of impaired language or cognition. For example, the mismatch negativity
(MMN) paradigm is a passive listening task, which would be more user-friendly to a wider range
of participants. MMN experiments investigating the pre-attentive responses to emotional prosody
have identified differences in the ERP responses of men and women (Schirmer & Kotz, 2006;
Fan, Tsu & Cheng, 2013). These and other experiments suggest that emotional prosody
processing is dependent on gender (Schirmer, Kotz & Frederici, 2002; Schirmer, Kotz &
Frederici, 2005). These differences have also been documented in gender comparisons of
responses to facial expressions (Guntenkin & Basar, 2007). Future research could further
evaluate whether the ERP waveform, localization, and cortical rhythm findings presented here
would differ between men and women.
Conventional ERP research provides excellent time resolution but comparatively poor
localization information. Minimum norm estimation can identify which broad regions at the
cortical level are implicated in certain processes, but its resolution cannot rival that of more
precise imaging techniques like fMRI. Nevertheless, the MNE localization results taken in
tandem with the topographical potential distribution data and cortical rhythm activities can help
elucidate the neural dynamics underlying phonetic and prosodic processing. As we only used one
male speaker for the spoken word stimuli, further research would be needed to address how
speaker identity information may interfere with or facilitate the processing of phonetic vs.
emotional prosodic information and determine what neural mechanisms are responsible for the
important socio-indexical information of speaker identity (von Kreigsten et al., 2010).
Time frequency analysis is still a relatively novel area of ERP research, but it is one that
warrants further investigation. The dynamic information it provides beyond the level of the
waveform is analogous to spectral analysis of speech, which can provide highly valuable
information about the neural dynamics underlying the processing of the multidimensional speech
signal (Zhang, 2008). In typical development, children appear to implicitly and effortlessly tease
apart the many informational dimensions that make of speech. These dimensions include
linguistic cues, like phonetic and semantic information, as well as non-linguistic cues, such as
emotion and speaker identity. However, research suggests that some clinical populations may
exhibit difficulty with sorting these multiple sources of information (Gebauer, 2014; Marshall et
al., 2009, Paul et al., 2005; Wang & Tsao, 2015), which would lead to developmental delays or
disorders in language learning.
The current investigation provides a baseline from normal adults for comparison to
clinical populations, such as individuals with autism, language impairment, or aphasia. Many
similar studies have utilized behavioral measures and event-related potentials or other functional
imaging methods to investigate brain mechanisms for emotional prosody processing (See Belin,
Fecteau, & Bédard, 2004; Klasen, Chen, & Mathiak, 2012; Schirmer & Kotz, 2006 for reviews).
Our study has two novel features. First, the same set of words were used as targets for phonetic
or prosodic congruency judgment depending on the visual prime, which eliminated potential
confounds due to acoustic differences in the two conditions. Previous studies did not make such
as direct comparison of the linguistic vs. paralinguistic congruency effects in the same set of
target spoken words. Take Paulmann and Pell (2010) for example. Their study used face targets
preceded by prosodic primes, and their protocol thus relied on a “facial affective decision task.”
Second, in addition to the temporal components of N400 and LPC, we integrated source
localization and time-frequency analysis to verify different neural networks for linguistic and
paralinguistic processing and identify differences cortical rhythms associated with audiovisual
congruence for the two experimental conditions. In neurophysiological studies on speech
processing with time-frequency analysis (Giraud & Poeppel, 2012), relatively few have
incorporated cross-modal presentation with single-word stimuli containing both phonetic and
prosodic contrasts. Future research should investigate how the individuals with pathological
conditions integrate audiovisual speech information and selectively attend to a single dimension
of information while ignoring the others (Schelinski et al., 2014). If we can identify the neural
markers that are tied to the clinical manifestation of language impairment, electrophysiological
measures like those utilized in the current investigation may provide more objective diagnostic
criteria for clinical practice. However, further research efforts are necessary to establish a robust
system that allows identification of reliable neural markers at the individual level.
This research project was supported in part by the University of Minnesota’s
Undergraduate Research Opportunity Program (UROP), the Bryng Bryngelson Research Fund,
and the Brain Imaging Research Project Award from the College of Liberal Arts. Portions of the
work were written during the corresponding author’s visiting professorship at Shanghai Jiao
Tong University. We would like to thank Drs. Sharon Miller, Tess Koerner, Aparna Rao,
Benjamin Munson, and Edward Carney as well as two anonymous reviewers for their
suggestions and help.
Figure legends
Figure 1. Visual schematic of the affective cross-modal priming protocol. The visual prime was
presented for 400 ms before the onset of the auditory target, whose duration was 295 ms. The
baseline for ERP epochs was assessed relative to the onset of the visual prime.
Figure 2. Behavioral data, percent correct accuracy (A) and reaction times (B), for the phonetic
and prosodic conditions. The error bars represent standard error for each category.
Figure 3. Electrode grouping. Electrode channels were grouped into nine regions by laterality
and site for statistical analysis: left anterior (LA, including F7, F5, F3, FT7, FC5, and FC3), left
central (LC, including T7, C5, C3, TP7, CP5, and CP3), left posterior (LP, including P7, P5, P3,
PO7, PO3, and O1), middle anterior (MA, including F1, FZ, F2, FCZ, FC1, and FC,), middle
central (MC, including C1, CZ, C2, CP1, CP2, and CPZ), middle posterior (MP, including P1,
PZ, P2, POZ, and OZ), right anterior (RA, including F8, F6, F4, FT8, FC6, and FC4), right
central (RC, including T8, C6, C4, TP8, CP6, and CP4), and right posterior (RP, including P8,
P6, P4, PO8, PO4, and O2).
Figure 4. ERP average waveforms comparing responses to congruent and incongruent stimuli in
the phonetic condition. In the waveform plot for the LP (left posterior) electrode region, the dark
gray bar indicates the length of the visual prime and the light gray bar indicates the duration of
the auditory target. The baseline for ERP epochs was assessed prior to the onset of the visual
prime. The N400 and late positive response (LPR) latencies were assessed from onset of the
auditory target at 400 ms in the epoch. These components are denoted with arrows on the MP
(middle posterior) waveform plot.
Figure 5. ERP average waveforms comparing responses to congruent and incongruent stimuli in
the emotional prosodic condition. The same plotting convention was used as in Figure 3.
Figure 6. Grand mean ERP difference (incongruent - congruent) waveforms and scalp
topography. In the average waveforms (A), the gray bar highlights the N400 component. Scalp
topography of the N400 and LPR for the two listening conditions is shown in (B).
Figure 7. Minimum norm estimation results. (A) Boundary Element Method (BEM) head model.
(B) total activity waveforms for phonetic and prosodic conditions in the left and right
hemispheres. (C) Minimum norm estimation (MNE) activity for the N400 response. (D) MNE
activity for the late positive response.
Figure 8. Inter-trial phase coherence results showing cortical oscillations for incongruent and
congruent trials and their differences (Incongruent vs. Congruent) with significant data points
plotted (p < 0.01) in phonetic and prosodic conditions.
Aguado, L., Dieguez-Risco, T., Méndez-Bértolo, C., Pozo, M. A., & Hinojosa, J. A. (2013).
Priming effects on the N400 in the affective priming paradigm with facial expressions of
emotion. Cognitive, Affective, & Behavioral Neuroscience, 13(2), 284-296.
Balconi, M., & Pozzoli, U. (2009). Arousal effect on emotional face comprehension: Frequency
band changes in different time intervals. Physiology & behavior, 97(3), 455-462.
Belin, P., Fecteau, S., & Bédard, C. (2004). Thinking the voice: neural correlates of voice
perception. Trends in Cognitive Sciences, 8(3), 129-135.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B
(Methodological), 57, 289–300.
Blasi, Anna, Mercure, Evelyne, Lloyd-Fox, Sarah, Thomson, Alex, Brammer, Michael, Sauter,
Disa, . . . Murphy, Declan G M. (2011). Early specialization for voice and emotion
processing in the infant brain. Current Biology, 21(14), 1220-1224.
Boersma, P. & Weenink, D. (2015). Praat: doing phonetics by computer [Computer program].
Version 6.0, retrieved from
Buchanan, T. W., Lutz, K., Mirzazade, S., Specht, K., Shah, N. J., Zilles, K., & Jäncke, L.
(2000). Recognition of emotional prosody and verbal components of spoken language: an
fMRI study. Cognitive Brain Research, 9(3), 227-238.
Chen, X., Zhao, L., Jiang, A., & Yang, Y. (2011). Event-related potential correlates of the
expectancy violation effect during emotional prosody processing. Biological
Psychology, 86(3), 158-167.
Czerwon, B., Hohlfeld, A., Wiese, H., & Werheid, K. (2013). Syntactic structural parallelisms
influence processing of positive stimuli: Evidence from cross-modal ERP
priming. International Journal of Psychophysiology, 87(1), 28-34.
Delorme, A., & Makeig, S. (2004). EEGLAB: an open source toolbox for analysis of single-trial
EEG dynamics. Journal of Neuroscience Methods, 134, 9–21.
Donchin, E., & Coles, M. G. (1988). Is the P300 component a manifestation of context
updating?. Behavioral and Brain Sciences, 11(03), 357-374.
Fazio, R. H. (2001). On the automatic activation of associated evaluations: An
overview. Cognition & Emotion, 15(2), 115-141.
Fazio, R.H., Sanbonmatsu, D.M., Powell, M.C., & Kardes, F.R. (1986). On the automatic
activation of attitudes. Journal of Personality and Social Psychology, 50, 229–238.
Gebauer, L., Skewes, J., Hørlyck,L., & Vuust, P. (2014). Atypical perception of affective
prosody in Autism Spectrum Disorder. NeuroImage: Clinical, 6, 370-378.
Giraud, Anne-Lise, & Poeppel, David. (2012). Cortical oscillations and speech processing:
emerging computational principles and operations. Nature Neuroscience, 15(4), 511-
Grandjean, D., Sander, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., &
Vuilleumier, P. (2005). The voices of wrath: brain responses to angry prosody in
meaningless speech. Nature Neuroscience, 8(2), 145-146.
Greene, A. J., Easton, R. D., & LaShell, L. S. (2001). Visual–auditory events: cross-modal
perceptual priming and recognition memory. Consciousness and Cognition, 10(3), 425-
Green, K. P. & Gerdeman, A. (1995). Cross-modal discrepancies in coarticulation and the
integration of speech information: the McGurk effect with mismatched vowels. Journal
of Experimental Psychology: Human Perception and Performance, 21, 1409-1426.
Grimshaw, G. M., Kwasny, K. M., Covell, E., & Johnson, R. A. (2003). The dynamic nature of
language lateralization: effects of lexical and prosodic factors. Neuropsychologia, 41(8),
Goerlich, K. S., Witteman, J., Schiller, N. O., Van Heuven, V. J., Aleman, A., & Martens, S.
(2012). The nature of affective priming in music and speech. Journal of Cognitive
Neuroscience, 24(8), 1725-1741.
Güntekin, B., & Basar, E. (2007). Emotional face expressions are differentiated with brain
oscillations. International Journal of Psychophysiology, 64(1), 91-100.
Güntekin, B., & Tülay, E. (2014). Event related beta and gamma oscillatory responses during
perception of affective pictures. Brain research, 1577, 45-56.
Hajcak, G., Dunning, J. P., & Foti, D. (2009). Motivated and controlled attention to emotion:
time-course of the late positive potential. Clinical Neurophysiology, 120(3), 505-510.
Hald, L. A., Bastiaansen, M. C., & Hagoort, P. (2006). EEG theta and gamma responses to
semantic violations in online sentence processing. Brain and Language, 96(1), 90-105.
Haxby, J. V., Hoffman, E. A., & Gobbini, M. I. (2000). The distributed human neural system for
face perception. Trends in Cognitve Sciences, 4(6), 223-233.
Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature
Reviews Neuroscience, 8(5), 393-402.
Iredale, J. M., Rushby, J. A., McDonald, S., Dimoska-Di Marco, A., & Swift, J. (2013). Emotion
in voice matters: Neural correlates of emotional prosody perception. International
Journal of Psychophysiology, 89(3), 483-490.
Izdebski, K. (Ed.) (2008). Emotions in the Human Voice, Vols. 1-3. San Diego, CA: Plural
Publishing. Kamiyama, K. S., Abla, D., Iwanaga, K., & Okanoya, K. (2013). Interaction
between musical emotion and facial expression as measured by event-
relatedpotentials. Neuropsychologia, 51(3), 500-505.
Knyazev, G. G., Slobodskoj-Plusnin, J. Y., & Bocharov, A. V. (2009). Event-related delta and
theta synchronization during explicit and implicit emotion
processing. Neuroscience, 164(4), 1588-1600.
Klasen, M., Chen, Y.-H., & Mathiak, K. (2012). Multisensory emotions: perception, combination
and underlying neural processes. Reviews in the Neurosciences, 23(4), 381.
Koerner, T., & Zhang, Y. (2015). Effects of background noise on inter-trial phase coherence and
auditory N1-P2 responses to speech stimulus. Hearing Research, 328, 113-119.
Kotz, S. A., Kalberlah, C., Bahlmann, J., Friederici, A. D., & Haynes, J. D. (2013). Predicting
vocal emotion expressions from the human brain. Human Brain Mapping, 34(8), 1971-
Kotz, S. A., & Paulmann, S. (2007). When emotional prosody and semantics dance cheek to
cheek: ERP evidence. Brain Research, 1151, 107-118.
Kotz, S. A., & Paulmann, S. (2011). Emotion, language, and the brain. Language and Linguistics
Compass, 5(3), 108-125.
Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: finding meaning in the N400
component of the event-related brain potential (ERP).Annual review of psychology, 62,
Kutas, M., & Hillyard, S. A. (1980). Reading senseless sentences: Brain potentials reflect
semantic incongruity. Science, 207(4427), 203-205.
Lense, M. D., Gordon, R. L., Key, A. P., & Dykens, E. M. (2014). Neural correlates of cross-
modal affective priming by music in Williams syndrome. Social Cognitive and Affective
Neuroscience, 9(4), 529-537.
Li, W., Zinbarg, R. E., Boehm, S. G., & Paller, K. A. (2008). Neural and behavioral evidence for
affective priming from unconsciously perceived emotional facial expressions and the
influence of trait anxiety. Journal of Cognitive Neuroscience, 20(1), 95-107.
Luck, S. J. (2014). An Introduction to the Event-Related Potential Technique (2nd ed.).
Cambridge, Massachusets: The MIT Press.
Marshall, C.R., Harcourt-Brown, S., Ramus, F., & van der Lely, H. K. J. (2009). The link
between prosody and language skills in children with specific language impairment (SLI)
and/or dyslexia. International Journal of Language and Communicative Disorders, 44
(4), 466-488.
Patel, S., Scherer, K. R., Björkner, E., & Sundberg, J. (2011) Mapping emotions into acoustic
space: the role of voice production. Biological Psychology, 87, 93–98.
Paul, R., Auguestyn, A., Klin, A., & Volkmar, F. R. (2005). Perception and production of
prosody by speakers with autism spectrum disorders. Journal of Autism and
Developmental Disorders, 35(2), 205-220.
Paulmann, S., Bleichner, M., & Kotz, S. A. (2013). Valence, arousal, and task effects in
emotional prosody processing. Frontiers in Psychology, 4.
Paulmann, S., Jessen, S., & Kotz, S. A. (2012). It's special the way you say it: An ERP
investigation on the temporal dynamics of two types of prosody.
Neuropsychologia, 50(7), 1609-1620.
Paulmann, S., & Pell, M. D. (2010). Contextual influences of emotional speech prosody on face
processing: how much is enough?. Cognitive, Affective, & Behavioral
Neuroscience, 10(2), 230-242.
Pell, M. D. (2006). Cerebral mechanisms for understanding emotional prosody in speech. Brain
and Language, 96(2), 221-234.
Pitcher, D., Dilks, D. D., Saxe, R. R., Triantafyllou, C., & Kanwisher, N. (2011). Differential
selectivity for dynamic versus static information in face-selective cortical regions.
Neuroimage, 56(4), 2356-2363.
Rao, A., Zhang, Y., & Miller, S. (2010). Selective listening of concurrent auditory stimuli: an
event-related potential study. Hearing Research, 268(1), 123-132.
Riedel, P., Ragert, P., Schelinski, S., Kiebel, S. J., & von Kriegstein, K. (2015). Visual face-
movement sensitive cortex is relevant for auditory-only speech recognition. Cortex, 68,
Scheeringa, R., Petersson, K. M., Oostenveld, R., Norris, D. G., Hagoort, P., & Bastiaansen, M.
C. (2009). Trial-by-trial coupling between EEG and BOLD identifies networks related to
alpha and theta EEG power increases during working memory
maintenance. Neuroimage, 44(3), 1224-1238.
Schelinski, S., Riedel, P., & von Kriegstein, K. (2014). Visual abilities are important for
auditory-only speech recognition: Evidence from autism spectrum disorder.
Neuropsychologia, 65, 1-11.
Schirmer, A., Kotz, S. A., & Friederici, A. D. (2002). Sex differentiates the role of emotional
prosody during word processing. Cognitive Brain Research, 14(2), 228-233.
Schirmer, A., Kotz, S. A., & Friederici, A. D. (2005). On the role of attention for the processing
of emotions in speech: Sex differences revisited. Cognitive Brain Research, 24(3), 442-
Schirmer, A., & Kotz, S. A. (2003). ERP evidence for a sex-specific Stroop effect in emotional
speech. Journal of Cognitive Neuroscience, 15(8), 1135-1148.
Schirmer, A., & Kotz, S. A. (2006). Beyond the right hemisphere: brain mechanisms mediating
vocal emotional processing. Trends in Cognitive Sciences, 10(1), 24-30.
Schneider, T. R., Debener, S., Oostenveld, R., & Engel, A. K. (2008). Enhanced EEG gamma-
band activity reflects multisensory semantic matching in visual-to-auditory object
priming. Neuroimage, 42(3), 1244-1254.
Stevens, J., & Zhang, Y., (2014). Brain mechanisms for processing co-speech gesture: A cross-
language study of spatial demonstratives. Journal of Neurolinguistics, 30, 27-47.
Tse, C. Y., Tien, K. R., & Penney, T. B. (2006). Event-related optical imaging reveals the
temporal dynamics of right temporal and frontal cortex activation in pre-attentive change
detection. Neuroimage, 29(1), 314-320.
von Kriegstein, K., Dogan, O., Gruter, M., Giraud, A. L., Kell, C. A., Gruter, T., . . . Kiebel, S. J.
(2008). Simulation of talking faces in the human brain improves auditory speech
recognition. Proceedings of the National Academy of Sciences of the United States of
America, 105(18), 6747-6752.
von Kriegstein, K., Smith, D. R. R., Patterson, R. D., Kiebel, S. J., & Griffiths, T. D. (2010).
How the human brain recognizes speech in the context of changing speakers. The Journal
of Neuroscience, 30(2), 629-638.
Wang, J. E., & Tsao, F. M. (2015). Emotional prosody perception and its association with
pragmatic language in school-aged children with high-function autism. Research in
Developmental Disabilities, 37, 162-170.
Weiss, Sabine, & Mueller, Horst M. (2003). The contribution of EEG coherence to the
investigation of language. Brain and Language, 85(2), 325-343.
Werheid, K., Alpay, G., Jentzsch, I., & Sommer, W. (2005). Priming emotional facial
expressions as evidenced by event-related brain potentials. International Journal of
Psychophysiology, 55(2), 209-219.
Wildgruber, D., Riecker, A., Hertrich, I., Erb, M., Grodd, W., Ethofer, T., & Ackermann, H.
(2005). Identification of emotional intonation evaluated by fMRI. Neuroimage, 24(4),
Witteman, J., Goerlich-Dobre, K. S., Martens, S., Aleman, A., Van Heuven, V. J., & Schiller, N.
O. (2014). The nature of hemispheric specialization for prosody perception. Cognitive,
Affective, & Behavioral Neuroscience, 1-11.
Wittfoth, M., Schroder, C., Schardt, D. M., Dengler, R., Heinze, H. J., & Kotz, S. A. (2010). On
emotional conflict: interference resolution of happy and angry prosody reveals valence-
specific effects. Cerebral Cortex, 20(2), 383-392.
Zhang, Q., Lawson, A., Guo, C., & Jiang, Y. (2006). Electrophysiological correlates of visual
affective priming. Brain Research Bulletin, 71(1), 316-323.
Zhang, Q., Li, X., Gold, B. T., & Jiang, Y. (2010). Neural correlates of cross-domain affective
priming. Brain Research, 1329, 142-151.
Zhang, Y. (2008). Plurality and plasticity of neural representation for speech sounds. The
Journal of the Acoustical Society of America, 123, 3880.
Zhang, Y., Cheng, B., Koerner, T., Cao, C., Carney, E., & Wang, Y. (2013). Cortical processing
of audiovisual speech perception in infancy and adulthood. The Journal of the Acoustical
Society of America, 134, 4234.
Zhang, Y., Koerner, T., Miller, S., GricePatil, Z., Svec, A., Akbari, D., Tusler, L., & Carney, E.
(2011). Neural coding of formantexaggerated speech in the infant brain. Developmental
science, 14(3), 566-581.
Table 1. Behavioral results of percent correct responses and reaction time (mean ± standard
deviation) in the phonetic and prosodic conditions. Significant congruency effects were observed
in percent correct data for the prosodic condition (p < 0.05), and in the reaction time data for
phonetic (p < 0.05) and prosodic (p < 0.01) conditions.
Percent Correct
Reaction Time
88.5 ± 2.9
88.1 ± 3.2
709.1 ± 23.2
753.2 ± 24.6
89.5 ± 2.5
78.6 ± 4.5
695.4 ± 25.9
764.9 ± 18.6
Figure 1. Visual schematic of the affective cross-modal priming protocol. The visual prime was
presented for 400 ms before the onset of the auditory target, whose duration was 295 ms. The
baseline for ERP epochs was assessed relative to the onset of the visual prime.
Figure 2. Behavioral data, percent correct accuracy (A) and reaction times (B), for the phonetic
and prosodic conditions. The error bars represent standard error for each category.
Figure 3. Electrode grouping. Electrode channels were grouped into nine regions by laterality
and site for statistical analysis: left anterior (LA, including F7, F5, F3, FT7, FC5, and FC3), left
central (LC, including T7, C5, C3, TP7, CP5, and CP3), left posterior (LP, including P7, P5, P3,
PO7, PO3, and O1), middle anterior (MA, including F1, FZ, F2, FCZ, FC1, and FC,), middle
central (MC, including C1, CZ, C2, CP1, CP2, and CPZ), middle posterior (MP, including P1,
PZ, P2, POZ, and OZ), right anterior (RA, including F8, F6, F4, FT8, FC6, and FC4), right
central (RC, including T8, C6, C4, TP8, CP6, and CP4), and right posterior (RP, including P8,
P6, P4, PO8, PO4, and O2).
Figure 4. ERP average waveforms comparing responses to congruent and incongruent stimuli in
the phonetic condition. In the waveform plot for the LP (left posterior) electrode region, the dark
gray bar indicates the length of the visual prime and the light gray bar indicates the duration of
the auditory target. The baseline for ERP epochs was assessed prior to the onset of the visual
prime. The N400 and late positive response (LPR) latencies were assessed from onset of the
auditory target at 400 ms in the epoch. These components are denoted with arrows on the MP
(middle posterior) waveform plot.
Figure 5. ERP average waveforms comparing responses to congruent and incongruent stimuli in
the emotional prosodic condition. The same plotting convention was used as in Figure 3.
Figure 6. Grand mean ERP difference (incongruent - congruent) waveforms and scalp
topography. In the average waveforms (A), the gray bar highlights the N400 component. Scalp
topography of the N400 and LPR for the two listening conditions is shown in (B).
Figure 7. Minimum norm estimation results. (A) Boundary Element Method (BEM) head model.
(B) total activity waveforms for phonetic and prosodic conditions in the left and right
hemispheres. (C) Minimum norm estimation (MNE) activity for the N400 response. (D) MNE
activity for the late positive response.
Figure 8. Inter-trial phase coherence results showing cortical oscillations for incongruent and
congruent trials and their differences (Incongruent vs. Congruent) with significant data points
plotted (p < 0.01) in phonetic and prosodic conditions.
... The activation of these brain areas is significantly related to the attention function enhancement of mindfulness meditators [42]. Brown et al. (2012) [32] used electroencephalography (EEG) to evaluate the relationship between the late positive component (LPC), also referred to as the late positive response (LPR) [43] or late positive potential (LPP) [44] and trait mindfulness among college students. The results showed that a reduction in LPC amplitude was significantly associated with high trait mindfulness. ...
... The facial recognition and emotional arousal tasks ( Figure 1) were revised by the cross-modal affective priming paradigm [43,[46][47][48]50] and were used to measure and evaluate the differences between-and within-groups in the MAEP. The priming stimuli were complete Chinese classical folk instrumental works with three emotional levels (calm, happy, and sad); target stimuli consisted of a combination of 40 pictures of the same faces with three different emotion levels (calm, happy and sad). ...
... Based on previous studies and across-modal emotion processing of the topographical distribution of the grand-averaged ERP activities (see Liu et al., 2021b [27]), their time epochs were selected for analysis: two early time windows from 250 to 500 ms (N400 component [27,46,47,60]) and from 300 to 600 ms (P3 component, [27,54]), and a late time window from 600 to 1000 ms (late positive component: LPC, [27,43,46]). These ERP component latencies were assessed relative to the onset of the auditory stimulus, which included three levels of musical emotion (calm, happy, and sad music). ...
Full-text available
This study explored the behavioral and neural correlates of mindfulness meditation improvement in musical aesthetic emotion processing (MAEP) in young adults, using the revised across-modal priming paradigm. Sixty-two participants were selected from 652 college students who assessed their mindfulness traits using the Mindful Attention Awareness Scale (MAAS). According to the 27% ratio of the high and low total scores, participants were divided into two subgroups: high trait group (n =31) and low trait group (n =31). Participants underwent facial recognition and emotional arousal tasks while listening to music, and simultaneously recorded event-related potentials (ERPs). The N400, P3, and late positive component (LPC) were investigated. The behavioral results showed that mindfulness meditation improved executive control abilities in emotional face processing and effectively regulated the emotional arousal of repeated listening to familiar music among young adults. These improvements were associated with positive changes in key neural signatures of facial recognition (smaller P3 and larger LPC effects) and emotional arousal (smaller N400 and larger LPC effects). Our results show that P3, N400, and LPC are important neural markers for the improvement of executive control and regulating emotional arousal in musical aesthetic emotion processing, providing new evidence for exploring attention training and emotional processing. We revised the affecting priming paradigm and E-prime 3.0 procedure to fulfill the simultaneous measurement of music listening and experimental tasks and provide a new experimental paradigm to simultaneously detect the behavioral and neural correlates of mindfulness-based musical aesthetic processing.
... In mature listeners, it is well established that correct prosody facilitates speech recognition, whereas incorrect prosody results in interference (Kjelgaard & Speer, 1999). Therefore, it is important to elucidate how linguistic prosody as opposed to emotional prosody (Diamond & Zhang, 2016) and phonetic knowledge are represented in the brain. ...
... For sentence processing, research have established the role of theta frequency phase alignment for suprasyllabic processing of linguistic content and rhythm-based speech parsing (Doelling et al., 2014;Giraud & Poeppel, 2012;Peelle et al., 2013). For spoken word processing, theta ITPC was found enhanced for targets with incongruent emotional prosody from visual primes in the N400 and later response windows (Diamond & Zhang, 2016). A study of children with autism and speech impairment has found smaller theta ITPC post-200 ms for both pure tone and word processing (Yu et al., 2018). ...
... Our result corresponds with studies using a passive listening procedure that does not require attending to speech content (Y. Zhang et al., 2005), and those using nonlinguistic prosody (Diamond & Zhang, 2016) and nonsense word stimuli (M. Friedrich & Friederici, 2005). ...
Purpose This study aimed to examine whether abstract knowledge of word-level linguistic prosody is independent of or integrated with phonetic knowledge. Method Event-related potential (ERP) responses were measured from 18 adult listeners while they listened to native and nonnative word-level prosody in speech and in nonspeech. The prosodic phonology (speech) conditions included disyllabic pseudowords spoken in Chinese and in English matched for syllabic structure, duration, and intensity. The prosodic acoustic (nonspeech) conditions were hummed versions of the speech stimuli, which eliminated the phonetic content while preserving the acoustic prosodic features. Results We observed language-specific effects on the ERP that native stimuli elicited larger late negative response (LNR) amplitude than nonnative stimuli in the prosodic phonology conditions. However, no such effect was observed in the phoneme-free prosodic acoustic control conditions. Conclusions The results support the integration view that word-level linguistic prosody likely relies on the phonetic content where the acoustic cues embedded in. It remains to be examined whether the LNR may serve as a neural signature for language-specific processing of prosodic phonology beyond auditory processing of the critical acoustic cues at the suprasyllabic level.
... This facilitation is usually shown in better and faster speech recognition (Paulmann & Pell, 2011;Stekelenburg & Vroomen, 2007). On the contrary, incongruent cues lengthen speech processing and increase the perceiver's response error rate (Ben-David et al., 2016;Diamond & Zhang, 2016;Gerdes et al., 2014;Wurm et al., 2001). Furthermore, perceivers may process different domains of cues in the same speech sound differently. ...
... Cross-modally, Grainger et al. (2001) found that spoken words as auditory primes could accelerate and improve the following lexical decision task performance when the prime and the target were phonetically related, providing evidence of an audiovisual phonetic priming effect. Combining phonetic and another informational domain in speech, Diamond and Zhang (2016) investigated how native listeners undertake phonetic and emotional prosodic information using a crossmodal selective priming task. In this task, pictures with a speaker's face with both phonetic (lip shape) and emotional (facial expression) information were used as visual primes, and sound files with spoken syllables in varying emotional prosodies were used as auditory targets. ...
... To further disentangle whether paralinguistic information processing of emotional speech takes precedence over linguistic processing in L2 learners, experimental manipulations of naturally produced speech with concurrent phonetic and affective cues can be adopted to determine the processing of multidimensional L2 speech at the syllable level without directly involving semantic processing. In this regard, the current study extended the work of Diamond and Zhang (2016), which employed a cross-modal priming task to demonstrate that native listeners process affective and phonetic information in both behavioral and neurophysiological responses differently from late L2 learners. ...
Full-text available
Purpose: Spoken language is inherently multimodal and multidimensional in natural settings, but very little is known about how second language (L2) learners undertake multilayered speech signals with both phonetic and affective cues. This study investigated how late L2 learners undertake parallel processing of linguistic and affective information in the speech signal at behavioral and neurophysiological levels. Method: Behavioral and event-related potential measures were taken in a selective cross-modal priming paradigm to examine how late L2 learners (N = 24, Mage = 25.54 years) assessed the congruency of phonetic (target vowel: /a/ or /i/) and emotional (target affect: happy or angry) information between the visual primes of facial pictures and the auditory targets of spoken syllables. Results: Behavioral accuracy data showed a significant congruency effect in affective (but not phonetic) priming. Unlike a previous report on monolingual first language (L1) users, the L2 users showed no facilitation in reaction time for congruency detection in either selective priming task. The neurophysiological results revealed a robust N400 response that was stronger in the phonetic condition but without clear lateralization and that the N400 effect was weaker in late L2 listeners than in monolingual L1 listeners. Following the N400, late L2 learners showed a weaker late positive response than the monolingual L1 users, particularly in the left central to posterior electrode regions. Conclusions: The results demonstrate distinct patterns of behavioral and neural processing of phonetic and affective information in L2 speech with reduced neural representations in both the N400 and the later processing stage, and they provide an impetus for further research on similarities and differences in L1 and L2 multisensory speech perception in bilingualism.
... According to Schirmer and Kotz [35], there are three stages for emotional speech processing: (1) analyzing the acoustic features in vocalizations, (2) deriving the emotional salience from a set of acoustic signals, and (3) integrating emotional significance to higher-order cognitive processes. The first two stages have largely been studied with the N100 and P200 components using the ERP technique, and the third stage can be probed with the late positive component (LPC) as well as behavioral measures [21,[36][37][38][39][40]. However, it remains unclear how the relative salience of semantic versus prosodic channels unfolds across the different emotional speech processing stages. ...
... Grand average ERP waveforms ( Figure 2) were computed for each emotion (happy, neutral and sad) in each channel (semantic vs. prosodic) under each task (explicit vs. implicit). Four time windows were chosen for analyses based on previous literature and visual inspection of the grand mean auditory ERP data (i.e., N100: 65-170 ms; P200: 150-300 ms; LPC:500-900 ms) [36][37][38]40,41,70]. Since maximal effects were observed at the fronto-central and central sites, we selected six electrodes (FC3, FCz, FC4, C3, Cz, C4) for statistical analyses, which was consistent with previous reports [36,37,41,80]. ...
Full-text available
How language mediates emotional perception and experience is poorly understood. The present event-related potential (ERP) study examined the explicit and implicit processing of emotional speech to differentiate the relative influences of communication channel, emotion category and task type in the prosodic salience effect. Thirty participants (15 women) were presented with spoken words denoting happiness, sadness and neutrality in either the prosodic or semantic channel. They were asked to judge the emotional content (explicit task) and speakers’ gender (implicit task) of the stimuli. Results indicated that emotional prosody (relative to semantics) triggered larger N100 and P200 amplitudes with greater delta, theta and alpha inter-trial phase coherence (ITPC) values in the corresponding early time windows, and continued to produce larger LPC amplitudes and faster responses during late stages of higher-order cognitive processing. The relative salience of prosodic and semantics was modulated by emotion and task, though such modulatory effects varied across different processing stages. The prosodic salience effect was reduced for sadness processing and in the implicit task during early auditory processing and decision-making but reduced for happiness processing in the explicit task during conscious emotion processing. Additionally, across-trial synchronization of delta, theta and alpha bands predicted the ERP components with higher ITPC values significantly associated with stronger N100, P200 and LPC enhancement. These findings reveal the neurocognitive dynamics of emotional speech processing with prosodic salience tied to stage-dependent emotion- and task-specific effects, which can reveal insights to research reconciling language and emotion processing from cross-linguistic/cultural and clinical perspectives.
... The activation of each brain region was obtained by averaging the temporal activities of the electrodes within that region. Grouping of electrodes also could improve the signal-to-noise ratio (Diamond & Zhang, 2016;Zhang et al., 2011) and has been widely used in EEG studies (Chen et al., 2011;Diamond & Zhang, 2016;Elchlepp et al., 2016;Giertuga et al., 2017;Martinovic et al., 2014;Stevens & Zhang, 2014;Zhang et al., 2011). These nine collapsed brain regions were studied in all further analyses instead of 61 individual channels. ...
... The activation of each brain region was obtained by averaging the temporal activities of the electrodes within that region. Grouping of electrodes also could improve the signal-to-noise ratio (Diamond & Zhang, 2016;Zhang et al., 2011) and has been widely used in EEG studies (Chen et al., 2011;Diamond & Zhang, 2016;Elchlepp et al., 2016;Giertuga et al., 2017;Martinovic et al., 2014;Stevens & Zhang, 2014;Zhang et al., 2011). These nine collapsed brain regions were studied in all further analyses instead of 61 individual channels. ...
This study examined hypnotizability-related modulation of the cortical network following expected and nonexpected nociceptive stimulation. The electroencephalogram (EEG) was recorded in 9 high (highs) and 8 low (lows) hypnotizable participants receiving nociceptive stimulation with (W1) and without (noW) a visual warning preceding the stimulation by 1 second. W1 and noW were compared to baseline conditions to assess the presence of any later effect and between each other to assess the effects of expectation. The studied EEG variables measured local and global features of the cortical connectivity. With respect to lows, highs exhibited scarce differences between experimental conditions. The hypnotizability-related differences in the later processing of nociceptive information could be relevant to the development of pain-related individual traits. Present findings suggest a lower impact of nociceptive stimulation in highs than in lows.
... In addition to examining the influence of auditory information on visual affect processing, a reverse experimental paradigm can be developed to investigate how emotional information conveyed visually facilitates or interferes with auditory processing in terms of temporal dynamics. Furthermore, we can extend well-established research paradigms originally targeting healthy participants to patients with psychotic disorders (Diamond and Zhang, 2016;Filippi et al., 2017), which would allow us to gain a more comprehensive understanding of how patients and healthy participants differ in MSI of emotion in terms of neurophysiological representations. ...
... Though some key ERP indices and spatial localization of brain functions during MSI of emotion have been identified in the related studies, there are many unknowns regarding the cortical and subcortical dynamics responsible for the generation of the possible ERP components in schizophrenics. In other words, more thorough investigations should be made, which can combine the timing and source localization of neural responses in the MSI of emotion (Diamond and Zhang, 2016). ...
Full-text available
Multisensory integration (MSI) of emotion has been increasingly recognized as an essential element of schizophrenic patients’ impairments, leading to the breakdown of their interpersonal functioning. The present review provides an updated synopsis of schizophrenics’ MSI abilities in emotion processing by examining relevant behavioral and neurological research. Existing behavioral studies have adopted well-established experimental paradigms to investigate how participants understand multisensory emotion stimuli, and interpret their reciprocal interactions. Yet it remains controversial with regard to congruence-induced facilitation effects, modality dominance effects, and generalized vs. specific impairment hypotheses. Such inconsistencies are likely due to differences and variations in experimental manipulations, participants’ clinical symptomatology, and cognitive abilities. Recent electrophysiological and neuroimaging research has revealed aberrant indices in event-related potential (ERP) and brain activation patterns, further suggesting impaired temporal processing and dysfunctional brain regions, connectivity and circuities at different stages of MSI in emotion processing. The limitations of existing studies and implications for future MSI work are discussed in light of research designs and techniques, study samples and stimuli, and clinical applications.
... We are particularly interested in investigating three key issues [4][5][6]: 1) How processing linguistic and paralinguistic information differentially recruits specific cortical regions (superior temporal, inferior parietal and inferior frontal) in the two hemispheres; 2) How the dimensional processing mechanisms interact with each other; 3) How language experience affects linguistic and paralinguistic processing. Yang Zhang 1 , Jo-fu Lotus Lin 2 , Keita Tanaka 3 , Toshiaki Imada 4  Behavioral results revealed an increasing order of difficulty from gender to affect to phoneme perception across the subjects. ...
Full-text available
CONFERENCE ABSTRACT Human speech contains not only the linguistic content but also important information about speaker identity and affect. This study employed whole-head magnetoencephalography (MEG) to examine how brain activities were modulated by selective listening of phoneme, affect and gender information with different degrees of task difficulty. The participants were 10 male Japanese adults with normal hearing. The words were ‘right’ and ‘light’ recorded from native English speakers and binaurally presented at 50dB SL. The participants were asked to judge congruency between the visual prime and the spoken word for each trial. The experiment started with a familiarization phase, which was immediately followed by the test phase with 200 trials in each condition. Behavioral results confirmed an increasing order of difficulty from gender to affect to phoneme conditions. Significant priming effects were found only for the affect and gender conditions. In line with the behavioral results, the MEG data revealed distinct patterns of hemispheric and regional involvement and neural oscillatory activities for evaluating the cross-modal congruency in the three conditions. These results demonstrate the neural dynamics and complexity in processing linguistic and paralinguistic information in spoken words with differential influences of language experience.
... However, much more work is needed to discover how to optimize audiovisual training to mitigate the negative effects of hearing impairment (Picou et al., 2018;Yu et al., 2017). If successful, clinical applications have the potential to shape the intervention trajectory of emotion cognition and advance speech communication and social life for a large number of special populations such as patients with psychotic disorders (e.g., schizophrenia, autism, Alzheimer's dementia), people with hearing impairments (e.g., severe-profound hearing loss, recipients of cochlear implants), the elderly people, and children with learning disabilities (e.g., dyslexia; Agustí et al., 2017;de Jong et al., 2009;Diamond & Zhang, 2016;Irwin & DiBlasi, 2017). ...
Full-text available
Purpose: Emotional speech communication involves multisensory integration of linguistic (e.g., semantic content) and paralinguistic (e.g., prosody and facial expressions) messages. Previous studies on linguistic versus paralinguistic salience effects in emotional speech processing have produced inconsistent findings. In this study, we investigated the relative perceptual saliency of emotion cues in cross-channel auditory alone task (i.e., semantics-prosody Stroop task) and cross-modal audiovisual task (i.e., semantics-prosody-face Stroop task). Method: Thirty normal Chinese adults participated in two Stroop experiments with spoken emotion adjectives in Mandarin Chinese. Experiment 1 manipulated auditory pairing of emotional prosody (happy or sad) and lexical semantic content in congruent and incongruent conditions. Experiment 2 extended the protocol to cross-modal integration by introducing visual facial expression during auditory stimulus presentation. Participants were asked to judge emotional information for each test trial according to the instruction of selective attention. Results: Accuracy and reaction time data indicated that, despite an increase in cognitive demand and task complexity in Experiment 2, prosody was consistently more salient than semantic content for emotion word processing and did not take precedence over facial expression. While congruent stimuli enhanced performance in both experiments, the facilitatory effect was smaller in Experiment 2. Conclusion: Together, the results demonstrate the salient role of paralinguistic prosodic cues in emotion word processing and congruence facilitation effect in multisensory integration. Our study contributes tonal language data on how linguistic and paralinguistic messages converge in multisensory speech processing and lays a foundation for further exploring the brain mechanisms of cross-channel/modal emotion integration with potential clinical applications.
... De Silva et al. demonstrated that in normal subjects, some emotions such as sadness and fear in videos are better identified in the auditory modality whereas other emotions such as anger and happiness are better recognized in the visual modality [69]. A recent EEG study on normal subjects further revealed visual-auditory priming effects in distinct neural oscillatory activities for emotional prosody processing as against phonetic processing [70]. It remains to be tested how patients with schizophrenia differ from normal individuals in such multimodal experimental paradigms. ...
Full-text available
Emotional prosody (EP) has been increasingly recognized as an important area of schizophrenic patients’ dysfunctions in their language use and social communication. The present review aims to provide an updated synopsis on emotional prosody processing (EPP) in schizophrenic disorders, with a specific focus on performance characteristics, the influential factors and underlying neural mechanisms. A literature search up to 2018 was conducted with online databases, and final selections were limited to empirical studies which investigated the prosodic processing of at least one of the six basic emotions in patients with a clear diagnosis of schizophrenia without co-morbid diseases. A narrative synthesis was performed, covering the range of research topics, task paradigms, stimulus presentation, study populations and statistical power with a quantitative meta-analytic approach in Comprehensive Meta-Analysis Version 2.0. Study outcomes indicated that schizophrenic patients’ EPP deficits were consistently observed across studies (d = −0.92, 95% CI = −1.06 < δ < −0.78), with identification tasks (d = −0.95, 95% CI = −1.11 < δ < −0.80) being more difficult to process than discrimination tasks (d = −0.74, 95% CI = −1.03 < δ < −0.44) and emotional stimuli being more difficult than neutral stimuli. Patients’ performance was influenced by both participant- and experiment-related factors. Their social cognitive deficits in EP could be further explained by right-lateralized impairments and abnormalities in primary auditory cortex, medial prefrontal cortex and auditory-insula connectivity. The data pointed to impaired pre-attentive and attentive processes, both of which played important roles in the abnormal EPP in the schizophrenic population. The current selective review and meta-analysis support the clinical advocacy of including EP in early diagnosis and rehabilitation in the general framework of social cognition and neurocognition deficits in schizophrenic disorders. Future cross-sectional and longitudinal studies are further suggested to investigate schizophrenic patients’ perception and production of EP in different languages and cultures, modality forms and neuro-cognitive domains.
Full-text available
The event-related potential (ERP) technique was used to investigate whether there are different neural responses to musical emotion when the same melodies are presented in the voice and instrumental timbre such as the violin. With a crossmodal affective priming paradigm, target faces were primed by affectively congruent or incongruent vocal and instrumental music. Participants were asked to judge whether the prime-target pair was affectively congruent or incongruent. The results revealed a larger late positive component (LPC) at the time window of 473~677 ms in response to affectively incongruent versus congruent trials in the vocal version, whereas a larger N400 effect at the time window of 281~471 ms was observed in the instrumental version. These results indicate differential patterns of neurophysiological responses to emotion processing of vocal and instrumental music.
Full-text available
We hypothesized that attitudes characterized by a strong association between the attitude object and an evaluation of that object are capable of being activated from memory automatically upon mere presentation of the attitude object. We used a priming procedure to examine the extent to which the mere presentation of an attitude object would facilitate the latency with which subjects could indicate whether a subsequently presented target adjective had a positive or a negative connotation. Across three experiments, facilitation was observed on trials involving evaluatively congruent primes (attitude objects) and targets, provided that the attitude object possessed a strong evaluative association. In Experiments 1 and 2, preexperimentally strong and weak associations were identified via a measurement procedure. In Experiment 3, the strength of the object-evaluation association was manipulated. The results indicated that attitudes can be automatically activated and that the strength of the objectevaluation association determines the likelihood of such automatic activation. The implications of these findings for a variety of issues regarding attitudes—including their functional value, stability, effects on later behavior, and measurement—are discussed.
Full-text available
Autism Spectrum Disorder (ASD) is characterized by impairments in language and social–emotional cognition. Yet, findings of emotion recognition from affective prosody in individuals with ASD are inconsistent. This study investigated emotion recognition and neural processing of affective prosody in high-functioning adults with ASD relative to neurotypical (NT) adults. Individuals with ASD showed mostly typical brain activation of the fronto-temporal and subcortical brain regions in response to affective prosody. Yet, the ASD group did showed a trend towards increased activation of the right caudate during processing of affective prosody and rated the emotional intensity lower than did NT individuals. This is likely associated with increased attentional task demands in this group, which might contribute to social–emotional impairments.
Full-text available
The current study measured neural responses to investigate auditory stream segregation of noise stimuli with or without clear spectral contrast. Sequences of alternating A and B noise bursts were presented to elicit stream segregation in normal-hearing listeners. The successive B bursts in each sequence maintained an equal amount of temporal separation with manipulations introduced on the last stimulus. The last B burst was either delayed for 50% of the sequences or not delayed for the other 50%. The A bursts were jittered in between every two adjacent B bursts. To study the effects of spectral separation on streaming, the A and B bursts were further manipulated by using either bandpass-filtered noises widely spaced in center frequency or broadband noises. Event-related potentials (ERPs) to the last B bursts were analyzed to compare the neural responses to the delay vs. no-delay trials in both passive and attentive listening conditions. In the passive listening condition, a trend for a possible late mismatch negativity (MMN) or late discriminative negativity (LDN) response was observed only when the A and B bursts were spectrally separate, suggesting that spectral separation in the A and B burst sequences could be conducive to stream segregation at the pre-attentive level. In the attentive condition, a P300 response was consistently elicited regardless of whether there was spectral separation between the A and B bursts, indicating the facilitative role of voluntary attention in stream segregation. The results suggest that reliable ERP measures can be used as indirect indicators for auditory stream segregation in conditions of weak spectral contrast. These findings have important implications for cochlear implant (CI) studies – as spectral information available through a CI device or simulation is substantially degraded, it may require more attention to achieve stream segregation.
It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models. Copyright © 2015 Elsevier Ltd. All rights reserved.
In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically-developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory-only speech recognition.
Several studies reveal that unpleasant pictures elicit higher beta and gamma responses than pleasant and/or neutral pictures; however, the effect of stimulation design (block or random) has not been studied before. The aim of the study is to analyze the common and distinct parameters of affective picture perception in block and random designs by means of analysis of high frequency oscillatory dynamics (beta and gamma). EEG of 22 healthy subjects was recorded at 32 locations. The participants passively viewed 120 emotional pictures (10×4 unpleasant, 10×4 pleasant, 10×4 neutral) in block and random designs. The phase-locking and power of event related beta (14-28 Hz) and gamma (29–48 Hz) oscillations were analyzed for two different time windows (0-200 ms/ 200–400 ms). Statistical analysis showed that in the 0-200 ms time window, during the block design, unpleasant stimulation elicited higher beta phase-locking and beta power than the pleasant and neutral stimulation (p<0.05). In the 200-400 ms time window, during the block design, over occipital electrodes unpleasant stimulation elicited higher gamma response power than the pleasant stimulation and neutral stimulation (p<0.05). Unpleasant stimulation did not elicit higher beta or gamma responses in the random design. The present study showed that experimental design highly influences the perception of IAPS pictures. Unpleasant stimulation elicited higher event related beta and gamma phase-locking and power only in block design but not in random design. It seems that longer blocks of aversive pictures affect the brain more than the rapid observation of these pictures.