
Parham MokhtariToyama Prefectural University · Department of Intelligent Robotics
Parham Mokhtari
PhD
About
96
Publications
12,633
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,087
Citations
Introduction
Parham Mokhtari is an educator and research scientist in human vocal communication, spatial hearing, and related computational methods in signal processing and acoustic simulation. He is a full Professor at the Intelligent Robotics Department of Toyama Prefectural University, Japan. Broadly, his research seeks to clarify and discover links between the physical, psychophysical, physiological and neural mechanisms involved in the acoustic modality of human perception and communication.
Additional affiliations
April 2019 - March 2023
April 2016 - March 2019
National Institute of Information and Communications Technology (NICT)
Position
- Senior Researcher
April 2013 - March 2018
Doshisha University
Position
- Lecturer
Description
- Graduate course in Computer Science
Education
March 1993 - June 1998
University of New South Wales (UNSW)
Field of study
- "An Acoustic-Phonetic and Articulatory Study of Speech-Speaker Dichotomy"
February 1989 - December 1992
Publications
Publications (96)
The first (lowest) peak of head-related transfer functions (HRTFs) is known to be a concha depth resonance and a spectral cue in human sound localization. However, there is still no established model to estimate its center-frequency F1 and amplitude A1 from pinna anthropometry. Here, with geometries of 38 pinnae measured and their median-plane HRTF...
In this paper, we first review previous studies on control of voice quality in human speech communication, from the perspective of phonetics and signal processing. With the aim of building a quantitative model of the glottal source waveform capable of representing various voice qualities, we consider the descriptive framework of voice quality propo...
Beyond the first peak of head-related transfer functions or pinna-related transfer functions (PRTFs) human pinnae are known to have two normal modes with “vertical” resonance patterns, involving two or three pressure anti-nodes in cavum, cymba, and fossa. However, little is known about individual variations in these modes, and there is no establish...
In order to clarify the effect of visual stimuli on the spatially split perception of sound images by interaural differences, we conducted spatially split perception experiments under conditions in which visual stimuli were presented on a head mounted display. The auditory stimuli were synthesized binaural signals consisting of two uncorrelated pin...
This article describes a linear microphone array used for measuring head-related impulse responses simultaneously at various radial distances using the reciprocal method. The microphone array consists of miniature 5.8 mm diameter electret condenser microphones (ECMs) arranged on a boom, using a 3D printed microphone holder with pillars. The frequen...
Spectral cues (SCs) formed by the pinna are known to be essential for sound externalization, and accurate localization of sound-source azimuth and elevation in binaural listeners. SCs are also know to play a key role in monaural sound localization. The experiments described in this article intended to clarify how changes in SCs associated with head...
We measured the input impedance characteristics, input voltage versus output sound pressure characteristics, harmonic distortion characteristics, frequency characteristics, and impulse response of a currently available miniature electrodynamic driver unit (Foster Electric, MT006B) when used as a loudspeaker with an open space load. The nominal inpu...
Humans can externalise and localise sound-sources in three-dimensional (3D) space because approaching sound waves interact with the head and external ears, adding auditory cues by (de-)emphasising the level in different frequency bands depending on the direction of arrival. While virtual audio systems reproduce these acoustic filtering effects with...
The aim of this study is to comparatively review and evaluate three variants of the glottal inverse filtering algorithm based on iterative adaptive inverse filtering (IAIF): the Standard algorithm, and two recently proposed variants that use iterative optimal preemphasis (IOP) and a glottal flow model (GFM), respectively. To enable an objective eva...
The first (lowest) spectral notch (N1) of head-related transfer functions is known as a cue for sound localization in the median plane. This may be due to the fact that N1 frequency gradually increases as the sound source approaches the direction above the head. The mechanism of this phenomenon, however, is still unclear. To clarify the mechanism,...
Acoustic characteristics of the vocal tract have been investigated extensively in the literature using a one-dimensional (1D) acoustic simulation method. Because the 1D method assumes plane wave propagation only, it is recognized to be valid only in the low frequency region (below about 4 or 5 kHz). Recently, a three-dimensional (3D) acoustic simul...
As the peaks of head-related transfer functions are generated by normal modes of the pinna (external ear), the peak center-frequency and the three-dimensional pattern of each normal mode depend on individual pinna geometry. Traditionally, normal modes are visualized mainly in terms of pressure anti-nodes. To better understand the relations between...
Copyright is important to protect proprietary properties of multimedia contents, but it possibly poses a challenge in applying complicated signal processing to commercial multimedia products. It is usually forbidden to make a copy of copyrighted materials so that signal processing should be applied on the fly. As a result, such signal processing ne...
It is known that the right and left piriform fossae generate two deep dips on speech spectra and that acoustic interaction exists in generating the dips: if only one piriform fossa is modified, both the dips change in frequency and amplitude. In the present study, using a simple geometrical model and measured vocal tract shapes, the acoustic intera...
It has been suggested that the first spectral peak and the first two spectral notches of head-related transfer functions (HRTFs) are cues for sound localization in the median plane. Therefore, to examine the mechanism for generating spectral peaks and notches, HRTFs were calculated from four head shapes using the finite-difference time-domain metho...
We conducted quantitative analyses of a magnetic resonance imaging (MRI) database to examine the correlation between physical measures (vocal tract length and body height) and acoustic parameters (pitch and formant frequencies) of vowels. The vocal tract length was measured from MRI data for the five Japanese vowels produced by fifteen male Japanes...
A long-standing issue in virtual three-dimensional (3D) audio is personalization of head- and pinna-related transfer functions (HRTFs/PRTFs) to the head and pinna geometry of each individual listener. Despite research advances and the availability of some multi-subject HRTF databases, not enough is known about the range and type of individual diffe...
Spatial audio is one of the promising techniques for enriched audio media in the next generation. Fundamental techniques of signal processing to create spatial audio have been developed and prevalent in this research field. But special attention must be payed in applying them to practical applications, otherwise the sound could be perceived differe...
This paper provides an analysis of headphone calibration for reproduction of sound pressure at the eardrums of the listener. If the vibrations of the listener's eardrums are reproduced, the listener would perceive an auditory sensation as if he/she were in the original sound scene. The visual and other sensations also affect sound impressions, but...
Acoustic coupling between the voiced sound source and the time-varying acoustic load during phonation was simulated by combining the vocal-fold model of [S. Adachi etal., J. Acoust. Soc. Am 117(5) (2005)] with the vocal-tract model of [P. Mokhtari etal., Speech Commun. 50, 179-190 (2008)]. The combined simulation model enables to analyze the dynami...
An apparatus enabling automatic determination of a portion that reliably represents a feature of a speech waveform includes: an acoustic/prosodic analysis unit calculating, from data, distribution of an energy of a prescribed frequency range of the speech waveform on a time axis, and for extracting, among various syllables of the speech waveform, a...
A speaker identifying apparatus includes: a module for performing a principal component analysis on predetermined vocal tract geometrical parameters of a plurality of speakers and calculating an average and principal component vectors representing speaker-dependent variation; a module for performing acoustic analysis on the speech data being uttere...
Finite-Difference Time Domain (FDTD) acoustic simulation was used to calculate Pinna-Related Transfer Functions (PRTFs) of the KEMAR manikin's DB60 pinna. A baseline set of 25 PRTFs were first calculated at regular intervals of elevation angle in the front median plane. The simulation was then repeated 1784 times, corresponding to every unique, sin...
The Finite-Difference Time-Domain (FDTD) method was used to simulate Head-Related Transfer Functions (HRTFs) of KEMAR (Knowles Electronics Manikin for Acoustic Research). Compared with KEMAR's measured HRTFs available in the CIPIC database, the mean spectral mismatch on a linear frequency scale up to 14 kHz was 2.3 dB; this was better than the 3.1...
To reveal the mechanism generating spectral peaks and notches of the head-related transfer functions (HRTFs) in the median plane, head shapes were measured using magnetic resonance imaging. Then HRTFs were calculated from the shapes using the finite-difference time-domain method. Results showed that the pinna shape was the dominant factor for the b...
The vocal tract shape is three-dimensionally complex. For accurate acoustic analysis, a finite-difference time-domain method was introduced in the present study. By this method, transfer functions of the vocal tract for the five Japanese vowels were calculated from three-dimensionally reconstructed magnetic resonance imaging (MRI) data. The calcula...
This paper summarizes an empirical study exploring whether or not human listeners can tell the facing direction of a human speaker solely by the auditory sense, and if so, how accurately they are able to do it. The purpose is to find the sound information necessary for ultimately realistic and human-centered telecommunications. The study consists o...
In pursuit of an ultimately realistic human-to-human telecommunication technology, the ability to auditorily perceive the facing direction of a human speaker was explored. Listeners' performance was assessed in an anechoic chamber. A male speaker sat on a pivot chair and spoke a short sentence while facing a direction that was randomly chosen from...
A perfectly matched layer (PML) is commonly used in finite-difference time-domain (FDTD) simulation to absorb outgoing waves and thereby reduce artifactual reflections from the computational domain boundaries. However, previous two-dimensional studies have noted that increasing the PML loss factor does not monotonically improve the PML's performanc...
If the same sound pressure as when a listener were listening to the sound without headphones could be reproduced at the eardrum, the listener would perceive three-dimensional sound even when the sound is presented through head-phones. Headphone calibration is therefore required to compensate for individual variations in the transfer function of a l...
To better understand the relations between pinna anthropometry and acoustic features used for sound localisation, acoustic sensitivity analysis was carried out on the DB60 pinna of the Knowles Electronics Manikin for Acoustic Re-search (KEMAR), with the aid of computer simulations with the Finite-Difference Time Domain (FDTD) method. Starting with...
There is a common peak-notch pattern in head-related transfer functions (HRTFs) for the median plane, and the pattern provides cues for perceiving the elevation of the sound source. In the present study, to examine morphological features necessary for generating the typical peak-notch pattern, the pinna was modeled as a rectangular plate with a rec...
The hypopharyngeal cavities consist of the laryngeal cavity and bilateral piriform fossa, constituting the bottom part of the vocal tract near the larynx. Visualisation of these cavities with magnetic resonance imaging (MRI) techniques reveals that during speech, the laryngeal cavity takes the form of a long-neck flask and the piriform fossa takes...
Humans can feel sound three-dimensionally thanks to head-related transfer functions (HRTFs) that result from complex reflection and diffraction by the head and pinna, whose shapes vary greatly among individuals. In three-dimensional auditory reproduction systems, this variation should be considered. To visualize and understand the acoustic phenomen...
This paper proposes a new headphone calibration function for precise reproduction of 3D audio generated using simulated head-related transfer functions (HRTFs) or binaural recordings. In order to compensate for individual characteristics of the earcanal transfer functions and the eardrum impedance, which are generally different from person to perso...
In pursuit of an ultimately realistic human-to-human telecommunication technology, the ability to auditorily perceive the facing direction of a human speaker was measured. A male speaker sat on a pivot chair in an anechoic chamber and spoke a short sentence (about 5 s) while facing either of eight azimuth angles (0=listener's direction, 45, 90, 135...
To give listeners a vivid sense of 3D spatial audio, virtual auditory display technology relies crucially on head related transfer functions (HRTFs). However, as each person has unique morphological characteristics of their head and ears, for a realistic auditory experience it is important to use personalized HRTFs. Our approach to HRTF personaliza...
Sound localization tests were carried out with two subjects using a Virtual Auditory Display (VAD) to determine the inter-subject effects on localization accuracy, of employing either acoustically measured or Finite Difference Time Domain (FDTD)-simulated Head Related Transfer Functions (HRTFs). Results indicate that the simulated HRTFs were able t...
The hypopharyngeal cavities are the narrow, complex parts of the lower vocal tract that include the supraglottal laryngeal cavity and bilateral cavities of the piriform fossa. These small regions exhibit rather strong acoustic influence on vowel spectra in the higher frequencies and contribute to determining voice quality and speaker characteristic...
An acoustic simulator based on the finite‐difference time‐domain (FDTD) method was evaluated by acoustic measurements on solid models of the vocal tract. Three‐dimensional vocal tract (3D VT) shapes for a male subject during production of the five Japanese vowels were measured by magnetic resonance imaging. Transfer functions of the 3D VT shapes we...
Although it has been found that the piriform fossae play an important role in speech production and acoustics, the popular time domain articulatory synthesizer of [Maeda, S., 1982. A digital simulation method of the vocal-tract system. Speech Comm. 1 (3–4), 199–229] currently cannot include any more than one side branch to the acoustic tube that re...
This paper presents a comparison of computer-simulated versus acoustically measured, front-hemisphere head related transfer functions (HRTFs) of two human subjects. Simulations were carried out with a 3D finite difference time domain (FDTD) method, using magnetic resonance imaging (MRI) data of each subject's head. A spectral distortion measure was...
An alternative and complete derivation of the vocal tract length sensitivity function, which is an equation for finding a change in formant frequency due to perturbation of the vocal tract length [Fant, Quarterly Progress and Status Rep. No. 4, Speech Transmission Laboratory, Kungliga Teknisha Hogskolan, Stockholm, 1975, pp. 1-14] is presented. It...
This paper addresses the following two hypotheses: (i) vocal-tract area functions of Japanese vowels can be accurately represented by a linear combination of only a few principal components which, furthermore, are similar to those reported in the literature for different languages; and (ii) the principal components’ weights can be predicted and are...
Frequency‐domain simulations of the human vocal tract(VT) have previously shown the importance of including the piriform fossae, which impart a pole and two zeros in the 4–5‐kHz frequency range and thereby contribute to speaker individualities. The literature has also shown that time‐domain simulation of VT acoustics can result in high‐quality synt...
A speech synthesis system was developed based on Maedas method [S. Maeda, Speech Commun. 1, 199–229 (1982)], which simulates acoustic wave propagation in the vocal tract in the time domain. This system has a GUI interface that allows fine control of synthesis parameters and timing. In addition, the piriform fossae were included to the vocal tract m...
The acoustic effects of the laryngeal cavity on the vocal tract resonance were investigated by using vocal tract area functions for the five Japanese vowels obtained from an adult male speaker. Transfer functions were examined with the laryngeal cavity eliminated from the whole vocal tract, volume velocity distribution patterns were calculated, and...
Acoustic effects of the time-varying glottal area due to vocal fold vibration on the laryngeal cavity resonance were investigated based on vocal tract area functions and acoustic analysis. The laryngeal cavity consists of the vestibular and ventricular parts of the larynx, and gives rise to a regional acoustic resonance within the vocal tract, with...
The aim of this study is to explore possible speaker character- istics common to speech sounds, through two psychoacoustic experiments. In the first experiment, sustained Japanese vow- els produced by four adult male speakers were used, and ABX tests were carried out to confirm whether speaker individuali- ties common to sustained vowels exist by t...
Vocal tract data from 3D cine-MRI are used together with synchronised acoustics to evaluate a linear regression model for inversion. The first two principal components of vocalic area functions are predicted with correlations 0.99 and 0.97 respectively, from 24 FFT-cepstra measured in the frequency band 0-4 kHz. This best regression model together...
With the aim of automatically categorizing phrase final tones, investigations are conducted on the relationship between acoustic-prosodic parameters and perceptual tone categories. Three types of acoustic parameters are proposed: one related to pitch movement within the phrase final, one related to pitch reset prior to the phrase final, and one rel...
In this paper we propose methods of speech segmentation and unit characterization which are motivated by prosodic and physiological principles. In particular, we motivate and describe algorithms for unit-database creation on the basis of quasi-syllables and quasi-articulatory-gestures defined and parameterized purely by acoustic measurements. This...
This paper presents data from an analysis of a large conversational-speech corpus, showing evidence that voice quality, as measured on a continuum from pressed to breathy using a normalized amplitude quotient (NAQ), is varied consistently, and in much the same way as, but independently of, fundamental frequency, to signal paralinguistic information...
This paper describes our research into the relation between paralinguistic information and acoustic features of monosyllabic, typically backchannel and filler utterances in Japanese. For this study, we extracted 141 examples of " hai " , " un " , and " ah " from recordings of spontaneous conversational speech from one Japanese female. These utteran...
With the aim of enabling concatenative synthesis of expressive speech, we herein report progress towards developing robust and automatic algorithms for paralinguistic annotation of very large recorded-speech corpora. In particular, we describe a method of combining robust acoustic-prosodic and cepstral analyses to locate centres of acoustic-phoneti...
Colloque avec actes et comité de lecture. internationale.
This paper describes a dataset of formant patterns measured in the steady-states of recorded Japanese vowels. Five adult, male, native speakers of Japanese were selected from the "ETL-WD-I and II" balanced word dataset; and for each of the five vowels / i, e, a, o, /, 22 different words were selected on the basis of consistently finding the lengthi...
In this paper we propose a more complete model of inter-speaker variability, which accounts quantitatively for structural differences of the vocal-tract (VT) and for learned differences in articulatory setting and in phoneme-specific strategy. This tripartite modelling is applied to a dataset of VT area-functions estimated by acoustic-to-articulato...
In this paper we propose a more complete model of inter-speaker variability, which accounts quantitatively for structural differences of the vocal-tract (VT) and for learned differences in articulatory setting and in phoneme-specific strategy. This tripartite modelling is applied to a dataset of VT area-functions estimated by acoustic-to-articulato...
Colloque avec actes et comité de lecture. nationale.
Colloque avec actes et comité de lecture.