Article

Freddie Mercury—acoustic analysis of speaking fundamental frequency, vibrato, and subharmonics

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Freddie Mercury was one of the twentieth century's best-known singers of commercial contemporary music. This study presents an acoustical analysis of his voice production and singing style, based on perceptual and quantitative analysis of publicly available sound recordings. Analysis of six interviews revealed a median speaking fundamental frequency of 117.3 Hz, which is typically found for a baritone voice. Analysis of voice tracks isolated from full band recordings suggested that the singing voice range was 37 semitones within the pitch range of F#2 (about 92.2 Hz) to G5 (about 784 Hz). Evidence for higher phonations up to a fundamental frequency of 1,347 Hz was not deemed reliable. Analysis of 240 sustained notes from 21 a-cappella recordings revealed a surprisingly high mean fundamental frequency modulation rate (vibrato) of 7.0 Hz, reaching the range of vocal tremor. Quantitative analysis utilizing a newly introduced parameter to assess the regularity of vocal vibrato corroborated its perceptually irregular nature, suggesting that vibrato (ir)regularity is a distinctive feature of the singing voice. Imitation of subharmonic phonation samples by a professional rock singer, documented by endoscopic high-speed video at 4,132 frames per second, revealed a 3:1 frequency locked vibratory pattern of vocal folds and ventricular folds.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The modulation frequency can be independent of f o or coupled with it in some rational fraction such as 3 : 2 or 2 : 1, in which case we speak of subharmonics instead of modulation [73]. To complicate matters further, irregular AM with variable frequency and amplitude does not create visible sidebands, but simply produces a noisy-looking spectrogram and an irregular, harsh-sounding voice quality that may look and sound similar to chaos (figure 4). ...
... When conceptually considering a nonsinusoidal FM, the spectrum of the f o contour can be searched for spectral peaks that correspond to FM frequencies. For instance, Herbst et al. [73] found two such peaks-two dominant frequencies-in the vibrato produced by Freddie Mercury. Two main metrics are typically computed: modulation rate (the vibrato frequency) and modulation extent (the vibrato amplitude). ...
... They can be produced by partly desynchronized, but still strongly coupled vocal folds or parts thereof that vibrate at harmonically related frequencies, which can be caused by the entrainment of two vibratory modes of the vocal folds [81][82][83], asymmetric tension on the two vocal folds [84][85][86][87], or sourcefilter interactions with supraglottal [51] or subglottal [53] resonances. Another possible origin of subharmonics is simultaneous frequency-locked vibration of two oscillators such as the vocal folds and the ventricular folds [73,[88][89][90] or aryepiglottic folds [91]. When subharmonics are caused by AM, the modulation depth can be defined as the difference in the amplitude of adjacent glottal cycles expressed as a proportion of the sum of the two amplitudes; in the case of FM, it is defined as the difference in the periods instead of amplitudes of adjacent cycles [92,93]. ...
Article
Full-text available
We address two research applications in this methodological review: starting from an audio recording, the goal may be to characterize nonlinear phenomena (NLP) at the level of voice production or to test their perceptual effects on listeners. A crucial prerequisite for this work is the ability to detect NLP in acoustic signals, which can then be correlated with biologically relevant information about the caller and with listeners’ reaction. NLP are often annotated manually, but this is labour-intensive and not very reliable, although we describe potentially helpful advanced visualization aids such as reassigned spectrograms and phasegrams. Objective acoustic features can also be useful, including general descriptives (harmonics-to-noise ratio, cepstral peak prominence, vocal roughness), statistics derived from nonlinear dynamics (correlation dimension) and NLP-specific measures (depth of modulation and subharmonics). On the perception side, playback studies can greatly benefit from tools for directly manipulating NLP in recordings. Adding frequency jumps, amplitude modulation and subharmonics is relatively straightforward. Creating biphonation, imitating chaos or removing NLP from a recording are more challenging, but feasible with parametric voice synthesis. We describe the most promising algorithms for analysing and manipulating NLP and provide detailed examples with audio files and R code in supplementary material. This article is part of the theme issue ‘Nonlinear phenomena in vertebrate vocalizations: mechanisms and communicative functions’.
... Finally, a number of sustained vowels were manually extracted from the above musical phrases creating a collection of 33 vowels for both Ninou and Bellou containing the phonemes /a/, /i/, /ε/, /u/ and /ɔ/ in relatively equal proportions. These sustained tones were used to acquire characteristics of the vibrato that is deemed important in providing identity to a singing voice (Herbst, et al., 2016). The sound stimuli are available online 1 . ...
... As mentioned above the timbral features were complemented by a calculation of the basic vibrato characteristics from the isolated vowels, namely the rate, the extent and the regularity (Sundberg, 1995;Herbst et al., 2016). The rate is defined as the dominant modulation frequency in Hz, the extent as the maximum amplitude deviation from the mean in cents (( /2)/ 0) and the regularity as the deviation of the modulation pattern from a pure sinewave. ...
... The rate is defined as the dominant modulation frequency in Hz, the extent as the maximum amplitude deviation from the mean in cents (( /2)/ 0) and the regularity as the deviation of the modulation pattern from a pure sinewave. For the latter we have adopted the formula by Herbst et al. (2016) that estimates this deviation by comparing the amplitudes between the two most prominent frequency components (A1 and A2) of the vibrato spectrum ( !"#$ = (1 -A2)/A1). This metric ranges from 0 to indicate the presence of two equally strong frequency components to 1 for a vibrato featuring a pure sinewave modulation pattern. ...
... e experimental results of this algorithm on multiple data sets show that its classification accuracy is better than traditional methods such as self-organizing mapping algorithm; literature [13] proposes a deep network, which is limited by binary random units. e Boltzmann machines are stacked, and the results show that the deep network model has strong generalization ability in the classification of binary speech and English signals; literature [14] introduces the method of artificial neural network into lung sound signals Classification problems are used to help diagnose respiratory diseases. e performance of neural networks on the same data set is comparable and in some cases is better than other classification methods such as Gaussian mixture models; literature [15] is based on deep confidence networks (DBNs) that constructed a speech English classifier, and the network can perform deeper mining of audio features. ...
... erefore, the Meddis model of inner hair cells can be described by formulas (11), (12), (13), and (14). ...
Article
Full-text available
To improve the effect of English learning in the context of smart education, this study combines speech coding to improve the intelligent speech recognition algorithm, builds an intelligent English learning system, combines the characteristics of human ears, and studies a coding strategy of a psychoacoustic masking model based on the characteristics of human ears. Moreover, this study analyzes in detail the basic principles and implementation process of the psychoacoustic model coding strategy based on the characteristics of the human ear and completes the channel selection by calculating the masking threshold. In addition, this study verifies the effectiveness of the algorithm in this study through simulation experiments. Finally, this study builds a smart speech recognition system based on this model and uses simulation experiments to verify the effect of smart speech recognition on English learning. To improve the voice recognition effect of smart speech, band-pass filtering and envelope detection adopt the gammatone filter bank and Meddis inner hair cell model in the mathematical model of the cochlear system; at the same time, the masking effect model of psychoacoustics is introduced in the channel selection stage to prevent noise. Sex has been improved, and the recognition effect of smart voice has been improved. The analysis shows that the intelligent speech recognition system proposed in this study can effectively improve the effect of English learning. In particular, it has a great effect on improving the effect of oral learning.
... e method 3D speaker modeling based on data-driven synthesis uses digital image processing technology to extract features from digital images [8]. Literature [9] established a sound-to-speech reversal model based on generalized variable parameters-hidden Markov (GVP-HMM) to achieve 3D speaker modeling. Literature [10] modeled a 3D speaker from the anatomy of the head. ...
... In formula (9), c i is the context information at time i. In formula (10) Advances in Multimedia of the decoder at timeà, c is the input of the decoder at time i, and f is the cyclic neural network. ...
Article
Full-text available
In order to improve the pronunciation accuracy of spoken English reading, this paper combines artificial intelligence technology to construct a correction model of the spoken pronunciation accuracy of AI virtual English reading. Moreover, this paper analyzes the process of speech synthesis with intelligent speech technology, proposes a statistical parametric speech based on hidden Markov chains, and improves the system algorithm to make it an intelligent algorithm that meets the requirements of the correction system of spoken pronunciation accuracy of AI virtual English reading. Finally, this paper combines the simulation research to analyze the English reading, spoken pronunciation, and pronunciation correction of the intelligent system. From the experimental research results, the correction system of spoken pronunciation accuracy of AI virtual English reading proposed in this paper basically meets the basic needs of this paper to build a system.
... This 1 : 2 entrainment has been explored, for example, in kargyraa throat singing [19,20]. Also 1 : 3 entrainment and chaotic behaviour have been observed between the ventricular and vocal folds in vivo [20,[168][169][170]. Entrainment ratios of 1 : 4 to 1 : 7, along with chaotic oscillations, were reported between the tissues of the aryepiglottic complex and the vocal folds in metal singing [22]. ...
Article
Full-text available
The theory of nonlinear dynamics was introduced to voice science in the 1990s and revolutionized our understanding of human voice production mechanisms. This theory elegantly explains highly complex phenomena in the human voice, such as subharmonic and rough-sounding voice, register breaks, and intermittent aphonic breaks. These phenomena occur not only in pathologic, dysphonic voices but are also explored for artistic purposes, such as contemporary singing. The theory reveals that sudden changes in vocal fold vibratory patterns and fundamental frequency can result from subtle alterations in vocal fold geometry, mechanical properties, adduction, symmetry or lung pressure. Furthermore, these changes can be influenced by interactions with supraglottal tract and subglottal tract resonances. Crucially, the eigenmodes (modes of vibration) of the vocal folds play a significant role in these phenomena. Understanding how the left and right vocal fold eigenmodes interact and entrain with each other, as well as their interplay with supraglottal tissues, glottal airflow and acoustic resonances, is essential for more sophisticated diagnosis and targeted treatment of voice disorders in the future. Additionally, this knowledge can be helpful in modern vocal pedagogy. This article reviews the concepts of nonlinear dynamics that are important for understanding normal and pathologic voice production in humans. This article is part of the theme issue ‘Nonlinear phenomena in vertebrate vocalizations: mechanisms and communicative functions’.
... [62]). Herbst et al. [63] described different terms used by researchers for vocal production modes associated with vibrating ventricular folds (sometimes called 'false' vocal folds), and vibration regimes of other supraglottal structures, including growl, dist, throat-singing and distortion. ...
Article
Full-text available
Music traditions worldwide are subject to remarkable diversity but the origins of this variation are not well understood. Musical behaviour is the product of a multicomponent collection of abilities, some possibly evolved for music but most derived from traits serving nonmusical functions. Cultural evolution has stitched together these systems, generating variable normative practices across cultures and musical genres. Here, we describe the cultural evolution of musical distortion, a noisy manipulation of instrumental and vocal timbre that emulates nonlinear phenomena (NLP) present in the vocal signals of many animals. We suggest that listeners’ sensitivity to NLP has facilitated technological developments for altering musical instruments and singing with distortion, which continues to evolve culturally via the need for groups to both coordinate internally and differentiate themselves from other groups. To support this idea, we present an agent-based model of norm evolution illustrating possible dynamics of continuous traits such as timbral distortion in music, dependent on (i) a functional optimum, (ii) intra-group cohesion and inter-group differentiation and (iii) groupishness for assortment and social learning. This account illustrates how cultural transmission dynamics can lead to diversity in musical sounds and genres, and also provides a more general explanation for the emergence of subgroup-differentiating norms. This article is part of the theme issue ‘Nonlinear phenomena in vertebrate vocalizations: mechanisms and communicative functions’.
... Subharmonic voicing is also possible for vocally healthy speakers. Subharmonics can be voluntarily produced as vocal fry [6] or as a singing technique [7]. Perceptually, the subharmonic voice presents rough voice quality [8] or lowered pitch [9,10,11]. ...
Preprint
Many voice disorders induce subharmonic phonation, but voice signal analysis is currently lacking a technique to detect the presence of subharmonics reliably. Distinguishing subharmonic phonation from normal phonation is a challenging task as both are nearly periodic phenomena. Subharmonic phonation adds cyclical variations to the normal glottal cycles. Hence, the estimation of subharmonic period requires a wholistic analysis of the signals. Deep learning is an effective solution to this type of complex problem. This paper describes fully convolutional neural networks which are trained with synthesized subharmonic voice signals to classify the subharmonic periods. Synthetic evaluation shows over 98% classification accuracy, and assessment of sustained vowel recordings demonstrates encouraging outcomes as well as the areas for future improvements.
... Many of the f o estimation algorithms (also known as pitch detection algorithms), however, are designed to detect the true fundamental frequency of their input signals rather than the speaking f o based on the assertion that the voice is a nearly periodic phenomenon. This assertion is appropriate for normal speech processing tasks because subharmonic phonation is rare in the vocally healthy population although it can be voluntarily produced as vocal fry [14] or as a singing technique [15]. ...
Preprint
In clinical voice signal analysis, mishandling of subharmonic voicing may cause an acoustic parameter to signal false negatives. As such, the ability of a fundamental frequency estimator to identify speaking fundamental frequency is critical. This paper presents a sustained-vowel study, which used a quality-of-estimate classification to identify subharmonic errors and subharmonics-to-harmonics ratio (SHR) to measure the strength of subharmonic voicing. Five estimators were studied with a sustained vowel dataset: Praat, YAAPT, Harvest, CREPE, and FCN-F0. FCN-F0, a deep-learning model, performed the best both in overall accuracy and in correctly resolving subharmonic signals. CREPE and Harvest are also highly capable estimators for sustained vowel analysis.
... Thus, without losing its distinct identity, the mature Kazantzidis' voice is characterised by statistically significant timbral changes: weaker highfrequency context, less inharmonicity and stronger harmonic to noise energy. At the same time, his vibrato, which also constitutes an important element of vocal identity [2], is significantly different at maturity: faster, more regular but less deep. Out of the above, only the loss of spectral richness could be justified by ageing [3]. ...
Poster
Full-text available
Stelios Kazantzidis is one of the most prominent singers of popular Greek music with a career spanning from the 1950s till the end of the 20th century. His collaboration with the iconic composer Vassilis Tsitsanis, shaped the post-war Greek music scene and he later collaborated with most of the prominent Greek composers of popular music. One of his last recordings, before a long 12-year pause, was the album “Stin Anatoli” by Mikis Theodorakis in 1974. In this work we compare Kazantzidis’ timbral characteristics between his early ‘Tsitsanis period’ and his singing style featured in ‘Stin Anatoli’ album. The vocal part of a 16 of recorded compositions by Vasilis Tsitsanis made between 1956 and 1963 and the 11 songs included in ‘Stin Anatoli’ was extracted using the Demucs (v4) Music Source Separation (Défossez, 2021; Rouard et al., 2023) that is available in Github. Subsequently, musical phrases from each song were retained through manual editing and silent parts within the phrases were computationally eliminated. This resulted in 21 vocal phrases originating from the Tsitsanis collection and 11 vocal phrases originating from ‘Stin Anatoli’. Finally, a number of sustained vowels were manually extracted from the above musical phrases creating a collection of 80 vowels from the Tsitsanis collection and 60 vowels from ‘Stin Anatoli’ containing the phonemes /a/, /i/, /ε/, /u/ and /ɔ/. These sustained tones were used to acquire characteristics of the vibrato that is deemed important in providing identity to a singing voice (Herbst, et al., 2016). An updated version of the Timbre Toolbox (Kazazis et al., 2021) was used for extracting harmonic audio features from both musical phrases and isolated vowels. Comparison between the two collections showed that the earlier Kazantzidis’ voice featured significant timbral differences in comparison to his more mature delivery. In particular, the mature Kazantzidis’ voice features weaker high-frequency context, less inharmonicity together with stronger harmonic to noise energy and a faster, more regular but less deep vibrato compared to his younger self.
... As suggested by these results, phase spaces and recurrence plots could be a visual aid in assessing dynamic vibrato behavior and a useful resource in vocal pedagogy. A possible application, for instance, would be the detection of vibrato tones with more than one sinusoid [24]. ...
... More statistical spectrum estimation approaches have been created by researchers, such as the minimum mean square error (MMSE) logarithmic spectrum amplitude estimation method, the maximum likelihood (ML) spectrum amplitude estimation method, and the maximum a posteriori (MAP) method. The Linear Predictive Coding (LPC) model and Kalman filter were utilised in [13] to reduce noise and raise the signal-tonoise ratio of speech signals. The literature [14] provided more endpoint detection algorithms through the frequency domain spectrum analysis of the voice signal after using the Fourier transform to get the frequency domain information of the voice signal. ...
Article
Full-text available
The use of deep learning to improve English speaking has seen tremendous development in recent years. This study evaluates the noise that is present in the English speech environment, employs a two-way search method to select the optimum feature set, and applies a quick correlation filter to remove redundant features in order to increase the accuracy of English voice feature identification. In addition, this article designs a low-pass filter in the complex cepstrum domain to filter the room impulse response in order to obtain the estimated value of the complex cepstrum of the original speech signal. After doing so, the authors transform this estimated value into the time domain in order to obtain the estimated value of the original speech signal. In addition, this paper proposes a corresponding noise elimination model for the purpose of eliminating noise from English speech in a reverberant environment. It also designs a complex cepstrum domain filter in order to conduct simulation research on the different characteristics of the reverberation signal and the pure speech signal in the complex cepstrum domain. In conclusion, this study develops an English voice feature recognition model that is founded on a deep neural network. Furthermore, this paper uses experimental research to validate the validity of the algorithm model that was developed in this study.
... Next, two screening methods were used, one was after Zelinski and the other was after McCowan, to further filter the interventions and conduct a simulation system experiment [9]. rough the built-in Pocket sphinx speech recognition test in the Ubuntu system, the MVDR and SDBF beam methods have been proven to effectively improve the performance of the remote speakers, and beamforming and filtering have been proven to have better system performance [10]. ...
Article
Full-text available
Based on the method of the abnormal network traffic classification system of the CNN network, the traffic is encrypted according to the split and capture strategy, which makes it difficult to find the most important value in the whole world. An unconventional network is proposed based on CNN. This method combines several substeps, such as image creation, image selection, sorting, and end-to-end structure, and automatic learning of indirect relations is detected from the required original input and output, and it is possible applicable globally. The ideology of the ideological and political courses and political system of these distance universities is aimed at ensuring the political direction of college students and ensuring a comprehensive understanding of socialism in order to effectively conduct higher education courses. The ideological and political courses of a certain college have different characteristics from other courses. The teaching system provides students with an independent learning environment. Students can use the courses, teaching materials, text, drawings, and video information provided by the system to deepen their understanding and application knowledge.
... After 90 years of the last century, based on HMM technology, the key research of keyword detection technology is to combine other pattern recognition methods to improve performance and method improvement and improve search recognition algorithm to increase speed. During this period, CMU's School of Computer Science, MIT's Lincoln Laboratory, and Dragon Systems reported their research results [9][10][11]. At the same time, when keyword detection technology based on the filler model template is widely recognized, it has been proposed that keyword detection is based on large vocabulary continuous speech recognition, mainly by inputting the results of continuous speech recognition after acoustic decoding, and then performing keyword detection [12,13]. is research is mainly based on N-best structure. ...
Article
Full-text available
The keyword detection of Japanese speech in streaming media has a certain effect on our study of Japanese information and a certain promotion effect on Japanese teaching. Currently, there is a problem of stability in the detection model of Japanese speech keywords. In order to improve the detection effect of Japanese speech keywords in streaming media, based on SVM, this study constructed a detection model of Japanese speech keywords in streaming media based on support vector machine. Moreover, this study analyzes the problem of SVM probability output and the comprehensive problem of SVM confidence, etc. In addition, by comparing the effect of confidence synthesis with the arithmetic average method, we found that the confidence obtained by SVM can obtain a higher recognition rate under the same rejection rate and improve the overall performance of the system. Finally, this study uses the difference comparison test to analyze the performance of the model proposed in this study. The research results show that the algorithm proposed in this paper has good performance and can be used as a follow-up system algorithm.
... e literature used two acoustic characteristics of zerocrossing rate and short-term energy to classify voice and music in broadcast signals [13]. e literature first divided the audio signal in the TV into mute, signal with music component and signal without music component by using four audio characteristics of short-term energy, zerocrossing rate, pitch frequency, and spectral peak trajectory [14]. en, it further divided the signal containing music component into pure music, singing voice, and voice with music background, and further divided the signal without music component into pure audio and noisy audio. ...
Article
Full-text available
Existing speech recognition systems are only for mainstream audio types; there is little research on language types; the system is subject to relatively large restrictions; and the recognition rate is not high. Therefore, how to use an efficient classifier to make a speech recognition system with a high recognition rate is one of the current research focuses. Based on the idea of machine learning, this study combines the computational random forest classification method to improve the algorithm and builds an English speech recognition model based on machine learning. Moreover, this study uses a lightweight model and its improved model to recognize speech signals and directly performs adaptive wavelet threshold shrinkage and denoising on the generated time-frequency images. In addition, this study uses the EI strong classifier to replace the softmax of the lightweight AlexNet model, which further improves the recognition accuracy under a low signal-to-noise ratio. Finally, this study designs experiments to verify the model effect. The research results show that the effect of the model constructed in this study is good.
... But it is inconvenient to carry and test on-site. Moreover, the development cost of upper management software is also very high [15]. Although portable dynamic signal analyzers are successively listed abroad, their functions are relatively single, and their performance cannot meet the complex test requirements, so the application range is not very wide [16]. ...
Article
Full-text available
This study analyses the connotation of sound design in the later stages of the film and how to grasp the truth of art under subjective creative thinking in order to improve the effect of sound design in the later stages of the film and then proposes multiple methods of sound element selection and organisation, as well as sound element combination modes, in order to improve the effect of sound design in the later stages of the film. Furthermore, in the final stages of the film, this study incorporates digital and intelligent technologies to create a sound design system. Furthermore, in the final stages of the film, this study examines a number of technologies and picks the right sound design. Finally, this article blends experimental research with system performance analysis. The sound design system based on computer intelligence suggested in this study has a specific influence, as shown by the experimental investigation.
... Because there is currently no relevant literature to merge multiple channels of analog voice signals, there is currently no literature on the synchronization of multiple voice signals. There are many methods of diversity combining, among which the most widely used are maximum ratio combining, selective combining, and equal gain combining [11]. These merging methods are equivalent to performing linear dimensionality reduction operations [12], and the three merging methods have their own advantages and disadvantages. ...
Article
Full-text available
In order to improve the effect of spoken English processing, it is necessary to improve the spoken English processing technology from the perspective of the characteristics of spoken English, combined with intelligent algorithms. This paper combines the intelligent speech analysis technology to improve the spoken English recognition technology and combines the actual and needs of English learning to improve the system algorithm. Moreover, this paper combines the intelligent speech analysis to construct the intelligent spoken English learning model structure and combines the statistical method and the intelligent evaluation method to analyze the model effect. After obtaining the system function structure, this paper designs experiments to verify the effect of the model proposed in this paper. From the experimental analysis results, it can be seen that the intelligent English speech analysis model proposed in this paper can play an important role in the learning of spoken English.
... The literature [14] proposed a disguised voice hidden telephone system. Moreover, due to the consideration of the real-time nature of the system, it combines the classic hiding method to successfully realize the secure transmission of real-time voice. ...
Article
Full-text available
It is necessary to study the application of digital technology in English speech feature recognition. This paper combines the actual needs of English speech feature recognition to improve the digital algorithm. Moreover, this paper combines fuzzy algorithm to analyze English speech features, analyzes the shortcomings of traditional algorithms, proposes the fuzzy digitized English speech recognition algorithm, and builds an English speech feature recognition model on this basis. In addition, this paper conducts time-frequency analysis on chaotic signals and speech signals, eliminates noise in English speech features, improves the recognition effect of English speech features, and builds an English speech feature recognition system based on digital means. Finally, this paper conducts grouping experiments by inputting students’ English pronunciation forms and counts the results of the experiments to test the performance of the system. The research results show that the method proposed in this paper has a certain effect.
... Measurements of SF0 range were made at 95 th percentile to reduce the impact of potential high/low frequency outliers and produce results more representative of the speaker's actual range (compared to reporting absolute minimum and maximum), as seen in some studies measuring speaking SF0 variability 51,52 or SF0 range. 53,54 A decision was made not to exclude or separately analyse portions of connected speech featuring glottal fry. While glottal fry is associated with a lower SF0 than speech without glottal fry, it is also associated with listener perceptions of a speaking voice that is lower in pitch and less perceptually female/feminine than speech without glottal fry 35 and reducing glottal fry may have been a potential training target for individuals within the program. ...
Article
Introduction Gender affirming voice training is a service provided by speech language pathologists to members of the trans and gender diverse community. While there is some evidence to support the effectiveness of this training, the evidence base is limited by a lack of prospective studies with large sample sizes. Finally, there has been only limited research investigating the effectiveness of this training when delivered on intensive (compressed) schedules, even though such schedules are used in clinical practice and may have practical benefits such as increasing service access for this vulnerable population. Methodology This study aimed to investigate and compare the effectiveness gender affirming voice training among 34 trans individuals presumed male at birth who shared a goal of developing a ‘female-sounding voice’. Among these 34 participants, 17 received their training on a traditional schedule (one 45-minute session per week over 12 weeks) and 17 on an intensive scheduled (three 45-minute sessions per week over 4 weeks). Building on a previous mixed methodological study which indicated that these two training groups were equally satisfied with training outcomes, the current study utilised a wide range of self-report, acoustic, and auditory-perceptual outcome measures (including self-ratings and listener-ratings of voice) to investigate training effectiveness. Discussion Results from this study indicated that both training programs were similarly effective, producing positive statistically significant change among participants on a range of outcome measures. Participants in both groups demonstrated significant auditory-perceptual and acoustic voice change and reported increased satisfaction with voice, increased congruence between gender identity and expression, and a reduction in the negative impact of voice concerns on everyday life. However, as has been the case in past studies, training was not sufficient for all participants to achieve their specific goal of developing a consistently ‘female-sounding voice’. Conclusion This study provides evidence to suggest that gender affirming voice training for transfeminine clients may be similalrly effective whether delivered intensively or traditionally. This study provides evidence to support the practice of using a wide range of outcome measures to gain holistic insight into client progress in gender affirming voice training programs.
... From the perspective of language skills, literature [4] believes that in the course of oral English class exercises, attention should be paid to the selection and application of words and sentences, the cohesion of sentences, the organization of sentences, the conversion of style styles, rhetorical skills, and speech strategies. Literature [5] takes the lesson type as the starting point, and through the comparison of different lesson type practice methods, it is concluded that the classroom practice content of oral English class should not only have phonetic exercises but also include the usage of word making and sentence making and finally through actual practice. The communicative exercises enable students to truly appreciate the charm of spoken English. ...
Article
Full-text available
Spoken English practice requires a combination of listening, speaking, reading, and writing, among which listening and speaking are the most difficult. In order to improve the speaking ability of the practitioner, the pronunciation of spoken English needs to be corrected in time. However, the workload of manual evaluation is too large, so it is necessary to combine intelligent methods for spoken language recognition. Based on the needs of spoken English pronunciation correction, this paper combines the computer English speech recognition technology to construct the spoken English recognition and correction model and combines the coding technology to study the English speech recognition technology. Moreover, this article constructs the spoken English practice system based on the actual needs of spoken English practice. Finally, this paper verifies the reliability of this system through experimental research, which provides a reliable means for the subsequent intelligent learning of spoken English.
... Digital signal processing is the use of computers or special processing equipment to collect, transform, filter, estimate, enhance, compress, and identify signals in digital form to obtain a signal form that meets people's needs. Digital signal processing (DSP) is an emerging discipline that involves many disciplines and is widely used in many fields at the same time [11]. For example, in the field of mathematics, calculus, probability and statistics, stochastic processes, and numerical analysis are all basic tools for digital signal processing and are closely related to network theory, signal and system, cybernetics, communication theory, and fault diagnosis. ...
Article
Full-text available
In order to improve the effect of English listening teaching, this paper combines the interactive needs of English listening teaching to analyze the current problems in English listening teaching. Moreover, according to the actual needs of the intelligent English teaching system, this paper conducts research on the high-order cumulant signal-to-noise ratio estimation method and combines data analysis to improve the algorithm to obtain an intelligent algorithm suitable for English listening teaching. In addition, this paper combines the algorithm to construct an English listening teaching system based on a multimedia intelligent-embedded processor and applies the embedded processor to intelligent English listening teaching. Finally, this paper builds an intelligent system to solve the current problems in English listening teaching and improve the effect of English teaching.
... Furthermore, even if such parallel data is gathered, most speech conversion algorithms still need the training data to be time-aligned. To overcome the issue of temporal alignment problems, the alignment method always adds mistakes, necessitating more sophisticated processes such as exact corpus preparation or human correction [5]. ...
Article
Full-text available
Intelligent music teaching is the direction of subsequent reforms in music teaching methods. In order to improve the intelligence of music teaching, this paper conducts research on speech recognition technology. The speech conversion system based on Multiscale Star GAN extracts different levels of multiscale features of the global features of music utterance through the multiscale structure, which enhances the details of the converted speech and uses residual connections to alleviate the problem of gradient disappearance and enable the network to spread more deeply. In addition, after improving the speech recognition algorithm, this paper combines the needs of music teaching to construct a music teaching system based on speech recognition and artificial intelligence and designs system functional modules. Finally, this paper evaluates the performance of the system constructed in this paper by means of teaching experiments. From the experimental analysis, it can be seen that the system constructed in this paper has a good teaching effect.
... Describe the characteristics of the fatigue state, select the optimal feature describing the fatigue state by comparing different characteristics, integrate and optimize the information of different characteristics to find the optimal feature set, so as to achieve the completeness and complementarity of the fatigue information, and establish an optimal feature set. Excellent feature set (Herbst et al. 2017). Feature extraction plus classification is a typical speech emotion recognition mode. ...
Article
Full-text available
In order to improve the effect of intelligent language translation, this paper analyzes the problems of the MSE cost function used by most of the current DNN-based speech enhancement algorithms and proposes a deep learning speech enhancement algorithm based on perception-related cost functions. Moreover, this paper embeds the suppression gain parameter estimation into the architecture of the traditional speech enhancement algorithm and converts the relationship between the noisy speech spectrum and the enhanced speech spectrum into a simple multiplication relationship based on suppression gain combined with deep learning algorithms to construct an intelligent language translation system. Moreover, this paper evaluates the translation effect of the system, analyzes the actual results, and uses simulation tests to verify the performance of the intelligent language translation model constructed in this paper. From the experimental results, it can be seen that the intelligent language translation system based on deep learning algorithms has good results.
... The literature [14] proposed a disguised voice hidden telephone system. Moreover, due to the consideration of the real-time nature of the system, it combines the classic hiding method to successfully realize the secure transmission of real-time voice. ...
Preprint
Full-text available
In order to improve the effect of English speech recognition, based on digital means, this paper combines the actual needs of English speech feature recognition to improve the digital algorithm. Moreover, this paper combines fuzzy recognition algorithm to analyze English speech features, and analyzes the shortcomings of traditional algorithms, and proposes the fuzzy digitized English speech recognition algorithm, and builds an English speech feature recognition model on this basis. In addition, this paper conducts time-frequency analysis on chaotic signals and speech signals, eliminates noise in English speech features, improves the recognition effect of English speech features, and builds an English speech feature recognition system based on digital means. Finally, this paper conducts grouping experiments by inputting students' English pronunciation forms, and counts the results of the experiments to test the performance of the system. The research results show that the method proposed in this paper has a certain effect.
... Estimates of the extent and rate of intensity modulation would not likely be improved by using the manual analyses from previous studies due to the need to visually identify periodic maximum and minimum peaks [15,16]. However, analysis of the extent of intensity modulation might be improved using estimates of the coefficient of variation of amplitude as reported in Lester and Story [38], and analysis of the rate of intensity modulation improved using spectral analysis of the intensity trace as reported in Herbst, Hertegard, Zangger-Borch, and Lindestad [54] because these analyses do not require identification of peaks. ...
Article
Full-text available
Purpose Studies on medical and behavioral interventions for essential vocal tremor (EVT) have shown inconsistent effects on acoustical and perceptual outcome measures across studies and across participants. Remote acoustical and perceptual assessments might facilitate studies with larger samples of participants and repeated measures that could clarify treatment effects and identify optimal treatment candidates. Furthermore, remote acoustical and perceptual assessment might allow clinicians to monitor clients’ treatment responses and optimize treatment approaches during telepractice. Thus, the purpose of this study was to evaluate the accuracy of remote signal transmission and recording for acoustical and perceptual assessment of EVT. Method Simulations of EVT were produced using a computational model and were recorded using local and remote procedures to represent client- and clinician-end recordings respectively. Acoustical analyses measured the extent and rate of fundamental frequency (fo) and intensity modulation to represent vocal tremor severity and the cepstral peak prominence (CPPS) to represent voice quality. The data were analyzed using repeated measures analysis of variance (ANOVA) with recording as the within-subjects factor and sex of the computational model as the between-subjects factor. Results There was a significant main effect of recording on the rate of fo modulation and significant interactions of recording and sex for the extent of intensity modulation, rate of intensity modulation, and CPPS. Posthoc pairwise comparisons and analysis of effect size indicated that recording procedures had the largest effect on the extent of intensity modulation for male simulations, the rate of intensity modulation for male and female simulations, and the CPPS for male and female simulations. Despite having disabled all known software and computer audio enhancing options and having stable ethernet connections, there was inconsistent attenuation of signal amplitude in remote recordings that was most problematic for samples with a breathy voice quality but also affected samples with typical and pressed voice qualities. Conclusions Acoustical measures that correlate to perception of vocal tremor and voice quality were altered by remote signal transmission and recording. In particular, signal transmission and recording in Zoom altered time-based estimates of intensity modulation and CPPS with male and female simulations of EVT and magnitude-based estimates of intensity modulation with male simulations of EVT. In contrast, signal transmission and recording in Zoom minimally altered time- and magnitude-based estimates of fo modulation with male and female simulations of EVT. Therefore, acoustical and perceptual assessments of EVT should be performed using audio recordings that are collected locally on the participant- or client-end, particularly when measuring modulation of intensity and CPP or estimating vocal tremor severity and voice quality. Development of procedures for collecting local audio recordings in remote settings may expand data collection for treatment research and enhance telepractice.
... Describe the characteristics of the fatigue state, select the optimal feature describing the fatigue state by comparing different characteristics, integrate and optimize the information of different characteristics to find the optimal feature set, so as to achieve the completeness and complementarity of the fatigue information, and establish an optimal feature set. Excellent feature set [13]. Feature extraction plus classification is a typical speech emotion recognition mode. ...
Preprint
Full-text available
In order to improve the effect of intelligent language translation, this paper analyzes the problems of the MSE cost function used by most of the current DNN-based speech enhancement algorithms, and proposes a deep learning speech enhancement algorithm based on perception-related cost functions. Moreover, this paper embeds the suppression gain parameter estimation into the architecture of the traditional speech enhancement algorithm, and converts the relationship between the noisy speech spectrum and the enhanced speech spectrum into a simple multiplication relationship based on suppression gain combined with deep learning algorithms to construct an intelligent language translation system. Moreover, this paper evaluates the translation effect of the system, analyzes the actual results, and uses simulation tests to verify the performance of the intelligent language translation model constructed in this paper. From the experimental results, it can be seen that the intelligent language translation system based on deep learning algorithms has good results.
... Combining the advantages of compressed sensing theory and the characteristics of voice signals, a lot of research has been done in the direction of combining voice compression and compressed sensing at home and abroad and a series of results have been achieved. The literature [17] successfully applied CS theory to voice compression coding technology. The literature [18] conducted in-depth research on the sparseness of voice signals based on the excitation vocal tract model and combined with the theory of compressed sensing to make a breakthrough in the processing of sparse excitation signals. ...
Article
Full-text available
The deep neural network halts its application to mobile devices because of its high complexity. So, it motivates us to compress and accelerate the deep neural network model. In order to improve the operating effect of the voice recognition acoustic compression system, this paper improves on the traditional neural network, and stitches the transfer features of multiple deep convolutional neural networks together to obtain a more discriminative feature representation than the transfer learning features in a single convolutional neural network. Also to construct a voice recognition acoustic compression system based on deep convolutional neural networks, this paper combines the actual needs. After the system framework construction, the performance and recognition accuracy of the voice acoustic system are studied from two perspectives. The experimental findings show that the voice recognition acoustic compression system constructed in this paper has excellent performance. The voice data processing speed of the voice recognition acoustic compression model is 111(s) and the average accuracy is 94.1%.
... O pitch apresentou variação, tendo ficado entre médio e médio para agudo em todas as décadas. A Jovem Guarda, uma expressão do movimento do rock brasileiro, utiliza a guitarra elétrica nos anos 60 18,19 , o que pode justificar essa elevação de frequência nas canções e a presença de metal na voz 20,21 . Nas décadas 70, 80 e 90, a variação fica a cargo da temática das músicas, do canto falado e da compreensão da letra da canção 9 . ...
Article
Objetivo: Descrever a voz do cantor Roberto Carlos por meio de avaliação perceptivo-auditiva de parâmetros determinados em canções escolhidas que foram lançadas ao longo das décadas de 60 a 90. Métodos: Oito canções representativas da carreira do cantor foram selecionadas para a avaliação perceptivo-auditiva descritiva da voz, sendo duas delas de cada década. Resultados: Roberto Carlos manteve a coordenação pneumofonoarticulatória, loudness variou de adequada para forte; pitch variou de médio para agudo a médio; articulação precisa; ataque vocal variou de brusco para suave; voz sem brilho; a ressonância laringofaríngea teve maior variação, sendo esta com foco nasal compensatório, com foco nasal acentuado e com foco nasal discreto, registro vocal modal peito, sem projeção, vibrato ausente, tessitura restrita, qualidade vocal adaptada, adaptada com tensão, e adaptada com discreta soprosidade. Conclusão: Na avaliação perceptivo-auditiva algumas características se mantiveram inalteradas, como coordenação pneumofonoarticulatória, a articulação precisa, o registro vocal modal de peito, a voz sem brilho e sem projeção, a ausência de vibrato e a tecitura restrita. Houve variação em relação ao pitch, a loudness, o ataque vocal a ressonância foi caracterizada laringofaríngea com variações em relação ao foco nasal. As maiores mudanças observadas na voz do cantor no decorrer das décadas recaem sobre a variação de gêneros musicais cantados pelo cantor.
... In popular singing, vibrato is often used intentionally as a vocal effect to modulate straight tones and create emotion [20]- [22]. Other studies analyzed vibrato as an individual characteristic in order to identify a pop singer [23]- [25]. Vibrato can be analyzed as a more integral, essential parameter in classical singing. ...
Article
Vibrato is one of the most frequently investigated parameters in singing. Most studies examine a very limited number of sound samples to answer specific questions. In the novel approach of our interdisciplinary vibrato project, a sophisticated database of 1723 sound samples was compiled, containing examples from 30 operas, one oratorio and 73 vocal exercises. Single tones undisturbed by orchestral sound were selected in the vocal registers soprano, tenor, baritone and bass according to a standardized classification system dividing voices into either a lyric or dramatic voice type. Various algorithms were implemented in a Matlab environment to automatically evaluate sound samples, with respect to vibrato extent and rate. An AM/FM demodulation scheme is presented with postprocessing required for automatic calculation of these two parameters. The vibrato extent was higher for the dramatic compared to the lyric voice type in soprano and tenor voices. In contrast, the extent of lyric baritones was significantly higher than for dramatic ones. Moreover, the vibrato rate was slightly lower for the dramatic voice type in all vocal registers. It turned out that the classification of bass voices into lyric and dramatic, unlike the other vocal registers, was not scientifically reproducible with our methods. Further vibrato results regarding aging, intra-individual and general stylistic development over time are presented and their essential relevance for vocal pedagogy are discussed.
Article
Introduction: Effective communication is a key feature of vocal music. Singers can communicate during singing by changing their voice qualities to express emotion. Varying acceptable standards are used by performers for voice quality secondary to musical genre. Types of voice qualities that are historically perceived as abusive by some teachers of singing (ToS) and speech-language pathologists (SLPs) are vocal effects. This study investigates the perceptions of vocal effects in professional and nonprofessional listeners (NPLs). Methods: Participants (n = 100) completed an online survey. Participants were divided into four professional groups; Classical ToS, Contemporary ToS, SLPs, and NPLs. Participants completed an identification task to assess their ability to identify the use of a vocal effect. Secondly, participants analyzed a singer performing a vocal effect, rated their preferences towards the effect, and gave objective performance ratings using a Likert scale. Finally, participants were asked if they had concerns about the singer's voice. If the participant responded yes, they were asked who they would refer the singer to, a SLP, ToS or medical doctor (MD). Results: Statistically significant differences were observed in SLPs ability to identify the use of vocal effects compared to classical ToS (P = 0.01), contemporary ToS (P = 0.001) as well as NPLs compared to contemporary ToS (P = 0.009). NPLs were reported to have a lesser rate of concern statistically compared to professional listeners (P = .006). Statically significant differences were found when comparing performance rating scores secondary to preference for the vocal effect when comparisons were larger than one Likert rating interval. With listeners giving higher performance ratings, if they reported higher preference ratings. Finally, no significant differences were identified when comparing referral scores secondary to occupation. Conclusions: Findings provide support for the presence of specific biases towards the use of vocal effects although no bias was found in management and care recommendations. Future research is recommended to investigate the nature of these biases.
Article
Full-text available
Objectives/Hypothesis: Vibrato is a core aesthetic element in singing. It varies considerably by both genre and era. Though studied extensively in Western classical singing over the years, there is a dearth of studies on vibrato in contemporary commercial music. In addressing this research gap, the objective of this study was to find and investigate common crossover song material from the opera, operetta, and Schlager singing styles from the historical early 20th to the contemporary 21st century epochs. Study Design/Methods: A total of 51 commercial recordings of two songs, “Es muss was Wunderbares sein” by Ralph Benatzky, and “Die ganze Welt ist himmelblau” by Robert Stolz, from "The White Horse Inn" ("Im weißen Rößl") were collected from opera, operetta, and Schlager singers. Each sample was annotated using Praat and analyzed in a custom Matlab- and Python-based algorithmic approach of singing voice separation and sine wave fitting novel to vibrato research. Results: With respect to vibrato rate and extent, the three most notable findings were that (1) fo and vibrato were inherently connected; (2) Schlager, as a historical aesthetic category, has unique vibrato characteristics, with higher overall rate and lower overall extent; and (3) fo and vibrato extent varied over time based on the historical or contemporary recording year for each genre. Conclusions: Though these results should be interpreted with caution due to the limited sample size, conducting such acoustical analysis is relevant for voice pedagogy. This study sheds light on the complexity of vocal vibrato production physiology and acoustics while providing insight into various aesthetic choices when performing music of different genres and stylistic time periods. In the age of crossover singing training and commercially available recordings, this investigation reveals important distinctions regarding vocal vibrato across genres and eras that bear beneficial implications for singers and teachers of singing.
Article
This study explores the content and pertinence of a curricular framework for training popular music (PM) singing teachers. The author conducted three extensive literature review studies over two years via an action research framework to investigate common areas of interest for potential pedagogical strategy. She developed a draft curricular topic framework, incorporating five PM-specific areas: vocal health and hygiene, PM stylistic advice, microphones/audio technology, resonance and breathing technique. Five international PM voice teaching representatives took part in the research as focus group participants and provided feedback on the curriculum content. Data from this feedback were analysed in an iterative cycle process and adjustments suggested. Findings from this study indicate that such a curriculum would be both necessary and timely. Suggestions were made to amend some elements and to place greater emphasis on aesthetic artistry, cultural context and emotional intent.
Article
Full-text available
Drawing on Nietzsche’s dichotomy between the Apollonian and Dionysian principles pertaining to classic tragedy, in this article, I consider the poem “Fan mail Freddy Mercury” by Marlene van Niekerk, from Kaar (2013) from that perspective. The poem deals with the rock musician Freddy Mercury and from my reading is shown to be exemplifying the Dionysian spirit through Mercury’s movements, his performance and his vocal range. From my reading of the text it is evident that there was a strong fusion of the Apollonian and Dionysian in the life and work of Freddy Mercury. Despite his rebellious spirit and his negation of the faith of his childhood, culture dictated that he be buried in his faith. Van Niekerk, with her insight into the philosophy of Nietzsche, adds another dimension to the poem when she invokes Zarathustra, Nietzsche’s prophetic figure, to be distinguished from the historical god of the Zoroaster faith. The metaphorical language used by the poet suggests her Dionysian playfulness.
Article
Intelligent music teaching is the direction of subsequent reforms in music teaching methods. In order to improve the intelligence of music teaching, this paper conducts research on speech recognition technology through optic communication. The speech conversion system based on Multi-Scale Star GAN extracts different levels of multi-scale features of the global features of music utterance through the Multi-Scale structure, which enhances the details of the converted speech, and uses residual connections to alleviate the problem of gradient disappearance and enable the network to spread more deeply through optic communication. In addition, after improving the speech recognition algorithm, this paper combines the needs of music teaching to construct a music teaching system based on speech recognition and artificial intelligence, and design system functional modules. Finally, this paper evaluates the performance of the system constructed in this paper by means of teaching experiments through optic communication. From the experimental analysis, it can be seen that the system constructed in this paper has a good teaching effect.
Article
Full-text available
Purpose Vocal vibrato is a singing technique that involves periodic modulation of fundamental frequency (fo) and intensity. The physiological sources of modulation within the speech mechanism and the interactions between the laryngeal source and vocal tract filter in vibrato are not fully understood. Therefore, the purpose of this study was to determine if differences in the rate and extent of fo and intensity modulation could be captured using simultaneously recorded signals from a neck-surface vibration sensor and a microphone, which represent features of the source before and after supraglottal vocal tract filtering. Method Nine classically-trained singers produced sustained vowels with vibrato while simultaneous signals were recorded using a vibration sensor and a microphone. Acoustical analyses were performed to measure the rate and extent of fo and intensity modulation for each trial. Paired-samples sign tests were used to analyze differences between the rate and extent of fo and intensity modulation in the vibration sensor and microphone signals. Results The rate and extent of fo modulation and the extent of intensity modulation were equivalent in the vibration sensor and microphone signals, but the rate of intensity modulation was significantly higher in the microphone signal than in the vibration sensor signal. Larger differences in the rate of intensity modulation were seen with vowels that typically have smaller differences between the first and second formant frequencies. Conclusions This study demonstrated that the rate of intensity modulation at the source prior to supraglottal vocal tract filtering, as measured in neck-surface vibration sensor signals, was lower than the rate of intensity modulation after supraglottal vocal tract filtering, as measured in microphone signals. The difference in rate varied based on the vowel. These findings provide further support of the resonance-harmonics interaction in vocal vibrato. Further investigation is warranted to determine if differences in the physiological source(s) of vibrato account for inconsistent relationships between the extent of intensity modulation in neck-surface vibration sensor and microphone signals.
Article
In order to improve the pronunciation effect of students in music classroom, based on computer speech simulation technology, this paper combines speech recognition technology and speech feature extraction technology to summarize the various acoustic parameters of speech and the perception of the information by the human ear, and establishes the physical model and digital model of the speech signal. Through the analysis of voice signals, this paper selects several acoustic parameters that can reflect individual characteristics, and studies the methods of their extraction and adjustment to construct a pronunciation correction system for music students based on computer voice simulation. Finally, this paper designs experiments to verify the performance of the system constructed in this paper. The research results show that the system constructed in this paper has certain practical effects and enriches the expressive power of machine speech. The process of synthesizing speech is simple, and the practical effect is good, which meets the current music teaching needs.
Article
In order to construct an efficient translation system, this paper constructs a corpus translation system based on Web Services. Moreover, this paper builds a network term detection system based on machine learning algorithms, expands the corpus data with the support of the crawler system, and uses WEB retrieval translation technology. At the same time, in response to the problem of sentence length changes in English abstracts, this paper proposes a method to obtain standard sentence length changes based on edit distance and SVM sorting. Based on requirements, this paper designs the architecture and data integration process of the data integration system. In addition, this paper outlines the detailed design and implementation process of each module of the system, and proposes a system performance optimization plan, and combines translation requirements to construct a corpus translation system based on Web Services. Finally, this paper designs experiments to verify the performance of the model. The research results show that the system constructed in this paper has a good application effect.
Article
Full-text available
This article discusses the concept of musical nuances from a process-oriented perspective, with a particular emphasis on the aesthetic experience of hooks in Western popular music. First, the text elaborates on the particularities of nuances from the perspective of cognitive psychology. Second, it highlights their importance for musical interpretation, characterization, memorization, and valuation. Third, it critically reflects on analytical approaches to rhythmic and melodic nuances and gets into alternative methods to analyze such microscopic subtleties in the context of musical hooks. Fourth, analytical examples examine nuance-related intricacies in song phrases as processes regarding the aesthetic experience of increasing and decreasing intensity, tension, and motion. Finally, the findings and theoretical considerations are discussed in the broader context of mainstream popular music analysis.
Article
The traditional English teaching mode mostly relies on rote memorization of textbooks, but it lacks the training of oral expression skills and lacks intelligent guidance for students. Taking machine learning algorithm as the system algorithm, this paper combines the CA-IAFSA algorithm to construct an English intelligent system based on artificial intelligence. The system uses image recognition technology, introduces population pheromone and tribal pheromone, and adopts multiple ant colony planning and dual pheromone feedback strategies. Moreover, this paper improves the heuristic information search strategy, pheromone update strategy, and state transition probability of the basic ant colony algorithm. In addition, this paper proposes the MACDPA path planning algorithm to realize the intelligent analysis of English textbook images. Finally, after constructing the model, this paper conducts research and analysis on the performance of the model and uses controlled experimental methods and mathematical statistics to analyze the data. The research results show that the model constructed in this paper performs well in assisted teaching and intelligent translation and meets the expected requirements.
Article
The feature recognition of spoken Japanese is an effective carrier for Sino-Japanese communication. At present, most of the existing intelligent translation equipment only have equipment that converts English into other languages, and some Japanese translation systems have problems with accuracy and real-time translation. Based on this, based on support vector machines, this research studies and recognizes the input features of spoken Japanese, and improves traditional algorithms to adapt to the needs of spoken language recognition. Moreover, this study uses improved spectral subtraction based on spectral entropy for enhancement processing, modifies Mel filter bank, and introduces several improved MFCC feature parameters. In addition, this study selects an improved feature recognition algorithm suitable for this research system and conducts experimental analysis of input feature recognition of spoken Japanese on the basis of this research model. The research results show that this research model has improved the recognition speed and recognition accuracy, and this research model meets the system requirements, which can provide a reference for subsequent related research.
Article
The difference between English and Chinese expressions is that English emphasizes the stress of syllables, so the recognition of English speech emotions plays an important role in learning English. This study uses transfer learning as the technical support to study English speech emotion recognition. The acoustic model based on weight transfer has two different training strategies: single-stage training and two-stage training strategy. By comparing the performance of the English speech emotion recognition model based on CNN neural network and the model proposed in this paper, the statistical comparison data is drawn into a statistical graph. The research results show that transfer learning has certain advantages over other algorithms in English speech emotion recognition. In the subsequent teaching and real-time translation equipment research, transfer learning can be applied to English models.
Article
The performance of the speech recognition system for English classroom teaching is largely affected by the surrounding environment. These interference signals will seriously reduce the quality and intelligibility of the speech signal, thereby greatly reducing the performance of the far-field speech recognition system. Aiming at word order detection in English classroom teaching, this paper proposes an analysis model based on block coding and improved genetic algorithm. Moreover, for DNN-based single-channel speech enhancement algorithms, this paper proposes PDNNs and PLSTMs to solve the problem of serious performance degradation of prototype DNN speech enhancement under low signal-to-noise ratio. This method decomposes the entire enhancement task into multiple subtasks to complete, and the previously completed subtasks will provide prior knowledge for the subsequent subtasks, so that the subsequent subtasks can learn its goals better. In general, the experimental results prove the reliability of the model constructed in this paper.
Article
The process of international integration is accelerating continuously, which puts forward certain requirements for the current college students’ communicative ability and English ability. Therefore, it is necessary to further improve the students’ cross-cultural communicative ability in combination with English teaching. This paper combines machine learning and fuzzy mathematics methods to build an evaluation model of English cross-cultural communication ability. Moreover, based on the basic assumptions of college students’ oral communication ability evaluation, this paper builds a basic model for college students’ oral communication ability evaluation. In addition, through factor analysis and correlation analysis, this paper verifies the hypothesis of the student’s oral communication ability evaluation model and obtains an optimized university student’s oral communication ability evaluation model. After the model’s hypothesis testing and a series of statistical analysis, the evaluation system of college students’ oral communication ability is finally obtained. Finally, this article combines the investigation and analysis to test the performance of the model constructed in this article. The research results show that the capability evaluation model constructed in this paper has good performance.
Article
At present, the posterior probability measure widely used in English speech recognition has the situation that the posterior probability measure of different phonemes cannot be consistent to measure the pronunciation quality of the phoneme and the acoustic modeling method of voice recognition is inconsistent with the evaluation target. Therefore, in order to improve the evaluation effect of English pronunciation quality in colleges and universities, this article is based on artificial emotion recognition and high-speed hybrid model to analyze and filter various clutters that affect speech quality to improve students’ English speech recognition. Moreover, this article uses the characteristics of the clutter and the target in the data to conform to different distributions and based on the clutter distribution characteristics obtained by statistics, this article realizes the suppression of the clutter to improve the target detection performance. In addition, the method proposed in this paper solves the limitations of the clutter suppression technology in the traditional voice detection system and improves the target detection performance. In order to study the pronunciation quality evaluation effect of this model and its effect in English teaching, this paper designs a controlled experiment to analyze the model’s performance. The research results show that the model constructed in this paper has good performance.
Article
The difficulty of obtaining the characteristics of the corpus database of neural machine translation is a factor hindering its development. In order to improve the effect of English intelligent translation, based on the machine learning algorithm, this paper improves the multi-objective optimization algorithm to construct a model based on the English intelligent translation system. Moreover, this paper uses parallel corpus and monolingual corpus for model training and uses semi-supervised neural machine translation method to analyze the data processing path in detail and focuses on the analysis of node distribution and data processing flow. In addition, this paper introduces data-related regularization items through the probabilistic nature of the neural machine translation model and applies it to the monolingual corpus to help the training of the neural machine translation model. Finally, this paper designs experiments to verify the performance of this model. The research results show that the translation model constructed in this paper is highly intelligent and can meet actual translation needs.
Article
English reading plays an important role in promoting oral English and comprehensive English ability. At present, the traditional online reading mode is less effective. In order to change the shortcomings of traditional education, this article builds on the artificial intelligence algorithm and combines the spoken language spectrum algorithm to build the system. Moreover, this article combines with the actual needs to put forward endpoint detection and judgment criteria based on spectral entropy information, establishes a mathematical model of knowledge forgetting, and obtains an intelligent memory algorithm to guide students in personalized learning. In order to verify the effect of the model, this article takes the students in the experimental class and the control class as the experimental objects and compares the spoken pronunciation of the students and the comprehensive English scores of the students after the experiment. The research results show that the artificial intelligence-based English multimodal online reading mode platform constructed in this article has certain effects and can effectively improve students’ English scores.
Article
Artificial intelligence speech recognition technology is an important direction in the field of human-computer interaction. The use of speech recognition technology to assist teachers in the correction of spoken English pronunciation in teaching has certain effects and can help students without being constrained by places, time and teachers. Based on artificial intelligence speech recognition technology, this paper improves and analyzes speech recognition algorithms, and uses effective algorithms as the system algorithms of artificial intelligence models. Meanwhile, based on phoneme-level speech error correction, after introducing the basic knowledge, construction and training of acoustic models, the basic process of speech cutting, including the front-end processing of speech and the extraction of feature parameters, is elaborated. In addition, this study designed a control experiment to verify and analyze the artificial intelligence speech recognition correction model. The research results show that the method proposed in this paper has a certain effect.
Article
The traditional English online teaching model is limited by the teaching location and the difficulty of online teaching, which prevents teachers from controlling students. In order to improve the ability of the English online teaching model to supervise and recognize the status of students, this paper proposes an English online teaching model based on artificial intelligence technology, and adopts a positioning method based on an improved deep belief network for real-time position control and status recognition for students in online learning. Moreover, this study combines intelligent algorithms to build the model structure and verify the performance of the model. The results show that the performance of the model is good. In addition, on the basis of performance testing, the recognition effect of the artificial intelligence-based student online learning recognition model constructed in this paper is recognized. The results show that the model proposed in this paper has a certain effect and meets the actual needs of intelligent teaching.
Article
Full-text available
The web 2.0 is transforming the project management in organizations by improving communication and collaboration. The new generation of web-based collaborative tools provides much better experience than the traditional software package allowing document sharing, integrated task tracking, enforcing team processes and agile planning. Despite of the indubitable benefits brought by web 2.0, the use of these technologies to promote knowledge management remains unexplored. For many project managers to obtain and integrate information from different tools of previous similar projects in global organizations remains a challenge. This theoretical paper presents a proposal that suggests an innovation in the knowledge management area applying web 2.0 technologies. The main goal is to provide an integrated vision of a set of technologies that could be used by organizations in order to promote better management of lessons learned. The proposal includes the lessons learned processes (e.g. capture, share and dissemination), the process-based (e.g. project review and after action review) and documentation-based (e.g. micro article and learning histories) methods. Results show how web 2.0 technologies can help project managers and team project to cope with the main lessons learned processes and methods to learn from experience. Moreover, recommendations are made for the effective use of web 2.0 components promoting innovation and supporting lessons learned management in projects.Keywords: Project management; Lessons learned processes; lessons learned methods; project learning; web 2.0 technologies; innovation.
Article
Full-text available
This paper presents a previously unreported method of laryngeal vocal sound production that is capable of producing pitches even higher than the whistle register (M3). Colloquially known as the glottal whistle (here referred to as M4), this method has a wider range than M3 and features frequent instances of biphonation, which is of interest for those involved with contemporary and improvised music. Pitch profile analyses of M4 have found the majority of fundamental frequency (fO) activity to be between 1 and 3 kHz, while the most frequently seen range was between 1000 to 1,500 Hz. Remarkably, multiple singers were able to produce fO higher than the highest tone on the piano.
Article
Full-text available
This article reports three studies about performance of lieder, and in particular in comparison with opera performance. In study 1, 21 participants with experience in music performance and teaching completed a survey concerning various characteristics of lieder performance. The results showed that there was consensus between the literature and the assessment of an expert panel-that a "natural" and "unoperatic" vibrato was favored, and that diction, text, and variation of tone are all important aspects of lieder performance. Two acoustic analyses were conducted to investigate genre-specific differences of the singer's formant and vibrato parameters. The first analysis (study 2) used 18 single quasi-unaccompanied notes from commercial recordings of two lieder, and, for comparison, 20 single unaccompanied notes from an opera. Vibrato rate was statistically identical between the two genres at ∼6.4 Hz; however, lieder featured a longer delay in vibrato onset. Vibrato extent was smaller for lieder (∼112 cents) compared with opera (∼138 cents). The singer's formant, which is generally associated with opera, was at times observed in the lieder recordings; however, this was at an overall significantly weaker intensity than in the opera recordings. The results were replicated in study 3, where recordings using only singers who performed in both lied and opera were analyzed. This direct comparison used 45 lieder notes and 55 opera notes and also investigated three different methods of analyzing the singer's formant. A number of consistencies and inconsistencies were identified between acoustic parameters reported in studies 2 and 3, and the beliefs of singing teachers and scholars in the literature and study 1. Copyright © 2015 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Article
Full-text available
Vibrato is one of the most expressive aesthetic characteristics of singing voice. Indicative of good voice quality is typical of lyrical singing but it is also found in others styles of popular music. Acoustically, vibrato is defined as a long-term periodic modulation of the fundamental frequency. It occurs as a result of the laryngeal muscular system and is comprised of three main parameters: rate, extent, and amplitude variation. The main controversy refers to the physiological mechanism of vibrato production, specifically concerning its neurological conscious control, as well as the intra-subject variability of its acoustic parameters. In this study, we compare the characteristics related to vibrato rate (VR), assessing 423 emissions, from recorded samples, produced by 15 professional singers, publicly and artistically acclaimed in occidental culture, to represent three music styles: opera, rock and Brazilian country (sertanejo). We analyzed the samples through GRAM 5.01 and found that the VR was kept constant intra-subject, independently of the singing style. The mean values for the VR for opera and Brazilian country singers were higher than for rock singers. Effects of vocal training, kinship and aging on the vibrato rate, as well as technical skills to control it are objects of our future studies.
Article
Full-text available
Among the so-called extended vocal techniques, vocal growl is a rather common effect in some ethnic (e.g. the Xhosa people in South Africa) and pop styles (e.g. Jazz, Louis Armstrong-type) of music. Growl usually consists of simultaneous vibrations of the vocal folds and supra-glottal structures of the larynx, either in harmonic or sub-harmonic co-oscillation. This paper examines growl mechanism using vide-ofluoroscopy and high-speed imaging, and its acousit-cal characteristics by spectral analysis and model simu-lation. In growl, the larynx position is usually high and aryepiglottic folds vibrate. The aryepiglottic constriction is associated to a unique shape of the vocal tract, includ-ing the larynx tube, and characterizes growl.
Article
Full-text available
Fiji is a distribution of the popular open-source software ImageJ focused on biological-image analysis. Fiji uses modern software engineering practices to combine powerful software libraries with a broad range of scripting languages to enable rapid prototyping of image-processing algorithms. Fiji facilitates the transformation of new algorithms into ImageJ plugins that can be shared with end users through an integrated update system. We propose Fiji as a platform for productive collaboration between computer science and biology research communities.
Article
Full-text available
Recent work on human vocal production demonstrates that certain irregular phenomena seen in human pathological voices and baby crying result from nonlinearities in the vocal production system. Equivalent phenomena are quite common in nonhuman mammal vocal repertoires. In particular, bifurcations and chaos are ubiquitous aspects of the normal adult repertoire in many primate species. Here we argue that these phenomena result from properties inherent in the peripheral production mechanism, which allows individuals to generate highly complex and unpredictable vocalizations without requiring equivalently complex neural control mechanisms. We provide examples from the vocal repertoire of rhesus macaques, Macaca mulatta, and other species illustrating the different classes of nonlinear phenomena, and review the concepts from nonlinear dynamics that explicate these calls. Finally, we discuss the evolutionary significance of nonlinear vocal phenomena. We suggest that nonlinear phenomena may subserve individual recognition and the estimation of size or fluctuating asymmetry from vocalizations. Furthermore, neurally ‘cheap’ unpredictability may serve the valuable adaptive function of making chaotic calls difficult to predict and ignore. While noting that nonlinear phenomena are in some cases probably nonadaptive by-products of the physics of the sound-generating mechanism, we suggest that these functional hypotheses provide at least a partial explanation for the ubiquity of nonlinear calls in nonhuman vocal repertoires.
Article
Full-text available
Occurrences of period-doubling are found in human phonation, in particular for pathological and some singing phonations such as Sardinian A Tenore Bassu vocal performance. The combined vibration of the vocal folds and the ventricular folds has been observed during the production of such low pitch bass-type sound. The present study aims to characterize the physiological correlates of this acoustical production and to provide a better understanding of the physical interaction between ventricular fold vibration and vocal fold self-sustained oscillation. The vibratory properties of the vocal folds and the ventricular folds during phonation produced by a professional singer are analyzed by means of acoustical and electroglottographic signals and by synchronized glottal images obtained by high-speed cinematography. The periodic variation in glottal cycle duration and the effect of ventricular fold closing on glottal closing time are demonstrated. Using the detected glottal and ventricular areas, the aerodynamic behavior of the laryngeal system is simulated using a simplified physical modeling previously validated in vitro using a larynx replica. An estimate of the ventricular aperture extracted from the in vivo data allows a theoretical prediction of the glottal aperture. The in vivo measurements of the glottal aperture are then compared to the simulated estimations.
Article
Full-text available
Purpose This tutorial addresses fundamental characteristics of microphones (frequency response, frequency range, dynamic range, and directionality), which are important for accurate measurements of voice and speech. Method Technical and voice literature was reviewed and analyzed. The following recommendations on desirable microphone characteristics were formulated: The frequency response of microphones should be flat (i.e., variation of less than 2 dB) within the frequency range between the lowest expected fundamental frequency of voice and the highest spectral component of interest. The equivalent noise level of the microphones is recommended to be at least 15 dB lower than the sound level of the softest phonations. The upper limit of the dynamic range of the microphone should be above the sound level of the loudest phonations. Directional microphones should be placed at the distance that corresponds to their maximally flat frequency response, to avoid the proximity effect; otherwise, they will be unsuitable for spectral and level measurements. Numerical values for these recommendations were derived for the microphone distances of 30 cm and 5 cm. Conclusions The recommendations, while preliminary and in need of further numerical justification, should provide the basis for better accuracy and repeatability of studies on voice and speech production in the future.
Article
Full-text available
Several authors have recently demonstrated the intimate relationship between nonlinear dynamics and observations in vocal fold vibration (Herzel, 1993; Mende, Herzel, & Wermke, 1990; Titze, Baken, & Herzel, 1993). The aim of this paper is to analyze vocal disorders from a nonlinear dynamics point of view. Basic concepts and analysis techniques from nonlinear dynamics are reviewed and related to voice. The voices of several patients with vocal disorders are analyzed using traditional voice analysis techniques and methods from nonlinear dynamics. The two methods are shown to complement each other in many ways. Likely physiological mechanisms of the observed nonlinear phenomena are presented, and it is shown how much of the terminology in the literature describing rough voice can be unified within the framework of nonlinear dynamics.
Article
Full-text available
Fifteen patients, 13 women and 2 men, with a mean age of 72.7 years (56 to 86 years) and a clinical diagnosis of essential voice tremor, were treated with botulinum injections to the thyroarytenoid muscles, and in some cases, to the cricothyroid or thyrohyoid muscles. Evaluations were based on subjective judgments by the patients, and on perceptual and acoustic analysis of voice recordings. Subjective evaluations indicated that the treatment had a beneficial effect in 67% of the patients. Perceptual evaluations showed a significant decrease in voice tremor during connected speech (p < .05). Acoustic analysis showed a nearly significant decrease in the fundamental frequency variations (p = .06) and a significant decrease in fundamental frequency during sustained vowel phonation (p < .01). The results of perceptual evaluation coincided most closely with the subjective judgments. It was concluded that the treatment was successful in 50% to 65% of the patients, depending on the method of evaluation.
Article
Full-text available
For the diagnosis of voice disorders, and especially for the classification of hoarseness, direct observation of vocal fold vibration is essential. Furthermore, a quantitative description of the movement of the vocal fold becomes increasingly necessary to document and compare findings as well as the progression of speech therapy. On the base of digital high-speed sequences of vocal fold vibration, multiple "functional images"-also called digital kymograms-are obtained using image- and signal-processing algorithms. Digital kymograms can serve as a powerful aid for visualization, description, and classification of vocal fold vibration and as an intermediate step for image interpretation by biomechanical modeling. This visualization technique will be discussed and compared to other techniques currently available: videokymography and videostroboscopy. The technique is applied to several clinical examples: aperiodic processes (phonation onset), irregular vocal fold vibration (paralysis of the recurrent nerve), particular vibration modes (anterior-posterior modes), and running speech.
Article
Full-text available
A reflex mechanism with a long latency (>40 ms) is implicated as a plausible cause of vocal vibrato. At least one pair of agonist-antagonist muscles that can change vocal-fold length is needed, such as the cricothyroid muscle paired with the thyroarytenoid muscle, or the cricothyroid muscle paired with the lateral cricoarytenoid muscle or a strap muscle. Such an agonist-antagonist muscle pair can produce negative feedback instability in vocal-fold length with this long reflex latency, producing oscillations on the order of 5-7 Hz. It is shown that singers appear to increase the gain in the reflex loop to cultivate the vibrato, which grows out of a spectrum of 0-15-Hz physiologic tremors in raw form.
Article
Full-text available
Irregularities in voiced speech are often observed as a consequence of vocal fold lesions, paralyses, and other pathological conditions. Many of these instabilities are related to the intrinsic nonlinearities in the vibrations of the vocal folds. In this paper, bifurcations in voice signals are analyzed using narrow-band spectrograms. We study sustained phonation of patients with laryngeal paralysis and data from an excised larynx experiment. These spectrograms are compared with computer simulations of an asymmetric 2-mass model of the vocal folds. (c) 1995 American Institute of Physics.
Article
Full-text available
The acoustic characteristics of so-called 'dist' tones, commonly used in singing rock music, are analyzed in a case study. In an initial experiment a professional rock singer produced examples of 'dist' tones. The tones were found to contain aperiodicity, SPL at 0.3 m varied between 90 and 96 dB, and subglottal pressure varied in the range of 20-43 cm H2O, a doubling yielding, on average, an SPL increase of 2.3 dB. In a second experiment, the associated vocal fold vibration patterns were recorded by digital high-speed imaging of the same singer. Inverse filtering of the simultaneously recorded audio signal showed that the aperiodicity was caused by a low frequency modulation of the flow glottogram pulse amplitude. This modulation was produced by an aperiodic or periodic vibration of the supraglottic mucosa. This vibration reduced the pulse amplitude by obstructing the airway for some of the pulses produced by the apparently periodically vibrating vocal folds. The supraglottic mucosa vibration can be assumed to be driven by the high airflow produced by the elevated subglottal pressure.
Article
Full-text available
We present a straightforward and robust algorithm for periodicity detection, working in the lag (autocorrelation) domain. When it is tested for periodic signals and for signals with additive noise or jitter, it proves to be several orders of magnitude more accurate than the methods commonly used for speech analysis. This makes our method capable of measuring harmonics-to-noise ratios in the lag domain with an accuracy and reliability much greater than that of any of the usual frequency-domain methods. By definition, the best candidate for the acoustic pitch period of a sound can be found from the position of the maximum of the autocorrelation function of the sound, while the degree of periodicity (the harmonics-to-noise ratio) of the sound can be found from the relative height of this maximum. However, sampling and windowing cause problems in accurately determining the position and height of the maximum. These problems have led to inaccurate timedomain and cepstral methods for p...
Article
Fletcher has proposed the use of a logarithmic frequency scale such that the frequency level equals the number of octaves, tones, or semitones that a given frequency lies above a reference frequency of 16.35 cycles/sec., a frequency which is in the neighborhood of that producing the lowest pitch audible to the average ear. The merits of such a scale are here briefly discussed, and arguments are presented in favor of this choice of reference frequency. Using frequency level as a count of octaves or semitones from the reference C0, a rational system of subscript notation follows logically for the designation of musical tones without the aid of staff notation. In addition to certain conveniences such as uniformity of characters and simplicity of subscripts (the eight C's of the piano, for example, are represented by C1 to C8) this method shows by a glance at the subscript the frequency level of a given tone counted in octaves from the reference C0 = 16.352 cycles/sec. From middle C4, frequency 261.63 cycles/sec., the interval is four octaves to the reference frequency, so that below C4 there are roughly four octaves of audible sound. Various subdivisions of the octave are considered in the light of their ease of calculation and significance, and the semitone, including its hundredth part, the cent, is shown to be particularly suitable. Consequently, for general use in which a unit smaller than the octave is necessary it is recommended that frequency level counted in semitones from the reference frequency be employed.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Previous research suggests that independent variation of vocal loudness and glottal configuration (type and degree of vocal fold adduction) does not occur in untrained speech production. This study investigated whether these factors can be varied independently in trained singing and how subglottal pressure is related to average glottal airflow, voice source properties, and sound level under these conditions. A classically trained baritone produced sustained phonations on the endoscopic vowel [i:] at pitch D4 (approximately 294 Hz), exclusively varying either (a) vocal register; (b) phonation type (from “breathy” to “pressed” via cartilaginous adduction); or (c) vocal loudness, while keeping the others constant. Phonation was documented by simultaneous recording of videokymographic, electroglottographic, airflow and voice source data, and by percutaneous measurement of relative subglottal pressure. Register shifts were clearly marked in the electroglottographic wavegram display. Compared with chest register, falsetto was produced with greater pulse amplitude of the glottal flow, H1-H2, mean airflow, and with lower maximum flow declination rate (MFDR), subglottal pressure, and sound pressure. Shifts of phonation type (breathy/flow/neutral/pressed) induced comparable systematic changes. Increase of vocal loudness resulted in increased subglottal pressure, average flow, sound pressure, MFDR, glottal flow pulse amplitude, and H1-H2. When changing either vocal register or phonation type, subglottal pressure and mean airflow showed an inverse relationship, that is, variation of glottal flow resistance. The direct relation between subglottal pressure and airflow when varying only vocal loudness demonstrated independent control of vocal loudness and glottal configuration. Achieving such independent control of phonatory control parameters would be an important target in vocal pedagogy and in voice therapy.
Article
The obvious perceptual differences between various singing styles like Western operatic and jazz rely on specific dissimilarities in vocal technique. The present study focuses on differences in vibrato acoustics and in singer's formant as analyzed by a novel software tool, named BioVoice, based on robust high-resolution and adaptive techniques that have proven its validity on synthetic voice signals. A total of 48 professional singers were investigated (29 females; 19 males; 29 Western operatic; and 19 jazz). They were asked to sing "a cappella," but with artistic expression, a well-known musical phrase from Gershwin's Porgy and Bess, in their own style: either operatic or jazz. A specific sustained note was extracted for detailed vibrato analysis. Beside rate (s(-1)) and extent (cents), duration (seconds) and regularity were computed. Two new concepts are introduced: vibrato jitter and vibrato shimmer, by analogy with the traditional jitter and shimmer of voice signals. For the singer's formant, on the same sustained tone, the ratio of the acoustic energy in formants 1-2 to the energy in formants 3, 4, and 5 was automatically computed, providing a quality ratio (QR). Vibrato rates did not differ among groups. Extent was significantly larger in operatic singers, particularly females. Vibrato jitter and vibrato shimmer were significantly smaller in operatic singers. Duration of vibrato was also significantly longer in operatic singers. QR was significantly lower in male operatic singers. Some vibrato characteristics (extent, regularity, and duration) very clearly differentiate the Western operatic singing style from the jazz singing style. The singer's formant is typical of male operatic singers. The new software tool is well suited to provide useful feedback in a pedagogical context. Copyright © 2015 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Article
Singers are able to create aperiodic vibration of the vocal folds by either desynchronizing the primary oscillators (the vocal folds) or by engaging secondary sound sources (the false vocal folds, the aryepiglottic folds, or the velum). The desynchronization of the primary oscillators can be accomplished by creating a slight left‐right asymmetry and then using the vocal tract (an acoustic oscillator) to entrain one vocal fold, leaving the other vocal fold to oscillate independently. Other aperiodicities can be created by driving the vocal folds at large amplitudes (with the critical parameter being A/L, the amplitude to length ratio of the vocal folds). In this mode of distortion, the highly nonlinear stress‐strain characteristics of the vocal fold tissues are exploited. Infant cries, shouts, and groans are characteristic of this type of nonlinearity. Some videotaped examples and some computer simulations will be given to illustrate these phenomena.
Article
Fletcher has proposed the use of a logarithmic frequency scale such that the frequency level equals the number of octaves, tones, or semitones that a given frequency lies above a reference frequency of 16.35 cycles/sec., a frequency which is in the neighborhood of that producing the lowest pitch audible to the average ear. The merits of such a scale are here briefly discussed, and arguments are presented in favor of this choice of reference frequency. Using frequency level as a count of octaves or semitones from the reference C0, a rational system of subscript notation follows logically for the designation of musical tones without the aid of staff notation. In addition to certain conveniences such as uniformity of characters and simplicity of subscripts (the eight C's of the piano, for example, are represented by C1 to C8) this method shows by a glance at the subscript the frequency level of a given tone counted in octaves from the reference C0 = 16.352 cycles/sec. From middle C4, frequency 261.63 cycles/sec., the interval is four octaves to the reference frequency, so that below C4 there are roughly four octaves of audible sound. Various subdivisions of the octave are considered in light of their ease of calculation and significance, and the semitone, including its hundredth part, the cent, is shown to be suitable. Consequently, for general use in which a unit smaller than the octave is necessary it is recommended that frequency level counted in semitones from the reference frequency be employed.
Article
Acoustic analyses were carried out on vocal vibrato produced by nine opera singers and vocal tremor accompanying the sustained phonation of patients with the following diagnoses: Parkinson's disease, amyotrophic lateral sclerosis, spinal muscular atrophy, essential tremor, and adductor spastic dysphonia. While vocal tremor on average had a faster oscillatory rate and greater amplitude extent when compared to vocal vibrato, only the cycle to cycle measures of shimmer and jitter differed significantly between these groups. However, these differences existed even when the effect of the oscillation was removed. These data are consistent with the hypothesis that vocal vibrato in singers and vocal tremor in patients may be part of the same continuum.
Article
A subharmonic route to chaos including period-doubling bifurcations up to f/8 has been observed in experiments on acoustical turbulence (acoustic cavitation noise). The system also shows signs of reverse bifurcation with increasing control parameter (acoustic driving pressure amplitude). In view of the large variety of phenomena observed and yet to be expected the system investigated may well serve as a further experimental paradigm of nonlinear dynamical systems besides Rayleigh-Benard and circular couette flow.
Article
Two important aspects of singers' F0 control have been investigated: vibrato extent and intonation. From ten commercially available compact disc recordings of F. Schubert's Ave Maria, 25 tones were selected for analysis. Fundamental frequency was determined by spectrograph analysis of a high overtone at each turning point of the vibrato undulations. It was found that the mean vibrato extent for individual tones ranged between ± 34 and ± 123 cent and that the mean across tones and singers amounted to ±71 cent. Informal measurements on Verdi opera arias showed much higher figures. With regard to intonation substantial departures from equally tempered tuning were found for individual tones. A tone's vibrato extent was found to have a negative correlation with tone duration and a positive correlation with intonation.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
The purposes of this project were to discover (1) if the speaking fundamental frequency (SFF) levels of professional singers differ significantly from those of nonsingers and (2) if the age-related SFF patterns are similar for these two classes of individuals. Sixty professional singers and 94 nonsingers were recorded reading the first paragraph of the “Rainbow Passage;” both males and females were included. Three paired groups (young, middle, and old age) were studied; they were selected on the basis of health and age. The professional singer groups were further divided by a binary voice classification system, specifically that of soprano/alto for women and tenor/baritone for men. It was found that the sopranos and tenors exhibited significantly higher SFF levels then did the age-matched nonsingers, whereas the altos and baritones did not differ significantly from the controls. Relationships within the performer groups were mixed. For example, there appeared to be a systemic trend for the sopranos and tenors to exhibit higher SFF levels than the altos and baritones. Finally, although the nonsinger SFF levels varied significantly as a function of age, those for the professional singers did not.
Book
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
The objective of this study was to determine the speaking fundamental frequency (SFO) in professional opera singers and its dependence on their voice type, if any. A total of 75 persons were available for observation using a special computer clinical program. Male voices were categorized into three groups (viz, tenor, baritone, and bass), female ones into 2 groups (soprano and mezzo-soprano). It was shown that borderlines between SFO types varied within a wide range in all study groups. Significant differences in SFO were documented between tenors, baritones, and basses and between sopranos and mezzo-sopranos; the differences were insignificant between baritones and basses. It is concluded that the speaking fundamental frequency depends on the type of the singing voice; however this characteristic may serve only as an auxiliary tool but can not be used for the classification of singing voices.
Article
This investigation aims at describing voice function of four nonclassical styles of singing, Rock, Pop, Soul, and Swedish Dance Band. A male singer, professionally experienced in performing in these genres, sang representative tunes, both with their original lyrics and on the syllable /pae/. In addition, he sang tones in a triad pattern ranging from the pitch Bb2 to the pitch C4 on the syllable /pae/ in pressed and neutral phonation. An expert panel was successful in classifying the samples, thus suggesting that the samples were representative of the various styles. Subglottal pressure was estimated from oral pressure during the occlusion for the consonant [p]. Flow glottograms were obtained from inverse filtering. The four lowest formant frequencies differed between the styles. The mean of the subglottal pressure and the mean of the normalized amplitude quotient (NAQ), that is, the ratio between the flow pulse amplitude and the product of period and maximum flow declination rate, were plotted against the mean of fundamental frequency. In these graphs, Rock and Swedish Dance Band assumed opposite extreme positions with respect to subglottal pressure and mean phonation frequency, whereas the mean NAQ values differed less between the styles.
Article
The purpose of this study was to examine several factors of vocal quality that might be affected by changes in vocal fold vibratory patterns. Four voice types were examined: modal, vocal fry, falsetto, and breathy. Three categories of analysis techniques were developed to extract source-related features from speech and electroglottographic (EGG) signals. Four factors were found to be important for characterizing the glottal excitations for the four voice types: the glottal pulse width, the glottal pulse skewness, the abruptness of glottal closure, and the turbulent noise component. The significance of these factors for voice synthesis was studied and a new voice source model that accounted for certain physiological aspects of vocal fold motion was developed and tested using speech synthesis. Perceptual listening tests were conducted to evaluate the auditory effects of the source model parameters upon synthesized speech. The effects of the spectral slope of the source excitation, the shape of the glottal excitation pulse, and the characteristics of the turbulent noise source were considered. Applications for these research results include synthesis of natural sounding speech, synthesis and modeling of vocal disorders, and the development of speaker independent (or adaptive) speech recognition systems.
Article
The purpose of this study was to compare the mean speaking fundamental frequency (SFF), speaking frequency range, and mean speaking intensity for a group of trained male singers and a group of age-matched non-singers in three age ranges: 20 to 35 years old; 40 to 55 years old; and older than 65 years. Each subject was recorded as he read "The Rainbow Passage" and produced the vowel /a/ to the limits of his phonational frequency range. The data indicated that the mean SFF of the nonsingers was significantly lower among the middle-aged speakers than with the young or elderly. In contrast, the tenors exhibited no age-related SFF trends, and the young bass/baritones exhibited lower SFF levels than the middle-aged or elderly. The elderly nonsingers produced frequency ranges that were smaller than any other group. Finally, the young nonsingers used greater speech intensity than did the other groups.
Article
The biomechanics of wave propagation in viscoelastic materials can be useful in understanding the nature of normal and pathologic vocal fold vibration. Mucosal wave movement is the primary means by which the larynx transforms the egressive pulmonary air flow into sound. This short tutorial describes a number of concepts fundamental to the understanding of the vocal fold traveling wave. The displacement velocity of the vocal folds is shown to be proportional to the wave speed, which in turn is proportional to the elastic modulus or stiffness of the vocal folds. Finally, a few cases of unilateral paralysis are used to demonstrate how vocal fold stiffness, entrainment, and degree of vocal fold closure interact to create the complex vibratory patterns that occur in disordered laryngeal states. It is emphasized that surgical voice restoration must consider these properties of the mucosal wave to improve phonatory function.
Article
This study details a comparison of the speaking F0 and intensity values of young male and female adults with and without vocal training, as well as the superimposition of the speaking F0 and intensity data upon phonetograms. Results indicated that (a) trained vocalists have similar mean speaking F0's than do untrained vocalists, but exhibit significantly greater speaking F0 ranges than do untrained vocalists; (b) trained vocalists are significantly greater mean intensity levels in speech, as well as significantly greater speaking intensity ranges than do untrained vocalists; (c) the mean speaking F0 for both trained and untrained vocalists was found in the vicinity of the 5-7% frequency level of the entire phonational F0 range (in Hz), equivalent to 12-16% of the phonational F0 range in semitones; (d) the overall speech area (mean speaking F0 and minimum and maximum speaking F0 peaks) was found in the lower 23-31% of the entire phonational F0 range (in semitones), with the untrained subjects utilizing the lower 25% of the phonational range (in semitones) and the trained subjects extending this area to the lower 28-31%; and (e) significant correlations were observed between the total intensity range and intensity range used in speech in trained female vocalists and between total F0 range and speaking F0 range in the combined trained male and female group. These results have important implications for the use of the phonetogram, as well as the clinical applicability of vocal training exercises in various speech and voice therapy cases.
Article
A digital technique for high-speed visualization of vibration, called videokymography, was developed and applied to the vocal folds. The system uses a modified video camera able to work in two modes: high-speed (nearly 8,000 images/s) and standard (50 images/s in CCIR norm). In the high-speed mode, the camera selects one active horizontal line (transverse to the glottis) from the whole laryngeal image. The successive line images are presented in real time on a commercial TV monitor, filling each video frame from top to bottom. The system makes it possible to observe left-right asymmetries, open quotient, propagation of mucosal waves, movement of the upper and, in the closing phase, the lower margins of the vocal folds, etc. The technique is suitable for further processing and quantification of recorded vibration.
Article
Mongolian "throat singing" can be performed in different modes. In Mongolia, the bass-type is called Kargyraa. The voice source in bass-type throat singing was studied in one male singer. The subject alternated between modal voice and the throat singing mode. Vocal fold vibrations were observed with high-speed photography, using a computerized recording system. The spectral characteristics of the sound signal were analyzed. Kymographic image data were compared to the sound signal and flow inverse filtering data from the same singer were obtained on a separate occasion. It was found that the vocal folds vibrated at the same frequency throughout both modes of singing. During throat singing the ventricular folds vibrated with complete but short closures at half the frequency of the true vocal folds, covering every second vocal fold closure. Kymographic data confirmed the findings. The spectrum contained added subharmonics compared to modal voice. In the inverse filtered signal the amplitude of every second airflow pulse was considerably lowered. The ventricular folds appeared to modulate the sound by reducing the glottal flow of every other vocal fold vibratory cycle.
Article
A multidimensional protocol has been established by the ELS in order to reach better agreement and standardisation for functional assessment of pathologic voices. In order to evaluate the validity, practicability and applicability of this protocol the experiences of 6 european voice centres have been analysed in a retrospective study. The ELS protocol comprises 5 dimensions: perceptual voice evaluation, videostroboscopy, acoustics, aerodynamics and subjective rating by the patient. Results obtained in 94 patients with benign voice disorders were evaluated retrospectively in a multicenter study. According to our results, the validity, practicability and applicability of the ELS protocol was largely satisfactory. This was true for all "common" voice disorders, but not for extreme voice alterations (e. g. spasmodic dysphonia, aphonia, substitution voices). The 5 dimension proofed to be not redundant and were able to selectively differentiate pre- post changes among various etiologies of voice disorders, various types of treatment and genders.
Living on the edge: the Freddie Mercury Story
  • D Bret
Bret D. Living on the edge: the Freddie Mercury Story. London, UK: Robson Books, 1996.
Mercury: the great enigma The Guardian; 2012 Available from: http://www.theguardian.com/music
  • C Sullivan
  • Freddie
Praat: doing phonetics by computer. Amsterdam, The Netherlands: Institute of Phonetic Sciences
  • P Boersma
  • D Weenink
Resonance in singing
  • D G Miller
Miller DG. Resonance in singing Princeton, NJ: Inside View Press, 2008.
Wikimedia Foundation
  • Freddie Wikipedia
  • Mercury
Wikipedia. Freddie Mercury. Wikimedia Foundation; 2015 [cited 1 June 2015]; Available from: http://en.wikipedia.org/wiki/ Freddie_Mercury.