Conference PaperPDF Available

Dominant Distortion Classification for Pre-Processing of Vowels in Remote Biomedical Voice Analysis

  • FaunaPhotonics
  • Audio Analysis Lab, Aalborg University


Advances in speech signal analysis facilitate the development of techniques for remote biomedical voice assessment. However , the performance of these techniques is affected by noise and distortion in signals. In this paper, we focus on the vowel /a/ as the most widely-used voice signal for pathological voice assessments and investigate the impact of four major types of distortion that are commonly present during recording or transmission in voice analysis, namely: background noise, reverberation , clipping and compression, on Mel-frequency cepstral coefficients (MFCCs) – the most widely-used features in biomed-ical voice analysis. Then, we propose a new distortion classification approach to detect the most dominant distortion in such voice signals. The proposed method involves MFCCs as frame-level features and a support vector machine as classifier to detect the presence and type of distortion in frames of a given voice signal. Experimental results obtained from the healthy and Parkinson's voices show the effectiveness of the proposed approach in distortion detection and classification.
Dominant Distortion Classification for Pre-Processing of Vowels in Remote
Biomedical Voice Analysis
Amir Hossein Poorjam 1, Jesper Rindom Jensen 1, Max A. Little 2and Mads Græsbøll Christensen 1
1Audio Analysis Lab, AD:MT, Aalborg University, Aalborg, DK
2Engineering and Applied Science, Aston University, Birmingham, UK
2Media Lab, MIT, Cambridge, Massachusetts, USA
1{ahp,jrj,mgc} ,
Advances in speech signal analysis facilitate the development
of techniques for remote biomedical voice assessment. How-
ever, the performance of these techniques is affected by noise
and distortion in signals. In this paper, we focus on the vowel
/a/ as the most widely-used voice signal for pathological voice
assessments and investigate the impact of four major types of
distortion that are commonly present during recording or trans-
mission in voice analysis, namely: background noise, reverber-
ation, clipping and compression, on Mel-frequency cepstral co-
efficients (MFCCs) the most widely-used features in biomed-
ical voice analysis. Then, we propose a new distortion classifi-
cation approach to detect the most dominant distortion in such
voice signals. The proposed method involves MFCCs as frame-
level features and a support vector machine as classifier to de-
tect the presence and type of distortion in frames of a given
voice signal. Experimental results obtained from the healthy
and Parkinson’s voices show the effectiveness of the proposed
approach in distortion detection and classification.
Index Terms: distortion analysis, MFCC, remote biomedical
voice assessment, support vector machine
1. Introduction
Sustained vowels are widely used for evaluation of pathologi-
cal voice caused by a range of medical disorders. Vowels have
two main advantages: first, the complexity of modeling artic-
ulatory movement during running speech is avoided [1], and
second, experimental studies show that most dysphonic speak-
ers cannot produce steady, sustained vowel sounds [2]. Among
vowels, the vowel /a/ is sufficient for many voice analysis ap-
plications [3], [4]. During production of the vowel /a/, the vocal
tract is more open than other vowels resulting in minimal air
pulse reflections between the vocal tract and the vocal folds [1].
Using clean and sustained /a/ vowels, Tsanas et al. [4] achieved
almost 99% overall accuracy in detecting Parkinson’s disease
(PD) from voice recordings, for example.
Due to advances in automatic voice analysis, remote voice
assessment is becoming feasible [5], [6]. For example, re-
cently smartphones are being investigated as tools for measur-
ing pathological voice [7] since smartphones are ubiquitous and
inexpensive devices with built-in, high-quality microphones.
Compared to samples recorded in a clinic or a sound booth,
recordings from smartphones in most environments are subject
to many types of linear and nonlinear distortion. The presence
of distortion in voice signals degrades the performance of al-
gorithms designed to quantify medical symptoms from voice
This work was funded by the Danish Council for Independent Re-
search, grant ID: DFF 4184-00056
recordings [8]. In particular, the performance of different al-
gorithms for PD detection under a variety of acoustic condi-
tions has been evaluated in [9] and it has been demonstrated
that background noise and the use of codecs significantly de-
grade detection performance.
Several approaches to detect different types of noise and
distortion in voice signals have been studied, most of have fo-
cused on detecting a single and specific kind of distortion in
voice [10–14]. In this study, we consider the vowel /a/ and aim
to detect four different types of noise and distortion that are
commonly present during recording or transmission in remote
voice analysis, namely: background noise, room reverberation,
peak clipping and coding (i.e. speech compression). Although
there are an infinite number of possible levels, types and combi-
nations of distortion in real-world scenarios, this study aims to
provide a simplified approach to detect the most dominant dis-
tortion in the signal, which would be useful in practical applica-
tions where it is important to know whether a frame is distortion
free or needs enhancement. We assume that if a given frame is
considered as corrupted, there is a specific type of noise or dis-
tortion which dominates over other distortions. Following this,
we investigate the behavior of Mel-frequency cepstral coeffi-
cients (MFCCs, widely-used features in voice-based biomedi-
cal applications [8], [15]) in the presence of the four kinds of
distortion and noise. Then, a new method is proposed which
uses a support vector machine (SVM) as classifier and MFCCs
as features for that classifier, to detect distortion in each frame.
MFCCs are selected because of their sensitivity to changes in
signal characteristics due to noise, distortions or articulatory
movements [16].
2. Effects of distortion on MFCCs
The proposed method is based upon experimental observations
of the effect of distortions on MFCCs reported next. This ex-
perimental analysis reveals that different levels and types of dis-
tortion cause MFCCs to shift to different regions of the space
spanned by the MFCC values, and changes the covariance of
these values. To explore this effect, we take successive frames
from the center of the clean vowel /a/ uttered by 45 healthy
speakers and extract the MFCCs under different types and levels
of distortion and noise. We then evaluate the shift in the sample
mean and covariances of the MFCCs computed on the distorted
2.1. MFCC Features
MFCCs are based on the source-filter theory of speech produc-
tion [17]. To compute MFCCs, we take the discrete Fourier
transform (DFT) of the speech frames. Then, the power spec-
trum is computed and passed through a set of triangular filter
banks, linearly spaced on the Mel-frequency scale. The log-
energy output of the filter bank, which is sensitive to small
changes in signal characteristics due to noise, distortions or ar-
ticulatory movements [16], is then passed through the discrete
cosine transform (DCT). The MFCCs are the amplitudes of the
DCT coefficients. Specifically, the pth MFCC coefficient of the
kth frame is calculated as [8]:
φk[p] = 1
M+ 1
log |˜
Sk(q)|cos πq
M+ 1 p,(1)
where Mis the number of Mel-band filters and ˜
Sk(q)is the
estimate of the spectral energy in the qth band calculated as:
Sk(q) = X
M+1 FMel|
where IMel
q= [ q1
M+1 FMel,q+1
M+1 FMel]is the qth filter band in
Mel-frequency scale, fMel
iis the ith Mel frequency, FMel is the
maximum frequency in the Mel domain, fMel/2is the width
of the Mel bands and Skis the short-time DFT of the kth frame.
The transformation from the linear domain to the Mel domain
is performed by [18]:
fMel =1000
log10 2log10 1 + f
1000 .(3)
In this study, 13 MFCC coefficients are extracted for each
frame. In addition, delta and double-delta coefficients, defined
as the first- and second-order time-differences of the MFCC co-
efficients which capture the dynamic changes between frames,
are appended to the MFCCs to form a 39-dimensional vector.
Considering (1) (3), the effects of distortions on MFCCs
are complex since during the MFCC calculations, a corrupted
signal passes through several nonlinear functions. These effects
can even be more complex when a signal has been subject to a
nonlinear distortion such as clipping or compression. To eval-
uate the behavior of MFCCs in the presence of noise and dis-
tortion, we take successive 30 ms long frames of the vowel /a/
uttered by 45 healthy speakers and compute the change in the
covariance matrix and the mean of the MFCCs under different
types and levels of distortion. Specifically, the mean shift can
be defined as:
ξ(j) = 1
where Nis the number of speakers, k·k2represents the 2-
norm, and µc
nand µdj
nare the means of the MFCCs computed
respectively from clean signal and distorted signals from the
nth speaker subject to the jth distortion level. ξ= 0 indicates
that the mean of the corrupted MFCCs is unchanged in feature
space. The larger the value of ξ, the farther the MFCC vector
is moved with respect to the clean one. The change in the co-
variance matrix of the MFCC under the jth distortion level is
measured as:
δ(j) = 1
where Σc
nand Σdj
nare respectively the covariance matrices of
the MFCCs extracted from the clean and corrupted utterances
of the nth speaker, and tr(·)is the trace operator that maps the
MFCC covariance matrix to a single real number which rep-
resents the sum of variances for individual dimensions of the
MFCC vector. δ= 1 represents no change in covariance. A
value of δ < 1indicates a reduction in covariance with respect
to the covariance of the clean MFCC. That is, the MFCCs be-
come more compact in the feature space.
2.2. Impact of different distortions on MFCCs
In the first experiment, we investigate the impact of background
noise on MFCCs by corrupting clean vowels /a/ uttered by 45
healthy speakers by three commonly-encountered environmen-
tal noise types, namely “white Gaussian noise”, “quiet office
ambience noise” and “babble noise” under different signal-to-
noise ratio (SNR) conditions (ranging from -20 dB to 60 dB
in 1 dB steps). Babble noise, which consists of multiple speak-
ers talking in the background, has rapid, time-evolving structure
and is considered a challenging type of noise in many speech-
based applications due to its similarity to the target speech [19].
The office environment noise represents a general atmosphere
of a medium size room including the sound of air conditioning
systems and very weak background noise from outside. Fig-
ure 1(a) shows the impact of different types and levels of noise
on the mean and the covariance matrix of MFCCs. The left
vertical axis represents the amount of mean shift as defined in
(4) and the right vertical axis represents the relative change in
the covariance matrix as defined in (5). The plot suggests that
variable noise levels shift the mean of MFCCs to different, but
predictable, regions in the feature space. It can be observed
that the amount of shift monotonically increases as the level of
noise increases. Moreover, it can be noticed that the covariance
of the noisy MFCCs is always smaller than that of the clean one.
However, the covariance does not monotonically reduce. This
is probably due to the fact that as the SNR goes below 0 dB, the
noise dominates the signal and the MFCCs take on a different
Reverberation in voice recordings is caused by superim-
posed reflections of the original sound wave coming from dif-
ferent surfaces in an acoustic environment and is known to have
a detrimental impact on numerous signal processing tasks. To
study the effect of reverberation on MFCCs, we filtered the
clean signal with synthetic room impulse responses (RIRs) of
reverberation times (RTs) varying from 150 ms to 1 s mea-
sured at a room of dimension 5m×4m×3m. Furthermore, to
evaluate the effect of different source-to-receiver distances on
the MFCCs, the experiments are repeated with three differ-
ent speaker-to-microphone distances, namely 0.5 m, 1 m and
1.5 m. The RIRs are generated using the image method [20]
which is implemented using the RIR Generator toolbox [21]
in MATLAB. Figure 1(b) illustrates similar trends for MFCCs
under different speaker-to-microphone distances. The mean
shift increases as the RT increases. We can observe that when
the microphone records from a close distance, the amount of
shift is always smaller than when the microphone is recording
from a larger distances from the speaker. For large speaker-to-
microphone distances, however, we observe a different trend as
the RT exceeds 250 ms. Reverberation reduces the covariance
of the MFCCs as the RT increases.
Peak clipping and speech coding are two common nonlin-
ear speech signal modifications. Peak clipping occurs when
the amplitude of a speech signal exceeds the dynamic range
of the analogue-to-digital converter which introduces nonlinear
distortion into the signal and affects the subjective quality of
speech [22]. On the other hand, communication channels typ-
Signal SNR
White Gaussian Noise
Babble Noise
Office Ambience Noise
(a) Background noise
Reverberation time (in sec)
Speaker-to-microphone = 0.5m
Speaker-to-microphone = 1 m
Speaker-to-microphone = 1.5m
(b) Reverberation
Clipping level
(c) Peak clipping
Bit rate (kbps)
(d) Compression
Figure 1: Impact of different types and levels of distortion on the mean and covariance matrix of the MFCCs. The left vertical axes
represent ξdefined in (4) which is the amount of mean shift. ξ= 0 indicates that the mean of the corrupted MFCCs is not shifted in
the feature space. The larger the value of ξ, the farther the MFCC vector is positioned with respect to the clean one. The right vertical
axes represent δdefined in (5) which is the relative change in the covariance matrix. δ= 1 indicates no change in covariance of the
corrupted MFCCs, δ > 1represents increase in covariance with respect to the covariance of the clean MFCCs and δ < 1indicates
that the MFCCs become more compact in the feature space.
ically use lossy codecs such as code-excited linear prediction
(CELP) to compress speech signals to lower bit rates, which in-
evitably degrades the quality of the speech [23]. To study the
effect of peak clipping on MFCCs, we define the clipping level
as a proportion of the unclipped peak absolute signal amplitude
to which samples greater than this threshold are limited. The
clean recordings of the vowel /a/ are clipped with clipping lev-
els varying from 0.1 to 1 in 0.025 steps. Figure 1 (c) illustrates
the impact of peak clipping on the MFCCs. As the clipping
level increases, the mean of the MFCCs is positioned farther
away from that of the clean signal. MFCCs of a clipped signal
possess smaller covariance matrix values compared to that of
the clean MFCCs and become smaller as the clipping level in-
creases. To investigate the behavior of MFCCs when a speech
signal has undergone the distortion of a speech codec, the clean
vowels /a/ are coded by a CELP codec with three different stan-
dard bit rates, namely 6.3, 9.6 and 16 kbps [24]. Figure 1 (d)
shows the impact of speech compression on MFCCs. The plot
is produced by fitting a second order power function to the cal-
culated ξand δas:
ξ(j) = 2.35 ×105×j3.88 + 3.59 (6)
δ(j) = 1797 ×j4.89 + 0.91 (7)
We can observe that speech compression shifts the MFCCs to a
farther position (with respect to the position of the clean ones)
as the compression rate increases. On the other hand, although
MFCCs of a voice signal coded at 16 kbps and 9.6 kbps possess
smaller covariance matrices compared to the covariance of the
clean one, we observe a larger covariance than that of the clean
MFCCs when a signal is coded at 6.3 kbps. The empirical curve
fitted to δalso suggests that MFCCs of a signal compressed at
7.3 kbps are expected to have a comparable covariance matrix
with respect to the covariance of the clean MFCCs.
3. The proposed distortion classification
Motivated by the experimental findings above, we introduce a
new method for noise and distortion classification to detect the
presence and type of noise/distortion in /a/ vowels. Although a
recording can be subject to an infinite number of possible types
and levels of noise and distortion in real scenarios, our approach
focuses on detecting the most dominant corruption present in
any frame. We assume the simplifying model that if a given
Training Phase
Testing Phase
MFCC Extraction
VAD indices
MFCC Extraction
VAD indices
Figure 2: Block diagram of the proposed method for distor-
tion/noise classification, training and testing phases.
frame of a voice recording is corrupted, there is a single type of
noise or distortion which dominates. The block diagram of the
proposed approach in training and testing phases is illustrated
in Figure 2. Using a Hamming window, recordings are seg-
mented into frames of 30 ms. For each frame of a vowel (which
can be clean or corrupted), a 39-dimensional MFCC vector is
computed. Using an energy-based voice activity detection al-
gorithm [25], silent frames at the beginning and the end of the
signals are excluded. Then, a multiclass SVM with a radial
basis function kernel estimated on the training frames is used
to classify distortion in an unseen frame during testing. Intro-
duced by Vapnik [26], SVMs are powerful discriminative
pattern classifiers which find an optimal separating hyperplane
in a high dimensional nonlinear feature space formed using ker-
nels applied to the input feature space.
4. Experimental setup
The proposed system for distortion/noise recognition in /a/ vow-
els was developed and validated using two different databases.
The first database consists speech samples of healthy speak-
ers. This database contains different clean vowels uttered by 45
men, 48 women and 46 childeren, recorded by a dynamic mi-
crophone, sampled at 16 kHz and range from 370 ms to 780 ms
long [27]. There is no dysphonia variability. The only uncon-
trolled parameter is the speaker variability. From this database,
we have chosen 93 samples of /a/ vowels produced by 45 male
and 48 female speakers. Furthermore, to evaluate the proposed
system with more realistic pathological voice signals, we used a
PD voice database since the vast majority of people with PD ex-
hibit some form of vocal disorder [28]. This database was gen-
erated through collaboration between Sage Bionetworks, Pa-
tientsLikeMe and Dr. Max Little as part of the Patient Voice
Table 1: Frame- and recording-level classification performance for the healthy voice and the Parkinson’s voice databases in the form
mean ±STD computed using a 5-fold CV.
Database Frame-Level Classification Accuracy (in %±STD) Recording-Level Classification Accuracy (in %±STD)
Clean Noisy Clipped Coded Reverb. Overall Clean Noisy Clipped Coded Reverb. Overall
Healthy voice 61±6 92±3 82±3 71±6 85±4 78±1 77±12 100±0 98±3 82±11 90±7 89±4
Parkinson’s voice 48±5 89±3 74±6 77±8 66±5 72±4 55±11 97±4 82±7 85±9 77±4 79±3
Analysis study (PVA)1. The samples of this database are the
telephone recordings of the sustained vowels /a/ produced by
779 PD patients of both genders, sampled at 8 kHz and range
from 3 s to 30 s long. From this database, we randomly se-
lected 48 female and 26 male samples of 7 s to 15 s duration.
Then, we used 3 s of the middle of the signals, where the speak-
ers produced a steady sustained vowel. This database has both
speaker- and dysphonia- variability. Moreover, the recordings
may have already some types of distortion such as background
acoustic environmental noise and reverberation or may have
been through one or more codec since they are collected over
the telephone network, which makes the noise/distortion detec-
tion more challenging.
To create a database for distortion/noise detection, we en-
larged the databases by adding the distorted versions of all
recordings by applying different types and levels of noise and
distortion which typically present in the recordings of remote
voice analysis. Specifically, for noise, we added “babble”,
“white Gaussian” and “office ambiance” noises at 15 dB, 10
dB and 5 dB. For peak clipping, the clipping level was set
to 0.3, 0.4, 0.5 and 0.6. Signals were compressed using 6.3
kbps, 9.6 kbps and 16 kbps CELP codecs. To provide rever-
berant signals, recordings were filtered by 8 different real RIRs
of the AIR database [29]. The RIRs are measured with mock-
up phone in hand-held and hands-free positions in four realistic
indoor environments, namely an office, a lecture room, a cor-
ridor and a stairway. The measured RTs range from 390 ms
to 1.47 s [30]. Then, using a Hamming window of length 30
ms, we created a database of 30 ms clean and corrupted frames
for both databases. The recordings of each database are then
divided into two subsets: a training subset consisting of 80%
of the speakers, and a testing subset consisting of 20% of the
speakers. The resulting training and testing subsets of the en-
larged healthy vowel database consist of 5105 and 1360 frames,
respectively. The training and the testing subsets of the enlarged
PD voice database consist of 30150 and 7800 frames, respec-
tively. The enlarged databases have the same number of frames
per class of noise/distortion.
To detect different types of distortion in a given frame, a
multiclass SVM classifier implemented in the LIBSVM tool-
box [31] in MATLAB is used. The hyper-parameters of the
SVM, namely the RBF kernel spread and SVM regularization
parameter, were selected by 5-fold cross-validation (CV) on
10% of the training data assigned as the tuning subset.
5. Results and discussion
We used 5-fold CV to evaluate the classification performance
in terms of the number of correctly classified test frames. The
results over all CV repetitions using healthy and the PD voices
are reported in the first and the second rows of Table 1, respec-
tively. Assuming that the most dominant distortion in an utter-
ance usually affects the majority of frames, we also extend the
1They were obtained through Synapse ID [syn2321745]
proposed method to the recording-level by applying a majority
voting algorithm over all frames of a signal. The table reports
the classification accuracy both at frame and recordings levels
in the form mean ±one standard deviation.
The effectiveness of MFCCs in distortion classification
(particularly for noisy frames) can be observed. The results for
healthy voices are consistent with the behavior of MFCCs in the
presence of different types and levels of distortion observed in
Section 2.2. Considering Figure 1, MFCCs of the compressed
signals are, on average, positioned closer to the MFCCs of the
clean signals, while noise, reverberation and peak clipping shift
the MFCCs farther away from the position of clean MFCCs.
Moreover, the covariance of MFCCs extracted from coded sig-
nals is comparable to that of the clean signal, while the MFCCs
for noisy, reverberant and peak clipped signals are more com-
pact in the feature space. Taking these two observations into
account, the MFCCs of the coded and clean frames are more
likely to be overlapping in the feature space which results in
misclassification, particularly when there is speaker variability
in the database.
Although the proposed method is effective in distortion
classification for both healthy and pathological voices, we ob-
serve a degradation in overall classification performance (par-
ticularly for clean frame detection) when the system is eval-
uated using the PD voice database. The first factor affecting
the results is the dysphonia variability in the PD voice database
since the presence of pathologies in speech is related to signal
variability. Moreover, bearing in mind that the recordings in the
PD voice database have been collected over the telephone, these
signals may have already been through one or more codecs.
This means that some coded frames have been presented to the
classifier as “clean” ones during the training phase which will
result in some classification performance degradation.
6. Conclusions
In this study, the impact of four major types of distortion,
namely background noise, reverberation, clipping and speech
compression on MFCCs of the frames of the vowel /a/ has been
analyzed. These distortions are commonly present in voice sig-
nals during recording or transmission in remote pathological
voice assessments. It has been demonstrated experimentally
that introducing different types and levels of distortion to the
vowel results in predictable changes in mean and covariance
matrix of the MFCCs. Motivated by this observation, a new
approach for detecting the dominant type of distortion is pro-
posed, which uses MFCCs as frame-level acoustic features and
an SVM as the classifier. Experimental results using record-
ings of healthy speakers and speakers with PD (as an example
of people with voice disorders) show the effectiveness of the
proposed system in distortion classification. Since the presence
of disorders in speech is closely related to signal variability, a
slight degradation in classification performance has been ob-
served when the PD voices were analyzed.
7. References
[1] I. Titze, Principles of voice production, 2nd ed. Iowa City: Na-
tional Center for Voice and Speech, 1999.
[2] J. Schoentgen and R. De Guchteneere, “Time series analysis of
jitter,” J. Phon., vol. 73, pp. 189–201, 1995.
[3] F. Klingholtz, “Acoustic recognition of voice disorders: a compar-
ative study of running speech versus sustained vowels.” J. Acoust.
Soc. Am., vol. 87, no. 5, pp. 2218–24, may 1990.
[4] A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and
L. O. Ramig, “Novel speech signal processing algorithms for
high-accuracy classification of Parkinson’s disease, IEEE Trans.
Biomed. Eng., vol. 59, pp. 1264–1271, 2012.
[5] R. J. Moran, R. B. Reilly, P. De Chazal, and P. D. Lacy,
“Telephony-based voice pathology assessment using automated
speech analysis,” IEEE Trans. Biomed. Eng., vol. 53, no. 3, pp.
468–477, 2006.
[6] P. A. Mashima and C. R. Doarn, “Overview of telehealth activ-
ities in speech-language pathology, Telemed. e-Health, vol. 14,
no. 10, pp. 1101–1117, dec 2008.
[7] S. Arora, V. Venkataraman, A. Zhan, S. Donohue, K. Biglan,
E. Dorsey, and M. Little, “Detecting and monitoring the symp-
toms of Parkinson’s disease using smartphones: A pilot study,”
Parkinsonism Relat. Disord., vol. 21, no. 6, pp. 650–653, 2015.
[8] R. Fraile, N.S´
on, J. I. Godino-Llorente, and V. Osma-
Ruiz, “Use of Mel frequency cepstral coefficients for automatic
pathology detection on sustained vowel phonations: mathematical
and statistical justification,” in 4th Int. Symp. Image/Video Com-
mun. over Fixed Mob. Networks, no. 3, 2008.
[9] J. Vasquez-Correa, J. Serra, J. F. Orozco-Arroyave, J.R. Vargas-
Bonilla, and E. Noth, “Effect of acoustic conditions on algorithms
to detect Parkinson’s disease from speech, in ICASSP, 2017, pp.
[10] W. Yuan and B. Xia, “A speech enhancement approach based on
noise classification,” Appl. Acoust., vol. 96, pp. 11–19, 2015.
[11] K. El-maleh, A. Samouelian, and P. Kabal, “Frame-level noise
classification in mobile environments, ICASSP, pp. 237–240,
[12] S. Aleinik and Y. Matveev, “Detection of clipped fragments in
speech signals,” Int. J. Electr. Comput. Energ. Electron. Commun.
Eng., vol. 8, no. 2, pp. 286–292, 2014.
[13] J. Eaton and P. A. Naylor, “Noise-robust detection of peak-
clipping in decoded speech,” ICASSP, pp. 7019–7023, 2014.
[14] J. M. Desmond, L. M. Collins, and C. S. Throckmorton, “Us-
ing channel-specific statistical models to detect reverberation in
cochlear implant stimuli.” J. Acoust. Soc. Am., vol. 134, no. 2, pp.
1112–20, 2013.
[15] A. Dibazar, S. Narayanan, and T. Berger, “Feature analysis
for automatic detection of pathological speech,” in Second Jt.
EMBS/BMES Conf., vol. 1, 2002, pp. 182–183.
[16] M. Sahidullah and G. Saha, “Design, analysis and experimental
evaluation of block based transformation in MFCC computation
for speaker recognition,” Speech Commun., vol. 54, no. 4, pp.
543–565, 2012.
[17] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-time
processing of speech signals, 2nd ed. New York: IEEE Press,
[18] J. Harrington and S. Cassidy, Techniques in Speech Acoustics.
Kluwer Academic Publishers, 1999.
[19] N. Krishnamurthy and J. Hansen, “Babble noise: modeling, anal-
ysis, and applications,” IEEE Trans. Audio. Speech. Lang. Pro-
cessing, vol. 17, no. 7, pp. 1394–1407, 2009.
[20] J. B. Allen and D. A. Berkley, “Image method for efficiently sim-
ulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4,
p. 943, 1979.
[21] E. A. P. Habets, “Room impulse response generator, International
Audio Laboratories Erlangen, Tech. Rep., 2010.
[22] J. Gruber and L. Strawczynski, “Subjective effects of variable de-
lay and speech clipping in dynamically managed voice systems,”
IEEE Trans. Commun., vol. 33, no. 8, pp. 801–808, 1985.
[23] P. E. Souza, “Effects of compression on speech acoustics, intel-
ligibility, and sound quality.” Trends Amplif., vol. 6, no. 4, pp.
131–65, 2002.
[24] R. Goldberg and L. Riek, A practical handbook of speech coders.
CRC Press, 2000.
[25] T. Kinnunen and H. Li, “An Overview of Text-Independent Speak
er Recognition : from Features to Supervectors,” Speech Com-
mun., vol. 1, 2009.
[26] C. Cortes and V. Vapnik, “Support-vector networks,” Mach.
Learn., vol. 20, no. 3, pp. 273–297, 1995.
[27] J. Hillenbrand, L. A. Getty, M. J. Clark, and
K. Wheeler, “Acoustic characteristics of American English
vowels, J. Acoust. Soc. Am., 1995. [Online]. Available:
[28] A. K. Ho, R. Iansek, C. Marigliani, J. L. Bradshaw, and S. Gates,
“Speech impairment in a large sample of patients with Parkinson’s
disease.” Behav. Neurol., vol. 11, no. 3, pp. 131–137, 1998.
[29] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse re-
sponse database for the evaluation of dereverberation algorithms,
in 16th Int. Conf. Digit. Signal Process., 2009, pp. 1–5.
[30] M. Jeub, M. Sch¨
afer, and H. Kr¨
uger, “Do we need dereverberation
for hand-held telephony?” in Int. Congr. Acoust., 2010, pp. 1–7.
[31] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to
support vector classification,” National Taiwan University, Tech.
Rep., 2016.
... Even though very effective in finding outliers, it is not capable of detecting the type of degradation nor identifying short-term protocol violations in recordings. To identify the type of degradation in pathological voices, Poorjam et al. proposed two different parametric and non-parametric approaches to classify degradations commonly encountered in remote pathological voice analysis into four major types, namely background noise, reverberation, clipping and coding [21,22]. However, the performance of these approaches is limited when new degradation types are introduced. ...
... The major limitation of the multi-class classification-based approaches for identifying the type of degradation in a voice signal [21,22] is that they do not In this approach, the task of each detector is to determine whether a feature vector of the time frame t of a voice signal, x t , was contaminated by the P r e p r i n t corresponding degradation, H 0 , or not, H 1 . The decision about the adherence of each frame of a given speech signal to the hypothesized degradation is then computed as: ...
... To parametrize the recordings for the purpose of degradation detection, we propose to use mel-frequency cepstral coefficients (MFCCs) [76]. The main motivation for choosing a different speech parametrization for degradation detection than that used for PD detection is that not only do the MFCCs convey information about the speech context, but also they encode the presence and the level of degradation in signals due to their sensitivity to small changes in signal characteristics caused by degradation [21,77,78,45]. We have demonstrated in [21] and [45] that degradation in speech signals predictably modifies the distribution of MFCCs by changing the covariance of the features and shifting the mean to different regions in feature space, and the amount of change is related to the degradation level. ...
Full-text available
The performance of voice-based Parkinson’s disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion, and investigate how these degradations influence the performance of a PD detection system. Given that the specific degradation is known, we explore the effectiveness of a variety of enhancement algorithms in compensating this mismatch and improving the PD detection accuracy. Then, we propose two approaches to automatically control the quality of recordings by identifying the presence and type of short-term and long-term degradations and protocol violations in voice signals. Finally, we experiment with using the proposed quality control methods to inform the choice of enhancement algorithm. Experimental results using the voice recordings of the mPower mobile PD data set under different degradation conditions show the effectiveness of the quality control approaches in selecting an appropriate enhancement method and, consequently, in improving the PD detection accuracy. This study is a step towards the development of a remote PD detection system capable of operating in unseen acoustic environments.
... Ряд известных способов обнаружения клиппирования базируется именно на таком принципе, при этом в большинстве случаев в качестве меры степени клиппирования предлагается использовать степень различия формы либо параметров плотности распределения вероятностей (ПРВ) анализируемого и неискаженного сигналов [1][2][3][4][5][6][7][8]. ...
... Выбор мел-частотных кепстральных коэффициентов в качестве классификационных признаков [8] для обнаружения клиппирования объясняется особенностью решаемой задачи, состоящей в удаленной (телефонная или сетевая связь) диагностике болезни Паркинсона по особенностям произношения гласного звука /a/. Очевидно, основным недостатком данного способа является его сравнительно высокая вычислительная сложность. ...
Full-text available
It is shown that fourth standardized moment (kurtosis) and its some functional transformations (inverse value, square root of inverse value) can be objective measures of clipping and quality of speech and music signals. The essential advantages of suggested measures are no need for previous estimation of probability density of analyzed signal as well as no need for information about undistorted signals. Indices of correlation were calculated and correspondence maps were created that represent relationships between estimations of suggested measures and subjective quality evaluations of clipped sound signals, which enables calibration of objective measures. It was shown that values correspondence maps, which are simple functional transformations of kurtosis, can be approximated by polynomials of the first or second order, whereas for the approximation of correspondence maps of kurtosis the polynomials of the fourth order are needed. This fact in combination with interval limitations of possible values of used measures means that in engineering applications the applying of kurtosis functional transformation can be preferable. Suggested measures were compared to different measures represented by clipping factor. The clipping factor was shown to be less effective compared to suggested measures under conditions of high clipping level of speech and music signals.
... Summarizing the results presented above, we can conclude that the human auditory system is slightly more sensitive to the clipping of musical signals than to the clipping of speech signals, but this difference is small. In the future, it will be useful to compare these results with ones of [14,[19][20][21] and other papers in order to find out how common the identified phenomenon is. ...
... To solve this problem for the 4 parameter, one can calculate the ratio of 4 ( ) 4 Thus, we can conclude that studied objective measures 4 , 4 = 1 4 ⁄ , and 4 = 1 √ 4 ⁄ are practically insensitive to a kind of acoustic signal. Note that a similar situation was previously discovered in studies of phase distortion of speech and music signals [14]. ...
Full-text available
This paper compares the results of subjective and objective assessments of the quality of speech and music signals distorted during clipping when large instantaneous signal values are replaced by a certain threshold constant or by values close to it. It was proposed in recent works to use kurtosis and some of its simple functional transforms such as reciprocal of kurtosis and square root of reciprocal of kurtosis as objective (instrumental) clipping value measures. This paper clarifies the results of a subjective assessment of the quality of speech and music signals distorted by clipping. A comparison of the obtained estimates allows one to conclude that the human auditory system is slightly more sensitive to the clipping of musical signals than to the clipping of speech signals, but this difference is small. Similarly, objective quality measures of clipped signals are almost equally sensitive to the clipping value of speech and music signals. An analysis of the variability of the kurtosis estimates, depending on the time of estimation, showed that the relative standard deviation of the kurtosis estimates is close to 10% for the analysis time interval of 1-40 s.
... В подальшому доцільно оптимізувати розроблений графічний інтерфейс із врахуванням рекомендацій професійних фоніатрів [9], [10]. ...
Full-text available
Every year, people are faced with an increasing number of diseases that require timely detection and diagnosis without causing discomfort to the patient. Thus arose the systems of objective and subjective assessment of the quality of hearing, as well as the first systems of analysis of the state of the vocal tract. The problem of voice pathology is inherent in a fairly large risk group, which includes teachers, artists, call center operators. The specifics of the development of such systems is that the hardware and software complex developed by the engineer should be convenient for use by a doctor and relevant in terms of existing solutions to identify certain acoustic parameters of the voice. The study of the voice spectrum can be the main diagnostic test in the diagnosis of persistent voice disorders and, along with the study of the voice field and vibrometry, allows you to determine the form of voice disorders in professionals. Acoustic tests for the presence of high and low singing forms in the singing voice can be highly important in determining the singer's performance and professional prognosis, can serve as a criterion in the diagnosis of persistent voice disorders, and their use in early stages of occupational laryngeal diseases will help prevent preventive measures. MATLAB software comes with a large number of tools, which facilitates the implementation of many engineering, mathematical, computational issues of development and research of various processes related to any field of research. Based on these tools, there are a large number of basic functions of digital signal analysis, including audio signals. FFT - fast Fourier transform algorithms are chosen as a basis, which with a certain modification are divided into parametric and nonparametric. In this case, the nonparametric Welch method and the parametric Berg method are chosen. The user is given access to choose between the parameters required for their operation. For the first it is the dimension of the weight function, for the second it is the order of the autoregressive model. All this gives opportunity to analyze the spectrum of vowel phonemes. The AppDesigner package provides great opportunities for creating interfaces in software development. By manipulating the functions of Callback, you can bring the program to the finest settings, which at first may seem invisible, but generally create comfort when working. It is important to build a certain algorithm of action of each component. It often happens that it is necessary to take into account such details for which a certain component in itself is not responsible. For example, this could be changing the signatures of other components. To simplify code writing, it is important to create m-functions. However, this must be taken into account when editing them after entering them in the main script. Using the above toolbox, a software interface was developed, which is divided into two working areas: time and spectral parts. In addition, the interface is filled with controls for input data and spectrum analyzer parameters, as well as spectrum analysis tools.
... Earlier adaptations for development of an ASR system were based upon the interpretation of phonemes for effective creation as well as recognition of vowel sounds. However, development of noise-robust ASR systems has been greatly affected by acoustic environments in the presence of background noise, reverberation and other distortions caused due to interfering signals [27]. The primary requirement of increasing efficiency of an ASR system in native language is important and, basically, dependent upon the representation of compact information by utilizing various technique of filtering noise and undesired information present in an input speech signal. ...
Full-text available
Development of a native language robust ASR framework is very challenging as well as an active area of research. Although an urge for investigation of effective front-end as well as back-end approaches are required for tackling environment differences, large training complexity and inter-speaker variability in achieving success of a recognition system. In this paper, four front-end approaches: mel-frequency cepstral coefficients (MFCC), Gammatone frequency cepstral coefficients (GFCC), relative spectral-perceptual linear prediction (RASTA-PLP) and power-normalized cepstral coefficients (PNCC) have been investigated to generate unique and robust feature vectors at different SNR values. Furthermore, to handle the large training data complexity, parameter optimization has been performed with sequence-discriminative training techniques: maximum mutual information (MMI), minimum phone error (MPE), boosted-MMI (bMMI), and state-level minimum Bayes risk (sMBR). It has been demonstrated by selection of an optimal value of parameters using lattice generation, and adjustments of learning rates. In proposed framework, four different systems have been tested by analyzing various feature extraction approaches (with or without speaker normalization through Vocal Tract Length Normalization (VTLN) approach in test set) and classification strategy on with or without artificial extension of train dataset. To compare each system performance, true matched (adult train and test—S1, child train and test—S2) and mismatched (adult train and child test—S3, adult + child train and child test—S4) systems on large adult and very small Punjabi clean speech corpus have been demonstrated. Consequently, gender-based in-domain data augmented is used to moderate acoustic and phonetic variations throughout adult and children’s speech under mismatched conditions. The experiment result shows that an effective framework developed on PNCC + VTLN front-end approach using TDNN-sMBR-based model through parameter optimization technique yields a relative improvement (RI) of 40.18%, 47.51%, and 49.87% in matched, mismatched and gender-based in-domain augmented system under typical clean and noisy conditions, respectively.
... The use of a multiclass classification, on the other hand, can be used to detect different types of degradations. In [18,19], Poorjam et al. proposed two generalized multiclass classificationbased approaches detecting various types of degradation, which investigated only on pathological voice signals and the accuracy was still inadequate. Moreover, there is no control over the class assignment in these approaches when a new type of degradation is observed for which the classifier has not been trained. ...
Full-text available
The presence of degradations in speech signals, which causes acoustic mismatch between training and operating conditions, deteriorates the performance of many speech-based systems. A variety of enhancement techniques have been developed to compensate the acoustic mismatch in speech-based applications. To apply these signal enhancement techniques, however, it is necessary to know prior information about the presence and the type of degradations in speech signals. In this paper, we propose a new convolutional neural network (CNN)-based approach to automatically identify the major types of degradations commonly encountered in speech-based applications, namely additive noise, nonlinear distortion, and reverberation. In this approach, a set of parallel CNNs, each detecting a certain degradation type, is applied to the log-mel spectrogram of audio signals. Experimental results using two different speech types, namely pathological voice and normal running speech, show the effectiveness of the proposed method in detecting the presence and the type of degradations in speech signals which outperforms the state-of-the-art method. Using the score weighted class activation mapping, we provide a visual analysis of how the network makes decision for identifying different types of degradation in speech signals by highlighting the regions of the log-mel spectrogram which are more influential to the target degradation.
... A method for detecting clipping by measuring mel-frequency cepstral coefficients was proposed in [6], but an obvious disadvantage of this method is its relatively high computational complexity that can be justified when using equipment for medical purposes. ...
Full-text available
It is shown that the kurtosis and the normalized variance can be used as a measures of the clipping value of speech signals. The use of the proposed measures makes it possible to significantly simplify and speed up the clipping value calculations compare to the methods where preliminarily estimation of the probability density function of the analyzed speech signal is required. Subjective estimates of the clipped speech signals quality were obtained. Matching maps between the proposed objective measures and the subjective estimates of the clipped speech signals quality have been built. It was shown that the maps can be well approximated by polynomials of small (1st - 4th) order. This fact indicates the possibility of construction of simple, in computational sense, algorithms for the control of clipped speech signals quality.
... The method of clipping detection by mel-frequency cepstral coefficients measuring is considered in [8]. An obvious disadvantage of this method is its high computational complexity. ...
Conference Paper
Full-text available
When transmitting music signals via infocommunication channels, there is a risk of distortion of these signals due to clipping. The consequence of these distortions is a decrease in subjective assessments of the quality of musical signals, so clipping detection is an important issue. In this paper, the kurtosis and the inverse value of kurtosis are proposed as measures of an objective assessment of the musical signals clipping degree. Matching maps of objective and subjective clipping degree estimates are monotonic functions that do not contain sharp kinks, which allow not only evaluating clipping degree, but also estimating music signals quality. An important advantage of the proposed measures of the clipping degree is the ability to simplify and speed up significantly the calculations due to the absence of the need of preliminarily estimation the probability density function of the analyzed musical signal.
... A number of known methods of clipping detection are based on this principle and the difference in the shape or parameters of the probability density function (PDF) of the analyzed and undistorted signal is used as a measure of the clipping value in most cases [1], [2], [3], [4], [5]. A method for detecting clipping by measuring mel-frequency cepstral coefficients was proposed in [6], but an obvious disadvantage of this method is its relatively high computational complexity that can be justified when using equipment for medical purposes. ...
Conference Paper
Full-text available
In this paper, it is shown that kurtosis and some of its functional transformations are expedient to use as a measures of the clipping value of speech signals. In addition to the kurtosis, two more measures of the clipping value are considered: they are the reciprocal of the kurtosis, as well as the square root of the reciprocal of the kurtosis. The use of the proposed measures makes it possible to significantly simplify and speed up the calculations in comparison with the methods where it is necessary to preliminarily estimate the probability density function of the analyzed speech signal. It is shown also that the matching maps between the proposed objective measures and the subjective estimates of the speech signals quality are well approximated by a linear function or polynomials of the second and fourth orders. This fact indicates the possibility of building economical, from a computational point of view, algorithms for the control of speech signals quality.
Full-text available
Remote, non-invasive and objective tests that can be used to support expert diagnosis for Parkinson's disease (PD) are lacking. Participants underwent baseline in-clinic assessments, including the Unified Parkinson's Disease Rating Scale (UPDRS), and were provided smartphones with an Android operating system that contained a smartphone application that assessed voice, posture, gait, finger tapping, and response time. Participants then took the smart phones home to perform the five tasks four times a day for a month. Once a week participants had a remote (telemedicine) visit with a Parkinson disease specialist in which a modified (excluding assessments of rigidity and balance) UPDRS performed. Using statistical analyses of the five tasks recorded using the smartphone from 10 individuals with PD and 10 controls, we sought to: (1) discriminate whether the participant had PD and (2) predict the modified motor portion of the UPDRS. Twenty participants performed an average of 2.7 tests per day (68.9% adherence) for the study duration (average of 34.4 days) in a home and community setting. The analyses of the five tasks differed between those with Parkinson disease and those without. In discriminating participants with PD from controls, the mean sensitivity was 96.2% (SD 2%) and mean specificity was 96.9% (SD 1.9%). The mean error in predicting the modified motor component of the UPDRS (range 11-34) was 1.26 UPDRS points (SD 0.16). Measuring PD symptoms via a smartphone is feasible and has potential value as a diagnostic support tool. Copyright © 2015 Elsevier Ltd. All rights reserved.
Full-text available
In this paper a novel method for the detection of clipping in speech signals is described. It is shown that the new method has better performance than known clipping detection methods, is easy to implement, and is robust to changes in signal amplitude, size of data, etc. Statistical simulation results are presented.
Full-text available
Reverberation is especially detrimental for cochlear implant listeners; thus, mitigating its effects has the potential to provide significant improvements to cochlear implant communication. Efforts to model and correct for reverberation in acoustic listening scenarios can be quite complex, requiring estimation of the room transfer function and localization of the source and receiver. However, due to the limited resolution associated with cochlear implant stimulation, simpler processing for reverberation detection and mitigation may be possible for cochlear implants. This study models speech stimuli in a cochlear implant on a per-channel basis both in quiet and in reverberation, and assesses the efficacy of these models for detecting the presence of reverberation. This study was able to successfully detect reverberation in cochlear implant pulse trains, and the results appear to be robust to varying room conditions and cochlear implant stimulation parameters. Reverberant signals were detected 100% of the time for a long reverberation time of 1.2 s and 86% of the time for a shorter reverberation time of 0.5 s.
Full-text available
The topic of compression has been discussed quite extensively in the last 20 years (eg, Braida et al., 1982; Dillon, 1996, 2000; Dreschler, 1992; Hickson, 1994; Kuk, 2000 and 2002; Kuk and Ludvigsen, 1999; Moore, 1990; Van Tasell, 1993; Venema, 2000; Verschuure et al., 1996; Walker and Dillon, 1982). However, the latest comprehensive update by this journal was published in 1996 (Kuk, 1996). Since that time, use of compression hearing aids has increased dramatically, from half of hearing aids dispensed only 5 years ago to four out of five hearing aids dispensed today (Strom, 2002b). Most of today's digital and digitally programmable hearing aids are compression devices (Strom, 2002a). It is probable that within a few years, very few patients will be fit with linear hearing aids. Furthermore, compression has increased in complexity, with greater numbers of parameters under the clinician's control. Ideally, these changes will translate to greater flexibility and precision in fitting and selection. However, they also increase the need for information about the effects of compression amplification on speech perception and speech quality. As evidenced by the large number of sessions at professional conferences on fitting compression hearing aids, clinicians continue to have questions about compression technology and when and how it should be used. How does compression work? Who are the best candidates for this technology? How should adjustable parameters be set to provide optimal speech recognition? What effect will compression have on speech quality? These and other questions continue to drive our interest in this technology. This article reviews the effects of compression on the speech signal and the implications for speech intelligibility, quality, and design of clinical procedures.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
For speech enhancement, most existing approaches do not consider the differences, between various types of noise, which significantly affect the performance of speech enhancement. In this paper, we propose a novel speech enhancement approach by taking into account the different characteristic statistical properties of various noise on the basis of noise classification. To classify noise, an effective noise classification method is firstly developed by exploiting the features of noise energy distribution in the Bark domain. Then, based on the noise types, the speech enhancement approach is obtained by forming the optimal parameter combinations for the optimally modified log-spectral amplitude (OM-LSA) speech estimator with the improved minima controlled recursive averaging (IMCRA) noise estimator, where the parameter combinations consisting of the smoothing parameters for smoothing the noisy power spectrum and the recursive averaging in the noise spectrum estimation as well as the weighting factor for the a priori SNR estimation, are built through the enhancement of noisy speech samples. Finally, extensive experiments are carried out in terms of objective evaluation under various noise conditions, and the experimental results show that the proposed approach yields better performance compared with the conventional OM-LSA with IMCRA in speech enhancement.
This article concerns the time series analysis of jitter. Jitter involves small fluctuations in glottal cycle lengths. Time series analysis is the statistical processing of data that are recorded over a period of time. Conventionally, the amount of jitter in an analysis interval is estimated by a measure of dispersion of the glottal cycle lengths. The problem is that measures of dispersion only describe the fluctuations in the cycle lengths unambiguously when these are statistically independent. This means that the fluctuations are white noise and that changing the order of the cycles does not change their statistical properties. But it can be shown experimentally that neighbouring cycle lengths are not statistically independent because they are correlated. We therefore studied jitter by means of time series analysis methods. These dispense with the assumption that glottal cycle lengths are statistically independent. They make it possible to distinguish between mean- and short-term perturbations and to remove correlations between neighbouring perturbations. We studied dispersion measures of raw and whitened jitter (i.e., jitter from which correlations had been removed). Jitter time series were obtained from vowels [a], [i], [u] sustained by male and female healthy and dysphonic speakers. Results showed that the inter-speaker differences were smaller for whitened jitter than for raw jitter. Inter-speaker variability was reduced because time series analysis separated random from non-random perturbations.
The demand for digital speech coding algorithms grows every day, fueled by applications such as streaming speech over the Internet, Internet telephone, digital cellular telephony, wireless teleconferencing, and various multimedia applications. Until now, most of the books available on audio coding have been collections of individually authored papers. Others have discussed the fundamental coders, but neglected many of the innovations currently in use. Unlike these books, A Practical Handbook of Speech Coders offers in-depth treatment of the basics of speech coding plus the innovations to the basic methods that make the coders useful and efficient. The authors designed this work for engineers, scientists, and manager who need to understand the emerging speech coding techniques and telecommunication standards. However, it will prove useful to people at all levels of speech coder experience: · If you want to simply download the code for an existing algorithm, this book helps you evaluate the strengths and weaknesses of all publicly available codes and choose the right one, then points you to the Internet location where the code is available for download. · For experts who want to improve on existing coders, this book provides the parameters of current coders and the techniques to improve upon them. You can download an existing algorithm or code it using the algorithmic descriptions in the book, make your innovations, and then test the code with the procedures given. · If you want to become an expert and have some basic knowledge of digital signal processing, you can learn the innovative steps taken by the inventor of each coder, explore the rigorous research techniques needed to develop your own coder, and become proficient in existing vocoder technology.