A new phase model for sinusoidal transform coding of speech

Dept. of Electr. Eng., Arizona State Univ., Tempe, AZ
IEEE Transactions on Speech and Audio Processing (Impact Factor: 2.29). 10/1998; 6(5):495 - 501. DOI: 10.1109/89.709675
Source: IEEE Xplore


A phase modeling algorithm for sinusoidal analysis-synthesis of
speech is presented, where short-time sinusoidal phases are approximated
using a combination of linear prediction, spectral sampling, delay
compensation, and phase correction techniques. The algorithm is
different to phase compensation methods proposed for source-system LPC
in that it has been tailored to sinusoidal representation of speech.
Performance analysis on a large speech data base reveals an improvement
in temporal and spectral signal matching, as well as in the subjective
quality of reconstructed speech. The method can be applied to enhance
phase matching in low bit rate sinusoidal coders, where underlying sine
wave amplitudes are extracted from an all-pole model. Preliminary
subjective results are presented for a 2.4 kb/s sinusoidal coder

21 Reads
  • Source
    • "Hence, although the use of zero or minimum phase is attractive for coding because of the bits that are saved, these methods cannot be used for high-quality speech synthesis. All-pass filters [18], [19] have been proposed for improving the quality of the minimum phase approach. However, even with all-pass filtering the resulting speech quality cannot be characterized as being natural. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Many current text-to-speech (TTS) systems are based on the concatenation of acoustic units of recorded speech. While this approach is believed to lead to higher intelligibility and naturalness than synthesis-by-rule, it has to cope with the issues of concatenating acoustic units that have been recorded at different times and in a different order. One important issue related to the concatenation of these acoustic units is their synchronization. In terms of signal processing this means removing linear phase mismatches between concatenated speech frames. This paper presents two novel approaches to the problem of synchronization of speech frames with an application to concatenative speech synthesis. Both methods are based on the processing of phase spectra without, however, decreasing the quality of the output speech, in contrast to previously proposed methods. The first method is based on the notion of center of gravity and the second on differentiated phase data. They are applied off-line, during the preparation of the speech database without, therefore, any computational burden on synthesis. The proposed methods have been tested with the harmonic plus noise model, HNM, and the TTS system of AT&T Labs. The resulting synthetic speech is free of linear phase mismatches
    IEEE Transactions on Speech and Audio Processing 04/2001; 9(3-9):232 - 239. DOI:10.1109/89.905997 · 2.29 Impact Factor
  • Source
    • "Various approaches are employed for coding of the phase information in recent harmonic+noise coders. However, these phase coding schemes require in general more than 20 bits per speech segment [1][3][9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel technique for modeling and quantization of the phase information in low-rate harmonic+noise coding. In the proposed phase model, each frequency track is adjusted by a frequency deviation (FD) that reduces the error between measured and predicted phases. By exploiting the intra-frame relationship of the FD's, the phase information is represented more efficiently when compared with the representation by measured phases or by phase prediction residuals. An efficient FD quantization scheme based on closed-loop analysis is also developed. In this scheme, the FD of the first harmonic and a vector of the FD differences are quantized by minimizing a perceptually weighted distortion measure between the measured phases and the quantized phases. The proposed technique reproduces the temporal events of the original speech signal and improves the subjective quality of the synthesized speech using 13 bits per frame only.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the application of the Harmonic plus Noise Model, HNM, for concatenative Text-to-Speech (TTS) synthesis. In the context of HNM, speech signals are represented as a time-varying harmonic component plus a modulated noise component. The decomposition of speech signal in these two components allows for more natural-sounding modifications (e.g., source and filter modifications) of the signal. The parametric representation of speech using HNM provides a straightforward way of smoothing discontinuities of acoustic units around concatenation points. Formal listening tests have shown that HNM provides high-quality speech synthesis while outperforming other models for synthesis (e.g., TD-PSOLA) in intelligibility, naturalness and pleasantness.
Show more