Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on

Published by Institute of Electrical and Electronics Engineers
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
Motivated by linguistic theories of prosodic categoricity, symbolic representations of prosody have recently attracted the attention of speech technologists. Categorical representations such as ToBI not only bear linguistic relevance, but also have the advantage that they can be easily modeled and integrated within applications. Since manual labeling of these categories is time-consuming and expensive, there has been significant interest in automatic prosody labeling. This paper presents a fine-grained ToBI-style prosody labeling system that makes use of features derived from RFC and TILT parameterization of F0 together with a n-gram prosodic language model for 4-way pitch accent labeling and 2-way boundary tone labeling. For this task, our system achieves pitch accent labeling accuracy of 56.4% and boundary tone labeling accuracy of 67.7% on the Boston University Radio News Corpus.
Prosody is an important cue for identifying dialog acts. In this paper, we show that modeling the sequence of acoustic-prosodic values as n-gram features with a maximum entropy model for dialog act (DA) tagging can perform better than conventional approaches that use coarse representation of the prosodic contour through acoustic correlates of prosody. We also propose a discriminative framework that exploits preceding context in the form of lexical and prosodic cues from previous discourse segments. Such a scheme facilitates online DA tagging and offers robustness in the decoding process, unlike greedy decoding schemes that can potentially propagate errors. Using only lexical and prosodic cues from 3 previous utterances, we achieve a DA tagging accuracy of 72% compared to the best case scenario with accurate knowledge of previous DA tag, which results in 74% accuracy.
Symbolic representations of prosodic events have been shown to be useful for spoken language applications such as speech recognition. However, a major drawback with categorical prosody models is their lack of scalability due to the difficulty in annotating large corpora with prosodic tags for training. In this paper, we present a novel, unsupervised adaptation technique for bootstrapping categorical prosodic language models (PLMs) from a small, annotated training set. Our experiments indicate that the adaptation algorithm significantly improves the quality and coverage of the PLM. On a test set derived from the Boston University Radio News corpus, the adapted PLM gave a relative improvement of 13.8% over the seed PLM on the binary pitch accent detection task, while reducing the OOV rate by 16.5% absolute.
Evaluation of baroreflex control of heart rate (HR) has important implications in clinical practice of anesthesia and postoperative care. In this paper, we present a point process method to assess the dynamic baroreflex gain using a closed-loop model of the cardiovascular system. Specifically, the inverse Gaussian probability distribution is used to model the heartbeat interval, whereas the instantaneous mean is identified by a linear or bilinear bivariate regression on the previous R-R intervals and blood pressure (BP) measures. The instantaneous baroreflex gain is estimated in the feedback loop with a point process filter, while the RR→BP feedforward frequency response is estimated by a Kalman filter. In addition, the instantaneous cross-spectrum and cross-bispectrum (as well as their ratio) can also be estimated. All statistical indices provide a valuable quantitative assessment of the interaction between heartbeat dynamics and hemodynamics during general anesthesia.
A top-down task-dependent model guides attention to likely target locations in cluttered scenes. Here, a novel biologically plausible top-down auditory attention model is presented to model such task-dependent influences on a given task. First, multi-scale features are extracted based on the processing stages in the central auditory system, and converted to low-level auditory "gist" features. These features capture rough information about the overall scene. Then, the top-down model learns the mapping between auditory gist features and the scene categories. The proposed top-down attention model is tested with prominent syllable detection task in speech. When tested on broadcast news-style read speech using the BU Radio News Corpus, the model achieves 85.8% prominence detection accuracy at syllable level. The results compare well to the reported human performance on this task.
The ability to identify speech acts reliably is desirable in any spoken language system that interacts with humans. Minimally, such a system should be capable of distinguishing between question-bearing turns and other types of utterances. However, this is a non-trivial task, since spontaneous speech tends to have incomplete syntactic, and even ungrammatical, structure and is characterized by disfluencies, repairs and other non-linguistic vocalizations that make simple rule based pattern learning difficult. In this paper, we present a system for identifying question-bearing turns in spontaneous multi-party speech (ICSI Meeting Corpus) using lexical and prosodic evidence. On a balanced test set, our system achieves an accuracy of 71.9% for the binary question vs. non-question classification task. Further, we investigate the robustness of our proposed technique to uncertainty in the lexical feature stream (e.g. caused by speech recognition errors). Our experiments indicate that classification accuracy of the proposed method is robust to errors in the text stream, dropping only about 0.8% for every 10% increase in word error rate (WER).
In this paper, we compare and validate different probabilistic models of human heart beat intervals for assessment of the electrocardiogram data recorded with varying conditions in posture and pharmacological autonomic blockade. The models are validated using the adaptive point process filtering paradigm and Kolmogorov-Smirnov test. The inverse Gaussian model was found to achieve the overall best performance in the analysis of autonomic control. We further improve the model by incorporating the respiratory covariate measurements and present dynamic respiratory sinus arrhythmia (RSA) analysis. Our results suggest the instantaneous RSA gain computed from our proposed model as a potential index of vagal control dynamics.
One of the biggest challenges in averaging ECG or EEG signals is to overcome temporal misalignments and distortions, due to uncertain timing or complex non-stationary dynamics. Standard methods average individual leads over a collection of epochs on a time-sample by time-sample basis, even when multi-electrode signals are available. Here we propose a method that averages multi electrode recordings simultaneously by using spatial patterns and without relying on time or frequency.
Mask-based objective speech-intelligibility measures have been successfully proposed for evaluating the performance of binary masking algorithms. These objective measures were computed directly by comparing the estimated binary mask against the ground truth ideal binary mask (IdBM). Most of these objective measures, however, assign equal weight to all time-frequency (T-F) units. In this study, we propose to improve the existing mask-based objective measures by weighting each T-F unit according to its target or masker loudness. The proposed objective measure shows significantly better performance than two other existing mask-based objective measures.
We study the convergence behavior of the Active Mask (AM) framework, originally designed for segmenting punctate image patterns. AM combines the flexibility of traditional active contours, the statistical modeling power of region-growing methods, and the computational efficiency of multiscale and multiresolution methods. Additionally, it achieves experimental convergence to zero-change (fixed-point) configurations, a desirable property for segmentation algorithms. At its a core lies a voting-based distributing function which behaves as a majority cellular automaton. This paper proposes an empirical measure correlated to the convergence behavior of AM, and provides sufficient theoretical conditions on the smoothing filter operator to enforce convergence.
This paper presents an approach for selecting optimal components for discriminant analysis. Such an approach is useful when further detailed analyses for discrimination or characterization requires dimensionality reduction. Our approach can accommodate a categorical variable such as diagnosis (e.g. schizophrenic patient or healthy control), or a continuous variable like severity of the disorder. This information is utilized as a reference for measuring a component's discriminant power after principle component decomposition. After sorting each component according to its discriminant power, we extract the best components for discriminant analysis. An application of our reference selection approach is shown using a functional magnetic resonance imaging data set in which the sample size is much less than the dimensionality. The results show that the reference selection approach provides an improved discriminant component set as compared to other approaches. Our approach is general and provides a solid foundation for further discrimination and classification studies.
Sample visual presentation screens; matrix presentation (on the left) and rapid serial visual presentation (on the right) 
Duration to copy 8 phrases for each subject for each scenario for simulation and experiment. This corresponds to the total duration spent on attempting to type all of the phrases. Top bars represent the experimental results and bottom bars represent the simulation results with shorter the bar is better. (Consider only the absolute value of the vertical axis values) 
Humans need communication. The desire to communicate remains one of the primary issues for people with locked-in syndrome (LIS). While many assistive and augmentative communication systems that use various physiological signals are available commercially, the need is not satisfactorily met. Brain interfaces, in particular, those that utilize event related potentials (ERP) in electroencephalography (EEG) to detect the intent of a person noninvasively, are emerging as a promising communication interface to meet this need where existing options are insufficient. Existing brain interfaces for typing use many repetitions of the visual stimuli in order to increase accuracy at the cost of speed. However, speed is also crucial and is an integral portion of peer-to-peer communication; a message that is not delivered timely often looses its importance. Consequently, we utilize rapid serial visual presentation (RSVP) in conjunction with language models in order to assist letter selection during the brain-typing process with the final goal of developing a system that achieves high accuracy and speed simultaneously. This paper presents initial results from the RSVP Keyboard system that is under development. These initial results on healthy and locked-in subjects show that single-trial or few-trial accurate letter selection may be possible with the RSVP Keyboard paradigm.
In many areas of signal processing, the phases of complex-valued random variables are used to estimate system parameters, the magnitudes being discarded. In this paper, we consider the implications of doing this: the loss of statistical information and subsequent increase in asymptotic variance. Two particular cases, those of estimating the phase of the mean of a complex distribution, and estimating the frequency of a complex sinusoid in white noise, are considered. The estimators are motivated by estimation under von Mises distributional assumptions. The asymptotic distributional properties are obtained under general assumptions, and are tested using a small number of simulations.
sqrt(MSE) versus the length of data N.
Principal component analysis (PCA) has been proposed for the estimation of the self-similarity parameter H, namely the Hurst parameter of 1/f processes, and an analytical proof is provided only for H/=0.5 in a recent study [I]. In our paper, we extend this study by deriving explicit expressions and presenting an analytical proof for the range of 0 < H < 0.5 (the anti-persistent part of the fractional Brownian motion). We also show via simulations that the accuracy of the estimated H values may decrease considerably as the theoretical H value increases towards the persistent part (0.5<H<1)
The conference proceedings are published in six volumes. Volume I deals with speech processing. Volume II deals with: speech processing; industry technology track; design and implementation of signal processing systems; neural networks for signal processing. Volume III deals with: image and multidimensional signal processing; multimedia signal processing. Volume IV deals with signal processing for communications. Volume V deals with:signal processing education; sensor array and multichannel signal processing; audio and electroacoustics. Volume VI deals with signal processing theory and methods
the features we found (through heuristic search) to perform best, alone and in various combinations. The results are provided for the decile-quantized data, and unigram target- only models.
While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been seen when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed the 2001 NIST Extended Data task. We examined a variety of modeling techniques, such as n-gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by k<sup>th</sup> nearest-neighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversation sides, such as turn-taking patterns.
%Pur metrics for the NIST RT'09 dataset (MDM condition) before and after purification (solid and dashed profiles respectively).
There are two approaches to speaker diarization. They are bottom-up and top-down. Our work on top-down systems show that they can deliver competitive results compared to bottom-up systems and that they are extremely computationally efficient, but also that they are particularly prone to poor model initialisation and cluster impurities. In this paper we present enhancements to our state-of-the-art, top-down approach to speaker diarization that deliver improved stability across three different datasets composed of conference meetings from five standard NIST RT evaluations. We report an improved approach to speaker modelling which, despite having greater chances for cluster impurities, delivers a 35% relative improvement in DER for the MDM condition. We also describe new work to incorporate cluster purification into a top-down system which delivers relative improvements of 44% over the baseline system without compromising computational efficiency.
The two most important constant modulus criteria are studied and compared, exploiting recently obtained results. A theoretical analysis of the performance is provided and excess output MSE figures are derived. The answer to the title question is found to depend on the output error power. In applications where the lower bound for the output signal-to-noise ratio is small, typically less than 8 dB, the CM(2,2) criterion can be employed. Otherwise, the CM(1,2) criterion is preferable. This result can be of great help to system designers, to select the constant modulus criterion that best suits their application and their performance objectives.
This paper presents a new hardware implementation of additive synthesis for high quality musical sound generation. The single-chip configuration is capable of performing 1,200 sinusoid real-time synthesis; the system is expandable to 13,200 partials by series connecting 11 chips. Each sinusoid is generated by a marginally stable second order IIR filter, and its frequency, amplitude and phase can be independently specified. The system is clocked at 60 MHz when working with a 44.1 kHz sampling rate. Two completely independent channels are available as output, and each sample relies on a 20 bit representation to achieve an SNR of at least 110 dB, thanks to the internal 24 bit word length. The IC is designed in a 0.5 μm CMOS technology and has a core area of approximately 19 mm<sup>2</sup>
Signal Flow Graph of the 8 point IDCT.
Block diagram of the architecture.
We have designed and fabricated a low power IC to perform the inverse 8×8 DCT transform according to the CCITT precision specifications, suitable for portable video communication devices. Several design techniques have been used to reduce the power, such as a fast algorithm, an architecture that can exploit input signal correlation, and large amount of parallelism. The chip is fabricated in a triple metal 0.5 μm gate array CMOS technology. The maximum throughput is 400 Kpix/s at 1.1 V, and 27 Mpix/s at 3.3 V. The measured power consumption is 35 μW for typical image sequences in color QCIF format at 10 frames/sec with a 1.1 V power supply, making this device ideal for low power portable applications
The recently-introduced waveform interpolation (WI) coders provide good-quality speech at low rates but may be too complex for commercial use. This paper proposes new approaches to low-complexity WI speech coding at rates of 1.2 and 2.4 kbps. The proposed coders are 4 to 5 times faster than the previously reported ones. At 2.4 kbps, the complexity is about 7.5 and 2.5 MFLOPS for the encoder and decoder, respectively. At 1.2 kbps, the complexity is about 6 and 2.3 MFLOPS for the encoder and decoder, respectively. Informal subjective evaluation shows that, at 2.4 kbps, the quality is close to that of the high-complexity coders. The quality does not significantly degrade at 1.2 kbps and it is considered sufficient for messaging applications
This paper describes our new mixed excitation linear predictive (MELP) coder designed for very low bit rate applications. This new coder, through algorithmic improvements and enhanced quantization techniques, produces better speech quality at 1.7 kb/s than the new U.S. Federal Standard MELP coder at 2.4 kb/s. Key features of the coder are an improved pitch estimation algorithm and a line spectral frequencies (LSF) quantization scheme that requires only 21 bits per frame. With channel coding, this new MELP coder is capable of maintaining good speech quality even in severely degraded channels, at a total bit rate of only 3 kb/s
In mobile communications, smart antenna systems that utilize an antenna array and perform advanced signal processing techniques can achieve greater channel capacity and improve link quality by selective reception/transmission at the base station. However, development of adaptive signal processing algorithms for smart antenna system applications require the accurate knowledge about the multichannel propagation characteristics. A few experimental results on the spatial signature variation of a uniform linear array (ULA) at 900 MHz have been reported. This paper presents the experimental results on the channel propagation characteristic variations of a 1.8 GHz smart antenna system using a uniform circular array (UCA) in moving mobile scenarios. The results indicate the stability of direction-of-arrival (DOAs) of multipath components in all scenarios, and the instability of the spatial signatures in scenarios with strong multipaths
The modifications and improvements of the acoustic recognition component of the SPICOS system for the DARPA naval resource management task are described. These modifications and improvements include: the modeling of the continuous mixture densities of the acoustic vectors, the choice of suitable context-dependent phoneme units and the construction of generalized context phoneme units, and the modeling of transitional information in the acoustic vector. The experimental results show that critical factors are the acoustic resolution of the probability distributions and the context information captured in the acoustic vectors. By these enhancements, the system was able to attain a word error rate of 23.6% and 26.5% on two test sets in speaker-independent recognition mode, when trained on 80 speakers. The word pair grammar reduced the word error rate to 7.1% and 9.3% respectively
The system was trained in a speaker dependent mode on 28 minutes of speech from each of 8 speakers, and was tested on independent test material for each speaker. The system was tested with three artificial grammars spanning a broad perplexity range. The average performance of the system measured in percent word error was: 1.4% for a pattern grammar of perplexity 9, 7.5% for a word-pair grammar of perplexity 62, and 32.4% for a null grammar of perplexity 1000
An algorithm based on hidden Markov models is applied to the task of speaker-independent continuous-speech recognition for a vocabulary of 1000 words with no syntactic constraints. The signal is limited to 4000 Hz. Word models were built from three-state representations of phonetic units, concatenated according to entries in a lexicon. Performance as measured on DARPAs resource management database was 40% correct word recognition. It was found that the use of several different acoustic features and the use of word-specific phonetic modeling, where possible, improved system performance
A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program. The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition. The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs. The 1000-word task vocabulary is intended to be logically complete and habitable. The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes
Some results obtained when the recognition vocabulary size of a phoneme-based speaker-dependent continuous-speech recognizer was increased from 1000 to 10000 words are reported. The potential search space increased from 46000 to 516000 states without problems for the data-driven search. Increasing the recognition vocabulary by a factor of 10 (from a perplexity of 917 to 9686) increased the word error rate by a factor of two (from 21.8% to 43.1%). Phoneme models were tested with both discrete probabilities and continuous mixture densities. The mixture density models performed better; moreover, they saved about half of the search costs. A language model was found to be very important for a larger vocabulary size. With a test set perplexity of 388 (i.e. a reduction by a factor of 25 compared to the case without a bigram model) the error rate decreased by a factor of 2.4. In order to check how meaningful perplexity is for the prediction of the system's performance, a stochastic language model was constructed with a perplexity of 1000, the size of the vocabulary used in previous experiments, and about the same error rate was obtained
The authors describe some techniques for boosting the accuracy of a 100000-word Japanese speech recognition system. They review the structure and performance of the system and introduce two methods for improving the recognition accuracy. One method involves the selection of the training word set, and the other involves the number of word candidates selected by the preselection method. A method for reducing the amount of computation is also proposed. The performance of the improved system is evaluated. The top-20 recognition error rate was 1.5% for 10000 test utterances of five females and five males. The improved system could reduce top-20 recognition errors by 75% without a significant increase in computation
Speech recognition for a vocabulary of 100000 words is described. Acoustic-segment networks are used as word templates in recognition. The acoustic-segment networks are automatically generated from orthographic strings of the words using rules that account for several kinds of variations in speech. To reduce the amount of computation in recognition, a tree representation of the networks and a preselection method based on input-frame sampling are used. It is confirmed that 98.75% of the computation can be eliminated without a significant increase of error, when using the preselection which outputs 500 candidates for main matching. Top-20 recognition accuracy is 93.5% for 10000 test utterances of five males and five females
Traditionally equalization is performed individually for 10GBASE-T, and FEXT is treated as noise to be cancelled at the receiver. However, FEXT contains information about the symbols transmitted from remote transmitters and it can be viewed as a signal rather than noise to facilitate signal recovery. This paper proposes to use MIMO (multi-input multi output) equalization technique to deal with FEXT in 10GBASE-T. In the proposed MIMO technique, FEXT is treated as signal, which improves SNR. Instead of using long FEXT cancellers, MIMO-DFE with short length is used to remove post-cursor ISI. Our simulation results show that, by using the proposed MIMO equalization, we are able to achieve SNR (signal to noise ratio) improvement around 0.5-9 dB with 13% less complexity than the traditional equalization technique in twisted-pair channel environment
Line spectrum pair (LSP) representation of linear predictive coding (LPC) parameters is widely used in speech coding applications. An efficient method for LPC to LSP conversion is Kabal's method. In this method the LSPs are the roots of two polynomials P'<sub>p</sub>(x) and Q'<sub>p</sub>(x), and are found by a zero crossing search followed by successive bisections and interpolation. The precision of the obtained LSPs is higher than required by most applications, but the number of bisections cannot be decreased without compromising the zero crossing search. In this paper, it is shown that, in the case of 10th-order LPC, five intervals containing each only one zero crossing of P'<sub>10</sub>(x) and one zero crossing of Q'<sub>10</sub>(x) can be calculated, avoiding the zero crossing search. This allows a trade-off between LSP precision and computational complexity resulting in considerable computational saving
A class of practical fast algorithms is introduced for the discrete cosine transform (DCT). For an 8-point DCT only 11 multiplications and 29 additions are required. A systematic approach is presented for generating the different members in this class, all having the same minimum arithmetic complexity. The structure of many of the published algorithms can be found in members of this class. An extension of the algorithm to longer transformations is presented. The resulting 16-point DCT requires only 31 multiplications and 81 additions, which is, to the authors' knowledge, less than required by previously published algorithms
The performance of error protected transformed binary pulse excited (TBPE) coders over Rayleigh fading channels is investigated. The TBPE coder is a stochastically excited linear predictive coding (LPC) coder which produces near-toll quality speech in the vicinity of 8 kb/s utilizing an extremely simple excitation search procedure. A TBPE coder operating at 7.4 kb/s is used, and two methods of embedding channel coding into the speech coder are studied, aiming at the overall bit rate of 11.4 kb/s specified for the half-rate GSM coder. A method of embedding Reed-Solomon codes into speech coding is presented, and it is compared with rate-compatible punctured convolutional (RCPC) codes. Results show the superiority of the proposed scheme over RCPC codes in both performance and implementation simplicity
The application of higher-order spectral methods to certain multichannel inverse problems where the observed time signals are linear combinations of unknown sources is proposed. The transfer matrix is static and known beforehand but is highly ill-conditioned, leading to the failure of standard least squares regularization techniques or low rank approximation when there is additive noise. The noise suppression properties of third order cumulants are used, along with the ability to obtain multiple estimates by extracting individual signals from their third order cross-correlations via cross-bicepstrum operations. Variations of this approach are presented, and normalization issues are discussed
This paper presents a 1.2 kbps speech coder based on the mixed excitation linear prediction (MELP) analysis algorithm. In the proposed coder, the MELP parameters of three consecutive frames are grouped into a superframe and jointly quantized to obtain a high coding efficiency. The interframe redundancy is exploited with distinct quantization schemes for different unvoiced/voiced (U/V) frame combinations in the superframe. Novel techniques for improving performance make use of the superframe structure. These include pitch vector quantization using pitch differentials, joint quantization of pitch and U/V decisions and LSF quantization with a forward-backward interpolation method. Subjective test results indicate that the 1.2 kbps speech coder achieves approximately the same quality as the proposed federal standard 2.4 kbps MELP coder
A new 1200 bps speech coder designed with a tree searched multistage matrix quantization scheme is proposed. To improve speech quality and reduce the average bit rate, we have developed a new residual multistage matrix quantization method with the joint design technique. The new,joint design algorithm reduces the codebook training complexity. Other new techniques for improving the performance include joint quantization of pitch and voiced/unvoiced/mixed decisions and gain interpolation. For the new matrix quantization based speech coder (MQBC), the listening tests have proven that an efficient and high quality coding has been achieved at bit rate 1200 bps. Test results are compared with the 2400 bps LPC10e coder and the new 2400 bps MELP coder which has been chosen as the new 2400 bps Federal Standard
An approach to wideband digital audio compression of CD-quality signals at data rates of 128 kb/s channel and below is presented. A form of adaptive transform coding, this technique features a nonuniform frequency division and coding scheme to exploit known characteristics of human perception. The algorithm has low computational complexity and can be adapted for use at other bit rates. A windowed overlap-add process is used with the forward/inverse transforms, which have been efficiently implemented using FFTs. Transform coefficients are converted into a subband block-companded format consisting of exponent words and associated mantissas, which are then coded with an adaptive quantizer. A real-time, single-chip programmable digital signal processing (DSP) implementation encodes 480-kHz-sampled stereo audio signals at a variety of bit rates. At 128 kb/s, the coder's subjective performance is appropriate for highest-quality 15-kHz professional audio applications
H.264 intra coding flow  
Nine modes for intra 4x4 prediction  
This paper presents an HD720p 30 frames per sec H.264 intra encoder operated at 61 MHz with just 72 K gate count. We achieve the low cost and low operating frequency with the highly utilized variable pixel scheduling, and a modified three-step fast algorithm. Thus, the resulted design only needs half of operating frequency and reduces 30% of area cost compared to the previous HD720p intra encoder design
This paper describes a wideband (7 kHz) speech compression scheme operating at a bit rate of 13.0 kbit/s, i.e. 0.8 bit per sample. We apply a split-band (SB) technique, where the 0-6 kHz band is critically subsampled and coded by an ACELP approach. The high frequency signal components (6-7 kHz) are generated by an improved high-frequency-resynthesis (HFR) at the decoder such that no additional information has to be transmitted. In informal listening tests, the subjective speech quality was rated to be comparable to the CCITT G.722 wideband codec at 48 kbit/s
In this paper, we present a combined speech quality enhancement solution for IS-136 systems. Since echo and background noise are the two major factors that adversely affect the speech quality in most transmission systems, our solution consists of echo cancellation and noise reduction elements. These are used in conjunction with the IS-641 (enhanced fullrate standard for IS-136 systems) coder to form an integrated speech-processing unit. The echo canceller uses the normalized least mean square (NLMS) method. Because of the existence of high levels of background noise, variable step-size techniques are employed. The noise reduction consists of a single microphone method and it uses a spectral amplitude enhancement gain function with minimal spectral distortion. The noise reduction is utilized in the pre-compression configuration, and it comes after the echo canceller on the send path reducing the residual echo as well as noise
Optimum detection in randomly time-varying channels requires an efficient adaptive receiver structure. The complexity of signal processing in the receiver is limited by the amount of processing power and power consumption of the receiver. Therefore, efficiency and convergence of adaptive matched filtering and equalization techniques are very important. We extend common results in tracking theory for system identification to the equalization case for systems with short impulse response. Unlike in system identification where a steady-state error energy is minimized, the optimization criterion here is the minimum of the BER. However, they are related by a monotone function and therefore minimizing the BER is equivalent to minimizing the steady-state-error energy. Optimum parameters for LMS as well as RLS algorithms are derived and simulation results indicate that under the conditions defined in the TDMA standards and small delay spread, the performance of the two methods is comparable
In this paper, we describe the enhanced full rate (EFR) speech codec that has recently been standardised for the North American TDMA digital cellular system (IS-136). The EFR codec, specified in the IS-641 standard, has been jointly developed by Nokia and University of Sherbrooke. The codec consists of 7.4 kbit/s speech (source) coding and 5.6 kbit/s channel coding (error protection) resulting in a 13.0 kbit/s gross bit-rate in the channel. Speech coding is based on the ACELP algorithm (algebraic code excited linear prediction). The codec offers speech quality close to that of wireline telephony (G.726 32 kbit/s ADPCM used as a wireline reference) and provides a substantial improvement over the quality of the current speech channel. The improved speech quality is not only achieved in error-free conditions, but also in typical cellular operating conditions including transmission errors, environmental noise, and tandeming of speech codecs
This paper describes the low-complexity 14 kHz audio coding algorithm which has been recently standardized by ITU-T as Recommendation G.722.1 Annex C ("G.722.1C"). The algorithm is an extension to ITU-T Recommendation G.722.1 and doubles the G.722.1 algorithm to permit 14 kHz audio bandwidth using a 32 kHz audio sample rate, at 24, 32, and 48 kbit/s. The G.722.1C codec features very high audio quality and extremely low computational complexity compared to other state-of-the-art audio coding algorithms. This codec is suitable for use in video conferencing and teleconferencing, and Internet streaming applications. Subjective test results from the characterization phase of G.722.1 C are also presented in the paper
This paper describes a new 14 kb/s wideband speech coder. The coder uses a split-band approach, where the input signal, sampled at 16 kHz, is split into two equal frequency bands from 0-4 kHz and 4-8 kHz, each of which is decimated to an 8 kHz sampling rate. The lower band is coded with a high-quality narrowband speech coder, the 11.8 kb/s G.729 Annex E, while the higher band is represented by a simple but effective parametric model. Two new features facilitate efficient coding of the high-band signal: noise modulation and high-frequency reversal. Since the encoding of the lower band is independent of the high-band signal, the narrowband encoder output can be embedded in the overall bitstream. Subjective test results show that this wideband speech coder is capable of producing high quality output speech
The development of an adaptive vector quantizer (AVQ) CODEC chip using open architecture silicon implementation system (OASIS) tools is presented. The goal of this effort is development of a 16-kbps AVQ coder prototype system which can be used as an alternative to the currently used 32-kbps adaptive delta pulse code modulation (ADPCM). The AVQ semi-custom chip layout uses 1.2-μm SCMOS technology and consists of over 157000 transistors. The AVQ system is being designed with Mentor Graphics tools and is a 4-layer PC board encompassing over 50 analog and digital components
We present a 16 kb/s CELP coder with a complexity as low as 3 MIPS. The main thrust is to reduce the complexity as much as possible while maintaining toll-quality. This low-complexity CELP (LC-CELP) coder has the following features: (1) fast LPC quantization, (2) 3-tap pitch prediction with efficient open-loop pitch search and predictor tap quantization, (3) backward-adaptive excitation gain, and (4) a trained excitation codebook with a small vector dimension and a small codebook size. Most CELP coders require one full DSP or even two DSP chips to implement in real-time. In contrast, 3 to 6 full-duplex LC-CELP coders can fit into a single DSP chip, since each takes only around 3 MIPS to implement. This coder achieved slightly higher mean opinion stores (MOS) than the CCITT 32 kb/s ADPCM. It also exhibits good performance when tandemed with itself or transcoded with other coders
The authors present a real-time single DSP prototype of a 16-kb/s subband speech coder, using a single digital signal processor, for application to digital portable radio communications. The system architecture, coding algorithm, firmware, and hardware of this coder are addressed. Careful attention has been given to the many special design requirements for interfacing to a time-division multiple-access radio link. The coder uses nonlinear quantization improvement in speech quality with negligible added complexity. A single-board full-duplex prototype, based on a TMS320C25, has been designed and delivers good speech quality
A real-time 16 kb/s waveform coder which uses a combination of subband analysis/synthesis and vector quantization (VQ) to achieve a one-way communication delay of 20 ms is described. Gain-shape decomposition is used to vector-quantize the outputs of a generalized quadrature mirror filter bank. The short-time filter bank energy (gain) is used to dynamically allocate bits for coding the gain-normalized filter outputs (shape). The coder-decoder, implemented on a single AT&T DSP32 signal processing chip with 256 kB of off-chip memory for codebook storage, has a typical segmental SNR of 20.44 dB for sentences outside the training set
Top-cited authors
Abdel-rahman Mohamed
  • University of Toronto
li Deng
  • Zhejiang Normal University
Dong Yu
  • Tohoku University
Tara Sainath
  • Google Inc.
Bhuvana Ramabhadran