Content uploaded by Michael S. Lewicki

Author content

All content in this area was uploaded by Michael S. Lewicki on Oct 01, 2014

Content may be subject to copyright.

A preview of the PDF is not available

Figures - uploaded by Michael S. Lewicki

Author content

All figure content in this area was uploaded by Michael S. Lewicki

Content may be subject to copyright.

Content uploaded by Michael S. Lewicki

Author content

All content in this area was uploaded by Michael S. Lewicki on Oct 01, 2014

Content may be subject to copyright.

A preview of the PDF is not available

... Gammatones have been used to characterize the impulse response of cochlear nerve fibers in cats; specifically, the output of a gammatone filter was shown to be a good predictor of the corresponding fiber's firing probability [De Boer and De Jongh, 1978]. Therefore, each occurrence of auditory kernels φ m in (4.1) may be interpreted as a population of auditory nerve spikes, with average firing rate encoded by the kernel gain α [Lewicki, 2002]. ...

... The sparsity of noise in a basis of auditory kernels has no direct physical or perceptual interpretation. When the kernels resemble cochlear filter shapes, each kernel occurrence in the representation can be thought of a local population of auditory nerve spikes [Lewicki, 2002]. The spike rate is commonly thought to be related to loudness, although the precise relationship appears to be more complex [Moore, 2003]. ...

This thesis deals with signal-based methods that predict how listeners perceive speech quality in telecommunications. Such tools, called objective quality measures, are of great interest in the telecommunications industry to evaluate how new or deployed systems affect the end-user quality of experience. Two widely used measures, ITU-T Recommendations P.862 âPESQâ and P.863 âPOLQAâ, predict the overall listening quality of a speech signal as it would be rated by an average listener, but do not provide further insight into the composition of that score. This is in contrast to modern telecommunication systems, in which components such as noise reduction or speech coding process speech and non-speech signal parts differently. Therefore, there has been a growing interest for objective measures that assess different quality features of speech signals, allowing for a more nuanced analysis of how these components affect quality. In this context, the present thesis addresses the objective assessment of two quality features: background noise intrusiveness and speech intelligibility. The perception of background noise is investigated with newly collected datasets, including signals that go beyond the traditional telephone bandwidth, as well as Lombard (effortful) speech. We analyze listener scores for noise intrusiveness, and their relation to scores for perceived speech distortion and overall quality. We then propose a novel objective measure of noise intrusiveness that uses a sparse representation of noise as a model of high-level auditory coding. The proposed approach is shown to yield results that highly correlate with listener scores, without requiring training data. With respect to speech intelligibility, we focus on the case where the signal is degraded by strong background noises or very low bit-rate coding. Considering that listeners use prior linguistic knowledge in assessing intelligibility, we propose an objective measure that works at the phoneme level and performs a comparison of phoneme class-conditional probability estimations. The proposed approach is evaluated on a large corpus of recordings from public safety communication systems that use low bit-rate coding, and further extended to the assessment of synthetic speech, showing its applicability to a large range of distortion types. The effectiveness of both measures is evaluated with standardized performance metrics, using corpora that follow established recommendations for subjective listening tests.

... This small degree of correlation between actual neuronal responses implies that the " universal " value of the Minkowski summation exponent should be a little greater than suggested by Shepard, but still a lot lower than infinity (maximum rule). Since this degree of correlation is likely to be shaped by the natural statistics of the world, we suspect that this reflects an overall economical and efficient encoding mechanism underlying perceptual integration of features in the natural world (Field, 1994; Laughlin, Steveninck & Anderson, 1998; Nirenberg, Carcieri, Jacobs & Latham, 2001; Barlow, 2001; Lewicki, 2002). ...

A bstract The world is rich in sensory information, and the challenge for any neural sensory system is to piece together the diverse messages from large arrays of feature detectors. In vision and auditory research, there has been speculation about the rules governing combination of signals from different neural channels: e.g. linear (city-block) addition, Euclidian (energy) summation, or a maximum rule. These are all special cases of a more general Minkowski summation rule (Cue 1 m +Cue 2 m) 1/m , where m=1, 2 and infinity respectively. Recently, we reported that Minkowski summation with exponent m=2.84 accurately models combination of visual cues in photographs [To et al. (2008). Proc Roy Soc B, 275, 2299]. Here, we ask whether this rule is equally applicable to cue combinations across different auditory dimensions: such as intensity, pitch, timbre and content. We found that in suprathreshold discrimination tasks using musical sequences, a Minkowski summation with exponent close to 3 (m=2.95) outperformed city-block, Euclidian or maximum combination rules in describing cue integration across feature dimensions. That the same exponent is found in this music experiment and our previous vision experiments, suggests the possibility of a universal "Minkowski summation Law" in sensory feature integration. We postulate that this particular Minkowski exponent relates to the degree of correlation in activity between different sensory neurons when stimulated by natural stimuli, and could reflect an overall economical and efficient encoding mechanism underlying perceptual integration of features in the natural world.

Nonstationary acoustic features provide essential cues for many auditory tasks, including sound localization, auditory stream analysis, and speech recognition. These features can best be characterized relative to a precise point in time, such as the onset of a sound or the beginning of a harmonic periodicity. Extracting these types of features is a difficult problem. Part of the difficulty is that with standard block-based signal analysis methods, the representation is sensitive to the arbitrary alignment of the blocks with respect to the signal. Convolutional techniques such as shift-invariant transformations can reduce this sensitivity, but these do not yield a code that is efficient, that is, one that forms a nonredundant representation of the underlying structure. Here, we develop a non-block-based method for signal representation that is both time relative and efficient. Signals are represented using a linear superposition of time-shiftable kernel functions, each with an associated magnitude and temporal position. Signal decomposition in this method is a non-linear process that consists of optimizing the kernel function scaling coefficients and temporal positions to form an efficient, shift-invariant representation. We demonstrate the properties of this representation for the purpose of characterizing structure in various types of nonstationary acoustic signals. The computational problem investigated here has direct relevance to the neural coding at the auditory nerve and the more general issue of how to encode complex, time-varying signals with a population of spiking neurons.

We apply a Bayesian method for inferring an optimal basis to the problem of nding eecient image codes for natural scenes. The basis functions learned by the algorithm are oriented and localized in both space and frequency, bearing a resemblance to Gabor functions, and increasing the number of basis functions results in a greater sampling density in position, orientation, and scale. These properties also resemble the spatial receptive elds of neurons in the primary visual cortex of mammals, suggesting that the receptive eld structure of these neurons can be accounted for by a general eecient coding principle. The probabilistic framework provides a method for comparing the coding eeciency of diierent bases objectively by calculating their probability given the observed data or by measuring the entropy of the basis function coeecients. The learned bases are shown to have better coding eeciency compared to traditional Fourier and wavelet bases. This framework also provides a Bayesian solution to the problems of image denoising and lling-in of missing pixels. We demonstrate that the results obtained by applying the learned bases to these problems are improved over those obtained with traditional techniques.

Two different procedures are studied by which a rrequency analysis of a time-dependenl signal can be effected, locally in lime. The lirst procedure is the short-time or windowed Fourier transform, the second is the "wavelet transform," in which high frequency components are sludied wilh sharper time resolution than low frequency components. The similarities and the differences between these two methods are discussed. For both scbemes a detailed study is made of Ibe reconslruetion method and ils stability, as a function of the chosen time-frequency density. Finally the notion of "time-frequency localization" is made precise, within this framework, by two localization theorems.

The time-frequency and time-scale communities have recently developed a large number of overcomplete waveform dictionaries-stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets, to name a few. Decomposition into overcomplete systems is not unique, and several methods for decomposition have been proposed, including the method of frames (MOF), Matching pursuit (MP), and, for special dictionaries, the best orthogonal basis (BOB). Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l(1) norm of coefficients among all such decompositions. We give examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution. BP has interesting relations to ideas in areas as diverse as ill-posed problems, in abstract harmonic analysis, total variation denoising, and multiscale edge denoising. BP in highly overcomplete dictionaries leads to large-scale optimization problems. With signals of length 8192 and a wavelet packet dictionary, one gets an equivalent linear program of size 8192 by 212,992. Such problems can be attacked successfully only because of recent advances in linear programming by interior-point methods. We obtain reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.

We derive new unsupervised learning rules for blind separation of
mixed and convolved sources. These rules are nonlinear in the signals
and thus exploit high-order spatiotemporal statistics to achieve
separation. The derivation is based on a global optimization formulation
of the separation problem, yielding a stable algorithm. Different rules
are obtained from frequency- and time-domain optimization. We illustrate
the performance of this method by successfully separating convolutive
mixtures of speech signals

What use can the brain make of the massive flow of sensory information that occurs without any associated rewards or punishments? This question is reviewed in the light of connectionist models of unsupervised learning and some older ideas, namely the cognitive maps and working models of Tolman and Craik, and the idea that redundancy is important for understanding perception (Attneave 1954), the physiology of sensory pathways (Barlow 1959), and pattern recognition (Watanabe 1960). It is argued that (1) The redundancy of sensory messages provides the knowledge incorporated in the maps or models. (2) Some of this knowledge can be obtained by observations of mean, variance, and covariance of sensory messages, and perhaps also by a method called “minimum entropy coding.” (3) Such knowledge may be incorporated in a model of “what usually happens” with which incoming messages are automatically compared, enabling unexpected discrepancies to be immediately identified. (4) Knowledge of the sort incorporated into such a filter is a necessary prerequisite of ordinary learning, and a representation whose elements are independent makes it possible to form associations with logical functions of the elements, not just with the elements themselves.

We develop model-independent methods for characterizing the information carried by particular features of a neural spike train as it encodes continuously varying stimuli. These methods consist, in essence, of an inverse statistical approach; instead of asking for the statistics of neural responses to a given stimulus we describe the probability distribution of stimuli that give rise to a certain short pattern of spikes. These `response-conditional ensembles' contain all the information about the stimulus that a hypothetical observer of the spike train may obtain. The structure of these distributions thus provides a quantitative picture of the neural code, and certain integrals of these distributions determine the absolute information in bits carried by a given spike sequence. These methods are applied to a movement-sensitive neuron (H1) in the visual system of the blowfly Calliphora erythrocephala. The stimulus is chosen as the time-varying angular velocity of a (spatially) random pattern, and we consider segments of the spike train of up to three spikes with specified spike-intervals. We demonstrate that, with extensive analysis, a single experiment of roughly one hour's duration is sufficient to provide reliable estimates of the relevant probability distributions. From the experimentally determined probability distributions we are able to draw several conclusions. (1) Under the conditions of our experiment, observation of a single spike carries roughly 0.36 bits of information, but spike pairs carry an interval-dependent signal that can be much larger than 0.72 bits; estimates of the total information capacity are in rough agreement with the maximum possible capacity given the signal-to-noise characteristics of the photoreceptors. (2) On average a single spike signals the occurrence of a velocity waveform that is positive (movement in the excitatory direction) at all times before the spike, whereas spike pairs can signal both positive and negative velocities, depending on the inter-spike interval. (3) Although inter-spike intervals are crucial in extracting all the coded information, the code is robust to several millisecond errors in the estimate of spike arrival times. (4) Short spike sequences give reliable information about specific features of the stimulus waveform, and this specificity can be quantified. (5) Our results suggest approximate strategies for reading the neural code -- reconstructing the stimulus from observations of the spike train -- and some preliminary reconstructions are presented. Some tentative attempts are made to relate our results to the more general questions of coding and computation in the nervous system.