ArticlePDF Available

Streaming Waveform Data Processing by Hermite Expansion for TextIndependent Speaker Indexing from Continuous Speech

Authors:

Abstract and Figures

In this paper we shall consider the new projection scheme of streaming waveform data processing for text-independent speaker indexing from continuous speech. It is based on an expansion into series of eigenfunctions of the Fourier transform. Partly this scheme can be also used for speech recognition.
Content may be subject to copyright.
Streaming Waveform Data Processing by Hermite Expansion for Text-
Independent Speaker Indexing from Continuous Speech
Andrey S. Krylov, Danil N. Kortchagine, Alexey S. Lukin
Faculty of Computational Mathematics and Cybernetics, Moscow State University
Moscow, Russia
Abstract
In this paper we shall consider the new projection scheme of
streaming waveform data processing for text-independent speaker
indexing from continuous speech. It is based on an expansion into
series of eigenfunctions of the Fourier transform. Partly this
scheme can be also used for speech recognition.
Keywords: Fourier transform, Hermite functions, wave
processing, speaker recognition.
1. INTRODUCTION
Fourier analysis plays a very important role in wave processing
and wave analysis. At the same time, wave parameterization to
code wave information by some kind of mathematical formulae
enables to perform many of wave processing procedures in a most
effective way. The aim of the work is streaming waveform data
(Russian speech) processing by Hermite expansion for text-
independent speaker indexing.
The proposed method is based on the features of Hermite
functions and quasiperiods. An expansion of signal information
into a series of these functions enables us to perform information
analysis of the signal and its Fourier transform at the same time,
because the Hermite functions are the eigenfunctions of Fourier
transform. These functions are widely used in pure mathematics,
where the expansion into Hermite functions is also called as
Gram-Charlier series [1], [2] and image analysis [3-6]. It is also
necessary to underline that the joint localization of Hermite
functions in the both frequency and temporal spaces makes using
this functions very stable to information errors. On the other hand,
quasiperiod is a time period of the sound corresponding to period
of base tone for vowels or resonant consonants, so extraction of
quasiperiods suggests separating quasiperiods in a continuous
speech waveform. So suggested method is very stable and flexible
to using in streaming waveform data processing.
This work illustrates some possibilities to take full advantage of
the use of this method.
2. HERMITE FUNCTIONS
The Hermite functions satisfy an important feature for wave
processing, as they derivate a full orthonormal in
),(
2Lsystem of functions.
The Hermite functions are defined as:
n
xn
n
xn
ndx
ed
n
e
x)(
!2
)1(
)(
22 2/
=
They also can be determined by the following recurrent formulae:
2
2
/2
04
/2
14
12
1
2
21
,2
x
x
nn n
e
xe
n
xn
nn


=
=
=
Moreover the Hermite functions are the eigenfunctions of the
Fourier transform:
n
n
niF
=)( ,
where F denotes Fourier transform operator.
The graphs of the Hermite functions look like the following:
Figure 1: Hermite functions
3. HIERARCHY CODING IN HERMITE
EXPANSION
The algorithm of hierarchy coding that we used looks as follows:
First, we approximate the whole quasiperiod with one function
only, after that we subtract obtained result from original and
repeat this operation with difference but using 2 functions. On
every next step we use two times more functions than for the
previous step (and stretch our interval of audio data to have all the
Hermite functions used concentrated on the interval and to obtain
the best approximation at this step). This channel separation with
Hermite functions enables us to associate the obtained results with
character formants.
Figure 2:Scheme of hierarchy coding
Figure 3:Example of quasiperiod for character "o"
Figure 4:Hermite expansion coefficients for quasiperiod above
4. THE ALGORITHM
At this stage we have designed algorithm of text-independent
speaker indexing with fixed/manual recognition threshold and the
length of the time filter adjustment. Also vowels and resonant
consonants indexing by this approach was designed to be used in
speech recognition system. The default values of time filter length
and recognition threshold used in this paper were 0.2 sec and 0.05
correspondently.
General scheme of the algorithm:
Figure 5:Scheme of indexing algorithm
At the first step we extract quasiperiods from the analyzed wave
file and sort them by length (for improving performance).
After elimination of noise blocks (using noise threshold) we apply
Hermite expansion for speech blocks (quasiperiods) at the second
stage.
At the next step we index resonant quasiperiods using content-
independent indexing algorithm.
Next we index resonant quasiperiods using database information
and correct them depending on their arrangement.
The speaker indexing is performed using resonant quasiperiods. It
is also based on using database information. The Hermite
coefficients retrieved from database are compared with Hermite
coefficients calculated for the given waveform.
Speaker indexing correction algorithms are used to treat
incorrectly indexed speakers. First of these algorithms analyzes
the correspondence between known resonant quasiperiods and
indexed speakers to re-index segments with wrong detected
speakers. Second algorithm is based on a use of time filter.
Optimal time filter length for the dialogs used in this paper was
found to be 0.2 sec.
This algorithm has been implemented in “Hermite coder PWE”
software.
4.1 Quasiperiods extraction algorithm
4.1.1 Basic algorithm
Quasiperiod is a time period of the sound corresponding to period
of base tone for vowels or resonant consonants. Extraction of
quasiperiods suggests separating quasiperiods in a continuous
speech waveform.
First of all the input waveform is pre-processed to increase
algorithm robustness: DC (direct current) offset is eliminated, and
(if needed) waveform can be inverted to make sharpest peaks
point upside (see fig. 6). After that we scan the waveform to find a
block with a largest RMS value (RMS window size = 13 ms). We
suggest this block to be corresponding to a vowel, which has
sharp quasiperiods (because sibilants and consonants usually have
smaller RMS). Within this block we find a maximal positive
sample value and consider it to be a starting boundary between
two quasiperiods.
Figure 6:Now we have found the first boundary between
quasiperiods. It will be a starting point.
After that we continue finding quasiperiods’ boundaries at the
rightmost part of the waveform. After the right part is processed,
we reverse the left part of the waveform and find quasiperiods
there using the same routine. The last step is to sort (simply
rearrange) an array of quasiperiods’ boundaries.
Now we will describe how the main routine works. During each
step we find the next border between quasiperiods (see fig. 7).
Figure 7:During each step we find the next border between
quasiperiods.
In our algorithm we find quasiperiods’ boundaries using a
probability concept. For example if a number of previous
quasiperiods were detected to have a base frequency around 120
Hz (1/T, where T is average duration), then we can expect that the
next quasiperiod has the same duration because the fundamental
frequency of human speech is varying relatively slow. If previous
quasiperiods were detected with a high degree of confidence (see
below), then the mentioned probability becomes even greater.
Given the current fundamental frequency Fcurrent we can suggest
the most probable position for the next quasiperiod’s boundary.
After that we construct a time-domain window, which has a shape
of raised cosine, and apply this window to our waveform (see fig.
8), so that the center of the window was applied to the most
probable boundary position.
current
F
erOffsetWindowCent 1
=
Figure 8:Raised cosine window increases the probability of
preserving a frequency.
Now we can find a maximum of the windowed waveform and
decide it to be the next quasiperiod boundary. The current
frequency is adjusted to match new conditions. Older
quasiperiods also take part in the correction but now they have
lower weights at the compound formula:
last
currentcurrent T
FF 1
3.07.0 +=
Here Tlast is the duration of the last detected quasiperiod.
After each 10 quasiperiods are found, the current frequency is
adjusted by pitch detection routine (see later):
ectedcurrentcurrent FFF det
5.05.0 +=
If pitch detection routine fails to detect the pitch of the signal at
the given period of waveform, then the current frequency is not
changed.
The next step is to decide whether the last quasiperiod has been
found with a high degree of confidence. To perform this task we
check if there are any other maximums in the non-windowed
(original) waveform during the last detected quasiperiod. If we
can find a maximum that is higher than boundary values, then
confidence degree is considered to be low, and the window
selected for the next step will be wider than the current one. If the
confidence degree is high (see fig. 7), we can choose a narrower
window for the next step.
In this way we move throughout the input waveform till the end is
reached.
4.1.2 Algorithm modification
Here we modify our algorithm so that quasiperiods’ boundaries
were corresponding not to maximums, but to zero-crossing points
of waveform. This modification enables us to perform hierarchy
coding of quasiperiods more effectively. We try to find a sample
lying between two old boundaries, which is close enough to the
center of the old quasiperiod and has a low absolute value. We
use a V-shaped time window (see fig. 9) to increase a probability
of snapping not only to zero, but also to the center of the old
quasiperiod. After applying the window we search for a sample
with absolute value, lower than a certain threshold (0.02 at our
program). The search starts from the center of the widowed
waveform to ensure that selected sample will lie as close as
possible to the center.
Figure 9:Finding a new (zero-snapped) quasiperiods’ boundary
using a V-shaped window. Thick lines show old boundaries; thin
line shows a new boundary.
The center point of V-shaped window at the next old quasiperiod
will be shifted to ensure preserving the frequency. An example of
resulting quasiperiod boundaries by this algorithm is shown in fig.
10.
Figure 10:Quasiperiods detected by a modified algorithm.
4.1.3 Pitch detection algorithm
To decrease the error rate of quasiperiods’ separation we
compared two algorithms for estimating the pitch of speaker.
Being able to extract the pitch of the signal locally, we can
dynamically correct distances between quasiperiods’ boundaries.
Two algorithms for pitch detection were considered. The first of
them uses a spectrum of the signal to find the fundamental. The
second one uses a cepstrum of the signal to analyze harmonic
structure of the signal. The second algorithm has proved a better
stability even when the fundamental is severely masked with
overtones (like in a phone line). So it is currently used in our
speaker indexing software and it is described below.
At first, the algorithm analyses the spectrum of the signal to
decide whether the vocal or sibilant phoneme is present there. To
decide the type of phoneme the algorithm calculates the averaged
(per-frequency) energy at 2 frequency bands: from 200 to 2000
Hz and from 3800 to 10000 Hz. If the first energy is at least 8
times higher than the second one, we decide, that phoneme is
vocal. If there is a sibilant phoneme, the fundamental cannot be
detected.
The algorithm employs the cepstrum of the signal to analyze the
periodic structure of the spectrum of the harmonic signal (see fig.
11). Periodic structure of spectrum corresponds to the
fundamental frequency and harmonics with integer-multiple
frequencies. When we take a Fourier transform of the logarithmic
speech spectrum, we get a peak at the cepstrum, which
corresponds to the fundamental frequency.
Figure 11: Harmonic speech signal has a spectrum with a
periodic structure.
To achieve even better results, we use only real part of cepstrum
(instead of a magnitude) for finding the peak, because the
harmonic structure of spectrum corresponds to a zero-phase
cosine harmonic.
This algorithm has shown stable results on a various speech
material.
4.2 Speech/music/silence detection
algorithm
An algorithm was developed for detecting silent or musical blocks
at the file being processed. The algorithm helps us to exclude
these blocks from speaker indexing process (see fig.12).
Figure 12: The result of speech/music/silence detection
algorithm. Detected music is marked with the dark horizontal
bars; detected silence is marked with light horizontal bar.
Vertical bars correspond to detected quasiperiods.
At the first stage the algorithm detects silent blocks at the input
waveform. Silent blocks usually have low level of signal and
contain only background noise. To find such blocks more
effectively we create a separate copy of input waveform which is
preprocessed by high-pass FIR filter with a slow roll-off of 3 dB
per octave. The filter transforms pink background noise into white
noise and also cuts off low-frequency rumble. Then we get a
waveform with significantly reduced amplitude of the noise and
increased amplitude of sibilant consonants in speech. Without
preprocessing sibilants are often lost in a low-frequency
background noise.
After the filtering we apply an amplitude gate to the filtered
waveform. The gate forces all the blocks with a volume below
certain threshold to digital zero. The threshold is selected
automatically. The gate features separate thresholds for opening
and closing, and soft attacks and releases with look-ahead (attack)
time of 125 ms and release time of 170 ms.
After the gating, the blocks with digital zeroes at the processed
waveform are considered as silent blocks at the original
waveform.
At the second stage the algorithm detects music at the input signal
to exclude it from speaker indexing. The rest part of waveform is
divided into 1-second intervals and fed to the algorithm of
speech/music detection.
The decision is generated at each block separately. The final
decision is a compound of 2 factors. The first factor is an
amplitude dynamics of the block. The second factor is the
distribution of fundamental in time at the block.
To calculate the first factor we obtain the amplitude envelope of
the given block by calculating RMS with a moving 20 ms
window. The standard deviations of RMS values over the entire
block and over 3 sub-blocks (of 0.3 seconds) are measured. Their
weighted sum is considered to be the first factor. Hereby we
assume that speech usually has a wider dynamics than music has,
because in speech we continuously observe fast changes of vowel
phonemes with large amplitudes and short pauses between
different phonemes or between words (see fig. 13). Music, in
contrary, usually has low dynamics (esp. heavily compressed
music on TV or radio).
Figure 13: Amplitude envelope of speech (upper) and music
(lower). The dynamics of speech is significantly higher.
To calculate the second factor for the given block we obtain the
distribution of fundamental frequency in time. Here we use a
slightly modified version of our pitch detection algorithm,
featuring very high frequency accuracy. We obtain high degree of
frequency accuracy by analyzing the harmonic structure of the
signal. After we get the rough estimate of fundamental frequency
from cepstrum, we analyze the spectrum of the signal to find
higher harmonics. Because of the linear frequency grid, we get
higher frequency resolution when analyzing higher harmonics. To
find spectral peaks, corresponding to harmonics, even more
precisely we use spline interpolation of spectrum values between
FFT frequency bins. The final fundamental frequency is
calculated as following:
3
32 321
0
fff
F
+
+
=
Here f1, f2and f3are frequencies of 3 first harmonics (including
the fundamental f1)found at the interpolated spectrum.
After we calculate the fundamental frequency over the block with
astep of 20 ms, we smooth the obtained array with a median filter
(kernel size = 3) to eliminate dropped-off samples.
The next step is to decide, whether the obtained curve
corresponds to speech or to music. To analyze the curve we
introduce the dF0value, which is the first derivative of F0over
time (actually we should write
dt
dF0). We calculate dF0value at
each point of F0array and count the number of points, where the
absolute value of dF0is between 30 and 500 Hz/second. Then at
all such points we check the sign of dF0and exclude every point
dF0(n), where
))1(())(( 00 nd FsignndFsign
or
))2(())(( 00 nd FsignndFsign
Then we calculate the percent of selected points and this percent
is considered to be the second factor. Higher percent means that
there's high probability that the given block corresponds to
speech.
Here we assume that speech has a definite fundamental frequency,
which is always varying in time with a speed from 30 to 500
Hz/second. The music, in contrary, usually has no definite
fundamental frequency, so our routine for pitch detection either
returns nothing, or resulting fundamental curve is not smooth (like
for speech) and consists of random samples.
The final decision for each block is based on these 2 criterions (it
is a weighted sum of them). The algorithm has shown good results
on various sound samples. Still there are opportunities to improve
its performance by carefully optimizing weights of different
factors at the compound formula and performing more thorough
spectrum analysis.
4.3 Vowels detection algorithm
Afast algorithm for detection of quasiperiods, which correspond
to vowels in speech, has been developed. The algorithm works
after the input waveform has been already separated into
quasiperiods.
The final decision on each quasiperiod is based on 2 criterions:
waveform shape similarity within 2 adjacent quasiperiods and a
number of zero crossings at the given quasiperiod.
The first criterion estimates the shape similarity by finding the
norm of the difference between waveforms of 2 adjacent
quasiperiods.
=
==
=n
i
ii
n
i
i
n
i
i
yx
yx
Similarity
1
2
1
2
1
2
)(
Here we assume that the shape of waveform corresponding to
vowel is changing slowly (see fig. 10).
The second criterion calculates the number of zero-crossing points
at the waveform of a given quasiperiod. If this number is between
2and 15, then the probability of a vowel quasiperiod is high. Here
we assume that sibilants and noise have a much larger number of
zero-crossing points (because of significant HF component).
The final decision is calculated as a weighted sum of probabilities
from these 2 criterions.
This algorithm has been integrated into our BSS program as a new
criterion for speaker separation, and into our Hermite Coder
program to exclude non-vocal quasiperiods from indexing and to
increase its speed.
4.4 Approximated waves
At this stage, at first, we should select the number of the Hermite
functions used for speaker indexing. The optimal number for this
task using hierarchy coding is 63 functions (32 functions for the
last layer). Further we stretch our approximation’s quasiperiod [-
A0, A0]to the segment [-A1, A1]for every layer, defined from the
next criterion:
=
1
1
99.0)(
2
A
A
ndxx
,
where n is the number of the Hermite functions per this layer for
the approximation.
Then we decompose wave function f(x) into Fourier series by
Hermite functions:
,,1,2
)()(
)()(
1
1
1
1
lkn
dxxxfc
xcxvalue
k
k
A
A
iki
n
ni
iik
k
k
==
=
=
=
where fk(x) = fk-1(x) – valuek-1(x), f0(x) = f(x),
nkis a number of functions for the current layer,
lis the current layer.
Since the Hermite functions are the eigenfunctions of Fourier
transform, we have also found Fourier transform of the
approximation for every quasiperiod of the original wave.
We used pre-sorting of the quasiperiods by length and odd/even
property of Hermite functions to accelerate this algorithm.
4.5 Indexing results for three speakers
conversation
1 2 3 1 3 2
Figure 15: Original dialog (15sec, 3 speakers, 5 changes)
The shown waveform represents a conversation (in Russian) of 3
speakers on a news program on NTV Russian television. For
every speaker we have trained a database on his independent
monolog (length = 8 seconds). Each training took about 14
seconds (on PIII-750). Indexing took about 21 seconds. (It is
necessary to notice, that even longer dialogues are calculated in
real-time on PIII-850). Frequently we can reduce indexing errors
by using manual threshold, but sometimes best threshold
coincides with the default one. The example of reducing errors by
using manual threshold you can see below (fig. 16, fig. 17).
1 2 1 2 3 1 3 21 3 2
Figure 16: Indexed dialog based on Hermite expansion
(time filter is off, fixed recognition threshold = 0.05)
1 2 3 1 3 2
Figure 17: Indexed dialog based on Hermite expansion
(time filter is off, manual (best) recognition threshold = 0.0474)
Let's compare results obtained by Hermite transform and results
obtained by Fourier series (fig. 18, fig. 19).
13 1 2 1 3 2 3 1 3 21 3 1 2
Figure 18: Indexed dialog based on Fourier expansion
(time filter is off, fixed recognition threshold = 0.05)
1 2 1 3 2 3 1 3 2
Figure 19: Indexed dialog based on Fourier expansion
(time filter is off, manual (best) recognition threshold = 0.042)
As we see, in this case application of Hermite transform is more
justified, than application of Fourier series. As we will see further,
statistics confirms this statement.
But this algorithm at this stage does not determine precisely
indexing borders. It is due to processing only resonant consonants
and vowels. Their distribution you can see on a figure 20.
Figure 20: Indexed resonant quasiperiods
At figure 20 colored vertical bars correspond to sections of
waveform consisting of resonant quasiperiods. The change of
color indicates the change of a speaker or the change of phoneme.
At figures 15-19 colored vertical bars correspond to sections of
waveform consisting of different speakers. The change of color
indicates the change of a speaker. The numbers below figures
show detected speakers.
As we saw before, thresholds are necessary to eliminate incorrect
recognition of the periods lying between different phonemes.
Minimal threshold is zero (all quasiperiods will be missed).
Maximal threshold is one (every quasiperiod will be recognized).
Optimal threshold found for the tested data is 0.05, but sometimes
it must be corrected for better recognition.
Another way to eliminate incorrectly indexed speakers is to use
time filter. Minimal time filter length is zero (time filter is off).
Optimal found time filter length is 0.2 sec, but sometimes it must
be corrected for better recognition. The example of reducing
errors by using time filter for waveforms from figures 16 and 18
you can see on fig. 19 and fig. 20 correspondently.
1 2 3 1 3 2
Figure 19: Indexed dialog based on Hermite expansion
(time filter length fixed to 0.2, fixed recognition threshold = 0.05)
1 3 2 3 1 3 2 3 1 2
Figure 20: Indexed dialog based on Fourier expansion
(time filter length fixed to 0.2, fixed recognition threshold = 0.05)
As we see, in the case with time filter the application of Hermite
transform is more justified too (compared to Fourier series). As
we will see further, statistics confirms this statement too.
4.6 Statistics of indexing results
When acquiring statistics, the dialogues of two types were used:
with two speakers and with three speakers. For each dialogue we
also had a set of solo monologues of each of the speakers. The
training monologues of two types were used: short monologues
(7-15 seconds each), and long monologues (20-45 seconds each).
All database trainings on these monologues were performed
automatically. All dialogues were tested both using Hermite
transform based indexing, and using Fourier series based
indexing. In most cases it was necessary to correct the recognition
threshold manually. In less often cases it was necessary to correct
the length of a time filter manually.
We have calculated the error rate using the following formula:
=er ,where
er
is the error rate,
is the number of all inclusions of different speakers,
is the number of correctly indexed inclusions of different
speakers.
The obtained results are shown in the following tables:
Short training monologues (7-15 seconds each) without time
filter:
Error rate Fixed parameters Manual parameters
Hermite transform 46,3% 32,8%
Fourier series 71,4% 52,3%
Short training monologues (7-15 seconds each) with time filter:
Error rate Fixed parameters Manual parameters
Hermite transform 43,2% 7,6%
Fourier series 61% 16,6%
Long training monologues (20-45 seconds each) without time
filter:
Error rate Fixed parameters Manual parameters
Hermite transform 48,9% 6,2%
Fourier series 62,3% 8%
Long training monologues (20-45 seconds each) with time filter:
Error rate Fixed parameters Manual parameters
Hermite transform 10,5% 4,4%
Fourier series 31,5% 6,2%
Best achieved error rate without time filter is 6,2% (Hermite
transform with long training monologues).
Best achieved error rate with time filter is 4,4% (Hermite
transform with long training monologues).
It can be seen that the best error rate corresponds to Hermite
transform with manual parameters adjustment. It is necessary to
emphasize that quite good results are achieved at the default
parameters on long trainings when time filter is turned on. Shown
results confirm that when dynamically changing time windows are
involved, the application of Hermite transform for speaker
indexing is more justified.
5. CONCLUSION
In this paper we considered the new projection scheme of
streaming waveform data processing for text-independent speaker
indexing from continuous speech, which is based on the features
of Hermite functions and quasiperiods. We have used an
expansion into series of eigenfunctions of the Fourier transform,
which has enabled us using advantages of a time-frequency
analysis.
6. REFERENCES
[1] Gabor Szego “Orthogonal Polynomials”. American
Mathematical Society Colloquium Publications, vol. 23, NY,
1959.
[2] Dunham Jeckson, “Fourier Series and Orthogonal
Polynomials”. Carus Mathematical Monographs,No. 6, Chicago,
1941.
[3] Jean-Bernard Martens.“The Hermite Transform – Theory”.
IEEE Transactions on Acoustics, Speech and Signal Processing,
vol. 38 (1990) p. 1595-1606.
[4] Jean-Bernard Martens.“The Hermite Transform –
Applications”.IEEE Transactions on Acoustics, Speech and
Signal Processing, vol. 38 (1990) p. 1607-1618.
[5] Andrey Krylov and Danil Kortchagine "Projection filtering
in image processing", Graphicon'2000 Conference proceedings,
Moscow (2000) P.42-45.
[6] Andrey Krylov and Anton Liakishev. "Numerical Projection
Method For Inverse Fourier Transform and its Application".
Numerical Functional Analysis and optimization,vol. 21 (2000)
p. 205-216.
About the authors
Dr. Andrey S. Krylov is the head scientist of Moscow State
University.
E-mail: kryl@cs.msu.su
Danil N. Kortchagine is the student of Moscow State University.
E-mail: dan_msu@euro.ru
Alexey S. Lukin is the student of Moscow State University.
E-mail: lukin@ixbt.com
Address:
Faculty of Computational Mathematics & Cybernetics, Moscow
State University, Vorob’evy Gory, 119992, Moscow, Russia.
... An expansion of signal information into a series of these functions enables us to perform information analysis of the signal and its Fourier transform at the same time, because the Hermite functions are the eigenfunctions of Fourier transform. These functions are widely used in image processing [4], [5], [6] and streaming waveform data processing [7], [8]. It is also necessary to underline that the joint localization of Hermite functions in the both frequency and space domains makes using these functions very stable to information errors. ...
Article
Full-text available
In this paper we will consider a new scheme of image database retrieval by fast Hermite projection method. The database contained 4100 images. The method is based on an expansion into series of eigenfunctions of the Fourier transform. Photo normalization includes following steps of preprocessing: resampling, corners detection, rotation, perspective and parallelogram elimination, painting cutting, ranging and color plane elimination. The searching is based on the database query by fast Hermite coefficients and retrieving from the database nearest record by quadratic discrepancy.
... 9, many speech fragments have a quasi-periodic structure (it should be emphasized that the lengths of neighboring quasi-periods, as well as the forms of the signals on the neighboring quasi-periods, may slightly differ). For the input interval, we successively take these quasi-periods [14]. The endpoints of these intervals are selected such that the extremum of the signal on the interval is attained, approximately, in the middle of the interval, and the values at the endpoints are close to zero. ...
Article
Full-text available
Currently, various time-frequency representations are often used for sound analysis. These representations, on the one hand, are convenient for visible sensation of sound by a human and, on the other hand, can be used for automatically analyzing sound pictures. In this paper, various methods for representation of sound as two-dimensional time-frequency vectors of a fixed dimension and their use for speech and speaker recognition problems are discussed. Probabilistic, distance-based, and neural-network methods for the recognition of these vectors by examples of separate words are considered. Numerical experiments showed that the best among them is the method based on a three-layer neural network, the short-time Fourier transform, and the two-dimensional wavelet transformation. For the speaker recognition problem, a distance-based recognition method employing the adaptive Hermite transform turned out the best among all.
... Hermite functions vanish at the infinity and represent the eigenfunctions of the Fourier transform. Due to their good properties, these functions have been used in various applications: texture analysis, projection filtering, image foveation, speech processing, etc. [13] [16]. Hermite functions for the first six orders (Ψ k , k=0,1,2,3,4,5) are shown inFig 1. Starting from the idea introduced by Thom- son [9], the multiple windows spectrogram has been introduced in [10]. ...
Article
Full-text available
A new distribution that provides high concentration in the time-frequency domain is proposed. It is based on the S-method and multiwindow approach, where different order Hermite functions are employed as multiple windows. The resulting distribution will be referred to as the multiwindow S-method. It preserves favourable properties of the standard S-method, whereas the distribution concentration is improved by using Hermite functions of just a few first orders. The proposed distribution is appropriate for radar signal analysis, as it will be proven by experimental examples.
... In this respect , the proposed method does not predicate on learning or reference sets, as in the case in both [14], [15] and, as such, avoids the potential mismatch between learning and test sets which is likely responsible for the rather high classification errors exhibited in applying those methods. It is noted that the multiple windows are obtained by using the Hermite functions , which have good time-frequency localization property [19], [20]. By employing only a few Hermite functions (of lowest orders), the complexity of realization is slightly increased comparing to the standard S-method. ...
Article
We introduceanewandsimpletechniqueforhumangaitclassificationbasedonthe time-frequencyanalysisofradardata.Thefocusisontheclassificationofarm movementstodiscernfreevs.confinedarmswingingmotion.Thelattermayarisein hostagesituationormaybeindicativetocarryingobjectswithoneorbothhands.The motion signaturescorrespondingtothearmandlegmovementsarebothextracted from thetime-frequencyrepresentationofthemicro-Doppler.Thetime-frequency analysisisperformedusingthemultiwindowS-method.WiththeHermitefunctions acting asmultiwindows,itisshownthattheHermiteS-methodprovidesanefficient representationofthecomplexDopplerassociatedwithhumanwalking.Theproposed humangaitclassificationtechniqueutilizesthearmpositiveandnegativeDoppler frequenciesandtheirrelativetimeofoccurrence.Itistestedonvariousrealradar signalsandshowntoprovideanaccurateclassification.
Article
This paper modeled the disk brake automatic drilling system for the first time and pointed out its characteristics of high-order and time-varying, what's more, it is very sensitive to disturbance. A claasic linear time-invariant controller has insufficiently addressed deep and deviated directional drilling process from the control standpoint. herein, compound control strategies with computer including cascade control and feed-forward control are introduced to eliminate disturbances which include the friction between drilling tools and borehole wall, the resonance of rope and drill pipes, and the vibrancy caused by downhole motors. The reliability and practicability of this automatic driller are increased by the adoption of Programmable Controller (PC). The system undergoes field test. The field experiments shown that the computer control methods above can effectively suppress interference during the deviated deep hole drilling operation, achieving a satisfied and reliable control performance. In addition, the study and control methods about this drilling system have universal significance to the application of other automatic drilling systems.
Conference Paper
Fast Hermite projection scheme for image processing and analysis is introduced. It is based on an expansion of image intensity function into a Fourier series using full orthonormal system of Hermite functions instead of trigonometric basis. Hermite functions are the eigenfunctions of the Fourier transform and they are computationally localized both in frequency and spatial domains in the contrast to the trigonometric functions. The acceleration of this expansion procedure is based on Gauss-Hermite quadrature scheme simplified by the replacement of Hermite associated weights and Hermite polynomials by an array of associated constants. This array of associated constants depends on the values of Hermite functions. Image database retrieval and image foveation applications based on 2D fast Hermite projection method have been considered. The proposed acceleration algorithm can be also efficiently used in Hermite transform method.
Article
Full-text available
Numerical projection method of the Fourier transform inversion from data given on a finite interval is proposed. It is based on an expansion of the solution into a series of eigenfunctions of the Fourier transform. The number of terms of the expansion depends on the length of the data interval. Convergence of the solution of the method is proved. The projection method for the case of the sine Fourier transform and the set of the odd Hermite functions being its eigenfunctions are examined and applied to numerical Fourier filtering.
Article
Full-text available
In this paper we shall consider the new projection scheme of local image processing of the visual information. It is based on an expansion into series of eigenfunctions of the Fourier transform. This scheme can be used for compression of images and any kind other media data, their filtration, tracing of outlines, definition of structures and properties of objects.
Article
Thesis (M.S. in Mathematics)--Louisiana Polytechnic Institute, July 1962. Bibliography: leaf 41.
Article
It is demonstrated how the Hermite transform can be used for image coding and analysis. Hierarchical coding structures based on increasingly specified basic patterns, i.e. general 2-D patterns, general 1-D patterns, and specific 1-D patterns such as edges and corners, are presented. In the image coding application, the relation with existing pyramid coders is described. A new coding scheme, based on local one-dimensional image approximations, is introduced. In the image analysis application, the relation between the Hermite transform and existing line/edge detection schemes is described. It is shown that, by concentrating on more specific patterns, the coding efficiency can be increased since fewer coefficients have to be coded. Meanwhile, sufficient descriptive power can be maintained for approximating the most interesting features in natural images
Article
The author introduces a scheme for the local processing of visual information, called the Hermite transform. The problem is addressed from the point of view of image coding, and therefore the scheme is presented as an analysis/resynthesis system. The objectives of the present work, however, are not restricted to coding. The analysis part is designed so that it can also serve applications in the area of computer vision. Indeed, derivatives of Gaussians, which have found widespread application in feature detection over the past few years, play a central role in the Hermite analysis. It is also argued that the proposed processing scheme is in close agreement with current insight into the image processing that is carried out by the human visual system. In particular, it is demonstrated that the Hermite transform is in better agreement with human visual modeling than Gabor expansions
Krylov is the head scientist of
  • Dr
  • S Andrey
Dr. Andrey S. Krylov is the head scientist of Moscow State University.
Kortchagine is the student of
  • N Danil
Danil N. Kortchagine is the student of Moscow State University.