Conference PaperPDF Available

Computational efficient real-time capable constant-Q spectrum analyzer

Authors:

Abstract and Figures

The constant-Q transform (CQT) is a valuable tool for music information retrieval, e.g. for chroma calculation and harmonic analysis. In this E-Brief, we propose a block based, real-time capable, efficient analysis algorithm resting upon a subsampling technique performed with fast Fourier transform. In addition, advanced features such as time resolution enhancement towards lower frequencies and a robust CQT-based tuner are presented. Finally a reference-implementation in C++ in form of a VST3-Plugin is introduced. The plugin's source code will be available openly for further development.
Content may be subject to copyright.
Computational efficient real-time capable constant-Q spectrum
analyzer
Felix Holzm¨uller1,2, Paul Bereuter1,2, Philipp Merz1,2 , Daniel Rudrich1,3, and Alois
Sontacchi1,3
1University of Music and Performing Arts, Graz, Austria
2Graz University of Technology, Austria
3Institute of Electronic Music and Acoustics, Graz, Austria
May 2020
Abstract
The constant-Q transform (CQT) is a valuable tool for music information retrieval, e.g. for chroma
calculation and harmonic analysis. In this e-Brief, we propose a block based, real-time capable, efficient
analysis algorithm resting upon a subsampling technique performed with fast Fourier transform. In addition,
advanced features such as time resolution enhancement towards lower frequencies and a robust CQT-based
tuner are presented. Finally a reference-implementation in C++ in form of a VST3-Plugin is introduced.
The plugin’s source code will be available openly for further development.
1 Introduction
VST-Plugins that show a visual representation of
the short-time Fourier transform (STFT) are quite a
common tool in modern DAWs. In a musical context,
the linear frequency bin spacing is not optimal, since
the scale of western music is geometrically spaced.
Linear spacing in respect of musical analysis leads to
an insufficient resolution in lower frequency regions,
while the higher frequencies are presented with an
expendable amount of accuracy. In signal processing
a general solution to this problem has been found:
the constant-Q transform (CQT). So far it has not
been implemented for efficient, accurate real-time
applications.
Since the CQT was proposed by J.C. Brown in
Author’s accepted manuscript as presented at the 148th
AES Convention. The AES published version can be found at
https://aes2.org/publications/elibrary-page/?id=20805
1991 [1] for musical analysis, numerous authors have
contributed to making its calculation more efficient.
With the latest additions by G.A. Velasco et al. [
6
],
N. Holighaus et al. [
3
] and an implementation by
C. Sch¨orkhuber et al. [
4
], it is possible to create a
computationally inexpensive, fast Fourier transform
(FFT) based CQT-analyzer. Based on this approach,
it is possible to develop a real-time capable algorithm
and a corresponding implementation as a VST-Plugin
using the JUCE framework [
5
]. Section 2 explains
the analyzer’s algorithm in detail, by going from its
original direct implementation through the various
optimization steps. Section 3 features a manual de-
scribing the analyzer’s user interface. A short sum-
mary in section 4 concludes this publication. The
program, its source code and a documentation fea-
turing a detailed overview of the framework as well
as the plugin’s dataflow can be found here:
https:
1
//git.iem.at/audioplugins/cqt-analyzer
2 Algorithm
The CQT can be seen as a filter-bank with logarith-
mically spaced center frequencies
fk
and a constant
quality factor
Q
for all filters. The modulated local-
ization functions (the so called atoms)
ak
[
n
] use a
window function
gk
[
n
] and a sampling frequency
fs
.
The samples of a discrete input-signal block
x
[
m
] are
weighted with the atoms and summed. This defines
the CQT as [4, p. 2]
X[k, n] =
Nk1
X
m=0
x[m]a
k[mn], n, k N,with (1)
ak[m] = gk[m]ej2πm fk
fs, m Z.(2)
In this case
n
and
k
denote time and frequency indices,
respectively in the CQT domain. In this notation one
can see the CQT’s relation to the Discrete Fourier
Transformation (DFT): the same kernel function is
used, but with logarithmically spaced analysis fre-
quencies
fk
, bandwidths
Bk1
and the analysis window
length in time domain Nk.
fk=f02k
b(3)
Bk=fk
Q+γ(4)
Qnew,k =fk
Bk
(5)
bnew,k =ln (2)
arsinh 1
2Qnew,k (6)
Nk=Qnew,k
fs
fk
(7)
f0
is the lowest frequency to be analyzed and
b
is
the resolution in bins per octave. In case of constant-
Q,
Bk
is directly proportional to
fk
. With
γ >
0,
a frequency-dependent recalculation of
Qnew,k
and
bnew,k
is necessary. The constant-Q case can be
achieved by
γ
= 0. However, the purpose of
γ
will be
discussed in chapter 2.3.
1
In this case
Bk
is defined as the absolute bandwidth, not
as the 3 dB bandwidth.
2.1
Calculating the CQT with a fast
convolution
As a consequence of these differences the optimized
algorithm used to calculate the DFT (the FFT) is in
this case not applicable, making a direct implemen-
tation of eq. (1) computationally too expensive or
inaccurate for a real-time applications.
The FFT is still of use, since the transformation in
eq. (1) can be rewritten into a convolution, assuming
the atoms ak[n] are symmetric:
X[k, n] =
Nk1
X
n=0
x[m]a
k[mn]
=
Nk1
X
n=0
x[m]ak[nm]
= (xak)[n] (8)
The convolution can be efficiently computed by a
fast convolution, that is, using the FFT to compute
the convolution by means of a multiplication in the
frequency domain and going back to the time domain
via an inverse-DFT (IDFT):
X[k, n]=(xak)[n]
=F1
i7→n{[Fn7→i{x}·Fn7→i{ak}](i)}[n] (9)
where
i
shall denote a DFT-bin and
k
a CQT-bin.
This requires block processing of the signal. The
block lengths correspond to the DFT length
NDFT
.
The length of the DFT
NDFT
depends on the max-
imum window length
Nmax
=
max (Nk)
, occurring
at the lowest analysis frequency
f0
. The signal is
blocked with 50% overlap. Each block is windowed
with a Hann-window before its transformation into
the frequency domain. For a computationally more
efficient implementation meaning less overlap, the us-
age of Tukey-windows as proposed in [
3
] would be also
possible. Using oversampling with a factor
os N
its length is defined as
NDFT =os ·nextPower2(Nmax)
=os ·nextPower2Qnew
fs
f0
=os ·2log2Qnew fs
f0.(10)
2
As the algorithm only deals with real valued input
signals, we can optimize by only calculating the DFT
for positive frequencies.
There are two immediate optimizations for this
procedure. Firstly, the input signal needs to be trans-
formed to the frequency domain only once for the
calculation of all frequency bins. Secondly, there is
no need to repeatedly transform the localization func-
tions, they can be designed and stored in the frequency
domain prior to the transformation itself, where they
constitute window functions
Ak
. This gives the ad-
vantage of a window with compact support, so that
applying the window is computationally easy. As a
drawback slight ripples can be observed, due to the
window’s infinite support in time doamin, after trans-
forming back into the CQT domain. We propose using
aHann window due to its good sidelobe suppression
and narrow mainlobe as well as perfect overlapping.
Y(i) = Fn7→i{x[n]}(i) (11)
X[k, n] = F1
i7→n{Y(i)Ak(i)}[n] (12)
2.2
Subsampling in the frequency do-
main
In this state the algorithm’s output, namely one co-
efficient for each bin in each time step, contains a
large amount of redundancy. The Shannon-Nyquist
sampling theorem (in its extension to non-baseband
signals) states that this redundancy can be removed
by subsampling the output of each bin: as long as the
sampling rate after subsampling
fk
s
is at least the size
of the absolute bandwidth
Bk
, no information will be
lost.
fk
sBk(13)
To further reduce the computational effort it is
also possible to perform the subsampling in the fre-
quency domain. This is done by applying an IDFT
with
NIDFT < NDFT
only along the range where the
respective frequency domain window is non-zero, or
in other words, by shifting the windowed spectrum to
the baseband before transforming back to the time
domain with a lower resolution IDFT.
iu,k =fk·2
1
bnew,k NDFT
fs(14)
il,k =fk·2
1
bnew,k NDFT
fs(15)
iu,k
and
il,k
are the DFT-bins that mark the upper
and lower bounds of
Ak
. The values of the windows
itself are obtained from a large, precalculated Hann
window
Alookup
of length
M
. For every sampling
point of
Alookup
a frequency
fAlookup
[
k, m
] is assigned
for each bin
k
, based on the CQT’s logarithmical
frequency spacing. These are calculated as
fAlookup [k, m] =fk·2
−⌊M/2+m
M/2⌋·bnew,k (16)
for m= 0,1, . . . , M 1.
A length of
M
= 8
·NIDFT
+ 1 (see eq. 19) is more
than sufficient for usage without further interpolation.
The values of
Alookup
whose corresponding frequencies
fAlookup
[
k, m
] are closest to the frequencies of the DFT-
bins
fi
=
i·NDFT
fs
for
il,k iiu,k
are chosen and
stored as the window function
Ak
for the
k
th CQT-
bin.
The shift to the baseband is computed with:
Y(i, k) =
Y(il,k +i)·Ak(il,k +i),for
0iiu,k il,k
0,else
(17)
The spectrum is then transformed back to obtain
X.
X[k, nk] = F1
i7→nk{Y(i, k)}(18)
X
is the subsampled CQT of
x
, with the time vari-
ables
nk
of the individual channels progressing with
fk
s
, if the IDFT size is chosen to be the minimum,
namely
NIDFT = nextPower2NDFT
Bk,max
fs.(19)
In this implementation the maximum required IDFT
length (for the lowest frequency bin) is used for all bins,
applying zero-padding when necessary. Although
slightly less efficient, this has the advantage of all chan-
nels running at the same rate. Finally, the time-CQT
3
representation is obtained using an overlap-and-add
algorithm on the absolute values |X[k, nk]|using
hs = 2 NIDFT
os .(20)
as the hopsize for reconstruction of the blocks.
2.3
Low frequency time resolution en-
hancement
The in eq. (4) introduced parameter
γ
enables the
modification of a CQT-bin’s bandwidth
Bk
[
4
, p. 5].
This is done to enhance the low frequency time res-
olution at the expense of the frequency resolution.
Note that the center frequencies of the CQT-bins
fk
remain the same. In the context of human perception
this makes sense, since the filterbank of the human
auditory system only resembles a constant-Q system
above approximately 500 Hz [
4
, p. 5]. These properties
can be explained using the theory of the equivalent
rectangular bandwidths (ERB) [
2
]. Strictly speak-
ing the resulting transformation is not a constant-Q
transform anymore, when γ= 0.
2.4 Tuner algorithm
The implemented tuner algorithm is only sufficient for
a resolution
b
of at least 36 bins per octave (therefore
3 bins per semitone). The algorithm is based upon
parabolic interpolation, where the location of the
maximum of the parabola through three samples is
determined. Initially, the index
kc
of the nearest
CQT-bin to the given tuning frequency
fref
is found.
For parabolic interpolation, three sampling points are
needed. Therefore the bins are summed semitone-wise
at the relative position of the tuning bin, as well as
one bin above and below
2
, resulting in an average
amplitude distribution over all semitones. This can
2
e.g. for a resolution of 3 bins per semitone all bins at the
center of the semitone, all bins below and all bins above are
summed up.
be calculated as
Xsum,c[nk] = PkX[k, nk],
for nk
(k) mod b
12 = (kc) mod b
12 o
Xsum,l[nk] = PkX[k, nk],
for nk
(k) mod b
12 = (kc1) mod b
12 o
Xsum,u[nk] = PkX[k, nk],
for nk
(k) mod b
12 = (kc+ 1) mod b
12 o.
(21)
The subscripts
l
and
u
describe the bin below and
above the center-bin
kc
. If the maximum value does
not rest upon the reference-bin (e.g. in case of bigger
detuning), the center-bin index
kc
is appropriately
shifted up or down. The parabolic interpolation for
unevenly spaced sampling points according to [
7
] is
given as
xmax =
=b+(f(a)f(b))(cb)2(f(c)f(b))(ba)2
2[(f(a)f(b))(cb)+(f(c)f(b))(ba)] ,(22)
where
a=fkl, f(a) = Xsum,l[nk],
b=fkc, f(b) = Xsum,c[nk],
c=fku, f(c) = Xsum,u[nk],
ftuning[nk] = xmax .
(23)
The interpolation is visualized in figure 1.
Figure 1: Parabolic interpolation.
4
An integration or averaging can smoothen fluctua-
tions of the calculated value. With
det[nk] = 1200 log2ftuning[nk]
fref (24)
the detuning in cent is calculated. If the value exceeds
the range [
50; 50], 100 cents are either added or
subtracted.
Note that the accuracy reduces for increasing
γ
due
to the impaired frequency resolution.
3 VST3-Implementation
The Interface of the VST3-implementation in C++
using the JUCE framework [
5
] is shown in figure 2.
There are in total six controls for varying the CQT
parameters as shown in table 1.
Figure 2: The RT-CQT VST User Interface
Control Element Action
fMin Center Frequency of the
lowest bin (cf. f0)
Octaves
Number of octaves to analyze,
starting from fMin,
defines maximum frequency
Bins per Octave
Number of CQT-bins
per octave, e.g. set to 24 for
a quartertone resolution (cf. b)
Gamma
Low frequency time
resolution enhancement
(cf. Bkand γ)
Tuner Tuner reference frequency
Gain Gain for visualization
Table 1: Control parameters of RT-CQT VST
4 Summary
This efficient real-time CQT-Analyzer is based on the
algorithm proposed in [
3
] and [
4
]. The algorithm can
be summarized by the following steps. The time do-
main signal is transformed into the frequency domain
by means of a FFT. Then for each CQT-bin, the spec-
tral components of the DFT between the CQT-bin’s
lower bound and its upper bound get extracted by
means of a frequency-domain window. The extract is
then shifted to the baseband, to perform subsampling
in the frequency domain by applying an IDFT of a
length substantially shorter than the length of the
DFT. After applying overlap-and-add, this yields the
time series of CQT-coefficients. The whole signal pro-
cessing chain is block-based. By using the parameter
γ
, the time resolution towards lower frequencies can
be increased. This leads to a decrease of the Q-factor
towards lower frequencies.
Furthermore a CQT-based tuner algorithm is
proposed. By semitone-wise summing the CQT-
coefficient’s absolute values below, at and above the
tuning frequency bin, fixed through the set tuning
5
frequency, an average amplitude distribution over all
semitones is calculated. The three summed values
are taken as the representatives at the frequencies
below, at and above the tuning frequency bin. These
and the aforementioned frequencies are used for the
parabolic interpolation shown in eq. (22). With the
parabolic interpolation a more accurate estimation of
the input-signal’s tuning frequency is apprehended.
The mistuning of the estimated tuning frequency in
respect to the tuning-frequency, set in the Plugin’s
Tuner section, is calculated and displayed in cents.
For further information on the VST-Plugin’s struc-
ture and a more detailed insight into the created frame-
work refer to the documentation placed at
https:
//git.iem.at/audioplugins/cqt-analyzer.
References
[1]
Judith C Brown. Calculation of a constant Q spec-
tral transform. Journal of the Acoustical Society
of America, 89:425 434, January 1991.
[2]
Brian R Glasberg and Brian C.J Moore. Deriva-
tion of auditory filter shapes from notched-noise
data. Hearing Research, 47(1-2):103–138, 1990.
[3]
Nicki Holighaus, Monika orfler, Gino Angelo
Velasco, and Thomas Grill. A framework for in-
vertible, real-time constant-q transforms. IEEE
Transactions on Audio, Speech, and Language Pro-
cessing, 21(4):775–785, 2013.
[4]
Christian Sch¨orkhuber, Anssi Klapuri, Nicki Ho-
lighaus, and Monika orfler. A Matlab toolbox
for efficient perfect reconstruction time-frequency
transforms with log-frequency resolution. In AES
53rd International Conference, London, UK, Jan-
uary 2014.
[5]
Jules Storer and ROLI. JUCE - Jules’ Utility Class
Extensions, October 2019. Version 5.4.5, available
at https://github.com/WeAreROLI/JUCE.
[6]
Gino Angelo Velasco, Nicki Holighaus, Monika
orfler, and Thomas Grill. Constructing an in-
vertible constant-Q transform with nonstationary
Gabor frames. In Proc. of the 14th International
Conference on Digital Audio Effects, Paris, France,
September 2011.
[7]
Ruye Wang. Harvey mudd college, lecture notes,
parabolic interpolation, February 2015. Ruye
Wang 2015-02-12, available at
http://fourier.
eng.hmc.edu/e176/lectures/NM/node25.html
,
last accessed on 26.03.2020.
6
... The results shown in Fig. 3 belong to a synthesized, modal /a/ vowel atf 0 ≈ 300 Hz. The buffer structure allowing block processing was taken from [7]. An additional field in the GUI's lower right corner is indicating the cur-rentf 0 , calculated during the preprocessing steps of the analysis stage. ...
Conference Paper
Full-text available
Voice disorders due to strenuous usage of unhealthy voice qualities are a common problem in professional singing. In order to minimize the risk of these voice disorders, vital feedback can be given by making aware of one's sung voice quality. This work presents the design task of a vowel and voice quality indication tool which can enable such a feedback. The tool is implemented in form of a VST plug-in. The plugin's interface provides a graphical representation of voice quality and vowel intelligibility by means of two 2D voice maps. The voice maps allow a graphical distinction of three voice qualities (modal, breathy or creaky), and the representation of a sung vowel within the formant space spanned by the first and second formant frequency. The design process includes (i) building a ground truth dataset by using a modified speech synthesizer, (ii) linear prediction analysis, and (iii) the visualisation of the estimated vowel and voice quality by means of the 2D voice maps. The plugin's code is available as open source to enable further development.
Conference Paper
Full-text available
In this paper, we propose a time-frequency representation where the frequency bins are distributed uniformly in log-frequency and their Q-factors obey a linear function of the bin center frequencies. The latter allows for time-frequency representations where the bandwidths can be e.g. constant on the log-frequency scale (constant Q) or constant on the auditory critical-band scale (smoothly varying Q). The proposed techniques are published as a Matlab toolbox that extends [3]. Besides the features that stem from [3] – perfect reconstruction and computational efficiency – we propose here a technique for computing coefficient phases in a way that makes their interpretation more natural. Other extensions include flexible control of the Q- values and more regular sampling of the time-frequency plane in order to simplify signal processing in the transform domain.
Conference Paper
Full-text available
An efficient and perfectly invertible signal transform featuring a constant-Q frequency resolution is presented. The proposed ap- proach is based on the idea of the recently introduced nonstation- ary Gabor frames. Exploiting the properties of the operator corre- sponding to a family of analysis atoms, this approach overcomes the problems of the classical implementations of constant-Q trans- forms, in particular, computational intensity and lack of invertibil- ity. Perfect reconstruction is guaranteed by using an easy to calcu- late dual system in the synthesis step and computation time is kept low by applying FFT-based processing. The proposed method is applied to real-life signals and evaluated in comparison to a related approach, recently introduced specifically for audio signals.
Article
Full-text available
Audio signal processing frequently requires time-frequency representations and in many applications, a non-linear spacing of frequency-bands is preferable. This paper introduces a framework for efficient implementation of invertible signal transforms allowing for non-uniform and in particular non-linear frequency resolution. Non-uniformity in frequency is realized by applying nonstationary Gabor frames with adaptivity in the frequency domain. The realization of a perfectly invertible constant-Q transform is described in detail. To achieve real-time processing, independent of signal length, slice-wise processing of the full input signal is proposed and referred to as sliCQ transform. By applying frame theory and FFT-based processing, the presented approach overcomes computational inefficiency and lack of invertibility of classical constant-Q transform implementations. Numerical simulations evaluate the efficiency of the proposed algorithm and the method's applicability is illustrated by experiments on real-life audio signals.
Article
The frequencies that have been chosen to make up the scale of Western music are geometrically spaced. Thus the discrete Fourier transform (DFT), although extremely efficient in the fast Fourier transform implementation, yields components which do not map efficiently to musical frequencies. This is because the frequency components calculated with the DFT are separated by a constant frequency difference and with a constant resolution. A calculation similar to a discrete Fourier transform but with a constant ratio of center frequency to resolution has been made; this is a constant Q transform and is equivalent to a 1/24-oct filter bank. Thus there are two frequency components for each musical note so that two adjacent notes in the musical scale played simultaneously can be resolved anywhere in the musical frequency range. This transform against log (frequency) to obtain a constant pattern in the frequency domain for sounds with harmonic frequency components has been plotted. This is compared to the conventional DFT that yields a constant spacing between frequency components. In addition to advantages for resolution, representation with a constant pattern has the advantage that note identification ("note identification" rather than the term "pitch tracking," which is widely used in the signal processing community, is being used since the editor has correctly pointed out that "pitch" should be reserved for a perceptual contest), instrument recognition, and signal separation can be done elegantly by a straightforward pattern recognition algorithm.
Article
A well established method for estimating the shape of the auditory filter is based on the measurement of the threshold of a sinusoidal signal in a notched-noise masker, as a function of notch width. To measure the asymmetry of the filter, the notch has to be placed both symmetrically and asymmetrically about the signal frequency. In previous work several simplifying assumptions and approximations were made in deriving auditory filter shapes from the data. In this paper we describe modifications to the fitting procedure which allow more accurate derivations. These include: 1) taking into account changes in filter bandwidth with centre frequency when allowing for the effects of off-frequency listening; 2) correcting for the non-flat frequency response of the earphone; 3) correcting for the transmission characteristics of the outer and middle ear; 4) limiting the amount by which the centre frequency of the filter can shift in order to maximise the signal-to-masker ratio. In many cases, these modifications result in only small changes to the derived filter shape. However, at very high and very low centre frequencies and for hearing-impaired subjects the differences can be substantial. It is also shown that filter shapes derived from data where the notch is always placed symmetrically about the signal frequency can be seriously in error when the underlying filter is markedly asymmetric. New formulae are suggested describing the variation of the auditory filter with frequency and level. The implication of the results for the calculation of excitation patterns are discussed and a modified procedure is proposed. The appendix list FORTRAN computer programs for deriving auditory filter shapes from notched-noise data and for calculating excitation patterns. The first program can readily be modified so as to derive auditory filter shapes from data obtained with other types of maskers, such as rippled noise.
JUCE -Jules' Utility Class Extensions
  • J Storer
  • Roli
Storer, J. and ROLI, "JUCE -Jules' Utility Class Extensions," 2019, Version 5.4.5, available at https://github.com/WeAreROLI/ JUCE.
Harvey mudd college, lecture notes, parabolic interpolation
  • Ruye Wang
Ruye Wang. Harvey mudd college, lecture notes, parabolic interpolation, February 2015. Ruye Wang 2015-02-12, available at http://fourier. eng.hmc.edu/e176/lectures/NM/node25.html, last accessed on 26.03.2020.
Utility Class Extensions
  • Jules Storer
  • Roli Juce -Jules
Jules Storer and ROLI. JUCE -Jules' Utility Class Extensions, October 2019. Version 5.4.5, available at https://github.com/WeAreROLI/JUCE.