Content uploaded by Felix Holzmüller
Author content
All content in this area was uploaded by Felix Holzmüller on Aug 05, 2024
Content may be subject to copyright.
Computational efficient real-time capable constant-Q spectrum
analyzer
Felix Holzm¨uller1,2, Paul Bereuter1,2, Philipp Merz1,2 , Daniel Rudrich1,3, and Alois
Sontacchi1,3
1University of Music and Performing Arts, Graz, Austria
2Graz University of Technology, Austria
3Institute of Electronic Music and Acoustics, Graz, Austria
May 2020
Abstract
The constant-Q transform (CQT) is a valuable tool for music information retrieval, e.g. for chroma
calculation and harmonic analysis. In this e-Brief, we propose a block based, real-time capable, efficient
analysis algorithm resting upon a subsampling technique performed with fast Fourier transform. In addition,
advanced features such as time resolution enhancement towards lower frequencies and a robust CQT-based
tuner are presented. Finally a reference-implementation in C++ in form of a VST3-Plugin is introduced.
The plugin’s source code will be available openly for further development.
1 Introduction
VST-Plugins that show a visual representation of
the short-time Fourier transform (STFT) are quite a
common tool in modern DAWs. In a musical context,
the linear frequency bin spacing is not optimal, since
the scale of western music is geometrically spaced.
Linear spacing in respect of musical analysis leads to
an insufficient resolution in lower frequency regions,
while the higher frequencies are presented with an
expendable amount of accuracy. In signal processing
a general solution to this problem has been found:
the constant-Q transform (CQT). So far it has not
been implemented for efficient, accurate real-time
applications.
Since the CQT was proposed by J.C. Brown in
Author’s accepted manuscript as presented at the 148th
AES Convention. The AES published version can be found at
https://aes2.org/publications/elibrary-page/?id=20805
1991 [1] for musical analysis, numerous authors have
contributed to making its calculation more efficient.
With the latest additions by G.A. Velasco et al. [
6
],
N. Holighaus et al. [
3
] and an implementation by
C. Sch¨orkhuber et al. [
4
], it is possible to create a
computationally inexpensive, fast Fourier transform
(FFT) based CQT-analyzer. Based on this approach,
it is possible to develop a real-time capable algorithm
and a corresponding implementation as a VST-Plugin
using the JUCE framework [
5
]. Section 2 explains
the analyzer’s algorithm in detail, by going from its
original direct implementation through the various
optimization steps. Section 3 features a manual de-
scribing the analyzer’s user interface. A short sum-
mary in section 4 concludes this publication. The
program, its source code and a documentation fea-
turing a detailed overview of the framework as well
as the plugin’s dataflow can be found here:
https:
1
//git.iem.at/audioplugins/cqt-analyzer
2 Algorithm
The CQT can be seen as a filter-bank with logarith-
mically spaced center frequencies
fk
and a constant
quality factor
Q
for all filters. The modulated local-
ization functions (the so called atoms)
ak
[
n
] use a
window function
gk
[
n
] and a sampling frequency
fs
.
The samples of a discrete input-signal block
x
[
m
] are
weighted with the atoms and summed. This defines
the CQT as [4, p. 2]
X[k, n] =
Nk−1
X
m=0
x[m]a∗
k[m−n], n, k ∈N,with (1)
ak[m] = gk[m]ej2πm fk
fs, m ∈Z.(2)
In this case
n
and
k
denote time and frequency indices,
respectively in the CQT domain. In this notation one
can see the CQT’s relation to the Discrete Fourier
Transformation (DFT): the same kernel function is
used, but with logarithmically spaced analysis fre-
quencies
fk
, bandwidths
Bk1
and the analysis window
length in time domain Nk.
fk=f02k
b(3)
Bk=fk
Q+γ(4)
Qnew,k =fk
Bk
(5)
bnew,k =ln (2)
arsinh 1
2Qnew,k (6)
Nk=Qnew,k
fs
fk
(7)
f0
is the lowest frequency to be analyzed and
b
is
the resolution in bins per octave. In case of constant-
Q,
Bk
is directly proportional to
fk
. With
γ >
0,
a frequency-dependent recalculation of
Qnew,k
and
bnew,k
is necessary. The constant-Q case can be
achieved by
γ
= 0. However, the purpose of
γ
will be
discussed in chapter 2.3.
1
In this case
Bk
is defined as the absolute bandwidth, not
as the −3 dB bandwidth.
2.1
Calculating the CQT with a fast
convolution
As a consequence of these differences the optimized
algorithm used to calculate the DFT (the FFT) is in
this case not applicable, making a direct implemen-
tation of eq. (1) computationally too expensive or
inaccurate for a real-time applications.
The FFT is still of use, since the transformation in
eq. (1) can be rewritten into a convolution, assuming
the atoms ak[n] are symmetric:
X[k, n] =
Nk−1
X
n=0
x[m]a∗
k[m−n]
=
Nk−1
X
n=0
x[m]ak[n−m]
= (x∗ak)[n] (8)
The convolution can be efficiently computed by a
fast convolution, that is, using the FFT to compute
the convolution by means of a multiplication in the
frequency domain and going back to the time domain
via an inverse-DFT (IDFT):
X[k, n]=(x∗ak)[n]
=F−1
i7→n{[Fn7→i{x}·Fn7→i{ak}](i)}[n] (9)
where
i
shall denote a DFT-bin and
k
a CQT-bin.
This requires block processing of the signal. The
block lengths correspond to the DFT length
NDFT
.
The length of the DFT
NDFT
depends on the max-
imum window length
Nmax
=
max (Nk)
, occurring
at the lowest analysis frequency
f0
. The signal is
blocked with 50% overlap. Each block is windowed
with a Hann-window before its transformation into
the frequency domain. For a computationally more
efficient implementation meaning less overlap, the us-
age of Tukey-windows as proposed in [
3
] would be also
possible. Using oversampling with a factor
os ∈N∗
its length is defined as
NDFT =os ·nextPower2(Nmax)
=os ·nextPower2Qnew
fs
f0
=os ·2log2Qnew fs
f0.(10)
2
As the algorithm only deals with real valued input
signals, we can optimize by only calculating the DFT
for positive frequencies.
There are two immediate optimizations for this
procedure. Firstly, the input signal needs to be trans-
formed to the frequency domain only once for the
calculation of all frequency bins. Secondly, there is
no need to repeatedly transform the localization func-
tions, they can be designed and stored in the frequency
domain prior to the transformation itself, where they
constitute window functions
Ak
. This gives the ad-
vantage of a window with compact support, so that
applying the window is computationally easy. As a
drawback slight ripples can be observed, due to the
window’s infinite support in time doamin, after trans-
forming back into the CQT domain. We propose using
aHann window due to its good sidelobe suppression
and narrow mainlobe as well as perfect overlapping.
Y(i) = Fn7→i{x[n]}(i) (11)
X[k, n] = F−1
i7→n{Y(i)Ak(i)}[n] (12)
2.2
Subsampling in the frequency do-
main
In this state the algorithm’s output, namely one co-
efficient for each bin in each time step, contains a
large amount of redundancy. The Shannon-Nyquist
sampling theorem (in its extension to non-baseband
signals) states that this redundancy can be removed
by subsampling the output of each bin: as long as the
sampling rate after subsampling
fk
s
is at least the size
of the absolute bandwidth
Bk
, no information will be
lost.
fk
s≥Bk(13)
To further reduce the computational effort it is
also possible to perform the subsampling in the fre-
quency domain. This is done by applying an IDFT
with
NIDFT < NDFT
only along the range where the
respective frequency domain window is non-zero, or
in other words, by shifting the windowed spectrum to
the baseband before transforming back to the time
domain with a lower resolution IDFT.
iu,k =fk·2
1
bnew,k NDFT
fs(14)
il,k =fk·2
−1
bnew,k NDFT
fs(15)
iu,k
and
il,k
are the DFT-bins that mark the upper
and lower bounds of
Ak
. The values of the windows
itself are obtained from a large, precalculated Hann
window
Alookup
of length
M
. For every sampling
point of
Alookup
a frequency
fAlookup
[
k, m
] is assigned
for each bin
k
, based on the CQT’s logarithmical
frequency spacing. These are calculated as
fAlookup [k, m] =fk·2
−⌊M/2⌋+m
⌊M/2⌋·bnew,k (16)
for m= 0,1, . . . , M −1.
A length of
M
= 8
·NIDFT
+ 1 (see eq. 19) is more
than sufficient for usage without further interpolation.
The values of
Alookup
whose corresponding frequencies
fAlookup
[
k, m
] are closest to the frequencies of the DFT-
bins
fi
=
i·NDFT
fs
for
il,k ≤i≤iu,k
are chosen and
stored as the window function
Ak
for the
k
th CQT-
bin.
The shift to the baseband is computed with:
Y′(i, k) =
Y(il,k +i)·Ak(il,k +i),for
0≤i≤iu,k −il,k
0,else
(17)
The spectrum is then transformed back to obtain
X′.
X′[k, nk] = F−1
i7→nk{Y′(i, k)}(18)
X′
is the subsampled CQT of
x
, with the time vari-
ables
nk
of the individual channels progressing with
fk
s
, if the IDFT size is chosen to be the minimum,
namely
NIDFT = nextPower2NDFT
Bk,max
fs.(19)
In this implementation the maximum required IDFT
length (for the lowest frequency bin) is used for all bins,
applying zero-padding when necessary. Although
slightly less efficient, this has the advantage of all chan-
nels running at the same rate. Finally, the time-CQT
3
representation is obtained using an overlap-and-add
algorithm on the absolute values |X′[k, nk]|using
hs = 2 NIDFT
os .(20)
as the hopsize for reconstruction of the blocks.
2.3
Low frequency time resolution en-
hancement
The in eq. (4) introduced parameter
γ
enables the
modification of a CQT-bin’s bandwidth
Bk
[
4
, p. 5].
This is done to enhance the low frequency time res-
olution at the expense of the frequency resolution.
Note that the center frequencies of the CQT-bins
fk
remain the same. In the context of human perception
this makes sense, since the filterbank of the human
auditory system only resembles a constant-Q system
above approximately 500 Hz [
4
, p. 5]. These properties
can be explained using the theory of the equivalent
rectangular bandwidths (ERB) [
2
]. Strictly speak-
ing the resulting transformation is not a constant-Q
transform anymore, when γ= 0.
2.4 Tuner algorithm
The implemented tuner algorithm is only sufficient for
a resolution
b
of at least 36 bins per octave (therefore
3 bins per semitone). The algorithm is based upon
parabolic interpolation, where the location of the
maximum of the parabola through three samples is
determined. Initially, the index
kc
of the nearest
CQT-bin to the given tuning frequency
fref
is found.
For parabolic interpolation, three sampling points are
needed. Therefore the bins are summed semitone-wise
at the relative position of the tuning bin, as well as
one bin above and below
2
, resulting in an average
amplitude distribution over all semitones. This can
2
e.g. for a resolution of 3 bins per semitone all bins at the
center of the semitone, all bins below and all bins above are
summed up.
be calculated as
Xsum,c[nk] = PkX′[k, nk],
for nk
(k) mod b
12 = (kc) mod b
12 o
Xsum,l[nk] = PkX′[k, nk],
for nk
(k) mod b
12 = (kc−1) mod b
12 o
Xsum,u[nk] = PkX′[k, nk],
for nk
(k) mod b
12 = (kc+ 1) mod b
12 o.
(21)
The subscripts
l
and
u
describe the bin below and
above the center-bin
kc
. If the maximum value does
not rest upon the reference-bin (e.g. in case of bigger
detuning), the center-bin index
kc
is appropriately
shifted up or down. The parabolic interpolation for
unevenly spaced sampling points according to [
7
] is
given as
xmax =
=b+(f(a)−f(b))(c−b)2−(f(c)−f(b))(b−a)2
2[(f(a)−f(b))(c−b)+(f(c)−f(b))(b−a)] ,(22)
where
a=fkl, f(a) = Xsum,l[nk],
b=fkc, f(b) = Xsum,c[nk],
c=fku, f(c) = Xsum,u[nk],
ftuning[nk] = xmax .
(23)
The interpolation is visualized in figure 1.
Figure 1: Parabolic interpolation.
4
An integration or averaging can smoothen fluctua-
tions of the calculated value. With
det[nk] = 1200 log2ftuning[nk]
fref (24)
the detuning in cent is calculated. If the value exceeds
the range [
−
50; 50], 100 cents are either added or
subtracted.
Note that the accuracy reduces for increasing
γ
due
to the impaired frequency resolution.
3 VST3-Implementation
The Interface of the VST3-implementation in C++
using the JUCE framework [
5
] is shown in figure 2.
There are in total six controls for varying the CQT
parameters as shown in table 1.
Figure 2: The RT-CQT VST User Interface
Control Element Action
fMin Center Frequency of the
lowest bin (cf. f0)
Octaves
Number of octaves to analyze,
starting from fMin,
defines maximum frequency
Bins per Octave
Number of CQT-bins
per octave, e.g. set to 24 for
a quartertone resolution (cf. b)
Gamma
Low frequency time
resolution enhancement
(cf. Bkand γ)
Tuner Tuner reference frequency
Gain Gain for visualization
Table 1: Control parameters of RT-CQT VST
4 Summary
This efficient real-time CQT-Analyzer is based on the
algorithm proposed in [
3
] and [
4
]. The algorithm can
be summarized by the following steps. The time do-
main signal is transformed into the frequency domain
by means of a FFT. Then for each CQT-bin, the spec-
tral components of the DFT between the CQT-bin’s
lower bound and its upper bound get extracted by
means of a frequency-domain window. The extract is
then shifted to the baseband, to perform subsampling
in the frequency domain by applying an IDFT of a
length substantially shorter than the length of the
DFT. After applying overlap-and-add, this yields the
time series of CQT-coefficients. The whole signal pro-
cessing chain is block-based. By using the parameter
γ
, the time resolution towards lower frequencies can
be increased. This leads to a decrease of the Q-factor
towards lower frequencies.
Furthermore a CQT-based tuner algorithm is
proposed. By semitone-wise summing the CQT-
coefficient’s absolute values below, at and above the
tuning frequency bin, fixed through the set tuning
5
frequency, an average amplitude distribution over all
semitones is calculated. The three summed values
are taken as the representatives at the frequencies
below, at and above the tuning frequency bin. These
and the aforementioned frequencies are used for the
parabolic interpolation shown in eq. (22). With the
parabolic interpolation a more accurate estimation of
the input-signal’s tuning frequency is apprehended.
The mistuning of the estimated tuning frequency in
respect to the tuning-frequency, set in the Plugin’s
Tuner section, is calculated and displayed in cents.
For further information on the VST-Plugin’s struc-
ture and a more detailed insight into the created frame-
work refer to the documentation placed at
https:
//git.iem.at/audioplugins/cqt-analyzer.
References
[1]
Judith C Brown. Calculation of a constant Q spec-
tral transform. Journal of the Acoustical Society
of America, 89:425 – 434, January 1991.
[2]
Brian R Glasberg and Brian C.J Moore. Deriva-
tion of auditory filter shapes from notched-noise
data. Hearing Research, 47(1-2):103–138, 1990.
[3]
Nicki Holighaus, Monika D¨orfler, Gino Angelo
Velasco, and Thomas Grill. A framework for in-
vertible, real-time constant-q transforms. IEEE
Transactions on Audio, Speech, and Language Pro-
cessing, 21(4):775–785, 2013.
[4]
Christian Sch¨orkhuber, Anssi Klapuri, Nicki Ho-
lighaus, and Monika D¨orfler. A Matlab toolbox
for efficient perfect reconstruction time-frequency
transforms with log-frequency resolution. In AES
53rd International Conference, London, UK, Jan-
uary 2014.
[5]
Jules Storer and ROLI. JUCE - Jules’ Utility Class
Extensions, October 2019. Version 5.4.5, available
at https://github.com/WeAreROLI/JUCE.
[6]
Gino Angelo Velasco, Nicki Holighaus, Monika
D¨orfler, and Thomas Grill. Constructing an in-
vertible constant-Q transform with nonstationary
Gabor frames. In Proc. of the 14th International
Conference on Digital Audio Effects, Paris, France,
September 2011.
[7]
Ruye Wang. Harvey mudd college, lecture notes,
parabolic interpolation, February 2015. Ruye
Wang 2015-02-12, available at
http://fourier.
eng.hmc.edu/e176/lectures/NM/node25.html
,
last accessed on 26.03.2020.
6