Conference PaperPDF Available

librosa: Audio and Music Signal Analysis in Python

Authors:
PROC. OF THE 14th PYTHON IN SCIENCE CONF. (SCIPY 2015) 1
librosa: Audio and Music Signal Analysis in Python
Brian McFeek∗, Colin Raffel§, Dawen Liang§, Daniel P.W. Ellis§, Matt McVicar, Eric Battenberg∗∗, Oriol Nietok
F
Abstract—This document describes version 0.4.0 of librosa: a Python pack-
age for audio and music signal processing. At a high level, librosa provides
implementations of a variety of common functions used throughout the field of
music information retrieval. In this document, a brief overview of the library’s
functionality is provided, along with explanations of the design goals, software
development practices, and notational conventions.
Index Terms—audio, music, signal processing
Introduction
The emerging research field of music information retrieval
(MIR) broadly covers topics at the intersection of musicol-
ogy, digital signal processing, machine learning, information
retrieval, and library science. Although the field is relatively
young—the first international symposium on music informa-
tion retrieval (ISMIR)1was held in October of 2000—it is
rapidly developing, thanks in part to the proliferation and
practical scientific needs of digital music services, such as
iTunes, Pandora, and Spotify. While the preponderance of MIR
research has been conducted with custom tools and scripts
developed by researchers in a variety of languages such as
MATLAB or C++, the stability, scalability, and ease of use
these tools has often left much to be desired.
In recent years, interest has grown within the MIR commu-
nity in using (scientific) Python as a viable alternative. This has
been driven by a confluence of several factors, including the
availability of high-quality machine learning libraries such as
scikit-learn [Pedregosa11] and tools based on Theano
[Bergstra11], as well as Python’s vast catalog of packages
for dealing with text data and web services. However, the
adoption of Python has been slowed by the absence of a
stable core library that provides the basic routines upon which
many MIR applications are built. To remedy this situation,
we have developed librosa:2a Python package for audio and
music signal processing.3In doing so, we hope to both ease
the transition of MIR researchers into Python (and modern
software development practices), and also to make core MIR
Corresponding author: brian.mcfee@nyu.edu
Center for Data Science, New York University
kMusic and Audio Research Laboratory, New York University
§ LabROSA, Columbia University
Department of Engineering Mathematics, University of Bristol
∗∗ Silicon Valley AI Lab, Baidu, Inc.
Copyright c
2015 Brian McFee et al. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
techniques readily available to the broader community of
scientists and Python programmers.
Design principles
In designing librosa, we have prioritized a few key concepts.
First, we strive for a low barrier to entry for researchers
familiar with MATLAB. In particular, we opted for a relatively
flat package layout, and following scipy [Jones01] rely upon
numpy data types and functions [VanDerWalt11], rather than
abstract class hierarchies.
Second, we expended considerable effort in standardizing
interfaces, variable names, and (default) parameter settings
across the various analysis functions. This task was compli-
cated by the fact that reference implementations from which
our implementations are derived come from various authors,
and are often designed as one-off scripts rather than proper
library functions with well-defined interfaces.
Third, wherever possible, we retain backwards compatibility
against existing reference implementations. This is achieved
via regression testing for numerical equivalence of outputs.
All tests are implemented in the nose framework.4
Fourth, because MIR is a rapidly evolving field, we rec-
ognize that the exact implementations provided by librosa
may not represent the state of the art for any particular task.
Consequently, functions are designed to be modular, allowing
practitioners to provide their own functions when appropriate,
e.g., a custom onset strength estimate may be provided to the
beat tracker as a function argument. This allows researchers to
leverage existing library functions while experimenting with
improvements to specific components. Although this seems
simple and obvious, from a practical standpoint the monolithic
designs and lack of interoperability between different research
codebases have historically made this difficult.
Finally, we strive for readable code, thorough documen-
tation and exhaustive testing. All development is conducted
on GitHub. We apply modern software development prac-
tices, such as continuous integration testing (via Travis5) and
coverage (via Coveralls6). All functions are implemented in
pure Python, thoroughly documented using Sphinx, and in-
clude example code demonstrating usage. The implementation
mostly complies with PEP-8 recommendations, with a small
1.http://ismir.net
2.https://github.com/bmcfee/librosa
3. The name librosa is borrowed from LabROSA: the LABoratory for the
Recognition and Organization of Speech and Audio at Columbia University,
where the initial development of librosa took place.
4.https://nose.readthedocs.org/en/latest/
2 PROC. OF THE 14th PYTHON IN SCIENCE CONF. (SCIPY 2015)
set of exceptions for variable names that make the code
more concise without sacrificing clarity: e.g., yand sr are
preferred over more verbose names such as audio_buffer
and sampling_rate.
Conventions
In general, librosa’s functions tend to expose all relevant
parameters to the caller. While this provides a great deal
of flexibility to expert users, it can be overwhelming to
novice users who simply need a consistent interface to process
audio files. To satisfy both needs, we define a set of general
conventions and standardized default parameter values shared
across many functions.
An audio signal is represented as a one-dimensional numpy
array, denoted as ythroughout librosa. Typically the signal yis
accompanied by the sampling rate (denoted sr) which denotes
the frequency (in Hz) at which values of yare sampled. The
duration of a signal can then be computed by dividing the
number of samples by the sampling rate:
>>> duration_seconds =float(len(y)) /sr
By default, when loading stereo audio files, the
librosa.load() function downmixes to mono by
averaging left- and right-channels, and then resamples the
monophonic signal to the default rate sr=22050 Hz.
Most audio analysis methods operate not at the native
sampling rate of the signal, but over small frames of the
signal which are spaced by a hop length (in samples). The
default frame and hop lengths are set to 2048 and 512 samples,
respectively. At the default sampling rate of 22050 Hz, this
corresponds to overlapping frames of approximately 93ms
spaced by 23ms. Frames are centered by default, so frame
index tcorresponds to the slice:
y[(t *hop_length - frame_length / 2):
(t *hop_length + frame_length / 2)],
where boundary conditions are handled by reflection-padding
the input signal y. Unless otherwise specified, all sliding-
window analyses use Hann windows by default. For analyses
that do not use fixed-width frames (such as the constant-
Q transform), the default hop length of 512 is retained to
facilitate alignment of results.
The majority of feature analyses implemented by
librosa produce two-dimensional outputs stored as
numpy.ndarray, e.g., S[f, t] might contain the
energy within a particular frequency band fat frame index
t. We follow the convention that the final dimension provides
the index over time, e.g., S[:, 0], S[:, 1] access
features at the first and second frames. Feature arrays are
organized column-major (Fortran style) in memory, so that
common access patterns benefit from cache locality.
By default, all pitch-based analyses are assumed to be
relative to a 12-bin equal-tempered chromatic scale with a
reference tuning of A440 = 440.0 Hz. Pitch and pitch-
class analyses are arranged such that the 0th bin corresponds
to Cfor pitch class or C1 (32.7 Hz) for absolute pitch
measurements.
5.https://travis-ci.org
6.https://coveralls.io
Package organization
In this section, we give a brief overview of the structure
of the librosa software package. This overview is intended
to be superficial and cover only the most commonly used
functionality. A complete API reference can be found at
https://bmcfee.github.io/librosa.
Core functionality
The librosa.core submodule includes a range of com-
monly used functions. Broadly, core functionality falls into
four categories: audio and time-series operations, spectro-
gram calculation, time and frequency conversion, and pitch
operations. For convenience, all functions within the core
submodule are aliased at the top level of the package hierarchy,
e.g., librosa.core.load is aliased to librosa.load.
Audio and time-series operations include functions
such as: reading audio from disk via the audioread
package7(core.load), resampling a signal at a desired
rate (core.resample), stereo to mono conversion
(core.to_mono), time-domain bounded auto-correlation
(core.autocorrelate), and zero-crossing detection
(core.zero_crossings).
Spectrogram operations include the short-time Fourier trans-
form (stft), inverse STFT (istft), and instantaneous
frequency spectrogram (ifgram) [Abe95], which provide
much of the core functionality for down-stream feature anal-
ysis. Additionally, an efficient constant-Q transform (cqt)
implementation based upon the recursive down-sampling
method of Schoerkhuber and Klapuri [Schoerkhuber10] is
provided, which produces logarithmically-spaced frequency
representations suitable for pitch-based signal analysis. Fi-
nally, logamplitude provides a flexible and robust imple-
mentation of log-amplitude scaling, which can be used to avoid
numerical underflow and set an adaptive noise floor when
converting from linear amplitude.
Because data may be represented in a variety of time or fre-
quency units, we provide a comprehensive set of convenience
functions to map between different time representations: sec-
onds, frames, or samples; and frequency representations: hertz,
constant-Q basis index, Fourier basis index, Mel basis index,
MIDI note number, or note in scientific pitch notation.
Finally, the core submodule provides functionality to esti-
mate the dominant frequency of STFT bins via parabolic in-
terpolation (piptrack) [Smith11], and estimation of tuning
deviation (in cents) from the reference A440. These functions
allow pitch-based analyses (e.g., cqt) to dynamically adapt
filter banks to match the global tuning offset of a particular
audio signal.
Spectral features
Spectral representations—the distributions of energy over a
set of frequencies—form the basis of many analysis tech-
niques in MIR and digital signal processing in general. The
librosa.feature module implements a variety of spectral
representations, most of which are based upon the short-time
Fourier transform.
7.https://github.com/sampsyo/audioread
LIBROSA: AUDIO AND MUSIC SIGNAL ANALYSIS IN PYTHON 3
Fig. 1: First: the short-time Fourier transform of a
20-second audio clip (librosa.stft). Second: the
corresponding Mel spectrogram, using 128 Mel bands
(librosa.feature.melspectrogram). Third: the
corresponding chromagram (librosa.feature.chroma_cqt).
Fourth: the Tonnetz features (librosa.feature.tonnetz).
The Mel frequency scale is commonly used to represent
audio signals, as it provides a rough model of human fre-
quency perception [Stevens37]. Both a Mel-scale spectro-
gram (librosa.feature.melspectrogram) and the
commonly used Mel-frequency Cepstral Coefficients (MFCC)
(librosa.feature.mfcc) are provided. By default, Mel
scales are defined to match the implementation provided by
Slaney’s auditory toolbox [Slaney98], but they can be made
to match the Hidden Markov Model Toolkit (HTK) by setting
the flag htk=True [Young97].
While Mel-scaled representations are commonly used to
capture timbral aspects of music, they provide poor reso-
lution of pitches and pitch classes. Pitch class (or chroma)
representations are often used to encode harmony while
suppressing variations in octave height, loudness, or timbre.
Two flexible chroma implementations are provided: one uses
a fixed-window STFT analysis (chroma_stft)8and the
other uses variable-window constant-Q transform analysis
(chroma_cqt). An alternative representation of pitch and
harmony can be obtained by the tonnetz function, which es-
timates tonal centroids as coordinates in a six-dimensional in-
terval space using the method of Harte et al. [Harte06]. Figure
1illustrates the difference between STFT, Mel spectrogram,
chromagram, and Tonnetz representations, as constructed by
the following code fragment:9
>>> filename =librosa.util.example_audio_file()
>>> y, sr =librosa.load(filename,
... offset=25.0,
... duration=20.0)
>>> spectrogram =np.abs(librosa.stft(y))
>>> melspec =librosa.feature.melspectrogram(y=y,
... sr=sr)
>>> chroma =librosa.feature.chroma_cqt(y=y,
... sr=sr)
>>> tonnetz =librosa.feature.tonnetz(y=y, sr=sr)
In addition to Mel and chroma features, the feature
submodule provides a number of spectral statistic
representations, including spectral_centroid,
spectral_bandwidth,spectral_rolloff
[Klapuri07], and spectral_contrast [Jiang02].10
Finally, the feature submodule provides a few func-
tions to implement common transformations of time-series
features in MIR. This includes delta, which provides a
smoothed estimate of the time derivative; stack_memory,
which concatenates an input feature array with time-lagged
copies of itself (effectively simulating feature n-grams); and
sync, which applies a user-supplied aggregation function
(e.g., numpy.mean or median) across specified column
intervals.
Display
The display module provides simple interfaces to visu-
ally render audio data through matplotlib [Hunter07].
The first function, display.waveplot simply renders the
amplitude envelope of an audio signal yusing matplotlib’s
fill_between function. For efficiency purposes, the signal
is dynamically down-sampled. Mono signals are rendered
symmetrically about the horizontal axis; stereo signals are
rendered with the left-channel’s amplitude above the axis
and the right-channel’s below. An example of waveplot is
depicted in Figure 2(top).
The second function, display.specshow wraps mat-
plotlib’s imshow function with default settings (origin and
aspect) adapted to the expected defaults for visualizing
spectrograms. Additionally, specshow dynamically selects
appropriate colormaps (binary, sequential, or diverging) from
the data type and range.11 Finally, specshow provides a
variety of acoustically relevant axis labeling and scaling pa-
rameters. Examples of specshow output are displayed in
Figures 1and 2(middle).
Onsets, tempo, and beats
While the spectral feature representations described above
capture frequency information, time information is equally
important for many applications in MIR. For instance, it can
be beneficial to analyze signals indexed by note or beat events,
rather than absolute time. The onset and beat submodules
implement functions to estimate various aspects of timing in
music.
8.chroma_stft is based upon the reference implementation provided
at http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/
9. For display purposes, spectrograms are scaled by
librosa.logamplitude. We refer readers to the accompanying
IPython notebook for the full source code to recontsruct figures.
10.spectral_*functions are derived from MATLAB reference imple-
mentations provided by the METLab at Drexel University. http://music.ece.
drexel.edu/
11. If the seaborn package [Waskom14] is available, its version of
cubehelix is used for sequential data.
4 PROC. OF THE 14th PYTHON IN SCIENCE CONF. (SCIPY 2015)
More specifically, the onset module provides two
functions: onset_strength and onset_detect. The
onset_strength function calculates a thresholded spec-
tral flux operation over a spectrogram, and returns a one-
dimensional array representing the amount of increasing spec-
tral energy at each frame. This is illustrated as the blue curve in
the bottom panel of Figure 2. The onset_detect function,
on the other hand, selects peak positions from the onset
strength curve following the heuristic described by Boeck et
al. [Boeck12]. The output of onset_detect is depicted as
red circles in the bottom panel of Figure 2.
The beat module provides functions to estimate the global
tempo and positions of beat events from the onset strength
function, using the method of Ellis [Ellis07]. More specifically,
the beat tracker first estimates the tempo, which is then used
to set the target spacing between peaks in an onset strength
function. The output of the beat tracker is displayed as the
dashed green lines in Figure 2(bottom).
Tying this all together, the tempo and beat positions for an
input signal can be easily calculated by the following code
fragment:
>>> y, sr =librosa.load(FILENAME)
>>> tempo, frames =librosa.beat.beat_track(y=y,
... sr=sr)
>>> beat_times =librosa.frames_to_time(frames,
... sr=sr)
Any of the default parameters and analyses may be overridden.
For example, if the user has calculated an onset strength
envelope by some other means, it can be provided to the beat
tracker as follows:
>>> oenv =some_other_onset_function(y, sr)
>>> librosa.beat.beat_track(onset_envelope=oenv)
All detection functions (beat and onset) return events as frame
indices, rather than absolute timing. The downside of this is
that it is left to the user to convert frame indices back to
absolute time. However, in our opinion, this is outweighed
by two practical benefits: it simplifies the implementations,
and it makes the results directly accessible to frame-indexed
functions such as librosa.feature.sync.
Structural analysis
Onsets and beats provide relatively low-level timing cues
for music signal processing. Higher-level analyses attempt to
detect larger structure in music, e.g., at the level of bars or
functional components such as verse and chorus. While this
is an active area of research that has seen rapid progress in
recent years, there are some useful features common to many
approaches. The segment submodule contains a few useful
functions to facilitate structural analysis in music, falling
broadly into two categories.
First, there are functions to calculate and
manipulate recurrence or self-similarity plots. The
segment.recurrence_matrix constructs a binary
k-nearest-neighbor similarity matrix from a given feature
array and a user-specified distance function. As displayed
in Figure 3(left), repeating sequences often appear as
diagonal bands in the recurrence plot, which can be used
to detect musical structure. It is sometimes more convenient
to operate in time-lag coordinates, rather than time-time,
which transforms diagonal structures into more easily
detectable horizontal structures (Figure 3, right) [Serra12].
This is facilitated by the recurrence_to_lag (and
lag_to_recurrence) functions.
Second, temporally constrained clustering can be
used to detect feature change-points without relying
upon repetition. This is implemented in librosa by
the segment.agglomerative function, which
uses scikit-learn’s implementation of Ward’s
agglomerative clustering method [Ward63] to partition
the input into a user-defined number of contiguous
components. In practice, a user can override the
default clustering parameters by providing an existing
sklearn.cluster.AgglomerativeClustering
object as an argument to segment.agglomerative().
Decompositions
Many applications in MIR operate upon latent factor represen-
tations, or other decompositions of spectrograms. For exam-
ple, it is common to apply non-negative matrix factorization
(NMF) [Lee99] to magnitude spectra, and analyze the statistics
of the resulting time-varying activation functions, rather than
the raw observations.
The decompose module provides a simple interface to fac-
tor spectrograms (or general feature arrays) into components
and activations:
>>> comps, acts = librosa.decompose.decompose(S)
By default, the decompose() function constructs
ascikit-learn NMF object, and applies its
fit_transform() method to the transpose of S. The
resulting basis components and activations are accordingly
transposed, so that comps.dot(acts) approximates S. If
the user wishes to apply some other decomposition technique,
any object fitting the sklearn.decomposition interface
may be substituted:
>>> T = SomeDecomposer()
>>> librosa.decompose.decompose(S, transformer=T)
In addition to general-purpose matrix decomposition tech-
niques, librosa also implements the harmonic-percussive
source separation (HPSS) method of Fitzgerald [Fitzgerald10]
as decompose.hpss. This technique is commonly used in
MIR to suppress transients when analyzing pitch content, or
suppress stationary signals when detecting onsets or other
rhythmic elements. An example application of HPSS is illus-
trated in Figure 4.
Effects
The effects module provides convenience functions for
applying spectrogram-based transformations to time-domain
signals. For instance, rather than writing
>>> D=librosa.stft(y)
>>> Dh, Dp =librosa.decompose.hpss(D)
>>> y_harmonic =librosa.istft(Dh)
one may simply write
LIBROSA: AUDIO AND MUSIC SIGNAL ANALYSIS IN PYTHON 5
Fig. 2: Top: a waveform plot for a 20-second audio clip y, generated by librosa.display.waveplot. Middle: the log-power short-time
Fourier transform (STFT) spectrum for yplotted on a logarithmic frequency scale, generated by librosa.display.specshow. Bottom:
the onset strength function (librosa.onset.onset_strength), detected onset events (librosa.onset.onset_detect), and
detected beat events (librosa.beat.beat_track) for y.
Fig. 3: Left: the recurrence plot derived from the chroma features displayed in Figure 1. Right: the corresponding time-lag plot.
>>> y_harmonic =librosa.effects.harmonic(y)
Convenience functions are provided for HPSS (retaining the
harmonic, percussive, or both components), time-stretching
and pitch-shifting. Although these functions provide no ad-
ditional functionality, their inclusion results in simpler, more
readable application code.
Output
The output module includes utility functions to save the
results of audio analysis to disk. Most often, this takes the form
of annotated instantaneous event timings or time intervals,
which are saved in plain text (comma- or tab-separated values)
via output.times_csv and output.annotation, re-
spectively. These functions are somewhat redundant with al-
ternative functions for text output (e.g., numpy.savetxt),
but provide sanity checks for length agreement and semantic
validation of time intervals. The resulting outputs are designed
to work with other common MIR tools, such as mir_eval
[Raffel14] and sonic-visualiser [Cannam10].
The output module also provides the write_wav
function for saving audio in .wav format. The write_wav
simply wraps the built-in scipy wav-file writer
(scipy.io.wavfile.write) with validation and
optional normalization, thus ensuring that the resulting audio
files are well-formed.
Caching
MIR applications typically require computing a variety of fea-
tures (e.g., MFCCs, chroma, beat timings, etc) from each audio
signal in a collection. Assuming the application programmer
is content with default parameters, the simplest way to achieve
this is to call each function using audio time-series input, e.g.:
6 PROC. OF THE 14th PYTHON IN SCIENCE CONF. (SCIPY 2015)
Fig. 4: Top: the separated harmonic and percussive waveforms.
Middle: the Mel spectrogram of the harmonic component. Bottom:
the Mel spectrogram of the percussive component.
>>> mfcc =librosa.feature.mfcc(y=y, sr=sr)
>>> tempo, beats =librosa.beat.beat_track(y=y,
... sr=sr)
However, because there are shared computations between the
different functions—mfcc and beat_track both compute
log-scaled Mel spectrograms, for example—this results in
redundant (and inefficient) computation. A more efficient
implementation of the above example would factor out the
redundant features:
>>> lms =librosa.logamplitude(
... librosa.feature.melspectrogram(y=y,
... sr=sr))
>>> mfcc =librosa.feature.mfcc(S=lms)
>>> tempo, beats =librosa.beat.beat_track(S=lms,
... sr=sr)
Although it is more computationally efficient, the above ex-
ample is less concise, and it requires more knowledge of
the implementations on behalf of the application programmer.
More generally, nearly all functions in librosa eventually de-
pend upon STFT calculation, but it is rare that the application
programmer will need the STFT matrix as an end-result.
One approach to eliminate redundant computation is to
decompose the various functions into blocks which can be
arranged in a computation graph, as is done in Essentia
[Bogdanov13]. However, this approach necessarily constrains
the function interfaces, and may become unwieldy for com-
mon, simple applications.
Instead, librosa takes a lazy approach to eliminating redun-
dancy via output caching. Caching is implemented through an
extension of the Memory class from the joblib package12,
which provides disk-backed memoization of function outputs.
The cache object (librosa.cache) operates as a decorator
on all non-trivial computations. This way, a user can write
simple application code (i.e., the first example above) while
transparently eliminating redundancies and achieving speed
comparable to the more advanced implementation (the second
example).
Parameter Description Values
fmax Maximum frequency value (Hz) 8000, 11025
n_mels Number of Mel bands 32, 64, 128
aggregate Spectral flux aggregation function np.mean,
np.median
ac_size Maximum lag for onset autocor-
relation (s)
2, 4, 8
std_bpm Deviation of tempo estimates
from 120.0 BPM
0.5, 1.0, 2.0
tightness Penalty for deviation from esti-
mated tempo
50, 100, 400
TABLE 1: The parameter grid for beat tracking optimization. The
best configuration is indicated in bold.
The cache object is disabled by default, but can be activated
by setting the environment variable LIBROSA_CACHE_DIR
prior to importing the package. Because the Memory object
does not implement a cache eviction policy (as of version
0.8.4), it is recommended that users purge the cache after
processing each audio file to prevent the cache from filling
all available disk space13. We note that this can potentially
introduce race conditions in multi-processing environments
(i.e., parallel batch processing of a corpus), so care must be
taken when scheduling cache purges.
Parameter tuning
Some of librosa’s functions have parameters that require some
degree of tuning to optimize performance. In particular, the
performance of the beat tracker and onset detection functions
can vary substantially with small changes in certain key
parameters.
After standardizing certain default parameters—sampling
rate, frame length, and hop length—across all functions, we
optimized the beat tracker settings using the parameter grid
given in Table 1. To select the best-performing configura-
tion, we evaluated the performance on a data set comprised
of the Isophonics Beatles corpus14 and the SMC Dataset2
[Holzapfel12] beat annotations. Each configuration was eval-
uated using mir_eval [Raffel14], and the configuration was
chosen to maximize the Correct Metric Level (Total) metric
[Davies14].
Similarly, the onset detection parameters (listed in Table 2)
were selected to optimize the F1-score on the Johannes Kepler
University onset database.15
We note that the "optimal" default parameter settings are
merely estimates, and depend upon the datasets over which
they are selected. The parameter settings are therefore subject
to change in the future as larger reference collections become
available. The optimization framework has been factored out
into a separate repository, which may in subsequent versions
grow to include additional parameters.16
12.https://github.com/joblib/joblib
13. The cache can be purged by calling librosa.cache.clear().
14.http://isophonics.net/content/reference-annotations
15.https://github.com/CPJKU/onset_db
16.https://github.com/bmcfee/librosa_parameters
LIBROSA: AUDIO AND MUSIC SIGNAL ANALYSIS IN PYTHON 7
Parameter Description Values
fmax Maximum frequency value
(Hz)
8000, 11025
n_mels Number of Mel bands 32, 64, 128
aggregate Spectral flux aggregation
function
np.mean,
np.median
delta Peak picking threshold 0.0--0.10 (0.07)
TABLE 2: The parameter grid for onest detection optimization. The
best configuration is indicated in bold.
Conclusion
This document provides a brief summary of the design consid-
erations and functionality of librosa. More detailed examples,
notebooks, and documentation can be found in our devel-
opment repository and project website. The project is under
active development, and our roadmap for future work includes
efficiency improvements and enhanced functionality of audio
coding and file system interactions.
Citing librosa
We request that when using librosa in academic work, authors
cite the Zenodo reference [McFee15]. For references to the
design of the library, citation of the present document is
appropriate.
Acknowledgements
BM acknowledges support from the Moore-Sloan Data Sci-
ence Environment at NYU. Additional support was provided
by NSF grant IIS-1117015.
REFERENCES
[Pedregosa11] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort,
Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
Blondel et al. Scikit-learn: Machine learning in Python.
The Journal of Machine Learning Research 12 (2011):
2825-2830.
[Bergstra11] Bergstra, James, Frédéric Bastien, Olivier Breuleux, Pascal
Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume
Desjardins et al. Theano: Deep learning on gpus with
python. In NIPS 2011, BigLearning Workshop, Granada,
Spain. 2011.
[Jones01] Jones, Eric, Travis Oliphant, and Pearu Peterson. SciPy:
Open source scientific tools for Python. http://www.scipy.
org/ (2001).
[VanDerWalt11] Van Der Walt, Stefan, S. Chris Colbert, and Gael Varo-
quaux. The NumPy array: a structure for efficient numeri-
cal computation. Computing in Science & Engineering 13,
no. 2 (2011): 22-30.
[Abe95] Abe, Toshihiko, Takao Kobayashi, and Satoshi Imai. Har-
monics tracking and pitch extraction based on instanta-
neous frequency. International Conference on Acoustics,
Speech, and Signal Processing, ICASSP-95., Vol. 1. IEEE,
1995.
[Schoerkhuber10] Schoerkhuber, Christian, and Anssi Klapuri. Constant-Q
transform toolbox for music processing. 7th Sound and
Music Computing Conference, Barcelona, Spain. 2010.
[Smith11] Smith, J.O. "Sinusoidal Peak Interpolation", in Spectral
Audio Signal Processing, https://ccrma.stanford.edu/~jos/
sasp/Sinusoidal_Peak_Interpolation.html , online book,
2011 edition, accessed 2015-06-15.
[Stevens37] Stevens, Stanley Smith, John Volkmann, and Edwin B.
Newman. A scale for the measurement of the psychological
magnitude pitch. The Journal of the Acoustical Society of
America 8, no. 3 (1937): 185-190.
[Slaney98] Slaney, Malcolm. Auditory toolbox. Interval Research Cor-
poration, Tech. Rep 10 (1998): 1998.
[Young97] Young, Steve, Evermann, Gunnar, Gales, Mark, Hain,
Thomas, Kershaw, Dan, Liu, Xunying (Andrew), Moore,
Gareth, Odell, Julian, Ollason, Dave, Povey, Dan, Valtchev,
Valtcho, and Woodland, Phil. The HTK book. Vol. 2. Cam-
bridge: Entropic Cambridge Research Laboratory, 1997.
[Harte06] Harte, C., Sandler, M., & Gasser, M. (2006). Detecting
Harmonic Change in Musical Audio. In Proceedings of
the 1st ACM Workshop on Audio and Music Computing
Multimedia (pp. 21-26). Santa Barbara, CA, USA: ACM
Press. doi:10.1145/1178723.1178727.
[Jiang02] Jiang, Dan-Ning, Lie Lu, Hong-Jiang Zhang, Jian-Hua
Tao, and Lian-Hong Cai. Music type classification by
spectral contrast feature. In ICME’02. vol. 1, pp. 113-116.
IEEE, 2002.
[Klapuri07] Klapuri, Anssi, and Manuel Davy, eds. Signal processing
methods for music transcription. Springer Science & Busi-
ness Media, 2007.
[Hunter07] Hunter, John D. Matplotlib: A 2D graphics environment.
Computing in science and engineering 9, no. 3 (2007):
90-95.
[Waskom14] Michael Waskom, Olga Botvinnik, Paul Hobson, John B.
Cole, Yaroslav Halchenko, Stephan Hoyer, Alistair Miles,
et al. Seaborn: v0.5.0 (November 2014). ZENODO, 2014.
doi:10.5281/zenodo.12710.
[Boeck12] Böck, Sebastian, Florian Krebs, and Markus Schedl. Eval-
uating the Online Capabilities of Onset Detection Meth-
ods. In 11th International Society for Music Information
Retrieval Conference (ISMIR 2012), pp. 49-54. 2012.
[Ellis07] Ellis, Daniel P.W. Beat tracking by dynamic programming.
Journal of New Music Research 36, no. 1 (2007): 51-60.
[Serra12] Serra, Joan, Meinard Müller, Peter Grosche, and Josep
Lluis Arcos. Unsupervised detection of music boundaries
by time series structure features. In Twenty-Sixth AAAI
Conference on Artificial Intelligence. 2012.
[Ward63] Ward Jr, Joe H. Hierarchical grouping to optimize an
objective function. Journal of the American statistical
association 58, no. 301 (1963): 236-244.
[Lee99] Lee, Daniel D., and H. Sebastian Seung. Learning the parts
of objects by non-negative matrix factorization. Nature
401, no. 6755 (1999): 788-791.
[Fitzgerald10] Fitzgerald, Derry. Harmonic/percussive separation using
median filtering. 13th International Conference on Digital
Audio Effects (DAFX10), Graz, Austria, 2010.
[Cannam10] Cannam, Chris, Christian Landone, and Mark Sandler.
Sonic visualiser: An open source application for viewing,
analysing, and annotating music audio files. In Proceed-
ings of the international conference on Multimedia, pp.
1467-1468. ACM, 2010.
[Holzapfel12] Holzapfel, Andre, Matthew E.P. Davies, José R. Zapata,
João Lobato Oliveira, and Fabien Gouyon. Selective sam-
pling for beat tracking evaluation. Audio, Speech, and
Language Processing, IEEE Transactions on 20, no. 9
(2012): 2539-2548.
[Davies14] Davies, Matthew E.P., and Boeck, Sebastian. Evaluating
the evaluation measures for beat tracking. In 15th Interna-
tional Society for Music Information Retrieval Conference
(ISMIR 2014), 2014.
[Raffel14] Raffel, Colin, Brian McFee, Eric J. Humphrey, Justin
Salamon, Oriol Nieto, Dawen Liang, and Daniel PW Ellis.
mir eval: A transparent implementation of common MIR
metrics. In 15th International Society for Music Infor-
mation Retrieval Conference (ISMIR 2014), pp. 367-372.
2014.
[Bogdanov13] Bogdanov, Dmitry, Nicolas Wack, Emilia Gómez, Sankalp
Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma,
Justin Salamon, José R. Zapata, and Xavier Serra. Essen-
tia: An Audio Analysis Library for Music Information Re-
trieval. In 12th International Society for Music Information
Retrieval Conference (ISMIR 2013), pp. 493-498. 2013.
[McFee15] Brian McFee, Matt McVicar, Colin Raffel, Dawen Liang,
Oriol Nieto, Josh Moore, Dan Ellis, et al. Librosa: v0.4.0.
Zenodo, 2015. doi:10.5281/zenodo.18369.
... It is important 6 of 20 to bear in mind that data augmentation needs to be limited to the training data once it has been split into train and test samples. On our training data, we applied polarity inversion, noise addition, time stretching, pitch change, sound shifting, and random gain [29][30][31]. To expand the amount of the test samples, we only added noise to our test data. ...
Preprint
Full-text available
Emotion recognition in speech is essential for enhancing human-computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model's robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D Convolutional Neural Networks (1D CNN), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNN with Bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores, and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared to existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90\%, 85.20\%, 90.63\%, 67.71\%, and 69.25\% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.
... We first performed silence removal using librosa (McFee et al., 2015), ensuring that only meaningful content remains. Then, audio was segmented into 2-second clips. ...
Preprint
Full-text available
Data are crucial in various computer-related fields, including music information retrieval (MIR), an interdisciplinary area bridging computer science and music. This paper introduces CCMusic, an open and diverse database comprising multiple datasets specifically designed for tasks related to Chinese music, highlighting our focus on this culturally rich domain. The database integrates both published and unpublished datasets, with steps taken such as data cleaning, label refinement, and data structure unification to ensure data consistency and create ready-to-use versions. We conduct benchmark evaluations for all datasets using a unified evaluation framework developed specifically for this purpose. This publicly available framework supports both classification and detection tasks, ensuring standardized and reproducible results across all datasets. The database is hosted on HuggingFace and ModelScope, two open and multifunctional data and model hosting platforms, ensuring ease of accessibility and usability.
... We first performed silence removal using librosa (McFee et al., 2015), ensuring that only meaningful content remains. Then, audio was segmented into 2-second clips. ...
... For example if our audio was an instrument playing the note C, shifting by one semitone resulted in a C# and two semitones resulted in a D. The range of -3 to +3 semitones ensures that the audio remains natural-sounding. We additionally calculated the average fundamental frequency ( f 0 ) for all of the modified voices using the librosa Python package [112] to obtain a meaningful quantity to compare sounds. ...
Article
Full-text available
Perceptions of gender have a significant impact on human-human interaction, and gender has wide-reaching social implications for robots intended to interact with humans. This work explored two flexible modalities for communicating gender in robots–voice and appearance–and we studied their individual and combined influences on a robot’s perceived gender. We evaluated the perception of a robot’s gender through three online studies. First, we conducted a voice design study (n = 65) on the gender perception of robot voices by varying speaker identity and pitch. Second, we conducted a clothing design study (n = 93) on the gender perception of robot clothing designed for two different tasks. Finally, building on the results of the first two studies, we completed a large integrative video study (n = 273) involving two human-robot interaction tasks. We found that voice and clothing can be used to reliably establish a robot’s perceived gender, and that combining these two modalities can have different effects on the robot’s perceived gender. Taken together, these results inform the design of robot voices and clothing as individual and interacting components in the perceptions of robot gender.
... The field dedicated to extracting and analyzing information from musical works is known as Music Information Retrieval (MIR). Different tasks encompassed by MIR include: cover song detection [McFee et al. 2015], music recommendation [Song et al. 2012], tempo estimation [Alonso et al. 2004], query by humming [Ghias et al. 1995], instrument recognition [Livshin 2007], and genre recognition [Tzanetakis and Cook 2002]. ...
Conference Paper
Music has been increasingly prominent in people’s lives in recent years. Technology integration with music is progressively increasing, directly contributing to the enhancement and understanding of this art form. In this sense, Music Genre Recognition (MGR) applies Machine Learning (ML) to identify similar patterns in music and classify them. This study assessed traditional ML algorithms in the GTZAN dataset, classifying audio songs into ten genres. The results showed that the XGBoost classifier statistically outperformed all the other algorithms evaluated, with an accuracy value of 0.722. Future experiments can improve it with a more robust feature engineering process, exploring and mixing deep with traditional features.
Conference Paper
Full-text available
Central to the field of MIR research is the evaluation of algorithms used to extract information from music data. We present mir_eval, an open source software library which provides a transparent and easy-to-use implementation of the most common metrics used to measure the performance of MIR algorithms. In this paper, we enumerate the metrics implemented by mir_eval and quantitatively compare each to existing implementations. When the scores reported by mir_eval differ substantially from the reference, we detail the differences in implementation. We also provide a brief overview of mir_eval's architecture, design, and intended use.
Article
Full-text available
In this paper, we propose a method that can identify challenging music samples for beat tracking without ground truth. Our method, motivated by the machine learning method “selective sampling,” is based on the measurement of mutual agreement between beat sequences. In calculating this mutual agreement we show the critical influence of different evaluation measures. Using our approach we demonstrate how to compile a new evaluation dataset comprised of difficult excerpts for beat tracking and examine this difficulty in the context of perceptual and musical properties. Based on tag analysis we indicate the musical properties where future advances in beat tracking research would be most profitable and where beat tracking is too difficult to be attempted. Finally, we demonstrate how our mutual agreement method can be used to improve beat tracking accuracy on large music collections.
Conference Paper
Full-text available
We present Essentia 2.0, an open-source C++ library for audio analysis and audio-based music information retrieval released under the Affero GPL license. It contains an ex- tensive collection of reusable algorithms which implement audio input/output functionality, standard digital signal processing blocks, statistical characterization of data, and a large set of spectral, temporal, tonal and high-level mu- sic descriptors. The library is also wrapped in Python and includes a number of predefined executable extractors for the available music descriptors, which facilitates its use for fast prototyping and allows setting up research experiments very rapidly. Furthermore, it includes a Vamp plugin to be used with Sonic Visualiser for visualization purposes. The library is cross-platform and currently supports Linux, Mac OS X, and Windows systems. Essentia is designed with a focus on the robustness of the provided music descriptors and is optimized in terms of the computational cost of the algorithms. The provided functionality, specifically the music descriptors included in-the-box and signal processing algorithms, is easily expandable and allows for both research experiments and development of large-scale industrial applications.
Article
Full-text available
In the Python world, NumPy arrays are the standard representation for numerical data and enable efficient implementation of numerical computations in a high-level language. As this effort shows, NumPy performance can be improved through three techniques: vectorizing calculations, avoiding copying data in memory, and minimizing operation counts.
Conference Paper
Full-text available
Sonic Visualiser is a friendly and flexible end-user desktop application for analysis, visualisation, and annotation of music audio files. Its stated goal is to be "the first program you reach for when want to study a musical recording rather than simply listen to it". To this end, it has a user interface that resembles familiar audio editing applications, a set of useful standard visualisation facilities, and support for a plugin format for additional automated analysis methods.
Conference Paper
In this paper, we evaluate various onset detection algo-rithms in terms of their online capabilities. Most methods use some kind of normalization over time, which renders them unusable for online tasks. We modified existing methods to enable online application and evaluated their performance on a large dataset consisting of 27,774 an-notated onsets. We focus particularly on the incorporated preprocessing and peak detection methods. We show that, with the right choice of parameters, the maximum achievable performance is in the same range as that of offline algorithms, and that preprocessing can improve the results considerably. Furthermore, we propose a new onset detection method based on the common spectral flux and a new peak-picking method which outperforms traditional meth-ods both online and offline and works with audio signals of various volume levels.
Article
Beat tracking – i.e. deriving from a music audio signal a sequence of beat instants that might correspond to when a human listener would tap his foot – involves satisfying two constraints. On the one hand, the selected instants should generally correspond to moments in the audio where a beat is indicated, for instance by the onset of a note played by one of the instruments. On the other hand, the set of beats should reflect a locally-constant inter-beat-interval, since it is this regular spacing between beat times that defines musical rhythm. These dual constraints map neatly onto the two constraints optimized in dynamic programming, the local match, and the transition cost. We describe a beat tracking system which first estimates a global tempo, uses this tempo to construct a transition cost function, then uses dynamic programming to find the best-scoring set of beat times that reflect the tempo as well as corresponding to moments of high ‘onset strength’ in a function derived from the audio. This very simple and computationally efficient procedure is shown to perform well on the MIREX-06 beat tracking training data, achieving an average beat accuracy of just under 60% on the development data. We also examine the impact of the assumption of a fixed target tempo, and show that the system is typically able to track tempo changes in a range of ±10% of the target tempo.
Article
A subjective scale for the measurement of pitch was constructed from determinations of the half-value of pitches at various frequencies. This scale differs from both the musical scale and the frequency scale, neither of which is subjective. Five observers fractionated tones of 10 different frequencies and the values were used to construct a numerical scale which is proportional to the perceived magnitude of subjective pitch. The close agreement of this pitch scale with an integration of the DL's for pitch shows that, unlike the DL's for loudness, all DL's for pitch are of uniform subjective magnitude. The agreement further implies that pitch and differential sensitivity to pitch are both rectilinear functions of extent on the basilar membrane, and that in cutting a pitch in half, the observer adjusts the tone until it stimulates a position half-way from the original locus to the apical end of the membrane. Measurement of the subjective size of musical intervals (such as octaves) in terms of the pitch scale shows that the intervals become larger as the frequency of the midpoint of the interval increases (except for very high tones). (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.