PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

A crucial step in the music production process is creating a cohesive combination mixture from the separate recorded components of a piece of music, aided by a variety of audio effects and techniques. Audio mixing is a far more creative and varied task than its subsequential post-production process: mastering, for which there are numerous AI-driven applications on the market. This research takes a look at a promising deep learning method for automatic mixing of drums in dry and wet scenarios; the Wave-U-Net. Modifications are made to the convolutional filters of the current Wave-U-Net architecture to increase the number of trainable parameters of the model, with the idea that this will aid in better understanding patterns in the audio. Both the Wave-U-Net and modified Wave-U-Net are trained and tested, with the output mixes being evaluated through subjective listening tests and an objective analysis of the waveforms and spectrograms. It is found that the modified Wave-U-Net architecture outperforms the current standard Wave-U-Net in the automatic dry mixing of drums, rivalling the performance of human engineers. However the performance in wet mixing is hindered by an inability to recreate certain wet audio effects. Potential future directions are also established.
Content may be subject to copyright.
Percussive audio mixing with Wave-U-Nets
Bradley Aldous
School of Electronic Engineering and Computer Science
Queen Mary University of London, UK
Abstract—A crucial step in the music production process is creating a cohesive combination mixture from the separate recorded
components of a piece of music, aided by a variety of audio effects and techniques. Audio mixing is a far more creative and varied task
than its subsequential post-production process: mastering, for which there are numerous AI-driven applications on the market. This
research takes a look at a promising deep learning method for automatic mixing of drums in dry and wet scenarios; the Wave-U-Net.
Modifications are made to the convolutional filters of the current Wave-U-Net architecture to increase the number of trainable
parameters of the model, with the idea that this will aid in better understanding patterns in the audio. Both the Wave-U-Net and
modified Wave-U-Net are trained and tested, with the output mixes being evaluated through subjective listening tests and an objective
analysis of the waveforms and spectrograms. It is found that the modified Wave-U-Net architecture outperforms the current standard
Wave-U-Net in the automatic dry mixing of drums, rivalling the performance of human engineers. However the performance in wet
mixing is hindered by an inability to recreate certain wet audio effects. Potential future directions are also established.
1 INTRODUCTION
The availability of music production software, like DAWs (Digital
Audio Workstations), has seen music production become fairly
commonplace as a hobby amongst younger generations. The skills
needed to play or record instruments vary from those needed in
the post-production stages of obtaining a finished track: mixing
and mastering. These latter stages require a more tuned ear for
subtleties in the audio, thus many aspiring producers struggle to
create professional sounding mixes and masters due to a lack of
experience.
In music production, once the individual tracks for a song are
recorded, the producer or a separate mixing engineer creates an
audio mixture that is ready for the mastering process. The process
of audio mixing refers to the balancing of multitrack audio to pro-
duce a single cohesive audio mixture. A suite of audio effects and
manipulation techniques are employed in this process, typically
including: gain staging or level balancing (setting the relative level
of the elements in the mix), panning (placement of the elements
within the stereo field), equalization or EQ (adjustment of the
timbre of elements within a mix), compression (application of
dynamic range compression), and reverberation (application of
artificial reverberation). Once a mix is produced, it is sent into the
mastering process, which is designed to enhance the overall sound
and create consistency across a record, ensuring the tracks are
ready for release. At the bare minimum mastering chains typically
include EQ, compression and limiting (a type of dynamic range
compression used to increase the average volume of a track).
The field of intelligent music production has seen consid-
erable developments over recent years in automating aspects of
the music production process, notably in developing commercial
applications that provide automated mastering functionalities like
LANDR [1] and BandLab [2], based on [3, 4]. However, in
comparison, there is a distinct lack of commercial applications for
automating the mixing process because of the relative complexity
of the two tasks. There are a general set of rules to be learned
by human mix engineers which help in obtaining a professional
*email: b.aldous@se14.qmul.ac.uk
sounding mix, however it is also crucial to treat each song indi-
vidually in order to create the optimal mix. This special treatment
is defined by a vast number of creative decisions made by audio
engineers, whereas mastering chains will often contain the same
(or a similar) composition of audio effects for songs of the same
genre, meaning more of a strict formula can be followed. This
more in-depth treatment makes automating the mixing process a
far more difficult task compared to automated mastering.
The primary aim of this project is to explore a current state-
of-the-art automatic mixing method, with the intention of building
on and improving this method, as well as providing an indication
as to potential avenues for future research in this field. Overall,
the contributions of this paper can be summarised as follows:
Modifications to the standard Wave-U-Net architecture are
made with the motivation that a larger number of trainable
parameters will be able to predict complexities in the audio
to a better degree.
A listening test is conducted and found to show that the
modified architecture outperforms the standard architec-
ture in both dry and wet mixing scenarios.
An exploratory analysis of the resulting waveforms and
spectrograms is performed, finding similar results to the
listening test.
There were no currently available PyTorch implementa-
tions at time of writing for a mixing variant of the Wave-U-
Net, so implementations for both architectures have been
produced for this paper.
The outline of this paper is as follows. Section 2 presents a
review of prior research in the field of audio, focusing down
on the three key areas of automatic mixing research: knowledge-
based approaches, machine learning approaches and deep learning
approaches. The dataset used in this paper is detailed in depth
in Section 3, as well as details of the split made for training,
validation and testing partitions. Section 4 presents details of the
Wave-U-Net architectures and the processes used to train them.
Subjective and objective evaluations are performed in Section 5,
in the form of a MUSHRA style listening test and a spectrogram
2
analysis respectively. Conclusions are drawn from the results and
discussed in Section 6, with potential future areas of research
identified in Section 7.
2 RE LATED WO RK
Deep learning has a long history dating back to the 1940s,
originating from the development of perceptrons which mimic
biological neurons [5], however it has only risen to prominence
in research in the last two decades following the application
of a fast, greedy algorithm in learning deep belief nets [6, 7].
Numerous fields of research have benefited greatly through the
use of neural networks, and whilst computer vision and natural
language processing (NLP) get a lot of buzz for their obvious
impact on day to day life [8, 9, 10], there are many important uses
for deep learning with audio data. Prior to deep learning, audio
machine learning applications would rely on more traditional
digital signal processing techniques to perform feature extraction,
whereas many deep learning architectures learn these features
through observing the data that is given.
Many aspects of audio research, outside of audio mixing, have
benefited from the application of machine learning methods, like
research based on the manipulation of audio signals like noise
suppression and speech enhancement [11, 12], and vocal-related
fields like speech generation [13] or automatic speech recognition
[14], as well as machine listening tasks like automatic music
transcription [15, 16]. In comparison to some of these fields,
audio mixing research is still fairly niche, and has a much smaller
research effort geared towards it, though it has seen a considerable
rise in deep learning approaches in the last decade. The body
of work attributed to automatic mixing research can generally
be divided into two main approaches: knowledge-based systems
(KBS) and machine learning (ML) approaches, which can further
be broken down into traditional ML approaches and deep learning
(DL) approaches. KBS are built on a knowledge base collected
from professionals, like a set of rules or guidelines, from which
an algorithm is constructed and then applied to a task, whereas
ML and DL approaches consist of using statistical methods and
algorithms directly on data to create a system that can be applied
to a task. The following literature review is divided into these
three sections, which are further divided into segments focusing
on particular audio effects where they have been researched. Table
1 clearly presents these groupings in the literature for convenience
to the reader.
2.1 Knowledge-based approaches to automatic mixing
A large body of work has been geared towards approaches using
KBS in automated gain balancing, in [17] a novel system is
designed, based on an interactive algorithm whereby a user can
rate mixes and thus guide the system to their personal taste, by
making minimal or no assumptions. A significant portion of earlier
research in the field focused on the application of automated gain
balancing systems to a live environment [18, 19, 20, 21], which
is of particular difficulty due to the spatial differences between
musicians in a live performance. The presented models in these
works generally aim to consider the requirements of all listeners
in a space whilst optimising the monitor levels accordingly. There
has been some research into producing clear speech audio for
TV through background ducking [22], though this is minimal in
regard to other applications. Systems employing loudness models
in measuring the loudness of each track have been explored
in [23, 24, 25], including comparisons of simple energy-based
loudness models and more sophisticated psychoacoustic models,
whereby it was found that simple energy-based loudness models
performed better. An approach to identify source interference,
through the calculation of Source to Distortion Ratios (SDRs),
to aid in adjusting gains was developed in [26], where it was
found that this approach doesn’t yield ideal results, but it does
allow for the ability to better identify interference levels in
recordings. Gain normalization techniques for changing linear
systems suitable for both live and studio use are outlined [27, 28],
as well as an implementation whereby the gain of a source is
adjusted based on unmasking its spectral content from spectrally
related channels [29]. The mathematical relationships between the
components within a mix are explored in [30], demonstrating how
problems in a mix can be overcome through careful consideration
of optimisation parameters and strategies.
Methods for automating panning for multitrack mixes tend to
follow similar methodologies to implementations for gain level
balancing as the concepts are similar in that panning can be seen
as separate gains for right and left channels, though these can be
made more complex through identifying source placements in a
mix. The first efforts in developing an autonomous stereo panner
were framed for a live scenario [37], where spectral analysis
aided in determining the position of sources in the stereo space.
Further research involved the utilisation of spectral decomposition,
constraint rules and cross-adaptive algorithms in performing real-
time placement of sources in a mix [38], whereby there was no
statistical difference found between mixes attained through this
method and those produced by a human engineer. An expansion
upon this method can be found in [39], where the constructed
system is highly versatile and can be used in both live and post-
production environments. Furthermore, two more recent methods
both focus on the reduction of masking effects between source
signals [40, 41], both of these methods achieve higher ratings
through subjective listening tests than those that have come before
them, attaining standards comparable to professional human mix
engineers.
Whilst not being as completely fundamental in obtaining an
audio mix when compared to gain level balancing, automated EQ
has seen a notable research effort attributed to it. Early attempts
to automate EQ of multitrack audio involved the use of cross-
adaptive methods [42], with the design intention to reduce set-
up time for live and studio environments. Approaches towards
tonal balancing through the use of EQ in mixing and mastering
processes have been explored [43, 44], with the former employing
the Yule-Walker method, whereby an autoregressive model is fit
to the input data to match the target EQ curve. Offline and real-
time autonomous EQ systems based on masking reduction have
been explored [45], in which it was found that the systems reduce
both perceived and objective spectral masking, thus improving the
quality of a mix.
Some early work in the field of automated compression re-
search focused on noise gates, a compressor style audio processor
that attenuates signals registering below a certain threshold volume
[49]. In this work an algorithm applied to noisy kick drum
recordings (containing bleed from secondary sources) whereby the
attack, release, threshold and hold of a noise gate are automatically
set is explored, though while the gate parameters are intuitively
correct, there was no further evaluation made. Subsequently, a
multiple dynamic range compression (DRC) system designed for
application in a multitrack mixing scenario was tested against an
3
Type of approach
Effect/technique Knowledge-based Machine learning Deep learning
Gain [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] [31, 32, 33, 34, 35, 36] -
Panning [37, 38, 39, 40, 41] - -
EQ [42, 43, 44, 45] [46] [47, 48]
Compression [49, 50, 51, 52, 53, 54, 55] - [56, 57]
Reverb - [58, 59, 60, 61] -
Multiple [62, 63, 64, 65, 66, 67, 68, 69, 70] [71] [72, 73, 74]
TABLE 1: A review of automatic mixing literature, grouped by type of approach and effect/technique.
expert engineer where the results were found to be comparable
[50]. Further work was done to automate the choice of parameter
values on a compressor based on feature extraction of an input
signal [51], leaving just the threshold as a user specified parameter.
However this method is further expanded to take into account the
threshold parameter [52], where results can compete or outperform
those created by semi-professional human engineers. Compression
is a critical effect in a typical mastering chain as it aids in achiev-
ing a greater desired loudness for a track, and as such a statistical
approach geared towards this purpose has been explored in [53].
Following this, a novel method of compression intending to take
the individual listener’s environment into account is devised in
[54], whereby noise levels taken through the microphone of a
listener’s mobile device affect the controls of the compressor.
More recently, work has been done to construct a system that
can adaptively change compression timing parameters based on
an incoming audio signal with the aim of emphasising transients
in drum tracks [55], though as of yet there has been no attempt at
an in-depth evaluation of this method.
A large body of work has been aimed at developing compre-
hensive systems involving more than one of the above effects or
techniques. A hugely detailed exploration of almost all aspects
of automatic mixing is contained in [62], including in-depth
breakdowns of each effect used in the mixing process. An early
attempt at automating gain levels, panning and EQ based on
probabilistic graphical models can be found in [63], where the
implementation was found to perform significantly better than
other data modelling tools at the time. Further exploratory analysis
has been done on the efficacy of a model using these three effects
based on the identification of instruments present in a mix [64].
Autonomous mixing systems based on semantic mixing rules
gleaned from mixing engineering literature have been researched
[65, 66, 67], with listening tests suggesting a performance equal
to that of human engineers. Sound source separation techniques
have been used as a preprocessing step of the mixing process in a
framework designed for automatic mixing of early jazz recordings
[68], where the desired improvements were found to be attainable
through a block based framework (which includes panning and
EQ). A novel method describing a parameter space containing
all possible mixes based on gain, panning and EQ is described
in [69], where flaws were found in tempo-estimation algorithms,
as such improvements need to be made. Most recently, a study
into the minimisation of masking in audio mixing using a new
multitrack masking metric alongside the use of subgrouping has
been conducted in [70], where it was found that both investigated
methods can be used to improve the perceived quality and clarity
of a mix.
The overwhelming majority of early automatic mixing ap-
proaches were based on KBS, though in recent years the research
effort has shifted more towards ML and DL approaches.
2.2 Machine learning approaches to automatic mixing
A significant portion of ML approaches to automatic mixing
are specifically tailored to gain balancing as it is the most
fundamental step in obtaining a professional audio mixture. The
earliest research into this presents a method whereby the Euclidean
distance between spectral histograms of a mix and target sound
is calculated and fed into an optimisation algorithm to find the
best parameters for that mix [31], where the system was found
to accurately and successfully create the mixes. Following this,
two works presenting similar methods based on the use of linear
dynamical systems can be found in [32, 33], where both systems
are found to accurately predict the desired mixing coefficients.
A novel method designed to reduce masking effects in gain
mixes through the employment of genetic programming has been
constructed [34], however, after inspection of the generated mixes,
it is discussed that there is significant improvement required
for this style of method. Further work using genetic algorithms
to optimise an audio mix following user input of a listener is
researched [35], where the system is found to be comparable in
usability to a traditional fader mixer. A recent method focused on
gain mixing of drums [36], where gain parameters are estimated
through the least-squares equation and used as target values in
various machine learning methods, found that a random forest
method significantly outperforms others.
There are very few attempts to strictly automate EQ on an
audio track, though very early work using inductive learning
through pattern recognition can be found in [46], outperforming
existing methods at the time.
However, the automation of reverberation in an ML context
has been thoroughly researched in the literature. Work that uses
a variety of classifiers on features extracted from audio tracks to
control the parameters of the reverberation found that a support
vector machine classifier performed best in estimating the param-
eters [58]. An evolution of this method can be found in [59],
yielding similar results. Probabilistic Soft Logic (PSL) rules have
been defined and weighted to form a model for use in extract-
ing parameters for reverberation [60], where the resulting mixes
didn’t have the desired perceptual qualities, though no data pre-
processing was performed which can improve the output. More
4
recently, a similar method using rules extracted from literature
was constructed to allow the real-time control of reverberation
parameters based on the incoming signal [61], though no formal
evaluation was performed, the authors found some high frequency
reverberation to be unpleasant.
An approach to classify the importance of audio features in
producing a model for processing raw audio to individual stems
has been researched [71], where the selected features are used to
train regression models to allow this feature set to reduce further.
2.3 Deep learning approaches to automatic mixing
In recent years deep neural networks (DNNs) have seen a signif-
icant growth in their application to various processes in music
production, including automatic mixing, often achieving state-
of-the-art performance. Some attempts at DL approaches to au-
tomating EQ have been made. A novel end-to-end convolutional
neural network architecture has been implemented in [47], which
approximates the target EQ by processing the audio directly. The
model is tested for shelving, peaking, lowpass and highpass EQ,
whereby the model was found to perform best in the highpass
context. An additional method which takes as input an untreated
recording and a target, to predict a set of parameters which
can then be used to control an audio effect is designed [48].
This method was first constructed for EQ, though it affords the
possibility of expansion to apply to other post-production audio
effects.
There have been a few attempts at automating DRC with deep
learning approaches, with one notable method focusing on mas-
tering applications [56]. This method uses DNNs to predict DRC
and spectral balance enhancement parameters, which are then
applied to unmastered audio tracks, however it was found through
subjective listening tests that this method could not perform to a
human mix standard. Further work done utilising siamese DNN
architectures, a class of neural network containing two or more
identical subnetworks, to learn feature embeddings characterising
the effects of a compressor was performed in [57]. These models
were found to outperform previous methods in predicting DRC
parameters.
A novel approach that makes use of DNNs for automatic
mixing of early jazz recordings, an expansion on a previous
method [68], has been presented in [72], for which a brief informal
listening test suggested a good performance. Motivated by the
limited number of available multitrack audio datasets, a domain-
inspired model has been constructed which produces parameters
that afford the user the ability to adjust a console of audio effects
in order to refine a mix [73]. Through a perceptual evaluation
this model was found to outperform baseline methods. Drums
are generally one of the first parts of a piece of music to be
mixed (dependent on genre) and as such much of the research
is geared towards this purpose. A recent deep learning method for
performing autonomous drum mixing is adapted in [74], whereby
a Wave-U-Net model is trained, and found to perform drum mixing
to the standard of a professional mix engineer.
3 DATASE T
This work makes use of the ENST drums dataset [75], an audio-
visual database designed for use in audio signal processing re-
search. The dataset is comprised of multitrack audio for various
drum recordings, including 8 audio stems (each stem correspond-
ing to a separate microphone used in the recording process)
as well as dry and wet mixes of each track. Recordings were
taken from sessions with three separate drummers, amounting to
roughly 75 minutes of recorded audio for each drummer. The
drum kits were built with a basic kick, hi-hat and snare setup
with varying numbers of toms and cymbals. Recorded sequences
were performed with either sticks, brushes, rods or mallets. There
is an additional visual component to this database whereby the
drummers were recorded from two different angles, however the
video recordings were not of benefit to this study, thus this data
was not utilised here.
Training Validation Test Total
Hits 86 11 11 108
Phrase 111 12 12 135
Solo 7 2 2 11
Minus-One 50 7 7 64
Total 254 32 32 318
Percentage 79.87% 10.06% 10.06% 100%
TABLE 2: The distribution of the dataset following a train,
validation, test split of 8:1:1.
The two stereo audio mixes, a dry mix and a wet mix, were made
from the 8 audio stems. The dry mixing chain is composed of
gain level balancing and panning, without any further processing.
In the wet mixing chain, each instrument is additionally processed
by an appropriate equalization and compression, as well as a slight
reverberation along with dynamic processing (limiter/maximiser).
There were four styles of drumming recorded for this dataset:
hits, solo, phrases, and minus-one.
Hits refer to a recording whereby the drummer hits a
single piece of their drum kit with a single type of stick.
Solo refers to a freestyle recording whereby the drummer
is able to use the entire kit however they like.
Phrases are short recordings performed in a specific style
or genre at a range of tempos and complexities.
Minus-one recordings are performances whereby the
drummer plays to a rhythm determined by a CD or MIDI
file that the drummer is listening to.
The recordings for drummers 1 and 2 in the dataset only consist of
7 microphone recordings (the third tom microphone wasn’t used in
the recording process), thus, for these multitracks, a zero-padded
silent input track is incorporated in the third tom’s place. All tracks
in the data were used in this research, following a train, validation,
test split outlined in detail in Table 2.
4 METHODOLOGY
Presented here is an in-depth description of the two deep learning
model architectures, as well as the processes used to train them.
4.1 Wave-U-Net architecture
Recently a model that has gained much interest in the audio
research field is the Wave-U-Net model [76] (shortened to WN
here), inspired by the U-Net architecture, which was first designed
for biological image segmentation [77], and went on to find use in
5
Downsampling Block 1
1D Convolution, Size 15
Downsampling layer
Upsampling Block 1
1D Convolution, Size 1
Upsampling layer
Downsampling Block 2
Downsampling Block K
Upsampling Block 2
Upsampling Block K
Input Stems Output Mix
1D Convolution, Size 15
1D Convolution, Size 1
... ...
Crop and concat
Modified Wave-U-Net
Fig. 1: A block diagram showing the modified Wave-U-Net
architecture for Klayers, where K= 10. Dashed lines represent
a crop and concatenation operation. Each convolutional layer is
followed by a leaky ReLU activation layer with a negative slope
of 0.2, except for the final output convolutional layer of size 1.
similar tasks like binding site prediction of protein structure [78]
and image reconstruction [79]. The U-Net class of models are
based on encoder-decoder schemes, whereby an image is fed into
a series of downsampling blocks comprised of small unpadded
convolutional layers and a rectified linear unit (ReLU), and then
followed by a small max pooling layer for downsampling. The
output of these downsampling layers is then put through a series
of upsampling layers, comprised of the same convolutional and
ReLU layers as each downsampling block, with a small up-
convolutional layer in between to upsample. The U-Net model
has seen some applications to the audio field, specifically in audio
separation tasks [80, 81] like in [82], where a U-Net architecture
is applied to an audio mix in order to split it into the constituent
vocal and instrumental components. U-Net has also proven to be
effective in the denoising and enhancement of speech [11, 12], as
well as the suppression of echo and reverberation [83, 84].
The key difference between the U-Net and WN architectures
are as follows. The WN model takes, as input, the raw audio signal,
rather than a magnitude spectrogram of the audio signal. Addi-
tionally the downsampling and upsampling layers are composed
differently; for the U-Net in [82] each downsampling layer is made
up of a 5x5 2D convolutional layer, 2D batch normalisation and a
leaky ReLU activation layer, whereas in WN, each downsampling
layer consists of a 1D convolutional layer of size 15, and a leaky
ReLU activation layer. The main benefit of this is that we can avoid
the problems related to the reconstruction of the audio signal.
The WN also has a final 1D convolutional layer of size 1 after
the last concatenation. WN has proven to be efficient at many of
the audio related tasks that U-Net found use in, including audio
separation tasks [85], acoustic echo cancellation [86], and speech
enhancement [87, 88], as well as mixing tasks [74]. The results
in [76, 87] seem to indicate that the WN can achieve a similar or
better performance to the U-Net, however a limited dataset means
that these conclusions would benefit from further evaluation.
4.2 Modified Wave-U-Net architecture
An additional WN variant was also tested in this research and
consisted of the following modifications. The kernel size of the
decoder in the model was decreased from 5 kernels to 1 kernels.
Furthermore the number of initial filters in the model was in-
creased from 24 to 48. These modifications allow for an increased
number of trainable parameters, up to 17,118,882 from 6,183,450,
which in turn leads to increased training times. The changes made
in this modified Wave-U-Net (MWN) architecture are motivated
by the idea that a higher number of trainable parameters can allow
for more complex pattern to be identified in the audio. A block
diagram of this architecture can be seen in Fig. 1.
4.3 Training procedure
Due to time and hardware limitations, there were some restrictions
on aspects of the training process. Firstly, batch sizes up to 16 were
tested, where it was found that a batch size of 4 gave the best loss
convergence, and so this was chosen for the training.
The models were trained for a total of 150 epochs, with a
learning rate scheduler put in place, such that the first 100 epochs
ran with a learning rate of 104, decreasing to 105for the final
50 epochs in a fine-tuning stage. The L1 loss and Adam optimiser
were used in the training of these models.
Additionally, input sizes were restricted due to CUDA memory
limits, and so input stems 121,843 samples long were used,
giving outputs of 89,093 samples in length. This is similar to
the methodology applied in [74]. Some experimentation found
that initial renditions of the models lacked high frequency content
(likely due to the limited length of the training process), so pre-
emphasis filters of varying degrees were applied to the inputs
before training to account for these shortcomings.
5 EVALUATION
Various methods have been employed to evaluate these methods,
the first of which is a subjective perceptual evaluation in the form
of a blind listening test. To objectively evaluate the methods, the
waveforms and spectrograms of mixes produced by each method
have all been visually compared.
5.1 Subjective evaluation
A MUSHRA style listening test was conducted to subjectively
evaluate the quality of mixes produced by the two Wave-U-Net
models, in turn providing an indication as to the effectiveness of
each model in their application to audio mixing.
6
5.1.1 Setup
The test was conducted online, using the end-to-end listening test
platform GoListen [89], following a MUSHRA style test format
where a hidden reference and anchor are used to provide an
indication of of how the systems being tested compare to well-
known audio quality levels. A listening test of this style has been
shown to be suitable, reliable and accurate in evaluating audio
quality [90]. An image of the user interface for the listening test
can be seen in Fig. 2.
Participants were first asked to ensure that they were in a quiet
environment and were wearing a suitable pair of headphones, such
that low frequency data was audible and was taken into account
when judging mix quality. Each participant was then presented
with a volume adjustment page where they were asked to set the
volume of their headphones to a comfortable level given an audio
prompt, after which they were asked not to change this level so as
to avoid loudness bias.
Following this, each participant was presented with a series of
preliminary questions, inspired by the Gold-MSI [91], which were
designed to gauge the participant’s level of audio knowledge and
critical listening experience, as well as the degree of any hearing
impairments they may have.
Fig. 2: A screenshot of the listening test user interface on the
GoListen platform [89].
5.1.2 Participants
There was a total of 9 participants for the listening test, of
which 1 participant noted exhibiting significant hearing impair-
ment symptoms, and thus their response was removed from the
analysis. Additionally, one of the wet mix responses for one of
the participants was skipped over, so just the remaining nine track
responses for this participant were included in the analysis. The
test was planned to take no longer than 20 minutes, and as such
there were no issues of fatigue for the participants. There were
also no financial incentives for anyone involved.
Of the remaining 8 participants, all declared they were over 18
years of age and experienced no hearing impairment symptoms.
3 participants noted that they were experts in the field of audio
research, and so a separate analysis has also been done on this
specific group. 5 of the 8 participants reported that they spend at
least 1 hour a day critically listening to music.
5.1.3 Composition of listening test
The main body of the test was divided into two sections, dry mix
testing and wet mix testing. Each section consisted of 5 questions,
each of these questions corresponding to a different track in the
test partition of the dataset (10 tracks in total were chosen at
random, ignoring ’hit’ tracks, as little consideration is needed for
mixing these). Participants were presented with 4 different audio
mixes in each question: the WN output mix, the MWN output
mix, a reference track (the dry or wet human engineered mix
present in the dataset), and an anchor (constructed in a DAW with
randomised gain levels and audio effect parameter values). These
tracks were randomised in each question to ensure the reference
and anchor are both hidden, as well as ensuring the participant
can’t spot patterns across the questions. From these 4 stimuli,
participants were asked to score each one from 0-100 based on
the perceived quality and balance of the mix, keeping in mind to
reflect any differences in the mixes in the scores they give, such
as to give a better comparison between the stimuli.
5.1.4 Results
Mean ranking Mean expert ranking
Anchor Dry 4.0 4.0
Reference Dry 2.0 2.2
WN Dry 2.2 1.8
MWN Dry 1.8 2.0
Anchor Wet 4.0 3.8
Reference Wet 1.2 1.0
WN Wet 2.6 2.8
MWN Wet 2.2 2.4
TABLE 3: The mean comparative rankings of each type of dry and
wet mix for all songs included in the listening test, where Wave-
U-Net is denoted WN, and the modified Wave-U-Net is denoted
MWN.
To get a better evaluation for the different mixing purposes, the
following analysis has been segmented into dry and wet mixing.
The mean of the results for each mix was calculated and used
to rank the mixes for each song from best to worst (1 to 4, for
reference, anchor, WN and MWN). The means of these rankings
were then calculated to find the average ranking, across all songs,
for each of the four types of mix included in the listening test.
As expected, the anchor performed worst in all cases, however
there were some unexpected results regarding the other models.
In general, both WN models performed the dry mixing process to
the same level as a human engineer. The average ranking of all
participants for the dry test reference mixes was 2, however for
the standard WN this was 2.2, and for the MWN this was 1.8.
This indicates that the modified architecture outperforms both the
current WN architecture and a human engineer in a creating a
dry mix. However, narrowing this analysis down to just include
those participants who reported being experts in the field of audio
7
research yields some slightly different results. For the expert
subcategory, the average ranking for the reference mixes was 2.2,
and for the WN was 1.8, and for the MWN was 2, implying that
the standard WN outperforms the modified architecture, but both
outperform a human engineer. It should be noted here that the
expert subcategory is only comprised of three sets of responses,
and so the data here is more liable to extreme deviation due to
outlying values compared to the analysis of the whole group of
participants.
For wet mixing, an informal listening test found that both WN
models failed to sufficiently capture the reverberation effects on
the reference mix tracks, which has been previously reported [74].
Breaking the listening test down compounds this observation. The
average ranking of the wet reference mixes was 1.2, whereas for
the WN it was 2.6 and for the MWN it was 2.2, demonstrating that
the modified architecture outperforms the traditional architecture
in wet drum mixing. This same conclusion can be drawn from
the expert subcategory where the rankings are 1, 2.8 and 2.4
respectively. This data is shown in Table 3.
5.2 Objective evaluation
Waveforms and spectrograms have been generated for specific
tracks in the test dataset: a disco phrase by drummer 3 for the
dry mix, and a blues shuffle phrase by drummer 3 for the wet
mix. Drummer 3 tracks have been chosen as the setup for these
recordings included a third tom microphone, and phrases have
been chosen as they give the largest variety of drum hits in a
recording. Both of these choices lead to a greater variety of data
to analyse in the resulting waveforms and spectrograms.
Fig. 3 displays the waveforms and spectrograms for the
reference, anchor, WN and MWN mixes for both dry and wet
mixing scenarios. For the case of dry mixing, it can be seen that
comparison of these plots compounds the conclusions drawn from
the listening test, in that both the WN and MWN models produce
output waveforms of an extremely similar quality to the reference
mix. This can also be seen in the corresponding spectrograms.
Scrutinising these leads to the conclusion that the MWN produces
output mixes that are more similar to the professional mix, than
the WN model does, demonstrated by the low-end frequencies in
the tail of the track behaving more similarly.
For the wet mixes, it can be seen that the WN and MWN
outputs are more alike to each other than either one is to the
reference mix. This is explicitly illustrated when observing the
space between transients in the spectrograms, as these spaces are
relatively empty for the WN and MWN mixes, however there
is a lot more response between the transients for the reference
mix. This draws the same conclusion as what was drawn from
informal listening, in that the reverberation effects in the WN and
MWN wet mixes are severely underestimated when compared to
the reference mix.
6 CONCLUSION
A modification to the current Wave-U-Net architecture has been
proposed and tested, whereby it was found that the performance of
both models is as good or better, in both subjective and objective
evaluation, than a human engineer’s performance in a dry mixing
scenario. However wet mix testing has shown that both models
are lacking in their replication of reverberation effects, and are
outperformed by a human engineer. Despite this, the modified
Wave-U-Net model was found to marginally outperform the Wave-
U-Net model in wet mixing and in most cases of dry mixing,
where expert listeners found the Wave-U-Net model to outperform
the modified model in dry mixing.
The benefit of deep learning models of this kind is that
they have the ability to learn the audio manipulations performed
by multiple audio effects more readily and more easily than
traditional machine learning approaches. These models learn these
manipulations directly through observing the data, and even
though the models presented here failed to sufficiently capture
the reverberation effects, the subjective and objective evaluation
results infer that the models successfully capture the manipulations
of the other effects present in the mixing chain. The nature of
these models, and the way in which they take in and process data,
imply that deep learning approaches are the most promising and
interesting candidates for the development of a fully autonomous
mixing system.
7 FUTURE WORK
There are certain aspects of this project that could be improved,
given fewer limitations. Firstly, experimenting with different hy-
perparameters, losses and batch sizes could allow for increased
performance. Various different losses could be tested to find which
function works best for the models included here. Secondly, the
training process that each model was run through could be adapted
and refined, either by increasing the number of epochs, or using
a different learning rate schedule with an early stopping imple-
mentation. These changes could all aid in improving the model’s
efficiency and accuracy in producing professional standard audio
mixtures. Additionally pre-emphasis filters were applied to the
inputs to each model in varying degrees to account for any
shortcomings in the high frequency output of the models. However
it would be good to account for the frequency differences within
the architectures themselves, rather than with filters.
Other aspects of this work could’ve been improved if there
were fewer limitations. For instance, batch sizes, training times
and input data sizes could all have been improved had there not
been restrictions enforced on the work due to GPU capabilities.
The evaluation methods could also be expanded as the number
of listening test participants was not as great as initially desired,
and so a larger study could yield more reliable, and different,
results.
Multitrack datasets are very few in number, mainly due to the
time to create and compile such a dataset, however if there was
a sufficiently large and varied dataset available then the models
could be trained to learn the mixing processes of instrumentation
beyond drums. Also, due to the nature of multitrack datasets
following a strict format, the resultant models are only trained
to work on input data of the same size and shape as the dataset.
Thus, there is the potential for work to be done to generalise and
extend these models to work with any number of input tracks.
Additionally, there could be work to build presets of these
models that are trained to produce output mixes possessing certain
user-specified qualities. This can potentially be achieved in a
number of ways, like through the augmentation of target mixes
before they’re fed into the model, or it could be attained as a
consequence of modifying the model and it’s hyperparameters.
The central motivation for this work was inspired by the
observation that there seems to be no user-friendly, completely
8
Fig. 3: A plot for comparison of the various mixes outputted by each method. The tracks chosen are track 55 from drummer 3 for the
dry mix (a disco phrase), and track 89 from drummer 3 for the wet mix (a shuffle blues phrase). The left side of the plot displays the
waveform for each type of wet and dry mix, with the right side displaying the corresponding spectrograms.
autonomous mixing systems available commercially, unlike mas-
tering applications. Much of the above-mentioned possibilities
have the potential to compound and culminate in the eventual
design of a system of this nature.
ACKNOWLEDGEMENTS
Firstly, I would like to thank Prof Ga¨
el Richard for his assistance
in providing the ENST Drums Dataset for use in this project, and
Dr Dan Barry for allowing me the use of the GoListen platform
for the evaluation.
I would also like to thank Dr Huy Phan and Dr Mathieu Barthet
for providing me with access to crucial materials in understanding
some of the foundational concepts underlying this research topic.
Additionally, I would like to thank Joseph T. Colonel for giving
me his time and assistance in understanding some of the more
technical aspects of this project.
Finally, I would like to thank my supervisor, Dr Emmanouil
Benetos, for his truly invaluable support and advice throughout
every stage of this research, from forming the initial ideas, to the
final execution of this project. This work could not have been
completed without his help.
REFERENCES
[1] LANDR: Creative Tools for Musicians. [Online]. Available:
https://www.landr.com/
[2] BandLab: Make Music Online. [Online]. Available:
https://www.bandlab.com/?lang=en
9
[3] M. J. Terrell, S. Mansbridge, B. D. Man, and
J. D. Reiss, “System and method for performing
automatic audio production using semantic data,”
US Patent 9 304 988, 2016. [Online]. Available:
https://patents.justia.com/patent/9304988
[4] J. D. Reiss, S. Mansbridge, A. Clifford, Z. Ma, S. Hafezi, and
N. Jillings, “System and method for autonomous multi-track
audio processing,” US Patent 9 654 869, 2017. [Online].
Available: https://patents.justia.com/patent/9654869
[5] W. S. McCulloch and W. Pitts, A logical
calculus of the ideas immanent in nervous activity,”
The bulletin of mathematical biophysics 1943 5:4,
vol. 5, pp. 115–133, 12 1943. [Online]. Available:
https://link.springer.com/article/10.1007/BF02478259
[6] G. E. Hinton and S. Osindero, A fast learning algorithm for
deep belief nets,” Neural Computation, vol. 18, pp. 1527–
1554, 2006.
[7] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimen-
sionality of data with neural networks,” Science, vol. 313, pp.
504–507, July 2006.
[8] T. S. Huang, “Computer vision: Evolution and promise,”
CERN European Organization for Nuclear Research-
Reports-CERN, pp. 21–26.
[9] K. Nalbant and Uyanık, “Computer vision in the metaverse,
vol. 1, pp. 9–12, December 2021.
[10] T. Teoh, Chatbot, Speech, and NLP, January 2022, pp. 277–
287.
[11] E. Moliner and V. V¨
alim¨
aki, A two-stage u-net for high-
fidelity denoising of historical recordings,” February 2022.
[12] Z. Kang, Z. Huang, and C. Lu, “Speech enhancement using
u-net with compressed sensing,” Applied Sciences, vol. 12,
p. 4161, April 2022.
[13] Y. Chen, “Speech generation by generative adversarial net-
work, 2nd International Conference on Big Data Artificial
Intelligence Software Engineering (ICBASE), pp. 435–438,
September 2021.
[14] P. Chowdhary, Automatic Speech Recognition, April 2020,
pp. 651–668.
[15] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and
A. Klapuri, Automatic music transcription: Challenges and
future directions,” Journal of Intelligent Information Sys-
tems, vol. 41, pp. 407–434, December 2013.
[16] A. Klapuri and M. Davy, Eds., Signal Processing Methods
for Music Transcription. Springer, 2006.
[17] A. Wilson, “An evolutionary computation approach to intelli-
gent music production informed by experimentally gathered
domain knowledge, in Proceedings of the 2nd AES Work-
shop on Intelligent Music Production, 2016.
[18] M. J. Terrell and J. D. Reiss, “Automatic Monitor Mixing for
Live Musical Performance, J. Audio Eng. Soc., vol. 57, pp.
927–936, 2009.
[19] E. Perez-Gonzalez and J. D. Reiss, “Automatic Gain and
Fader Control For Live Mixing, in IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics,
2009.
[20] M. Terrell and M. Sandler, An offline, automatic mix-
ing method for live music, incorporating multiple sources,
loudspeakers, and room effects, Computer Music Journal,
vol. 36, pp. 37–54, 2012.
[21] J. Jim´
enez-Sauma, “Real-Time Multi-Track Mixing
For Live Performance, 2019. [Online]. Available:
https://zenodo.org/record/2550903
[22] M. Torcoli, A. Freke-Morin, J. Paulus, C. Simon, and
B. Shirley, “Background Ducking to Produce Esthetically
Pleasing Audio for TV with Clear Speech,” in 146th Con-
vention of the Audio Engineering Society, 2019.
[23] S. Fenton, “Automatic Mixing of Multitrack Material Using
Modified Loudness Models,” in 145th Convention of the
Audio Engineering Society, 2018.
[24] D. Ward, J. D. Reiss, and C. Athwal, “Multi-track mixing
using a model of loudness and partial loudness,” in 133rd
Convention of the Audio Engineering Society, 2012.
[25] G. Wichern, A. S. Wishnick, A. Lukin, and H. Robertson,
“Comparison of Loudness Features for Automatic Level
Adjustment in Mixing,” in 140th Convention of the Audio
Engineering Society, 2015.
[26] D. Moffat and M. Sandler, Automatic Mixing Level Balanc-
ing Enhanced through Source Interference Identification,” in
146th Convention of the Audio Engineering Society, 2019.
[27] S. Mansbridge, S. Finn, and J. D. Reiss, “Implementation
and Evaluation of Autonomous Multi-track Fader Control,
in 132nd Convention of the Audio Engineering Society, 2012.
[28] E. Perez-Gonzalez and J. D. Reiss, “An automatic maximum
gain normalization technique with applications to audio mix-
ing,” in 124th Convention of the Audio Engineering Society,
2008.
[29] E. Perez-Gonzalez and J. D. Reiss, “Improved control for
selective minimization of masking using inter-channel de-
pendancy effects, in Proc. of the 11th Int. Conference on
Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.
[30] M. Terrell, A. Simpson, and M. Sandler, “The Mathematics
of Mixing,” J. Audio Eng. Soc., vol. 62, pp. 4–13, 2014.
[31] B. A. Kolasinski, “A Framework for Automatic Mixing Us-
ing Timbral Similarity Measures and Genetic Optimization,
in 124th Convention of the Audio Engineering Society, 2008.
[32] J. Scott and Y. E. Kim, “Analysis of acoustic features for
automated multi-track mixing,” in 12th International Society
for Music Information Retrieval Conference, 2011.
[33] J. Scott, M. Prockup, E. M. Schmidt, and Y. E. Kim, “Auto-
matic multi-track mixing using linear dynamical systems,”
in Proceedings of the 8th Sound and Music Computing
Conference, 2011.
10
[34] N. Jillings and R. Stables, “Automatic Masking Reduction
in Balance Mixes Using Evolutionary Computing, in 143rd
Convention of the Audio Engineering Society, 2017.
[35] A. Wilson and B. M. Fazenda, “User-Guided Rendering of
Audio Objects Using an Interactive Genetic Algorithm, J.
Audio Eng. Soc, vol. 67, pp. 522–530, 2019.
[36] D. Moffat and M. Sandler, “Machine Learning Multitrack
Gain Mixing of Drums,” in 147th Convention of the Audio
Engineering Society, 2019.
[37] E. Perez-Gonzalez and J. D. Reiss, Automatic mixing:
live downmixing stereo panner,” in Proc. of the 10th Int.
Conference on Digital Audio Effects (DAFx-07), Bordeaux,
France, 2007.
[38] E. Perez-Gonzalez and J. D. Reiss, “A Real-Time Semi-
autonomous Audio Panning System for Music Mixing,
EURASIP Journal on Advances in Signal Processing, pp.
1–10, 2010.
[39] S. Mansbridge, S. Finn, and J. D. Reiss, “An Autonomous
System for Multi-track Stereo Pan Positioning, in 133rd
Convention of the Audio Engineering Society, 2012.
[40] P. D. Pestana and J. D. Reiss, A cross-adaptive dynamic
spectral panning technique,” in Proc. of the 17th Int. Con-
ference on Digital Audio Effects (DAFx-14), Erlangen, Ger-
many, 2014.
[41] A. Tom, J. Reiss, and P. Depalle, An automatic mixing
system for multitrack spatialization for stereo based on
unmasking and best panning practices,” in 146th Convention
of the Audio Engineering Society, 2019.
[42] E. Perez-Gonzalez and J. Reiss, “Automatic equalization of
multi-channel audio using cross-adaptive methods, in 127th
Convention of the Audio Engineering Society, 2009.
[43] Z. Ma, J. D. Reiss, and D. A. A. Black, “Implementation of
an intelligent equalization tool using Yule-Walker for music
mixing and mastering,” in 134th Convention of the Audio
Engineering Society, 2013.
[44] S. I. Mimilakis, K. Drossos, A. Floros, and D. T. G. Katere-
los, Automated Tonal Balance Enhancement for Audio
Mastering Applications,” in 134th Convention of the Audio
Engineering Society, 2013.
[45] S. Hafezi and J. D. Reiss, “Autonomous Multitrack Equal-
ization Based on Masking Reduction,” J. Audio Eng. Soc,
vol. 63, pp. 312–323, 2015.
[46] D. Reed, “Perceptual assistant to do sound equalization,” in
5th International Conference on Intelligent User Interfaces,
2000, pp. 212–218.
[47] M. A. M. Ram´
ırez and J. D. Reiss, “End-to-end equaliza-
tion with convolutional neural networks, in Proceedings of
the 21st International Conference on Digital Audio Effects
(DAFx-18), Aveiro, Portugal, 2018.
[48] S. I. Mimilakis, N. J. Bryan, and P. Smaragdis, “One-Shot
Parametric Audio Production Style Transfer with Applica-
tion to Frequency Equalization, in ICASSP, IEEE Interna-
tional Conference on Acoustics, Speech and Signal Process-
ing, 2020.
[49] M. Terrell, J. D. Reiss, and M. Sandler, Automatic noise
gate settings for drum recordings containing bleed from sec-
ondary sources,” EURASIP Journal on Advances in Signal
Processing, pp. 1–9, 2010.
[50] J. A. Maddams, S. Finn, and J. D. Reiss, “An autonomous
method for multi-track dynamic range compression,” in Proc.
of the 15th Int. Conference on Digital Audio Effects (DAFx-
12), York, UK, 2012.
[51] D. Giannoulis, M. Massberg, and J. D. Reiss, “Parameter
Automation in a Dynamic Range Compressor,” J. Audio Eng.
Soc., vol. 61, pp. 716–726, 2013.
[52] Z. Ma, B. D. Man, P. D. L. Pestana, D. A. A. Black,
and J. D. Reiss, “Intelligent Multitrack Dynamic Range
Compression,” J. Audio Eng. Soc, vol. 63, pp. 412–426,
2015.
[53] M. Hilsamer and S. Herzog, A statisticial approach to
automated offline dynamic processing in the audio mastering
process,” in Proc. of the 17th Int. Conference on Digital
Audio Effects (DAFx-14), Erlangen, Germany, 2014.
[54] A. Mason, N. Jillings, Z. Ma, J. D. Reiss, and F. Melchior,
Adaptive audio reproduction using personal compression,”
in AES 57th International Conference, Hollywood, CA, USA,
2015.
[55] D. Moffat and M. B. Sandler, “Adaptive Ballistics Control
of Dynamic Range Compression for Percussive Tracks, in
145th Convention of the Audio Engineering Society, 2018.
[56] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller,
“Deep Neural Networks for Dynamic Range Compression in
Mastering Applications,” in 140th Convention of the Audio
Engineering Society, 2016.
[57] D. Sheng and G. Fazekas, “A Feature Learning
Siamese Model for Intelligent Control of the
Dynamic Range Compressor, 2019. [Online]. Available:
https://doi.org/10.48550/arXiv.1905.01022
[58] E. T. Chourdakis and J. D. Reiss, “Automatic Control of a
Digital Reverberation Effect using Hybrid Models, in 60th
International Conference: DREAMS (Dereverberation and
Reverberation of Audio, Music, and Speech), 2016.
[59] E. T. Chourdakis and J. D. Reiss, “A Machine-Learning Ap-
proach to Application of Intelligent Artificial Reverberation,
J. Audio Eng. Soc, vol. 65, pp. 56–65, 2017.
[60] A. L. Benito and J. D. Reiss, “Intelligent Multitrack Rever-
beration Based on Hinge-Loss Markov Random Fields, in
Conference on Semantic Audio, 2017.
[61] D. Moffat and M. Sandler, “An Automated Approach to the
Application of Reverberation, in 147th Convention of the
Audio Engineering Society, 2019.
[62] E. Perez-Gonzalez, “Advanced automatic mixing tools
for music,” Ph.D. dissertation, 2010. [Online]. Available:
https://qmro.qmul.ac.uk/jspui/handle/123456789/614
11
[63] R. Gang, G. Bocko, J. Lundberg, D. Headlam, and M. F.
Bocko, Automatic music production system employing
probabilistic expert systems, in 129th Convention of the
Audio Engineering Society, 2010.
[64] J. Scott and Y. E. Kim, “Instrument identification informed
multi-track mixing,” in 14th International Society for Music
Information Retrieval Conference, 2013.
[65] B. D. Man and J. D. Reiss, “A knowledge-engineered au-
tonomous mixing system,” in 135th Convention of the Audio
Engineering Society, 2013.
[66] F. Everardo, “Towards an Automated Multitrack Mixing Tool
using Answer Set Programming,” in Proceedings of the 14th
Sound and Music Computing Conference, Espoo, Finland,
2017.
[67] D. Moffat, F. Thalmann, and M. B. Sandler, “Towards a
semantic web representation and application of audio mixing
rules,” in Proceedings of the 4th Workshop on Intelligent
Music Production, Huddersfield, UK, 2018.
[68] D. Matz, E. Cano, and J. Abeßer, “New sonorities for early
jazz recordings using sound source separation and automatic
mixing tools,” in Proceedings of the 16th ISMIR Conference,
M´alaga, Spain, 2015.
[69] A. Wilson and B. M. Fazenda, “Populating the Mix Space:
Parametric Methods for Generating Multitrack Audio Mix-
tures,” Applied Sciences, vol. 7, p. 1329, 2017.
[70] D. Ronan, Z. Ma, P. M. Namara, H. Gunes, and J. D. Reiss,
“Automatic Minimisation of Masking in Multitrack Audio
using Subgroups,” ArXiv e-prints, 2018.
[71] M. A. M. Ram´irez and J. D. Reiss, “Analysis and Prediction
of the Audio Feature Space when Mixing Raw Recordings
into Individual Stems, in 143rd Convention of the Audio
Engineering Society, 2017.
[72] S. I. Mimilakis, E. Cano, J. Abeßer, and G. Schuller, “New
sonorities for jazz recordings: separation and mixing using
deep neural networks, in Proceedings of the 2nd AES Work-
shop on Intelligent Music Production, London, UK, 2016.
[73] C. J. Steinmetz, J. Pons, S. Pascual, and J. Serr`
a, Auto-
matic multitrack mixing with a differentiable mixing console
of neural audio effects, in ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing -
Proceedings, 2020.
[74] M. A. Mart´
ınez Ram´
ırez, D. St¨
oller, and D. Moffat, “A Deep
Learning Approach to Intelligent Drum Mixing With the
Wave-U-Net,” J. Audio Eng. Soc, vol. 69, pp. 142–151, 2021.
[75] O. Gillet and G. Richard, “ENST-Drums: an extensive audio-
visual database for drum signals processing,” in 7th Interna-
tional Conference on Music Information Retrieval (ISMIR
2006), January 2006, pp. 156–159.
[76] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A Multi-
Scale Neural Network for End-to-End Audio Source Sep-
aration,” 19th International Society for Music Information
Retrieval Conference (ISMIR 2018), 2018.
[77] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Con-
volutional Networks for Biomedical Image Segmentation,
LNCS, vol. 9351, pp. 234–241, October 2015.
[78] F. Nazem, F. Ghasemi, A. Fassihi, and A. M. Dehnavi, “3d
u-net: A voxel-based method in binding site prediction of
protein structure,” Journal of Bioinformatics and Computa-
tional Biology, vol. 19, April 2021.
[79] J. Andersson, H. Ahlstr¨
om, and J. Kullberg, “Separation
of water and fat signal in whole-body gradient echo scans
using convolutional neural networks, Magnetic Resonance
in Medicine, vol. 82, pp. 1177–1186, September 2019.
[80] B.-W. Chen, Y.-M. Hsu, and H.-Y. Lee, “J-net: Randomly
weighted u-net for audio source separation,” November
2019.
[81] V. S. Kadandale, J. F. Montesinos, G. Haro, and E. G´
omez,
“Multi-task U-Net for Music Source Separation,” 2020.
[82] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Ku-
mar, and T. Weyde, “Singing voice separation with deep U-
Net convolutional networks, 2017, pp. 745–751.
[83] D. Le ´
on and F. Tobar, “Late reverberation suppression using
u-nets,” October 2021.
[84] J. Silva-Rodr´
ıguez, M. F. Dolz, M. Ferrer, A. Castell´
o,
V. Naranjo, and G. Pinero, Acoustic echo cancellation using
residual u-nets,” September 2021.
[85] A. Cohen-Hadria, A. Roebel, and G. Peeters, “Improving
singing voice separation using deep u-net and wave-u-net
with data augmentation,” pp. 1–5, September 2019.
[86] J.-H. Kim and J.-H. Chang, “Attention wave-u-net for acous-
tic echo cancellation,” Interspeech, pp. 3969–3973, October
2020.
[87] C. Macartney and T. Weyde, “Improved speech enhancement
with the wave-u-net, November 2018.
[88] R. Giri, U. Isik, and A. Krishnaswamy, Attention wave-u-
net for speech enhancement,” pp. 249–253, October 2019.
[89] D. Barry, Q. Zhang, P. W. Sun, and A. Hines, “Go listen: An
end-to- end online listening test platform,” Journal of Open
Research Software, vol. 9, pp. 1–12, 2021.
[90] ITU-R, “Bs 1534-1, method for the subjective assessment of
intermediate quality level of coding systems, 2001.
[91] D. M ¨
ullensiefen, B. Gingras, J. Musil, and L. Stewart, “The
musicality of non-musicians: An index for assessing musical
sophistication in the general population,” PLOS ONE, vol. 9,
2014.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the development of deep learning, speech enhancement based on deep neural networks had made a great breakthrough. The methods based on U-Net structure achieved good denoising performance. However, part of them rely on ordinary convolution operation, which may ignore the contextual information and detailed features of input speech. To solve this issue, many studies have improved model performance by adding additional network modules, such as attention mechanism, long and short-term memory (LSTM), etc. In this work, therefore, a time-domain U-Net speech enhancement model which combines lightweight Shuffle Attention mechanism and compressed sensing loss (CS loss) is proposed. The time-domain dilated residual blocks are constructed and used for down-sampling and up-sampling in this model. The Shuffle Attention is added to the final output of the encoder for focusing on features of speech and suppressing irrelevant audio information. A new loss is defined by using the measurements of clean speech and enhanced speech based on compressed sensing, it can further remove noise in noisy speech. In the experimental part, the influence of different loss functions on model performance is proved through ablation experiments, and the effectiveness of CS loss is verified. Compared with the reference models, the proposed model can obtain higher speech quality and intelligibility scores with fewer parameters. When dealing with the noise outside the dataset, the proposed model still achieves good denoising performance, which proves that the proposed model can not only achieve a good enhancement effect, but also has good generalization ability.
Article
Full-text available
Metaverse is a rapidly developing new technology today. The purpose of this article is to examine this technology from a computer vision and general perspective. In this study, a comprehensive review of Metaverse concepts in computer vision has been made. Its history, process, techniques, architecture, advantages, and disadvantages are mentioned. The adaptation of Metaverse to life, the ideas of companies about this technological change, and how society will take place in Metaverse are also discussed. The future of Metaverse and what needs to be done to adapt to this technology are explained. As a result, since there are few studies in the literature, this article aims to be a review article that increases academic studies.
Conference Paper
Enhancing the sound quality of historical music recordings is a long-standing problem. This paper presents a novel denoising method based on a fully-convolutional deep neural network. A two-stage U-Net model architecture is designed to model and suppress the degradations with high fidelity. The method processes the time-frequency representation of audio, and is trained using realistic noisy data to jointly remove hiss, clicks, thumps, and other common additive disturbances from old analog discs. The proposed model outperforms previous methods in both objective and subjective metrics. The results of a formal blind listening test show that real gramophone recordings denoised with this method have significantly better quality than the baseline methods. This study shows the importance of realistic training data and the power of deep learning in audio restoration.
Article
Binding site prediction for new proteins is important in structure-based drug design. The identified binding sites may be helpful in the development of treatments for new viral outbreaks in the world when there is no information available about their pockets with Covid-19 being a case in point. Identification of the pockets using computational methods, as an alternative method, has recently attracted much interest. In this study, the binding site prediction is viewed as a semantic segmentation problem. An improved 3D version of the U-Net model based on the dice loss function is utilized to predict the binding sites accurately. The performance of the proposed model on the independent test datasets and SARS-COV-2 shows the segmentation model could predict the binding sites with a more accurate shape than the recently published deep learning model, i.e. DeepSite. Therefore, the model may help predict the binding sites of proteins and could be used in drug design for novel proteins.