Wen-Yi Hsiao’s research while affiliated with Academia Sinica and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (18)


PyNeuralFx: A Python Package for Neural Audio Effect Modeling
  • Preprint
  • File available

August 2024

·

88 Reads

·

Wen-Yi Hsiao

·

We present PyNeuralFx, an open-source Python toolkit designed for research on neural audio effect modeling. The toolkit provides an intuitive framework and offers a comprehensive suite of features, including standardized implementation of well-established model architectures, loss functions, and easy-to-use visualization tools. As such, it helps promote reproducibility for research on neural audio effect modeling, and enable in-depth performance comparison of different models, offering insight into the behavior and operational characteristics of models through DSP methodology. The toolkit can be found at https://github.com/ytsrt66589/pyneuralfx.

Download

Figure 4: The diagram illustrates the proposed transient metric. The blue color represents the signal in the time domain, while the orange color signifies the signal in the discrete cosine transform (DCT) domain. The algorithm extracts the transient signal and calculates the spectral loss in the DCT domain.
Hyper Recurrent Neural Network: Condition Mechanisms for Black-box Audio Effect Modeling

August 2024

·

39 Reads

Recurrent neural networks (RNNs) have demonstrated impressive results for virtual analog modeling of audio effects. These networks process time-domain audio signals using a series of matrix multiplication and nonlinear activation functions to emulate the behavior of the target device accurately. To additionally model the effect of the knobs for an RNN-based model, existing approaches integrate control parameters by concatenating them channel-wisely with some intermediate representation of the input signal. While this method is parameter-efficient, there is room to further improve the quality of generated audio because the concatenation-based conditioning method has limited capacity in modulating signals. In this paper, we propose three novel conditioning mechanisms for RNNs, tailored for black-box virtual analog modeling. These advanced conditioning mechanisms modulate the model based on control parameters, yielding superior results to existing RNN- and CNN-based architectures across various evaluation metrics.


MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

July 2024

·

48 Reads

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.


DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation

August 2022

·

28 Reads

·

1 Citation

Da-Yi

·

Wen-Yi Hsiao

·

Fu-Rong Yang

·

[...]

·

A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our experiments show that SawSing converges much faster and outperforms state-of-the-art generative adversarial network and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.



Fig. 1. Schematic diagram of the proposed multi-loss seq2seq Transformer model for AMT, with additional losses at the output of the encoder compared to the model proposed by Hawthorne et al. [17]. The input to the "frame stack" is a concatenation of the output from the "onset," "offset," and "activation" stacks.
towards automatic transcription of polyphonic electric guitar music:a new dataset and a multi-loss transformer model

February 2022

·

152 Reads

In this paper, we propose a new dataset named EGDB, that con-tains transcriptions of the electric guitar performance of 240 tab-latures rendered with different tones. Moreover, we benchmark theperformance of two well-known transcription models proposed orig-inally for the piano on this dataset, along with a multi-loss Trans-former model that we newly propose. Our evaluation on this datasetand a separate set of real-world recordings demonstrate the influenceof timbre on the accuracy of guitar sheet transcription, the potentialof using multiple losses for Transformers, as well as the room forfurther improvement for this task.



Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking

June 2021

·

23 Reads

Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to their tempo. In this paper, we investigate another approach for data augmentation, to account for the composition of the training data in terms of the percussive and non-percussive sound sources. Specifically, we propose to employ a blind drum separation model to segregate the drum and non-drum sounds from each training audio signal, filtering out training signals that are drumless, and then use the obtained drum and non-drum stems to augment the training data. We report experiments on four completely unseen test sets, validating the effectiveness of the proposed method, and accordingly the importance of drum sound composition in the training data for beat and downbeat tracking.


Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

May 2021

·

74 Reads

·

151 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note’s pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5 to 10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music


Figure 2. Diagram of the proposed MTHarmonizer, a deep multitask model extended from the model (Lim et al., 2017) depicted in Figure 1. See Section 2.5 for details.
Figure 3. A harmonization example (in major key) from The Beatles: Hey Jude. We can see that, while the non-deep learning models change the harmonization in different phrases, the MTharmonizer generates a V-I progression nicely to close the phrase.
Figure 4. A harmonization example (in minor key) from ABBA: Gimme Gimme Gimme A Man After Midnight. Similar to the example shown in Figure 3, the result of the MTHarmonizer appears to be more diverse and functionally correct. We also see that the result of GA is quite "interesting"-e.g., with nondiatonic chord D flat Major and close the music phrase with Picardy third (i.e., a major chord of the tonic at the end of a chord sequence that is in a minor key). We also see that the non-deep learning methods seem to be weaker in handling the tonality of music.
Figure 5. "Win probabilities" of different model pairs. Each entry represents the probability that the model in that column scores higher than the model in that row.
Figure 6. The mean rating scores in subjective evaluation, along with the standard deviation (the error bars).
Automatic melody harmonization with triad chords: A comparative study

January 2021

·

275 Reads

·

55 Citations

Journal of New Music Research

The task of automatic melody harmonization aims to build a model that generates a chord sequence as the harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a comparative study evaluating the performance of canonical approaches to this task, including template matching, hidden Markov model, genetic algorithm and deep learning. The evaluation is conducted on a dataset of 9226 melody/chord pairs, considering 48 different triad chords. We report the result of an objective evaluation using six different metrics and a subjective study with 202 participants, showing that a deep learning method performs the best.


Citations (9)


... Similarly, Wu et al. [37] introduced a new variant of the Transformer model called Sparse Transformer for modeling long sequences in music. Additionally, Hsiao et al. [39] based their model architecture on Linear Transformer using a linear attention mechanism, effectively reducing model complexity using low-rank matrix approximation methods. ...

Reference:

MAML-XL: a symbolic music generation method based on meta-learning and Transformer-XL
Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... Alonso and Erkut 2021 Experiments on autoencoder from Engel et al. (2020a) for singing voice synthesis. Wu et al. 2022a Differentiable subtractive singing voice synthesiser. Guo et al. 2022a Differentiable filtering of sine excitation for adversarial SVC. ...

DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation
  • Citing Preprint
  • August 2022

... It also provided distinct methods of generating tracks: both all tracks without delay (simultaneous technology) or one after another (sequential technology), which added a layer of flexibility depending on the creative goal. By focusing on how to harmonize multiple instruments correctly, MuseGAN tackled one of the hardest demanding situations in the automated tune era and unfolded new possibilities for growing sensible, layered musical portions using AI [2]. In 2018, Roberts et al. introduced a creative new approach to AI tune technology through developing a hierarchical latent vector model that makes use of recurrent variational autoencoders (VAEs). ...

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment
  • Citing Article
  • April 2018

Proceedings of the AAAI Conference on Artificial Intelligence

... To evaluate the impact of expanded and diverse training data on multi-pitch prediction accuracy, we introduce EGDB-PG, a dataset created by rendering the EGDB dataset [11] with BiasFX2 plugins from Positive Grid, a world-leading guitar amplifier and plugin company. This dataset addresses the scarcity of publicly available guitar datasets that feature effect-rendered audio with alignment labels, a critical limitation for training robust transcription models. ...

Towards Automatic Transcription of Polyphonic Electric Guitar Music: A New Dataset and a Multi-Loss Transformer Model
  • Citing Conference Paper
  • May 2022

... Source separation in music applications often focuses on remixing separated musical stems [37]. Separation has also been widely used for music transcription (i.e., scoring), either as a pre-processing frontend [38], [39] or within a joint approach [20], [40]. In [41], the authors extensively explore the idea of source separation specifically applied to choir ensemble mixtures, allowing for a set of potential downstream applications such as F 0 contour analysis, synthesis, transposition, unison analysis, as well as singing group remixing. ...

Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking
  • Citing Conference Paper
  • August 2021

... Notably, When in Rome is an actively maintained corpus where new harmonic annotations (in RomanText format) are also contributed and internally validated by experts.Chord datasets not included in ChoCo.Although other collections providing harmonic information exist in the literature, some of them were currently discarded for the reasons explained below. The Leadsheet dataset[416] separately annotates chord progressions for each segment (e.g. intro, chorus) but does not provide information on how structures are laid out in the piece. ...

Automatic melody harmonization with triad chords: A comparative study

Journal of New Music Research

... MSS for classical music is much less explored than popular music. Chiu et al. [14] trained an Open-Unmix model [15] to separate pianoviolin duets. For 6 tracks from the MedleyDB dataset, they achieved SDRs of 9.66 and 1.56 dB for piano and violin, respectively. ...

Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation
  • Citing Conference Paper
  • September 2020

... Common approaches in this field combine convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In contrast, automatic music generation has gained significant attention, with models like RNNs and transformers [8,9] and generative models such as generative adversarial networks (GANs) and diffusion models [10,11]. Datasets supporting these efforts include MIDI datasets [12,13] and waveform datasets [14,15]. ...

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment