yi-hsuan Yang

yi-hsuan Yang
Academia Sinica · Research Center for Information Technology Innovation

PhD

About

197
Publications
81,899
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,926
Citations
Additional affiliations
September 2011 - present
Academia Sinica
Position
  • Research Assistant
July 2006 - July 2010
National Taiwan University
Position
  • PhD Student

Publications

Publications (197)
Preprint
Full-text available
Even with strong sequence models like Transformers, generating expressive piano performances with long-range musical structures remains challenging. Meanwhile, methods to compose well-structured melodies or lead sheets (melody + chords), i.e., simpler forms of music, gained more success. Observing the above, we devise a two-stage Transformer-based...
Preprint
While generative adversarial networks (GANs) have been widely used in research on audio generation, the training of a GAN model is known to be unstable, time consuming, and data inefficient. Among the attempts to ameliorate the training process of GANs, the idea of Projected GAN emerges as an effective solution for GAN-based image generation, estab...
Preprint
A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal w...
Preprint
Full-text available
In this paper, we propose a new dataset named EGDB, that con-tains transcriptions of the electric guitar performance of 240 tab-latures rendered with different tones. Moreover, we benchmark theperformance of two well-known transcription models proposed orig-inally for the piano on this dataset, along with a multi-loss Trans-former model that we new...
Conference Paper
Singing voice synthesis (SVS) systems are built to generate human-like voice signals from lyrics and the corresponding musical scores. In most SVS systems, a neural network-based auxiliary duration model is employed to control the duration of phonemes. In this paper, a rule-based algorithm inspired by Mandarin phonology is proposed for the duration...
Preprint
Full-text available
Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based con...
Preprint
Full-text available
Recent years have witnessed a growing interest in research related to the detection of piano pedals from audio signals in the music information retrieval community. However, to our best knowledge, recent generative models for symbolic music have rarely taken piano pedals into account. In this work, we employ the transcription model proposed by Kong...
Preprint
In this paper, we propose a novel neural network model called KaraSinger for a less-studied singing voice synthesis (SVS) task named score-free SVS, in which the prosody and melody are spontaneously decided by machine. KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to s...
Preprint
Full-text available
While there are many music datasets with emotion labels in the literature, they cannot be used for research on symbolic-domain music analysis or generation, as there are usually audio files only. In this paper, we present the EMOPIA (pronounced `yee-m\`{o}-pi-uh') dataset, a shared multi-modal (audio and MIDI) database focusing on perceived emotion...
Article
This letter presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are agg...
Preprint
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent lin...
Article
To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note’s pit...
Preprint
Full-text available
Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert control over different parts (e.g., bars) of the music to be generated. In this paper, we are interest...
Article
The task of automatic melody harmonization aims to build a model that generates a chord sequence as the harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a comparative study evaluating the performance of canonical approaches to this task, including template matching, hidden Markov model, genetic algorithm and...
Preprint
Full-text available
To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pit...
Preprint
A DJ mix is a sequence of music tracks concatenated seamlessly, typically rendered for audiences in a live setting by a DJ on stage. As a DJ mix is produced in a studio or the live version is recorded for music streaming services, computational methods to analyze DJ mixes, for example, extracting track information or understanding DJ techniques, ha...
Preprint
Full-text available
Blind music source separation has been a popular and active subject of research in both the music information retrieval and signal processing communities. To counter the lack of available multi-track data for supervised model training, a data augmentation method that creates artificial mixtures by combining tracks from different songs has been show...
Preprint
Deep learning algorithms are increasingly developed for learning to compose music in the form of MIDI files. However, whether such algorithms work well for composing guitar tabs, which are quite different from MIDIs, remain relatively unexplored. To address this, we build a model for composing fingerstyle guitar tabs with Transformer-XL, a neural s...
Preprint
Full-text available
This paper presents the Jazz Transformer, a generative model that utilizes a neural sequence model called the Transformer-XL for modeling lead sheets of Jazz music. Moreover, the model endeavors to incorporate structural events present in the Weimar Jazz Database (WJazzD) for inducing structures in the generated music. While we are able to reduce t...
Preprint
In a recent paper, we have presented a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices. As the generator of the model is designed to take a variable-length sequence of noise vectors as input, it can generate mel-spectrograms of variable length. However, our previous listening t...
Preprint
Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from...
Preprint
The task automatic music composition entails generative modeling of music in symbolic formats such as the musical scores. By serializing a score as a sequence of MIDI-like events, recent work has demonstrated that state-of-the-art sequence models with self-attention work nicely for this task, especially for composing music with long-range coherence...
Preprint
Full-text available
Several prior works have proposed various methods for the task of automatic melody harmonization, in which a model aims to generate a sequence of chords to serve as the harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a comparative study evaluating and comparing the performance of a set of canonical approach...
Preprint
Generative models for singing voice have been mostly concerned with the task of "singing voice synthesis," i.e., to produce singing voice waveforms given musical scores and text lyrics. In this work, we explore a novel yet challenging alternative: singing voice generation without pre-assigned scores and lyrics, in both training and inference time....
Article
Vector-valued neural learning has emerged as a promising direction in deep learning recently. Traditionally, training data for neural networks (NNs) are formulated as a vector of scalars; however, its performance may not be optimal since associations among adjacent scalars are not modeled. In this article, we propose a new vector neural architectur...
Preprint
In this paper, we tackle the problem of transfer learning for Jazz automatic generation. Jazz is one of representative types of music, but the lack of Jazz data in the MIDI format hinders the construction of a generative model for Jazz. Transfer learning is an approach aiming to solve the problem of data insufficiency, so as to transfer the common...
Conference Paper
Full-text available
Music creation involves not only composing the different parts (e.g., melody, chords) of a musical work but also arranging/selecting the instruments to play the different parts. While the former has received increasing attention, the latter has not been much investigated. This paper presents, to the best of our knowledge, the first deep learning mo...
Conference Paper
Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve high-resolution information and still reach distant locations. Producing high-resolution predictions is also crucial in music source separation, whose goal is...
Conference Paper
Full-text available
We present in this paper PerformacnceNet, a neural network model we proposed recently to achieve score-to-audio music generation. The model learns to convert a music piece from the symbolic domain to the audio domain, assigning performance-level attributes such as changes in velocity automatically to the music and then synthesizing the audio. The m...
Article
Full-text available
Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Dire...
Article
Full-text available
The employment of playing techniques such as string bend and vibrato in electric guitar performance makes it difficult to transcribe the note events using general note tracking methods. These methods analyze the contour of fundamental frequency computed from a given audio signal, but they do not consider the variation in the contour caused by the p...
Preprint
Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve high-resolution information and still reach distant locations. Producing high-resolution predictions is also crucial in music source separation, whose goal is...
Preprint
Full-text available
Music creation involves not only composing the different parts (e.g., melody, chords) of a musical work but also arranging/selecting the instruments to play the different parts. While the former has received increasing attention, the latter has not been much investigated. This paper presents, to the best of our knowledge, the first deep learning mo...
Conference Paper
We present collaborative similarity embedding (CSE), a unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation. In the proposed framework, we differentiate two types of proximity relations: direct proximity and k-th order neighborhood proximity. Wh...
Preprint
We present collaborative similarity embedding (CSE), a unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation. In the proposed framework, we differentiate two types of proximity relations: direct proximity and k-th order neighborhood proximity. Wh...
Preprint
Full-text available
New machine learning algorithms are being developed to solve problems in different areas, including music. Intuitive, accessible, and understandable demonstrations of the newly built models could help attract the attention of people from different disciplines and evoke discussions. However, we notice that it has not been a common practice for resea...
Preprint
Full-text available
Recent work has proposed various adversarial losses for training generative adversarial networks. Yet, it remains unclear what certain types of functions are valid adversarial loss functions, and how these loss functions perform against one another. In this paper, we aim to gain a deeper understanding of adversarial losses by decoupling the effects...
Article
Over the last decade, music-streaming services have grown dramatically. Pandora, one company in the field, has pioneered and popularized streaming music by successfully deploying the Music Genome Project [1] (https://www.pandora.com/about/mgp) based on human-annotated content analysis. Another company, Spotify, has a catalog of over 40 million song...
Article
This paper presents a fast Tensor Factorization (TF) algorithm for context-aware recommendation from implicit feedback. For such a recommendation problem, the observed data indicate the (positive) association between users and items in some given contexts. For better accuracy, it has been shown essential to include unobserved data that indicate the...
Preprint
In this paper, we introduce a novel attentional similarity module for the problem of few-shot sound recognition. Given a few examples of an unseen sound event, a classifier must be quickly adapted to recognize the new sound event without much fine-tuning. The proposed attentional similarity module can be plugged into any metric-based learning metho...
Preprint
Full-text available
Timbre and pitch are the two main perceptual properties of musical sounds. Depending on the target applications, we sometimes prefer to focus on one of them, while reducing the effect of the other. Researchers have managed to hand-craft such timbre-invariant or pitch-invariant features using domain knowledge and signal processing techniques, but it...
Preprint
Full-text available
For many music analysis problems, we need to know the presence of instruments for each time frame in a multi-instrument musical piece. However, such a frame-level instrument recognition task remains difficult, mainly due to the lack of labeled datasets. To address this issue, we present in this paper a large-scale dataset that contains synthetic po...
Conference Paper
Hypergraph is a data structure commonly used to represent connections and relations between multiple objects. Embedding a hypergraph into a low-dimensional space and representing each vertex as a vector is useful in various tasks such as visualization, classification, and link prediction. However, most hypergraph embedding or learning algorithms re...
Preprint
Full-text available
We propose the BinaryGAN, a novel generative adversarial network (GAN) that uses binary neurons at the output layer of the generator. We employ the sigmoid-adjusted straight-through estimators to estimate the gradients for the binary neurons and train the whole network by end-to-end backpropogation. The proposed model is able to directly generate b...
Article
Music videos are one of the most popular types on video streaming services, and instrument playing is among the most common scenes in such videos. In order to understand the instrument-playing scenes in the videos, it is important to know what instruments are played, when they are played, and where the playing actions occur in the scene. While audi...
Conference Paper
Full-text available
It has been shown recently that deep convolutional generative adversarial networks (GANs) can learn to generate music in the form of piano-rolls, which represent music by binary-valued time-pitch matrices. However, existing models can only generate real-valued piano-rolls and require further post-processing, such as hard thresholding (HT) or Bernou...
Preprint
Can we make a famous rap singer like Eminem sing whatever our favorite song? Singing style transfer attempts to make this possible, by replacing the vocal of a song from the source singer to the target singer. This paper presents a method that learns from unpaired data for singing style transfer using generative adversarial networks.
Preprint
Convolutional neural networks with skip connections have shown good performance in music source separation. In this work, we propose a denoising Auto-encoder with Recurrent skip Connections (ARC). We use 1D convolution along the temporal axis of the time-frequency feature map in all layers of the fully-convolutional network. The use of 1D convoluti...
Conference Paper
Full-text available
Music recommender systems can offer users personalized and contextualized recommendation and are therefore important for music information retrieval. An increasing number of datasets have been compiled to facilitate research on different topics, such as content-based, context-based or next-song recommendation. However, these topics are usually addr...
Conference Paper
Making sense of the surrounding context and ongoing events through not only the visual inputs but also acoustic cues is critical for various AI applications. This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV). Aside from the large num...
Preprint
Full-text available
Instrument recognition is a fundamental task in music information retrieval, yet little has been done to predict the presence of instruments in multi-instrument music for each time frame. This task is important for not only automatic transcription but also many retrieval problems. In this paper, we use the newly released MusicNet dataset to study t...
Article
Mood and emotion play an important role when it comes to choosing musical tracks to listen to. In the field of music information retrieval and recommendation, emotion is considered contextual information that is hard to capture, albeit highly influential. In this study, we analyze the connection between users' emotional states and their musical cho...
Preprint
Vector-valued neural learning has emerged as a promising direction in deep learning recently. Traditionally, training data for neural networks (NNs) are formulated as a vector of scalars; however, its performance may not be optimal since associations among adjacent scalars are not modeled. In this paper, we propose a new vector neural architecture...