Ye Wang’s research while affiliated with National University of Singapore and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (57)


Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning
  • Preprint

May 2025

·

2 Reads

·

Xintong Wang

·

Ye Wang

Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the VALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (PACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating PACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.




Lead Instrument Detection from Multitrack Music

March 2025

Prior approaches to lead instrument detection primarily analyze mixture audio, limited to coarse classifications and lacking generalization ability. This paper presents a novel approach to lead instrument detection in multitrack music audio by crafting expertly annotated datasets and designing a novel framework that integrates a self-supervised learning model with a track-wise, frame-level attention-based classifier. This attention mechanism dynamically extracts and aggregates track-specific features based on their auditory importance, enabling precise detection across varied instrument types and combinations. Enhanced by track classification and permutation augmentation, our model substantially outperforms existing SVM and CRNN models, showing robustness on unseen instruments and out-of-domain testing. We believe our exploration provides valuable insights for future research on audio content analysis in multitrack music settings.


KeYric: Unsupervised Keywords Extraction and Expansion from Music for Coherent Lyrics Generation

October 2024

·

5 Reads

·

1 Citation

ACM Transactions on Multimedia Computing, Communications and Applications

Xichu Ma

·

Varun Sharma

·

Min-Yen Kan

·

[...]

·

Ye Wang

We address the challenge of enhancing coherence in generated lyrics from symbolic music, particularly for creating singing-based language learning materials. Coherence, defined as the quality of being logical and consistent, forming a unified whole, is crucial for lyrics at multiple levels—word, sentence, and full-text. Additionally, it involves lyrics’ musicality—matching of style and sentiment of the music. To tackle this, we introduce KeYric, a novel system that leverages keyword skeletons to strengthen both coherence and musicality in lyrics generation. KeYric employs an innovative approach with an unsupervised keyword skeleton extractor and a graph-based skeleton expander, designed to produce a style-appropriate keyword skeleton from input music. This framework integrates the skeleton with the input music via a three-layer coherence mechanism, significantly enhancing lyric coherence by 5% in objective evaluations. Subjective assessments confirm that KeYric-generated lyrics are perceived as 19% more coherent and suitable for language learning through singing compared to existing models. Our analyses indicate that integrating genre-relevant elements, such as pitch, into music encoding is crucial, as musical genres significantly affect lyric coherence.


Figure 2: An example of REMI-z tokenization.
Figure 3: The resulting content sequence derived by the operator C(·) from the REMI-z sequence shown in Figure 2b.
Instrument probing results.
Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement
  • Preprint
  • File available

August 2024

·

25 Reads

Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model for multiple multi-track arrangement tasks, including band arrangement, piano reduction, drum arrangement, and voice separation. Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines across all four tasks. Furthermore, through additional experiments on probing analysis, we show the pre-training phase equips the model with essential knowledge to understand musical conditions, which is hard to acquired solely through task-specific fine-tuning.

Download

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

August 2024

·

1 Read

Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.


XAI-Lyricist: Improving the Singability of AI-Generated Lyrics with Prosody Explanations

August 2024

·

23 Reads

·

1 Citation

Explaining the singability of lyrics is an important but missing ability of language models (LMs) in song lyrics generation. This ability allows songwriters to quickly assess if LM-generated lyrics can be sung harmoniously with melodies and helps singers align lyrics with melodies during practice. This paper presents XAI-Lyricist, leveraging musical prosody to guide LMs in generating singable lyrics and providing human-understandable singability explanations. We employ a Transformer model to generate lyrics under musical prosody constraints and provide demonstrations of the lyrics' prosody patterns as singability explanations. XAI-Lyricist is evaluated by computational metrics (perplexity, prosody-BLEU) and a human-grounded study (human ratings, average time and number of attempts for singing). Experimental results show that musical prosody can significantly improve the singability of LM-generated lyrics. A controlled study with 14 singers also confirms the usefulness of the provided explanations in helping them to interpret lyrical singability faster than reading plain text lyrics.


End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

May 2024

·

3 Reads

Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.



Citations (29)


... Recently, multi-modal machine learning has significantly advanced various fields in Music Information Retrieval (MIR), including videoto-music generation [32], music video (MV) generation [4], and lyrics generation [21]. Among these techniques, Automatic Stage *Corresponding Author: Xiaoyu Zhang. ...

Reference:

Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?
KeYric: Unsupervised Keywords Extraction and Expansion from Music for Coherent Lyrics Generation
  • Citing Article
  • October 2024

ACM Transactions on Multimedia Computing, Communications and Applications

... This section outlines the key privacy challenges associated with LLM systems. Our comprehensive analysis of the literature reveals four main categories of privacy challenges: (i) privacy issues related to LLM training data [36,80], (ii) privacy challenges in the interaction with LLM systems via user prompts in the interaction with LLM systems [68,146], (iii) privacy vulnerabilities in LLM-generated outputs [86,124], and (iv) privacy challenges involving LLM agents [39,109]. Figure 1 shows the multi-faceted view of four identified privacy challenges in LLM systems. ...

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
  • Citing Conference Paper
  • May 2024

... Some works [8,9,10] have focused on handling style characteristics of singing, including pitch styles. Vibrato control is achieved in [8] by extracting vibrato extent from the power spectrogram of the first-order difference. ...

SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System
  • Citing Article
  • January 2024

IEEE/ACM Transactions on Audio Speech and Language Processing

... Pesan ini menunjukkan kekhawatiran terhadap kemungkinan pengaruh luar yang dapat mengubah adat istiadat dan budaya lokal. Lagu-lagu daerah dapat meningkatkan kesadaran akan pentingnya pelestarian budaya lokal (Gu et al., 2024). Dengan memberi tahu masyarakat tentang arti dan nilai yang terkandung dalam lagu-lagu tersebut, diharapkan masyarakat akan lebih menghargai dan melestarikan warisan budaya mereka. ...

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing
  • Citing Article
  • March 2024

ACM Transactions on Multimedia Computing, Communications and Applications

... MusicLM [3] 2023 MuLan SoundStream + w2v-BERT Joint Embedding Transformer Noise2Music [66] 2023 T5 Waveform / Log-Mel Cross-attention LDM Moûsai [140] 2024 T5 Diffusion-based autoencoder Cross-attention LDM AudioLDM 2 [96] 2024 [84] 2019 LSTM Note pitch and duration Mapping LSTM Yu et al. [188] 2021 Skip-gram models Note pitch and duration Mapping LSTM with adversarial learning SongMASS [143] 2021 Token sequence Note pitch and duration Cross-attention Transformer TeleMelody [71] 2022 Template Template Template bridging Transformer ReLyMe [191] 2022 Tone, rhythm and structure Tone, rhythm and structure Rule-based mapping SongMASS / TeleMelody SongGLM [187] 2025 Syllable stress Event token sequence Concatenation GLM Image-to-Symbolic SDMuse [192] 2023 SDE Piano roll + MIDI Joint Embedding DDPM + CNN + Transformer-XL Drawlody [93] 2024 [199] 2022 SMPL + 2D body keypoints+ I3D Music VQ-VAE Mapping GAN LORIS [185] 2023 I3D + Bi-LSTM + Poses LDM-based encoder Cross-attention LDM Li et al. [89] 2024 Keypoints VQ-VAE / VAE Text bridging Text-to-music models Ensuring expressiveness and realism in synthesized audio, Wu et al. [176] further enhanced controllability, allowing users to comprehensively control details at various levels. However, these previous works overlooked the impact of expressive timing in real performances on musical expressiveness. ...

Drawlody: Sketch-Based Melody Creation With Enhanced Usability and Interpretability
  • Citing Article
  • January 2024

IEEE Transactions on Multimedia

... While the tradeoffs between privacy and utility and fairness and utility have been extensively discussed [23,24,25], there remains a gap in understanding the interplay between privacy and fairness specifically in the speech processing field. Existing research in other domains has shown conflicting findings regarding the relationship between privacy and fairness [1]. ...

Elucidate Gender Fairness in Singing Voice Transcription
  • Citing Conference Paper
  • October 2023

... With the advent of generative models such as Transformers and Diffusion models, various works have investigated the quality and diversity of AI-generated music for different applications [2][3][4]. Some examples of tasks that can be accomplished involve style change [5], orchestration [6], and novel music generation pieces from a given text prompt [7]. ...

Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement
  • Citing Conference Paper
  • August 2023

... Tian et al. (2023) generated lyrics to a melody without needing melody-lyrics aligned data for training. Studies on automatic song translations were done mainly on Chinese: Guo et al. (2022) fo-cused on translating lyrics for tonal languages, and Ou et al. (2023) used prompted machine translation with melody-based word boundaries for Chinese lyrics translation. ...

Songs Across Borders: Singable and Controllable Neural Lyric Translation

... FedNH [36] and Fed-CBS [37] alleviate client-wise class imbalance by a client sampling strategy to generate grouped class-balanced datasets and utilizing the uniformity and semantics of class prototypes, receptively. Scaffold [38], and FedNP [39] alleviate non-iid data issues by deliberately handling distribution diversity in distributed datasets. FedST [40] and FedA3I [41] enhance the contribution of highquality local models during aggregation, aiming to achieve superior aggregated models. ...

FedNP: Towards Non-IID Federated Learning via Federated Neural Propagation
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... For the quality of generated results, we assess how well the infilled part aligns with its past (p) and future (f) content using the average overlapped area of pitch distribution (PD), duration distribution (DD), and chord distribution (CD). These metrics, widely recognized in music composition for similarity comparisons (Hu et al. 2024;Dai et al. 2023;Min et al. 2023;Chang, Lee, and Yang 2021), yield six measures: P D p , P D f , DD p , DD f , CD p , and CD f . For user responses, we focus on two perspectives: the increase in user satisfaction (IUS) and the reduction in interaction cost (RIC). ...

Personalised popular music generation using imitation and structure

Journal of New Music Research