Abstract
We adopt the recent cycle consistent generative adversarial network (MaskCycleGAN-VC) that allows converting a specific singing technique using only a few articulations of singing voice as examples. Since it is often prone to fail to preserve the content information of the singing voice due to distortion and noise during the conversion, a self-supervised learning module is proposed as the basic framework to enforce content consistency without additional annotations. We evaluate the proposed methods on three datasets that were commonly used in pop songs which involve singing techniques in terms of breathy voice, vibrato, and vocal fry. Experiments showed that our proposed methods outperform the baseline in terms of audio quality and content preservation, including melody and singer's timbral identity, without affecting the perception of singing techniques.
Original language | English |
---|---|
Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
DOIs | |
State | Published - 2023 |
Event | 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 04 06 2023 → 10 06 2023 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- cycle consistency
- few shot learning
- generative adversarial networks
- singing technique conversion
- triplet learning