ProsodyBERT: Self-Supervised Prosody Representation for Expressive TTS
Abstract
We propose ProsodyBERT, a self-supervised approach to learning prosody representations from raw audio. Different from
most previous work, which uses information bottlenecks to disentangle prosody features from lexical content and speaker
information, we perform an offline clustering of speaker-normalized prosody-related features (energy, pitch, etc.) and
use the cluster labels as targets for HuBERT-like masked unit prediction. A span boundary loss is also used to capture
long-range prosodic information. We demonstrate the effectiveness of ProsodyBERT on a multi-speaker style-controllable
text-to-speech (TTS) system, showing that the TTS system trained with ProsodyBERT features generates natural and
expressive speech samples, surpassing Fastspeech 2 (which directly models pitch and energy) in subjective human
evaluation.