ProsodyBERT: Self-Supervised Prosody Representation for Expressive TTS

Abstract

We propose ProsodyBERT, a self-supervised approach to learning prosody representations from raw audio. Different from most previous work, which uses information bottlenecks to disentangle prosody features from lexical content and speaker information, we perform an offline clustering of speaker-normalized prosody-related features (energy, pitch, etc.) and use the cluster labels as targets for HuBERT-like masked unit prediction. A span boundary loss is also used to capture long-range prosodic information. We demonstrate the effectiveness of ProsodyBERT on a multi-speaker style-controllable text-to-speech (TTS) system, showing that the TTS system trained with ProsodyBERT features generates natural and expressive speech samples, surpassing Fastspeech 2 (which directly models pitch and energy) in subjective human evaluation.

Expressive TTS on DailyTalk
Expressive TTS on Unseen Text
Expressiveness Control on Unseen Text

1. Expressive TTS on DailyTalk

FastSpeech2	Baseline UTTS	UTTS + ProsodyBERT	Ground Truth
"Where do you want to go?" (female) (Figure 4 of the paper)

"You don't look too well. What's going on?" (female)

"Do you have any plans for tomorrow?" (female)

"I am looking for a newspaper article." (female)

"I just watched the movie and I am scared." (male)

"You must be very brave." (male)

"I am not sure. I know I should know that, but I can't remember right now." (male)

2. Expressive TTS on Unseen Text

Male Target Speaker	Female Target Speaker

FastSpeech2	Baseline UTTS	UTTS + ProsodyBERT
"Life is like a box of chocolates. You never know what you are going to get." (male)

"Life is like a box of chocolates. You never know what you are going to get." (female)

"I still have a dream. It is a dream deeply rooted in the American dream." (male)

"I still have a dream. It is a dream deeply rooted in the American dream." (female)

3. Expressiveness Control on Unseen Text

DailyTalk Style	0.25	0.5	0.75	VCTK Style
"How can I solve the problem? I am angry." (male)

"Life is like a box of chocolates. You never know what you are going to get." (male)

"Life is like a box of chocolates. You never know what you are going to get." (female)

"I still have a dream. It is a dream deeply rooted in the American dream." (male)

"I still have a dream. It is a dream deeply rooted in the American dream." (female)

ProsodyBERT: Self-Supervised Prosody Representation for Expressive TTS

Abstract

Contents

1. Expressive TTS on DailyTalk

2. Expressive TTS on Unseen Text

3. Expressiveness Control on Unseen Text