PiCo-VITS: Leveraging Pitch Contours for Fine-grained Emotional Speech Synthesis

-----------------------------> Audio Samples <---------------------------

Prosody-Tacotron [1]: A state-of-the-art extension to Tacotron that synthesizes speech to match the prosody of referenace audio.

VITS-E: VITS with an additional embedding bank for global emotion control

VITS-ES: VITS with an addition bank and spectrogram encoder for mixed emotion control

PiCo-VITS (proposed) : The proposed end-to-end speech synthesis architecture that leverages pitch contours to synthesize speech with mixed emotions.

Baseline Comparision
Sentence	Speaker	Groundtruth	Prosody-Tacotron	VITS-E	VITS-ES	PiCo-VITS (Ours)
I can smell the breath of an English.	Speaker 11
Said the American to Chinese.	Speaker 11
I suppose no, it doesn't!	Speaker 12
When such wanderers meet.	Speaker 13
On the twenty second of last march.	Speaker 14
I lent george three pounds.	Speaker 15
How are you, dear child?	Speaker 15
Enough,you a foolish chatter.	Speaker 16
And be with you,Tom!	Speaker 17
That was his chief thought.	Speaker 18
At the end of four.	Speaker 18
The nine the eggs, I keep.	Speaker 18
They were going fast, with a light clip.	Speaker 19
Name of the song is called haddocks.	Speaker 20
Emotion Control with Embedding Bank
	Sentence	Speaker	Target Emotion	Target Emotion Pitch Contour	Result
	Hello, how are you?	Speaker 16	Angry	Neutral
	Hello, how are you?	Speaker 16	Happy	Neutral
	Hello, how are you?	Speaker 16	Neutral	Neutral
	Hello, how are you?	Speaker 16	Sad	Neutral
	Hello, how are you?	Speaker 16	Surprise	Neutral
Mixed Emotion Controllability
	Sentence	Speaker	Global Emotion	Reference Pitch Contour	Result
	She was born on april nineteen forty three.	Speaker 18	Sad	(Surprise)
	Said the American to Chinese?	Speaker 18	Neutral	(Surprise)
	Said the American to Chinese?	Speaker 18	Happy	(Surprise)
	Said the American to Chinese?	Speaker 12	Happy	(Surprise)
	No, I burst the balloon!	Speaker 19	Angry	(Angry)
	No, I burst the balloon!	Speaker 19	Angry	(Surprise)
	No, I burst the balloon!	Speaker 19	Angry	(Sad)
Mixed Emotion Transition
	Sentence	Speaker	Emotion Sequence	Result
	Give me your hand or I will cry harder than before.	Speaker 16	Happy+Happy to Sad+Sad
	Give me your hand or I will cry harder than before.	Speaker 16	Happy+Angry to Sad+Sad
	Give me your hand or I will cry harder than before.	Speaker 16	Happy+Angry to Sad+Angry
	Suppose I take grandmother a fresh vegetable.	Speaker 20	Angry+Angry to Sad+Sad
	Suppose I take grandmother a fresh vegetable.	Speaker 20	Angry+Happy to Sad+Sad
	Suppose I take grandmother a fresh vegetable.	Speaker 18	Happy+Happy to Happy+Surprise
	Suppose I take grandmother a fresh vegetable.	Speaker 15	Happy+Happy to Happy+Happy
	Suppose I take grandmother a fresh vegetable.	Speaker 15	Happy+Happy to Happy+Surprise
	Suppose I take grandmother a fresh vegetable.	Speaker 16	Angry+Happy to Surprise+Surprise
	Suppose I take grandmother a fresh vegetable.	Speaker 16	Surprise+Happy to Surprise+Surprise

[1]Skerry-Ryan, R., “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron”, arXiv e-prints, 2018. doi:10.48550/arXiv.1803.09047.

-----------------------------> Audio Samples <---------------------------

Baseline Comparision

Prosody-Tacotron [1]: A state-of-the-art extension to Tacotron that synthesizes speech to match the prosody of referenace audio.

VITS-E: VITS with an additional embedding bank for global emotion control

VITS-ES: VITS with an addition bank and spectrogram encoder for mixed emotion control

PiCo-VITS (proposed) : The proposed end-to-end speech synthesis architecture that leverages pitch contours to synthesize speech with mixed emotions.

Emotion Control with Embedding Bank

Mixed Emotion Controllability

Mixed Emotion Transition