-----------------------------> Audio Samples <---------------------------



Prosody-Tacotron [1]: A state-of-the-art extension to Tacotron that synthesizes speech to match the prosody of referenace audio.
VITS-E: VITS with an additional embedding bank for global emotion control
VITS-ES: VITS with an addition bank and spectrogram encoder for mixed emotion control
PiCo-VITS (proposed) : The proposed end-to-end speech synthesis architecture that leverages pitch contours to synthesize speech with mixed emotions.

Baseline Comparision

Sentence Speaker Groundtruth Prosody-Tacotron VITS-E VITS-ES PiCo-VITS (Ours)
I can smell the breath of an English. Speaker 11
Said the American to Chinese. Speaker 11
I suppose no, it doesn't! Speaker 12
When such wanderers meet. Speaker 13
On the twenty second of last march. Speaker 14
I lent george three pounds. Speaker 15
How are you, dear child? Speaker 15
Enough,you a foolish chatter. Speaker 16
And be with you,Tom! Speaker 17
That was his chief thought. Speaker 18
At the end of four. Speaker 18
The nine the eggs, I keep. Speaker 18
They were going fast, with a light clip. Speaker 19
Name of the song is called haddocks. Speaker 20

Emotion Control with Embedding Bank

Sentence Speaker Target Emotion Target Emotion Pitch Contour Result
Hello, how are you? Speaker 16 Angry Neutral
Hello, how are you? Speaker 16 Happy Neutral
Hello, how are you? Speaker 16 Neutral Neutral
Hello, how are you? Speaker 16 Sad Neutral
Hello, how are you? Speaker 16 Surprise Neutral

Mixed Emotion Controllability

Sentence Speaker Global Emotion Reference Pitch Contour Result
She was born on april nineteen forty three. Speaker 18 Sad (Surprise)
Said the American to Chinese? Speaker 18 Neutral (Surprise)
Said the American to Chinese? Speaker 18 Happy (Surprise)
Said the American to Chinese? Speaker 12 Happy (Surprise)
No, I burst the balloon! Speaker 19 Angry (Angry)
No, I burst the balloon! Speaker 19 Angry (Surprise)
No, I burst the balloon! Speaker 19 Angry (Sad)

Mixed Emotion Transition

Sentence Speaker Emotion Sequence Result
Give me your hand or I will cry harder than before. Speaker 16 Happy+Happy to Sad+Sad
Give me your hand or I will cry harder than before. Speaker 16 Happy+Angry to Sad+Sad
Give me your hand or I will cry harder than before. Speaker 16 Happy+Angry to Sad+Angry
Suppose I take grandmother a fresh vegetable. Speaker 20 Angry+Angry to Sad+Sad
Suppose I take grandmother a fresh vegetable. Speaker 20 Angry+Happy to Sad+Sad
Suppose I take grandmother a fresh vegetable. Speaker 18 Happy+Happy to Happy+Surprise
Suppose I take grandmother a fresh vegetable. Speaker 15 Happy+Happy to Happy+Happy
Suppose I take grandmother a fresh vegetable. Speaker 15 Happy+Happy to Happy+Surprise
Suppose I take grandmother a fresh vegetable. Speaker 16 Angry+Happy to Surprise+Surprise
Suppose I take grandmother a fresh vegetable. Speaker 16 Surprise+Happy to Surprise+Surprise
[1]Skerry-Ryan, R., “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron”, arXiv e-prints, 2018. doi:10.48550/arXiv.1803.09047.