Prosody-Tacotron [1]: A state-of-the-art extension to Tacotron that synthesizes speech to match the prosody of referenace audio.
VITS-E: VITS with an additional embedding bank for global emotion control
VITS-ES: VITS with an addition bank and spectrogram encoder for mixed emotion control
PiCo-VITS (proposed) : The proposed end-to-end speech synthesis architecture that leverages
pitch contours to synthesize speech with mixed emotions.
Sentence
Speaker
Groundtruth
Prosody-Tacotron
VITS-E
VITS-ES
PiCo-VITS (Ours)
I can smell the breath of an English.
Speaker 11
Said the American to Chinese.
Speaker 11
I suppose no, it doesn't!
Speaker 12
When such wanderers meet.
Speaker 13
On the twenty second of last march.
Speaker 14
I lent george three pounds.
Speaker 15
How are you, dear child?
Speaker 15
Enough,you a foolish chatter.
Speaker 16
And be with you,Tom!
Speaker 17
That was his chief thought.
Speaker 18
At the end of four.
Speaker 18
The nine the eggs, I keep.
Speaker 18
They were going fast, with a light clip.
Speaker 19
Name of the song is called haddocks.
Speaker 20
Emotion Control with Embedding Bank
Sentence
Speaker
Target Emotion
Target Emotion Pitch Contour
Result
Hello, how are you?
Speaker 16
Angry
Neutral
Hello, how are you?
Speaker 16
Happy
Neutral
Hello, how are you?
Speaker 16
Neutral
Neutral
Hello, how are you?
Speaker 16
Sad
Neutral
Hello, how are you?
Speaker 16
Surprise
Neutral
Mixed Emotion Controllability
Sentence
Speaker
Global Emotion
Reference Pitch Contour
Result
She was born on april nineteen forty three.
Speaker 18
Sad
(Surprise)
Said the American to Chinese?
Speaker 18
Neutral
(Surprise)
Said the American to Chinese?
Speaker 18
Happy
(Surprise)
Said the American to Chinese?
Speaker 12
Happy
(Surprise)
No, I burst the balloon!
Speaker 19
Angry
(Angry)
No, I burst the balloon!
Speaker 19
Angry
(Surprise)
No, I burst the balloon!
Speaker 19
Angry
(Sad)
Mixed Emotion Transition
Sentence
Speaker
Emotion Sequence
Result
Give me your hand or I will cry harder than before.
Speaker 16
Happy+Happy to Sad+Sad
Give me your hand or I will cry harder than before.
Speaker 16
Happy+Angry to Sad+Sad
Give me your hand or I will cry harder than before.
Speaker 16
Happy+Angry to Sad+Angry
Suppose I take grandmother a fresh vegetable.
Speaker 20
Angry+Angry to Sad+Sad
Suppose I take grandmother a fresh vegetable.
Speaker 20
Angry+Happy to Sad+Sad
Suppose I take grandmother a fresh vegetable.
Speaker 18
Happy+Happy to Happy+Surprise
Suppose I take grandmother a fresh vegetable.
Speaker 15
Happy+Happy to Happy+Happy
Suppose I take grandmother a fresh vegetable.
Speaker 15
Happy+Happy to Happy+Surprise
Suppose I take grandmother a fresh vegetable.
Speaker 16
Angry+Happy to Surprise+Surprise
Suppose I take grandmother a fresh vegetable.
Speaker 16
Surprise+Happy to Surprise+Surprise
[1]Skerry-Ryan, R., “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron”, arXiv e-prints, 2018. doi:10.48550/arXiv.1803.09047.