Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis


Overview

We propose Hierarchical Expressive Vector (HE-Vector), a two-stage method for Emotional Dialectal TTS. In the first stage, we construct different task vectors to model dialectal and emotional styles independently, and then enhance single-style synthesis by adjusting their weights, a method we refer to as Expressive Vector (E-Vector). For the second stage, we hierarchically integrate these vectors to achieve controllable emotionally expressive dialect synthesis without requiring jointly labeled data, corresponding to Hierarchical Expressive Vector (HE-Vector).

(a) Construction of the E-Vector and enhancement of F5-TTS

(b) Fully merging strategy for dialect and emotion E-Vectors

(c) Hierarchically merging strategy for dialect and emotion E-Vectors

Dialect Synthesis

Synthesizing dialectal speech from a Mandarin prompt.
Target Text Target Speaker Target Dialect CosyVoice2 FT FT-last Enhanced LoRA Enhanced
所以啊这个豆子就生不出来哈所以后头他们打兵的时候他就就那个就非常的非常的
于是军士归心,因此这雄主二字可谓是搔到了他的痒处。
四川话
当时真害怕风雨过来揭掉屋顶铁皮
我说你这只大鸟,真是不讲理,我对你做什么了呀,你就要吞了我!
陕西话
你唔讲系人都知你系神经质儿童嚟噶啦
前段时间我面试了六个年轻人,我是倒吸了一口凉气。
广东话
给俺设置上下管加风扇一百二十度六小时
于是军士归心,因此这雄主二字可谓是搔到了他的痒处。
山东话
不用拍马屁帮我挑时间
主播说联播,今天我来说
郑州话
读书的辰光学堂里从小就教英文所以讲也会的讲英文
人生就像一场马拉松比赛,重要的是坚持不懈地向前跑,而不仅仅是关注眼前的一小段路程。
上海话
关于此事的报道成了人们生活中议论的焦点
前段时间我面试了六个年轻人,我是倒吸了一口凉气。
天津话
被黑滴童鞋们要团结起来
于是军士归心,因此这雄主二字可谓是搔到了他的痒处。
长沙话

Controllable Degree of Emotional Speech Synthesis

Synthesizing speech from a neual prompt together with the target emotion label.
By adjusting the enhancement coefficient β, we can control the degree of emotional expression in the synthesized speech. The following examples illustrate the effect of varying β from 0.0 to 2.5 for target emotions:
Target Text Target Speaker Target Emotion β = 0.0 β = 0.5 β = 1.0 β = 1.5 β = 2.0 β = 2.5
我也想去看可爱的熊猫。
我们乘船漂游了三峡,真是刺激。
happy
跳舞好难呀,我还正在练基本功。
我老家在北京。
sad
我们俩合不来,还经常吵架,我拿她真没办法。
前几天我碰见了一件有趣的事儿。
angry
真想不到,游泳竟有如此多的好处,我下周还想来。
除非你打飞碟球,但这是不可能的。
surprise

Emotionally Expressive Dialectal Speech Synthesis

Note: In real-world scenarios, when the same speaker expresses different emotions, the perceived timbre often changes as well, as illustrated in the following examples:
Example Text Neutral Happy Sad Angry Suriprise
Male 英国的哲学家曾经说过
Female 不管怎么说,主队好像志在夺魁。
Synthesizing speech from a Mandarin prompt together with the target emotion and dialect labels.
Target Text Target Speaker Target Labels CosyVoice2 Two-stage Direct Merge LoRA Merge
(lora rank = 8)
LoRA Merge
(lora rank = 64)
别吓我啊怎么解决
我老猪本是上界的天蓬元帅,不想下界之后错投了猪胎。
河南话+happy
抑或产业群聚集度高导致的成本低
哈尔滨亚冬会中国体育代表团正式成立了
天津话+sad
早上一只苹果和桃子我和豆豆一人一半
主播说联播,今天我来说。
上海话+angry
当时真害怕风雨过来揭掉屋顶铁皮
我说你这只大鸟,真是不讲理,我对你做什么了呀,你就要吞了我!
陕西话+surprise