PHONOLOGICAL TOKENIZER: PROSODY-AWARE PHONETIC TOKEN VIA
MULTI-OBJECTIVE FINE-TUNING WITH DIFFERENTIABLE K-MEANS
4.3. Evaluation on generative tasks
Voice Conversion (src: Expresso)
| |
src |
Discrete WavLM |
SpeechTokenizer |
WavTokenizer |
Ours (α=0) |
Ours (α=0.1) |
Ours (α=0.3) |
Ours (α=0.5) |
Ours (α=1) |
| ex01_sad_00366 |
|
|
|
|
|
|
|
|
|
| ex01_whisper_00366 |
|
|
|
|
|
|
|
|
|
| ex02_confused_00376 |
|
|
|
|
|
|
|
|
|
| ex02_happy_00376 |
|
|
|
|
|
|
|
|
|
Voice Conversion (src: TIMIT)
| |
src |
Discrete WavLM |
SpeechTokenizer |
WavTokenizer |
Ours (α=0) |
Ours (α=0.1) |
Ours (α=0.3) |
Ours (α=0.5) |
Ours (α=1) |
| DR1_MJWT0_SX391 |
|
|
|
|
|
|
|
|
|
| DR3_FCKE0_SX391 |
|
|
|
|
|
|
|
|
|
| DR4_FALR0_SX245 |
|
|
|
|
|
|
|
|
|
| DR8_MRDM0_SX245 |
|
|
|
|
|
|
|
|
|
In-domain Reconstruction (LJSpeech)
| |
src |
Discrete WavLM |
SpeechTokenizer |
WavTokenizer |
Ours (α=0) |
Ours (α=0.1) |
Ours (α=0.3) |
Ours (α=0.5) |
Ours (α=1) |
| LJ050-0029 |
|
|
|
|
|
|
|
|
|
| LJ050-0030 |
|
|
|
|
|
|
|
|
|
| LJ050-0031 |
|
|
|
|
|
|
|
|
|
| LJ050-0032 |
|
|
|
|
|
|
|
|
|
4.4. Evaluation on speechLMs
Speech Continuation (prompt: first 3s of LibriSpeech test-clean)
| |
prompt |
Discrete WavLM |
SpeechTokenizer |
WavTokenizer |
Ours (α=0) |
Ours (α=0.1) |
Ours (α=0.3) |
Ours (α=0.5) |
Ours (α=1) |
| 1580-141084-0049 |
|
|
|
|
|
|
|
|
|
| 5683-32879-0010 |
|
|
|
|
|
|
|
|
|