PHONOLOGICAL TOKENIZER: PROSODY-AWARE PHONETIC TOKEN VIA
MULTI-OBJECTIVE FINE-TUNING WITH DIFFERENTIABLE K-MEANS

4.3. Evaluation on generative tasks

Voice Conversion (src: Expresso)

src Discrete WavLM SpeechTokenizer WavTokenizer Ours (α=0) Ours (α=0.1) Ours (α=0.3) Ours (α=0.5) Ours (α=1)
ex01_sad_00366
ex01_whisper_00366
ex02_confused_00376
ex02_happy_00376


Voice Conversion (src: TIMIT)

src Discrete WavLM SpeechTokenizer WavTokenizer Ours (α=0) Ours (α=0.1) Ours (α=0.3) Ours (α=0.5) Ours (α=1)
DR1_MJWT0_SX391
DR3_FCKE0_SX391
DR4_FALR0_SX245
DR8_MRDM0_SX245


In-domain Reconstruction (LJSpeech)

src Discrete WavLM SpeechTokenizer WavTokenizer Ours (α=0) Ours (α=0.1) Ours (α=0.3) Ours (α=0.5) Ours (α=1)
LJ050-0029
LJ050-0030
LJ050-0031
LJ050-0032


4.4. Evaluation on speechLMs

Speech Continuation (prompt: first 3s of LibriSpeech test-clean)

prompt Discrete WavLM SpeechTokenizer WavTokenizer Ours (α=0) Ours (α=0.1) Ours (α=0.3) Ours (α=0.5) Ours (α=1)
1580-141084-0049
5683-32879-0010