Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference

Voice Conversion (src: TIMIT)

k=128 k=1024 k=4096
src cont. hard soft hard soft hard soft
dr_faks0_si2203
dr2_mmdm2_sx12
dr3_mmab0_sx192
dr4_fjmg0_sx281


In-domain Reconstruction (LJSpeech)

k=128 k=1024 k=4096
gt cont. hard soft hard soft hard soft
LJ50-0029
LJ050-0197