Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference
Voice Conversion (src: TIMIT)
k=128
k=1024
k=4096
src
cont.
hard
soft
hard
soft
hard
soft
dr_faks0_si2203
dr2_mmdm2_sx12
dr3_mmab0_sx192
dr4_fjmg0_sx281
In-domain Reconstruction (LJSpeech)
k=128
k=1024
k=4096
gt
cont.
hard
soft
hard
soft
hard
soft
LJ50-0029
LJ050-0197