Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Here, we present audio examples for a better idea about the quality and similarity of the evaluated voices in the paper Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model accepted for the TSD2024 conference.

You can listen and compare sentences using 3 different types of source data for few-shot fine-tuning:

Pre-print of the paper: https://arxiv.org/abs/2407.17167

Released weights of the pre-trained SpeechT5 model: https://huggingface.co/fav-kky/SpeechT5-base-cs-tts

Examples trained from Oration

ID Original voice Zero-shot approach Few-shot approach (10 seconds) Few-shot approach (1 minute) Few-shot approach (5 minutes)
voice1, sentence1
voice1, sentence2
voice2, sentence1
voice2, sentence2
voice3, sentence1
voice3, sentence2
voice4, sentence1
voice4, sentence2

Examples trained from Interview

ID Original voice Zero-shot approach Few-shot approach (10 seconds) Few-shot approach (1 minute) Few-shot approach (5 minutes)
voice1, sentence1
voice1, sentence2
voice2, sentence1
voice2, sentence2
voice3, sentence1
voice3, sentence2
voice4, sentence1
voice4, sentence2
voice5, sentence1
voice5, sentence2
voice6, sentence1
voice6, sentence2

Examples trained from Read Speech

ID Original voice Zero-shot approach Few-shot approach (10 seconds) Few-shot approach (1 minute) Few-shot approach (5 minutes)
voice1, sentence1
voice1, sentence2
voice2, sentence1
voice2, sentence2
voice3, sentence1
voice3, sentence2
voice4, sentence1
voice4, sentence2
voice5, sentence1
voice5, sentence2