Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Here, we present audio examples for a better idea about the quality and similarity of the evaluated voices in the paper Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model accepted for the TSD2024 conference.

You can listen and compare sentences using 3 different types of source data for few-shot fine-tuning:

Oration -- A few-shot fine-tuning trained from a major public speech addressed to the whole nation on the occasion of some important event, such as the President's New Year's speech. These speeches are typically not spontaneous but are read from a reading device; however, the speakers usually aim for an emotional and solemn speech. A typical duration of used orations is from five to ten minutes.
Interview -- A few-shot fine-tuning trained from an interview of the target speaker with a moderator. We searched for interviews with low background noise and a duration of about 30 minutes to ensure there would be at least 5 minutes of clean speech from the target speaker. We used mainly public interviews broadcast on the radio. Speech from interviews is spontaneous and can contain disfluencies, unfinished sentences, imperfect pronunciation, and non-speech events such as laughter, coughing, etc.
Read Speech -- A few-shot fine-tuning trained from a collection of spoken sentences recorded in a recording studio. Voices in this group belonged to publicly unknown non-professional speakers, whose data was self-recorded at our department.

Pre-print of the paper: https://arxiv.org/abs/2407.17167

Released weights of the pre-trained SpeechT5 model: https://huggingface.co/fav-kky/SpeechT5-base-cs-tts

Examples trained from Oration

ID	Original voice	Zero-shot approach	Few-shot approach (10 seconds)	Few-shot approach (1 minute)	Few-shot approach (5 minutes)
voice1, sentence1
voice1, sentence2
voice2, sentence1
voice2, sentence2
voice3, sentence1
voice3, sentence2
voice4, sentence1
voice4, sentence2

Examples trained from Interview

ID	Original voice	Zero-shot approach	Few-shot approach (10 seconds)	Few-shot approach (1 minute)	Few-shot approach (5 minutes)
voice1, sentence1
voice1, sentence2
voice2, sentence1
voice2, sentence2
voice3, sentence1
voice3, sentence2
voice4, sentence1
voice4, sentence2
voice5, sentence1
voice5, sentence2
voice6, sentence1
voice6, sentence2

Examples trained from Read Speech

ID	Original voice	Zero-shot approach	Few-shot approach (10 seconds)	Few-shot approach (1 minute)	Few-shot approach (5 minutes)
voice1, sentence1
voice1, sentence2
voice2, sentence1
voice2, sentence2
voice3, sentence1
voice3, sentence2
voice4, sentence1
voice4, sentence2
voice5, sentence1
voice5, sentence2