For the TTS Pipeline, all of the top models from HuggingFace and Reddit were tried.
The goal was to use the models that were easy to setup and sounded less robotic with ability to include sound effects like laughter, etc.
Minimal code to run their models:
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
# Define text and description
text_prompt = "This is where the actual words to be spoken go"
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
ipd.Audio(audio_arr, rate=model.config.sampling_rate)
The really cool aspect of these models are the ability to prompt the description which can change the speaker profile and pacing of the outputs.
Surprisingly, Parler's mini model sounded more natural.
In their repo they share names of speakers that we can use in prompt.
Minimal code to run bark:
voice_preset = "v2/en_speaker_6"
sampling_rate = 24000
text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)
speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
Similar to parler models, suno has a library of speakers.
v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark.
The incredible thing about Bark model is being able to add sound effects: [Laugh], [Gasps], [Sigh], [clears throat], making words capital causes the model to emphasize them.
Adding - gives a break in the text. We utilize this knowledge when we re-write the transcript using the 8B model to add effects to our transcript.
Note: Authors suggest using .... However, this didn't work as effectively as adding a hyphen during trails.
Bark models have two parameters we can tweak: temperature and semantic_temperature
Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. temperature and semantic_temperature respectively below:
First, fix temperature and sweep semantic_temperature
0.7, 0.2: Quite bland and boring0.7, 0.3: An improvement over the previous one0.7, 0.4: Further improvement0.7, 0.5: This one didn't work0.7, 0.6: So-So, didn't stand out0.7, 0.7: The best so far0.7, 0.8: Further improvement0.7, 0.9: Mix feelings on this oneNow sweeping the temperature
0.1, 0.9: Very Robotic0.2, 0.9: Less Robotic but not convincing0.3, 0.9: Slight improvement still not fun0.4, 0.9: Still has a robotic tinge0.5, 0.9: The laugh was weird on this one but the voice modulates so much it feels speaker is changing0.6, 0.9: Most consistent voice but has a robotic after-taste0.7, 0.9: Very robotic and laugh was weird0.8, 0.9: Completely ignore the laughter but it was more natural0.9, 0.9: We have a winner probablyAfter this about ~30 more sweeps were done with the promising combinations:
Best results are at speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
Promising directions to explore in future:
In other words, we leave this as an exercise to readers :D