1 jaar geleden · 96ea5411de
--- a/recipes/quickstart/NotebookLlama/TTS_Notes.md
+++ b/recipes/quickstart/NotebookLlama/TTS_Notes.md
@@ -6,45 +6,91 @@ The goal was to use the models that were easy to setup and sounded less robotic
 
				 
			
 
				 #### Parler-TTS
			
 
				 
			
 
				+Minimal code to run their models:
			
 
				 
			
 
				-
			
 
				-Surprisingly, Parler's mini model sounded more natural. In their [repo]() they share names of speakers that we can use in prompt 
			
 
				-
			
 
				-Actually this IS THE MOST CONSISTENT PROMPT:
			
 
				-Small:
			
 
				 ```
			
 
				+model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
			
 
				+tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
			
 
				+
			
 
				+# Define text and description
			
 
				+text_prompt = "This is where the actual words to be spoken go"
			
 
				 description = """
			
 
				 Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
			
 
				 """
			
 
				-```
			
 
				 
			
 
				-Large: 
			
 
				-```
			
 
				-description = """
			
 
				-Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
			
 
				-"""
			
 
				+input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
			
 
				+prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
			
 
				+
			
 
				+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
			
 
				+audio_arr = generation.cpu().numpy().squeeze()
			
 
				+
			
 
				+ipd.Audio(audio_arr, rate=model.config.sampling_rate)
			
 
				 ```
			
 
				-Small:
			
 
				+
			
 
				+The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs.
			
 
				+
			
 
				+Surprisingly, Parler's mini model sounded more natural.
			
 
				+
			
 
				+In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt.
			
 
				+
			
 
				+#### Suno/Bark
			
 
				+
			
 
				+Minimal code to run bark:
			
 
				+
			
 
				 ```
			
 
				-description = """
			
 
				-Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
			
 
				+voice_preset = "v2/en_speaker_6"
			
 
				+sampling_rate = 24000
			
 
				+
			
 
				+text_prompt = """
			
 
				+Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
			
 
				 """
			
 
				+inputs = processor(text_prompt, voice_preset=voice_preset).to(device)
			
 
				+
			
 
				+speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
			
 
				+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
			
 
				 ```
			
 
				 
			
 
				-#### Suno/Bark
			
 
				+Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers.
			
 
				 
			
 
				-Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad. 
			
 
				+v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark.
			
 
				 
			
 
				-Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
			
 
				-Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
			
 
				+The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasise them. 
			
 
				+
			
 
				+Adding `-` gives a break in the text. We utilise this knowledge when we re-write the transcript using the 8B model to add effects to our transcript.
			
 
				+
			
 
				+Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hypen during trails.
			
 
				+
			
 
				+#### Hyper-parameters: 
			
 
				 
			
 
				-Tested sound effects:
			
 
				+Bark models have two parameters we can tweak: `temperature` and `semantic_temperature`
			
 
				 
			
 
				-- Laugh is probably most effective
			
 
				-- Sigh is hit or miss
			
 
				-- Gasps doesn't work
			
 
				-- A singly hypen is effective
			
 
				-- Captilisation makes it louder
			
 
				+Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below:
			
 
				+
			
 
				+First, fix `temperature` and sweep `semantic_temperature`
			
 
				+- `0.7`, `0.2`: Quite bland and boring
			
 
				+- `0.7`, `0.3`: An improvement over the previous one
			
 
				+- `0.7`, `0.4`: Further improvement 
			
 
				+- `0.7`, `0.5`: This one didn't work
			
 
				+- `0.7`, `0.6`: So-So, didn't stand out
			
 
				+- `0.7`, `0.7`: The best so far
			
 
				+- `0.7`, `0.8`: Further improvement 
			
 
				+- `0.7`, `0.9`: Mix feelings on this one
			
 
				+
			
 
				+Now sweeping the `temperature`
			
 
				+- `0.1`, `0.9`: Very Robotic
			
 
				+- `0.2`, `0.9`: Less Robotic but not convincing
			
 
				+- `0.3`, `0.9`: Slight improvement still not fun
			
 
				+- `0.4`, `0.9`: Still has a robotic tinge
			
 
				+- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing
			
 
				+- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste
			
 
				+- `0.7`, `0.9`: Very robotic and laugh was weird
			
 
				+- `0.8`, `0.9`: Completely ignore the laughter but it was more natural
			
 
				+- `0.9`, `0.9`: We have a winner probably
			
 
				+
			
 
				+After this about ~30 more sweeps were done with the promising combinations:
			
 
				+
			
 
				+Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
			
 
				+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
			
 
				 
			
 
				 
			
 
				 ### Notes from other models that were tested:
			
@@ -57,8 +103,6 @@ Promising directions to explore in future:
 
				 - E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test
			
 
				 - [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising
			
 
				 
			
 
				-
			
 
				-
			
 
				 #### Some more models that weren't tested:
			
 
				 
			
 
				 In other words, we leave this as an excercise to readers :D