| 
					
				 | 
			
			
				@@ -6,45 +6,91 @@ The goal was to use the models that were easy to setup and sounded less robotic 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 #### Parler-TTS 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Minimal code to run their models: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Surprisingly, Parler's mini model sounded more natural. In their [repo]() they share names of speakers that we can use in prompt  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Actually this IS THE MOST CONSISTENT PROMPT: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Small: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1") 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Define text and description 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+text_prompt = "This is where the actual words to be spoken go" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 description = """ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 """ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Large:  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-description = """ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-""" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+audio_arr = generation.cpu().numpy().squeeze() 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ipd.Audio(audio_arr, rate=model.config.sampling_rate) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Small: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Surprisingly, Parler's mini model sounded more natural. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Suno/Bark 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Minimal code to run bark: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-description = """ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+voice_preset = "v2/en_speaker_6" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+sampling_rate = 24000 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+text_prompt = """ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 """ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+inputs = processor(text_prompt, voice_preset=voice_preset).to(device) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-#### Suno/Bark 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad.  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasise them.  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Adding `-` gives a break in the text. We utilise this knowledge when we re-write the transcript using the 8B model to add effects to our transcript. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hypen during trails. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Hyper-parameters:  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Tested sound effects: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Bark models have two parameters we can tweak: `temperature` and `semantic_temperature` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- Laugh is probably most effective 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- Sigh is hit or miss 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- Gasps doesn't work 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- A singly hypen is effective 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- Captilisation makes it louder 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+First, fix `temperature` and sweep `semantic_temperature` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.2`: Quite bland and boring 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.3`: An improvement over the previous one 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.4`: Further improvement  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.5`: This one didn't work 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.6`: So-So, didn't stand out 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.7`: The best so far 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.8`: Further improvement  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.9`: Mix feelings on this one 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Now sweeping the `temperature` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.1`, `0.9`: Very Robotic 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.2`, `0.9`: Less Robotic but not convincing 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.3`, `0.9`: Slight improvement still not fun 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.4`, `0.9`: Still has a robotic tinge 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.7`, `0.9`: Very robotic and laugh was weird 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.8`, `0.9`: Completely ignore the laughter but it was more natural 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `0.9`, `0.9`: We have a winner probably 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+After this about ~30 more sweeps were done with the promising combinations: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ### Notes from other models that were tested: 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -57,8 +103,6 @@ Promising directions to explore in future: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 #### Some more models that weren't tested: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 In other words, we leave this as an excercise to readers :D 
			 |