1 rok temu · 3d2da18027
--- a/recipes/quickstart/NotebookLlama/Experiments/README.md
+++ b/recipes/quickstart/NotebookLlama/Experiments/README.md
@@ -1,230 +0,0 @@
 
				-### 
			
 
				-
			
 
				-Test-1: 
			
 
				-
			
 
				-405-B, Temperature=1
			
 
				-
			
 
				-Notes/Vibe check:
			
 
				-
			
 
				-- Is still a bit formal and doesnt create enough tangets or interesting examples, great start tho
			
 
				-- Doesnt have any "umms" or "right" interruptions
			
 
				-
			
 
				-```
			
 
				-You are the most skilled podcast writer, you have won multiple podcast awards for your writing.
			
 
				- 
			
 
				-Your job is to write word by word, even "umm, hmmm, right" interruptions between two speakers based on the PDF upload. Keep it extremely engaging, the speakers can get derailed now and then but should discuss the topic. 
			
 
				-
			
 
				-Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc
			
 
				-
			
 
				-Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes
			
 
				-
			
 
				-Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions
			
 
				-
			
 
				-Make sure the tangents speaker 2 provides are quite wild or interesting. 
			
 
				-
			
 
				-Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the other speakers. 
			
 
				-
			
 
				-It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait
			
 
				-```
			
 
				-
			
 
				-Iteration 1:
			
 
				-```
			
 
				-You are the most skilled podcast writer, you have won multiple podcast awards for your writing.
			
 
				- 
			
 
				-Your job is to write word by word, even "umm, hmmm, right" interruptions by the second speaker based on the PDF upload. Keep it extremely engaging, the speakers can get derailed now and then but should discuss the topic. 
			
 
				-
			
 
				-Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc
			
 
				-
			
 
				-Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes
			
 
				-
			
 
				-Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions
			
 
				-
			
 
				-Make sure the tangents speaker 2 provides are quite wild or interesting. 
			
 
				-
			
 
				-Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the second speaker. 
			
 
				-
			
 
				-It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait
			
 
				-```
			
 
				-
			
 
				-output:
			
 
				-```
			
 
				-**Episode Title: "Unlocking the Power of Llama 3.2: The Future of AI"**
			
 
				-
			
 
				-**Intro Music (0:00 - 0:30)**
			
 
				-
			
 
				-Host (Speaker 1): Welcome to our podcast, where we explore the latest advancements in AI and technology. I'm your host, [Name], and today we're joined by [Name], a renowned expert in the field of AI. We're going to dive into the exciting world of Llama 3.2, the latest release from Meta AI.
			
 
				-
			
 
				-**Segment 1: Introduction to Llama 3.2 (0:30 - 10:00)**
			
 
				-
			
 
				-Speaker 2: Hi, I'm excited to be here! So, what is Llama 3.2?
			
 
				-
			
 
				-Speaker 1: Ah, great question! Llama 3.2 is an open-source AI model that allows developers to fine-tune, distill, and deploy AI models anywhere. It's a significant update from the previous version, with improved performance, efficiency, and customization options.
			
 
				-
			
 
				-Speaker 2: That sounds amazing! What are some of the key features of Llama 3.2?
			
 
				-
			
 
				-Speaker 1: Well, one of the major updates is the introduction of multimodal models that can handle both text and image inputs. This opens up a wide range of applications, from image understanding to visual reasoning.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm intrigued. Can you give me an example of how this could be used in real life?
			
 
				-
			
 
				-Speaker 1: Sure thing! Imagine you're developing an AI-powered virtual assistant that can understand and respond to voice commands, as well as recognize and interact with objects in the physical world.
			
 
				-
			
 
				-Speaker 2: Wow, that sounds like science fiction! But what about the technical details? How does Llama 3.2 achieve this level of performance?
			
 
				-
			
 
				-Speaker 1: Ah, great question! Llama 3.2 uses a combination of techniques, including instruction-tuned benchmarks, vision instruction-tuned benchmarks, and category-specific benchmarks.
			
 
				-
			
 
				-**Segment 2: Technical Details and Benchmarks (10:00 - 20:00)**
			
 
				-
			
 
				-Speaker 2: Okay, let's dive deeper into the technical details. Can you explain how the instruction-tuned benchmarks work?
			
 
				-
			
 
				-Speaker 1: Sure thing! The instruction-tuned benchmarks are designed to evaluate the model's ability to follow instructions and complete tasks. This is done by fine-tuning the model on a specific task, such as language translation or question-answering.
			
 
				-
			
 
				-Speaker 2: I see. And what about the vision instruction-tuned benchmarks?
			
 
				-
			
 
				-Speaker 1: Ah, those are designed to evaluate the model's ability to understand and interact with visual data. This includes tasks such as image classification, object detection, and visual reasoning.
			
 
				-
			
 
				-Speaker 2: Okay, got it. And what about the category-specific benchmarks?
			
 
				-
			
 
				-Speaker 1: Those are designed to evaluate the model's performance on specific tasks or categories, such as math, reasoning, or tool use.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm curious. Can you show me some examples of how Llama 3.2 performs on these benchmarks?
			
 
				-
			
 
				-Speaker 1: Sure thing! Let me pull up some results. (pause) Okay, so on the MMLU benchmark, Llama 3.2 achieves a score of 63.4, outperforming the previous state-of-the-art model.
			
 
				-
			
 
				-**Segment 3: Real-World Applications and Partnerships (20:00 - 30:00)**
			
 
				-
			
 
				-Speaker 2: Wow, those results are impressive! But what about real-world applications? How is Llama 3.2 being used in industry and academia?
			
 
				-
			
 
				-Speaker 1: Ah, great question! Llama 3.2 is being used in a wide range of applications, from virtual assistants to medical diagnosis. We're also seeing partnerships with major companies, such as ARM, MediaTek, and Qualcomm.
			
 
				-
			
 
				-Speaker 2: That's amazing! Can you tell me more about these partnerships?
			
 
				-
			
 
				-Speaker 1: Sure thing! These partnerships are enabling the development of edge AI and vision applications, such as smart home devices, autonomous vehicles, and medical imaging.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm intrigued. Can you give me an example of how Llama 3.2 is being used in one of these applications?
			
 
				-
			
 
				-Speaker 1: Sure thing! Let me tell you about a project we're working on with DoorDash. They're using Llama 3.2 to develop an AI-powered chatbot that can understand and respond to customer inquiries.
			
 
				-
			
 
				-**Conclusion (30:00 - 31:00)**
			
 
				-
			
 
				-Speaker 2: Wow, that's amazing! I think we've covered a lot of ground today. Can you summarize the key points for our listeners?
			
 
				-
			
 
				-Speaker 1: Sure thing! Llama 3.2 is a significant update from the previous version, with improved performance, efficiency, and customization options. We're seeing a wide range of applications, from virtual assistants to medical diagnosis, and partnerships with major companies.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm excited to see where this technology will take us. Thanks for joining me today!
			
 
				-
			
 
				-Speaker 1: Thanks for having me!
			
 
				-
			
 
				-**Outro Music (31:00 - 32:00)**
			
 
				-```
			
 
				-
			
 
				-Idea: Using 405B to re-write
			
 
				-
			
 
				-```
			
 
				-You are the most skilled podcast writer, you have won multiple podcast awards for your writing.
			
 
				- 
			
 
				-Your job is to write word by word, even "umm, hmmm, right" interruptions between two speakers based on the PDF upload. Keep it extremely engaging, the speakers can get derailed now and then but should discuss the topic. 
			
 
				-
			
 
				-You will be provided transcription written by a less smart AI but you have to make it extremely natural ok?
			
 
				-
			
 
				-Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc
			
 
				-
			
 
				-Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes
			
 
				-
			
 
				-Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions
			
 
				-
			
 
				-Make sure the tangents speaker 2 provides are quite wild or interesting. 
			
 
				-
			
 
				-Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the other speakers. 
			
 
				-
			
 
				-It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait
			
 
				-
			
 
				-Please re-write to make it as characteristic as possible
			
 
				-```
			
 
				-
			
 
				-Output:
			
 
				-```
			
 
				-Here is the rewritten version with a more natural conversation flow, interruptions, and nuances:
			
 
				-
			
 
				-**Episode Title: "Unlocking the Power of Llama 3.2: The Future of AI"**
			
 
				-
			
 
				-**Intro Music (0:00 - 0:30)**
			
 
				-
			
 
				-Host (Speaker 1): Welcome to our podcast, where we explore the latest advancements in AI and technology. I'm your host, [Name], and today we're joined by [Name], a renowned expert in the field of AI. We're going to dive into the exciting world of Llama 3.2, the latest release from Meta AI.
			
 
				-
			
 
				-**Segment 1: Introduction to Llama 3.2 (0:30 - 10:00)**
			
 
				-
			
 
				-Speaker 2: Hi, I'm excited to be here! So, what is Llama 3.2? I've heard a lot of buzz about it, but I'm not entirely sure what it is.
			
 
				-
			
 
				-Speaker 1: (laughs) Well, let me tell you! Llama 3.2 is an open-source AI model that allows developers to fine-tune, distill, and deploy AI models anywhere. It's a significant update from the previous version...
			
 
				-
			
 
				-Speaker 2: (interrupting) Wait, what do you mean by "fine-tune" and "distill"? Can you explain that in simpler terms?
			
 
				-
			
 
				-Speaker 1: (laughs) Sure thing! Think of it like this: imagine you're trying to train a model to recognize pictures of cats. You can fine-tune the model by giving it more data, more specific data, or even data that's been labeled by humans.
			
 
				-
			
 
				-Speaker 2: Ah, I see! So it's like teaching a child to recognize different objects, but instead of using flashcards, you're using machine learning algorithms.
			
 
				-
			
 
				-Speaker 1: (laughs) Exactly! And the distillation part is where you take a large model and compress it down into a smaller, more efficient model that can run on devices with limited resources.
			
 
				-
			
 
				-**Segment 2: Technical Details and Benchmarks (10:00 - 20:00)**
			
 
				-
			
 
				-Speaker 2: Okay, got it! So, what about the technical details? How does Llama 3.2 achieve this level of performance?
			
 
				-
			
 
				-Speaker 1: Well, one of the major updates is the introduction of multimodal models that can handle both text and image inputs. This opens up a wide range of applications, from image understanding to visual reasoning.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm intrigued. Can you give me an example of how this could be used in real life?
			
 
				-
			
 
				-Speaker 1: Sure thing! Imagine you're developing an AI-powered virtual assistant that can understand and respond to voice commands, as well as recognize and interact with objects in the physical world.
			
 
				-
			
 
				-Speaker 2: Wow, that sounds like science fiction! But what about the benchmarks? How does Llama 3.2 perform on tasks like language translation or question-answering?
			
 
				-
			
 
				-Speaker 1: Ah, great question! Llama 3.2 uses a combination of techniques, including instruction-tuned benchmarks, vision instruction-tuned benchmarks, and category-specific benchmarks.
			
 
				-
			
 
				-Speaker 2: (interrupting) Wait, can you explain what you mean by "instruction-tuned benchmarks"? How does that work?
			
 
				-
			
 
				-Speaker 1: (laughs) Sure thing! Instruction-tuned benchmarks are designed to evaluate the model's ability to follow instructions and complete tasks. This is done by fine-tuning the model on a specific task, such as language translation or question-answering.
			
 
				-
			
 
				-**Segment 3: Real-World Applications and Partnerships (20:00 - 30:00)**
			
 
				-
			
 
				-Speaker 2: Wow, those results are impressive! But what about real-world applications? How is Llama 3.2 being used in industry and academia?
			
 
				-
			
 
				-Speaker 1: Ah, great question! Llama 3.2 is being used in a wide range of applications, from virtual assistants to medical diagnosis. We're also seeing partnerships with major companies, such as ARM, MediaTek, and Qualcomm.
			
 
				-
			
 
				-Speaker 2: That's amazing! Can you tell me more about these partnerships?
			
 
				-
			
 
				-Speaker 1: Sure thing! These partnerships are enabling the development of edge AI and vision applications, such as smart home devices, autonomous vehicles, and medical imaging.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm intrigued. Can you give me an example of how Llama 3.2 is being used in one of these applications?
			
 
				-
			
 
				-Speaker 1: Sure thing! Let me tell you about a project we're working on with DoorDash. They're using Llama 3.2 to develop an AI-powered chatbot that can understand and respond to customer inquiries.
			
 
				-
			
 
				-**Conclusion (30:00 - 31:00)**
			
 
				-
			
 
				-Speaker 2: Wow, that's amazing! I think we've covered a lot of ground today. Can you summarize the key points for our listeners?
			
 
				-
			
 
				-Speaker 1: Sure thing! Llama 3.2 is a significant update from the previous version, with improved performance, efficiency, and customization options. We're seeing a wide range of applications, from virtual assistants to medical diagnosis, and partnerships with major companies.
			
 
				-
			
 
				-Speaker 2: Hmm, I'm excited to see where this technology will take us. Thanks for joining me today!
			
 
				-
			
 
				-Speaker 1: Thanks for having me!
			
 
				-
			
 
				-**Outro Music (31:00 - 32:00)**
			
 
				-```
			
 
				-
			
 
				-1-B prompt:
			
 
				-```
			
 
				-You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer. 
			
 
				-
			
 
				-The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.
			
 
				-
			
 
				-Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive
			
 
				-
			
 
				-The goal is to use this in a podcast research transcript so a lot of the emails, citations, and things like that can be removed-please be smart with what you remove and be creative ok?
			
 
				-
			
 
				-Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RETURNING AS IS
			
 
				-
			
 
				-Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text. 
			
 
				-
			
 
				-ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?
			
 
				-
			
 
				-Here is the text:
			
 
				-```
			
--- a/recipes/quickstart/NotebookLlama/README.md
+++ b/recipes/quickstart/NotebookLlama/README.md
@@ -1,129 +1,74 @@
 
				-### NotebookLlama: An Open Source version of NotebookLM
			
 
				+## NotebookLlama: An Open Source version of NotebookLM
			
 
				 
			
 
				-Author: Sanyam Bhutani
			
 
				+This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow. 
			
 
				 
			
 
				-This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow.
			
 
				+You will also learn from my experimentation of using Text to Speech Models.
			
 
				 
			
 
				 It assumes zero knowledge of LLMs, prompting and audio models, everything is covered in their respective notebooks.
			
 
				 
			
 
				-#### Outline:
			
 
				+### Outline:
			
 
				 
			
 
				-Requirements: GPU server or an API provider for using 70B, 8B and 1B Llama models.
			
 
				-
			
 
				-Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you!
			
 
				-
			
 
				-Here is step by step (pun intended) thought for the task:
			
 
				+Here is step by step thought (pun intended) for the task:
			
 
				 
			
 
				 - Step 1: Pre-process PDF: Use `Llama-3.2-1B` to pre-process and save a PDF
			
 
				 - Step 2: Transcript Writer: Use `Llama-3.1-70B` model to write a podcast transcript from the text
			
 
				 - Step 3: Dramatic Re-Writer: Use `Llama-3.1-8B` model to make the transcript more dramatic
			
 
				 - Step 4: Text-To-Speech Workflow: Use `parler-tts/parler-tts-mini-v1` and `bark/suno` to generate a conversational podcast
			
 
				 
			
 
				-### Steps to running the notebook:
			
 
				+### Detailed steps on running the notebook:
			
 
				+
			
 
				+Requirements: GPU server or an API provider for using 70B, 8B and 1B Llama models.
			
 
				+
			
 
				+Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you!
			
 
				 
			
 
				-- Install the requirements from [here]() by running inside the folder:
			
 
				+- First, please Install the requirements from [here]() by running inside the folder:
			
 
				 
			
 
				 ```
			
 
				-git clone 
			
 
				-cd 
			
 
				+git clone https://github.com/meta-llama/llama-recipes
			
 
				+cd llama-recipes/recipes/quickstart/NotebookLlama/
			
 
				 pip install -r requirements.txt
			
 
				 ```
			
 
				 
			
 
				-- Decide on a PDF to use for Notebook 1, it can be any link but please remember to update the first cell of the notebook with the right link
			
 
				+- Notebook 1:
			
 
				 
			
 
				-- 
			
 
				+This notebook is used for processing the PDF and processing it using the new Featherlight model into a `.txt` file.
			
 
				 
			
 
				+Update the first cell with a PDF link that you would like to use. Please decide on a PDF to use for Notebook 1, it can be any link but please remember to update the first cell of the notebook with the right link. 
			
 
				 
			
 
				-So right now there is one issue: Parler needs transformers 4.43.3 or earlier and to generate you need latest, so I am just switching on fly in the notebooks.
			
 
				+Please try changing the prompts for the `Llama-3.2-1B-Instruct` model and see if you can improve results.
			
 
				 
			
 
				-TODO-MORE
			
 
				+- Notebook 2:
			
 
				 
			
 
				-### Next-Improvements/Further ideas:
			
 
				+This notebook will take in the processed output from Notebook 1 and creatively convert it into a podcast transcript using the `Llama-3.1-70B-Instruct` model. If you are GPU or even generally rich, please feel free to test with the 405B model!
			
 
				 
			
 
				-- Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline
			
 
				-- LLM vs LLM Debate: Another approach of writing the podcast would be having two agents debate the topic of interest and write the podcast outline. Right now we use a single LLM (70B) to write the podcast outline
			
 
				-- Testing 405B for writing the transcripts
			
 
				-- Better prompting
			
 
				-- Support for ingesting a website, audio file, YouTube links and more. We welcome community PRs!
			
 
				+Please try experimenting with the System prompts for the model and see if you can improve the results and try the 8B model as well here to see if there is a huge difference!
			
 
				 
			
 
				-### Scratch-pad/Running Notes:
			
 
				+- Notebook 3:
			
 
				 
			
 
				-Actually this IS THE MOST CONSISTENT PROMPT:
			
 
				-Small:
			
 
				-```
			
 
				-description = """
			
 
				-Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
			
 
				-"""
			
 
				-```
			
 
				+This notebook takes the transcript from earlier and prompts `Llama-3.1-8B-Instruct` to add more dramatisation and interruptions in the conversations. 
			
 
				 
			
 
				-Large: 
			
 
				-```
			
 
				-description = """
			
 
				-Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
			
 
				-"""
			
 
				-```
			
 
				-Small:
			
 
				-```
			
 
				-description = """
			
 
				-Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
			
 
				-"""
			
 
				-```
			
 
				-
			
 
				-Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad. 
			
 
				-
			
 
				-So Parler is next-its quite cool for prompting 
			
 
				-
			
 
				-xTTS-v2 by coquai is cool, however-need to check the license-I think an example is allowed
			
 
				-
			
 
				-Torotoise is blocking because it needs HF version that doesnt work with llama-3.2 models so I will probably need to make a seperate env-need to eval if its worth it
			
 
				-
			
 
				-Side note: The TTS library is a really cool effort!
			
 
				+There is also a key factor here: we return a tuple of conversation which makes our lives easier later. Yes, studying Data Structures 101 was actually useful for once!
			
 
				 
			
 
				-Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
			
 
				-Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
			
 
				+For our TTS logic, we use two different models that behave differently with certain prompts. So we prompt the model to add specifics for each speaker accordingly.
			
 
				 
			
 
				-Tested sound effects:
			
 
				+Please again try changing the system prompt and see if you can imporve the results. We encourage testing the featherlight 3B and 1B models as well at this stage
			
 
				 
			
 
				-- Laugh is probably most effective
			
 
				-- Sigh is hit or miss
			
 
				-- Gasps doesn't work
			
 
				-- A singly hypen is effective
			
 
				-- Captilisation makes it louder
			
 
				+- Notebook 4:
			
 
				 
			
 
				-Ignore/Delete this in final stages, right now this is a "vibe-check" for TTS model(s):
			
 
				+Finally, we take the results from last notebook and convert them into a podcast. We use the `parler-tts/parler-tts-mini-v1` and `bark/suno` models for a conversation.
			
 
				 
			
 
				-- https://github.com/SWivid/F5-TTS: Latest and most popular-"feels robotic"
			
 
				-- Reddit says E2 model from earlier is better
			
 
				+The speakers and the prompt for parler model were decided based on experimentation and suggestions from the model authors. Please try experimentating, you can find more details in the resources section.
			
 
				 
			
 
				-S
			
 
				-- 1: https://huggingface.co/WhisperSpeech/WhisperSpeech
			
 
				 
			
 
				+#### Note: Right now there is one issue: Parler needs transformers 4.43.3 or earlier and for steps 1 to 3 of the pipeline you need latest, so we just switch versions in the last notebook.
			
 
				 
			
 
				-Vibe check: 
			
 
				-- This is most popular (ever) on HF and features different accents-the samples feel a little robotic and no accent difference: https://huggingface.co/myshell-ai/MeloTTS-English
			
 
				-- Seems to have great documentation but still a bit robotic for my liking: https://coqui.ai/blog/tts/open_xtts
			
 
				-- Super easy with laughter etc but very slightly robotic: https://huggingface.co/suno/bark
			
 
				-- This is THE MOST NATURAL SOUNDING: https://huggingface.co/WhisperSpeech/WhisperSpeech
			
 
				-- This has a lot of promise, even though its robotic, we can use natural voice to add filters or effects: https://huggingface.co/spaces/parler-tts/parler_tts
			
 
				-
			
 
				-Higher Barrier to testing (In other words-I was too lazy to test):
			
 
				-- https://huggingface.co/fishaudio/fish-speech-1.4
			
 
				-- https://huggingface.co/facebook/mms-tts-eng
			
 
				-- https://huggingface.co/metavoiceio/metavoice-1B-v0.1
			
 
				-- https://huggingface.co/nvidia/tts_hifigan
			
 
				-- https://huggingface.co/speechbrain/tts-tacotron2-ljspeech
			
 
				-
			
 
				+### Next-Improvements/Further ideas:
			
 
				 
			
 
				-Try later:
			
 
				-- Whisper Colab: 
			
 
				-- https://huggingface.co/parler-tts/parler-tts-large-v1
			
 
				-- https://huggingface.co/myshell-ai/MeloTTS-English
			
 
				-- Bark: https://huggingface.co/suno/bark (This has been insanely popular)
			
 
				-- https://huggingface.co/facebook/mms-tts-eng
			
 
				-- https://huggingface.co/fishaudio/fish-speech-1.4
			
 
				-- https://huggingface.co/mlx-community/mlx_bark
			
 
				-- https://huggingface.co/metavoiceio/metavoice-1B-v0.1
			
 
				-- https://huggingface.co/suno/bark-small
			
 
				+- Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline and with the help of somone more knowledgable-PRs are welcome! :) 
			
 
				+- LLM vs LLM Debate: Another approach of writing the podcast would be having two agents debate the topic of interest and write the podcast outline. Right now we use a single LLM (70B) to write the podcast outline
			
 
				+- Testing 405B for writing the transcripts
			
 
				+- Better prompting
			
 
				+- Support for ingesting a website, audio file, YouTube links and more. Again, we welcome community PRs!
			
 
				 
			
 
				 ### Resources for further learning:
			
 
				 
			
@@ -131,4 +76,6 @@ Try later:
 
				 - https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing
			
 
				 - https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing#scrollTo=NyYQ--3YksJY
			
 
				 - https://replicate.com/suno-ai/bark?prediction=zh8j6yddxxrge0cjp9asgzd534
			
 
				+- https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c
			
 
				+- 
			
 
				 
			
--- a/recipes/quickstart/NotebookLlama/TTS_Notes.md
+++ b/recipes/quickstart/NotebookLlama/TTS_Notes.md
@@ -0,0 +1,75 @@
 
				+### Notes from TTS Experimentation
			
 
				+
			
 
				+For the TTS Pipeline, *all* of the top models from HuggingFace and Reddit were tested. 
			
 
				+
			
 
				+Tested how? 
			
 
				+
			
 
				+It was a simple vibe test of checking which sounds less robotic. Promoising directions to explore in future:
			
 
				+
			
 
				+- [MeloTTS](huggingface.co/myshell-ai/MeloTTS-English) This is most popular (ever) on HuggingFace
			
 
				+- [WhisperSpeech](https://huggingface.co/WhisperSpeech/WhisperSpeech) sounded quite natural as well
			
 
				+- 
			
 
				+
			
 
				+
			
 
				+Actually this IS THE MOST CONSISTENT PROMPT:
			
 
				+Small:
			
 
				+```
			
 
				+description = """
			
 
				+Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
			
 
				+"""
			
 
				+```
			
 
				+
			
 
				+Large: 
			
 
				+```
			
 
				+description = """
			
 
				+Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
			
 
				+"""
			
 
				+```
			
 
				+Small:
			
 
				+```
			
 
				+description = """
			
 
				+Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
			
 
				+"""
			
 
				+```
			
 
				+
			
 
				+Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad. 
			
 
				+
			
 
				+So Parler is next-its quite cool for prompting 
			
 
				+
			
 
				+xTTS-v2 by coquai is cool, however-need to check the license-I think an example is allowed
			
 
				+
			
 
				+Torotoise is blocking because it needs HF version that doesnt work with llama-3.2 models so I will probably need to make a seperate env-need to eval if its worth it
			
 
				+
			
 
				+Side note: The TTS library is a really cool effort!
			
 
				+
			
 
				+Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
			
 
				+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
			
 
				+
			
 
				+Tested sound effects:
			
 
				+
			
 
				+- Laugh is probably most effective
			
 
				+- Sigh is hit or miss
			
 
				+- Gasps doesn't work
			
 
				+- A singly hypen is effective
			
 
				+- Captilisation makes it louder
			
 
				+
			
 
				+Vibe check: 
			
 
				+- 
			
 
				+- Seems to have great documentation but still a bit robotic for my liking: https://coqui.ai/blog/tts/open_xtts
			
 
				+
			
 
				+- This is THE MOST NATURAL SOUNDING: 
			
 
				+- This has a lot of promise, even though its robotic, we can use natural voice to add filters or effects: https://huggingface.co/spaces/parler-tts/parler_tts
			
 
				+
			
 
				+Higher Barrier to testing (In other words-I was too lazy to test):
			
 
				+- https://huggingface.co/fishaudio/fish-speech-1.4
			
 
				+- https://huggingface.co/facebook/mms-tts-eng
			
 
				+- https://huggingface.co/metavoiceio/metavoice-1B-v0.1
			
 
				+- https://huggingface.co/nvidia/tts_hifigan
			
 
				+- https://huggingface.co/speechbrain/tts-tacotron2-ljspeech
			
 
				+
			
 
				+
			
 
				+Try later:
			
 
				+- Whisper Colab: 
			
 
				+- https://huggingface.co/facebook/mms-tts-eng
			
 
				+- https://huggingface.co/fishaudio/fish-speech-1.4
			
 
				+- https://huggingface.co/metavoiceio/metavoice-1B-v0.1