{ "cells": [ { "cell_type": "markdown", "id": "45a1b5d7-fd98-4fa2-9bea-e68c514b9245", "metadata": {}, "source": [ "## Notebook 4: TTS Workflow\n", "\n", "We have the exact podcast transcripts ready now to generate our audio for the Podcast.\n", "\n", "In this notebook, we will learn how to generate Audio using both `suno/bark` and `parler-tts/parler-tts-mini-v1` models first. \n", "\n", "After that, we will use the output from Notebook 3 to generate our complete podcast\n", "\n", "Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt." ] }, { "cell_type": "markdown", "id": "534e5f94-66d0-459d-ab01-8599905d8e1b", "metadata": {}, "source": [ "⚠️ Warning: This notebook likes have `transformers` version to be `4.43.3` or earlier so we will downgrade our environment to make sure things run smoothly" ] }, { "cell_type": "markdown", "id": "efd866ac-8ea6-486d-96cd-7594a8c329e0", "metadata": {}, "source": [ "Credit: [This](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing#scrollTo=68QtoUqPWdLk) Colab was used for starter code\n" ] }, { "cell_type": "markdown", "id": "a4e2c0ee-7527-46e4-9c07-e6dac34376e5", "metadata": {}, "source": [ "We can install these packages for speedups" ] }, { "cell_type": "code", "execution_count": 1, "id": "3ee4811a-50a1-4030-8312-54fccddc221b", "metadata": {}, "outputs": [], "source": [ "#!pip3 install optimum\n", "#!pip install -U flash-attn --no-build-isolation\n", "#pip install transformers==4.43.3 torch optimum accelerate tqdm ipywidgets rich PyPDF2 huggingface-hub jupyter pydub\n", "#pip install bark parler-tts" ] }, { "cell_type": "markdown", "id": "07672295-af30-4b4b-b11c-44ca938436cd", "metadata": {}, "source": [ "Let's import the necessary frameworks" ] }, { "cell_type": "code", "execution_count": 17, "id": "89d75859-e0f9-40e3-931d-64aa3d273f49", "metadata": {}, "outputs": [], "source": [ "from IPython.display import Audio\n", "import IPython.display as ipd\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": 18, "id": "f442758d-c48f-48ac-a4b0-558695290aa9", "metadata": {}, "outputs": [], "source": [ "from transformers import BarkModel, AutoProcessor, AutoTokenizer\n", "import torch\n", "import json\n", "import numpy as np\n", "from parler_tts import ParlerTTSForConditionalGeneration" ] }, { "cell_type": "markdown", "id": "31ba1903-59c8-4004-bb39-1761cd3d140e", "metadata": {}, "source": [ "### Testing the Audio Generation" ] }, { "cell_type": "markdown", "id": "2523c565-bb35-4fae-bdcb-cba11ef0b572", "metadata": {}, "source": [ "Let's try generating audio using both the models to understand how they work. \n", "\n", "Note the subtle differences in prompting:\n", "- Parler: Takes in a `description` prompt that can be used to set the speaker profile and generation speeds\n", "- Suno: Takes in expression words like `[sigh]`, `[laughs]` etc. You can find more notes on the experiments that were run for this notebook in the [TTS_Notes.md](./TTS_Notes.md) file to learn more." ] }, { "cell_type": "markdown", "id": "50b62df5-5ea3-4913-832a-da59f7cf8de2", "metadata": {}, "source": [ "Please set `device = \"cuda\"` below if you're using a single GPU node." ] }, { "cell_type": "markdown", "id": "309d0678-880b-44cb-a54a-9408b3c8d644", "metadata": {}, "source": [ "#### Parler Model\n", "\n", "Let's try using the Parler Model first and generate a short segment with speaker Laura's voice" ] }, { "cell_type": "code", "execution_count": 22, "id": "4e84ed3f-336b-4f45-b098-ce477929fa8a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Config of the text_encoder: is overwritten by shared text_encoder config: T5Config {\n", " \"_name_or_path\": \"google/flan-t5-large\",\n", " \"architectures\": [\n", " \"T5ForConditionalGeneration\"\n", " ],\n", " \"classifier_dropout\": 0.0,\n", " \"d_ff\": 2816,\n", " \"d_kv\": 64,\n", " \"d_model\": 1024,\n", " \"decoder_start_token_id\": 0,\n", " \"dense_act_fn\": \"gelu_new\",\n", " \"dropout_rate\": 0.1,\n", " \"eos_token_id\": 1,\n", " \"feed_forward_proj\": \"gated-gelu\",\n", " \"initializer_factor\": 1.0,\n", " \"is_encoder_decoder\": true,\n", " \"is_gated_act\": true,\n", " \"layer_norm_epsilon\": 1e-06,\n", " \"model_type\": \"t5\",\n", " \"n_positions\": 512,\n", " \"num_decoder_layers\": 24,\n", " \"num_heads\": 16,\n", " \"num_layers\": 24,\n", " \"output_past\": true,\n", " \"pad_token_id\": 0,\n", " \"relative_attention_max_distance\": 128,\n", " \"relative_attention_num_buckets\": 32,\n", " \"tie_word_embeddings\": false,\n", " \"transformers_version\": \"4.46.1\",\n", " \"use_cache\": true,\n", " \"vocab_size\": 32128\n", "}\n", "\n", "Config of the audio_encoder: is overwritten by shared audio_encoder config: DACConfig {\n", " \"_name_or_path\": \"parler-tts/dac_44khZ_8kbps\",\n", " \"architectures\": [\n", " \"DACModel\"\n", " ],\n", " \"codebook_size\": 1024,\n", " \"frame_rate\": 86,\n", " \"latent_dim\": 1024,\n", " \"model_bitrate\": 8,\n", " \"model_type\": \"dac_on_the_hub\",\n", " \"num_codebooks\": 9,\n", " \"sampling_rate\": 44100,\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.46.1\"\n", "}\n", "\n", "Config of the decoder: is overwritten by shared decoder config: ParlerTTSDecoderConfig {\n", " \"_name_or_path\": \"/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder\",\n", " \"activation_dropout\": 0.0,\n", " \"activation_function\": \"gelu\",\n", " \"add_cross_attention\": true,\n", " \"architectures\": [\n", " \"ParlerTTSForCausalLM\"\n", " ],\n", " \"attention_dropout\": 0.0,\n", " \"bos_token_id\": 1025,\n", " \"codebook_weights\": null,\n", " \"cross_attention_implementation_strategy\": null,\n", " \"dropout\": 0.1,\n", " \"eos_token_id\": 1024,\n", " \"ffn_dim\": 4096,\n", " \"hidden_size\": 1024,\n", " \"initializer_factor\": 0.02,\n", " \"is_decoder\": true,\n", " \"layerdrop\": 0.0,\n", " \"max_position_embeddings\": 4096,\n", " \"model_type\": \"parler_tts_decoder\",\n", " \"num_attention_heads\": 16,\n", " \"num_codebooks\": 9,\n", " \"num_cross_attention_key_value_heads\": 16,\n", " \"num_hidden_layers\": 24,\n", " \"num_key_value_heads\": 16,\n", " \"pad_token_id\": 1024,\n", " \"rope_embeddings\": false,\n", " \"rope_theta\": 10000.0,\n", " \"scale_embedding\": false,\n", " \"tie_word_embeddings\": false,\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.46.1\",\n", " \"use_cache\": true,\n", " \"use_fused_lm_heads\": false,\n", " \"vocab_size\": 1088\n", "}\n", "\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set up device\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "\n", "# Load model and tokenizer\n", "model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(device)\n", "tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")\n", "\n", "# Define text and description\n", "text_prompt = \"\"\"\n", "Llama 3 is quite versatile. For instance, it can be used for coding tasks, where it can generate high-quality code based on a description or even help debug existing code. It's also very capable in multilingual tasks, being able to understand and generate text in several languages.\n", "\"\"\"\n", "description = \"\"\"\n", "Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.\n", "\"\"\"\n", "# Tokenize inputs\n", "input_ids = tokenizer(description, return_tensors=\"pt\").input_ids.to(device)\n", "prompt_input_ids = tokenizer(text_prompt, return_tensors=\"pt\").input_ids.to(device)\n", "\n", "# Generate audio\n", "generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", "audio_arr = generation.cpu().numpy().squeeze()\n", "\n", "# Play audio in notebook\n", "ipd.Audio(audio_arr, rate=model.config.sampling_rate)" ] }, { "cell_type": "markdown", "id": "03c2abc6-4a1d-4318-af6f-0257dd66a691", "metadata": {}, "source": [ "#### Bark Model\n", "\n", "Amazing, let's try the same with bark now:\n", "- We will set the `voice_preset` to our favorite speaker\n", "- This time we can include expression prompts inside our generation prompt\n", "- Note you can CAPTILISE words to make the model emphasise on these\n", "- You can add hyphens to make the model pause on certain words" ] }, { "cell_type": "code", "execution_count": 20, "id": "a20730f0-13dd-48b4-80b6-7c6ef05a0cc4", "metadata": {}, "outputs": [], "source": [ "voice_preset = \"v2/en_speaker_6\"\n", "sampling_rate = 24000" ] }, { "cell_type": "code", "execution_count": 26, "id": "246d0cbc-c5d8-4f34-b8e4-dd18a624cdad", "metadata": {}, "outputs": [], "source": [ "device = \"cuda:7\"\n", "\n", "processor = AutoProcessor.from_pretrained(\"suno/bark\")\n", "\n", "#model = model.to_bettertransformer()\n", "#model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16, attn_implementation=\"flash_attention_2\").to(device)\n", "model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(device)#.to_bettertransformer()" ] }, { "cell_type": "code", "execution_count": null, "id": "2313a899", "metadata": {}, "outputs": [], "source": [ "text_prompt = \"\"\"\n", "That sounds incredible. The potential applications are vast, from helping developers with coding tasks to facilitating communication across languages. (curious) How does it handle tasks that require a deep understanding of context or nuance, like understanding humor or sarcasm? (hmm)\n", "\"\"\"\n", "inputs = processor(text_prompt, voice_preset=voice_preset).to(device)\n", "\n", "speech_output = model.generate(**inputs, temperature = 0.9)\n", "Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)" ] }, { "cell_type": "markdown", "id": "dd650176-ab17-47a7-8e02-10dc9ca9e852", "metadata": {}, "source": [ "## Bringing it together: Making the Podcast\n", "\n", "Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast\n", "\n", "Let's load in our pickle file from earlier and proceed:" ] }, { "cell_type": "code", "execution_count": 28, "id": "b1dca30f-1226-4002-8e02-fd97e78ecc83", "metadata": {}, "outputs": [], "source": [ "import pickle\n", "\n", "with open('./resources/podcast_ready_data.pkl', 'rb') as file:\n", " PODCAST_TEXT = pickle.load(file)" ] }, { "cell_type": "markdown", "id": "c10a3d50-08a7-4786-8e28-8fb6b8b048ab", "metadata": {}, "source": [ "Let's define load in the bark model and set it's hyper-parameters for discussions" ] }, { "cell_type": "code", "execution_count": null, "id": "a3a4aa8f", "metadata": {}, "outputs": [], "source": [ "bark_processor = AutoProcessor.from_pretrained(\"suno/bark\")\n", "bark_model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(\"cpu\")#\"cuda:3\")\n", "bark_sampling_rate = 24000" ] }, { "cell_type": "markdown", "id": "e03e313a-c727-4489-876b-db71920d49cd", "metadata": {}, "source": [ "Now for the Parler model:" ] }, { "cell_type": "code", "execution_count": 5, "id": "6c04a04d-3686-4932-bd45-72d7f518c602", "metadata": {}, "outputs": [], "source": [ "parler_model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(\"cuda:3\")\n", "parler_tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "efbe1434-37f3-4f77-a5fb-b39625f5e676", "metadata": {}, "outputs": [], "source": [ "speaker1_description = \"\"\"\n", "Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "56f6fa24-fe07-4702-850f-0428bfadd2dc", "metadata": {}, "source": [ "We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio" ] }, { "cell_type": "code", "execution_count": 7, "id": "cebfd0f9-8703-4fce-b207-014c6e16cc8a", "metadata": {}, "outputs": [], "source": [ "generated_segments = []\n", "sampling_rates = [] # We'll need to keep track of sampling rates for each segment" ] }, { "cell_type": "code", "execution_count": 8, "id": "9b333e36-9579-4237-b329-e2911229be42", "metadata": {}, "outputs": [], "source": [ "device= \"cuda:3\"" ] }, { "cell_type": "markdown", "id": "d7b2490c-012f-4e35-8890-cd6a5eaf4cc4", "metadata": {}, "source": [ "Function generate text for speaker 1" ] }, { "cell_type": "code", "execution_count": 9, "id": "50323f9e-09ed-4c8c-9020-1511ab775969", "metadata": {}, "outputs": [], "source": [ "def generate_speaker1_audio(text):\n", " \"\"\"Generate audio using ParlerTTS for Speaker 1\"\"\"\n", " input_ids = parler_tokenizer(speaker1_description, return_tensors=\"pt\").input_ids.to(device)\n", " prompt_input_ids = parler_tokenizer(text, return_tensors=\"pt\").input_ids.to(device)\n", " generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", " audio_arr = generation.cpu().numpy().squeeze()\n", " return audio_arr, parler_model.config.sampling_rate" ] }, { "cell_type": "markdown", "id": "3fb5dac8-30a6-4aa2-a983-b5f1df3d56af", "metadata": {}, "source": [ "Function to generate text for speaker 2" ] }, { "cell_type": "code", "execution_count": 10, "id": "0e6120ba-5190-4739-97ca-4e8b44dddc5e", "metadata": {}, "outputs": [], "source": [ "def generate_speaker2_audio(text):\n", " \"\"\"Generate audio using Bark for Speaker 2\"\"\"\n", " inputs = bark_processor(text, voice_preset=\"v2/en_speaker_6\").to(device)\n", " speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)\n", " audio_arr = speech_output[0].cpu().numpy()\n", " return audio_arr, bark_sampling_rate\n" ] }, { "cell_type": "markdown", "id": "7ea67fd1-9405-4fce-b08b-df5e11d0bf37", "metadata": {}, "source": [ "Helper function to convert the numpy output from the models into audio" ] }, { "cell_type": "code", "execution_count": 38, "id": "4482d864-2806-4410-b239-da4b2d0d1340", "metadata": {}, "outputs": [], "source": [ "def numpy_to_audio_segment(audio_arr, sampling_rate):\n", " \"\"\"Convert numpy array to AudioSegment\"\"\"\n", " # Convert to 16-bit PCM\n", " audio_int16 = (audio_arr * 32767).astype(np.int16)\n", " \n", " # Create WAV file in memory\n", " byte_io = io.BytesIO()\n", " wavfile.write(byte_io, sampling_rate, audio_int16)\n", " byte_io.seek(0)\n", " \n", " # Convert to AudioSegment\n", " return AudioSegment.from_wav(byte_io)" ] }, { "cell_type": "code", "execution_count": 29, "id": "c4dbb3b3-cdd3-4a1f-a60a-661e64a67f53", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'[\\n (\"Speaker 1\", \"Welcome to our latest podcast on the advancements in AI, where we\\'re going to dive into the world of Llama 3, a new set of foundation models that are pushing the boundaries of what\\'s possible with language understanding and generation. I\\'m your host, and joining me is my co-host, who is new to this topic. We\\'re excited to explore this together. So, let\\'s start with the basics. Llama 3 is all about improving upon its predecessors by incorporating larger models, more data, and better training techniques. Our largest model boasts an impressive 405B parameters and can process up to 128K tokens. That\\'s right, we\\'re talking about a significant leap in scale and capability.\"),\\n (\"Speaker 2\", \"Wow, 405B parameters? That\\'s... that\\'s enormous! (umm) I mean, I\\'ve heard of large models before, but this is on a whole different level. Can you explain what that means in practical terms? Like, how does it affect what the model can do?\"),\\n (\"Speaker 1\", \"Absolutely. So, having a model with 405B parameters means it has a much more nuanced understanding of language. It can capture subtleties and context in a way that smaller models can\\'t. For instance, our model can understand and generate text based on a much larger context window, up to 128K tokens. To put that into perspective, that\\'s like being able to understand and respond to a lengthy document or even a small book in one go.\"),\\n (\"Speaker 2\", \"Hmm, that\\'s amazing! I can see how that would be super useful for tasks like summarization or question-answering based on a long document. But, (hesitates) how does it handle complexity? I mean, with so many parameters, doesn\\'t it risk being overly complex or even overfitting to the training data? [sigh]\"),\\n (\"Speaker 1\", \"That\\'s a great question. One of the key challenges with large models is managing complexity. To address this, we\\'ve made several design choices, such as using a standard Transformer architecture but with some adaptations like grouped query attention to improve inference speed. We\\'ve also been careful about our pre-training data, ensuring it\\'s diverse and of high quality.\"),\\n (\"Speaker 2\", \"Grouped query attention? That\\'s a new one for me. Can you explain how that works and why it\\'s beneficial? (slightly confused) I thought attention mechanisms were already pretty optimized. (laughs)\"),\\n (\"Speaker 1\", \"Grouped query attention is a technique we use to improve the efficiency of our model during inference. Essentially, it allows the model to process queries in groups rather than one by one, which can significantly speed up the process. This is particularly useful when dealing with long sequences or when generating text.\"),\\n (\"Speaker 2\", \"Hmm, that sounds like a significant improvement. And, (curious) what about the pre-training data? You mentioned it\\'s diverse and of high quality. Can you tell me more about that? How do you ensure the data is good enough for such a large and complex model? [sigh]\"),\\n (\"Speaker 1\", \"We\\'ve put a lot of effort into curating our pre-training data. We start with a massive corpus of text, but then we apply various filtering techniques to remove low-quality or redundant data. We also use techniques like deduplication to ensure that our model isn\\'t biased towards any particular subset of the data.\"),\\n (\"Speaker 2\", \"I see. So, it\\'s not just about having a lot of data, but also about making sure that data is relevant and useful for training. That makes sense. (pauses) What about the applications of Llama 3? You mentioned it can do a lot of things, from answering questions to generating code. Can you give some specific examples? (umm)\"),\\n (\"Speaker 1\", \"Llama 3 is quite versatile. For instance, it can be used for coding tasks, where it can generate high-quality code based on a description or even help debug existing code. It\\'s also very capable in multilingual tasks, being able to understand and generate text in several languages.\"),\\n (\"Speaker 2\", \"That sounds incredible. The potential applications are vast, from helping developers with coding tasks to facilitating communication across languages. (curious) How does it handle tasks that require a deep understanding of context or nuance, like understanding humor or sarcasm? (hmm)\"),\\n (\"Speaker 1\", \"That\\'s an area where Llama 3 has shown significant improvement. By being trained on a vast amount of text data, it has developed a better understanding of context and can often pick up on subtleties like humor or sarcasm. However, it\\'s not perfect, and there are still cases where it might not fully understand the nuance.\"),\\n (\"Speaker 2\", \"I can imagine. Understanding humor or sarcasm can be challenging even for humans, so it\\'s not surprising that it\\'s an area for improvement. (pauses) What about the safety and reliability of Llama 3? With models this powerful, there are concerns about potential misuse or generating harmful content. [sigh]\"),\\n (\"Speaker 1\", \"We\\'ve taken several steps to ensure the safety and reliability of Llama 3. This includes incorporating safety mitigations during the training process and testing the model extensively to identify and mitigate any potential risks.\"),\\n (\"Speaker 2\", \"That\\'s good to hear. It\\'s crucial that as we develop more powerful AI models, we also prioritize their safety and responsible use. (curious) What\\'s next for Llama 3? Are there plans to continue improving it or expanding its capabilities? (umm)\"),\\n (\"Speaker 1\", \"We\\'re committed to ongoing research and development to further improve Llama 3 and explore new applications. We\\'re excited about the potential of this technology to make a positive impact across various domains.\"),\\n (\"Speaker 2\", \"Well, it\\'s been enlightening to learn more about Llama 3. The advancements in AI are truly remarkable, and it\\'s exciting to think about what\\'s possible with models like this. (concludes) Thanks for having me on the show!\"),\\n (\"Speaker 1\", \"Thank you for joining us on this episode. It was a pleasure to explore the world of Llama 3 with you.\")\\n]'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PODCAST_TEXT" ] }, { "cell_type": "markdown", "id": "485b4c9e-379f-4004-bdd0-93a53f3f7ee0", "metadata": {}, "source": [ "Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. \n", "\n", "We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`" ] }, { "cell_type": "code", "execution_count": 30, "id": "9946e46c-3457-4bf9-9042-b89fa8f5b47a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Speaker 1',\n", " \"Welcome to our latest podcast on the advancements in AI, where we're going to dive into the world of Llama 3, a new set of foundation models that are pushing the boundaries of what's possible with language understanding and generation. I'm your host, and joining me is my co-host, who is new to this topic. We're excited to explore this together. So, let's start with the basics. Llama 3 is all about improving upon its predecessors by incorporating larger models, more data, and better training techniques. Our largest model boasts an impressive 405B parameters and can process up to 128K tokens. That's right, we're talking about a significant leap in scale and capability.\"),\n", " ('Speaker 2',\n", " \"Wow, 405B parameters? That's... that's enormous! (umm) I mean, I've heard of large models before, but this is on a whole different level. Can you explain what that means in practical terms? Like, how does it affect what the model can do?\"),\n", " ('Speaker 1',\n", " \"Absolutely. So, having a model with 405B parameters means it has a much more nuanced understanding of language. It can capture subtleties and context in a way that smaller models can't. For instance, our model can understand and generate text based on a much larger context window, up to 128K tokens. To put that into perspective, that's like being able to understand and respond to a lengthy document or even a small book in one go.\"),\n", " ('Speaker 2',\n", " \"Hmm, that's amazing! I can see how that would be super useful for tasks like summarization or question-answering based on a long document. But, (hesitates) how does it handle complexity? I mean, with so many parameters, doesn't it risk being overly complex or even overfitting to the training data? [sigh]\"),\n", " ('Speaker 1',\n", " \"That's a great question. One of the key challenges with large models is managing complexity. To address this, we've made several design choices, such as using a standard Transformer architecture but with some adaptations like grouped query attention to improve inference speed. We've also been careful about our pre-training data, ensuring it's diverse and of high quality.\"),\n", " ('Speaker 2',\n", " \"Grouped query attention? That's a new one for me. Can you explain how that works and why it's beneficial? (slightly confused) I thought attention mechanisms were already pretty optimized. (laughs)\"),\n", " ('Speaker 1',\n", " 'Grouped query attention is a technique we use to improve the efficiency of our model during inference. Essentially, it allows the model to process queries in groups rather than one by one, which can significantly speed up the process. This is particularly useful when dealing with long sequences or when generating text.'),\n", " ('Speaker 2',\n", " \"Hmm, that sounds like a significant improvement. And, (curious) what about the pre-training data? You mentioned it's diverse and of high quality. Can you tell me more about that? How do you ensure the data is good enough for such a large and complex model? [sigh]\"),\n", " ('Speaker 1',\n", " \"We've put a lot of effort into curating our pre-training data. We start with a massive corpus of text, but then we apply various filtering techniques to remove low-quality or redundant data. We also use techniques like deduplication to ensure that our model isn't biased towards any particular subset of the data.\"),\n", " ('Speaker 2',\n", " \"I see. So, it's not just about having a lot of data, but also about making sure that data is relevant and useful for training. That makes sense. (pauses) What about the applications of Llama 3? You mentioned it can do a lot of things, from answering questions to generating code. Can you give some specific examples? (umm)\"),\n", " ('Speaker 1',\n", " \"Llama 3 is quite versatile. For instance, it can be used for coding tasks, where it can generate high-quality code based on a description or even help debug existing code. It's also very capable in multilingual tasks, being able to understand and generate text in several languages.\"),\n", " ('Speaker 2',\n", " 'That sounds incredible. The potential applications are vast, from helping developers with coding tasks to facilitating communication across languages. (curious) How does it handle tasks that require a deep understanding of context or nuance, like understanding humor or sarcasm? (hmm)'),\n", " ('Speaker 1',\n", " \"That's an area where Llama 3 has shown significant improvement. By being trained on a vast amount of text data, it has developed a better understanding of context and can often pick up on subtleties like humor or sarcasm. However, it's not perfect, and there are still cases where it might not fully understand the nuance.\"),\n", " ('Speaker 2',\n", " \"I can imagine. Understanding humor or sarcasm can be challenging even for humans, so it's not surprising that it's an area for improvement. (pauses) What about the safety and reliability of Llama 3? With models this powerful, there are concerns about potential misuse or generating harmful content. [sigh]\"),\n", " ('Speaker 1',\n", " \"We've taken several steps to ensure the safety and reliability of Llama 3. This includes incorporating safety mitigations during the training process and testing the model extensively to identify and mitigate any potential risks.\"),\n", " ('Speaker 2',\n", " \"That's good to hear. It's crucial that as we develop more powerful AI models, we also prioritize their safety and responsible use. (curious) What's next for Llama 3? Are there plans to continue improving it or expanding its capabilities? (umm)\"),\n", " ('Speaker 1',\n", " \"We're committed to ongoing research and development to further improve Llama 3 and explore new applications. We're excited about the potential of this technology to make a positive impact across various domains.\"),\n", " ('Speaker 2',\n", " \"Well, it's been enlightening to learn more about Llama 3. The advancements in AI are truly remarkable, and it's exciting to think about what's possible with models like this. (concludes) Thanks for having me on the show!\"),\n", " ('Speaker 1',\n", " 'Thank you for joining us on this episode. It was a pleasure to explore the world of Llama 3 with you.')]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ast\n", "ast.literal_eval(PODCAST_TEXT)" ] }, { "cell_type": "markdown", "id": "5c7b4c11-5526-4b13-b0a2-8ca541c475aa", "metadata": {}, "source": [ "#### Generating the Final Podcast\n", "\n", "Finally, we can loop over the Tuple and use our helper functions to generate the audio" ] }, { "cell_type": "code", "execution_count": 39, "id": "c640fead-2017-478f-a7b6-1b96105d45d6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Generating podcast segments: 6%|███▉ | 1/16 [00:20<05:02, 20.16s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 19%|███████████▋ | 3/16 [01:02<04:33, 21.06s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 31%|███████████████████▍ | 5/16 [01:41<03:30, 19.18s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 44%|███████████████████████████▏ | 7/16 [02:26<03:05, 20.57s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 56%|██████████████████████████████████▉ | 9/16 [03:04<02:13, 19.10s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 69%|█████████████████████████████████████████▉ | 11/16 [03:42<01:31, 18.27s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 81%|█████████████████████████████████████████████████▌ | 13/16 [04:17<00:50, 16.99s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 94%|█████████████████████████████████████████████████████████▏ | 15/16 [04:49<00:15, 15.83s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", "Generating podcast segments: 100%|█████████████████████████████████████████████████████████████| 16/16 [05:13<00:00, 19.57s/segment]\n" ] } ], "source": [ "final_audio = None\n", "\n", "for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc=\"Generating podcast segments\", unit=\"segment\"):\n", " if speaker == \"Speaker 1\":\n", " audio_arr, rate = generate_speaker1_audio(text)\n", " else: # Speaker 2\n", " audio_arr, rate = generate_speaker2_audio(text)\n", " \n", " # Convert to AudioSegment (pydub will handle sample rate conversion automatically)\n", " audio_segment = numpy_to_audio_segment(audio_arr, rate)\n", " \n", " # Add to final audio\n", " if final_audio is None:\n", " final_audio = audio_segment\n", " else:\n", " final_audio += audio_segment" ] }, { "cell_type": "markdown", "id": "4fbb2228-8023-44c4-aafe-d6e1d22ff8e4", "metadata": {}, "source": [ "### Output the Podcast\n", "\n", "We can now save this as a mp3 file" ] }, { "cell_type": "code", "execution_count": 40, "id": "2eeffdb7-875a-45ec-bdd8-c8c5b34f5a7b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<_io.BufferedRandom name='_podcast.mp3'>" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "final_audio.export(\"./resources/_podcast.mp3\", \n", " format=\"mp3\", \n", " bitrate=\"192k\",\n", " parameters=[\"-q:a\", \"0\"])" ] }, { "cell_type": "markdown", "id": "c7ce5836", "metadata": {}, "source": [ "### Suggested Next Steps:\n", "\n", "- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks\n", "- Extend workflow beyond two speakers\n", "- Test other TTS Models\n", "- Experiment with Speech Enhancer models as a step 5." ] }, { "cell_type": "code", "execution_count": null, "id": "26cc56c5-b9c9-47c2-b860-0ea9f05c79af", "metadata": {}, "outputs": [], "source": [ "#fin" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }