{
"cells": [
{
"cell_type": "markdown",
"id": "45a1b5d7-fd98-4fa2-9bea-e68c514b9245",
"metadata": {},
"source": [
"## Notebook 4: TTS Workflow\n",
"\n",
"We have the exact podcast transcripts ready now to generate our audio for the Podcast.\n",
"\n",
"In this notebook, we will learn how to generate Audio using both `suno/bark` and `parler-tts/parler-tts-mini-v1` models first. \n",
"\n",
"After that, we will use the output from Notebook 3 to generate our complete podcast\n",
"\n",
"Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt."
]
},
{
"cell_type": "markdown",
"id": "534e5f94-66d0-459d-ab01-8599905d8e1b",
"metadata": {},
"source": [
"⚠️ Warning: This notebook likes have `transformers` version to be `4.43.3` or earlier so we will downgrade our environment to make sure things run smoothly"
]
},
{
"cell_type": "markdown",
"id": "efd866ac-8ea6-486d-96cd-7594a8c329e0",
"metadata": {},
"source": [
"Credit: [This](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing#scrollTo=68QtoUqPWdLk) Colab was used for starter code\n"
]
},
{
"cell_type": "markdown",
"id": "a4e2c0ee-7527-46e4-9c07-e6dac34376e5",
"metadata": {},
"source": [
"We can install these packages for speedups"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3ee4811a-50a1-4030-8312-54fccddc221b",
"metadata": {},
"outputs": [],
"source": [
"#!pip3 install optimum\n",
"#!pip install -U flash-attn --no-build-isolation\n",
"#!pip install transformers==4.43.3"
]
},
{
"cell_type": "markdown",
"id": "07672295-af30-4b4b-b11c-44ca938436cd",
"metadata": {},
"source": [
"Let's import the necessary frameworks"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "89d75859-e0f9-40e3-931d-64aa3d273f49",
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import Audio\n",
"import IPython.display as ipd\n",
"from tqdm import tqdm"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f442758d-c48f-48ac-a4b0-558695290aa9",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Flash attention 2 is not installed\n"
]
}
],
"source": [
"from transformers import BarkModel, AutoProcessor, AutoTokenizer\n",
"import torch\n",
"import json\n",
"import numpy as np\n",
"from parler_tts import ParlerTTSForConditionalGeneration"
]
},
{
"cell_type": "markdown",
"id": "31ba1903-59c8-4004-bb39-1761cd3d140e",
"metadata": {},
"source": [
"### Testing the Audio Generation"
]
},
{
"cell_type": "markdown",
"id": "2523c565-bb35-4fae-bdcb-cba11ef0b572",
"metadata": {},
"source": [
"Let's try generating audio using both the models to understand how they work. \n",
"\n",
"Note the subtle differences in prompting:\n",
"- Parler: Takes in a `description` prompt that can be used to set the speaker profile and generation speeds\n",
"- Suno: Takes in expression words like `[sigh]`, `[laughs]` etc. You can find more notes on the experiments that were run for this notebook in the [TTS_Notes.md](./TTS_Notes.md) file to learn more."
]
},
{
"cell_type": "markdown",
"id": "50b62df5-5ea3-4913-832a-da59f7cf8de2",
"metadata": {},
"source": [
"Please set `device = \"cuda\"` below if you're using a single GPU node."
]
},
{
"cell_type": "markdown",
"id": "309d0678-880b-44cb-a54a-9408b3c8d644",
"metadata": {},
"source": [
"#### Parler Model\n",
"\n",
"Let's try using the Parler Model first and generate a short segment with speaker Laura's voice"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4e84ed3f-336b-4f45-b098-ce477929fa8a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Set up device\n",
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
"\n",
"# Load model and tokenizer\n",
"model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(device)\n",
"tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")\n",
"\n",
"# Define text and description\n",
"text_prompt = \"\"\"\n",
"Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.\n",
"\"\"\"\n",
"description = \"\"\"\n",
"Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.\n",
"\"\"\"\n",
"# Tokenize inputs\n",
"input_ids = tokenizer(description, return_tensors=\"pt\").input_ids.to(device)\n",
"prompt_input_ids = tokenizer(text_prompt, return_tensors=\"pt\").input_ids.to(device)\n",
"\n",
"# Generate audio\n",
"generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n",
"audio_arr = generation.cpu().numpy().squeeze()\n",
"\n",
"# Play audio in notebook\n",
"ipd.Audio(audio_arr, rate=model.config.sampling_rate)"
]
},
{
"cell_type": "markdown",
"id": "03c2abc6-4a1d-4318-af6f-0257dd66a691",
"metadata": {},
"source": [
"#### Bark Model\n",
"\n",
"Amazing, let's try the same with bark now:\n",
"- We will set the `voice_preset` to our favorite speaker\n",
"- This time we can include expression prompts inside our generation prompt\n",
"- Note you can CAPTILISE words to make the model emphasise on these\n",
"- You can add hyphens to make the model pause on certain words"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a20730f0-13dd-48b4-80b6-7c6ef05a0cc4",
"metadata": {},
"outputs": [],
"source": [
"voice_preset = \"v2/en_speaker_6\"\n",
"sampling_rate = 24000"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "246d0cbc-c5d8-4f34-b8e4-dd18a624cdad",
"metadata": {},
"outputs": [],
"source": [
"device = \"cuda:7\"\n",
"\n",
"processor = AutoProcessor.from_pretrained(\"suno/bark\")\n",
"\n",
"#model = model.to_bettertransformer()\n",
"#model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16, attn_implementation=\"flash_attention_2\").to(device)\n",
"model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(device)#.to_bettertransformer()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "5986510c-4a09-4c24-9344-c98fa16947d9",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_prompt = \"\"\"\n",
"Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.\n",
"\"\"\"\n",
"inputs = processor(text_prompt, voice_preset=voice_preset).to(device)\n",
"\n",
"speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)\n",
"Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)"
]
},
{
"cell_type": "markdown",
"id": "dd650176-ab17-47a7-8e02-10dc9ca9e852",
"metadata": {},
"source": [
"## Bringing it together: Making the Podcast\n",
"\n",
"Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast\n",
"\n",
"Let's load in our pickle file from earlier and proceed:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b1dca30f-1226-4002-8e02-fd97e78ecc83",
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"with open('./resources/podcast_ready_data.pkl', 'rb') as file:\n",
" PODCAST_TEXT = pickle.load(file)"
]
},
{
"cell_type": "markdown",
"id": "c10a3d50-08a7-4786-8e28-8fb6b8b048ab",
"metadata": {},
"source": [
"Let's define load in the bark model and set it's hyper-parameters for discussions"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8db78921-36c7-4388-b1d9-78dff4f972c2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/sanyambhutani/.conda/envs/final-checking-meta/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.\n",
" WeightNorm.apply(module, name, dim)\n",
"/home/sanyambhutani/.conda/envs/final-checking-meta/lib/python3.11/site-packages/transformers/models/encodec/modeling_encodec.py:120: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n",
" self.register_buffer(\"padding_total\", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)\n"
]
}
],
"source": [
"bark_processor = AutoProcessor.from_pretrained(\"suno/bark\")\n",
"bark_model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(\"cuda:3\")\n",
"bark_sampling_rate = 24000"
]
},
{
"cell_type": "markdown",
"id": "e03e313a-c727-4489-876b-db71920d49cd",
"metadata": {},
"source": [
"Now for the Parler model:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6c04a04d-3686-4932-bd45-72d7f518c602",
"metadata": {},
"outputs": [],
"source": [
"parler_model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(\"cuda:3\")\n",
"parler_tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "efbe1434-37f3-4f77-a5fb-b39625f5e676",
"metadata": {},
"outputs": [],
"source": [
"speaker1_description = \"\"\"\n",
"Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "56f6fa24-fe07-4702-850f-0428bfadd2dc",
"metadata": {},
"source": [
"We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "cebfd0f9-8703-4fce-b207-014c6e16cc8a",
"metadata": {},
"outputs": [],
"source": [
"generated_segments = []\n",
"sampling_rates = [] # We'll need to keep track of sampling rates for each segment"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9b333e36-9579-4237-b329-e2911229be42",
"metadata": {},
"outputs": [],
"source": [
"device=\"cuda:3\""
]
},
{
"cell_type": "markdown",
"id": "d7b2490c-012f-4e35-8890-cd6a5eaf4cc4",
"metadata": {},
"source": [
"Function generate text for speaker 1"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "50323f9e-09ed-4c8c-9020-1511ab775969",
"metadata": {},
"outputs": [],
"source": [
"def generate_speaker1_audio(text):\n",
" \"\"\"Generate audio using ParlerTTS for Speaker 1\"\"\"\n",
" input_ids = parler_tokenizer(speaker1_description, return_tensors=\"pt\").input_ids.to(device)\n",
" prompt_input_ids = parler_tokenizer(text, return_tensors=\"pt\").input_ids.to(device)\n",
" generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n",
" audio_arr = generation.cpu().numpy().squeeze()\n",
" return audio_arr, parler_model.config.sampling_rate"
]
},
{
"cell_type": "markdown",
"id": "3fb5dac8-30a6-4aa2-a983-b5f1df3d56af",
"metadata": {},
"source": [
"Function to generate text for speaker 2"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0e6120ba-5190-4739-97ca-4e8b44dddc5e",
"metadata": {},
"outputs": [],
"source": [
"def generate_speaker2_audio(text):\n",
" \"\"\"Generate audio using Bark for Speaker 2\"\"\"\n",
" inputs = bark_processor(text, voice_preset=\"v2/en_speaker_6\").to(device)\n",
" speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)\n",
" audio_arr = speech_output[0].cpu().numpy()\n",
" return audio_arr, bark_sampling_rate\n"
]
},
{
"cell_type": "markdown",
"id": "7ea67fd1-9405-4fce-b08b-df5e11d0bf37",
"metadata": {},
"source": [
"Helper function to convert the numpy output from the models into audio"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "4482d864-2806-4410-b239-da4b2d0d1340",
"metadata": {},
"outputs": [],
"source": [
"def numpy_to_audio_segment(audio_arr, sampling_rate):\n",
" \"\"\"Convert numpy array to AudioSegment\"\"\"\n",
" # Convert to 16-bit PCM\n",
" audio_int16 = (audio_arr * 32767).astype(np.int16)\n",
" \n",
" # Create WAV file in memory\n",
" byte_io = io.BytesIO()\n",
" wavfile.write(byte_io, sampling_rate, audio_int16)\n",
" byte_io.seek(0)\n",
" \n",
" # Convert to AudioSegment\n",
" return AudioSegment.from_wav(byte_io)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "c4dbb3b3-cdd3-4a1f-a60a-661e64a67f53",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'[\\n (\"Speaker 1\", \"Welcome to this week\\'s episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we\\'re going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who\\'s new to the topic, and I\\'ll be guiding them through the ins and outs of knowledge distillation. So, let\\'s get started!\"),\\n (\"Speaker 2\", \"Sounds exciting! I\\'ve heard of knowledge distillation, but I\\'m not entirely sure what it\\'s all about. Can you give me a brief overview?\"),\\n (\"Speaker 1\", \"Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model\\'s output, enabling it to acquire similar capabilities. Think of it like a master chef teaching their apprentice the art of cooking – the apprentice doesn\\'t need to start from scratch.\"),\\n (\"Speaker 2\", \"Hmm, that sounds interesting. So, it\\'s like a teacher-student relationship, where the teacher model guides the student model to learn from its output... Umm, can you explain this process in more detail?\"),\\n (\"Speaker 1\", \"The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher\\'s output behavior.\"),\\n (\"Speaker 2\", \"That makes sense, I think. So, it\\'s like the teacher model is saying, \\'Hey, student model, learn from my output, and try to produce similar results.\\' But what about the different approaches to knowledge distillation? I\\'ve heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization.\"),\\n (\"Speaker 1\", \"Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model. Imagine you\\'re trying to tune a piano – you need to adjust the keys to produce the perfect sound.\"),\\n (\"Speaker 2\", \"[laughs] Okay, I think I\\'m starting to get it. But can you give me some examples of how these approaches are used in real-world applications? I\\'m thinking of something like a language model that can generate human-like text...\"),\\n (\"Speaker 1\", \"Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model.\"),\\n (\"Speaker 2\", \"Wow, that\\'s fascinating! I\\'m starting to see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks... Umm, can we talk more about multimodal tasks? That sounds really interesting.\"),\\n (\"Speaker 1\", \"Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems.\"),\\n (\"Speaker 2\", \"[sigh] I\\'m starting to see the bigger picture now. Knowledge distillation is not just a technique; it\\'s a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users... Hmm, that\\'s a pretty big deal.\"),\\n (\"Speaker 1\", \"That\\'s right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence.\"),\\n (\"Speaker 2\", \"Well, I\\'m excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I\\'m looking forward to our next episode!\"),\\n (\"Speaker 1\", \"Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we\\'ve curated a list of papers, articles, and tutorials to help you get started.\"),\\n (\"Speaker 2\", \"And if you\\'re interested in building your own AI model using knowledge distillation, maybe we can even do a follow-up episode on how to get started... Umm, let\\'s discuss that further next time.\"),\\n]'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"PODCAST_TEXT"
]
},
{
"cell_type": "markdown",
"id": "485b4c9e-379f-4004-bdd0-93a53f3f7ee0",
"metadata": {},
"source": [
"Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. \n",
"\n",
"We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "9946e46c-3457-4bf9-9042-b89fa8f5b47a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Speaker 1',\n",
" \"Welcome to this week's episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we're going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who's new to the topic, and I'll be guiding them through the ins and outs of knowledge distillation. So, let's get started!\"),\n",
" ('Speaker 2',\n",
" \"Sounds exciting! I've heard of knowledge distillation, but I'm not entirely sure what it's all about. Can you give me a brief overview?\"),\n",
" ('Speaker 1',\n",
" \"Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model's output, enabling it to acquire similar capabilities. Think of it like a master chef teaching their apprentice the art of cooking – the apprentice doesn't need to start from scratch.\"),\n",
" ('Speaker 2',\n",
" \"Hmm, that sounds interesting. So, it's like a teacher-student relationship, where the teacher model guides the student model to learn from its output... Umm, can you explain this process in more detail?\"),\n",
" ('Speaker 1',\n",
" \"The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher's output behavior.\"),\n",
" ('Speaker 2',\n",
" \"That makes sense, I think. So, it's like the teacher model is saying, 'Hey, student model, learn from my output, and try to produce similar results.' But what about the different approaches to knowledge distillation? I've heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization.\"),\n",
" ('Speaker 1',\n",
" \"Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model. Imagine you're trying to tune a piano – you need to adjust the keys to produce the perfect sound.\"),\n",
" ('Speaker 2',\n",
" \"[laughs] Okay, I think I'm starting to get it. But can you give me some examples of how these approaches are used in real-world applications? I'm thinking of something like a language model that can generate human-like text...\"),\n",
" ('Speaker 1',\n",
" 'Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model.'),\n",
" ('Speaker 2',\n",
" \"Wow, that's fascinating! I'm starting to see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks... Umm, can we talk more about multimodal tasks? That sounds really interesting.\"),\n",
" ('Speaker 1',\n",
" 'Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems.'),\n",
" ('Speaker 2',\n",
" \"[sigh] I'm starting to see the bigger picture now. Knowledge distillation is not just a technique; it's a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users... Hmm, that's a pretty big deal.\"),\n",
" ('Speaker 1',\n",
" \"That's right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence.\"),\n",
" ('Speaker 2',\n",
" \"Well, I'm excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I'm looking forward to our next episode!\"),\n",
" ('Speaker 1',\n",
" \"Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we've curated a list of papers, articles, and tutorials to help you get started.\"),\n",
" ('Speaker 2',\n",
" \"And if you're interested in building your own AI model using knowledge distillation, maybe we can even do a follow-up episode on how to get started... Umm, let's discuss that further next time.\")]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import ast\n",
"ast.literal_eval(PODCAST_TEXT)"
]
},
{
"cell_type": "markdown",
"id": "5c7b4c11-5526-4b13-b0a2-8ca541c475aa",
"metadata": {},
"source": [
"#### Generating the Final Podcast\n",
"\n",
"Finally, we can loop over the Tuple and use our helper functions to generate the audio"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "c640fead-2017-478f-a7b6-1b96105d45d6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Generating podcast segments: 6%|███▉ | 1/16 [00:20<05:02, 20.16s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 19%|███████████▋ | 3/16 [01:02<04:33, 21.06s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 31%|███████████████████▍ | 5/16 [01:41<03:30, 19.18s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 44%|███████████████████████████▏ | 7/16 [02:26<03:05, 20.57s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 56%|██████████████████████████████████▉ | 9/16 [03:04<02:13, 19.10s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 69%|█████████████████████████████████████████▉ | 11/16 [03:42<01:31, 18.27s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 81%|█████████████████████████████████████████████████▌ | 13/16 [04:17<00:50, 16.99s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 94%|█████████████████████████████████████████████████████████▏ | 15/16 [04:49<00:15, 15.83s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
"Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n",
"Generating podcast segments: 100%|█████████████████████████████████████████████████████████████| 16/16 [05:13<00:00, 19.57s/segment]\n"
]
}
],
"source": [
"final_audio = None\n",
"\n",
"for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc=\"Generating podcast segments\", unit=\"segment\"):\n",
" if speaker == \"Speaker 1\":\n",
" audio_arr, rate = generate_speaker1_audio(text)\n",
" else: # Speaker 2\n",
" audio_arr, rate = generate_speaker2_audio(text)\n",
" \n",
" # Convert to AudioSegment (pydub will handle sample rate conversion automatically)\n",
" audio_segment = numpy_to_audio_segment(audio_arr, rate)\n",
" \n",
" # Add to final audio\n",
" if final_audio is None:\n",
" final_audio = audio_segment\n",
" else:\n",
" final_audio += audio_segment"
]
},
{
"cell_type": "markdown",
"id": "4fbb2228-8023-44c4-aafe-d6e1d22ff8e4",
"metadata": {},
"source": [
"### Output the Podcast\n",
"\n",
"We can now save this as a mp3 file"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "2eeffdb7-875a-45ec-bdd8-c8c5b34f5a7b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<_io.BufferedRandom name='_podcast.mp3'>"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final_audio.export(\"./resources/_podcast.mp3\", \n",
" format=\"mp3\", \n",
" bitrate=\"192k\",\n",
" parameters=[\"-q:a\", \"0\"])"
]
},
{
"cell_type": "markdown",
"id": "c7ce5836",
"metadata": {},
"source": [
"### Suggested Next Steps:\n",
"\n",
"- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks\n",
"- Extend workflow beyond two speakers\n",
"- Test other TTS Models\n",
"- Experiment with Speech Enhancer models as a step 5."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26cc56c5-b9c9-47c2-b860-0ea9f05c79af",
"metadata": {},
"outputs": [],
"source": [
"#fin"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}