Browse Source

Notebook llama (#739)

Sanyam Bhutani 6 tháng trước cách đây
mục cha
commit
e1599969d5

+ 93 - 0
recipes/quickstart/NotebookLlama/README.md

@@ -0,0 +1,93 @@
+## NotebookLlama: An Open Source version of NotebookLM
+
+![NotebookLlama](./resources/Outline.jpg)
+
+[Listen to audio from the example here](./resources/_podcast.mp3)
+
+This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow. 
+
+You will also learn from the experiments of using  Text to Speech Models.
+
+It assumes zero knowledge of LLMs, prompting and audio models, everything is covered in their respective notebooks.
+
+### Outline:
+
+Here is step by step thought (pun intended) for the task:
+
+- Step 1: Pre-process PDF: Use `Llama-3.2-1B-Instruct` to pre-process the PDF and save it in a `.txt` file.
+- Step 2: Transcript Writer: Use `Llama-3.1-70B-Instruct` model to write a podcast transcript from the text
+- Step 3: Dramatic Re-Writer: Use `Llama-3.1-8B-Instruct` model to make the transcript more dramatic
+- Step 4: Text-To-Speech Workflow: Use `parler-tts/parler-tts-mini-v1` and `bark/suno` to generate a conversational podcast
+
+Note 1: In Step 1, we prompt the 1B model to not modify the text or summarize it, strictly clean up extra characters or garbage characters that might get picked due to encoding from PDF. Please see the prompt in Notebook 1 for more details.
+
+Note 2: For Step 2, you can also use `Llama-3.1-8B-Instruct` model, we recommend experimenting and trying if you see any differences. The 70B model was used here because it gave slightly more creative podcast transcripts for the tested examples.
+
+### Detailed steps on running the notebook:
+
+Requirements: GPU server or an API provider for using 70B, 8B and 1B Llama models.
+For running the 70B model, you will need a GPU with aggregated memory around 140GB to infer in bfloat-16 precision.
+
+Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you!
+
+- Before getting started, please make sure to login using the `huggingface cli` and then launch your jupyter notebook server to make sure you are able to download the Llama models.
+
+You'll need your Hugging Face access token, which you can get at your Settings page [here](https://huggingface.co/settings/tokens). Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login to make sure the scripts can download Hugging Face models if needed.
+
+- First, please Install the requirements from [here]() by running inside the folder:
+
+```
+git clone https://github.com/meta-llama/llama-recipes
+cd llama-recipes/recipes/quickstart/NotebookLlama/
+pip install -r requirements.txt
+```
+
+- Notebook 1:
+
+This notebook is used for processing the PDF and processing it using the new Feather light model into a `.txt` file.
+
+Update the first cell with a PDF link that you would like to use. Please decide on a PDF to use for Notebook 1, it can be any link but please remember to update the first cell of the notebook with the right link. 
+
+Please try changing the prompts for the `Llama-3.2-1B-Instruct` model and see if you can improve results.
+
+- Notebook 2:
+
+This notebook will take in the processed output from Notebook 1 and creatively convert it into a podcast transcript using the `Llama-3.1-70B-Instruct` model. If you are GPU rich, please feel free to test with the 405B model!
+
+Please try experimenting with the System prompts for the model and see if you can improve the results and try the 8B model as well here to see if there is a huge difference!
+
+- Notebook 3:
+
+This notebook takes the transcript from earlier and prompts `Llama-3.1-8B-Instruct` to add more dramatization and interruptions in the conversations. 
+
+There is also a key factor here: we return a tuple of conversation which makes our lives easier later. Yes, studying Data Structures 101 was actually useful for once!
+
+For our TTS logic, we use two different models that behave differently with certain prompts. So we prompt the model to add specifics for each speaker accordingly.
+
+Please again try changing the system prompt and see if you can improve the results. We encourage testing the feather light 3B and 1B models as well at this stage
+
+- Notebook 4:
+
+Finally, we take the results from last notebook and convert them into a podcast. We use the `parler-tts/parler-tts-mini-v1` and `bark/suno` models for a conversation.
+
+The speakers and the prompt for parler model were decided based on experimentation and suggestions from the model authors. Please try experimenting, you can find more details in the resources section.
+
+
+#### Note: Right now there is one issue: Parler needs transformers 4.43.3 or earlier and for steps 1 to 3 of the pipeline you need latest, so we just switch versions in the last notebook.
+
+### Next-Improvements/Further ideas:
+
+- Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline and with the help of someone more knowledgable-PRs are welcome! :) 
+- LLM vs LLM Debate: Another approach of writing the podcast would be having two agents debate the topic of interest and write the podcast outline. Right now we use a single LLM (70B) to write the podcast outline
+- Testing 405B for writing the transcripts
+- Better prompting
+- Support for ingesting a website, audio file, YouTube links and more. Again, we welcome community PRs!
+
+### Resources for further learning:
+
+- https://betterprogramming.pub/text-to-audio-generation-with-bark-clearly-explained-4ee300a3713a
+- https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing
+- https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing#scrollTo=NyYQ--3YksJY
+- https://replicate.com/suno-ai/bark?prediction=zh8j6yddxxrge0cjp9asgzd534
+- https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c
+

Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 2731 - 0
recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 337 - 0
recipes/quickstart/NotebookLlama/Step-2-Transcript-Writer.ipynb


+ 288 - 0
recipes/quickstart/NotebookLlama/Step-3-Re-Writer.ipynb

@@ -0,0 +1,288 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d0b5beda",
+   "metadata": {},
+   "source": [
+    "## Notebook 3: Transcript Re-writer\n",
+    "\n",
+    "In the previouse notebook, we got a great podcast transcript using the raw file we have uploaded earlier. \n",
+    "\n",
+    "In this one, we will use `Llama-3.1-8B-Instruct` model to re-write the output from previous pipeline and make it more dramatic or realistic."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fdc3d32a",
+   "metadata": {},
+   "source": [
+    "We will again set the `SYSTEM_PROMPT` and remind the model of its task. \n",
+    "\n",
+    "Note: We can even prompt the model like so to encourage creativity:\n",
+    "\n",
+    "> Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c32c0d85",
+   "metadata": {},
+   "source": [
+    "Note: We will prompt the model to return a list of Tuples to make our life easy in the next stage of using these for Text To Speech Generation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8568b77b-7504-4783-952a-3695737732b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "SYSTEMP_PROMPT = \"\"\"\n",
+    "You are an international oscar winnning screenwriter\n",
+    "\n",
+    "You have been working with multiple award winning podcasters.\n",
+    "\n",
+    "Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.\n",
+    "\n",
+    "Make it as engaging as possible, Speaker 1 and 2 will be simulated by different voice engines\n",
+    "\n",
+    "Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc\n",
+    "\n",
+    "Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes\n",
+    "\n",
+    "Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions\n",
+    "\n",
+    "Make sure the tangents speaker 2 provides are quite wild or interesting. \n",
+    "\n",
+    "Ensure there are interruptions during explanations or there are \"hmm\" and \"umm\" injected throughout from the Speaker 2.\n",
+    "\n",
+    "REMEMBER THIS WITH YOUR HEART\n",
+    "The TTS Engine for Speaker 1 cannot do \"umms, hmms\" well so keep it straight text\n",
+    "\n",
+    "For Speaker 2 use \"umm, hmm\" as much, you can also use [sigh] and [laughs]. BUT ONLY THESE OPTIONS FOR EXPRESSIONS\n",
+    "\n",
+    "It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait\n",
+    "\n",
+    "Please re-write to make it as characteristic as possible\n",
+    "\n",
+    "START YOUR RESPONSE DIRECTLY WITH SPEAKER 1:\n",
+    "\n",
+    "STRICTLY RETURN YOUR RESPONSE AS A LIST OF TUPLES OK? \n",
+    "\n",
+    "IT WILL START DIRECTLY WITH THE LIST AND END WITH THE LIST NOTHING ELSE\n",
+    "\n",
+    "Example of response:\n",
+    "[\n",
+    "    (\"Speaker 1\", \"Welcome to our podcast, where we explore the latest advancements in AI and technology. I'm your host, and today we're joined by a renowned expert in the field of AI. We're going to dive into the exciting world of Llama 3.2, the latest release from Meta AI.\"),\n",
+    "    (\"Speaker 2\", \"Hi, I'm excited to be here! So, what is Llama 3.2?\"),\n",
+    "    (\"Speaker 1\", \"Ah, great question! Llama 3.2 is an open-source AI model that allows developers to fine-tune, distill, and deploy AI models anywhere. It's a significant update from the previous version, with improved performance, efficiency, and customization options.\"),\n",
+    "    (\"Speaker 2\", \"That sounds amazing! What are some of the key features of Llama 3.2?\")\n",
+    "]\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ee70bee",
+   "metadata": {},
+   "source": [
+    "This time we will use the smaller 8B model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "ebef919a-9bc7-4992-b6ff-cd66e4cb7703",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MODEL = \"meta-llama/Llama-3.1-8B-Instruct\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7bc794b",
+   "metadata": {},
+   "source": [
+    "Let's import the necessary libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "de29b1fd-5b3f-458c-a2e4-e0341e8297ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import necessary libraries\n",
+    "import torch\n",
+    "from accelerate import Accelerator\n",
+    "import transformers\n",
+    "\n",
+    "from tqdm.notebook import tqdm\n",
+    "import warnings\n",
+    "\n",
+    "warnings.filterwarnings('ignore')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8020c39c",
+   "metadata": {},
+   "source": [
+    "We will load in the pickle file saved from previous notebook\n",
+    "\n",
+    "This time the `INPUT_PROMPT` to the model will be the output from the previous stage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "4b5d2c0e-a073-46c0-8de7-0746e2b05956",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "\n",
+    "with open('./resources/data.pkl', 'rb') as file:\n",
+    "    INPUT_PROMPT = pickle.load(file)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c4461926",
+   "metadata": {},
+   "source": [
+    "We can again use Hugging Face `pipeline` method to generate text from the model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eec210df-a568-4eda-a72d-a4d92d59f022",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "0711c2199ca64372b98b781f8a6f13b7",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n"
+     ]
+    }
+   ],
+   "source": [
+    "pipeline = transformers.pipeline(\n",
+    "    \"text-generation\",\n",
+    "    model=MODEL,\n",
+    "    model_kwargs={\"torch_dtype\": torch.bfloat16},\n",
+    "    device_map=\"auto\",\n",
+    ")\n",
+    "\n",
+    "messages = [\n",
+    "    {\"role\": \"system\", \"content\": SYSTEMP_PROMPT},\n",
+    "    {\"role\": \"user\", \"content\": INPUT_PROMPT},\n",
+    "]\n",
+    "\n",
+    "outputs = pipeline(\n",
+    "    messages,\n",
+    "    max_new_tokens=8126,\n",
+    "    temperature=1,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "612a27e0",
+   "metadata": {},
+   "source": [
+    "We can verify the output from the model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b8632442-f9ce-4f63-82bd-bb5238a23dc1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(outputs[0][\"generated_text\"][-1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a61182ea-f4a3-45e1-aed9-b45cb7b52329",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "save_string_pkl = outputs[0][\"generated_text\"][-1]['content']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d495a957",
+   "metadata": {},
+   "source": [
+    "Let's save the output as a pickle file to be used in Notebook 4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "281d3db7-5bfa-4143-9d4f-db87f22870c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('./resources/podcast_ready_data.pkl', 'wb') as file:\n",
+    "    pickle.dump(save_string_pkl, file)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21c7e456-497b-4080-8b52-6f399f9f8d58",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#fin"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 674 - 0
recipes/quickstart/NotebookLlama/Step-4-TTS-Workflow.ipynb


+ 116 - 0
recipes/quickstart/NotebookLlama/TTS_Notes.md

@@ -0,0 +1,116 @@
+### Notes from TTS Experimentation
+
+For the TTS Pipeline, *all* of the top models from HuggingFace and Reddit were tried. 
+
+The goal was to use the models that were easy to setup and sounded less robotic with ability to include sound effects like laughter, etc.
+
+#### Parler-TTS
+
+Minimal code to run their models:
+
+```
+model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
+tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
+
+# Define text and description
+text_prompt = "This is where the actual words to be spoken go"
+description = """
+Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
+"""
+
+input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
+prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
+
+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
+audio_arr = generation.cpu().numpy().squeeze()
+
+ipd.Audio(audio_arr, rate=model.config.sampling_rate)
+```
+
+The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs.
+
+Surprisingly, Parler's mini model sounded more natural.
+
+In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt.
+
+#### Suno/Bark
+
+Minimal code to run bark:
+
+```
+voice_preset = "v2/en_speaker_6"
+sampling_rate = 24000
+
+text_prompt = """
+Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
+"""
+inputs = processor(text_prompt, voice_preset=voice_preset).to(device)
+
+speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
+```
+
+Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers.
+
+v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark.
+
+The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasize them. 
+
+Adding `-` gives a break in the text. We utilize this knowledge when we re-write the transcript using the 8B model to add effects to our transcript.
+
+Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hyphen during trails.
+
+#### Hyper-parameters: 
+
+Bark models have two parameters we can tweak: `temperature` and `semantic_temperature`
+
+Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below:
+
+First, fix `temperature` and sweep `semantic_temperature`
+- `0.7`, `0.2`: Quite bland and boring
+- `0.7`, `0.3`: An improvement over the previous one
+- `0.7`, `0.4`: Further improvement 
+- `0.7`, `0.5`: This one didn't work
+- `0.7`, `0.6`: So-So, didn't stand out
+- `0.7`, `0.7`: The best so far
+- `0.7`, `0.8`: Further improvement 
+- `0.7`, `0.9`: Mix feelings on this one
+
+Now sweeping the `temperature`
+- `0.1`, `0.9`: Very Robotic
+- `0.2`, `0.9`: Less Robotic but not convincing
+- `0.3`, `0.9`: Slight improvement still not fun
+- `0.4`, `0.9`: Still has a robotic tinge
+- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing
+- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste
+- `0.7`, `0.9`: Very robotic and laugh was weird
+- `0.8`, `0.9`: Completely ignore the laughter but it was more natural
+- `0.9`, `0.9`: We have a winner probably
+
+After this about ~30 more sweeps were done with the promising combinations:
+
+Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
+
+
+### Notes from other models that were tested:
+
+Promising directions to explore in future:
+
+- [MeloTTS](https://huggingface.co/myshell-ai/MeloTTS-English) This is most popular (ever) on HuggingFace
+- [WhisperSpeech](https://huggingface.co/WhisperSpeech/WhisperSpeech) sounded quite natural as well
+- [F5-TTS](https://github.com/SWivid/F5-TTS) was the latest release at this time, however, it felt a bit robotic
+- E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test
+- [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising
+
+#### Some more models that weren't tested:
+
+In other words, we leave this as an exercise to readers :D
+
+- [Fish-Speech](https://huggingface.co/fishaudio/fish-speech-1.4)
+- [MMS-TTS-Eng](https://huggingface.co/facebook/mms-tts-eng)
+- [Metavoice](https://huggingface.co/metavoiceio/metavoice-1B-v0.1)
+- [Hifigan](https://huggingface.co/nvidia/tts_hifigan)
+- [TTS-Tacotron2](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech) 
+- [MMS-TTS-Eng](https://huggingface.co/facebook/mms-tts-eng)
+- [VALL-E X](https://github.com/Plachtaa/VALL-E-X)

+ 15 - 0
recipes/quickstart/NotebookLlama/requirements.txt

@@ -0,0 +1,15 @@
+# Core dependencies
+PyPDF2>=3.0.0
+torch>=2.0.0
+transformers>=4.46.0
+accelerate>=0.27.0
+rich>=13.0.0
+ipywidgets>=8.0.0
+tqdm>=4.66.0
+
+# Optional but recommended
+jupyter>=1.0.0
+ipykernel>=6.0.0
+
+# Warning handling
+warnings>=0.1.0

BIN
recipes/quickstart/NotebookLlama/resources/2402.13116v4.pdf


BIN
recipes/quickstart/NotebookLlama/resources/Outline.jpg


BIN
recipes/quickstart/NotebookLlama/resources/_podcast.mp3


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 74 - 0
recipes/quickstart/NotebookLlama/resources/clean_extracted_text.txt


BIN
recipes/quickstart/NotebookLlama/resources/data.pkl


BIN
recipes/quickstart/NotebookLlama/resources/podcast_ready_data.pkl