{ "cells": [ { "cell_type": "markdown", "id": "4f67a6a6", "metadata": {}, "source": [ "## Notebook 1: PDF Pre-processing" ] }, { "cell_type": "markdown", "id": "f68aee84-04e3-4cbc-be78-6de9e06e704f", "metadata": {}, "source": [ "In the series, we will be going from a PDF to Podcast using all open models. \n", "\n", "The first step in getting to the podcast is finding a script, right now our logic is:\n", "- Use any PDF on any topic\n", "- Prompt `Llama-3.2-1B-Instruct` model to process it into a text file\n", "- Re-write this into a podcast transcript in next notebook.\n", "\n", "In this notebook, we will upload a PDF and save it into a `.txt` file using the `PyPDF2` library, later we will process chunks from the text file using our featherlight model." ] }, { "cell_type": "markdown", "id": "61cb3584", "metadata": {}, "source": [ "Most of us shift-enter pass the comments to realise later we need to install libraries. For the few that read the instructions, please remember to do so:" ] }, { "cell_type": "code", "execution_count": 41, "id": "f4fc7aef-3505-482e-a998-790b8b9d48e4", "metadata": {}, "outputs": [], "source": [ "#!pip install PyPDF2\n", "#!pip install rich ipywidgets" ] }, { "cell_type": "markdown", "id": "7b23d509", "metadata": {}, "source": [ "Assuming you have a PDF uploaded on the same machine, please set the path for the file. \n", "\n", "Also, if you want to flex your GPU-please switch to a bigger model although the featherlight models work perfectly for this task:" ] }, { "cell_type": "code", "execution_count": 14, "id": "60d0061b-8b8c-4353-850f-f19466a0ae2d", "metadata": {}, "outputs": [], "source": [ "pdf_path = './resources/2402.13116v4.pdf'\n", "DEFAULT_MODEL = \"meta-llama/Llama-3.2-1B-Instruct\"" ] }, { "cell_type": "code", "execution_count": 49, "id": "21029232-ac5f-42ca-b26b-baad5b2f49b7", "metadata": {}, "outputs": [], "source": [ "import PyPDF2\n", "from typing import Optional\n", "import os\n", "import torch\n", "from accelerate import Accelerator\n", "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "\n", "from tqdm.notebook import tqdm\n", "import warnings\n", "\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "id": "203c22eb", "metadata": {}, "source": [ "Let's make sure we don't stub our toe by checking if the file exists" ] }, { "cell_type": "code", "execution_count": 9, "id": "153d9ece-37a4-4fff-a8e8-53f923a2b0a0", "metadata": {}, "outputs": [], "source": [ "def validate_pdf(file_path: str) -> bool:\n", " if not os.path.exists(file_path):\n", " print(f\"Error: File not found at path: {file_path}\")\n", " return False\n", " if not file_path.lower().endswith('.pdf'):\n", " print(\"Error: File is not a PDF\")\n", " return False\n", " return True" ] }, { "cell_type": "markdown", "id": "5a362ac3", "metadata": {}, "source": [ "Convert PDF to a `.txt` file. This would simply read and dump the contents of the file. We set the maximum characters to 100k. \n", "\n", "For people converting their favorite novels into a podcast, they will have to add extra logic of going outside the Llama models context length which is 128k tokens." ] }, { "cell_type": "code", "execution_count": 10, "id": "b57c2d64-3d75-4aeb-b4ee-bd1661286b66", "metadata": {}, "outputs": [], "source": [ "def extract_text_from_pdf(file_path: str, max_chars: int = 100000) -> Optional[str]:\n", " if not validate_pdf(file_path):\n", " return None\n", " \n", " try:\n", " with open(file_path, 'rb') as file:\n", " # Create PDF reader object\n", " pdf_reader = PyPDF2.PdfReader(file)\n", " \n", " # Get total number of pages\n", " num_pages = len(pdf_reader.pages)\n", " print(f\"Processing PDF with {num_pages} pages...\")\n", " \n", " extracted_text = []\n", " total_chars = 0\n", " \n", " # Iterate through all pages\n", " for page_num in range(num_pages):\n", " # Extract text from page\n", " page = pdf_reader.pages[page_num]\n", " text = page.extract_text()\n", " \n", " # Check if adding this page's text would exceed the limit\n", " if total_chars + len(text) > max_chars:\n", " # Only add text up to the limit\n", " remaining_chars = max_chars - total_chars\n", " extracted_text.append(text[:remaining_chars])\n", " print(f\"Reached {max_chars} character limit at page {page_num + 1}\")\n", " break\n", " \n", " extracted_text.append(text)\n", " total_chars += len(text)\n", " print(f\"Processed page {page_num + 1}/{num_pages}\")\n", " \n", " final_text = '\\n'.join(extracted_text)\n", " print(f\"\\nExtraction complete! Total characters: {len(final_text)}\")\n", " return final_text\n", " \n", " except PyPDF2.PdfReadError:\n", " print(\"Error: Invalid or corrupted PDF file\")\n", " return None\n", " except Exception as e:\n", " print(f\"An unexpected error occurred: {str(e)}\")\n", " return None\n" ] }, { "cell_type": "markdown", "id": "e023397b", "metadata": {}, "source": [ "Helper function to grab meta info about our PDF" ] }, { "cell_type": "code", "execution_count": 11, "id": "0984bb1e-d52c-4cec-a131-67a48061fabc", "metadata": {}, "outputs": [], "source": [ "# Get PDF metadata\n", "def get_pdf_metadata(file_path: str) -> Optional[dict]:\n", " if not validate_pdf(file_path):\n", " return None\n", " \n", " try:\n", " with open(file_path, 'rb') as file:\n", " pdf_reader = PyPDF2.PdfReader(file)\n", " metadata = {\n", " 'num_pages': len(pdf_reader.pages),\n", " 'metadata': pdf_reader.metadata\n", " }\n", " return metadata\n", " except Exception as e:\n", " print(f\"Error extracting metadata: {str(e)}\")\n", " return None" ] }, { "cell_type": "markdown", "id": "6019affc", "metadata": {}, "source": [ "Finally, we can run our logic to extract the details from the file" ] }, { "cell_type": "code", "execution_count": 12, "id": "63848943-79cc-4e21-8396-6eab5df493e0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Extracting metadata...\n", "\n", "PDF Metadata:\n", "Number of pages: 44\n", "Document info:\n", "/Author: \n", "/CreationDate: D:20240311015030Z\n", "/Creator: LaTeX with hyperref\n", "/Keywords: \n", "/ModDate: D:20240311015030Z\n", "/PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5\n", "/Producer: pdfTeX-1.40.25\n", "/Subject: \n", "/Title: \n", "/Trapped: /False\n", "\n", "Extracting text...\n", "Processing PDF with 44 pages...\n", "Processed page 1/44\n", "Processed page 2/44\n", "Processed page 3/44\n", "Processed page 4/44\n", "Processed page 5/44\n", "Processed page 6/44\n", "Processed page 7/44\n", "Processed page 8/44\n", "Processed page 9/44\n", "Processed page 10/44\n", "Processed page 11/44\n", "Processed page 12/44\n", "Processed page 13/44\n", "Processed page 14/44\n", "Processed page 15/44\n", "Processed page 16/44\n", "Reached 100000 character limit at page 17\n", "\n", "Extraction complete! Total characters: 100016\n", "\n", "Preview of extracted text (first 500 characters):\n", "--------------------------------------------------\n", "1\n", "A Survey on Knowledge Distillation of Large\n", "Language Models\n", "Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1,\n", "Can Xu5, Dacheng Tao6, Tianyi Zhou2\n", "1The University of Hong Kong2University of Maryland3Microsoft\n", "4University of Technology Sydney5Peking University6The University of Sydney\n", "{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu\n", "ckcheng@cs.hku.hk jl0725@connect.hku.hk\n", "Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati\n", "--------------------------------------------------\n", "\n", "Total characters extracted: 100016\n", "\n", "Extracted text has been saved to extracted_text.txt\n" ] } ], "source": [ "# Extract metadata first\n", "print(\"Extracting metadata...\")\n", "metadata = get_pdf_metadata(pdf_path)\n", "if metadata:\n", " print(\"\\nPDF Metadata:\")\n", " print(f\"Number of pages: {metadata['num_pages']}\")\n", " print(\"Document info:\")\n", " for key, value in metadata['metadata'].items():\n", " print(f\"{key}: {value}\")\n", "\n", "# Extract text\n", "print(\"\\nExtracting text...\")\n", "extracted_text = extract_text_from_pdf(pdf_path)\n", "\n", "# Display first 500 characters of extracted text as preview\n", "if extracted_text:\n", " print(\"\\nPreview of extracted text (first 500 characters):\")\n", " print(\"-\" * 50)\n", " print(extracted_text[:500])\n", " print(\"-\" * 50)\n", " print(f\"\\nTotal characters extracted: {len(extracted_text)}\")\n", "\n", "# Optional: Save the extracted text to a file\n", "if extracted_text:\n", " output_file = 'extracted_text.txt'\n", " with open(output_file, 'w', encoding='utf-8') as f:\n", " f.write(extracted_text)\n", " print(f\"\\nExtracted text has been saved to {output_file}\")" ] }, { "cell_type": "markdown", "id": "946d1f59", "metadata": {}, "source": [ "### Llama Pre-Processing\n", "\n", "Now let's proceed to justify our distaste for writing regex and use that as a justification for a LLM instead:\n", "\n", "At this point, have a text file extracted from a PDF of a paper. Generally PDF extracts can be messy due to characters, formatting, Latex, Tables, etc. \n", "\n", "One way to handle this would be using regex, instead we can also prompt the feather light Llama models to clean up our text for us. \n", "\n", "Please try changing the `SYS_PROMPT` below to see what improvements you can make:" ] }, { "cell_type": "code", "execution_count": 60, "id": "7c0828a5-964d-475e-b5f5-40a04e287725", "metadata": {}, "outputs": [], "source": [ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "\n", "SYS_PROMPT = \"\"\"\n", "You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.\n", "\n", "The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.\n", "\n", "Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive\n", "\n", "Please be smart with what you remove and be creative ok?\n", "\n", "Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RE-WRITING WHEN NEEDED\n", "\n", "Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.\n", "\n", "PLEASE DO NOT ADD MARKDOWN FORMATTING, STOP ADDING SPECIAL CHARACTERS THAT MARKDOWN CAPATILISATION ETC LIKES\n", "\n", "ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?\n", "Here is the text:\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "fd393fae", "metadata": {}, "source": [ "Instead of having the model process the entire file at once, as you noticed in the prompt-we will pass chunks of the file. \n", "\n", "One issue with passing chunks counted by characters is, we lose meaning of words so instead we chunk by words:" ] }, { "cell_type": "code", "execution_count": 61, "id": "24e8a547-9d7c-4e2f-be9e-a3aea09cce76", "metadata": {}, "outputs": [], "source": [ "def create_word_bounded_chunks(text, target_chunk_size):\n", " \"\"\"\n", " Split text into chunks at word boundaries close to the target chunk size.\n", " \"\"\"\n", " words = text.split()\n", " chunks = []\n", " current_chunk = []\n", " current_length = 0\n", " \n", " for word in words:\n", " word_length = len(word) + 1 # +1 for the space\n", " if current_length + word_length > target_chunk_size and current_chunk:\n", " # Join the current chunk and add it to chunks\n", " chunks.append(' '.join(current_chunk))\n", " current_chunk = [word]\n", " current_length = word_length\n", " else:\n", " current_chunk.append(word)\n", " current_length += word_length\n", " \n", " # Add the last chunk if it exists\n", " if current_chunk:\n", " chunks.append(' '.join(current_chunk))\n", " \n", " return chunks" ] }, { "cell_type": "markdown", "id": "5d74223f", "metadata": {}, "source": [ "Let's load in the model and start processing the text chunks" ] }, { "cell_type": "code", "execution_count": 62, "id": "d04a4f07-b0b3-45ca-8f41-a433e1abe050", "metadata": {}, "outputs": [], "source": [ "accelerator = Accelerator()\n", "model = AutoModelForCausalLM.from_pretrained(\n", " DEFAULT_MODEL,\n", " torch_dtype=torch.bfloat16,\n", " use_safetensors=True,\n", " device_map=device,\n", ")\n", "tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)\n", "model, tokenizer = accelerator.prepare(model, tokenizer)" ] }, { "cell_type": "code", "execution_count": 63, "id": "bbda5241-e890-4402-87dd-514d6761bb9c", "metadata": {}, "outputs": [], "source": [ "def process_chunk(text_chunk, chunk_num):\n", " \"\"\"Process a chunk of text and return both input and output for verification\"\"\"\n", " conversation = [\n", " {\"role\": \"system\", \"content\": SYS_PROMPT},\n", " {\"role\": \"user\", \"content\": text_chunk},\n", " ]\n", " \n", " prompt = tokenizer.apply_chat_template(conversation, tokenize=False)\n", " inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n", " \n", " with torch.no_grad():\n", " output = model.generate(\n", " **inputs,\n", " temperature=0.7,\n", " top_p=0.9,\n", " max_new_tokens=512\n", " )\n", " \n", " processed_text = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()\n", " \n", " # Print chunk information for monitoring\n", " #print(f\"\\n{'='*40} Chunk {chunk_num} {'='*40}\")\n", " print(f\"INPUT TEXT:\\n{text_chunk[:500]}...\") # Show first 500 chars of input\n", " print(f\"\\nPROCESSED TEXT:\\n{processed_text[:500]}...\") # Show first 500 chars of output\n", " print(f\"{'='*90}\\n\")\n", " \n", " return processed_text" ] }, { "cell_type": "code", "execution_count": 64, "id": "a0183c47-339d-4041-ae83-77fc34931075", "metadata": {}, "outputs": [], "source": [ "INPUT_FILE = \"./resources/extracted_text.txt\" # Replace with your file path\n", "CHUNK_SIZE = 1000 # Adjust chunk size if needed\n", "\n", "chunks = create_word_bounded_chunks(text, CHUNK_SIZE)\n", "num_chunks = len(chunks)\n" ] }, { "cell_type": "code", "execution_count": 65, "id": "bb36814f-9310-4734-bf54-e16a5032339e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "101" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_chunks" ] }, { "cell_type": "code", "execution_count": 66, "id": "447188d3-ebf0-42d5-940e-4d7e0d9dbf32", "metadata": {}, "outputs": [], "source": [ "# Read the file\n", "with open(INPUT_FILE, 'r', encoding='utf-8') as file:\n", " text = file.read()\n", "\n", "# Calculate number of chunks\n", "num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE\n", "\n", "# Cell 6: Process the file with ordered output\n", "# Create output file name\n", "output_file = f\"clean_{os.path.basename(INPUT_FILE)}\"" ] }, { "cell_type": "code", "execution_count": 67, "id": "7917dfdd-b3af-44fc-a8c0-2760ace9363e", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b767f45b5e514e7db936cef825af6fce", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing chunks: 0%| | 0/101 [00:00