Browse Source

Use train loop in quickstart finetuning notebook

Matthias Reso 10 tháng trước cách đây
mục cha
commit
a0917ff176

Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 0 - 669
recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb


+ 321 - 0
recipes/finetuning/quickstart_peft_finetuning.ipynb

@@ -0,0 +1,321 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Meta Platforms, Inc. and affiliates.\n",
+    "This software may be used and distributed according to the terms of the Llama 2 Community License Agreement."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PEFT Finetuning Quick Start Notebook\n",
+    "\n",
+    "This notebook shows how to train a Meta Llama 3 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 0: Install pre-requirements and convert checkpoint\n",
+    "\n",
+    "We need to have llama-recipes and its dependencies installed for this notebook. Additionally, we need to log in with the huggingface_cli and make sure that the account is able to to access the Meta Llama weights."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ! pip install llama-recipes ipywidgets\n",
+    "\n",
+    "# import huggingface_hub\n",
+    "# huggingface_hub.login()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Load the model\n",
+    "\n",
+    "Setup training configuration and load the model and tokenizer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from transformers import LlamaForCausalLM, AutoTokenizer\n",
+    "from llama_recipes.configs import train_config as TRAIN_CONFIG\n",
+    "\n",
+    "train_config = TRAIN_CONFIG()\n",
+    "train_config.model_name = \"meta-llama/Meta-Llama-3-8B\"\n",
+    "train_config.num_epochs = 1\n",
+    "train_config.run_validation = False\n",
+    "train_config.gradient_accumulation_steps = 4\n",
+    "train_config.batch_size_training = 1\n",
+    "train_config.lr = 3e-4\n",
+    "train_config.use_fast_kernels = True\n",
+    "train_config.use_fp16 = True\n",
+    "train_config.context_length = 2048\n",
+    "train_config.batching_strategy = \"packing\"\n",
+    "train_config.output_dir = \"meta-llama-samsum\"\n",
+    "\n",
+    "from transformers import BitsAndBytesConfig\n",
+    "config = BitsAndBytesConfig(\n",
+    "    load_in_8bit=True,\n",
+    ")\n",
+    "\n",
+    "model = LlamaForCausalLM.from_pretrained(\n",
+    "            train_config.model_name,\n",
+    "            device_map=\"auto\",\n",
+    "            quantization_config=config,\n",
+    "            use_cache=False,\n",
+    "            attn_implementation=\"sdpa\" if train_config.use_fast_kernels else None,\n",
+    "            torch_dtype=torch.float16,\n",
+    "        )\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(train_config.model_name)\n",
+    "tokenizer.pad_token = tokenizer.eos_token"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Check base model\n",
+    "\n",
+    "Run the base model on an example input:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "eval_prompt = \"\"\"\n",
+    "Summarize this dialog:\n",
+    "A: Hi Tom, are you busy tomorrow’s afternoon?\n",
+    "B: I’m pretty sure I am. What’s up?\n",
+    "A: Can you go with me to the animal shelter?.\n",
+    "B: What do you want to do?\n",
+    "A: I want to get a puppy for my son.\n",
+    "B: That will make him so happy.\n",
+    "A: Yeah, we’ve discussed it many times. I think he’s ready now.\n",
+    "B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) \n",
+    "A: I'll get him one of those little dogs.\n",
+    "B: One that won't grow up too big;-)\n",
+    "A: And eat too much;-))\n",
+    "B: Do you know which one he would like?\n",
+    "A: Oh, yes, I took him there last Monday. He showed me one that he really liked.\n",
+    "B: I bet you had to drag him away.\n",
+    "A: He wanted to take it home right away ;-).\n",
+    "B: I wonder what he'll name it.\n",
+    "A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))\n",
+    "---\n",
+    "Summary:\n",
+    "\"\"\"\n",
+    "\n",
+    "model_input = tokenizer(eval_prompt, return_tensors=\"pt\").to(\"cuda\")\n",
+    "\n",
+    "model.eval()\n",
+    "with torch.no_grad():\n",
+    "    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the base model only repeats the conversation.\n",
+    "\n",
+    "### Step 3: Load the preprocessed dataset\n",
+    "\n",
+    "We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_recipes.configs.datasets import samsum_dataset\n",
+    "from llama_recipes.data.concatenator import ConcatDataset\n",
+    "from llama_recipes.utils.config_utils import get_dataloader_kwargs\n",
+    "from llama_recipes.utils.dataset_utils import get_preprocessed_dataset\n",
+    "\n",
+    "train_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'train')\n",
+    "\n",
+    "train_dl_kwargs = get_dataloader_kwargs(train_config, train_dataset, tokenizer, \"train\")\n",
+    "\n",
+    "if train_config.batching_strategy == \"packing\":\n",
+    "        train_dataset = ConcatDataset(train_dataset, chunk_size=train_config.context_length)\n",
+    "\n",
+    "# Create DataLoaders for the training and validation dataset\n",
+    "train_dataloader = torch.utils.data.DataLoader(\n",
+    "    train_dataset,\n",
+    "    num_workers=train_config.num_workers_dataloader,\n",
+    "    pin_memory=True,\n",
+    "    **train_dl_kwargs,\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Prepare model for PEFT\n",
+    "\n",
+    "Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from peft import get_peft_model, prepare_model_for_kbit_training, LoraConfig\n",
+    "from dataclasses import asdict\n",
+    "from llama_recipes.configs import lora_config as LORA_CONFIG\n",
+    "\n",
+    "lora_config = LORA_CONFIG()\n",
+    "lora_config.r = 8\n",
+    "lora_config.lora_alpha = 32\n",
+    "lora_dropout: float=0.01\n",
+    "\n",
+    "peft_config = LoraConfig(**asdict(lora_config))\n",
+    "\n",
+    "model = prepare_model_for_kbit_training(model)\n",
+    "model = get_peft_model(model, peft_config)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 5: Fine tune the model\n",
+    "\n",
+    "Here, we fine tune the model for a single epoch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch.optim as optim\n",
+    "from llama_recipes.utils.train_utils import train\n",
+    "from torch.optim.lr_scheduler import StepLR\n",
+    "\n",
+    "model.train()\n",
+    "\n",
+    "optimizer = optim.AdamW(\n",
+    "            model.parameters(),\n",
+    "            lr=train_config.lr,\n",
+    "            weight_decay=train_config.weight_decay,\n",
+    "        )\n",
+    "scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)\n",
+    "\n",
+    "# Start the training process\n",
+    "results = train(\n",
+    "    model,\n",
+    "    train_dataloader,\n",
+    "    None,\n",
+    "    tokenizer,\n",
+    "    optimizer,\n",
+    "    scheduler,\n",
+    "    train_config.gradient_accumulation_steps,\n",
+    "    train_config,\n",
+    "    None,\n",
+    "    None,\n",
+    "    None,\n",
+    "    wandb_run=None,\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 6:\n",
+    "Save model checkpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.save_pretrained(train_config.output_dir)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 7:\n",
+    "Try the fine tuned model on the same example again to see the learning progress:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.eval()\n",
+    "with torch.no_grad():\n",
+    "    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}