11 mēneši atpakaļ · 82bb0087b8
--- a/end-to-end-use-cases/coding/text2sql/README.md
+++ b/end-to-end-use-cases/coding/text2sql/README.md
@@ -1,20 +1,25 @@
 
				 # Improving Llama Text2SQL performance with CoT Fine-tuning
			
 
				 
			
 
				-This recipe is step by step guide to improve Llama performance on Text2SQL measured with the popular [BIRD](https://bird-bench.github.io) benchmark. We generate synthetic Chain of Thought(CoT) dataset and fine-tune Llama models on it.
			
 
				+This recipe is step by step guide to improve Llama performance on Text2SQL measured with the popular [BIRD](https://bird-bench.github.io) benchmark. We generate a synthetic Chain of Thought(CoT) dataset and fine-tune Llama models on it.
			
 
				 
			
 
				-Results: [graph_placeholder]
			
 
				+Results:
			
 
				+|-----------------------------|-------------------------------|
			
 
				+| baseline                    | 39.47%                        |
			
 
				+| CoT, PEFT                   | 43.35%                        |
			
 
				+| CoT, FFT                    | 42.44% (3 epochs)             |
			
 
				+| CoT, FFT                    | 43.87% (10 epochs)            |
			
 
				 
			
 
				-We followed following steps:
			
 
				+The complete steps are:
			
 
				 
			
 
				-1. Pre-processing the BIRD TRAIN datset by converting SQL statements into conversation format
			
 
				+1. Pre-processing the BIRD TRAIN datset by converting SQL statements into the conversation format.
			
 
				 
			
 
				-2. We use the conversations from step 1, add CoT to these existing conversations using Llama-3.3-70B
			
 
				+2. We use the conversations from step 1, add CoT to these existing conversations using Llama-3.3-70B.
			
 
				 
			
 
				-3. Fine-tuning Llama-3.1-8B on the dataset from step 2
			
 
				+3. Fine-tuning Llama-3.1-8B on the dataset from step 2.
			
 
				 
			
 
				-4. We provide scripts to simplify running the [BIRD](https://bird-bench.github.io) benchmark on the fine-tuned models and compare it with out of the model.
			
 
				+4. We provide scripts to simplify running the [BIRD](https://bird-bench.github.io) eval benchmark on the fine-tuned models and compare it with out of the model.
			
 
				 
			
 
				-## Structure:
			
 
				+## Folder Structure
			
 
				 
			
 
				 - quickstart folder: contains a notebook to ask Llama 3.3 to convert natural language queries into SQL queries.
			
 
				 - data folder: contains scripts to download the BIRD TRAIN and DEV datasets;
			
--- a/end-to-end-use-cases/coding/text2sql/eval/README.md
+++ b/end-to-end-use-cases/coding/text2sql/eval/README.md
@@ -17,7 +17,9 @@ Below are the results of the Llama models we have evaluated on the BIRD DEV data
 
				 
			
 
				 ## Quick Start with Llama Models via Llama API
			
 
				 
			
 
				-First, run the commands below to create a new Conda environment and install all the required packages for Text2SQL evaluation and fine-tuning:
			
 
				+Follow the steps below to evaluate Llama 3 & 4 models on Text2SQL using the BIRD benchmark:
			
 
				+
			
 
				+1. Run the commands below to create a new Conda environment and install all the required packages for Text2SQL evaluation:
			
 
				 
			
 
				 ```
			
 
				 conda create -n llama-text2sql python=3.10
			
@@ -28,18 +30,16 @@ cd llama-cookbook/end-to-end-use-cases/coding/text2sql/eval
 
				 pip install -r requirements.txt
			
 
				 ```
			
 
				 
			
 
				-Then, follow the steps below to evaluate Llama 3 & 4 models on Text2SQL using the BIRD benchmark:
			
 
				-
			
 
				-1. Get the DEV dataset:
			
 
				+2. Get the DEV dataset:
			
 
				 ```
			
 
				 cd ../data
			
 
				 sh download_dev_unzip.sh
			
 
				 cd ../eval
			
 
				 ```
			
 
				 
			
 
				-2. Open `llama_eval.sh` and set `YOUR_API_KEY` to your [Llama API](https://llama.developer.meta.com/) key then uncomment a line that starts with `model=` to specify the Llama model to use for the text2sql eval.
			
 
				+3. Open `llama_eval.sh` and set `YOUR_API_KEY` to your [Llama API](https://llama.developer.meta.com/) key then uncomment a line that starts with `model=` to specify the Llama model to use for the text2sql eval.
			
 
				 
			
 
				-3. Run the evaluation script `sh llama_eval.sh`, which will use the BIRD DEV dataset (1534 examples in total) with external knowledge turned on to run the Llama model on each text question and compare the generated SQL with the gold SQL.
			
 
				+4. Run the evaluation script `sh llama_eval.sh`, which will use the BIRD DEV dataset (1534 examples in total) with external knowledge turned on to run the Llama model on each text question and compare the generated SQL with the gold SQL.
			
 
				 
			
 
				 If your API key or model name is incorrect, the script will exit with an authentication or model not supported error.
			
 
				 
			
--- a/end-to-end-use-cases/coding/text2sql/fine-tuning/README.md
+++ b/end-to-end-use-cases/coding/text2sql/fine-tuning/README.md
@@ -4,7 +4,7 @@ This folder contains scripts to:
 
				 
			
 
				 * generate a dataset from the BIRD TRAIN set (with no CoT info) for supervised fine-tuning (SFT);
			
 
				 * generate a dataset from the BIRD TRAIN set (with CoT info by Llama 3.3 70B) for SFT;
			
 
				-* SFT the Llama 3.1 8B model with the generated datasets with different fine-tuning combinations: with or without CoT, using quantization or not,  full fine-tuning (FFT) or parameter-efficient fine-tuning (PEFT).
			
 
				+* SFT the Llama 3.1 8B model with the generated datasets with different fine-tuning combinations: with or without CoT, using quantization or not, full fine-tuning (FFT) or parameter-efficient fine-tuning (PEFT).
			
 
				 
			
 
				 **Note:** CoT stands for Chain of Thought and we will use "CoT" and "reasoning" interchangeably here, although generally, reasoning encompasses a broader concept than CoT.
			
 
				 
			
@@ -22,15 +22,92 @@ The eval results of SFT Llama 3.1 8B with different options (epochs is 3, with a
 
				 
			
 
				 Using Quantization+PEFT on CoT dataset only dropped the accuracy from 43.35% to 42.89%.
			
 
				 
			
 
				-## Creating dataset
			
 
				+## Quick Start with Fine-tuning Llama 3.1 8B
			
 
				 
			
 
				-We use the BIRD TRAIN dataset to prepare for supervised fine-tuning with reasoning info in the dataset. The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset.
			
 
				+1. If you have already run the eval folder's Quick Start Step 1's commands [here](../eval/README.md#quick-start-with-llama-models-via-llama-api) to "create a new Conda environment and install all the required packages for Text2SQL evaluation", just run:
			
 
				+
			
 
				+```
			
 
				+cd llama-cookbook/end-to-end-use-cases/coding/text2sql/fine-tuning
			
 
				+pip install -r requirements.txt
			
 
				+```
			
 
				+
			
 
				+Otherwise, run the commands below to create a new Conda environment and install all the required packages for Text2SQL evaluation and fine-tuning:
			
 
				+
			
 
				+```
			
 
				+conda create -n llama-text2sql python=3.10
			
 
				+conda activate llama-text2sql
			
 
				+git clone https://github.com/meta-llama/llama-cookbook
			
 
				+git checkout text2sql # to be removed after the PR merge
			
 
				+cd llama-cookbook/end-to-end-use-cases/coding/text2sql/fine-tuning
			
 
				+pip install -r requirements.txt
			
 
				+```
			
 
				+
			
 
				+2. Get the TRAIN dataset:
			
 
				+
			
 
				+```
			
 
				+cd ../data
			
 
				+sh download_train_unzip.sh
			
 
				+cd ../fine-tuning
			
 
				+```
			
 
				+
			
 
				+3. Create a CoT reasoning dataset from the TRAIN dataset:
			
 
				+
			
 
				+```
			
 
				+python create_reasoning_dataset.py --input_json ../data/train/train.json --db_root_path ../data/train/train_databases
			
 
				+```
			
 
				+
			
 
				+See the section "About Creating the CoT Dataset" below for more details.
			
 
				+
			
 
				+4. Run one of the commands below to fine-tune the Llama 3.1 8B model with the generated dataset (about 50-70GB GPU memory required):
			
 
				+
			
 
				+```
			
 
				+python trl_sft.py --quantized false --peft false --cot true
			
 
				+python trl_sft.py --quantized false --peft true --cot true
			
 
				+python trl_sft.py --quantized true --peft true --cot true
			
 
				+```
			
 
				+
			
 
				+See the section "About fine-tuning" below for more details.
			
 
				+
			
 
				+## Evaluating the fine-tuned model
			
 
				+
			
 
				+1. Set the `model` value in `llama_eval.sh` to be one of the fine-tuned model folders above, e.g.
			
 
				+
			
 
				+```
			
 
				+YOUR_API_KEY='finetuned'
			
 
				+model='fine_tuning/llama31-8b-text2sql-fft-nonquantized-cot'
			
 
				+```
			
 
				+
			
 
				+2. Start the vllm server by running
			
 
				+```
			
 
				+vllm serve fine_tuning/llama31-8b-text2sql-fft-nonquantized-cot --tensor-parallel-size 1 --max-num-batched-tokens 8192 --max-num-seqs 64
			
 
				+```
			
 
				+If you have multiple GPUs you can run something like
			
 
				+
			
 
				+```
			
 
				+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve fine_tuning/llama31-8b-text2sql-fft-nonquantized-cot --tensor-parallel-size 8 --max-num-batched-tokens 8192 --max-num-seqs 64
			
 
				+```
			
 
				+
			
 
				+to speed up the eval.
			
 
				 
			
 
				-### Creating a reasoning dataset from the TRAIN dataset
			
 
				+3. If you haven't downloaded the DEV dataset, download it and unzip it first:
			
 
				+
			
 
				+```
			
 
				+cd ../data
			
 
				+sh download_dev_unzip.sh
			
 
				+cd ../eval
			
 
				+```
			
 
				+
			
 
				+Then run `sh llama_eval.sh`.
			
 
				+
			
 
				+**Note:** If your fine-tuned model is PEFT based, you may need to run `python merge_peft.py` after modifying its `peft_model_path` and `output_dir` and set the merged folder path after `vllm serve`.
			
 
				+
			
 
				+## About Creating the CoT Dataset
			
 
				+
			
 
				+We use the BIRD TRAIN dataset to prepare for supervised fine-tuning with reasoning info in the dataset. The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset.
			
 
				 
			
 
				 The script `create_reasoning_dataset.py` is used to create a reasoning dataset from the TRAIN dataset by asking Llama 3.3 70B to generate the reasoning for each text question and its corresponding gold SQL. The intent is to use the reasoning dataset to fine-tune the Llama model to improve the accuracy of the generated SQL.
			
 
				 
			
 
				-To run the script, use the following commands:
			
 
				+To run the script, use the following command:
			
 
				 ```
			
 
				 python create_reasoning_dataset.py --input_json ../data/train/train.json --db_root_path ../data/train/train_databases
			
 
				 ```
			
@@ -71,7 +148,7 @@ Let me think through this step by step:\n\n1. First, I need to consider...\n2. T
 
				 """
			
 
				 ```
			
 
				 
			
 
				-### Running fine-tuning
			
 
				+## About fine-tuning
			
 
				 
			
 
				 Run one of the commands below:
			
 
				 
			
@@ -91,26 +168,3 @@ llama31-8b-text2sql-peft-quantized-cot
 
				 
			
 
				 The train loss chart should look like this:
			
 
				 ![](train_loss_cot.png)
			
 
				-
			
 
				-### Evaluating the fine-tuned model
			
 
				-
			
 
				-1. Set the `model` value in `llama_eval.sh` to be one of the fine-tuned model folders above, e.g.
			
 
				-
			
 
				-```
			
 
				-YOUR_API_KEY='finetuned'
			
 
				-model='fine_tuning/llama31-8b-text2sql-fft-nonquantized-cot'
			
 
				-```
			
 
				-
			
 
				-2. Start the vllm server by running
			
 
				-```
			
 
				-vllm serve fine_tuning/llama31-8b-text2sql-fft-nonquantized-cot --tensor-parallel-size 1 --max-num-batched-tokens 8192 --max-num-seqs 64
			
 
				-```
			
 
				-If you have multiple GPUs you can run something like
			
 
				-```
			
 
				-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve fine_tuning/llama31-8b-text2sql-fft-nonquantized-cot --tensor-parallel-size 8 --max-num-batched-tokens 8192 --max-num-seqs 64
			
 
				-```
			
 
				- to speed up the eval.
			
 
				-
			
 
				-3. Run `sh llama_eval.sh`.
			
 
				-
			
 
				-**Note:** If your fine-tuned model is PEFT based, you may need to run `python merge_peft.py` after modifying its `peft_model_path` and `output_dir` and set the merged folder path after `vllm serve`.
			
--- a/end-to-end-use-cases/coding/text2sql/fine-tuning/requirements.txt
+++ b/end-to-end-use-cases/coding/text2sql/fine-tuning/requirements.txt
@@ -1,17 +1,19 @@
 
				-llama_api_client==0.1.1
			
 
				+llama_api_client==0.1.2
			
 
				+func_timeout==4.3.5
			
 
				+tqdm==4.67.1
			
 
				+vllm==0.9.2
			
 
				+openai==1.90.0
			
 
				 langchain-together==0.3.0
			
 
				 sqlparse==0.5.3
			
 
				-torch==2.4.1
			
 
				 tensorboard==2.19.0
			
 
				-liger-kernel==0.4.2
			
 
				+liger_kernel==0.6.1
			
 
				 setuptools==78.1.1
			
 
				-deepspeed==0.15.4
			
 
				-transformers==4.46.3
			
 
				-datasets==3.6.0
			
 
				-accelerate==1.1.1
			
 
				-bitsandbytes==0.44.1
			
 
				-trl==0.12.1
			
 
				-peft==0.13.2
			
 
				-lighteval==0.6.2
			
 
				-hf-transfer==0.1.8
			
 
				-func_timeout==4.3.5
			
 
				+deepspeed==0.17.3
			
 
				+transformers==4.54.0
			
 
				+datasets==4.0.0
			
 
				+accelerate==1.9.0
			
 
				+bitsandbytes==0.46.1
			
 
				+trl==0.19.1
			
 
				+peft==0.16.0
			
 
				+lighteval==0.10.0
			
 
				+hf_transfer==0.1.9