Sfoglia il codice sorgente

4 READMEs; requirements

Jeff Tang 3 settimane fa
parent
commit
99ead57fb6

+ 19 - 11
end-to-end-use-cases/coding/text2sql/README.md

@@ -1,16 +1,24 @@
-## Text2SQL: Eval and Fine-tuning Tools and Quick Start Notebook
+# Text2SQL: Evaluating and Fine-tuning Llama Models
 
-This folder contains the `tool` subfolder, which has e2e scripts for evaluating Llama (original and fine-tuned) models on the Text2SQL task using the popular [BIRD](https://bird-bench.github.io) dataset, and e2e scripts for generating fine-tuning datasets and fine-tuning Llama 3.1 8B with the datasets.
+This folder contains scripts to:
 
-Before looking into the `tool` folder, you may start with the scripts and notebook in this folder to get familiar with how to interact with a database using natural language inputs bu asking Llama to convert natural language queries into SQL queries.
+1. Evaluate Llama (original and fine-tuned) models on the Text2SQL task using the popular [BIRD](https://bird-bench.github.io) dataset in **3 simple steps**;
 
-For detailed instructions on setting up the environment, creating a database, and executing natural language queries using the Text2SQL interface, please refer to the [quickstart.ipynb](quickstart.ipynb) notebook.
+2. Generate fine-tuning datasets (both with and without CoT reasoning) and fine-tuning Llama 3.1 8B with the datasets, gaining a **165% (with no reasoning) and 209% (with reasoning) accuracy improvement** over the original model.
 
-### Structure:
+Our end goal is to maximize the accuracy of Llama models on the Text2SQL task. To do so we need to first evaluate the current state of the art Llama models on the task, then apply fine-tuning, agent and other approaches to evaluate and improve Llama's performance.
 
-- tool: A folder containing scripts for evaluating and fine-tuning Llama models on the Text2SQL task.
-- quickstart.ipynb: A Quick Demo of Text2SQL Using Llama 3.3. This Jupyter Notebook includes examples of how to use the interface to execute natural language queries on the sample data. It uses Llama 3.3 to answer questions about a SQLite database using LangChain and the Llama cloud provider Together.ai.
-- nba.txt: A text file containing NBA roster information, which is used as sample data for demonstration purposes.
-- txt2csv.py: A script that converts text data into a CSV format. This script is used to preprocess the input data before it is fed into csv2db.py.
-- csv2db.py: A script that imports data from a CSV file into a SQLite database. This script is used to populate the database with sample data.
-- nba_roster.db: A SQLite database file created from the nba.txt data, used to test the Text2SQL interface.
+## Structure:
+
+- data: contains the scripts to download the BIRD TRAIN and DEV datasets;
+- eval: contains the scripts to evaluate Llama models (original and fine-tuned) on the BIRD dataset;
+- fine-tune: contains the scripts to generate non-CoT and CoT datasets based on the BIRD TRAIN set and to fine-tune Llama models using the datasets;
+- quickstart: contains a notebook to ask Llama 3.3 to convert natural language queries into SQL queries.
+
+## Next Steps
+
+1. Try GRPO RFT to further improve the accuracy.
+2. Fine-tune Llama 3.3 70b and Llama 4 models.
+3. Use torchtune.
+4. Try agentic workflow.
+5. Expand the eval to support other enterprise databases.

+ 8 - 152
end-to-end-use-cases/coding/text2sql/eval/README.md

@@ -1,18 +1,8 @@
-# Text2SQL Evaluation and Fine-Tuning Tools for Llama Models
+# Llama Text2SQL Evaluation
 
-## Overview
+We have updated and simplified the original eval scripts from the BIRD [repo](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird) to 3 simple steps for Llama 3 & 4 models hosted via Meta's [Llama API](https://llama.developer.meta.com) or [Together.ai](https://together.ai), as well as the fine-tuned Llama 3.1 model.
 
-This folder contains scripts to:
-1. Evaluate Llama (original and fine-tuned) models on the Text2SQL task using the popular [BIRD](https://bird-bench.github.io) dataset in **three simple steps**;
-2. Generate fine-tuning datasets (with and without reasoning steps)and fine-tuning Llama 3.1 8B with the datasets, gaining a **165% (with no reasoning) and 209% (with reasoning) accuracy improvement** over the original model.
-
-Our end goal is to maximize the accuracy of Llama models on the Text2SQL task via fine-tuning, agent and other approaches. To do so we need to first evaluate the current state of the art Llama models on the task. In other words, "no eval, no success" AND "eval only is not success". Hence, we have created this tool to quickly evaluate Llama models on the Text2SQL task and, as a first step, to fine-tune Llama models to improve their accuracy on the task.
-
-## Llama Text2SQL Evaluation
-
-We have updated and significantly simplified the original eval scripts from the BIRD [repo](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird) for Llama 3 & 4 models hosted via Meta's [Llama API](https://llama.developer.meta.com) or [Together.ai](https://together.ai), as well as the fine-tuned Llama 3.1 model.
-
-### Evaluation Results
+## Evaluation Results
 
 Below are the results of the Llama models we have evaluated on the BIRD DEV dataset:
 
@@ -28,13 +18,13 @@ Below are the results of the Llama models we have evaluated on the BIRD DEV data
 - Fine-tuned with no reasoning dataset: 37.16%
 - Fine-tuned with reasoning dataset: 43.37%
 
-### Quick Start on Evaluating Llama on Text2SQL
+## Quick Start
 
 First, run the commands below to create a new Conda environment and install all the required packages for Text2SQL evaluation and fine-tuning:
 
 ```
 git clone https://github.com/meta-llama/llama-cookbook
-cd llama-cookbook/end-to-end-use-cases/coding/text2sql/tool
+cd llama-cookbook/end-to-end-use-cases/coding/text2sql
 conda create -n llama-text2sql python=3.10
 conda activate llama-text2sql
 pip install -r requirements.txt
@@ -46,6 +36,7 @@ Then, follow the steps below to evaluate Llama 3 & 4 models on Text2SQL using th
 ```
 cd data
 sh download_dev_unzip.sh
+cd ../eval
 ```
 
 2. Open `llama_eval.sh` and set `YOUR_API_KEY` to your [Llama API](https://llama.developer.meta.com/) key or [Together](https://api.together.ai/) API key, then uncomment a line that starts with `model=` to specify the Llama model to use for the text2sql eval.
@@ -58,7 +49,7 @@ After the script completes, you'll see the accuracy of the Llama model on the BI
 
 To compare your evaluated accuracy of your selected Llama model with other results in the BIRD Dev leaderboard, click [here](https://bird-bench.github.io/).
 
-### Evaluation Process
+## Evaluation Process
 
 1. **SQL Generation**: `llama_text2sql.py` sends natural language questions to the specified Llama model and collects the generated SQL queries.
 
@@ -68,7 +59,7 @@ To compare your evaluated accuracy of your selected Llama model with other resul
 
 4. **Accuracy Calculation**: Accuracy scores are calculated overall and broken down by difficulty levels (simple, moderate, challenging).
 
-### Supported Models for Evaluation
+## Supported Models for Evaluation
 
 Llama models supported on Together AI:
 - meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
@@ -83,145 +74,3 @@ Llama models supported on Llama API:
 - Llama-4-Maverick-17B-128E-Instruct-FP8
 - Llama-4-Scout-17B-16E-Instruct-FP8
 - other Llama models hosted on Llama API
-
-## Fine-tuning with the BIRD TRAIN dataset (No Reasoning)
-
-We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.
-
-### Using the TRAIN to prepare for supervised fine-tuning
-
-1. Get the TRAIN dataset:
-```
-cd data
-sh download_train_unzip.sh
-```
-
-2. Create the dataset
-
-```
-cd fine_tuning
-python create_sft_dataset.py --input_json ../data/train/train.json --db_root_path ../data/train/train_databases
-```
-
-This will create `train_text2sql_sft_dataset.json` and `test_text2sql_sft_dataset.json` using the TRAIN set. Each line in the json files is in the conversation format ready for fine-tuning:
-
-```
-{"messages":[{"content":"You are a text to SQL query translator. Using the SQLite DB Schema and the External Knowledge, translate the following text question into a SQLite SQL select statement.","role":"system"},{"content":"-- DB Schema: <DB_SCHEMA>\n\n-- External Knowledge: <KNOWLEDGE_FROM_TRAIN>\n\n-- Question: <TEXT_QUESTION>","role":"user"},{"content":"<GOLD_SQL>","role":"assistant"}]}
-```
-
-### Supervised Fine-tuning (No Reasoning)
-
-First, you need to login to HuggingFace (via running `huggingface-cli login` and enter your [HF token](https://huggingface.co/settings/tokens)) and have been granted access to the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model.
-
-Then run `python trl_sft.py`. After the fine-tuning completes, you'll see the fine-tuned model saved to `llama31-8b-text2sql-fine-tuned`, specified in `output_dir="llama31-8b-text2sql-fine-tuned"` of `TrainingArguments` in `trl_sft.py`.
-
-After running `tensorboard --logdir ./llama31-8b-text2sql-fine_tuning` you can open `http://localhost:6006` to see the train loss chat etc:
-
-![](fine_tuning/train_loss.png)
-
-
-### Evaluating the fine-tuned model (No Reasoning)
-
-First, modify `llama_eval.sh` to use the fine-tuned model:
-
-```
-YOUR_API_KEY='finetuned'
-model='fine_tuning/llama31-8b-text2sql'
-```
-
-Then run `sh llama_eval.sh` to evaluate the fine-tuned model. The accuracy on the BIRD DEV dataset is about 37.16%. This is a 165% improvement over the model before fine-tuning, which has an accuracy of about 14.02% on the same dataset - you can confirm this by comparing the fine-tuned model's accuracy above with the original model's accuracy by modifying `llama_eval.sh` to use the original model:
-
-```
-YOUR_API_KEY='huggingface'
-model='meta-llama/Llama-3.1-8B-Instruct'
-```
-
-Then running `sh llama_eval.sh` to evaluate the original model.
-
-*Note:* We are using the 4-bit quantized Llama 3.1 8b model to reduce the memory footprint and improve the efficiency (as shown in the code nippet of llama_text2sql.py below), hence the accuracy of the quantized version (14.02%) is quite lower than the accuracy of the original Llama 3.1 8b (35.66%).
-
-```
-  bnb_config = BitsAndBytesConfig(
-      load_in_4bit=True,
-      bnb_4bit_use_double_quant=True,
-      bnb_4bit_quant_type="nf4",
-      bnb_4bit_compute_dtype=torch.bfloat16,
-  )
-```
-
-## Fine-tuning with the BIRD TRAIN dataset (With Reasoning)
-
-Next we'll use the BIRD TRAIN dataset to prepare for supervised fine-tuning with reasoning info in the dataset. The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset.
-
-### Creating a reasoning dataset from the TRAIN dataset
-
-The script `create_reasoning_dataset.py` is used to create a reasoning dataset from the TRAIN dataset by asking Llama 3.3 70B to generate the reasoning for each text question and its corresponding gold SQL. The intent is to use the reasoning dataset to fine-tune the Llama model to improve the accuracy of the generated SQL.
-
-To run the script, use the following commands:
-```
-cd fine_tuning
-python create_reasoning_dataset.py --input_json ../data/train/train.json --db_root_path ../data/train/train_databases
-```
-
-This will create a `text2sql_cot_dataset` dataset and `train_text2sql_cot_dataset.json` in the conversation format ready for fine-tuning. Each example in the dataset is generated from the code snippet below:
-
-```
-prompt = f"""
-"""
-cot = {
-    "messages": [
-        {
-            "role": "system",
-            "content": "You are a text to SQL query translator. Using the SQLite DB Schema and the External Knowledge, generate the step-by-step reasoning and the final SQLite SQL select statement from the text question.",
-        },
-        {"role": "user", "content": prompt},
-        {"role": "assistant", "content": reasoning},
-    ]
-}
-```
-
-The prompt for Llama 3.3 70B to generate the `reasoning` above is:
-```
-You are a text to SQL query translator. Based on the DB Schema and External Knowledge, given the Text Question Input and its Gold SQL Output below, generate the step-by-step reasoning to infer the Gold SQL Output from the Text Question Input.
-
-
-Your response should be as follows:\n\n
-Let me think through this step by step:\n\n1. First, I need to consider...\n2. Then...\n3. Next...\n...\n\nFinally, the SQL statement for the text question is:
-```sql ...```\n
-
-"""
-```
-
-### Supervised Fine-tuning (With Reasoning)
-
-Uncomment the line `# FT_DATASET = "train_text2sql_cot_dataset.json"` in trl_sft.py to use the reasoning dataset for fine-tuning. Then run `python trl_sft.py`. After the fine-tuning completes, you'll see the fine-tuned model saved to `llama31-8b-text2sql-fine-tuned`, specified in `output_dir="llama31-8b-text2sql-fine-tuned"` of `TrainingArguments` in `trl_sft.py` - you may want to rename the `output_dir` folder to something else to avoid overwriting the previous fine-tuned model.
-
-The train loss chart will look like this:
-![](fine_tuning/train_loss_cot.png)
-
-### Evaluating the fine-tuned model (With Reasoning)
-
-First, modify `llama_eval.sh` to use the fine-tuned model, which should match the `output_dir` in `TrainingArguments` in `trl_sft.py`:
-
-```
-YOUR_API_KEY='finetuned'
-model='fine_tuning/llama31-8b-text2sql-fine-tuned'
-```
-
-Then uncomment the line `SYSTEM_PROMPT` [here](https://github.com/meta-llama/llama-cookbook/blob/text2sql/end-to-end-use-cases/coding/text2sql/tool/llama_text2sql.py#L31) in `llama_text2sql.py` to use it with the reasoning dataset fine-tuned model.
-
-Now run `sh llama_eval.sh`, which will take longer because the reasoning is needed to generate the SQL. The accuracy this time is 43.37%, compared with 37.16% without reasoning. This is another 16% improvement over the model with fine-tuning without reasoning.
-
-## Next Steps
-1. Add a Colab notebook for fine-tuning and evaluation.
-2. Try reinforcement fine-tuning to improve the accuracy further with reasoning.
-3. Use torchtune for full and non-quantized fine-tuning of Llama 3.3 70b and Llama 4 models.
-4. Introduce agent to try to improve the accuracy further.
-5. Expand the tool to support other databases.

+ 137 - 0
end-to-end-use-cases/coding/text2sql/fine-tuning/README.md

@@ -0,0 +1,137 @@
+# Llama Text2SQL Fine-tuning
+
+This folder contains the scripts to generate datasets from the BIRD TRAIN set with and without CoT, and to supervised fine-tune (SFT), as the first step, the Llama 3.1 8B model: accuracy improvement of **165% on the fine-tuned model with no reasoning and 209% with reasoning** over the original model.
+
+## SFT with the BIRD TRAIN dataset (No Reasoning)
+
+We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.
+
+### Using the TRAIN to prepare for supervised fine-tuning
+
+1. Get the TRAIN dataset:
+```
+cd data
+sh download_train_unzip.sh
+```
+
+2. Create the dataset
+
+```
+cd ../fine_tuning
+python create_sft_dataset.py --input_json ../data/train/train.json --db_root_path ../data/train/train_databases
+```
+
+This will create `train_text2sql_sft_dataset.json` and `test_text2sql_sft_dataset.json` using the TRAIN set. Each line in the json files is in the conversation format ready for fine-tuning:
+
+```
+{"messages":[{"content":"You are a text to SQL query translator. Using the SQLite DB Schema and the External Knowledge, translate the following text question into a SQLite SQL select statement.","role":"system"},{"content":"-- DB Schema: <DB_SCHEMA>\n\n-- External Knowledge: <KNOWLEDGE_FROM_TRAIN>\n\n-- Question: <TEXT_QUESTION>","role":"user"},{"content":"<GOLD_SQL>","role":"assistant"}]}
+```
+
+### SFT (No Reasoning)
+
+First, you need to login to HuggingFace (via running `huggingface-cli login` and enter your [HF token](https://huggingface.co/settings/tokens)) and have been granted access to the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model.
+
+Then run `python trl_sft.py`. After the fine-tuning completes, you'll see the fine-tuned model saved to `llama31-8b-text2sql-fine-tuned`, specified in `output_dir="llama31-8b-text2sql-fine-tuned"` of `TrainingArguments` in `trl_sft.py`.
+
+After running `tensorboard --logdir ./llama31-8b-text2sql-fine_tuning` you can open `http://localhost:6006` to see the train loss chat etc:
+
+![](fine_tuning/train_loss.png)
+
+
+### Evaluating the fine-tuned model (No Reasoning)
+
+First, modify `llama_eval.sh` to use the fine-tuned model:
+
+```
+YOUR_API_KEY='finetuned'
+model='fine_tuning/llama31-8b-text2sql'
+```
+
+Then run `sh llama_eval.sh` to evaluate the fine-tuned model. The accuracy on the BIRD DEV dataset is about 37.16%. This is a 165% improvement over the model before fine-tuning, which has an accuracy of about 14.02% on the same dataset - you can confirm this by comparing the fine-tuned model's accuracy above with the original model's accuracy by modifying `llama_eval.sh` to use the original model:
+
+```
+YOUR_API_KEY='huggingface'
+model='meta-llama/Llama-3.1-8B-Instruct'
+```
+
+Then running `sh llama_eval.sh` to evaluate the original model.
+
+*Note:* We are using the 4-bit quantized Llama 3.1 8b model to reduce the memory footprint and improve the efficiency (as shown in the code nippet of llama_text2sql.py below), hence the accuracy of the quantized version (14.02%) is quite lower than the accuracy of the original Llama 3.1 8b (35.66%).
+
+```
+  bnb_config = BitsAndBytesConfig(
+      load_in_4bit=True,
+      bnb_4bit_use_double_quant=True,
+      bnb_4bit_quant_type="nf4",
+      bnb_4bit_compute_dtype=torch.bfloat16,
+  )
+```
+
+## SFT with the BIRD TRAIN dataset (With Reasoning)
+
+Next we'll use the BIRD TRAIN dataset to prepare for supervised fine-tuning with reasoning info in the dataset. The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset.
+
+### Creating a reasoning dataset from the TRAIN dataset
+
+The script `create_reasoning_dataset.py` is used to create a reasoning dataset from the TRAIN dataset by asking Llama 3.3 70B to generate the reasoning for each text question and its corresponding gold SQL. The intent is to use the reasoning dataset to fine-tune the Llama model to improve the accuracy of the generated SQL.
+
+To run the script, use the following commands:
+```
+python create_reasoning_dataset.py --input_json ../data/train/train.json --db_root_path ../data/train/train_databases
+```
+
+This will create a `text2sql_cot_dataset` dataset and `train_text2sql_cot_dataset.json` in the conversation format ready for fine-tuning. Each example in the dataset is generated from the code snippet below:
+
+```
+prompt = f"""
+-- DB Schema: {db_schema}
+-- External Knowledge: {external_knowledge}
+-- Text Question: {question}
+"""
+cot = {
+    "messages": [
+        {
+            "role": "system",
+            "content": "You are a text to SQL query translator. Using the SQLite DB Schema and the External Knowledge, generate the step-by-step reasoning and the final SQLite SQL select statement from the text question.",
+        },
+        {"role": "user", "content": prompt},
+        {"role": "assistant", "content": reasoning},
+    ]
+}
+```
+
+The prompt for Llama 3.3 70B to generate the `reasoning` above is:
+```
+You are a text to SQL query translator. Based on the DB Schema and External Knowledge, given the Text Question Input and its Gold SQL Output below, generate the step-by-step reasoning to infer the Gold SQL Output from the Text Question Input.
+
+-- DB Schema: {db_schema}
+-- External Knowledge: {external_knowledge}
+-- Text Question Input: {question}
+-- Gold SQL Output: {gold_SQL}
+
+Your response should be as follows:\n\n
+Let me think through this step by step:\n\n1. First, I need to consider...\n2. Then...\n3. Next...\n...\n\nFinally, the SQL statement for the text question is:
+```sql ...```\n
+
+"""
+```
+
+### SFT (With Reasoning)
+
+Uncomment the line `# FT_DATASET = "train_text2sql_cot_dataset.json"` in trl_sft.py to use the reasoning dataset for fine-tuning. Then run `python trl_sft.py`. After the fine-tuning completes, you'll see the fine-tuned model saved to `llama31-8b-text2sql-fine-tuned`, specified in `output_dir="llama31-8b-text2sql-fine-tuned"` of `TrainingArguments` in `trl_sft.py` - you may want to rename the `output_dir` folder to something else to avoid overwriting the previous fine-tuned model.
+
+The train loss chart will look like this:
+![](fine_tuning/train_loss_cot.png)
+
+### Evaluating the fine-tuned model (With Reasoning)
+
+First, modify `llama_eval.sh` to use the fine-tuned model, which should match the `output_dir` in `TrainingArguments` in `trl_sft.py`:
+
+```
+YOUR_API_KEY='finetuned'
+model='fine_tuning/llama31-8b-text2sql-fine-tuned'
+```
+
+Then uncomment the line `SYSTEM_PROMPT` [here](https://github.com/meta-llama/llama-cookbook/blob/text2sql/end-to-end-use-cases/coding/text2sql/eval/llama_text2sql.py#L31) in `llama_text2sql.py` to use it with the reasoning dataset fine-tuned model.
+
+Now run `sh llama_eval.sh`, which will take longer because the reasoning is needed to generate the SQL. The accuracy this time is 43.37%, compared with 37.16% without reasoning. This is another 16% improvement over the model with fine-tuning without reasoning.

+ 17 - 0
end-to-end-use-cases/coding/text2sql/fine-tuning/requirements.txt

@@ -0,0 +1,17 @@
+llama_api_client==0.1.1
+langchain-together==0.3.0
+sqlparse==0.5.3
+torch==2.4.1
+tensorboard==2.19.0
+liger-kernel==0.4.2
+setuptools==78.1.1
+deepspeed==0.15.4
+transformers==4.46.3
+datasets==3.6.0
+accelerate==1.1.1
+bitsandbytes==0.44.1
+trl==0.12.1
+peft==0.13.2
+lighteval==0.6.2
+hf-transfer==0.1.8
+func_timeout==4.3.5

+ 2 - 2
end-to-end-use-cases/coding/text2sql/quickstart/README.md

@@ -1,10 +1,10 @@
-## Quickstart with Text2SQL
+# Quickstart with Text2SQL
 
 The scripts and notebook in this folder let you get familiar with how to interact with a database using natural language inputs by asking Llama to convert natural language queries into SQL queries.
 
 For detailed instructions on setting up the environment, creating a database, and executing natural language queries using the Text2SQL interface, please refer to the [quickstart.ipynb](quickstart.ipynb) notebook.
 
-### Structure:
+## Structure:
 
 - quickstart.ipynb: A Quick Demo of Text2SQL Using Llama 3.3. This Jupyter Notebook includes examples of how to use the interface to execute natural language queries on the sample data. It uses Llama 3.3 to answer questions about a SQLite database using LangChain and the Llama cloud provider Together.ai.
 - nba.txt: A text file containing NBA roster information, which is used as sample data for demonstration purposes.