|
|
@@ -2,9 +2,9 @@
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
-This folder contains the scripts for evaluating Llama (original and fine-tuned) models on Text2SQL tasks using the popular [BIRD](https://bird-bench.github.io) dataset, generating fine-tuning datasets, and fine-tuning Llama 3.1 8B with the datasets.
|
|
|
+This folder contains the scripts for evaluating Llama (original and fine-tuned) models on the Text2SQL task using the popular [BIRD](https://bird-bench.github.io) dataset, generating fine-tuning datasets, and fine-tuning Llama 3.1 8B with the datasets.
|
|
|
|
|
|
-We have significantly simplified the original eval scripts from the BIRD [repo](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird) for Llama models hosted via Meta's [Llama API](https://llama.developer.meta.com) or [Together.ai](https://together.ai), so you can quickly evaluate in 1-2-3 steps how well different Llama models perform on the Text2SQL task.
|
|
|
+We have updated and significantly simplified the original eval scripts from the BIRD [repo](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird) for Llama 3 & 4 models hosted via Meta's [Llama API](https://llama.developer.meta.com) or [Together.ai](https://together.ai), as well as the fine-tuned Llama 3.1 model, so you can quickly evaluate in 1-2-3 steps how well different Llama models perform on the Text2SQL task.
|
|
|
|
|
|
We have also provided end-to-end scripts for generating datasets and fine-tuning a quantized Llama 3.1 8B model to gain a **165% accuracy improvement** over the original model.
|
|
|
|
|
|
@@ -22,7 +22,6 @@ Below are the results of the Llama models we have evaluated on the BIRD DEV data
|
|
|
|
|
|
Llama 3.1 8b quantized model: 14.02% (original) -> 37.16% (fine-tuned)
|
|
|
|
|
|
-
|
|
|
## Quick Start on Evaluating Llama on Text2SQL
|
|
|
|
|
|
First, run the commands below to create a new Conda environment and install all the required packages for Text2SQL evaluation and fine-tuning:
|
|
|
@@ -57,7 +56,7 @@ After the script completes, you'll see the accuracy of the Llama model on the BI
|
|
|
|
|
|
2. **SQL Execution**: `text2sql_eval.py` executes both the generated SQL and ground truth SQL against the corresponding databases, then continues with steps 3 and 4 below.
|
|
|
|
|
|
-3. **Result Comparison**: The results from executing the generated SQL are compared with the results from the ground truth SQL to determine correctness.
|
|
|
+3. **Result Comparison**: The results from executing the generated SQL are compared [source code](text2sql_eval.py#L30) with the results from the ground truth SQL to determine correctness.
|
|
|
|
|
|
4. **Accuracy Calculation**: Accuracy scores are calculated overall and broken down by difficulty levels (simple, moderate, challenging).
|
|
|
|
|
|
@@ -77,7 +76,9 @@ After the script completes, you'll see the accuracy of the Llama model on the BI
|
|
|
- Llama-4-Scout-17B-16E-Instruct-FP8
|
|
|
- other Llama models hosted on Llama API
|
|
|
|
|
|
-## Preparing Fine-tuning Dataset
|
|
|
+## Fine-tuning with the BIRD TRAIN dataset (No Reasoning)
|
|
|
+
|
|
|
+We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.
|
|
|
|
|
|
### Using the TRAIN to prepare for supervised fine-tuning
|
|
|
|
|
|
@@ -100,7 +101,7 @@ This will create `train_text2sql_sft_dataset.json` and `test_text2sql_sft_datase
|
|
|
{"messages":[{"content":"You are a text to SQL query translator. Using the SQLite DB Schema and the External Knowledge, translate the following text question into a SQLite SQL select statement.","role":"system"},{"content":"-- DB Schema: <DB_SCHEMA>\n\n-- External Knowledge: <KNOWLEDGE_FROM_TRAIN>\n\n-- Question: <TEXT_QUESTION>","role":"user"},{"content":"<GOLD_SQL>","role":"assistant"}]}
|
|
|
```
|
|
|
|
|
|
-## Supervised Fine-tuning
|
|
|
+### Supervised Fine-tuning
|
|
|
|
|
|
First, you need to login to HuggingFace (via running `huggingface-cli login` and enter your [HF token](https://huggingface.co/settings/tokens)) and have been granted access to the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model.
|
|
|
|
|
|
@@ -111,7 +112,7 @@ After running `tensorboard --logdir ./llama31-8b-text2sql-fine_tuning` you can o
|
|
|

|
|
|
|
|
|
|
|
|
-## Evaluating the fine-tuned model
|
|
|
+### Evaluating the fine-tuned model
|
|
|
|
|
|
First, modify `llama_eval.sh` to use the fine-tuned model:
|
|
|
|
|
|
@@ -140,6 +141,9 @@ Then running `sh llama_eval.sh` to evaluate the original model.
|
|
|
)
|
|
|
```
|
|
|
|
|
|
+## Fine-tuning with the BIRD TRAIN dataset (With Reasoning)
|
|
|
+
|
|
|
+Next we'll use the BIRD TRAIN dataset to prepare for supervised fine-tuning with reasoning info in the dataset. The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset.
|
|
|
|
|
|
### Creating a reasoning dataset from the TRAIN dataset
|
|
|
|