Jelajahi Sumber

READMEs update based on the new FT results etc

Jeff Tang 3 bulan lalu
induk
melakukan
4037737fa6

+ 7 - 6
end-to-end-use-cases/coding/text2sql/README.md

@@ -4,7 +4,7 @@ This folder contains scripts to:
 
 1. Evaluate Llama (original and fine-tuned) models on the Text2SQL task using the popular [BIRD](https://bird-bench.github.io) dataset in **3 simple steps**;
 
-2. Generate two fine-tuning datasets (with and without CoT) and fine-tuning Llama 3.1 8B with the datasets, gaining a **165% improvement on the fine-tuned model without CoT (accuracy 37.16%) and 209% with CoT (accuracy 43.37%)** over the original model (accuracy 14.02%).
+2. Generate two supervised fine-tuning (SFT) datasets (with and without CoT) and fine-tuning Llama 3.1 8B with the datasets, using different SFT options: with or without CoT, using quantization or not, full fine-tuning (FFT) or parameter-efficient fine-tuning (PEFT). The non-quantized PEFT SFT has the most performance gains: from 39.47% of the original Llama 3.1 8B model to 43.35%. (Note: the results are based on only 3 epochs of SFT.)
 
 Our end goal is to maximize the accuracy of Llama models on the Text2SQL task. To do so we need to first evaluate the current state of the art Llama models on the task, then apply fine-tuning, agent and other approaches to evaluate and improve Llama's performance.
 
@@ -17,8 +17,9 @@ Our end goal is to maximize the accuracy of Llama models on the Text2SQL task. T
 
 ## Next Steps
 
-1. Try GRPO RFT to further improve the accuracy.
-2. Fine-tune Llama 3.3 70b and Llama 4 models.
-3. Use torchtune.
-4. Try agentic workflow.
-5. Expand the eval to support other enterprise databases.
+1. Hyper-parameter tuning of the current SFT scripts.
+2. Try GRPO RFT to further improve the accuracy.
+3. Fine-tune Llama 3.3 70b and Llama 4 models.
+4. Use torchtune.
+5. Try agentic workflow.
+6. Expand the eval to support other enterprise databases.

+ 3 - 3
end-to-end-use-cases/coding/text2sql/eval/README.md

@@ -14,9 +14,9 @@ Below are the results of the Llama models we have evaluated on the BIRD DEV data
 | Llama 4 Scout          | 44.39%             | 43.94%            |
 | Llama 4 Maverick       | 44.00%             | 41.46%            |
 
-- Llama 3.1 8b quantized model: 14.02% (original)
-- Fine-tuned with no reasoning dataset: 37.16%
-- Fine-tuned with reasoning dataset: 43.37%
+- Llama 3.1 8b on Hugging Face: quantized 14.02%, non-quantized 39.47%
+- Fine-tuned with no CoT dataset: 39.31%
+- Fine-tuned with CoT dataset: 43.35%
 
 ## Quick Start
 

+ 18 - 2
end-to-end-use-cases/coding/text2sql/fine-tuning/README.md

@@ -1,13 +1,29 @@
 # Enhancing Text-to-SQL with CoT: A Fine-Tuning Approach with Llama
 
-CoT stands for Chain of Thought and we will use "CoT" and "reasoning" interchangeably here, although generally, reasoning encompasses a broader concept than CoT.
-
 This folder contains scripts to:
 
 * generate a dataset from the BIRD TRAIN set (with no CoT info) for supervised fine-tuning (SFT);
 * generate a dataset from the BIRD TRAIN set (with CoT info by Llama 3.3 70B) for SFT;
 * SFT the Llama 3.1 8B model with the generated datasets with different fine-tuning combinations: with or without CoT, using quantization or not,  full fine-tuning (FFT) or parameter-efficient fine-tuning (PEFT).
 
+**Note:** CoT stands for Chain of Thought and we will use "CoT" and "reasoning" interchangeably here, although generally, reasoning encompasses a broader concept than CoT.
+
+## Eval Results of the Fine-tuned Models
+
+The eval results of SFT Llama 3.1 8B with different options are summarized in the table below:
+
+| Fine-tuning Combination     | Accuracy |
+|-----------------------------|----------|
+| Non-Quantized, CoT, FFT     | xx.xx%   |
+| Non-Quantized, CoT, PEFT    | 43.35%   |
+| Quantized, CoT, PEFT        | 42.89%   |
+| Non-Quantized, No CoT, PEFT | 39.31%   |
+| Quantized, No CoT, PEFT     | 39.31%   |
+| Non-Quantized, No CoT, FFT  | 33.70%   |
+| Quantized, CoT, FFT         | N/A      |
+| Quantized, No CoT, FFT      | N/A      |
+
+
 ## SFT with the BIRD TRAIN dataset (No Reasoning)
 
 We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.