Improving Llama Text2SQL performance with CoT Fine-tuning

This recipe is step by step guide to improve Llama performance on Text2SQL measured with the popular BIRD benchmark. We generate synthetic Chain of Thought(CoT) dataset and fine-tune Llama models on it.

Results: [graph_placeholder]

We followed following steps:

Pre-processing the BIRD TRAIN datset by converting SQL statements into conversation format
We use the conversations from step 1, add CoT to these existing conversations using Llama-3.3-70B
Fine-tuning Llama-3.1-8B on the dataset from step 2
We provide scripts to simplify running the BIRD benchmark on the fine-tuned models and compare it with out of the model.

Structure:

quickstart folder: contains a notebook to ask Llama 3.3 to convert natural language queries into SQL queries.
data folder: contains scripts to download the BIRD TRAIN and DEV datasets;
fine-tune folder: contains scripts to generate non-CoT and CoT datasets based on the BIRD TRAIN set and to supervised fine-tune Llama models using the datasets, with different SFT options (quantization or not, full fine-tuning or parameter-efficient fine-tuning);
eval folder: contains scripts to evaluate Llama models (original and fine-tuned) on the BIRD dataset;

We also experimented with supervised fine-tuning (SFT) without CoT which resulted in slightly lower accuracy.

Next Steps

Hyper-parameter tuning of the current SFT scripts.
Try GRPO reinforcement learning to further improve the accuracy.
Fine-tune Llama 3.3 70B and Llama 4 models.
Try agentic workflow.
Expand the eval to support other enterprise databases.

README.md 1.7 KB Historia Czysty

Improving Llama Text2SQL performance with CoT Fine-tuning

Structure:

Next Steps

README.md 1.7 KB

Historia Czysty