Amir 1 день тому
батько
коміт
58ea6cb115
1 змінених файлів з 10 додано та 29 видалено
  1. 10 29
      end-to-end-use-cases/coding/text2sql/eval/README.md

+ 10 - 29
end-to-end-use-cases/coding/text2sql/eval/README.md

@@ -6,17 +6,14 @@ We have updated and simplified the original eval scripts from the BIRD [repo](ht
 
 Below are the results of the Llama models we have evaluated on the BIRD DEV dataset:
 
-| Model                  | Llama API Accuracy | Together Accuracy |
-|------------------------|--------------------|-------------------|
-| Llama 3.1 8b           | -                  | 35.66%            |
-| Llama 3.3 70b          | 54.11%             | 54.63%            |
-| Llama-3.1-405B         | -                  | 55.80%            |
-| Llama 4 Scout          | 44.39%             | 43.94%            |
-| Llama 4 Maverick       | 44.00%             | 41.46%            |
-
-- Llama 3.1 8b on Hugging Face: quantized 14.02%, non-quantized 39.47%
-- Non-quantized FFT with no CoT dataset: 36.31%
-- Non-quantized FFT with CoT dataset: 43.87%
+| Model                  | Llama API Accuracy |
+|------------------------|--------------------|
+| Llama 3.1 8b           | 39.47% (*)         |
+| Llama 3.3 70b          | 54.11%             |
+| Llama 4 Scout          | 44.39%             |
+| Llama 4 Maverick       | 44.00%             |
+
+- Since Llama API does not have Llama 3.1 8b model, we use Hugging Face weights to run locally.
 
 ## Quick Start
 
@@ -39,13 +36,13 @@ sh download_dev_unzip.sh
 cd ../eval
 ```
 
-2. Open `llama_eval.sh` and set `YOUR_API_KEY` to your [Llama API](https://llama.developer.meta.com/) key or [Together](https://api.together.ai/) API key, then uncomment a line that starts with `model=` to specify the Llama model to use for the text2sql eval.
+2. Open `llama_eval.sh` and set `YOUR_API_KEY` to your [Llama API](https://llama.developer.meta.com/) key then uncomment a line that starts with `model=` to specify the Llama model to use for the text2sql eval.
 
 3. Run the evaluation script `sh llama_eval.sh`, which will use the BIRD DEV dataset (1534 examples in total) with external knowledge turned on to run the Llama model on each text question and compare the generated SQL with the gold SQL.
 
 If your API key or model name is incorrect, the script will exit with an authentication or model not supported error.
 
-After the script completes, you'll see the accuracy of the Llama model on the BIRD DEV text2sql. For example, the total accuracy is about 54.24% with `YOUR_API_KEY` set to your Llama API key and `model='Llama-3.3-70B-Instruct'`, or about 35.07% with `YOUR_API_KEY` set to your Together API key and `model=meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo`.
+After the script completes, you'll see the accuracy of the Llama model on the BIRD DEV text2sql. For example, the total accuracy is about 54.24% with `YOUR_API_KEY` set to your Llama API key and `model='Llama-3.3-70B-Instruct'`
 
 To compare your evaluated accuracy of your selected Llama model with other results in the BIRD Dev leaderboard, click [here](https://bird-bench.github.io/).
 
@@ -58,19 +55,3 @@ To compare your evaluated accuracy of your selected Llama model with other resul
 3. **Result Comparison**: The results from executing the generated SQL are compared ([source code](text2sql_eval.py#L30)) with the results from the ground truth SQL to determine correctness.
 
 4. **Accuracy Calculation**: Accuracy scores are calculated overall and broken down by difficulty levels (simple, moderate, challenging).
-
-## Supported Models for Evaluation
-
-Llama models supported on Together AI:
-- meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
-- meta-llama/Llama-3.3-70B-Instruct-Turbo
-- meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
-- meta-llama/Llama-4-Scout-17B-16E-Instruct
-- other Llama models hosted on Together AI
-
-Llama models supported on Llama API:
-- Llama-3.3-8B-Instruct
-- Llama-3.3-70B-Instruct
-- Llama-4-Maverick-17B-128E-Instruct-FP8
-- Llama-4-Scout-17B-16E-Instruct-FP8
-- other Llama models hosted on Llama API