|  | vor 1 Jahr | |
|---|---|---|
| .. | ||
| open_llm_leaderboard | vor 1 Jahr | |
| README.md | vor 1 Jahr | |
| eval.py | vor 1 Jahr | |
| open_llm_eval_prep.sh | vor 1 Jahr | |
Llama-Recipe make use of lm-evaluation-harness for evaluating our fine-tuned Llama2 model. It also can serve as a tool to evaluate quantized model to ensure the quality in lower precision or other optimization applied to the model that might need evaluation.
lm-evaluation-harness provide a wide range of features:
The Language Model Evaluation Harness is also the backend for 🤗 Hugging Face's (HF) popular Open LLM Leaderboard.
Before running the evaluation script, ensure you have all the necessary dependencies installed.
Clone the lm-evaluation-harness repository and install it:
git clone https://github.com/matthoffner/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
To run evaluation for HuggingFace Llama2 7B model  on a single GPU please run the following,
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks hellaswag --device cuda:0   --batch_size 8
Tasks can be extended by using , between them for example --tasks hellaswag,arc.
To set the number of shots you can use --num_fewshot to set the number for few shot evaluation.
In case you have fine-tuned your model using PEFT you can set the PATH to the PEFT checkpoints using PEFT as part of model_args as shown below:
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10  --device cuda:0 --batch_size 8 
There has been an study from IBM on efficient benchmarking of LLMs, with main take a way that to identify if a model is performing poorly, benchmarking on wider range of tasks is more important than the number example in each task. This means you could run the evaluation harness with fewer number of example to have initial decision if the performance got worse from the base line. To limit the number of example here, it can be set using --limit flag with actual desired number.
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10  --device cuda:0 --batch_size 8 --limit 100
Here, we provided a list of tasks from Open-LLM-Leaderboard which can be used by passing --open-llm-leaderboard-tasks instead of tasks to the eval.py.
NOTE Make sure to run the bash script below to that will set the include paths in the config files. The script with prompt you to enter the path to the cloned lm-evaluation-harness repo.You would need this step only for the first time.
bash open_llm_eval_prep.sh
Now we can run the eval benchmark:
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float",peft=../peft_output --num_fewshot 10  --device cuda:0 --batch_size 8 --limit 100 --open-llm-leaderboard-tasks
In the HF leaderboard, the LLMs are evaluated on 7 benchmarks from Language Model Evaluation Harness as described below:
In case you have customized the Llama model, for example a quantized version of model where it has different model loading from normal HF model, you can follow this guide to add your model to the eval.py and run the eval benchmarks.
You can also find full task list here.