瀏覽代碼

Update tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Co-authored-by: Hamid Shojanazeri <hamid.nazeri2010@gmail.com>
Kai Wu 11 月之前
父節點
當前提交
ae10920a03
共有 1 個文件被更改,包括 1 次插入1 次删除
  1. 1 1
      tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

+ 1 - 1
tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

@@ -11,7 +11,7 @@ As Meta Llama models gain popularity, evaluating these models has become increas
 
 ### Differences between our evaluation and Hugging Face leaderboard evaluation
 
-There are 4 major differences in terms of the eval configurations and prompts between this tutorial implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
+There are 4 major differences in terms of the eval configurations and prompting methods between this implementation and Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard).
 
 - **Prompts**: We use Chain-of-Thought(COT) prompts while Hugging Face leaderboard does not. The prompts that define the output format are also different.
 - **Metric calculation**: For MMLU-Pro, BBH, GPQA tasks, we ask the model to generate response and score the parsed answer from generated response, while Hugging Face leaderboard evaluation is comparing log likelihood of all label words, such as [ (A),(B),(C),(D) ].