erik-dunteman 9b77803689 add modal many llama human eval example		1 éve
..
README.md	9b77803689 add modal many llama human eval example	1 éve
download.py	9b77803689 add modal many llama human eval example	1 éve
eval.py	9b77803689 add modal many llama human eval example	1 éve
generate.py	9b77803689 add modal many llama human eval example	1 éve
inference.py	9b77803689 add modal many llama human eval example	1 éve
plot.py	9b77803689 add modal many llama human eval example	1 éve
run_e2e.sh	9b77803689 add modal many llama human eval example	1 éve

See rune2e.sh for info on how to run the experiment.

Many Llamas Human Eval

In this directory, we run an experiment answering the question:

If we run enough Llama models in parallel, can they outperform GPT-4o on HumanEval?

It seeks to increase model performance not by scaling parameters, but by scaling compute time.

This experiment has been built and run by the team at Modal, and is described in the following blog post:

The experiment has since been adapted to use the Llama 3.2 3B Instruct model, and run end-to-end using the Modal serverless platform.

Run it yourself

From within your virtual environment, run:

pip install modal

And if you're new to Modal, authenticate with:

modal setup
# or if that doesn't work, try 
# python -m modal setup

That's all!

This CLI will execute your modal apps, which build and run containers on the cloud, on your GPU of choice.

This command will run every step for you:

bash run_e2e.sh

Or if you prefer to run it manually, you can step through each of the modal commands in the script.

This will execute:

The resulting plots of the evals will be saved locally to:

/tmp/plot-pass-k.jpeg shows pass@k for the Llama 3.2 3B Instruct model vs pass@1 for GPT-4o.

You'll see that at 100 generations, the Llama model is able to perform on-par with GPT-4o. At higher scale, the Llama model will outperform GPT-4o.

/tmp/plot-fail-k.jpeg shows fail@k across a log-scale, showing smooth scaling of this method.