# Many-Llamas Human-Eval In this directory, we run an experiment answering the question: *If we run enough Llama models in parallel, can they outperform GPT-4o on HumanEval?* It seeks to increase model performance not through scaling parameters, but by scaling compute time. ### Technical Blog This experiment built by the team at [Modal](https://modal.com), and is described in the following blog post: [Beat GPT-4o at Python by searching with 100 small Llamas](https://modal.com/blog/llama-human-eval) The experiment has since been upgraded to use the [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) model, and run end-to-end using the Modal serverless platform. ## Run it yourself ### Install the Modal CLI From within your virtual environment, run: ```bash pip install modal ``` And if you're new to Modal, authenticate with: ```bash modal setup # or if that doesn't work, try # python -m modal setup ``` That's all! This CLI will execute your modal apps, which build and run containers on the cloud, on your GPU of choice. ### HuggingFace Pull Access To download the model, you'll first need to accept the [Llama 3.2 License](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) on HuggingFace and be approved for access. Then, create a [modal secret](https://modal.com/secrets) named `huggingface`, to which you'll add your HF_TOKEN as an environment variable. ### Run The Experiment This command will run every step for you: ```bash bash run_e2e.sh ``` Or if you prefer to run it manually, you can step through each of the modal commands in [the script](./run_e2e.sh). This will execute: 1. Downloading the Llama 3.2 3B Instruct model to a cloud volume 2. Deploying a vLLM inference server to GPUs 3. Running hundreds of parallel generations on the HumanEval test set 4. Running the evaluation script to compute pass@k and fail@k 5. Generating graphs of pass@k and fail@k ### Results The resulting plots of the evals will be saved locally to: - `/tmp/plot-pass-k.jpeg` - `/tmp/plot-fail-k.jpeg` `/tmp/plot-pass-k.jpeg` shows pass@k for the Llama 3.2 3B Instruct model vs pass@1 for GPT-4o. ![plot-pass-k](https://github.com/user-attachments/assets/11e9dc6e-4322-4d44-b928-4ed7c4ce8262) You'll see that at 100 generations, the Llama model is able to perform on-par with GPT-4o. At higher scale, the Llama model will outperform GPT-4o. `/tmp/plot-fail-k.jpeg` shows fail@k across a log-scale, showing smooth scaling of this method. ![plot-fail-k](https://github.com/user-attachments/assets/7286e4ff-5090-4288-bd62-8a078c6dc5a1)