# Overview This repository hosts a Django application called **LLM Eval**, designed for evaluating and benchmarking large language models against specific datasets. The app integrates clients from leading platforms, including OpenAI, Anthropic, Google, Ollama, and Anyscale. # Setup Follow these steps to set up the application on your local machine. ## Clone the repository Clone the repository to your local machine using the following command: ```bash git clone ``` ## Create and activate a Python virtual environment Create a virtual environment and activate it: ```bash python -m venv /path/to/environment source /path/to/environment/bin/activate # On Windows, use /path/to/environment/Scripts/activate ``` ## Install Python dependencies Install the required Python dependencies: ```bash pip install -r requirements.txt ``` ## Setup database connection The default database is SQLite, but you can configure other types of connections in the configuration file located at llmeval/llmeval/settings.py. For more information on database connections, see the [Django Documentation](https://docs.djangoproject.com/en/5.0/ref/databases/) ## Run the application as a standalone server 1. Instantiate the database Run the following command to apply database migrations: ```bash python manage.py migrate ``` 2. Generate a superuser to acces the app Create a superuser account to manage the application: ```bash python manage.py createsupersuer ``` 3. Start the server app Start the Server App ```bash python manage.py runserver 8000 ``` The application will be accesible at http://localhost:8000/admin. Login using the superuser credentials. Once logged in, you will notice this menu items: 1. **LLM Backends**. These interfaces connect to various LLM model providers. The initial implementation includes support for Ollama, OpenAI, Google, Anthropic, and Anyscale. The parameters attribute contains JSON-serialized arguments passed to the model client. For example, OpenAI requires an `api_key` in the parameters to access its web services. 2. **LLM Models**. These models are associated with a backend that serves them. The parameters attribute includes arguments passed to the chat client, such as `top_p`, `top_k`, and `temperature`. 3. **Eval Configs** are configurations used to evaluate the performance of LLM models against a dataset. In this section, you will specify the dataset, system prompt, and the regular expression for matching the final answer. Additionally, you can include a chat history, such as few-shot examples, to enhance evaluation accuracy. 4. **Answer Interpreters** can be used to interpret the models answers by another LLM based assistant. 5. **Eval Sessions** are testing sessions that are executed by the application. Here is where you choose the Eval Config, the LLM model, the Answer Interpreter (if needed), the dataset target (e.g. train, test, dev, validation) and the delay between requests. The `parameters` attribute overrides the LLM parameters. # Running an evaluation ## Load dataset At the moment there is an implementation for loading MedQA, PubMedQA and MMLU datasets. ```bash python manage.py import_medqa --file=datasets/medqa/test.jsonl --target=test --dataset=medqa #will load the medqa test QA to target test ``` ```bash python manage.py import_mmlu --dataset=mmlu --target=test --subject=anatomy #will load the dataset MMLU covering anatomy subject into a dataset called mmlu and target test. The dataset is loaded from hugging face. ``` ## Execute a session ```bash python manage.py eval_qa --session-id=16 --continue ``` ## Data Loaders The data loaders are locate in `commons/management/commands`. At this moment there are 3 loaders: the `import_medqa` importer, `import_mmlu` and `import_pubmedqa`. MedQA and PubMedQA are imported from local files. MMLU importer pulls the dataset from Huggingface. # License MIT License Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ©2024 Radu Boncea, ICI Bucharest