# Overview

This repository hosts a Django application called **LLM Eval**, designed for evaluating and benchmarking large language models against specific datasets. The app integrates clients from leading platforms, including OpenAI, Anthropic, Google, Ollama, and Anyscale.

# Setup

Follow these steps to set up the application on your local machine.

## Clone the repository

Clone the repository to your local machine using the following command:

```bash
git clone <repository-url>
```

## Create and activate a Python virtual environment
Create a virtual environment and activate it:

```bash 
python -m venv /path/to/environment
source /path/to/environment/bin/activate  # On Windows, use /path/to/environment/Scripts/activate
```

## Install Python dependencies
Install the required Python dependencies:

```bash
pip install -r requirements.txt
```

## Setup database connection
The default database is SQLite, but you can configure other types of connections in the configuration file located at llmeval/llmeval/settings.py. For more information on database connections, see the [Django Documentation](https://docs.djangoproject.com/en/5.0/ref/databases/)

## Run the application as a standalone server

1. Instantiate the database 

Run the following command to apply database migrations:
```bash
python manage.py migrate
```
2. Generate a superuser to acces the app

Create a superuser account to manage the application:
```bash
python manage.py createsupersuer
```
3. Start the server app 

Start the Server App
```bash
python manage.py runserver 8000
```
The application will be accesible at http://localhost:8000/admin. Login using the superuser credentials. Once logged in, you will notice this menu items:

1.  **LLM Backends**. These interfaces connect to various LLM model providers. The initial implementation includes support for Ollama, OpenAI, Google, Anthropic, and Anyscale. The parameters attribute contains JSON-serialized arguments passed to the model client. For example, OpenAI requires an `api_key` in the parameters to access its web services.
2.  **LLM Models**. These models are associated with a backend that serves them. The parameters attribute includes arguments passed to the chat client, such as `top_p`, `top_k`, and `temperature`.
3.  **Eval Configs** are configurations used to evaluate the performance of LLM models against a dataset. In this section, you will specify the dataset, system prompt, and the regular expression for matching the final answer. Additionally, you can include a chat history, such as few-shot examples, to enhance evaluation accuracy.
4. **Answer Interpreters** can be used to interpret the models answers by another LLM based assistant.
5. **Eval Sessions** are testing sessions that are executed by the application. Here is where you choose the Eval Config, the LLM model, the Answer Interpreter (if needed), the dataset target (e.g. train, test, dev, validation) and the delay between requests. The `parameters` attribute overrides the LLM parameters.

# Running an evaluation
## Load dataset
At the moment there is an implementation for loading MedQA, PubMedQA and MMLU datasets. 
```bash
python manage.py import_medqa  --file=datasets/medqa/test.jsonl --target=test --dataset=medqa #will load the medqa test QA to target test
```
```bash
python manage.py import_mmlu --dataset=mmlu --target=test --subject=anatomy #will load the dataset MMLU covering anatomy subject into a dataset called mmlu and target test. The dataset is loaded from hugging face.
```

## Execute a session
```bash
python manage.py eval_qa --session-id=16 --continue
```

## Data Loaders
The data loaders are locate in `commons/management/commands`. At this moment there are 3 loaders: the `import_medqa` importer, `import_mmlu` and `import_pubmedqa`. MedQA and PubMedQA are imported from local files. MMLU importer pulls the dataset from Huggingface. 


# License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

&copy;2024 Radu Boncea, ICI Bucharest