Нет описания

Elena Paraschiv 0f7e80404e small fix		1 год назад
llmeval	0f7e80404e small fix	1 год назад
.gitignore	e935566895 initial commit	1 год назад
LICENSE.md	e935566895 initial commit	1 год назад
README.md	000fae258e update on README	1 год назад

Overview

This repository hosts a Django application called LLM Eval, designed for evaluating and benchmarking large language models against specific datasets. The app integrates clients from leading platforms, including OpenAI, Anthropic, Google, Ollama, and Anyscale.

Setup

Follow these steps to set up the application on your local machine.

Clone the repository

Clone the repository to your local machine using the following command:

git clone <repository-url>

Create and activate a Python virtual environment

Create a virtual environment and activate it:

python -m venv /path/to/environment
source /path/to/environment/bin/activate  # On Windows, use /path/to/environment/Scripts/activate

Install Python dependencies

Install the required Python dependencies:

pip install -r requirements.txt

Setup database connection

The default database is SQLite, but you can configure other types of connections in the configuration file located at llmeval/llmeval/settings.py. For more information on database connections, see the Django Documentation

Run the application as a standalone server

Instantiate the database

Run the following command to apply database migrations:

python manage.py migrate

Generate a superuser to acces the app

Create a superuser account to manage the application:

python manage.py createsupersuer

Start the server app

Start the Server App

python manage.py runserver 8000

The application will be accesible at http://localhost:8000/admin. Login using the superuser credentials. Once logged in, you will notice this menu items:

LLM Backends. These interfaces connect to various LLM model providers. The initial implementation includes support for Ollama, OpenAI, Google, Anthropic, and Anyscale. The parameters attribute contains JSON-serialized arguments passed to the model client. For example, OpenAI requires an api_key in the parameters to access its web services.
LLM Models. These models are associated with a backend that serves them. The parameters attribute includes arguments passed to the chat client, such as top_p, top_k, and temperature.
Eval Configs are configurations used to evaluate the performance of LLM models against a dataset. In this section, you will specify the dataset, system prompt, and the regular expression for matching the final answer. Additionally, you can include a chat history, such as few-shot examples, to enhance evaluation accuracy.
Answer Interpreters can be used to interpret the models answers by another LLM based assistant.
Eval Sessions are testing sessions that are executed by the application. Here is where you choose the Eval Config, the LLM model, the Answer Interpreter (if needed), the dataset target (e.g. train, test, dev, validation) and the delay between requests. The parameters attribute overrides the LLM parameters.

Running an evaluation

Load dataset

At the moment there is an implementation for loading MedQA, PubMedQA and MMLU datasets.

python manage.py import_medqa  --file=datasets/medqa/test.jsonl --target=test --dataset=medqa #will load the medqa test QA to target test

python manage.py import_mmlu --dataset=mmlu --target=test --subject=anatomy #will load the dataset MMLU covering anatomy subject into a dataset called mmlu and target test. The dataset is loaded from hugging face.

Execute a session

python manage.py eval_qa --session-id=16 --continue

Data Loaders

The data loaders are locate in commons/management/commands. At this moment there are 3 loaders: the import_medqa importer, import_mmlu and import_pubmedqa. MedQA and PubMedQA are imported from local files. MMLU importer pulls the dataset from Huggingface.

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md